Federal state autonomous educational institution for higher education ¾Moscow institute of physics and technology (national research university)¿
On the rights of a manuscript
Le The Anh
DEEP NEURAL NETWORK MODELS FOR SEQUENCE LABELING AND COREFERENCE TASKS
Specialty 05.13.01 - ¾System analysis, control theory, and information processing (information and technical systems)¿
A dissertation submitted in requirements for the degree of candidate of technical sciences
Supervisor: PhD of physical and mathematical sciences Burtsev Mikhail Sergeevich
Dolgoprudny - 2020 Федеральное государственное автономное образовательное учреждение высшего образования ¾Московский физико-технический институт (национальный исследовательский университет)¿
На правах рукописи
Ле Тхе Ань
ГЛУБОКИЕ НЕЙРОСЕТЕВЫЕ МОДЕЛИ ДЛЯ ЗАДАЧ РАЗМЕТКИ ПОСЛЕДОВАТЕЛЬНОСТИ И РАЗРЕШЕНИЯ КОРЕФЕРЕНЦИИ
Специальность 05.13.01 – ¾Системный анализ, управление и обработка информации (информационные и технические системы)¿
Диссертация на соискание учёной степени кандидата технических наук
Научный руководитель: кандидат физико-математических наук Бурцев Михаил Сергеевич
Долгопрудный - 2020 Contents
Abstract 4
Acknowledgments 6
Abbreviations 7
List of Figures 11
List of Tables 13
1 Introduction 14 1.1 Overview of Deep Learning ...... 14 1.1.1 Artificial Intelligence, Machine Learning, and Deep Learning . . . . . 14 1.1.2 Milestones in Deep Learning History ...... 16 1.1.3 Types of Machine Learning Models ...... 16 1.2 Brief Overview of Natural Language Processing ...... 18 1.3 Dissertation Overview ...... 20 1.3.1 Scientific Actuality of the Research ...... 20 1.3.2 The Goal and Task of the Dissertation ...... 20 1.3.3 Scientific Novelty ...... 21 1.3.4 Theoretical and Practical Value of the Work in the Dissertation . . . 21 1.3.5 Statements to be Defended ...... 22 1.3.6 Presentations and Validation of the Research Results ...... 23 1.3.7 Publications ...... 23 1.3.8 Dissertation Structure ...... 24
2 Deep Neural Network Models for NLP Tasks 26 2.1 Word Representation Models ...... 26 2.1.1 Word Representation ...... 26 2.1.2 Prediction-based Models ...... 28 2.1.3 Count-based Models ...... 34 2.2 Deep Neural Network Models ...... 38
1 2.2.1 Convolutional Neural Network ...... 38 2.2.2 Recurrent Neural Network ...... 42 2.2.3 Long Short-Term Memory Cells ...... 46 2.2.4 LSTM Networks ...... 49 2.3 Pre-trained Language Models ...... 50 2.3.1 ELMo ...... 50 2.3.2 Transformer ...... 52 2.3.3 OpenAI’s GPT ...... 56 2.3.4 Google’s BERT ...... 58 2.4 Summary ...... 60
3 Sequence Labeling with Character-aware Deep Neural Networks and Lan- guage Models 62 3.1 Introduction to the Sequence Labeling Tasks ...... 62 3.2 Related Work ...... 63 3.2.1 Rule-based Models ...... 64 3.2.2 Feature-based Models ...... 64 3.2.3 Deep Learning-based Models ...... 65 3.2.4 Related Work on Vietnamese Named Entity Recognition ...... 67 3.2.5 Related Work on Russian Named Entity Recognition ...... 68 3.3 Tagging Schemes ...... 69 3.4 Evaluation Metrics ...... 70 3.5 WCC-NN-CRF Models for Sequence Labeling Tasks ...... 71 3.5.1 Backbone WCC-NN-CRF Architecture ...... 71 3.5.2 Language Model-based Architecture ...... 77 3.6 Application of WCC-NN-CRF Models for Named Entity Recognition . . . . 78 3.6.1 Overview of Named Entity Recognition Task ...... 78 3.6.2 Datasets and Pre-trained Word Embeddings ...... 78 3.6.3 Evaluation of backbone WCC-NN-CRF Model ...... 80 3.6.4 Evaluation of ELMo-based WCC-NN-CRF model ...... 85 3.6.5 Evaluation of BERT-based Multilingual WCC-NN-CRF Model . . . . 85 3.7 Application of WCC-NN-CRF Model for Sentence Boundary Detection . . . 87 3.7.1 Introduction to the Sentence Boundary Detection Task ...... 87 3.7.2 Sentence Boundary Detection as a Sequence Labeling Task ...... 89 3.7.3 Evaluation of WCC-NN-CRF SBD Model ...... 91 3.8 Conclusions ...... 93
2 4 Coreference Resolution with Sentence-level Coreferential Scoring 95 4.1 The Coreference Resolution Task ...... 95 4.2 Related Work ...... 96 4.2.1 Rule-based Models ...... 96 4.2.2 Deep Learning Models ...... 97 4.3 Coreference Resolution Evaluation Metrics ...... 100 4.4 Baseline Model Description ...... 102 4.5 Sentence-level Coreferential Relation-based Model ...... 104 4.6 BERT-based Coreference Model ...... 108 4.7 Experiments and Results ...... 108 4.7.1 Datasets ...... 108 4.7.2 Evaluation of Proposed Models ...... 110 4.8 Conclusions ...... 115
5 Conclusions 117 5.1 Conclusions for Sequence Labeling Task ...... 117 5.2 Conclusions for Coreference Resolution Task ...... 120 5.3 Summary of the Main Contributions of the Dissertation ...... 121
Bibliography 123
3 Abstract
Deep neural network models have recently received tremendous attentions from both academy and industry, and of course, garnered amazing results in a variety of domains ranging from Computer Vision, Speech Recognition to Natural Language Processing (NLP). They sig- nificantly lifted the performance of machine learning-based systems to a whole new level, close to the human-level performance. As a matter of course, the number of deep learning projects has also increased year by year. The IPavlov project1, based at the Neural Networks and Deep Learning Lab of Moscow Institute of Physics and Technology (MIPT), is one of them, aiming at building a set of pre-trained network models, predefined dialogue system components and pipeline templates. This thesis is based on the work carried out as a part of this project, focusing on studying deep neural network models to address Sequence Labeling and Coreference Resolution tasks. This thesis consists of three main parts. Firstly, we systematically synthesize three key concepts in the field of Deep Learning for NLP closely related to the work carried out in this thesis, including (1) two approaches to word representation learning, (2) deep neural network models often used to address machine learning tasks in general and NLP tasks in particular, and (3) cutting-edge pre-trained language models and their applications in downstream tasks. Secondly, we propose three deep neural network models for Sequence Labeling tasks, including (1) the hybrid model consisting of three sub-networks to fully capture character- level and capitalization features as well as word context features, (2) language modeling- based model, and (3) the multilingual model. These proposed models were evaluated on the task of Named Entity Recognition. Conducted experiments on six datasets covering four languages Russian, Vietnamese, English, and Chinese datasets showed that our models achieved state-of-the-art performance. Besides that, we reformulated the task of Sentence Boundary Detection as Sequence Labeling task and used the proposed model to address this task. The obtained results on two conversational datasets pointed out that the proposed model achieved an impressive accuracy. Thirdly, we propose two models for Coreference Resolution task, including (1) Sentence-level Coreferential Relation-based model that can take as input a paragraph, a discourse, or even a document with hundreds of sentences and predict the coreference relations between sentences, and (2) the language modeling-based
1https://ipavlov.ai/
4 model that leverages the power of modern language models to boost the model performance. The experiment results on two Russian datasets and the comparisons with the other existing models pointed out that our models obtained the cutting-edge results on both Anaphora and Coreference Resolution tasks.
5 Acknowledgments
The pursuit of a Ph.D. is a long and challenging journey through unknown lands. I would like to express sincere thanks to those who have accompanied and encouraged me on this journey. Without their supports, this thesis would not have been done. First and foremost, I would like to thank my supervisor Burtsev Mikhail Sergeevich, the Head of Neural Networks and Deep Learning Lab2 - Moscow Institute of Physics and Technology, who provided me a good research environment and gave me the right directions during my Ph.D. study. I also would like to thank my colleagues Yura Kuratov, Mikhail Arkhipov, Valentin Malykh, Maxim Petrov, Alexey Sorokin, and the others in our lab for the highly fruitful and enlightening discussions we shared. Besides, I would like to express my thanks to the reviewers Artem Shelmanov, Natalia Lukashevich, and Aleksandr Panov - the chair of the pre-defense seminar, for their valuable comments and suggestions on my dissertation. Finally, this thesis is dedicated to my family for their love, mental encouragement, and for always being by my side.
This work was supported by National Technology Initiative and PAO Sberbank project ID 0000000007417F630002.
Dolgoprudny - 2020
2https://mipt.ru/science/labs/laboratoriya-neyronnykh-sistem-i-glubokogo-obucheniya/
6 Abbreviations
General notation a.k.a. also known as e.g. exempli gratia, meaning “for example” et al. et alia, meaning “and others” etc. et cetera, meaning “and the other things” i.e. id est, meaning “in other words” w.r.t. with respect to
Natural language processing
BERT Bidirectional Encoder Representations from Transformers model [44]
Bi-LM Bidirectional Language Model
BPE BytePair Encoding
CBOW Continuous Bag-of-Words model [123]
CEAF Constrained Entity Alignment F-measure
CR Coreference Resolution
CRF Conditional Random Field
ELMo Embeddings from Language Models [69]
GPT Generative Pretraining [14]
IOB Inside-Outside-Beginning tagging scheme
IOBES Inside-Outside-Beginning-End-Single tagging scheme
IOE Inside-Outside-End tagging scheme
7 LM Language Modeling
LR-MVL Low-Rank Multi-View Learning model [87]
LSA Laten Semantic Analysis
Masked LM Masked Language Modeling
MT Machine Translation
NER Named Entity Recognition
NLP Natural Language Processing
NPs Noun Phrases
OOV Out-Of-Vocabulary
POS Part-Of-Speech
QA Question Answering
SRL Semantic Role Labeling
WER Word Error Rate
Neural networks
AGI Artificial General Intelligence
AI Artificial Intelligence
ANI Artificial Narrow Intelligence
ANN Artificial Neural Network
Bi-LSTM Bi-directional Long Short-Term Memory
BPTT Back-Propagation Through Time
CNN Convolutional Neural Network
DL Deep Learning
FFNN Feed-Forward Neural Network
GRU Gated Recurrent Network
HMM Hidden Markov Model
8 LSTM Long Short-Term Memory
ML Machine Learning
ReLU Rectified Linear Unit
RNN Recurrent Neural Network
Probability and statistics
CCA Canonical Correlation Analysis
PCA Principle Component Analysis
PPL Perplexity
SVD Singular Value Decomposition
T-SNE T-distributed Stochastic Neighbor Embedding
9 List of Figures
1.1 The Artificial Intelligence taxonomy...... 15
2.1 Word meaning similarity visualization...... 27 2.2 The most similar words to france − paris + moscow along with their cosine similarities, generated by GloVe embedding model...... 28 2.3 Neural probabilistic language model architecture [145]...... 29 2.4 Supervised multitasking model [102]...... 31 2.5 Recurrent neural network-based language model [124]...... 32 2.6 Continuous Bag-of-Words model [123]...... 33 2.7 Continuous Skip-gram model [123]...... 33 2.8 SVD Visualization...... 36 2.9 Architecture of LeNet-5 [140]...... 40 2.10 Connection architecture in a layer of neurons...... 41 2.11 RNN architecture and its unfolded structure through time...... 42 2.12 Different types of Recurrent Neural Networks...... 43 2.13 Backpropagation through time in RNN...... 44 2.14 LSTM cell at time step t...... 47 2.15 Peephole LSTM cell...... 48 2.16 A variant of LSTM cell with coupled input and forget gate...... 48 2.17 GRU cell...... 48 2.18 Standard LSTM network...... 49 2.19 Stacked LSTM network...... 50 2.20 Stacked Bidirectional LSTM network...... 51 2.21 Bi-LSTM based model of ELMo...... 52 2.22 Attention mechanism treated as a weighted average function...... 53 2.23 Scaled Dot-Product Attention and Multi-Head Attention [14]...... 54 2.24 High-level view of Transformer model...... 55 2.25 GPT architecture...... 56 2.26 Self Attention architecture proposed in [91]...... 58 2.27 Fine-tuning GPT for downstream tasks...... 59
10 3.1 Example of Sequence Labeling output space...... 63 3.2 Proposed Model for the Sequence Labeling Task...... 72 3.3 Character Features Extraction using CNN...... 74 3.4 Extraction Capitalization Feature Extraction using Bi-LSTM Network. . . . 76 3.5 BERT-based Multilingual NER Model...... 78 3.6 Importance of different components for performance of backbone WCC-NN- CRF model across the datasets...... 84 3.7 Five-run average tagging performance of the backbone WCC-NN-CRF with the different amounts of training data accross the datasets...... 85 3.8 Five-run average tagging performance of variants of the backbone WCC-NN- CRF model on CoNLL-2003 dataset with different amounts of training samples. 86 3.9 Annotating step of a voice-enabled chatbot...... 88 3.10 Example of sequence-to-sequence-based approach to the task of sequence bound- ary detection...... 89 3.11 Example of sequence-labeling-based approach to the task of SBD...... 89
4.1 Mention-pair Encoder and Cluster-pair Encoder [59]...... 98 4.2 Coreference Resolution model with Deep Biaffine Attention by Joint Mention Detection and Mention Clustering [103]...... 99 4.3 Baseline coreference resolution model [56]...... 104 4.4 Visualization of coreference relationship between sentences...... 105 4.5 Binary matrix representing the coreference relationship between sentences . . 106 4.6 Sentence-level Coreferential Relation-based (SCRb) model...... 107 4.7 BERT-based coreference model...... 109 4.8 SCRb model’s modifications M1-M7 performance on the OntoNotes 5.0. . . . 112
11 List of Tables
1.1 Some milestones in DL history...... 17 1.2 Some outstanding NLP achievements in the last twenty years...... 19
2.1 Prediction-based word embedding models...... 35 2.2 Count-based word embedding models...... 39
3.1 Example of applying different tagging schemes to the same sentence...... 70 3.2 Capitalization types of words...... 75 3.3 Statistics of NER datasets...... 80 3.4 Tagging performance of backbone WCC-NN-CRF model for Named Entity 5 and Named Entity 3 datasets...... 81 3.5 Tagging performance of backbone WCC-NN-CRF model on Gareev’s dataset using 5-folds cross validation...... 81 3.6 Tagging performance of backbone WCC-NN-CRF on Gareev’s datasets with pre-training on NE3...... 82 3.7 Tagging performance of backbone WCC-NN-CRF on VLSP-2016 dataset . . 82 3.8 Tagging performance of backbone WCC-NN-CRF on CoNLL-2003 dataset. . 82 3.9 Tagging performance of backbone WCC-NN-CRF on VLSP-2016 dataset com- pared with some state-of-the-art models...... 83 3.10 Tagging performance of backbone WCC-NN-CRF on MSRA dataset. . . . . 83 3.11 F1-score of ELMo WCC-NN-CRF model on Russian datasets compared with some published models...... 86 3.12 Tagging performance of ELMo WCC-NN-CRF on CoNLL-2003 dataset com- pared with some state-of-the-art models...... 87 3.13 Tagging performance of Multilingual BERT-based WCC-NN-CRF model on VLSP-2016, CoNLL-2003, and NE3...... 87 3.14 SBD dataset statistic...... 92 3.15 The SBD model setup in detail...... 92 3.16 Tagging performance of ELMo WCC-NN-CRF on the CornellMovie-Dialog dataset...... 92 3.17 Tagging performance of ELMo WCC-NN-CRF on the DailyDialog dataset. . 93
12 4.1 Illustration of global decisions in CR task ...... 96 4.2 Statistics on CR datasets...... 109 4.3 Effect of sentence-level coreferential relations on the baseline SCRb model performance...... 111 4.4 Testing results on OntoNotes dataset ...... 113 4.5 Testing results on the RuCor dataset using only raw text of baseline model + ELMo, and baseline model + SCRb...... 113 4.6 Testing results on AnCor dataset with raw text ...... 114 4.7 Coreference track results on Dialog Conference 2019. * denotes result without the gold mentions...... 114 4.8 Anaphora track results on Dialog Conference 2019...... 115
5.1 Summarization of obtained results over different languages ...... 120
13 Chapter 1
Introduction
1.1 Overview of Deep Learning
This section provides a brief overview of Deep Learning, starting by clarifying some fun- damental definitions of Artificial Intelligence, Machine Learning, Deep Learning, and their taxonomy as well. Subsequently, some milestones in the field of AI are highlighted. Finally, some types of machine learning models are briefly summarized.
1.1.1 Artificial Intelligence, Machine Learning, and Deep Learning
In 1950, Alan Turing, an English mathematician considered as the father of computer science, who helped Allied Forces win World War II by breaking the Enigma machine, published the famous paper titled “Computing Machinery and Intelligence” [3]. This paper considered the question “Can machines think?” and introduced the concept of what is now known as the Turing test. In this test, there are three participants including a human questioner, a human respondent, and the computer. During the testing time, the questioner gives a set of questions to the human respondent and computer as well and receives their answers. Based on the answers, the human questioner decides which respondent is the computer. Computer intelligence is measured by the difficulty in making that decision. The paper along with the Turing test is probably the first cornerstones of Artificial Intelligence nowadays. At its core, Artificial Intelligence is a sub-field of computer science aiming at making machines perceive, interpret, and act like human beings. Whenever a machine performs a task the same way humans tend to do it, such an intelligent behavior can be called the “artificial intelligence”. For instance, a camera in a car park can recognize the license plate number to decide whether to open the barrier to let the car in. In general, Artificial Intelligence can be divided into two categories:
• NAI, short for Artificial Narrow Intelligence (a.k.a. weak AI), is a kind of AI focusing on performing a narrow task. Most of the artificial intelligence systems nowadays fall
14 Computer Science
Artificial Intelligence
Machine Learning
Deep Learning
Figure 1.1: The Artificial Intelligence taxonomy.
into ANI such as personal assistants, customer service bots, spam filters as well as recommendation systems.
• Artificial General Intelligence (AGI, for short) or strong AI is the next future generation of AI which refers to the ability of machines to learn multi-skills in different contexts and apply the learned knowledge not only to the trained task but also to the others. We can think of Jarvis and Artoo-Detoo in “Iron Man”, “Star Wars” movies as two examples of AGI. Of course, they belong to the future, not the present time. However, AGI recently has been garnered attention from research academies and big companies as well. For instance, Microsoft has invested one billion USD to support building beneficial AGI through OpenAI1.
Machine Learning is considered as a subset of Artificial Intelligence which, as the name indicates, aims at empowering machines with the ability to learn by themselves directly from given data, without being explicitly programmed, and use the learned knowledge to make accurate decisions. Deep Learning is a deeper branch of Machine Learning which is inspired by the human biological neural network, comprising of neurons working in parallel to convert input signals to the desired outcome. Deep Learning focuses on building end-to- end deep neural networks that are able to learn data representation directly from natural unstructured data such as natural language text, speech or image without the human-crafted feature extraction. The Artificial Intelligence’s taxonomy showing relationships between these fields is visualized in Fig. 1.1.
1https://openai.com/blog/microsoft/
15 1.1.2 Milestones in Deep Learning History
With a history of about 70 years, Artificial Intelligence and its sub-fields are not new but always very attractive. The related researches have been constantly increasing2 year by year, both in quantity and quality. Artificial Intelligence in general and Machine Learning, in particular, have achieved a lot of success, especially after the advent of Deep Learning along with the availability of large scale datasets and massive amounts of computational power. Since it is difficult, if not impossible, to condense all the Artificial Intelligence achievements into one section of a dissertation, table 1.1 just highlights some milestones in the Deep Learning history according to our subjective opinion.
1.1.3 Types of Machine Learning Models
There are several ways to categorize types of machine learning models based on different criteria. However, these classifications are not exclusive due to the variety of machine learning models and some of them could belong to more than one type. According to the criteria whether or not models are trained with human supervision, they can be subdivided into three categories including Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Supervised Learning refers to the class of models aiming at learning a mapping between input examples and the corresponding targets. Training examples, therefore, must consist of input-target pairs. Supervised Learning can be divided into two sub-classes including Classification and Regression. Classification involves classifying or assigning an object into one or more classes. If one object only belongs to one class, the model is named Binary Classification Model. Otherwise, it is the Multi-class Classification Model. Different from classification models, regression models involve predicting numerical values. Some impor- tant supervised learning algorithms include k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machine, as well as Supervised-based Neural Networks. Unsupervised Learning, in contrast with Supervised Learning, tries to extract the hidden relationships from unlabeled data. In general, Unsupervised learning algorithms can be grouped into three main types: (1) Clustering referring to algorithms focusing on finding groups in data (e.g., k-Means, Hierarchical Cluster Analysis, Expectation Maximization), (2) Dimension Reduction aiming to simplify the data without losing too much information, and (3) Association rule learning which goal is to discover interesting relations between data attributes (e.g., Apriori, Eclat). Different from Supervised Learning and Unsupervised Learning, Reinforcement Learning is a family of goal-oriented models where an agent observes the state of the surrounding environment to choose an action, performs it, gets a reward along with a new state from
16 Year Event Description 1950 Imitation Game Alan Turing considered the question “Can machines think?” in [3] and introduced the Turing test which is used to test whether a computer is capable of intelligence. 1956 The Dartmouth AI This conference is widely considered as the birthplace of AI. conference In the conference, some potential topics were discussed such as language and cognition, gaming, reasoning, as well as vision. 1957 The Perceptron Frank Rosenblatt introduced the Perceptron garnered a lot of attention from researchers and widely covered in the media. 1966 ELIZA ELIZA was one of the earliest chatbots, developed by Joseph Weizenbaum at MIT Artificial Intelligence Laboratory. 1970 Reverse Mode of In 1970, Seppo Linnainmaa, a Finnish mathematician and Differentiation computer scientist, introduced a general method for automatic differentiation of discrete connected networks. Backpropaga- tion, the widely used method for training neural networks, is considered as the descendant of this method. 1997 Deep Blue This chess computer program developed by IBM [78] which can examine up to 200 million positions per second, defeated the chess-master Garry Kasparov. 1997 Long Short-Term LSTM introduced by Sepp Hochreiter and J¨urgenSchmidhu- Memory (LSTM) ber [111] are widely used in recent neural network models. 2012 Unsupervised Quoc Viet Le et al. [94] pointed out the usefulness of using learning-based unlabeled data in training neurons to be selective for high-level model concepts. 2014 DeepFace Yaniv Taigman et al. [137] built a deep neural network com- bining a 3D model-based alignment with deep feed-forward neural networks and achieved an impressive accuracy, 97.35%. 2016 AlphaGo David Silver et al. from Google Deepmind introduced a Go program [22], namely AlphaGo, which combines deep neural networks and tree search. AlphaGo defeated Lee Sedol, one of the greatest Go players, in 2016.
Table 1.1: Some milestones in DL history.
17 the environment and uses them for the next decision. Reinforcement Learning is applied in wide areas ranging from robotics, games, finances to computer vision, and natural language processing as well.
1.2 Brief Overview of Natural Language Processing
Humans traditionally communicate with computers using programming languages such as Java, C, Python which are structured, completely precise, and unambiguous. It is still an inconvenient way. So how to make computers interact with humans in the way humans daily communicate with each other? The efforts to answer this question led to the formation of Natural Language Processing (NLP, for short). NLP is a branch of Computer Science, which recently relies heavily on ML algorithms, aiming to eliminate the gap in human-computer interaction. There are a variety of NLP tasks which can be classified into four classes:
• Morphology: This class explores the stem of words, how they are formed, how they change depending on their context, etc. The tasks like Stemming, Gender Detection, Lemmatization, or Spellchecking belong to this class. Since the object of interest is the word, most of the models designed for these tasks are at word-level.
• Syntactic analysis: The tasks in this class aim at finding the underlying structure of a sentence, exploiting the arrangement of words in a sentence, and building a valid sentence. Some tasks of this class can include Part of Speech, Dependency Tree.
• Semantic analysis: This class refers to the task of finding the meaning and interpre- tation of words. Named Entity Recognition, Relation Extraction, or Semantic Role Labeling are three examples of this class.
• Pragmatics: Different from these above tasks which are often at word-level or sentence- level, the tasks in this class consider the text as a whole. Some typical tasks in this class include Topic Modeling, Coreference Resolution, Text Summarization, and Question Answering.
Thanks to the development of Machine Learning, NLP has been achieved many successes. Especially after the advent of Deep Learning, NLP has been lifted up to a new level, more close to the human-level. Table 1.2 highlights some outstanding achievements in NLP in the last twenty years.
18 Year Achievement Description 2001 The first neural Bengio et al. [145] introduced the first neural-network network-based lan- based language model which uses a look-up table to convert guage model some preceding words to vectors and uses them to predict the next word. Such kind of vectors nowadays are known as word vector representations. 2008 Multitasking model Collobert and Weston in [102] proposed to build a model for several tasks which uses a shared word embedding to share common low-level information between tasks. 2013 Word2Vec Mikolov et al. in [123] introduced a word representation model trained on large datasets. 2013 Bidirectional Long Graves et al. [7] achieved state-of-the-art performance Short Term Memory on phoneme recognition task by using end-to-end Bidirec- for NLP tional Long Short Term Memory model. 2014 Convolutional Neural Convolutional Neural Networks, which are more paralleliz- Networks for text able than RNNs, started to be applied to NLP tasks([79], [143]). 2014 Sequence to sequence Sutskever et al. [43] introduced encoder-decoder architec- model ture and applied it to the Machine Translation task. 2015 Attention mechanism Bahdanau et al. in [23] introduced attention mechanism for Machine Transla- to alleviate the shortcoming of compressing the input se- tion quence with arbitrary length into a fixed-size vector. 2017 Attention-based Ashish Vaswani et al. [14] proposed the Transformer ar- model chitecture which is solely relied on attention mechanism. 2018 Pre-trained language NLP tasks are approached in a new way, whereby the models model firstly is trained on unsupervised or semi-supervised tasks using large corpora and then is fine-tuned on specific task (ELMo [70], BERT [44], GPT [5]).
Table 1.2: Some outstanding NLP achievements in the last twenty years.
19 1.3 Dissertation Overview
1.3.1 Scientific Actuality of the Research
Deep Learning has become one of the most exciting fields of science which attracts both academy and industry. Applying deep neural network models to address natural language processing (NLP) tasks is the mainstream which completely overwhelmed traditional ma- chine learning-based approaches. This dissertation focuses on studying deep neural network models to address Sequence Labeling and Coreference Resolution tasks. Sequence Labeling task is a kind of pattern recognition task involving assignment a categorical label to each element in an observed sequence. This task plays an important role in the field of Natural Language Processing (NLP) since many NLP tasks are related to labeling sequence data. Typical tasks include Named Entity Recognition (NER), Part Of Speech (POS), Speech Recognition, as well as Word Segmentation. Therefore, solving the Sequence Labeling task has been attracting a lot of attention from NLP researchers. Coreference Resolution is an NLP task aiming at finding all mentions that refer to the same entity in a text. This task has an important role in a lot of higher-level NLP tasks including Question Answering, Text Summarization, and Information Extraction as well. However, it is one of the hardest tasks since the accurate prediction requires the model to deeply understand the meaning of the whole input text that could be a couple of sentences or even a document with hundreds of sentences. So far, not many works have been published and the achieved results are still not impressive compared with the other NLP tasks. Hence, there still remains a large potential for better solutions.
1.3.2 The Goal and Task of the Dissertation
The main goal of this dissertation is to study supervised learning models, focusing on deep neural network models such as Bidirectional Long Short-Term Memory (Bi-LSTM), Bidirec- tional Gated Recurrent Unit (Bi-GRU), Convolutional Neural Network (CNN), Attention- based models, as well as state-of-the-art language models, to address Sequence Labeling and Coreference Resolution tasks. The major tasks in this PhD work include:
• Systematically review publications on application of neural network to sequence la- beling and coreference resolution tasks with focus on recent supervised deep neural networks;
• Propose and implement new models for sequence labeling task with focus on extracting useful features for the word vector representation;
• Propose and implement new models for coreference resolution focusing on dealing with the long-term dependency problem;
20 • Apply the proposed models to three specific tasks: Named Entity Recognition, Sen- tence Boundary Detection, Coreference Resolution task;
• Based on the conducted experiments on given tasks, analyse performance of the pro- posed models.
1.3.3 Scientific Novelty
In this dissertation, we introduce a hybrid model for sequence labeling tasks which is different from previous approaches in:
• Generating rich semantic and syntactic word embeddings by combining (1) pre-trained word embedding, (2) character-level word embedding using a deep CNN network, and (3) additional features such as POS and Chunk;
• Encoding capitalization features of whole input sequence with Bi-LSTM;
• Improving the word embedding quality by integrating and finetuning context-based word embedding.
Application of the proposed model on NER task achieves state of the art performance on Russian and Vietnamese datasets (99.17%, 94.43% on NE3 and VLSP-2016 datasets). SBD task can be reformulated as a sequence labeling task and well handled by the proposed model (89.99%, 95.88% F1 on the Cornell Movie-Dialog and DailyDialog datasets). In addition, in this dissertation, we propose a new approach to coreference resolution task which is different from previous approaches in:
• Enhancing mention detection by utilizing modern language models;
• Improving mention detection and mention clustering by learning sentence-level coref- erential relations.
Application of the model with sentence coreference module for Russian language achieve state of the art of 58.42% average F1 on RuCor dataset.
1.3.4 Theoretical and Practical Value of the Work in the Dissertation
• Proposed sequence labeling models including WCC-NN-CRF and ELMo WCC-NN- CRF can be applied to several NLP tasks such as Named Entity Recognition, Part of Speech, Chunking, and Sentence Boundary Detection as well.
• WCC-NN-CRF and ELMo WCC-NN-CRF models are implemented in the open-source DeepPavlov framework2. Trained NER models on Vietnamese and English datasets are
2https://deeppavlov.ai
21 available to download at https://github.com/deepmipt/DeepPavlov/tree/master/ deeppavlov/configs/ner. These models are already to use or can be further fine- tuned on specific domains.
• SBD model trained on conversational datasets can be used in the annotating phase of a socialbot. The trained SBD model on the DailyDialog dataset can be found at https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/ configs/sentence_segmentation. This trained model was integrated into our so- cialbot which was built to participate in the Alexa Prize - Socialbot Grand Challenge 3.
• Sentence-level Coreferential Relation-based model can be used to extract the sentence relationship in a document-level context and has a promising potential not only for coreference resolution task but also for question answering, relation extraction or any other task that needs the information about the relationship between sentences.
1.3.5 Statements to be Defended
• WCC-NN-CRF model for sequence labeling task which utilizes (1) a CNN to generate vector representation of words from their characters and (2) a Bi-LSTM to encode the capitalization features of an input sequence. This model achieved state of the art performance on Vietnamese and Russian datasets with 98.21%, 94.43% on NE3 and VLSP-2016;
• Extensions of WCC-NN-CRF with a context-based word vector representation gener- ated by modern language models. This model obtained cutting edge performance on Russian and English datasets with 99.17%, 92.91%, and 92.27% F1 on NE3, Gareev’s, and CoNLL-2003 datasets;
• Sentence Boundary Detection task can be reformulated as a sequence labeling task and addressed by proposed sequence labeling models with impressive results of 89.99%, 95.88% on the Cornell Movie-Dialog and DailyDialog datasets;
• Sentence-level Coreferential Relation-based (SCRb) model to extract the sentence re- lationship in the coreference context;
• Coreference resolution models extended with SCRb obtained state of the art of 58.42% average F1 on RuCor dataset;
22 1.3.6 Presentations and Validation of the Research Results
The main findings and contributions of the dissertation were presented and discussed at four conferences:
• Artificial Intelligence and Natural Language Conference, ITMO University, Saint Pe- tersburg, Russia, September 20-23, 2017.
• The 2nd Asia Conference on Machine Learning and Computing, Ton Duc Thang Uni- versity, Ho Chi Minh, Viet Nam, December 7 - 9, 2018.
• The 25th International Conference on Computational Linguistics and Intellectual Tech- nologies, Russian State University for Humanities, Moscow, Russian, May 29 - June 01, 2019.
• The 4th International Conference on Machine Learning and Soft Computing, Mercure, Haiphong, Vietnam, January 17-19, 2020.
1.3.7 Publications
The proposed models applied to NER, SBD, and Coreference Resolution tasks along with the achieved results were published in five papers, out of which the first four papers were indexed by SCOPUS:
1. Le T.A., Arkhipov M. Y., Burtsev M. S., “Application of a Hybrid Bi- LSTM- CRF Model to the Task of Russian Named Entity Recognition”. In: Filchenkov A., Pivo- varova L., Zizka J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, pp. 91–103. 2017. Url: https://link.springer.com/chapter/10.1007/978-3-319-71746-3_8
2. Le T. A. and Burtsev M. S., “A Deep Neural Network Model for the Task of Named Entity Recognition”. In: International Journal of Machine Learning and Computing. Vol. 9. 1., pp. 8–13. 2019. Url: http://www.ijmlc.org/vol9/758-ML0025.pdf
3. Le T. A., Kuratov Y. M. , Petrov M. A., Burtsev M. S., “Sentence Level Representation and Language Models in the task of Coreference Resolution for Russian”. In: 25th International Conference on Computational Linguistics and Intellectual Technologies. 2019. Url: http://www.dialog-21.ru/media/4609/letaplusetal-160.pdf
4. Le T. A., “Sequence Labeling Approach to the Task of Sentence Boundary Detection”. In: Proceedings of the 4th International Conference on Machine Learning and Soft Computing. New York, NY, USA: Association for Computing Machinery, pp. 144–148. 2020. Url: https://dl.acm.org/doi/10.1145/3380688.3380703
23 5. Yuri Kuratov, Idris Yusupov, Dilyara Baymurzina, Denis Kuznetsov, Daniil Cher- niavskii, Alexander Dmitrievskiy, Elena Ermakova, Fedor Ignatov, Dmitry Karpov, Daniel Kornev, The Anh Le, Pavel Pugin, Mikhail Burtsev, "DREAM techni- cal report for the Alexa Prize 2019". In: The 3rd Proceedings of Alexa Prize. 2020. Url: https://m.media-amazon.com/images/G/01/mobile-apps/dex/alexa/ alexaprize/assets/challenge3/proceedings/Moscow-DREAM.pdf
Personal contribution of the author in the works with co-authors is as follows:
1. Development and implementation of a hybrid Bi-LSTM CRF model for solving the task of Named Entity Recognition in Russian.
2. Development and implementation of a character-aware WCC-NN-CRF model and its application to the task of Named Entity Recognition for four languages: Vietnamese, Russian, English, and Chinese.
3. Development and implementation of the Sentence-level Coreferential Relation-based (SCRb) model for determining the relationship of sentences in the context of corefer- ence; integration of the SCRb model into the baseline coreference model for solving the task of Coreference Resolution for Russian language.
5. Implementation and integration of the ELMo WCC-NN-CRF model for NER and SBD tasks in the socialbot architecture.
1.3.8 Dissertation Structure
The dissertation is organized into five chapters:
• Chapter 1 is an introductory chapter that firstly gives an overview of the dissertation, give a brief overview of Deep Learning and Natural Language Processing, and then briefly summaries the dissertation structure and contributions.
• Chapter 2 focuses on representing some major concepts of Deep Learning knowledge that are closely related to the proposed models described in the next two sections. The first one is the word embedding models that are responsible for converting raw words into vectors which can be processed by neural networks. The second concept is about deep neural network models which are mostly used for NLP tasks. The last one is general-purpose language models that aim at building language models which can be applied to most of NLP task to boost the model performance. Applying these models is easy and requires a little effort.
24 • Chapter 3 describes proposed models for the task of Sequence Labeling, and conducted experiments on two tasks Named Entity Recognition (NER) and Sentence Boundary Detection.
• Chapter 4 introduces a new approach to the task of Coreference Resolution, which aims at building a model to extract the sentence-level coreferential relationship. This model can be used in two ways: (1) output of this model is used as an additional feature for the baseline model, (2) this model is jointly trained with the baseline model.
• Chapter 5 summarizes the completed works, achieved results, and conclusions as well.
25 Chapter 2
Deep Neural Network Models for NLP Tasks
2.1 Word Representation Models
2.1.1 Word Representation
From the NLP perspective, words can be considered as core elements that can occur in isolation and have meaning. Understanding the word semantic is, therefore, the foundation for understanding the larger units (e.g., phrase, sentence, document). This has been found very useful for most of, if not all, NLP tasks. Nevertheless, the word semantic learning task is not easy to achieve. Unlike digital images whose elements are integer numbers that can be directly fed into neural networks, raw text data is the categorical data that contains labeled values and usually1 need to be represented in the numeric form. Furthermore, an image pixel has no meaning when standing alone, but a word itself can have more than one meaning. Which meaning a word carries can only be determined by its context. This is one of the major challenges in building word representation models. To deal with this challenge, most of the proposed machine learning-based models represented words as distributional vectors (a.k.a. word embeddings) due to following the distributional hypothesis stating that “words with similar meaning tend to occur in a similar context” [146]. In simple terms, a word vector is a fixed-length vector of real numbers in which each number expresses a dimension of the word’s meaning; and two words with similar meaning should be represented by similar vectors. A common measure for evaluating the similarity score between two vectors are cosine similarity which is derived from Euclidean dot product:
u · v = kuk kvk cosθ (2.1)
Given two n-dimensional vectors u, v, their cosine similarity is computed using their dot product and magnitude: u · v Pn u v similarity(u, v) = = i=1 i i (2.2) kuk kvk pPn 2pPn 2 i=1 ui i=1 vi 1Some algorithms can directly work with categorical data (e.g. decision tree).
26 Figure 2.1: Word meaning similarity visualization.
To intuitively see how well fine-tuned word vectors could capture the word meaning, we visualize some word vectors generated by GloVe [46] using Principle Component Analysis (PCA) (see Fig. 2.1). The visualization shows that these vectors well similarity in the meaning of words. The words with the same meaning, which usually appear in the same contexts, have similar vectors. An interesting property of word vectors obtained after training on a large-scale data is that word analogies can be solved with vector arithmetic: −−−−→ −−−→ −−−−→ france − paris + moscow−−−−−→ ≈ russia (2.3) −−−−→ −−−→ Fig. 2.2 shows the most similar vectors to france − paris + moscow−−−−−→ generated by GloVe embedding model. It is surprising that the word russia has the highest similarity. To explain this interesting property of word vectors, several papers has been published ([13], [6]). In the recent paper [54], Kawin Ethayarajh et al. gave a good explanation of word analogy arithmetic for two popular word representation models: GloVe [46] and Skip-gram with negative sampling [122] without making strong assumptions about the distribution of word frequencies. Motivated by the famous statement “You shall know a word by the company it keeps” [31], most word representation models try to extract the word semantic meaning
27 Figure 2.2: The most similar words to france − paris + moscow along with their cosine similarities, generated by GloVe embedding model. from the context it appears. In general, there are two major approaches (1) prediction- based approach and (2) count-base approach. The former utilizes neural networks to build a language model (LM) aiming at predicting the target word given its context. The context of a given word could be only several preceding words or both previous and next words. Neural networks used here could be recurrent neural networks (RNNs), convolutional neural networks (CNN), or even simple feed-forward neural networks. The count-based approaches leverage some global information at the corpus-level to build word vector representations. Models following this approach often focus on building word co-occurrence matrices from the entire corpus, then use dimension reduction algorithms (e.g. SVD, PCA) along with similarity measurements to achieve the word vector representations in a continuous low- dimensional space. Below both of these approaches are systematically summarized.
2.1.2 Prediction-based Models
As mentioned above, most of, if not all, prediction-based word embedding models are built for the LM task, in which the role of the first layer (i.e., the embedding layer) is to map a word to its vector representation. Thus, before diving into the prediction-based embedding models, the definition of the LM task should be cleared. The LM task aims at estimating the probability of a word sequence in a natural language
P (w1w2 ··· wT ). This joint probability can be factorized into conditional probabilities:
T Y P (w1w2 ··· wT ) = P (wt|w1 ··· wt−1), (2.4) t=1 here T denotes the length of the input sequence.
Since an enormous number of possible combinations of a word and its context, wt and w1 ··· wt−1 in Eq. 2.4, in practice, only several preceding words of the context is used to compute P (wt|w1 ··· wt−1):
T T Y Y P (w1w2 ··· wT ) = P (wt|w1 ··· wt−1) ≈ P (wt|wt−n+1 ··· wt−1), (2.5) t=1 t=1
28 Figure 2.3: Neural probabilistic language model architecture [145]. where t − n + 1 must be positive. n denotes number of preceding words considered. For instance, when n is set to 2 we have:
P (w1w2 ··· wT ) ≈ P (w1)P (w2|w1)P (w3|w1w2)P (w4|w2w3) ··· P (wT |wT −2wT −1).
The goal of most language models is to maximize or minimize the function of log- likelihood (with or without additional constraints). One of the first such models was pro- posed by Bengio et al. in 2003 [145], whose objective is to maximize the log-likelihood t−1 of P (wt|wt−n+1), namely Neural Probabilistic Language Model. The model architecture is simple. The first layer maps each word to its vector representation in a low-dimensional continuous space. The hidden layer with the tanh activation function followed by a softmax layer produces the probabilities of the words in the vocabulary (refer to figure 2.3 for more detail about the model architecture). The conducted experiments showed that this model achieved better results compared with n-gram models ([90], [97], [118]). The main weakness of Bengio et al. ’s model [145] is that it needs a long time for training. To overcome this drawback, in 2005, Morin et al. proposed a new novel architecture, named Hierarchical Probabilistic Neural Network Language Model [33], for speeding up neural language models by using a binary tree structure. The idea is to organize word vocabulary in a binary tree and utilize its structure to speed up normalization. Each word in the vocabulary is represented by a leaf. This technique helps to reduce the number of words
29 to choose from N down to O(log N) with N is the number of words in the vocabulary. From the probabilistic perspective, a sequence of O(log N)-way normalizations are used instead of N-way normalization. The tree was built by using IS-A taxonomy Directed-Acyclic-Graph from WordNet2. When experimented on Brown corpus, this model was remarkable faster than the Bengio et al. ’s model. However, its performance was a bit worse than the original model. In 2007, Mnih and Hinton proposed the Log-bilinear (LBL) Model [11] that is the foun- dation for many later models. In order to predict the next word wn given the sequence of preceding words (a.k.a the context) w1:n−1, firstly the LBL model linearly combines the feature vectors of the context words to create the predicted feature vector of the next word; then uses the inner product to measure the similarity between the predicted feature vector and the feature vector of each word in the vocabulary; and finally uses the softmax function to compute the distribution for the next word:
n−1 X rˆ = Cirwi , (2.6) i=1 > rˆ rw+bw n−1 e P (wn = w|w ) = > , (2.7) 1 P rˆ rj +bj j e n−1 where w1 , C, bw denote the given context, the weight matrix, and the bias of word w, respectively. One year later, in 2008, Mnih and Hinton [10] modified the Hierarchical Probabilistic Neural Network Language Model [145] by using the Log-bilinear model to compute the probabilities at each node in the tree. Below is a short description of how to make decisions at each node:
• Encode the path from the root node to the leaf where the word is at by a binary string d of left-right decisions.
• Give a feature vector for each non-leaf node.
• The probability of taking the left branch at ith node in the sequence is:
n−1 T P (di = 1|qi, w1 ) = σ(ˆr qi), (2.8)
here rˆ is computed as shown above in the Log-bilinear model, and qi denotes the feature vector for the node.
• The probability of w beging the next word is:
n−1 Y n−1 P (wn = w|w1 ) = P (di|qi, w1 ). (2.9) i 2https://wordnet.princeton.edu/
30 Figure 2.4: Supervised multitasking model [102].
The experiments showed that the model performance is comparative with Kneser-Ney n- gram models, and the speed is over 200 times faster than the previous Log-bilinear model. Also in 2008, Collobert and Weston proposed the multi-task model [102] in which word embeddings are utilized as a useful tool for downstream tasks such as Part-Of-Speech (POS), Chunking, NER, as well as Semantic Role Labeling (SRL). They jointly trained these su- pervised tasks on labeled data and the LM task on unlabeled data, which is abundant and can be freely crawled by search engines. Word embeddings are themselves model parameters and shared between tasks (Fig. 2.4). The model was trained to learn a binary classification task to answer the question “Is the middle word of the input sequence related to its con- text?”. The dataset, therefore, is fixed-size windows of text with two types of samples: (1) positive samples that are original windows from Wikipedia, and (2) negative samples made by replacing the middle word by a random word in the vocabulary. To avoid the expensive softmax normalization, they train the model to output a high score for a positive sample and a low score for a negative sample:
X X w Jθ = max{0, 1 − fθ(s) + fθ(s )}, (2.10) s∈S w∈V
w where S, V, fθ(·), s denote the windows of text, the word dictionary, the neural network with set of parameters θ, and the incorrect version of windows in which the middle word is replaced by w, respectively. One inherent drawback of n-gram models, which has not been overcome by feed-forward neural network language models, is the fixed number of words in the context. In other words, the number of context words needs to be set before training. To this end, in 2010, Mikolov et al. proposed to apply Recurrent Neural Network (RNN) to the LM task [124]. RNNs use recurrent connections to compress information about preceding words in one vector, namely context vector, and use it in combination with the input word to improve the prediction (Fig. 2.5). More formally, let w(t), s(t − 1) be vector of current word at time step t, and output from context layer at time step t − 1. The output is computed as follows:
31 Figure 2.5: Recurrent neural network-based language model [124].
x(t) = w(t) + s(t − 1), (2.11) X sj(t) = f( xi(t)uji), (2.12) i X yk(t) = g( sj(t)vkj), (2.13) j where f(z) is sigmoid function: 1 f(z) = (2.14) 1 + e−z and g(z) is softmax function: ezm g(zm) = . (2.15) P zk k e In 2013, Mikolov et al. published a stunning paper in which they proposed a word representation model, namely Word2Vec [123]. The idea is to build a model with a loss function computed by measuring how well a word can predict its context words. In the paper, two model architectures are introduced, namely Continuous Bag-of-Words (CBOW) and Skip-gram. Unlike the previous language models, which predict the next word only based on the preceding words, CBOW model uses both previous and next words to predict the target word. The name CBOW stems from using continuous vector representation and unordered words of the context. Thus, the loss function of CBOW is a bit different from [145]: T 1 X J = log p(w |wt−1 , wt+n), (2.16) θ T t t−n t+1 t=1 i where wj = wjwj+1 ··· wi. The model is similar to Feed-forward Neural Network Language Model [145], except that it removes the non-linear hidden layer, and shares the projection layer for all words (see Fig. 2.6). In contrast to CBOW, Skip-gram uses the middle word
32 Figure 2.6: Continuous Bag-of-Words model [123].
Figure 2.7: Continuous Skip-gram model [123]. to predict the surrounding words (see Fig. 2.7). Thus, the objective function of Skip-gram model computes the loss by summing up the log probabilities of all the surrounding words:
T 1 X X J = log p(w |w ), (2.17) θ T t+j t t=1 −n≤j≤n,j6=0
3 where p(wt+j|wt) is computed using softmax function :
0 > exp (v v ) wO wI p(wO|wI ) = , (2.18) PW 0 > w=1 exp (vw vwI )
0 here vw, vw are vector representations of w, and W is number of words in the vocabulary. 3To reduce the computational cost, the hierarchical softmax is used instead of the full softmax.
33 Table 2.1 briefly summaries the approaches of prediction-based language models, and their main contributions as well.
2.1.3 Count-based Models
Unlike predictive models mentioned before which are based on neural network language models with the goal of predicting a target word given its context, count-based models produce word embeddings by exploiting the co-occurrence statistics of words from large text corpora. To do this, most of them use matrix factorization methods for a rank reduction on a large term-frequence matrix to generate low-rank vectors word representations. One of the first count-based models is Latent Semantic Analysis (LSA), authored by Deerwester et al. [109]. Let m, n denote the number of documents in the corpus and number terms, respectively. The model firstly builds a word by document co-occurrence matrix Am×n in which each row represents documents and each column represents a term. This results in a very high-dimensional, sparse, and noisy matrix. Thus, Singular Value Decomposition (SVD) is used to reduce the dimensionality of A into two thin matrices U, V (see Fig. 2.8 for a visualization) with the dimensions ranged around 100 to 300 features condensing all important features into small vector space:
> Am×n = Um×rΣr×r(Vn×r) , (2.19) here:
• U is left singular vectors of size (m documents, r concepts),
• Σ is singular values (i.e., the diagonal matrix expressing the strength of each “concept”),
• V is the right singular vectors of size (n terms, r concepts).
Once U, V are computed, similarity metrics (e.g., cosine similarity) can be applied to eval- uate the similarity of documents, terms. The model achieves a comparative performance compared to prediction-based models when the training data is small. In the opposite case, when a large training data is available, prediction-based models capture similarity in a better manner [24]. In 2011, Low Rank Multi-View Learning (LR-MVL) model was proposed by Dhillon et al. [87] which leverages the Canonical Correlation Analysis (CCA) between left and right contexts of a given word to find the common latent structure. The conducted experiments on NER and Chunking tasks showed that the model achieved state-of-the-art accuracy at that time. Another count-based model was introduced in 2014 by Lebret and Collobert [99], which also build a word co-occurrence probability matrix, but instead of using Euclidean PCA, the
34 Model Approach Main Results Feed-forward Neural Represent words in a low dimensional space; au- Better performance in Network Language tomatically cluster similar words together; train comparison with tradi- Model [145] neural network with backpropagation algorithm tional n-gram models Hierarchical Prob- Use hierarchical decomposition of the condi- An exponential speed-up abilistic Language tional probabilities to speed up model perfor- in both training and pre- Model [33] mance diction Log-bilinear model Train a neural network language model with a Remarkably faster than [11] log-bilinear energy function to model the word the one-hidden-layer context model from Bengio [145] Log-bilinear model Use the binary tree to organize words and com- Speed is over 200 times supported by Hi- pute the probabilities at each node by using log- faster than the previous erarchical softmax bilinear model model [11] [10] Supervised multi- Build a multi-tasking model sharing the same Produce a good word em- task model [102] word embedding; treat LM task as binary clas- bedding, even for uncom- sification task that tries to predict if the middle mon words and show that word is related to given context it is a very useful tool for many downstream tasks (e.g. POS, Chunking, NER, SRL) RNN-based Lan- Use RNN to deal with the limitation of fixed- Outperform backoff guage Model [124] length contexts models (Both PPL and WER of RNN model are much smaller than that of backoff models), even when backoff models were trained on 5 times more data than RNN model CBOW and Con- Build a fast and accurate language model, which Produce high-quality tinuous Skip-gram is able to use both left and right contexts to pre- word vectors capturing models [123] dict the target word, for learning high-quality both semantic relation- word vectors from large datasets ships and analogy of words
Table 2.1: Prediction-based word embedding models.
35 Figure 2.8: SVD Visualization. authors perform Hellinger PCA to represent a word in a low-dimensional space. Conducted experiments on NER and IMDB review datasets showed that Hellinger PCA’s word em- bedding outperformed LR-MVL [87], CW model from SENNA4 and some other models. In the follow-up work [98], the authors studied some different factors impacting on the quality of the word embedding model. They concluded that (1) using a large window of context improves capturing both word’s syntactic and semantic information, (2) a context dictionary of rare words enhances capturing semantic, but good performance can be produced by using just a fraction of the most frequent words, and (3) the stochastic low-rank approximation approach beats traditional PCA approach. Also in 2014, the cutting-edge count-based model was introduced by Pennington et al. [46], namely GloVe (short for Global Vectors for Word Representation), which also learns word vectors by doing dimensionality reduction on the word-word co-occurrence matrix extracted from corpora. The authors first build the word-word co-occurrence matrix X whose elements Xij express number of times word j appears in the context of word i. After that, the probability of word j appearing in the context of word i is computed:
Xij Pij = P (j|i) = . (2.20) Xi By using empirical methods, the authors discovered that it makes more sense to learn word vector representation from ratios of co-occurrence probabilities rather than the probabilities themselves, therefore they introduced the general model form:
Pik F (wi, wj, w˜k) = , (2.21) Pjk where w ∈ Rd are word vectors and w˜ ∈ Rd are separate context word vectors. The vector 4http://www.nec-labs.com/senna/
36 difference is chosen for representing the ratio Pik in the word vector space: Pjk
Pik F (wi − wj, w˜k) = . (2.22) Pjk The dot product is then used to convert vectors at the left-hand side of Eq. 2.22 to a scalar in order to be suitable to the right-hand side:
> Pik F ((wi − wj) w˜k) = . (2.23) Pjk As noted from the word-word co-occurrence matrix that the role of words and their context words are exchangeable, the function F is required to be a homomorphism from (R, +) and (R, ×): F (a + b) = F (a)F (b), ∀a, b ∈ R. (2.24)
Since the right-hand side of Eq. 2.23 is defined in term of difference between wi and wj, an additive inverse (i.e., substract) is used instead of addition. This results in using multiplica- tive inverse in the right-hand side in order to guarantee the homomorphism property: F (a) F (a − b) = , ∀a, b ∈ . (2.25) F (b) R By relabeling and applying distributivity, we get:
> > F (wi w˜k) F ((wi − wj) w˜k) = > , (2.26) F (wj w˜k) which can be combined with Eq. 2.23 and 2.20 to produce a new simple equation:
> Xik F (wi w˜k) = Pik = , (2.27) Xi where Xik, Xi, as denoted before, are number of times word k occurs in the context of word i and number of times any word occurs in the context of word i, respectively. The best candidate function that satisfies the above conditions is ex. Taking the natural logarithm of both sides in Eq. 2.27 produces a new equation:
> wi w˜k = logPik = logXik − logXi. (2.28)
We note that Xik = Xki since the appearing of word i in the context of word k also means that the word k appears in the context of word i. In addition, the logXi can be replaced by a bias term since it is independent of k. The symmetric function is finally formed:
> ˜ wi w˜k + bi + bk = logXik. (2.29)
The authors also introduce new weighted least squares regression model:
V X > ˜ 2 J = f(Xij)(wi w˜j + bi + bj − logXij) , (2.30) i,j=1
37 here V denotes the vocabulary size, and f is the weighting function: α (x/xmax) if x < xmax f(x) = (2.31) 1 otherwise, that sastifies three conditions (1) have a limit of 0 when x goes to 0, (2) to be non-decreasing and (3) to be realitvely small for large values of x. The model was evaluated on some tasks including word analogy, word similarity, as well as named entity recognition, and achieved better results compared with the other models. Table 2.2 summaries the count-based language models along with their main contribu- tions.
2.2 Deep Neural Network Models
2.2.1 Convolutional Neural Network
In [21], Hubel and Wiesel introduced receptive field types of monkey striate cortex, in which two concepts "single cell" and "complex cell" was described. The authors showed that both of these types of cells are used for pattern recognition. Motivated by this study, in 1980, Kunihiko Fukushima introduced Neocognition [61], a hierarchical neural network for pattern recognition, which could be considered as the predecessor of recent CNNs. This model has two components "S-cell" and "C-cell" that use mathematical operations to mimic biological cells. A decade later, in 1990s, Yann LeCun et al. introduced a multi-layer neural network called LeNet-5 [139] that can be trained using the backpropagation algorithm. The architecture of LeNet-5 is straightforward and consists of two convolutional layers and two average pooling layers, followed by two fully connected layers (see Fig. 2.9 for more details of LeNet-5 architecture). The conducted experiments on MNIST dataset5 showed that LeNet achieved the best result compared with many other methods, including Linear Classifier and Pairwise Linear Classifier, Baseline Nearest Neighbor Classifier, Principle Component Analysis, and Polynomial Classifier, Radial Basis Function Network, One-Hidden Layer Fully Connected Multilayer Neural Network, Two-Hidden Layer Fully Connected Multilayer Neural Network, Tangent Distance Classifier, as well as Support Vector Machine. After the success of LeNet-5, numerous variants of CNN have been proposed. Their components are therefore also different. Nevertheless, in general, there are three main com- ponents that play key roles in the CNN architecture, namely Convolutional Layer, Pooling Layer, and Fully-Connected Layer.
5A dataset of handwritten digits, available download at http://yann.lecun.com/exdb/mnist/
38 Model Approach Main Results LSA [109] Build a word by document co- Although this approach occurrence matrix and use SVD to is mostly used for doc- perform dimension reduction ument representations or topic modeling, it is also used for word vector rep- resentations. The model performs quite well on a small number of data. LR-MVL [87] Leverage CCA between the past and fu- The word embedding ture views of the data to extract com- produced by LR-MVL mon latent structure. was reported to be better than that of [10], [102] and some others; Achieve state-of-the-art results on NER and Chunking tasks. Hellinger PCA [99], Apply a variant of PCA (Hellinger Outperformed LR-MVL [98] PCA) to the word co-occurrence ma- model [87], CW from trix, and study some different factors SENNA, and some impacting on the quality of the word other models on NER embedding. CoNLL2003 and IMDB review datasets. Glove [46] Combine two methods from both Achieved significantly prediction-based approach and count- better results in com- based approach: global matrix factor- parison with the others ization and local context window meth- on word analogy, word ods; Learn word vector representations similarity, and named from ratios of co-occurrence probabili- entity recognition as ties instead of the probabilities them- well. selves.
Table 2.2: Count-based word embedding models.
39 Figure 2.9: Architecture of LeNet-5 [140].
Convolutional Layer
The Convolutional Layer is considered as the most important component which aims to capture the feature patterns from low to high abstract level. This is done based on the connection architecture inside the convolutional layers. In particular, a convolutional layer consists of several layers of neurons. Unlike in the traditional fully-connected neural network where each neuron is connected to all neurons in the previous layer, where each neuron is only connected to some neurons located in a small rectangle to create a locally low-level feature. These features are then assembled to create a feature map that captures more higher and abstract features. This architecture is depicted in Fig. 2.10. Neurons in the same layer share the same weights, named filter. More mathematically, let x be an input matrix with n rows and m columns, f be a filter with r rows and t columns. The output of the neuron located at ith row and jth column is computed as below:
r−1 t−1 X X o[i, j] = x[i ∗ sh + u, j ∗ sw + v] ∗ f[u, v], (2.32) u=0 v=0 where sh, sw are vertical and horizontal strides. r, t along with sh, sw decide which neurons in the previous layer are connected to the target neuron at the current layer. Lets take Fig. 2.10 as an example. Suppose the filter size is 4×3, both horizontal and vertical strides are 1. The output of the last neuron locating at the 9th row and 7th column is computed as folow:
o[9, 7] = x[9, 7] ∗ f[0, 0] + x[10, 7] ∗ f[1, 0] + x[11, 7] ∗ f[2, 0] + x[12, 7] ∗ f[3, 0] +x[9, 8] ∗ f[0, 1] + x[10, 8] ∗ f[1, 1] + x[11, 8] ∗ f[2, 1] + x[12, 8] ∗ f[3, 1] +x[9, 9] ∗ f[0, 2] + x[10, 9] ∗ f[1, 2] + x[11, 9] ∗ f[2, 2] + x[12, 9] ∗ f[3, 2]
In a more general case when the input has one more dimension named channel (e.g., a RGB image), the filter, therefore, needs one more dimension. The Eq. 2.32 is generalized to: r−1 t−1 c−1 X X X o[i, j] = x[i ∗ sh + u, j ∗ sw + v, k] ∗ f[u, v, k], (2.33) u=0 v=0 k=0
40 Figure 2.10: Connection architecture in a layer of neurons. where c denotes the number of channels. The Convolutional Layer uses a set of filters to transform an input into a set of feature maps. During the training stage, the weights in the filters are fine-tuned in order to be able to create useful feature maps for a specific task. The repeated use of convolutional layers creates more complex patterns. This process is a part of the representation learning.
Pooling Layer
The key role of a pooling layer is to shrink the input to reduce the spatial dimensions of the feature map. Each neuron in a pooling layer is also connected to a small group of neurons in the previous layer (like in the convolutional layer). The output of the neuron located at ith row and jth column is computed as:
o[i, j] = pool_op(x[i ∗ sh + u, j ∗ sw + v], u = 0, r − 1, v = 0, t − 1), (2.34) here pool_op denotes the pooling operation (e.g., max pooling or average pooling), sh, sw are horizontal and vertical strides, and r, t are height and width of the pooling filter.
Fully-Connected Layer
This kind of layer is just like the traditional fully-connected feed-forward network where each neuron at the current layer is connected to all neurons in the previous layer. Given an input vector x ∈ Rn, the output is computed as:
o = σ(Wx + b), (2.35) where σ is the activation function, W ∈ Rm×n, b ∈ Rm are weight matrix and bias vector, respectively.
41 Figure 2.11: RNN architecture and its unfolded structure through time.
2.2.2 Recurrent Neural Network
Recurrent Neural Network [104] and its variants have been widely adopted in many fields from Natural Language Processing, Speech Recognition to Computer Vision due to their effectiveness in handling the sequence data such as text, audio, and video as well. Unlike in traditional FFNNs or CNNs where the output prediction just depends on the input at the same time step, in an RNN the output at the previous step is used along with the current input for making a prediction. More concretely, in RNNs, hidden layers consist of recurrent cells whose state is updated based on both past state and current input (refer to Fig. 2.11 for the depiction of the RNN architecture and its unfolded structure). There are several types of RNNs designed to suitable for input and output types of differ- ent tasks. For example, a "one-to-many" RNN is designed for the task of image captioning where the input is an image that is a fixed-length vector and output is a text, a variable- length vector describing the content of the image. Four main types of RNNs along with their applications are summed up in Fig. 2.12. The idea behind RNNs is that a sequence of vector can be process by applying a recurrence formula at every time step:
ht = fW (ht−1, xt), here the same function a long with the same set of parameters are shared for every time step.
In particular, given an input sequence x = x1, x2, .., xn, the corresponding output sequence y = y1, y2, .., yn is computed as below:
ht = σh(Wxxt + Whht−1 + bh), (2.36)
yt = σo(Wyht + by), (2.37) here xt, yt denote input and output at time step t, ht, ht−1 are hidden states at time step t ant t − 1. Wx,Wh,Wy and bh, by are weights and biases, respectively. σh, σo are activation functions.
42 (a) One-to-many RNN (e.g., Image (b) Many-to-one RNN (e.g., Sentiment Captioning, Music Generation). Analysis, Document Classification)
(c) Symmetrical many-to-many RNN (d) Asymmetrical many-to-many RNN (e.g., Machine (e.g., Part of Speech, NER) Translation, Question Answering)
Figure 2.12: Different types of Recurrent Neural Networks.
Backpropagation Through Time
Backpropagation is one of the simple and popular optimization algorithms for training neural networks. However, since RNN has a special architecture working with sequence data, this algorithm had been extended to suit RNN, called Back-propagation Through Time (BPTT) [88]. Generally, this algorithm firstly unfolds the network in time and hence converts the recurrent neural network into an equivalent feed-forward neural network, then propagates the errors backward through time. Let’s see how the weight matrix W is updated through time (see Fig. 2.13), starting from the original weight update formula: ∂L W ← W − α , (2.38) ∂W here W, L, α denote the recurrent weight matrix, loss, and learning rate, respectively. The ∂L gradient is determined from the summation over L : ∂W t n ∂L X ∂Lt = , (2.39) ∂W ∂W t=1 ∂L and t is computed based on h : ∂W k t ∂Lt X ∂Lt ∂hk = . (2.40) ∂W ∂hk ∂W k=1
43 Figure 2.13: Backpropagation through time in RNN.
∂Lt Since there is no explicit of Lt on hk, the chain rule is used to indirectly compute : ∂hk ∂L ∂L ∂h ∂L ∂y ∂h t = t t = t t t , (2.41) ∂hk ∂ht ∂hk ∂yt ∂ht ∂hk ∂h The chain rule is used again to compute t : ∂hk t ∂ht Y ∂hl = . (2.42) ∂hk ∂hl−1 l=k+1 The final backpropagation equation is: n t t ∂L X X ∂Lt ∂yt Y ∂hl ∂hk = ( ) . (2.43) ∂W ∂yt ∂ht ∂hl−1 ∂W t=1 k=1 l=k+1
The Vanishing and Exploding Gradients
Let’s take a look back at the formula of hidden state at time step l:
hl = σ(Wxxl + Whhl−1 + bh). (2.44)
This can be rewritten as
al = Wxxl + Whhl−1 + bh, (2.45)
hl = σ(al). (2.46)
Hence the partial derivative of hl w.r.t. hl−1 in Eq. 2.43 can be computed using the chain rule: ∂h ∂h ∂a l = l l . (2.47) ∂hl−1 ∂al ∂hl−1
44 Since al is a vector [al1, al2, . . . , ald], and hl is a element-wise activation function of al, ∂hl [σ(al1), σ(al1), . . . , σ(ald)], is a Jacobian matrix: ∂al ∂hl1 ∂hl2 ∂hld ··· 0 ∂al1 ∂al1 ∂al1 σ (al1) 0 ··· 0 ∂h ∂h ∂h l1 l2 ··· ld 0 ∂hl 0 σ (al2) ··· 0 0 = ∂al2 ∂al2 ∂al2 = = diag(σ (al)), (2.48) ∂a . . . . . . .. . l . . .. . . . . . 0 ∂hld ∂hld ∂hld ··· 0 0 ··· σ (ald) ∂al1 ∂al2 ∂ald here diag denotes the diagonal matrix. Next, from Eq. 2.46 we have:
∂al = Wh. (2.49) ∂hl−1 Finally, ∂hl 0 = diag(σ (al))Wh. (2.50) ∂hl−1 0 0 1 Since σ is a bounded function, diag(σ (al)) ≤ γ. γ = if we use sigmoid activation 4 function. In case of tanh function, γ = 1. Let λm be the largest value of Wh. Suppose 1 λ < : m γ
∂hl 0 0 1 = diag(σ (al))Wh ≤ diag(σ (al)) kWhk < γ = 1, ∀l. (2.51) ∂hl−1 γ
∂hl Let β satisfy that ∀l, ≤ β < 1. By using Eq. 2.41 in combination with Eq. 2.42, ∂hl−1 we have: t ∂Lt ∂Lt ∂ht ∂Lt Y ∂hl ∂Lt t−k = = ≤ β . (2.52) ∂hk ∂ht ∂hk ∂ht ∂hl−1 ∂ht l=k+1
∂Lt The Eq. 2.52 points out that (for which t − k is large) decrease dramatically down ∂hk to 0 since β < 1. This phenomenon is called gradient vanishing problem. By inverting 1 the assumption, when the largest value of W , λ > , the gradient of some long term h m γ components would be exploded. In practice, one way to avoid vanishing/exploding gradients problem is to use Truncated Backpropagation Through Time where the product in Eq. 2.42 is restricted to θ(< t − k) terms. This method also helps to reduce the memory and computation costs, as well as processing time. Another way to deal with the exploding gradient problem is Gradient Clipping introduced by Razvan Pascanu, Tomas Mikolov and Yoshua Bengio in [96]. The pseudo-code of this algorithm is shown in Algorithm 1. In 2015, Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton also introduced a simple way to initialize RNNs of Rectified Linear Units [95]. Generally, this is a vanilla RNN whose recurrent weights are initialized with the identity matrix, the biases are set to zero, and the activation function used for hidden units is ReLU.
45 Algorithm 1 Pseudo-code for norm clipping the gradients whenever they explode ∂L 1: gˆ ← ∂W 2: if kgˆk ≥ threshold then threshold 3: gˆ = gˆ kgˆk 4: end if
2.2.3 Long Short-Term Memory Cells
In theory, RNNs are able to capture “long-term dependencies” (Eq. 2.36). But actually, in practice, they don’t work well, primarily because of the simple recurrent cell architecture. The history information was often forgotten after the unit activations were multiplied several times by small numbers. In 1997 [111] Sepp Hochreiter and Jurgen Schmidhuber introduced the Long Short Term Memory network (LSTM, for short) which has a complex structure allowing to better control long-term dependencies and mitigate the vanishing gradient prob- lem. Unlike in the RNN cell with only one activation function taking as input the history information and current input, an LSTM cell has two major components:
• Cell state: a kind of information memory, to store information over time.
• Gates: use singoid/ tanh functions to control adding new information to, or removing some others from, cell state.
There are three gates in the standard LSTM cell:
• Forget gate: Decide which part of information should be removed from the cell state. The decision is made by a sigmoid function that takes as input the previous hidden state and current input:
ft = σ(Wf · [ht−1, xt] + bf ) (2.53)
• Input gate: Update the cell state. This consists of three steps:
– Decide which part of new information should be added to the cell state:
it = σ(Wi · [ht−1, xt] + bi) (2.54)
– Create a vector of new candidate values:
¯ Ct = tanh(WC · [ht−1, xt] + bC ) (2.55)
– Update cell state: ¯ Ct = ft ∗ Ct−1 + it ∗ Ct (2.56)
46 −1 x +
tanh
x x