Federal state autonomous educational institution for higher education ¾Moscow institute of physics and technology (national research university)¿

On the rights of a manuscript

Le The Anh

DEEP NEURAL NETWORK MODELS FOR SEQUENCE LABELING AND COREFERENCE TASKS

Specialty 05.13.01 - ¾System analysis, control theory, and information processing (information and technical systems)¿

A dissertation submitted in requirements for the degree of candidate of technical sciences

Supervisor: PhD of physical and mathematical sciences Burtsev Mikhail Sergeevich

Dolgoprudny - 2020 Федеральное государственное автономное образовательное учреждение высшего образования ¾Московский физико-технический институт (национальный исследовательский университет)¿

На правах рукописи

Ле Тхе Ань

ГЛУБОКИЕ НЕЙРОСЕТЕВЫЕ МОДЕЛИ ДЛЯ ЗАДАЧ РАЗМЕТКИ ПОСЛЕДОВАТЕЛЬНОСТИ И РАЗРЕШЕНИЯ КОРЕФЕРЕНЦИИ

Специальность 05.13.01 – ¾Системный анализ, управление и обработка информации (информационные и технические системы)¿

Диссертация на соискание учёной степени кандидата технических наук

Научный руководитель: кандидат физико-математических наук Бурцев Михаил Сергеевич

Долгопрудный - 2020 Contents

Abstract 4

Acknowledgments 6

Abbreviations 7

List of Figures 11

List of Tables 13

1 Introduction 14 1.1 Overview of ...... 14 1.1.1 Artificial Intelligence, Machine Learning, and Deep Learning . . . . . 14 1.1.2 Milestones in Deep Learning History ...... 16 1.1.3 Types of Machine Learning Models ...... 16 1.2 Brief Overview of Natural Language Processing ...... 18 1.3 Dissertation Overview ...... 20 1.3.1 Scientific Actuality of the Research ...... 20 1.3.2 The Goal and Task of the Dissertation ...... 20 1.3.3 Scientific Novelty ...... 21 1.3.4 Theoretical and Practical Value of the Work in the Dissertation . . . 21 1.3.5 Statements to be Defended ...... 22 1.3.6 Presentations and Validation of the Research Results ...... 23 1.3.7 Publications ...... 23 1.3.8 Dissertation Structure ...... 24

2 Deep Neural Network Models for NLP Tasks 26 2.1 Word Representation Models ...... 26 2.1.1 Word Representation ...... 26 2.1.2 Prediction-based Models ...... 28 2.1.3 Count-based Models ...... 34 2.2 Deep Neural Network Models ...... 38

1 2.2.1 Convolutional Neural Network ...... 38 2.2.2 Recurrent Neural Network ...... 42 2.2.3 Long Short-Term Memory Cells ...... 46 2.2.4 LSTM Networks ...... 49 2.3 Pre-trained Language Models ...... 50 2.3.1 ELMo ...... 50 2.3.2 Transformer ...... 52 2.3.3 OpenAI’s GPT ...... 56 2.3.4 Google’s BERT ...... 58 2.4 Summary ...... 60

3 Sequence Labeling with Character-aware Deep Neural Networks and Lan- guage Models 62 3.1 Introduction to the Sequence Labeling Tasks ...... 62 3.2 Related Work ...... 63 3.2.1 Rule-based Models ...... 64 3.2.2 Feature-based Models ...... 64 3.2.3 Deep Learning-based Models ...... 65 3.2.4 Related Work on Vietnamese Named Entity Recognition ...... 67 3.2.5 Related Work on Russian Named Entity Recognition ...... 68 3.3 Tagging Schemes ...... 69 3.4 Evaluation Metrics ...... 70 3.5 WCC-NN-CRF Models for Sequence Labeling Tasks ...... 71 3.5.1 Backbone WCC-NN-CRF Architecture ...... 71 3.5.2 Language Model-based Architecture ...... 77 3.6 Application of WCC-NN-CRF Models for Named Entity Recognition . . . . 78 3.6.1 Overview of Named Entity Recognition Task ...... 78 3.6.2 Datasets and Pre-trained Word Embeddings ...... 78 3.6.3 Evaluation of backbone WCC-NN-CRF Model ...... 80 3.6.4 Evaluation of ELMo-based WCC-NN-CRF model ...... 85 3.6.5 Evaluation of BERT-based Multilingual WCC-NN-CRF Model . . . . 85 3.7 Application of WCC-NN-CRF Model for Sentence Boundary Detection . . . 87 3.7.1 Introduction to the Sentence Boundary Detection Task ...... 87 3.7.2 Sentence Boundary Detection as a Sequence Labeling Task ...... 89 3.7.3 Evaluation of WCC-NN-CRF SBD Model ...... 91 3.8 Conclusions ...... 93

2 4 Coreference Resolution with Sentence-level Coreferential Scoring 95 4.1 The Coreference Resolution Task ...... 95 4.2 Related Work ...... 96 4.2.1 Rule-based Models ...... 96 4.2.2 Deep Learning Models ...... 97 4.3 Coreference Resolution Evaluation Metrics ...... 100 4.4 Baseline Model Description ...... 102 4.5 Sentence-level Coreferential Relation-based Model ...... 104 4.6 BERT-based Coreference Model ...... 108 4.7 Experiments and Results ...... 108 4.7.1 Datasets ...... 108 4.7.2 Evaluation of Proposed Models ...... 110 4.8 Conclusions ...... 115

5 Conclusions 117 5.1 Conclusions for Sequence Labeling Task ...... 117 5.2 Conclusions for Coreference Resolution Task ...... 120 5.3 Summary of the Main Contributions of the Dissertation ...... 121

Bibliography 123

3 Abstract

Deep neural network models have recently received tremendous attentions from both academy and industry, and of course, garnered amazing results in a variety of domains ranging from , Speech Recognition to Natural Language Processing (NLP). They sig- nificantly lifted the performance of machine learning-based systems to a whole new level, close to the human-level performance. As a matter of course, the number of deep learning projects has also increased year by year. The IPavlov project1, based at the Neural Networks and Deep Learning Lab of Moscow Institute of Physics and Technology (MIPT), is one of them, aiming at building a set of pre-trained network models, predefined dialogue system components and pipeline templates. This thesis is based on the work carried out as a part of this project, focusing on studying deep neural network models to address Sequence Labeling and Coreference Resolution tasks. This thesis consists of three main parts. Firstly, we systematically synthesize three key concepts in the field of Deep Learning for NLP closely related to the work carried out in this thesis, including (1) two approaches to word representation learning, (2) deep neural network models often used to address machine learning tasks in general and NLP tasks in particular, and (3) cutting-edge pre-trained language models and their applications in downstream tasks. Secondly, we propose three deep neural network models for Sequence Labeling tasks, including (1) the hybrid model consisting of three sub-networks to fully capture character- level and capitalization features as well as word context features, (2) language modeling- based model, and (3) the multilingual model. These proposed models were evaluated on the task of Named Entity Recognition. Conducted experiments on six datasets covering four languages Russian, Vietnamese, English, and Chinese datasets showed that our models achieved state-of-the-art performance. Besides that, we reformulated the task of Sentence Boundary Detection as Sequence Labeling task and used the proposed model to address this task. The obtained results on two conversational datasets pointed out that the proposed model achieved an impressive accuracy. Thirdly, we propose two models for Coreference Resolution task, including (1) Sentence-level Coreferential Relation-based model that can take as input a paragraph, a discourse, or even a document with hundreds of sentences and predict the coreference relations between sentences, and (2) the language modeling-based

1https://ipavlov.ai/

4 model that leverages the power of modern language models to boost the model performance. The experiment results on two Russian datasets and the comparisons with the other existing models pointed out that our models obtained the cutting-edge results on both Anaphora and Coreference Resolution tasks.

5 Acknowledgments

The pursuit of a Ph.D. is a long and challenging journey through unknown lands. I would like to express sincere thanks to those who have accompanied and encouraged me on this journey. Without their supports, this thesis would not have been done. First and foremost, I would like to thank my supervisor Burtsev Mikhail Sergeevich, the Head of Neural Networks and Deep Learning Lab2 - Moscow Institute of Physics and Technology, who provided me a good research environment and gave me the right directions during my Ph.D. study. I also would like to thank my colleagues Yura Kuratov, Mikhail Arkhipov, Valentin Malykh, Maxim Petrov, Alexey Sorokin, and the others in our lab for the highly fruitful and enlightening discussions we shared. Besides, I would like to express my thanks to the reviewers Artem Shelmanov, Natalia Lukashevich, and Aleksandr Panov - the chair of the pre-defense seminar, for their valuable comments and suggestions on my dissertation. Finally, this thesis is dedicated to my family for their love, mental encouragement, and for always being by my side.

This work was supported by National Technology Initiative and PAO Sberbank project ID 0000000007417F630002.

Dolgoprudny - 2020

2https://mipt.ru/science/labs/laboratoriya-neyronnykh-sistem-i-glubokogo-obucheniya/

6 Abbreviations

General notation a.k.a. also known as e.g. exempli gratia, meaning “for example” et al. et alia, meaning “and others” etc. et cetera, meaning “and the other things” i.e. id est, meaning “in other words” w.r.t. with respect to

Natural language processing

BERT Bidirectional Encoder Representations from Transformers model [44]

Bi-LM Bidirectional Language Model

BPE BytePair Encoding

CBOW Continuous Bag-of-Words model [123]

CEAF Constrained Entity Alignment F-measure

CR Coreference Resolution

CRF Conditional Random Field

ELMo Embeddings from Language Models [69]

GPT Generative Pretraining [14]

IOB Inside-Outside-Beginning tagging scheme

IOBES Inside-Outside-Beginning-End-Single tagging scheme

IOE Inside-Outside-End tagging scheme

7 LM Language Modeling

LR-MVL Low-Rank Multi-View Learning model [87]

LSA Laten Semantic Analysis

Masked LM Masked Language Modeling

MT Machine Translation

NER Named Entity Recognition

NLP Natural Language Processing

NPs Noun Phrases

OOV Out-Of-Vocabulary

POS Part-Of-Speech

QA Question Answering

SRL Semantic Role Labeling

WER Word Error Rate

Neural networks

AGI Artificial General Intelligence

AI Artificial Intelligence

ANI Artificial Narrow Intelligence

ANN Artificial Neural Network

Bi-LSTM Bi-directional Long Short-Term Memory

BPTT Back-Propagation Through Time

CNN Convolutional Neural Network

DL Deep Learning

FFNN Feed-Forward Neural Network

GRU Gated Recurrent Network

HMM Hidden Markov Model

8 LSTM Long Short-Term Memory

ML Machine Learning

ReLU Rectified Linear Unit

RNN Recurrent Neural Network

Probability and statistics

CCA Canonical Correlation Analysis

PCA Principle Component Analysis

PPL Perplexity

SVD Singular Value Decomposition

T-SNE T-distributed Stochastic Neighbor Embedding

9 List of Figures

1.1 The Artificial Intelligence taxonomy...... 15

2.1 Word meaning similarity visualization...... 27 2.2 The most similar words to france − paris + moscow along with their cosine similarities, generated by GloVe embedding model...... 28 2.3 Neural probabilistic language model architecture [145]...... 29 2.4 Supervised multitasking model [102]...... 31 2.5 Recurrent neural network-based language model [124]...... 32 2.6 Continuous Bag-of-Words model [123]...... 33 2.7 Continuous Skip-gram model [123]...... 33 2.8 SVD Visualization...... 36 2.9 Architecture of LeNet-5 [140]...... 40 2.10 Connection architecture in a layer of neurons...... 41 2.11 RNN architecture and its unfolded structure through time...... 42 2.12 Different types of Recurrent Neural Networks...... 43 2.13 through time in RNN...... 44 2.14 LSTM cell at time step t...... 47 2.15 Peephole LSTM cell...... 48 2.16 A variant of LSTM cell with coupled input and forget gate...... 48 2.17 GRU cell...... 48 2.18 Standard LSTM network...... 49 2.19 Stacked LSTM network...... 50 2.20 Stacked Bidirectional LSTM network...... 51 2.21 Bi-LSTM based model of ELMo...... 52 2.22 Attention mechanism treated as a weighted average function...... 53 2.23 Scaled Dot-Product Attention and Multi-Head Attention [14]...... 54 2.24 High-level view of Transformer model...... 55 2.25 GPT architecture...... 56 2.26 Self Attention architecture proposed in [91]...... 58 2.27 Fine-tuning GPT for downstream tasks...... 59

10 3.1 Example of Sequence Labeling output space...... 63 3.2 Proposed Model for the Sequence Labeling Task...... 72 3.3 Character Features Extraction using CNN...... 74 3.4 Extraction Capitalization Feature Extraction using Bi-LSTM Network. . . . 76 3.5 BERT-based Multilingual NER Model...... 78 3.6 Importance of different components for performance of backbone WCC-NN- CRF model across the datasets...... 84 3.7 Five-run average tagging performance of the backbone WCC-NN-CRF with the different amounts of training data accross the datasets...... 85 3.8 Five-run average tagging performance of variants of the backbone WCC-NN- CRF model on CoNLL-2003 dataset with different amounts of training samples. 86 3.9 Annotating step of a voice-enabled chatbot...... 88 3.10 Example of sequence-to-sequence-based approach to the task of sequence bound- ary detection...... 89 3.11 Example of sequence-labeling-based approach to the task of SBD...... 89

4.1 Mention-pair Encoder and Cluster-pair Encoder [59]...... 98 4.2 Coreference Resolution model with Deep Biaffine Attention by Joint Mention Detection and Mention Clustering [103]...... 99 4.3 Baseline coreference resolution model [56]...... 104 4.4 Visualization of coreference relationship between sentences...... 105 4.5 Binary matrix representing the coreference relationship between sentences . . 106 4.6 Sentence-level Coreferential Relation-based (SCRb) model...... 107 4.7 BERT-based coreference model...... 109 4.8 SCRb model’s modifications M1-M7 performance on the OntoNotes 5.0. . . . 112

11 List of Tables

1.1 Some milestones in DL history...... 17 1.2 Some outstanding NLP achievements in the last twenty years...... 19

2.1 Prediction-based word embedding models...... 35 2.2 Count-based word embedding models...... 39

3.1 Example of applying different tagging schemes to the same sentence...... 70 3.2 Capitalization types of words...... 75 3.3 Statistics of NER datasets...... 80 3.4 Tagging performance of backbone WCC-NN-CRF model for Named Entity 5 and Named Entity 3 datasets...... 81 3.5 Tagging performance of backbone WCC-NN-CRF model on Gareev’s dataset using 5-folds cross validation...... 81 3.6 Tagging performance of backbone WCC-NN-CRF on Gareev’s datasets with pre-training on NE3...... 82 3.7 Tagging performance of backbone WCC-NN-CRF on VLSP-2016 dataset . . 82 3.8 Tagging performance of backbone WCC-NN-CRF on CoNLL-2003 dataset. . 82 3.9 Tagging performance of backbone WCC-NN-CRF on VLSP-2016 dataset com- pared with some state-of-the-art models...... 83 3.10 Tagging performance of backbone WCC-NN-CRF on MSRA dataset. . . . . 83 3.11 F1-score of ELMo WCC-NN-CRF model on Russian datasets compared with some published models...... 86 3.12 Tagging performance of ELMo WCC-NN-CRF on CoNLL-2003 dataset com- pared with some state-of-the-art models...... 87 3.13 Tagging performance of Multilingual BERT-based WCC-NN-CRF model on VLSP-2016, CoNLL-2003, and NE3...... 87 3.14 SBD dataset statistic...... 92 3.15 The SBD model setup in detail...... 92 3.16 Tagging performance of ELMo WCC-NN-CRF on the CornellMovie-Dialog dataset...... 92 3.17 Tagging performance of ELMo WCC-NN-CRF on the DailyDialog dataset. . 93

12 4.1 Illustration of global decisions in CR task ...... 96 4.2 Statistics on CR datasets...... 109 4.3 Effect of sentence-level coreferential relations on the baseline SCRb model performance...... 111 4.4 Testing results on OntoNotes dataset ...... 113 4.5 Testing results on the RuCor dataset using only raw text of baseline model + ELMo, and baseline model + SCRb...... 113 4.6 Testing results on AnCor dataset with raw text ...... 114 4.7 Coreference track results on Dialog Conference 2019. * denotes result without the gold mentions...... 114 4.8 Anaphora track results on Dialog Conference 2019...... 115

5.1 Summarization of obtained results over different languages ...... 120

13 Chapter 1

Introduction

1.1 Overview of Deep Learning

This section provides a brief overview of Deep Learning, starting by clarifying some fun- damental definitions of Artificial Intelligence, Machine Learning, Deep Learning, and their taxonomy as well. Subsequently, some milestones in the field of AI are highlighted. Finally, some types of machine learning models are briefly summarized.

1.1.1 Artificial Intelligence, Machine Learning, and Deep Learning

In 1950, Alan Turing, an English mathematician considered as the father of , who helped Allied Forces win World War II by breaking the Enigma machine, published the famous paper titled “Computing Machinery and Intelligence” [3]. This paper considered the question “Can machines think?” and introduced the concept of what is now known as the Turing test. In this test, there are three participants including a human questioner, a human respondent, and the computer. During the testing time, the questioner gives a set of questions to the human respondent and computer as well and receives their answers. Based on the answers, the human questioner decides which respondent is the computer. Computer intelligence is measured by the difficulty in making that decision. The paper along with the Turing test is probably the first cornerstones of Artificial Intelligence nowadays. At its core, Artificial Intelligence is a sub-field of computer science aiming at making machines perceive, interpret, and act like human beings. Whenever a machine performs a task the same way humans tend to do it, such an intelligent behavior can be called the “artificial intelligence”. For instance, a camera in a car park can recognize the license plate number to decide whether to open the barrier to let the car in. In general, Artificial Intelligence can be divided into two categories:

• NAI, short for Artificial Narrow Intelligence (a.k.a. weak AI), is a kind of AI focusing on performing a narrow task. Most of the artificial intelligence systems nowadays fall

14 Computer Science

Artificial Intelligence

Machine Learning

Deep Learning

Figure 1.1: The Artificial Intelligence taxonomy.

into ANI such as personal assistants, customer service bots, spam filters as well as recommendation systems.

• Artificial General Intelligence (AGI, for short) or strong AI is the next future generation of AI which refers to the ability of machines to learn multi-skills in different contexts and apply the learned knowledge not only to the trained task but also to the others. We can think of Jarvis and Artoo-Detoo in “Iron Man”, “Star Wars” movies as two examples of AGI. Of course, they belong to the future, not the present time. However, AGI recently has been garnered attention from research academies and big companies as well. For instance, Microsoft has invested one billion USD to support building beneficial AGI through OpenAI1.

Machine Learning is considered as a subset of Artificial Intelligence which, as the name indicates, aims at empowering machines with the ability to learn by themselves directly from given data, without being explicitly programmed, and use the learned knowledge to make accurate decisions. Deep Learning is a deeper branch of Machine Learning which is inspired by the human biological neural network, comprising of neurons working in parallel to convert input signals to the desired outcome. Deep Learning focuses on building end-to- end deep neural networks that are able to learn data representation directly from natural unstructured data such as natural language text, speech or image without the human-crafted feature extraction. The Artificial Intelligence’s taxonomy showing relationships between these fields is visualized in Fig. 1.1.

1https://openai.com/blog/microsoft/

15 1.1.2 Milestones in Deep Learning History

With a history of about 70 years, Artificial Intelligence and its sub-fields are not new but always very attractive. The related researches have been constantly increasing2 year by year, both in quantity and quality. Artificial Intelligence in general and Machine Learning, in particular, have achieved a lot of success, especially after the advent of Deep Learning along with the availability of large scale datasets and massive amounts of computational power. Since it is difficult, if not impossible, to condense all the Artificial Intelligence achievements into one section of a dissertation, table 1.1 just highlights some milestones in the Deep Learning history according to our subjective opinion.

1.1.3 Types of Machine Learning Models

There are several ways to categorize types of machine learning models based on different criteria. However, these classifications are not exclusive due to the variety of machine learning models and some of them could belong to more than one type. According to the criteria whether or not models are trained with human supervision, they can be subdivided into three categories including Supervised Learning, , and Reinforcement Learning. Supervised Learning refers to the class of models aiming at learning a mapping between input examples and the corresponding targets. Training examples, therefore, must consist of input-target pairs. Supervised Learning can be divided into two sub-classes including Classification and Regression. Classification involves classifying or assigning an object into one or more classes. If one object only belongs to one class, the model is named Binary Classification Model. Otherwise, it is the Multi-class Classification Model. Different from classification models, regression models involve predicting numerical values. Some impor- tant supervised learning algorithms include k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machine, as well as Supervised-based Neural Networks. Unsupervised Learning, in contrast with Supervised Learning, tries to extract the hidden relationships from unlabeled data. In general, Unsupervised learning algorithms can be grouped into three main types: (1) Clustering referring to algorithms focusing on finding groups in data (e.g., k-Means, Hierarchical Cluster Analysis, Expectation Maximization), (2) Dimension Reduction aiming to simplify the data without losing too much information, and (3) Association rule learning which goal is to discover interesting relations between data attributes (e.g., Apriori, Eclat). Different from Supervised Learning and Unsupervised Learning, Reinforcement Learning is a family of goal-oriented models where an agent observes the state of the surrounding environment to choose an action, performs it, gets a reward along with a new state from

16 Year Event Description 1950 Imitation Game Alan Turing considered the question “Can machines think?” in [3] and introduced the Turing test which is used to test whether a computer is capable of intelligence. 1956 The Dartmouth AI This conference is widely considered as the birthplace of AI. conference In the conference, some potential topics were discussed such as language and cognition, gaming, reasoning, as well as vision. 1957 The Perceptron Frank Rosenblatt introduced the Perceptron garnered a lot of attention from researchers and widely covered in the media. 1966 ELIZA ELIZA was one of the earliest chatbots, developed by Joseph Weizenbaum at MIT Artificial Intelligence Laboratory. 1970 Reverse Mode of In 1970, Seppo Linnainmaa, a Finnish mathematician and Differentiation computer scientist, introduced a general method for automatic differentiation of discrete connected networks. Backpropaga- tion, the widely used method for training neural networks, is considered as the descendant of this method. 1997 Deep Blue This chess computer program developed by IBM [78] which can examine up to 200 million positions per second, defeated the chess-master Garry Kasparov. 1997 Long Short-Term LSTM introduced by Sepp Hochreiter and J¨urgenSchmidhu- Memory (LSTM) ber [111] are widely used in recent neural network models. 2012 Unsupervised Quoc Viet Le et al. [94] pointed out the usefulness of using learning-based unlabeled data in training neurons to be selective for high-level model concepts. 2014 DeepFace Yaniv Taigman et al. [137] built a deep neural network com- bining a 3D model-based alignment with deep feed-forward neural networks and achieved an impressive accuracy, 97.35%. 2016 AlphaGo David Silver et al. from Google Deepmind introduced a Go program [22], namely AlphaGo, which combines deep neural networks and tree search. AlphaGo defeated Lee Sedol, one of the greatest Go players, in 2016.

Table 1.1: Some milestones in DL history.

17 the environment and uses them for the next decision. Reinforcement Learning is applied in wide areas ranging from robotics, games, finances to computer vision, and natural language processing as well.

1.2 Brief Overview of Natural Language Processing

Humans traditionally communicate with computers using programming languages such as Java, C, Python which are structured, completely precise, and unambiguous. It is still an inconvenient way. So how to make computers interact with humans in the way humans daily communicate with each other? The efforts to answer this question led to the formation of Natural Language Processing (NLP, for short). NLP is a branch of Computer Science, which recently relies heavily on ML algorithms, aiming to eliminate the gap in human-computer interaction. There are a variety of NLP tasks which can be classified into four classes:

• Morphology: This class explores the stem of words, how they are formed, how they change depending on their context, etc. The tasks like Stemming, Gender Detection, Lemmatization, or Spellchecking belong to this class. Since the object of interest is the word, most of the models designed for these tasks are at word-level.

• Syntactic analysis: The tasks in this class aim at finding the underlying structure of a sentence, exploiting the arrangement of words in a sentence, and building a valid sentence. Some tasks of this class can include Part of Speech, Dependency Tree.

• Semantic analysis: This class refers to the task of finding the meaning and interpre- tation of words. Named Entity Recognition, Relation Extraction, or Semantic Role Labeling are three examples of this class.

• Pragmatics: Different from these above tasks which are often at word-level or sentence- level, the tasks in this class consider the text as a whole. Some typical tasks in this class include Topic Modeling, Coreference Resolution, Text Summarization, and Question Answering.

Thanks to the development of Machine Learning, NLP has been achieved many successes. Especially after the advent of Deep Learning, NLP has been lifted up to a new level, more close to the human-level. Table 1.2 highlights some outstanding achievements in NLP in the last twenty years.

18 Year Achievement Description 2001 The first neural Bengio et al. [145] introduced the first neural-network network-based lan- based language model which uses a look-up table to convert guage model some preceding words to vectors and uses them to predict the next word. Such kind of vectors nowadays are known as word vector representations. 2008 Multitasking model Collobert and Weston in [102] proposed to build a model for several tasks which uses a shared word embedding to share common low-level information between tasks. 2013 Word2Vec Mikolov et al. in [123] introduced a word representation model trained on large datasets. 2013 Bidirectional Long Graves et al. [7] achieved state-of-the-art performance Short Term Memory on phoneme recognition task by using end-to-end Bidirec- for NLP tional Long Short Term Memory model. 2014 Convolutional Neural Convolutional Neural Networks, which are more paralleliz- Networks for text able than RNNs, started to be applied to NLP tasks([79], [143]). 2014 Sequence to sequence Sutskever et al. [43] introduced encoder-decoder architec- model ture and applied it to the Machine Translation task. 2015 Attention mechanism Bahdanau et al. in [23] introduced attention mechanism for Machine Transla- to alleviate the shortcoming of compressing the input se- tion quence with arbitrary length into a fixed-size vector. 2017 Attention-based Ashish Vaswani et al. [14] proposed the Transformer ar- model chitecture which is solely relied on attention mechanism. 2018 Pre-trained language NLP tasks are approached in a new way, whereby the models model firstly is trained on unsupervised or semi-supervised tasks using large corpora and then is fine-tuned on specific task (ELMo [70], BERT [44], GPT [5]).

Table 1.2: Some outstanding NLP achievements in the last twenty years.

19 1.3 Dissertation Overview

1.3.1 Scientific Actuality of the Research

Deep Learning has become one of the most exciting fields of science which attracts both academy and industry. Applying deep neural network models to address natural language processing (NLP) tasks is the mainstream which completely overwhelmed traditional ma- chine learning-based approaches. This dissertation focuses on studying deep neural network models to address Sequence Labeling and Coreference Resolution tasks. Sequence Labeling task is a kind of task involving assignment a categorical label to each element in an observed sequence. This task plays an important role in the field of Natural Language Processing (NLP) since many NLP tasks are related to labeling sequence data. Typical tasks include Named Entity Recognition (NER), Part Of Speech (POS), Speech Recognition, as well as Word Segmentation. Therefore, solving the Sequence Labeling task has been attracting a lot of attention from NLP researchers. Coreference Resolution is an NLP task aiming at finding all mentions that refer to the same entity in a text. This task has an important role in a lot of higher-level NLP tasks including Question Answering, Text Summarization, and Information Extraction as well. However, it is one of the hardest tasks since the accurate prediction requires the model to deeply understand the meaning of the whole input text that could be a couple of sentences or even a document with hundreds of sentences. So far, not many works have been published and the achieved results are still not impressive compared with the other NLP tasks. Hence, there still remains a large potential for better solutions.

1.3.2 The Goal and Task of the Dissertation

The main goal of this dissertation is to study supervised learning models, focusing on deep neural network models such as Bidirectional Long Short-Term Memory (Bi-LSTM), Bidirec- tional Gated Recurrent Unit (Bi-GRU), Convolutional Neural Network (CNN), Attention- based models, as well as state-of-the-art language models, to address Sequence Labeling and Coreference Resolution tasks. The major tasks in this PhD work include:

• Systematically review publications on application of neural network to sequence la- beling and coreference resolution tasks with focus on recent supervised deep neural networks;

• Propose and implement new models for sequence labeling task with focus on extracting useful features for the word vector representation;

• Propose and implement new models for coreference resolution focusing on dealing with the long-term dependency problem;

20 • Apply the proposed models to three specific tasks: Named Entity Recognition, Sen- tence Boundary Detection, Coreference Resolution task;

• Based on the conducted experiments on given tasks, analyse performance of the pro- posed models.

1.3.3 Scientific Novelty

In this dissertation, we introduce a hybrid model for sequence labeling tasks which is different from previous approaches in:

• Generating rich semantic and syntactic word embeddings by combining (1) pre-trained word embedding, (2) character-level word embedding using a deep CNN network, and (3) additional features such as POS and Chunk;

• Encoding capitalization features of whole input sequence with Bi-LSTM;

• Improving the word embedding quality by integrating and finetuning context-based word embedding.

Application of the proposed model on NER task achieves state of the art performance on Russian and Vietnamese datasets (99.17%, 94.43% on NE3 and VLSP-2016 datasets). SBD task can be reformulated as a sequence labeling task and well handled by the proposed model (89.99%, 95.88% F1 on the Cornell Movie-Dialog and DailyDialog datasets). In addition, in this dissertation, we propose a new approach to coreference resolution task which is different from previous approaches in:

• Enhancing mention detection by utilizing modern language models;

• Improving mention detection and mention clustering by learning sentence-level coref- erential relations.

Application of the model with sentence coreference module for Russian language achieve state of the art of 58.42% average F1 on RuCor dataset.

1.3.4 Theoretical and Practical Value of the Work in the Dissertation

• Proposed sequence labeling models including WCC-NN-CRF and ELMo WCC-NN- CRF can be applied to several NLP tasks such as Named Entity Recognition, Part of Speech, Chunking, and Sentence Boundary Detection as well.

• WCC-NN-CRF and ELMo WCC-NN-CRF models are implemented in the open-source DeepPavlov framework2. Trained NER models on Vietnamese and English datasets are

2https://deeppavlov.ai

21 available to download at https://github.com/deepmipt/DeepPavlov/tree/master/ deeppavlov/configs/ner. These models are already to use or can be further fine- tuned on specific domains.

• SBD model trained on conversational datasets can be used in the annotating phase of a socialbot. The trained SBD model on the DailyDialog dataset can be found at https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/ configs/sentence_segmentation. This trained model was integrated into our so- cialbot which was built to participate in the Alexa Prize - Socialbot Grand Challenge 3.

• Sentence-level Coreferential Relation-based model can be used to extract the sentence relationship in a document-level context and has a promising potential not only for coreference resolution task but also for question answering, relation extraction or any other task that needs the information about the relationship between sentences.

1.3.5 Statements to be Defended

• WCC-NN-CRF model for sequence labeling task which utilizes (1) a CNN to generate vector representation of words from their characters and (2) a Bi-LSTM to encode the capitalization features of an input sequence. This model achieved state of the art performance on Vietnamese and Russian datasets with 98.21%, 94.43% on NE3 and VLSP-2016;

• Extensions of WCC-NN-CRF with a context-based word vector representation gener- ated by modern language models. This model obtained cutting edge performance on Russian and English datasets with 99.17%, 92.91%, and 92.27% F1 on NE3, Gareev’s, and CoNLL-2003 datasets;

• Sentence Boundary Detection task can be reformulated as a sequence labeling task and addressed by proposed sequence labeling models with impressive results of 89.99%, 95.88% on the Cornell Movie-Dialog and DailyDialog datasets;

• Sentence-level Coreferential Relation-based (SCRb) model to extract the sentence re- lationship in the coreference context;

• Coreference resolution models extended with SCRb obtained state of the art of 58.42% average F1 on RuCor dataset;

22 1.3.6 Presentations and Validation of the Research Results

The main findings and contributions of the dissertation were presented and discussed at four conferences:

• Artificial Intelligence and Natural Language Conference, ITMO University, Saint Pe- tersburg, Russia, September 20-23, 2017.

• The 2nd Asia Conference on Machine Learning and Computing, Ton Duc Thang Uni- versity, Ho Chi Minh, Viet Nam, December 7 - 9, 2018.

• The 25th International Conference on Computational Linguistics and Intellectual Tech- nologies, Russian State University for Humanities, Moscow, Russian, May 29 - June 01, 2019.

• The 4th International Conference on Machine Learning and Soft Computing, Mercure, Haiphong, Vietnam, January 17-19, 2020.

1.3.7 Publications

The proposed models applied to NER, SBD, and Coreference Resolution tasks along with the achieved results were published in five papers, out of which the first four papers were indexed by SCOPUS:

1. Le T.A., Arkhipov M. Y., Burtsev M. S., “Application of a Hybrid Bi- LSTM- CRF Model to the Task of Russian Named Entity Recognition”. In: Filchenkov A., Pivo- varova L., Zizka J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, pp. 91–103. 2017. Url: https://link.springer.com/chapter/10.1007/978-3-319-71746-3_8

2. Le T. A. and Burtsev M. S., “A Deep Neural Network Model for the Task of Named Entity Recognition”. In: International Journal of Machine Learning and Computing. Vol. 9. 1., pp. 8–13. 2019. Url: http://www.ijmlc.org/vol9/758-ML0025.pdf

3. Le T. A., Kuratov Y. M. , Petrov M. A., Burtsev M. S., “Sentence Level Representation and Language Models in the task of Coreference Resolution for Russian”. In: 25th International Conference on Computational Linguistics and Intellectual Technologies. 2019. Url: http://www.dialog-21.ru/media/4609/letaplusetal-160.pdf

4. Le T. A., “Sequence Labeling Approach to the Task of Sentence Boundary Detection”. In: Proceedings of the 4th International Conference on Machine Learning and Soft Computing. New York, NY, USA: Association for Computing Machinery, pp. 144–148. 2020. Url: https://dl.acm.org/doi/10.1145/3380688.3380703

23 5. Yuri Kuratov, Idris Yusupov, Dilyara Baymurzina, Denis Kuznetsov, Daniil Cher- niavskii, Alexander Dmitrievskiy, Elena Ermakova, Fedor Ignatov, Dmitry Karpov, Daniel Kornev, The Anh Le, Pavel Pugin, Mikhail Burtsev, "DREAM techni- cal report for the Alexa Prize 2019". In: The 3rd Proceedings of Alexa Prize. 2020. Url: https://m.media-amazon.com/images/G/01/mobile-apps/dex/alexa/ alexaprize/assets/challenge3/proceedings/Moscow-DREAM.pdf

Personal contribution of the author in the works with co-authors is as follows:

1. Development and implementation of a hybrid Bi-LSTM CRF model for solving the task of Named Entity Recognition in Russian.

2. Development and implementation of a character-aware WCC-NN-CRF model and its application to the task of Named Entity Recognition for four languages: Vietnamese, Russian, English, and Chinese.

3. Development and implementation of the Sentence-level Coreferential Relation-based (SCRb) model for determining the relationship of sentences in the context of corefer- ence; integration of the SCRb model into the baseline coreference model for solving the task of Coreference Resolution for Russian language.

5. Implementation and integration of the ELMo WCC-NN-CRF model for NER and SBD tasks in the socialbot architecture.

1.3.8 Dissertation Structure

The dissertation is organized into five chapters:

• Chapter 1 is an introductory chapter that firstly gives an overview of the dissertation, give a brief overview of Deep Learning and Natural Language Processing, and then briefly summaries the dissertation structure and contributions.

• Chapter 2 focuses on representing some major concepts of Deep Learning knowledge that are closely related to the proposed models described in the next two sections. The first one is the word embedding models that are responsible for converting raw words into vectors which can be processed by neural networks. The second concept is about deep neural network models which are mostly used for NLP tasks. The last one is general-purpose language models that aim at building language models which can be applied to most of NLP task to boost the model performance. Applying these models is easy and requires a little effort.

24 • Chapter 3 describes proposed models for the task of Sequence Labeling, and conducted experiments on two tasks Named Entity Recognition (NER) and Sentence Boundary Detection.

• Chapter 4 introduces a new approach to the task of Coreference Resolution, which aims at building a model to extract the sentence-level coreferential relationship. This model can be used in two ways: (1) output of this model is used as an additional feature for the baseline model, (2) this model is jointly trained with the baseline model.

• Chapter 5 summarizes the completed works, achieved results, and conclusions as well.

25 Chapter 2

Deep Neural Network Models for NLP Tasks

2.1 Word Representation Models

2.1.1 Word Representation

From the NLP perspective, words can be considered as core elements that can occur in isolation and have meaning. Understanding the word semantic is, therefore, the foundation for understanding the larger units (e.g., phrase, sentence, document). This has been found very useful for most of, if not all, NLP tasks. Nevertheless, the word semantic learning task is not easy to achieve. Unlike digital images whose elements are integer numbers that can be directly fed into neural networks, raw text data is the categorical data that contains labeled values and usually1 need to be represented in the numeric form. Furthermore, an image pixel has no meaning when standing alone, but a word itself can have more than one meaning. Which meaning a word carries can only be determined by its context. This is one of the major challenges in building word representation models. To deal with this challenge, most of the proposed machine learning-based models represented words as distributional vectors (a.k.a. word embeddings) due to following the distributional hypothesis stating that “words with similar meaning tend to occur in a similar context” [146]. In simple terms, a word vector is a fixed-length vector of real numbers in which each number expresses a dimension of the word’s meaning; and two words with similar meaning should be represented by similar vectors. A common measure for evaluating the similarity score between two vectors are cosine similarity which is derived from Euclidean dot product:

u · v = kuk kvk cosθ (2.1)

Given two n-dimensional vectors u, v, their cosine similarity is computed using their dot product and magnitude: u · v Pn u v similarity(u, v) = = i=1 i i (2.2) kuk kvk pPn 2pPn 2 i=1 ui i=1 vi 1Some algorithms can directly work with categorical data (e.g. decision tree).

26 Figure 2.1: Word meaning similarity visualization.

To intuitively see how well fine-tuned word vectors could capture the word meaning, we visualize some word vectors generated by GloVe [46] using Principle Component Analysis (PCA) (see Fig. 2.1). The visualization shows that these vectors well similarity in the meaning of words. The words with the same meaning, which usually appear in the same contexts, have similar vectors. An interesting property of word vectors obtained after training on a large-scale data is that word analogies can be solved with vector arithmetic: −−−−→ −−−→ −−−−→ france − paris + moscow−−−−−→ ≈ russia (2.3) −−−−→ −−−→ Fig. 2.2 shows the most similar vectors to france − paris + moscow−−−−−→ generated by GloVe embedding model. It is surprising that the word russia has the highest similarity. To explain this interesting property of word vectors, several papers has been published ([13], [6]). In the recent paper [54], Kawin Ethayarajh et al. gave a good explanation of word analogy arithmetic for two popular word representation models: GloVe [46] and Skip-gram with negative sampling [122] without making strong assumptions about the distribution of word frequencies. Motivated by the famous statement “You shall know a word by the company it keeps” [31], most word representation models try to extract the word semantic meaning

27 Figure 2.2: The most similar words to france − paris + moscow along with their cosine similarities, generated by GloVe embedding model. from the context it appears. In general, there are two major approaches (1) prediction- based approach and (2) count-base approach. The former utilizes neural networks to build a language model (LM) aiming at predicting the target word given its context. The context of a given word could be only several preceding words or both previous and next words. Neural networks used here could be recurrent neural networks (RNNs), convolutional neural networks (CNN), or even simple feed-forward neural networks. The count-based approaches leverage some global information at the corpus-level to build word vector representations. Models following this approach often focus on building word co-occurrence matrices from the entire corpus, then use dimension reduction algorithms (e.g. SVD, PCA) along with similarity measurements to achieve the word vector representations in a continuous low- dimensional space. Below both of these approaches are systematically summarized.

2.1.2 Prediction-based Models

As mentioned above, most of, if not all, prediction-based word embedding models are built for the LM task, in which the role of the first layer (i.e., the embedding layer) is to map a word to its vector representation. Thus, before diving into the prediction-based embedding models, the definition of the LM task should be cleared. The LM task aims at estimating the probability of a word sequence in a natural language

P (w1w2 ··· wT ). This joint probability can be factorized into conditional probabilities:

T Y P (w1w2 ··· wT ) = P (wt|w1 ··· wt−1), (2.4) t=1 here T denotes the length of the input sequence.

Since an enormous number of possible combinations of a word and its context, wt and w1 ··· wt−1 in Eq. 2.4, in practice, only several preceding words of the context is used to compute P (wt|w1 ··· wt−1):

T T Y Y P (w1w2 ··· wT ) = P (wt|w1 ··· wt−1) ≈ P (wt|wt−n+1 ··· wt−1), (2.5) t=1 t=1

28 Figure 2.3: Neural probabilistic language model architecture [145]. where t − n + 1 must be positive. n denotes number of preceding words considered. For instance, when n is set to 2 we have:

P (w1w2 ··· wT ) ≈ P (w1)P (w2|w1)P (w3|w1w2)P (w4|w2w3) ··· P (wT |wT −2wT −1).

The goal of most language models is to maximize or minimize the function of log- likelihood (with or without additional constraints). One of the first such models was pro- posed by Bengio et al. in 2003 [145], whose objective is to maximize the log-likelihood t−1 of P (wt|wt−n+1), namely Neural Probabilistic Language Model. The model architecture is simple. The first layer maps each word to its vector representation in a low-dimensional continuous space. The hidden layer with the tanh activation function followed by a softmax layer produces the probabilities of the words in the vocabulary (refer to figure 2.3 for more detail about the model architecture). The conducted experiments showed that this model achieved better results compared with n-gram models ([90], [97], [118]). The main weakness of Bengio et al. ’s model [145] is that it needs a long time for training. To overcome this drawback, in 2005, Morin et al. proposed a new novel architecture, named Hierarchical Probabilistic Neural Network Language Model [33], for speeding up neural language models by using a binary tree structure. The idea is to organize word vocabulary in a binary tree and utilize its structure to speed up normalization. Each word in the vocabulary is represented by a leaf. This technique helps to reduce the number of words

29 to choose from N down to O(log N) with N is the number of words in the vocabulary. From the probabilistic perspective, a sequence of O(log N)-way normalizations are used instead of N-way normalization. The tree was built by using IS-A taxonomy Directed-Acyclic-Graph from WordNet2. When experimented on Brown corpus, this model was remarkable faster than the Bengio et al. ’s model. However, its performance was a bit worse than the original model. In 2007, Mnih and Hinton proposed the Log-bilinear (LBL) Model [11] that is the foun- dation for many later models. In order to predict the next word wn given the sequence of preceding words (a.k.a the context) w1:n−1, firstly the LBL model linearly combines the feature vectors of the context words to create the predicted feature vector of the next word; then uses the inner product to measure the similarity between the predicted feature vector and the feature vector of each word in the vocabulary; and finally uses the softmax function to compute the distribution for the next word:

n−1 X rˆ = Cirwi , (2.6) i=1 > rˆ rw+bw n−1 e P (wn = w|w ) = > , (2.7) 1 P rˆ rj +bj j e n−1 where w1 , C, bw denote the given context, the weight matrix, and the bias of word w, respectively. One year later, in 2008, Mnih and Hinton [10] modified the Hierarchical Probabilistic Neural Network Language Model [145] by using the Log-bilinear model to compute the probabilities at each node in the tree. Below is a short description of how to make decisions at each node:

• Encode the path from the root node to the leaf where the word is at by a binary string d of left-right decisions.

• Give a feature vector for each non-leaf node.

• The probability of taking the left branch at ith node in the sequence is:

n−1 T P (di = 1|qi, w1 ) = σ(ˆr qi), (2.8)

here rˆ is computed as shown above in the Log-bilinear model, and qi denotes the feature vector for the node.

• The probability of w beging the next word is:

n−1 Y n−1 P (wn = w|w1 ) = P (di|qi, w1 ). (2.9) i 2https://wordnet.princeton.edu/

30 Figure 2.4: Supervised multitasking model [102].

The experiments showed that the model performance is comparative with Kneser-Ney n- gram models, and the speed is over 200 times faster than the previous Log-bilinear model. Also in 2008, Collobert and Weston proposed the multi-task model [102] in which word embeddings are utilized as a useful tool for downstream tasks such as Part-Of-Speech (POS), Chunking, NER, as well as Semantic Role Labeling (SRL). They jointly trained these su- pervised tasks on labeled data and the LM task on unlabeled data, which is abundant and can be freely crawled by search engines. Word embeddings are themselves model parameters and shared between tasks (Fig. 2.4). The model was trained to learn a binary classification task to answer the question “Is the middle word of the input sequence related to its con- text?”. The dataset, therefore, is fixed-size windows of text with two types of samples: (1) positive samples that are original windows from Wikipedia, and (2) negative samples made by replacing the middle word by a random word in the vocabulary. To avoid the expensive softmax normalization, they train the model to output a high score for a positive sample and a low score for a negative sample:

X X w Jθ = max{0, 1 − fθ(s) + fθ(s )}, (2.10) s∈S w∈V

w where S, V, fθ(·), s denote the windows of text, the word dictionary, the neural network with set of parameters θ, and the incorrect version of windows in which the middle word is replaced by w, respectively. One inherent drawback of n-gram models, which has not been overcome by feed-forward neural network language models, is the fixed number of words in the context. In other words, the number of context words needs to be set before training. To this end, in 2010, Mikolov et al. proposed to apply Recurrent Neural Network (RNN) to the LM task [124]. RNNs use recurrent connections to compress information about preceding words in one vector, namely context vector, and use it in combination with the input word to improve the prediction (Fig. 2.5). More formally, let w(t), s(t − 1) be vector of current word at time step t, and output from context layer at time step t − 1. The output is computed as follows:

31 Figure 2.5: Recurrent neural network-based language model [124].

x(t) = w(t) + s(t − 1), (2.11) X sj(t) = f( xi(t)uji), (2.12) i X yk(t) = g( sj(t)vkj), (2.13) j where f(z) is sigmoid function: 1 f(z) = (2.14) 1 + e−z and g(z) is softmax function: ezm g(zm) = . (2.15) P zk k e In 2013, Mikolov et al. published a stunning paper in which they proposed a word representation model, namely Word2Vec [123]. The idea is to build a model with a loss function computed by measuring how well a word can predict its context words. In the paper, two model architectures are introduced, namely Continuous Bag-of-Words (CBOW) and Skip-gram. Unlike the previous language models, which predict the next word only based on the preceding words, CBOW model uses both previous and next words to predict the target word. The name CBOW stems from using continuous vector representation and unordered words of the context. Thus, the loss function of CBOW is a bit different from [145]: T 1 X J = log p(w |wt−1 , wt+n), (2.16) θ T t t−n t+1 t=1 i where wj = wjwj+1 ··· wi. The model is similar to Feed-forward Neural Network Language Model [145], except that it removes the non-linear hidden layer, and shares the projection layer for all words (see Fig. 2.6). In contrast to CBOW, Skip-gram uses the middle word

32 Figure 2.6: Continuous Bag-of-Words model [123].

Figure 2.7: Continuous Skip-gram model [123]. to predict the surrounding words (see Fig. 2.7). Thus, the objective function of Skip-gram model computes the loss by summing up the log probabilities of all the surrounding words:

T 1 X X J = log p(w |w ), (2.17) θ T t+j t t=1 −n≤j≤n,j6=0

3 where p(wt+j|wt) is computed using softmax function :

0 > exp (v v ) wO wI p(wO|wI ) = , (2.18) PW 0 > w=1 exp (vw vwI )

0 here vw, vw are vector representations of w, and W is number of words in the vocabulary. 3To reduce the computational cost, the hierarchical softmax is used instead of the full softmax.

33 Table 2.1 briefly summaries the approaches of prediction-based language models, and their main contributions as well.

2.1.3 Count-based Models

Unlike predictive models mentioned before which are based on neural network language models with the goal of predicting a target word given its context, count-based models produce word embeddings by exploiting the co-occurrence statistics of words from large text corpora. To do this, most of them use matrix factorization methods for a rank reduction on a large term-frequence matrix to generate low-rank vectors word representations. One of the first count-based models is Latent Semantic Analysis (LSA), authored by Deerwester et al. [109]. Let m, n denote the number of documents in the corpus and number terms, respectively. The model firstly builds a word by document co-occurrence matrix Am×n in which each row represents documents and each column represents a term. This results in a very high-dimensional, sparse, and noisy matrix. Thus, Singular Value Decomposition (SVD) is used to reduce the dimensionality of A into two thin matrices U, V (see Fig. 2.8 for a visualization) with the dimensions ranged around 100 to 300 features condensing all important features into small vector space:

> Am×n = Um×rΣr×r(Vn×r) , (2.19) here:

• U is left singular vectors of size (m documents, r concepts),

• Σ is singular values (i.e., the diagonal matrix expressing the strength of each “concept”),

• V is the right singular vectors of size (n terms, r concepts).

Once U, V are computed, similarity metrics (e.g., cosine similarity) can be applied to eval- uate the similarity of documents, terms. The model achieves a comparative performance compared to prediction-based models when the training data is small. In the opposite case, when a large training data is available, prediction-based models capture similarity in a better manner [24]. In 2011, Low Rank Multi-View Learning (LR-MVL) model was proposed by Dhillon et al. [87] which leverages the Canonical Correlation Analysis (CCA) between left and right contexts of a given word to find the common latent structure. The conducted experiments on NER and Chunking tasks showed that the model achieved state-of-the-art accuracy at that time. Another count-based model was introduced in 2014 by Lebret and Collobert [99], which also build a word co-occurrence probability matrix, but instead of using Euclidean PCA, the

34 Model Approach Main Results Feed-forward Neural Represent words in a low dimensional space; au- Better performance in Network Language tomatically cluster similar words together; train comparison with tradi- Model [145] neural network with backpropagation algorithm tional n-gram models Hierarchical Prob- Use hierarchical decomposition of the condi- An exponential speed-up abilistic Language tional probabilities to speed up model perfor- in both training and pre- Model [33] mance diction Log-bilinear model Train a neural network language model with a Remarkably faster than [11] log-bilinear energy function to model the word the one-hidden-layer context model from Bengio [145] Log-bilinear model Use the binary tree to organize words and com- Speed is over 200 times supported by Hi- pute the probabilities at each node by using log- faster than the previous erarchical softmax bilinear model model [11] [10] Supervised multi- Build a multi-tasking model sharing the same Produce a good word em- task model [102] word embedding; treat LM task as binary clas- bedding, even for uncom- sification task that tries to predict if the middle mon words and show that word is related to given context it is a very useful tool for many downstream tasks (e.g. POS, Chunking, NER, SRL) RNN-based Lan- Use RNN to deal with the limitation of fixed- Outperform backoff guage Model [124] length contexts models (Both PPL and WER of RNN model are much smaller than that of backoff models), even when backoff models were trained on 5 times more data than RNN model CBOW and Con- Build a fast and accurate language model, which Produce high-quality tinuous Skip-gram is able to use both left and right contexts to pre- word vectors capturing models [123] dict the target word, for learning high-quality both semantic relation- word vectors from large datasets ships and analogy of words

Table 2.1: Prediction-based word embedding models.

35 Figure 2.8: SVD Visualization. authors perform Hellinger PCA to represent a word in a low-dimensional space. Conducted experiments on NER and IMDB review datasets showed that Hellinger PCA’s word em- bedding outperformed LR-MVL [87], CW model from SENNA4 and some other models. In the follow-up work [98], the authors studied some different factors impacting on the quality of the word embedding model. They concluded that (1) using a large window of context improves capturing both word’s syntactic and semantic information, (2) a context dictionary of rare words enhances capturing semantic, but good performance can be produced by using just a fraction of the most frequent words, and (3) the stochastic low-rank approximation approach beats traditional PCA approach. Also in 2014, the cutting-edge count-based model was introduced by Pennington et al. [46], namely GloVe (short for Global Vectors for Word Representation), which also learns word vectors by doing dimensionality reduction on the word-word co-occurrence matrix extracted from corpora. The authors first build the word-word co-occurrence matrix X whose elements Xij express number of times word j appears in the context of word i. After that, the probability of word j appearing in the context of word i is computed:

Xij Pij = P (j|i) = . (2.20) Xi By using empirical methods, the authors discovered that it makes more sense to learn word vector representation from ratios of co-occurrence probabilities rather than the probabilities themselves, therefore they introduced the general model form:

Pik F (wi, wj, w˜k) = , (2.21) Pjk where w ∈ Rd are word vectors and w˜ ∈ Rd are separate context word vectors. The vector 4http://www.nec-labs.com/senna/

36 difference is chosen for representing the ratio Pik in the word vector space: Pjk

Pik F (wi − wj, w˜k) = . (2.22) Pjk The dot product is then used to convert vectors at the left-hand side of Eq. 2.22 to a scalar in order to be suitable to the right-hand side:

> Pik F ((wi − wj) w˜k) = . (2.23) Pjk As noted from the word-word co-occurrence matrix that the role of words and their context words are exchangeable, the function F is required to be a homomorphism from (R, +) and (R, ×): F (a + b) = F (a)F (b), ∀a, b ∈ R. (2.24)

Since the right-hand side of Eq. 2.23 is defined in term of difference between wi and wj, an additive inverse (i.e., substract) is used instead of addition. This results in using multiplica- tive inverse in the right-hand side in order to guarantee the homomorphism property: F (a) F (a − b) = , ∀a, b ∈ . (2.25) F (b) R By relabeling and applying distributivity, we get:

> > F (wi w˜k) F ((wi − wj) w˜k) = > , (2.26) F (wj w˜k) which can be combined with Eq. 2.23 and 2.20 to produce a new simple equation:

> Xik F (wi w˜k) = Pik = , (2.27) Xi where Xik, Xi, as denoted before, are number of times word k occurs in the context of word i and number of times any word occurs in the context of word i, respectively. The best candidate function that satisfies the above conditions is ex. Taking the natural logarithm of both sides in Eq. 2.27 produces a new equation:

> wi w˜k = logPik = logXik − logXi. (2.28)

We note that Xik = Xki since the appearing of word i in the context of word k also means that the word k appears in the context of word i. In addition, the logXi can be replaced by a bias term since it is independent of k. The symmetric function is finally formed:

> ˜ wi w˜k + bi + bk = logXik. (2.29)

The authors also introduce new weighted least squares regression model:

V X > ˜ 2 J = f(Xij)(wi w˜j + bi + bj − logXij) , (2.30) i,j=1

37 here V denotes the vocabulary size, and f is the weighting function:  α (x/xmax) if x < xmax f(x) = (2.31) 1 otherwise, that sastifies three conditions (1) have a limit of 0 when x goes to 0, (2) to be non-decreasing and (3) to be realitvely small for large values of x. The model was evaluated on some tasks including word analogy, word similarity, as well as named entity recognition, and achieved better results compared with the other models. Table 2.2 summaries the count-based language models along with their main contribu- tions.

2.2 Deep Neural Network Models

2.2.1 Convolutional Neural Network

In [21], Hubel and Wiesel introduced receptive field types of monkey striate cortex, in which two concepts "single cell" and "complex cell" was described. The authors showed that both of these types of cells are used for pattern recognition. Motivated by this study, in 1980, introduced Neocognition [61], a hierarchical neural network for pattern recognition, which could be considered as the predecessor of recent CNNs. This model has two components "S-cell" and "C-cell" that use mathematical operations to mimic biological cells. A decade later, in 1990s, Yann LeCun et al. introduced a multi-layer neural network called LeNet-5 [139] that can be trained using the backpropagation algorithm. The architecture of LeNet-5 is straightforward and consists of two convolutional layers and two average pooling layers, followed by two fully connected layers (see Fig. 2.9 for more details of LeNet-5 architecture). The conducted experiments on MNIST dataset5 showed that LeNet achieved the best result compared with many other methods, including Linear Classifier and Pairwise Linear Classifier, Baseline Nearest Neighbor Classifier, Principle Component Analysis, and Polynomial Classifier, Radial Basis Function Network, One-Hidden Layer Fully Connected Multilayer Neural Network, Two-Hidden Layer Fully Connected Multilayer Neural Network, Tangent Distance Classifier, as well as Support Vector Machine. After the success of LeNet-5, numerous variants of CNN have been proposed. Their components are therefore also different. Nevertheless, in general, there are three main com- ponents that play key roles in the CNN architecture, namely Convolutional Layer, Pooling Layer, and Fully-Connected Layer.

5A dataset of handwritten digits, available download at http://yann.lecun.com/exdb/mnist/

38 Model Approach Main Results LSA [109] Build a word by document co- Although this approach occurrence matrix and use SVD to is mostly used for doc- perform dimension reduction ument representations or topic modeling, it is also used for word vector rep- resentations. The model performs quite well on a small number of data. LR-MVL [87] Leverage CCA between the past and fu- The word embedding ture views of the data to extract com- produced by LR-MVL mon latent structure. was reported to be better than that of [10], [102] and some others; Achieve state-of-the-art results on NER and Chunking tasks. Hellinger PCA [99], Apply a variant of PCA (Hellinger Outperformed LR-MVL [98] PCA) to the word co-occurrence ma- model [87], CW from trix, and study some different factors SENNA, and some impacting on the quality of the word other models on NER embedding. CoNLL2003 and IMDB review datasets. Glove [46] Combine two methods from both Achieved significantly prediction-based approach and count- better results in com- based approach: global matrix factor- parison with the others ization and local context window meth- on word analogy, word ods; Learn word vector representations similarity, and named from ratios of co-occurrence probabili- entity recognition as ties instead of the probabilities them- well. selves.

Table 2.2: Count-based word embedding models.

39 Figure 2.9: Architecture of LeNet-5 [140].

Convolutional Layer

The Convolutional Layer is considered as the most important component which aims to capture the feature patterns from low to high abstract level. This is done based on the connection architecture inside the convolutional layers. In particular, a convolutional layer consists of several layers of neurons. Unlike in the traditional fully-connected neural network where each neuron is connected to all neurons in the previous layer, where each neuron is only connected to some neurons located in a small rectangle to create a locally low-level feature. These features are then assembled to create a feature map that captures more higher and abstract features. This architecture is depicted in Fig. 2.10. Neurons in the same layer share the same weights, named filter. More mathematically, let x be an input matrix with n rows and m columns, f be a filter with r rows and t columns. The output of the neuron located at ith row and jth column is computed as below:

r−1 t−1 X X o[i, j] = x[i ∗ sh + u, j ∗ sw + v] ∗ f[u, v], (2.32) u=0 v=0 where sh, sw are vertical and horizontal strides. r, t along with sh, sw decide which neurons in the previous layer are connected to the target neuron at the current layer. Lets take Fig. 2.10 as an example. Suppose the filter size is 4×3, both horizontal and vertical strides are 1. The output of the last neuron locating at the 9th row and 7th column is computed as folow:

o[9, 7] = x[9, 7] ∗ f[0, 0] + x[10, 7] ∗ f[1, 0] + x[11, 7] ∗ f[2, 0] + x[12, 7] ∗ f[3, 0] +x[9, 8] ∗ f[0, 1] + x[10, 8] ∗ f[1, 1] + x[11, 8] ∗ f[2, 1] + x[12, 8] ∗ f[3, 1] +x[9, 9] ∗ f[0, 2] + x[10, 9] ∗ f[1, 2] + x[11, 9] ∗ f[2, 2] + x[12, 9] ∗ f[3, 2]

In a more general case when the input has one more dimension named channel (e.g., a RGB image), the filter, therefore, needs one more dimension. The Eq. 2.32 is generalized to: r−1 t−1 c−1 X X X o[i, j] = x[i ∗ sh + u, j ∗ sw + v, k] ∗ f[u, v, k], (2.33) u=0 v=0 k=0

40 Figure 2.10: Connection architecture in a layer of neurons. where c denotes the number of channels. The Convolutional Layer uses a set of filters to transform an input into a set of feature maps. During the training stage, the weights in the filters are fine-tuned in order to be able to create useful feature maps for a specific task. The repeated use of convolutional layers creates more complex patterns. This process is a part of the representation learning.

Pooling Layer

The key role of a pooling layer is to shrink the input to reduce the spatial dimensions of the feature map. Each neuron in a pooling layer is also connected to a small group of neurons in the previous layer (like in the convolutional layer). The output of the neuron located at ith row and jth column is computed as:

o[i, j] = pool_op(x[i ∗ sh + u, j ∗ sw + v], u = 0, r − 1, v = 0, t − 1), (2.34) here pool_op denotes the pooling operation (e.g., max pooling or average pooling), sh, sw are horizontal and vertical strides, and r, t are height and width of the pooling filter.

Fully-Connected Layer

This kind of layer is just like the traditional fully-connected feed-forward network where each neuron at the current layer is connected to all neurons in the previous layer. Given an input vector x ∈ Rn, the output is computed as:

o = σ(Wx + b), (2.35) where σ is the activation function, W ∈ Rm×n, b ∈ Rm are weight matrix and bias vector, respectively.

41 Figure 2.11: RNN architecture and its unfolded structure through time.

2.2.2 Recurrent Neural Network

Recurrent Neural Network [104] and its variants have been widely adopted in many fields from Natural Language Processing, Speech Recognition to Computer Vision due to their effectiveness in handling the sequence data such as text, audio, and video as well. Unlike in traditional FFNNs or CNNs where the output prediction just depends on the input at the same time step, in an RNN the output at the previous step is used along with the current input for making a prediction. More concretely, in RNNs, hidden layers consist of recurrent cells whose state is updated based on both past state and current input (refer to Fig. 2.11 for the depiction of the RNN architecture and its unfolded structure). There are several types of RNNs designed to suitable for input and output types of differ- ent tasks. For example, a "one-to-many" RNN is designed for the task of image captioning where the input is an image that is a fixed-length vector and output is a text, a variable- length vector describing the content of the image. Four main types of RNNs along with their applications are summed up in Fig. 2.12. The idea behind RNNs is that a sequence of vector can be process by applying a recurrence formula at every time step:

ht = fW (ht−1, xt), here the same function a long with the same set of parameters are shared for every time step.

In particular, given an input sequence x = x1, x2, .., xn, the corresponding output sequence y = y1, y2, .., yn is computed as below:

ht = σh(Wxxt + Whht−1 + bh), (2.36)

yt = σo(Wyht + by), (2.37) here xt, yt denote input and output at time step t, ht, ht−1 are hidden states at time step t ant t − 1. Wx,Wh,Wy and bh, by are weights and biases, respectively. σh, σo are activation functions.

42 (a) One-to-many RNN (e.g., Image (b) Many-to-one RNN (e.g., Sentiment Captioning, Music Generation). Analysis, Document Classification)

(c) Symmetrical many-to-many RNN (d) Asymmetrical many-to-many RNN (e.g., Machine (e.g., Part of Speech, NER) Translation, Question Answering)

Figure 2.12: Different types of Recurrent Neural Networks.

Backpropagation Through Time

Backpropagation is one of the simple and popular optimization algorithms for training neural networks. However, since RNN has a special architecture working with sequence data, this algorithm had been extended to suit RNN, called Back-propagation Through Time (BPTT) [88]. Generally, this algorithm firstly unfolds the network in time and hence converts the recurrent neural network into an equivalent feed-forward neural network, then propagates the errors backward through time. Let’s see how the weight matrix W is updated through time (see Fig. 2.13), starting from the original weight update formula: ∂L W ← W − α , (2.38) ∂W here W, L, α denote the recurrent weight matrix, loss, and learning rate, respectively. The ∂L gradient is determined from the summation over L : ∂W t n ∂L X ∂Lt = , (2.39) ∂W ∂W t=1 ∂L and t is computed based on h : ∂W k t ∂Lt X ∂Lt ∂hk = . (2.40) ∂W ∂hk ∂W k=1

43 Figure 2.13: Backpropagation through time in RNN.

∂Lt Since there is no explicit of Lt on hk, the chain rule is used to indirectly compute : ∂hk ∂L ∂L ∂h ∂L ∂y ∂h t = t t = t t t , (2.41) ∂hk ∂ht ∂hk ∂yt ∂ht ∂hk ∂h The chain rule is used again to compute t : ∂hk t ∂ht Y ∂hl = . (2.42) ∂hk ∂hl−1 l=k+1 The final backpropagation equation is: n t t ∂L X X ∂Lt ∂yt Y ∂hl ∂hk = ( ) . (2.43) ∂W ∂yt ∂ht ∂hl−1 ∂W t=1 k=1 l=k+1

The Vanishing and Exploding Gradients

Let’s take a look back at the formula of hidden state at time step l:

hl = σ(Wxxl + Whhl−1 + bh). (2.44)

This can be rewritten as

al = Wxxl + Whhl−1 + bh, (2.45)

hl = σ(al). (2.46)

Hence the partial derivative of hl w.r.t. hl−1 in Eq. 2.43 can be computed using the chain rule: ∂h ∂h ∂a l = l l . (2.47) ∂hl−1 ∂al ∂hl−1

44 Since al is a vector [al1, al2, . . . , ald], and hl is a element-wise activation function of al, ∂hl [σ(al1), σ(al1), . . . , σ(ald)], is a Jacobian matrix: ∂al   ∂hl1 ∂hl2 ∂hld ···  0  ∂al1 ∂al1 ∂al1  σ (al1) 0 ··· 0 ∂h ∂h ∂h   l1 l2 ··· ld   0  ∂hl    0 σ (al2) ··· 0  0 = ∂al2 ∂al2 ∂al2  =   = diag(σ (al)), (2.48) ∂a  . . . .   . . .. .  l  . . .. .   . . . .     0  ∂hld ∂hld ∂hld  ··· 0 0 ··· σ (ald) ∂al1 ∂al2 ∂ald here diag denotes the diagonal matrix. Next, from Eq. 2.46 we have:

∂al = Wh. (2.49) ∂hl−1 Finally, ∂hl 0 = diag(σ (al))Wh. (2.50) ∂hl−1 0 0 1 Since σ is a bounded function, diag(σ (al)) ≤ γ. γ = if we use sigmoid activation 4 function. In case of tanh function, γ = 1. Let λm be the largest value of Wh. Suppose 1 λ < : m γ

∂hl 0 0 1 = diag(σ (al))Wh ≤ diag(σ (al)) kWhk < γ = 1, ∀l. (2.51) ∂hl−1 γ

∂hl Let β satisfy that ∀l, ≤ β < 1. By using Eq. 2.41 in combination with Eq. 2.42, ∂hl−1 we have: t ∂Lt ∂Lt ∂ht ∂Lt Y ∂hl ∂Lt t−k = = ≤ β . (2.52) ∂hk ∂ht ∂hk ∂ht ∂hl−1 ∂ht l=k+1

∂Lt The Eq. 2.52 points out that (for which t − k is large) decrease dramatically down ∂hk to 0 since β < 1. This phenomenon is called gradient vanishing problem. By inverting 1 the assumption, when the largest value of W , λ > , the gradient of some long term h m γ components would be exploded. In practice, one way to avoid vanishing/exploding gradients problem is to use Truncated Backpropagation Through Time where the product in Eq. 2.42 is restricted to θ(< t − k) terms. This method also helps to reduce the memory and computation costs, as well as processing time. Another way to deal with the exploding gradient problem is Gradient Clipping introduced by Razvan Pascanu, Tomas Mikolov and Yoshua Bengio in [96]. The pseudo-code of this algorithm is shown in Algorithm 1. In 2015, Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton also introduced a simple way to initialize RNNs of Rectified Linear Units [95]. Generally, this is a vanilla RNN whose recurrent weights are initialized with the identity matrix, the biases are set to zero, and the activation function used for hidden units is ReLU.

45 Algorithm 1 Pseudo-code for norm clipping the gradients whenever they explode ∂L 1: gˆ ← ∂W 2: if kgˆk ≥ threshold then threshold 3: gˆ = gˆ kgˆk 4: end if

2.2.3 Long Short-Term Memory Cells

In theory, RNNs are able to capture “long-term dependencies” (Eq. 2.36). But actually, in practice, they don’t work well, primarily because of the simple recurrent cell architecture. The history information was often forgotten after the unit activations were multiplied several times by small numbers. In 1997 [111] Sepp Hochreiter and Jurgen Schmidhuber introduced the Long Short Term Memory network (LSTM, for short) which has a complex structure allowing to better control long-term dependencies and mitigate the vanishing gradient prob- lem. Unlike in the RNN cell with only one activation function taking as input the history information and current input, an LSTM cell has two major components:

• Cell state: a kind of information memory, to store information over time.

• Gates: use singoid/ tanh functions to control adding new information to, or removing some others from, cell state.

There are three gates in the standard LSTM cell:

• Forget gate: Decide which part of information should be removed from the cell state. The decision is made by a sigmoid function that takes as input the previous hidden state and current input:

ft = σ(Wf · [ht−1, xt] + bf ) (2.53)

• Input gate: Update the cell state. This consists of three steps:

– Decide which part of new information should be added to the cell state:

it = σ(Wi · [ht−1, xt] + bi) (2.54)

– Create a vector of new candidate values:

¯ Ct = tanh(WC · [ht−1, xt] + bC ) (2.55)

– Update cell state: ¯ Ct = ft ∗ Ct−1 + it ∗ Ct (2.56)

46 −1 x +

tanh

x x

tanh ℎ−1 ℎ

Figure 2.14: LSTM cell at time step t.

• Output gate: Decide which information should be outputted based on cell state

ot = σ(Wo · [ht−1, xt] + bo), (2.57)

ht = ot ∗ tanh(Ct). (2.58)

The graphical illustration of an LSTM cell is shown in Fig. 2.14.

Variants of LSTM Cell

In 2000, Gers and Schmidhuber introduced a variant of LSTM cell with “peephole connec- tions” [30] that allow the gates take into account the information from the cell state. The peephole LSTM cell are described in the equations 2.61 a long with Fig. 2.15.

ft = σ(Wf · [Ct−1, ht−1, xt] + bf ), (2.59)

it = σ(Wi · [Ct−1, ht−1, xt] + bi), (2.60)

ot = σ(Wo · [Ct, ht−1, xt] + bo). (2.61)

Another variant is created by modifying the way of updating information to the cell state (see Fig. 2.16). Instead of separately making decisions of which part of information should be thrown away and which part should be added, we make those decisions together:

¯ Ct = ft ∗ Ct−1 + (1 − ft) ∗ Ct (2.62)

Gated Recurrent Unit (GRU) is another variant of LSTM cell introduced by Kunghyun Cho in 2014 [62]. In this variant, the input gate and forget gate are combined into one gate, named “update gate”. There are only two gates in GRU: update gate and reset gate. In addition, cell state and hidden state also merged (see Fig. 2.17). These changes in the architecture make GRU simpler with fewer parameters than LSTM, and hence run faster,

47 −1 x +

tanh

x x

tanh

ℎ−1 ℎ

Figure 2.15: Peephole LSTM cell.

−1 x +

tanh

1- x x

tanh ℎ−1 ℎ

Figure 2.16: A variant of LSTM cell with coupled input and forget gate. while the performance is almost on par with LSTMs. The GRU’s formulas are described below:

zt = σ(Wz · [ht−1, xt]), (2.63)

rt = σ(Wr · [ht−1, xt]), (2.64) ¯ ht = tanh(W · [rt ∗ ht−1, xt]), (2.65) ¯ ht = (1 − zt) ∗ ht−1 + zt ∗ ht. (2.66)

ℎ−1 x + ℎ

1- x x

tanh

Figure 2.17: GRU cell.

48 ℎ ℎ ℎ+1 −1 +1

LSTM cell LSTM cell LSTM cell

ℎ−1 ℎ ℎ+1

+1

Figure 2.18: Standard LSTM network.

2.2.4 LSTM Networks

The standard LSTM network, as shown in Fig. 2.18, is actually a modified version of the vanilla RNN by replacing the RNN cell with the LSTM’s. In this standard network, there is only one hidden layer. The architecture of the standard LSTM network is therefore exactly the same with vanilla RNN’s, except that the RNN cell is replaced by LSTM cell. Another version is the Stacked LSTM network in which some hidden layers could be stacked to make the model deeper. There is no change in the time dimension comparing with the standard network. There is no current connection between layers. In the depth l dimension (e.g., layer dimension), the hidden state ht at the lth layer is used as input to the layer (l + 1)th (see Fig. 2.19). The formulas of Stacked LSTM network are listed below:

l l l l−1 l ft = σ(Wf · [ht−1, ht ] + bf ), (2.67) l l l l−1 l it = σ(Wi · [ht−1, ht ] + bi), (2.68) ¯l l l l−1 l Ct = tanh(Wc · [ht−1, ht ] + bC ), (2.69) ¯l l l l ¯l Ct = ft ∗ Ct−1 + it ∗ Ct, (2.70) l l l l−1 l ot = σ(Wo · [ht−1, ht ] + bo), (2.71) l l l ht = ot ∗ tanh(Ct). (2.72)

The LSTM networks discussed above are only able to take into account the preceding context. In some cases, however, the following context also needs to be considered. For instance, in the task of Named Entity Recognition, the tagging decision for the current word depends not only on the preceding words but also on the following ones. The Bidirectional LSTM network is created to deal with this problem. Fig. 2.20 depicts the architecture of a Stacked Bidirectional LSTM network. Basically, it consists of two parts (1) forward LSTM network and (1) backward LSTM network. At each time step, the output is the combination of two hidden states outputted by the forward and backward LSTM networks which capture both preceding and following contexts. The gates, cell state, and output in the former are computed exactly the same in the Stacked LSTM network. Whereas, in the latter, they are

49 +1 +1 +1 ℎ ℎ ℎ+1 +1 +1 +1 −1 +1

(l+1)th layer LSTM cell LSTM cell LSTM cell ℎ+1 +1 +1 −1 ℎ ℎ+1

ℎ ℎ ℎ +1 −1 +1

lth layer LSTM cell LSTM cell LSTM cell ℎ −1 ℎ ℎ+1

ℎ−1 ℎ−1 ℎ−1 +1 −1 −1 −1 −1 +1

(l-1)th layer LSTM cell LSTM cell LSTM cell ℎ−1 −1 −1 −1 ℎ ℎ+1

Figure 2.19: Stacked LSTM network. computed inverse order:

l l l l−1 l ft = σ(Wf · [ht+1, ht ] + bf ), (2.73) l l l l−1 l it = σ(Wi · [ht+1, ht ] + bi), (2.74) ¯l l l l−1 l Ct = tanh(Wc · [ht+1, ht ] + bC ), (2.75) ¯l l l l ¯l Ct = ft ∗ Ct+1 + it ∗ Ct, (2.76) l l l l−1 l ot = σ(Wo · [ht+1, ht ] + bo), (2.77) l l l ht = ot ∗ tanh(Ct). (2.78)

2.3 Pre-trained Language Models

2.3.1 ELMo

As mentioned in Section 2.1.1, a word itself could have several meanings. For instance, the words “rose” in the below sentences carry absolutely different meanings. The first one is a noun indicating a name of a flower, while the second one is a verb describing an action.

My favorite flower is Bulgarian rose. He quickly rose from his seat.

To determine the meaning of words correctly, the word context must be considered. In this case, the word “flower”, “quickly”, “seat” in the context plays a crucial role in determining the

50 → → ⃖ ⃖ ℎ ℎ ℎ +1 ℎ+1 ⃖ ⃖ +1 ⃖ +2 LSTM cell LSTM cell ⃖ ℎ⃖ ℎ⃖ ℎ +1 +2 ⃖ −1 ⃖ −1 ℎ ℎ+1 → → → −1 +1 LSTM cell LSTM cell → → → ℎ −1 ℎ ℎ +1 →−1 →−1 −1 ℎ ℎ −1 +1 −1 ⃖ ⃖ +1 ⃖ +2 LSTM cell LSTM cell −1 −1 ⃖ −1 ℎ⃖ ℎ⃖ ℎ +1 +2

→−1 →−1 →−1 −1 +1 LSTM cell LSTM cell →−1 →−1 →−1 ℎ −1 ℎ ℎ +1

Figure 2.20: Stacked Bidirectional LSTM network. meanings of the word “rose”. However, traditional word representation models, as described in the sections 2.1.2, 2.1.3, assign a vector to each word regardless of the context in which the word occurs. To deal with this problem, Matthew E. Peters et al. [70] introduced a new word embedding model, named Embeddings from Language Model (ELMo, for short). The pillar of this model is the bidirectional language model (Bi-LM) which estimates the probability of word sequences in both forward and backward directions: n −→ Y p (t1, ··· , tn) ≈ p(ti|t1, . . . , ti−1), (2.79) i=1 n ←− Y p (t1, ··· , tn) ≈ p(ti|ti+1, . . . , tn). (2.80) i=1 In ELMo model, they are done by using L-layer Stacked bidirectional LSTM (as shown in the Fig. 2.21). For each token tk, ELMo computes 2L + 1 representations:

LM −→LM ←−LM LM Rk = {xk , h k,j , h k,j |j = 1,...,L} = {hk,j |j = 0,...,L}, (2.81)

LM LM −→LM ←−LM where hk,0 denotes the token layer, hk,j = [ h k,j ; h k,j ], for each Bi-LSTM layer.

In downstream tasks, set of vector representations Rk are collapsed into a single vector by first using weighted summarization and then scalarization:

L task task X task LM ELMok = γ sj hk,j , (2.82) j=0

51 Softmax Softmax ⋯ Softmax

Project Project ⋯ Project

LSTM cell LSTM cell ⋯ LSTM cell

LSTM cell LSTM cell ⋯ LSTM cell

LSTM cell LSTM cell ⋯ LSTM cell

LSTM cell LSTM cell ⋯ LSTM cell

Figure 2.21: Bi-LSTM based model of ELMo.

task where sj denotes the softmax-normalized weights; γ is a scalar parameter used to scale entire received vector. These hyper parameters are fine-tuned during the training down- stream tasks. The model is jointly trained by minimizing the negative long likelihood of both forward and backward directions:

N X −→ L = − (log p(tk|t1, . . . , tk−1;Θx, Θ LST M , Θs) k=1 (2.83) ←− +log p(tk|tk+1, . . . , tn;Θx, Θ LST M , Θs)), where Θx, Θs are parameters of token representation layer and softmax layer, respectively. −→ ←− Both of them are shared to forward and backward directions. Θ LST M , Θ LST M denote two separate sets of parameters of forward and backward LSTMs.

2.3.2 Transformer

The Transformer was introduced in the paper titled “Attention Is All You Need” [14] authored by Ashish Vaswani and his colleagues from Google. It is not a language model since it was trained for machine translation task. However, it is the inspiration for the following described pre-trained language models. The overall architecture of the Transformer follows the Encoder-Decoder model [62], but instead of relying on RNN or CNN networks to build

52 My My Attention 0.01 0.1 favourite favourite 1 6 2 9 2 1.8 2.6 1 0.2 0.02 0.2 flower flower 0.4 0.4 0.1 2 1 4 3 2.7 0.8 2.8 2 is is 0.01 0.4 Bulgarian Bulgarian 0.5 0.06 0.7 0.5 2 0 1 2.4 2 3.4 rose rose 3 0.4 3

Figure 2.22: Attention mechanism treated as a weighted average function. encoder and decoder components, this model solely based on attention mechanism. In the simplest way, the attention mechanism is just a weighted function taking as input a sequence of vectors and outputting another sequence of the same length. The left part of Fig. 2.22 illustrates the attention mechanism as a black box taking as input a sequence of 3 vectors and output another sequence of the same length. In the field of NLP where words are usually represented as a vector, the attention mechanism is applied to produce a new word vector based on its surrounding words. The new word vectors are the weighted average of the current word along with the surrounding words and therefore contain the context information. The weights here represent the importance of each element. For instance, when computing new vector representing the word “rose” in the sentence “My favorite flower is Bulgarian rose”, the weights of the words “rose”, “flower” should be large because they play main roles in determining the meaning of the word “rose” (see the right part of Fig. 2.22). Because of that, using this technique is quite similar to RNNs. One property making this method better than RNNs is that attention can compute new word vectors parallel-in-time. Below some attention mechanisms used in [14] will be described.

Given dk-dimensional query, key vectors, and dv-dimensional value vectors of nw words packed into matrices, Q, K ∈ Rnw×dk ,V ∈ Rnw×dv , Scaled Dot-Product Attention (the left part in Fig. 2.23) is computed as below:

QK> Att(Q, K, V ) = softmax( √ )V, (2.84) dk here each query vector, qi, in Q represents the word that is paying attention. Each key vector, ki, in K represents the word to which attention is being paid. In the right part of Fig. 2.22, the query is the vector of the word “rose” on the right side and keys are the vectors > of the words on the left side. QK produces the matrix of size [nw, nw] representing the similarity between each pair of words in the input sequence. More concretely, the similarity

53 MatMul Linear

Softmax Concat

Mask (opt.)

Scaled Dot-Product Attention h

Scale

Linear Linear Linear MatMul

Q K V Q K V

Scaled Dot-ProductAttention Multi-Head Attention

Figure 2.23: Scaled Dot-Product Attention and Multi-Head Attention in Transformer model [14]. between two words is measured by the dot product:

qikj = cos(qi, kj) kqik2 kkjk2 . (2.85)

This represents how similar the directions of vectors qi and ki, and how large their lengths are. The larger dot product expresses the closer the direction and the larger length. This > matrix QK is then scaled by dk, the square root of the depth, to ensure the dot product > between two vectors Qi and Ki doesn’t rise to large in proportion to dk. Multi-head Attention is a variant of Scaled Dot-Product Attention in which queries, keys, and values are projected h times using different learnable sets of weights (see the right part

nw×dmodel Q K dmodel×dk in Fig. 2.23). Given Q, K, V ∈ R , and weight matrices Wi ,Wi ∈ R ,

V dmodel×dv O hdv×dmodel Wi ∈ R , W ∈ R . Multi-head Attention is computed as below:

Q K V headi = Att(QWi ,KWi ,VWi ), i = 1, . . . , h, (2.86) O MultiHead(Q, K, V ) = Concat(head1,..., headh)W . (2.87)

Q K dmodel×dk V dmodel×dv O hdv×dmodel where Wi ,Wi ∈ R ,Wi ∈ R ,W ∈ R . Masked Multi-head Self Attention is a slight modification of Multi-head Self Attention in which masking technique, which is done by using a pointwise product of QK> and the upper triangular matrix of ones, is applied to prevent the future leak. This is used in the decoder to guarantee that the prediction for ith position depends only on the previous known outputs.

54 I live in Moscow

Decoder Encoder Decoder Feed Forward

Encoder Decoder Encoder-Decoder Attention

Encoder Decoder Masked Multi-head Attention

Encoder Decoder

Encoder

Encoder Decoder Feed Forward

Encoder Decoder Multi-head Attention

Я живу в Москве

Figure 2.24: High-level view of Transformer model.

So far, self-attention types, which are pillars of the Transformer model, are described. The Transformer model can be thought of as an encoder-decoder model whose both en- coder and decoder are stacks of 6 blocks. Each encoder block comprises two main com- ponents (1) multi-head self-attention, and (2) feed-forward network. Decoder blocks are composed of three components including Masked multi-head self-attention, encoder-decoder attention, and feed-forward network. The encoder-decoder attention is just an multi-head self-attention, except that keys and queries come from encoder and queries are output from the previous decoder. Fig. 2.24 provides a high-level view of the Transformer architecture. Since the attention mechanism is not able to make use of the sequence order like RNN or CNN architectures, the output of attention-based network would be the same for the same input sequences in a different order. To deal with this problem, the positional encoding is injected to input embedding which determines the position of each element in the sequence: pos PE(pos, 2i) = sin( 2i ), (2.88) 10000 dmodel pos PE(pos, 2i + 1) = cos( 2i ), (2.89) 10000 dmodel where pos, i denotes the position and dimension. If the input shape is [N, dmodel], then pos, i are in ranges [0,N), [0, dmodel), respectively.

55 Text Prediction Classification

Add & Norm

+

FFNN

12 x

Add & Norm

+

Masked Multi-headed Self-Attention

Word Embed Position Embed

Figure 2.25: GPT architecture.

2.3.3 OpenAI’s GPT

GPT, short for Generative Pretraining, was introduced by OpenAI6 in 2018 [5] which uses semi-supervised learning approach involving the general-task unsupervised pretraining to learn a universal representation and task-specific supervised fine-tuning. Different from ELMo which based on the Bidirectional Language Model modeled by the Stacked Bi-LSTM network, GPT relies on the decoder architecture of the Transformer model [14]. In general, GPT can be treated as a stack of transformer decoder blocks, each of which consists of a Masked Multi-headed Self Attention layer followed by a normalization layer and Feed- Forward layer (see Fig. 2.25). In the unsupervised pre-training phase, the authors use a standard language model to learn the probability distribution of a sequence U = {u1, . . . , un}. The objective function is to maximize the log-likelihood, the same with the ELMo model, except missing the backward direction:

X LLM (U) = log p(ui|ui−k, . . . , ui−1; Θ), (2.90) i here k denotes the context window size, Θ is model parameters. Instead of using word vocabulary, the GPT model uses the sub-word vocabulary generated

6https://openai.com/

56 by BytePair Encoding (BPE). BPE is a data compression algorithm, was used by Sennrich et al. [100] to build the sub-word dictionary based on word frequencies. This algorithm could be summarized as follow:

• Collect a large corpus for generating sub-word vocabulary

• Generate characters, called symbols, from corpus by splitting words, use them as the initialize vocabulary

• Generate a new symbol by finding and merging the pair of symbols with the highest frequency based on the corpus

• Iterate the previous step until some stop conditions are met (e.g., vocabulary is full or after a pre-defined number of iterative steps)

The input sequence is embedded and then fed into transformer blocks, which uses a multilayer transformer decoder [91], to produce the output distribution over the target tokens

0 h = UWe + Wp (2.91) hl = transformer_block(hl−1)∀ l ∈ [1, n] (2.92)

n > P (u) = softmax(h We ), (2.93) where We,Wp are word embedding and position embedding matrices which are a part of model parameters. n is the number of transformer blocks. The transformer decoder here are borrowed from [91], see Fig. 2.26, which is the modification of Multi-Head Attention in Transformer model [14] with two key changes in order to deal with long sequences and reduce memory usage:

• Local attention, in which long sequences are firstly split into equal blocks, then fed into attention heads. Each block is processed independently.

• Memory-compressed attention: Number of values and keys are reduced by using a convolutional layer of size 3 and stride 3.

After training with LLM , the model is fine-tuned on the supervised tasks. Instead of building the task-specific model, the authors directly use the pre-trained language model with slight modifications in the input structures along with adding a linear layer followed by a softmax layer (see Fig. 2.27). For example, in the task of text classification, the input sequence x = (x1, . . . , xm) is fed through the transformer blocks of the pre-trained language n model to produce the hidden representation, hm, of the last token xm at the final transformer block (the nth block). After that, a linear layer followed by a softmax layer is used to produce the distribution probability over classes:

n P (y|x1, . . . , xn) = softmax(hmWy), (2.94)

57 Merge Masked Multi-head Attention Masked Multi-head Attention

Masked Masked Masked Multi-head Multi-head ... Multi-head Convlutional layer Attention Attention Attention

Q K V Q K V

Split

Local Attention Original Selft Attention Memory-compressed used in Transformer Attention

Figure 2.26: Self Attention architecture proposed in [91].

here Wy is the trainable weight matrix of the linear layer. The model is trained to maximize the log-likelihood for true labels:

X X n Lcls = log P (y|x1, . . . , xn) = log softmax(hmWy). (2.95) (x,y) (x,y)

Jointly training LM task with specific tasks also helps to improve generalization of the supervised model and boost convergence:

L = Lcls + λ ∗ LLM , (2.96) where λ is a learnable parameter used to control the importance of the auxiliary objective

LLM .

2.3.4 Google’s BERT

Different from ELMo which generates features for downstream tasks by independently con- catenating outputs of the forward and backward LSTMs, and GPT which is based on left- to-right transformer decoder blocks, BERT [44] (short for Bidirectional Encoder Represen- tations from Transformers) jointly conditions on both left and right contexts. In general, BERT is a stack of transformer encoder blocks. The core of each block is the Multi-head Self Attention discussed before. Two versions of BERT, named BERT-Based and BERT-Large, were released:

• BERT-Base is a stack of 12 layers, each of which has 12 attention heads, containing a total of about 110 million parameters.

58 Spam Not Spam

90% 10%

Softmax

Project

GPT

This is a spam mail

Figure 2.27: Fine-tuning GPT for downstream tasks.

• BERT-Large is huge with 24 layers, each of which has 24 attention heads, consisting of about 340 million parameters.

The input embeddings used in BERT is the combination of three types of embeddings:

• Token embeddings: Instead of using word units, sub-word units is used to enhance the issue of rare words. The sub-word vocabulary is built using WordPiece [72].

• Position embeddings: Due to the lack of capturing the order information in Transformer architecture, the positional embedding is injected as a part of input embeddings.

• Segment embeddings: BERT can take as input a pair of sentences for the task of the next sentence prediction. For this reason, segment embeddings are added to identify the sentence boundaries.

BERT was pre-trained using two unsupervised tasks including

• Masked Language Modeling (Masked LM): Unlike in the traditional LM task where the word prediction is just based on some previous word (refer to Eq. 2.5), in this task, some tokens in the input sentences are firstly masked randomly, then the model is trained to predict those masked words based on both preceding and succeeding words. This task is therefore called Masked LM. However, the drawback of this method is the mismatch between the pre-training and fine-tuning phase due to the not existence of

59 “MASK” token during the fine-tuning stage. Hence, after randomly choosing 15% of in- put tokens to mask, three heuristic rules below are applied to mitigate the mismatching problem:

– 80% of the time, replace the chosen word with “MASK” token;

– 10% of the time, replace the chosen word with a random one;

– 10% of the time, keep it unchanged.

• Next Sentence Prediction: To allow the model to perform better on some downstream tasks which rely on the relationship between two sentences (e.g., Question Answering, Natural Language Inference), the model is trained on the Next Sentence Prediction task where the input is the pair of sentences and the goal is to predict whether the first sentence immediately precedes the second one.

2.4 Summary

To sum up, this section presents three key of modern NLP including word representation models, deep neural network models, and general-purpose language models, which are pillars of the proposed models described in two following sections. In the field of NLP, vectorization is a currently widely used methodology to map discrete words to numeric vectors which can be processed by neural networks. Traditionally, there are two families of word representation models including prediction-based models and count- based models. The prediction-based models are based on language modeling. The underlying idea behind these models is to build a model for the LM task whose first layer is responsible for mapping discrete words to continuous low-dimensional vector. Since the objective of the language modeling task is to try to predict the next word based on some previous words called context, after training the model, the first layer is able to used to convert raw words to vectors representing their meaning. Some outstanding prediction-based models include CBOW, Skip-gram, Word2Vec or Google’s Word2Vec. Different from the prediction-based models, the count-based models analyze a large corpus of text to exploit the co-occurrence statistics of words. Most of the count-based models firstly build a large term-frequence matrix from corpora, then use on matrix factorization methods for rank reduction and generate a low-rank matrix for word representations. One of the impressive count-based models is GloVe introduce by Jeffrey Pennington, Richard Socher and Christopher Manning [46]. In addition to two families of word representation models mentioned above, another approach called the Contextualized word representation model (a.k.a. context-based word embedding) is also represented at the end of this section. Instead of learning the function mapping one word to one vector regardless of the context, context-based word representation models learn a

60 function that takes as input not only the word but also the context. This enables the model to deal with the problem of multi-meaning of words. The second concept described in this section is deep neural network models. Recently, RNNs, GRU, LSTM as well as their variants are widely used in most of NLP tasks due to their capability of learning long short-term dependencies in sequence data. It is hard to say which model is the best. Each of them has its own advantages and disadvantages. In recent paper [60], Klaus Greff et al. conducted a relatively comprehensive survey about variants of the LSTM architecture and concluded that their performance is all about the same. According to our experiments on the Named Entity Recognition task, the standard LSTM achieves slightly better results, especially in case of large training data, but quite slower than GRU. Finally, some recently state-of-the-art general-purpose language models are analyzed. These models approach NLP tasks in a new way which can be divided into two steps

• Unsupervised or semi-supervised learning task: This is usually done by training a language model on a large unlabelled text corpus.

• Fine-tuning task: Fine-tune the trained model on a specific NLP task. This step is not cost and time-consuming due to a small number of fine-tuning parameters.

If we had to describe these models succinctly in a few words, it would be “deep” and “large- scale”. Indeed, all of them are deep neural network models that are built by stacking many of the same layers or blocks. For instance, ELMo is a two-stacked Bi-LSTM model. GPT is a stack of 12 decoding transformer blocks. The BERT-large model is based on 24 encoding transformer blocks. Besides that, all of these models are trained on huge datasets in order to create meaningful word representations and obtain generalization ability. By owning deep network architectures and trained on large-scale datasets, these models can achieve state-of- the-art results on a variety of NLP tasks ranging from Text Classification, Sentiment Analysis to Question Answering by fine-tuning on small domain-specific datasets. These models can also be considered as word representation models since they often used to convert raw input words to the corresponding vector representations. The main advantage of these models compared with the traditional ones described in section 2.1 is that they generate a word vector representation based on not only the word itself but also the context.

61 Chapter 3

Sequence Labeling with Character-aware Deep Neural Networks and Language Models

3.1 Introduction to the Sequence Labeling Tasks

The Sequence Labeling task has been received a lot of attention from the research community since this task can be widely applied to many fields ranging from Computer Vision, Nat- ural Language Processing to Bioinformatic. Some typical applications are Video Analysis, Named Entity Recognition, Part of Speech, Word Segmentation, as well as Protein Structure Prediction. Generally, given a sequence of observations, the goal of the Sequence Labeling task is to determine an appropriate state for each observation. From the supervised machine learning perspective, the Sequence Labeling task can be formally defined as follow. Given the training data {(x(1), y(1)),..., (x(m), y(m))}, where (i) x = x1 . . . xn, xj ∈ X,

(i) y = y1 . . . yn, yj ∈ Y, here X denotes the set of all possible observations, and Y is set of all possible states. The task, here, is to build a predictor by learning a map function f : x → y that works well on unseen input sequence x. One of two key factors making the Sequence Labeling tasks are more difficult than the conventional multi-classification tasks is the exponential size of the output space. Let’s take the task of Named Entity Recognition as an example. Suppose we need to find and extract two types of entities including person names and organization names. One extra type needs to be used to mark the words that do not belong to these types. Let denote them respectively by PER, ORG, and O. Suppose that there are no nested entities. For a sentence of size 7,

An Nhien studies in Vietnam National University,

62 An Nhien studies in Vietnam National University

PER PER PER PER PER PER PER

ORG ORG ORG ORG ORG ORG ORG

O O O O O O O

Figure 3.1: Example of Sequence Labeling output space. there are totally 37 possible output sequences (See Fig. 3.1). In addition to the huge output space, the complex constraints between classes make the Sequence Labeling task even harder. In general, there are two types of constraints:

• Local constraints: Such kinds of constraints, as the name mentioned, are constraints on each word itself. For instance, the word “can” is more likely to be a verb than a noun. However, sometimes these constraints are inconsistent with each other. For example, the word “can” in the sentence “There is no trash can in the kitchen.” is a noun, not a verb.

• Global constraints: These constraints are conditioned on two or several consecutive elements of the sequence. For example, a noun is more likely than a verb to follow a determiner.

Therefore, to make accurate predictions, sequence labeling models need to be able to learn not only the input context but also the relationships between output tags. Up to now, a lot of works has been proposed. These models along with related works are summarized in the following section.

3.2 Related Work

As mentioned at the beginning of this chapter, Sequence Labeling can be applied to many fields such as Computer Vision, Natural Language Processing, as well as Bioinformatics. It is difficult to overview of all applications of Sequence Labeling since each of them has its own specifics. Due to this reason and in order to focus on domain of the thesis, this section gives summaries Sequence Labeling models in the field of Natural Language Processing. In general, Sequence Labeling models can be divided into two groups (1) traditional models and (2) deep learning-based models. The first one can be classified into two sub-

63 groups including (1) rule-bases models and (2) feature-based models. Unlike traditional approaches where each model has its own architecture, deep learning-based models share similar architecture with components including (1) input encoder, (2) context encoder, and (3) tag decoder. However, the implementations of these components and the ways they are combined are very diverse.

3.2.1 Rule-based Models

Rule-based models, as its name indicates, are made up of the hand-crafted rules which are based on syntactic features, grammatical features, as well as domain-specific gazetteers. In 1992, Eric Brill proposed a simple and portable rule-based model for the Part of Speech task [27], which can automatically acquire the rules and tags. According to reported experiments, a comparable accuracy compared to existing stochastic models was achieved. In 1999, Eric Brill et al. [16] introduced a hybrid approach that combines unsupervised and supervised rule-based training algorithms to create a highly accurate tagger. At the first step, all possible tags are assigned to the corresponding words. Then, in the second step, the rules are used to reduce the number of tags in order to remove the ambiguity. In the evaluation task of the Message Understanding Conference (MUC-7), many rule-based systems were introduced such as SAR [17], Facile [131], NetOwl [35], and LaSIE-II [51]. All of them relied on hand-crafted semantic and syntactic rules. Another model for the named entity extraction task based on the specific-domain gazetteers was proposed by Oren Etzioni et al. in [85]. In this paper, the authors introduced three methods, namely pattern learning, subclass extraction, and list extraction, to improve the recall of the baseline KnowItAll system. In the recent paper [25], Elaheh Sadredini et al. introduced a scalable solution for rule-based Part-of-Speech tagging in which they utilize two state-of-the-art hardware accelerators, the Automata Processor and Field Programmable Gate Arrays, to accelerate rule-based Part of Speech tagging by converting rules to regular expressions and exploiting the highly-parallel regular-expression-matching ability of these accelerators.

3.2.2 Feature-based Models

In general, feature-based models are the extended variants of rule-based models. In the rule- based models, the rules are designed based on language characteristics and then directly used to classification of tokens into tags. Instead, in the feature-based models, the rules are designed to represent training examples as feature vectors which are vectors of boolean or numeric values. This step is called feature engineering. The features could be character- level features, word-level features or even document features depending on each specific task demands. Once feature vectors are built, machine learning algorithms such as Hidden

64 Markov Model (HMM), Conditional Random Field (CRF), Support Vector Machine (SVM), as well as Maximum Entropy (ME), are used to train a model being able to work well on unseen data. Hidden Markov Model is one of the earliest models used for Sequence Labeling tasks. In [119], Thorsten Brants proposed to use the Second-order Markov Model for the Part Of Speech task and achieved impressive results on both English and German languages. In [38], GuoDong Zhou and Jian Su introduced a variant of HMM applied to the task of Named Entity Recognition. In this model, an input word is represented by its features and the word itself. The word features consist of simple deterministic internal features (digital numbers, abbreviation, capitalized words), internal semantic features (person titles, location suffixes, organization suffixes), internal gazetteer features (pre-defined person names, names of locations), and external macro context features. The tags are featured by three parts including (1) boundary categories which are used to determine the word position in the entity, (2) entity categories used to encode the class of entity name, and (3) word features. Maximum Entropy Model has also been applied to Sequence Labeling tasks such as Part Of Speech [1], Named Entity Recognition [83]. In [114], Taku Kudoh and Yuji Matsumoto used Support Vector Machine for Chunk Indentification task and obtained 93.48% of overall F1 on the CoNLL-2000 dataset. Compared with the above models, Conditional Random Field model seems to be more widely used in various tasks with different domains ([50], [80], [71]). Unlike HMM which has strict independence assumptions, CRF can accommodate any context information. Its feature design is flexible. CRF is better than ME since instead of defining the state distribution of the next state under the current state conditions, CRF computes the joint probability distribution of the whole label sequence when an observation sequence intended for labeling is available.

3.2.3 Deep Learning-based Models

Both rule-based and feature-based models have obvious disadvantages. The rule-based mod- els heavily rely on hand-crafted semantic and syntactic rules to recognize and extract entities in a sequence. The rule design process is time-consuming and costly. This type of models could only obtain a high accuracy when the lexicon is exhaustive. Since the rules are designed manually for specific domains, the precision score of these models is often higher than the recall. Two pillars of the learning-based models are (1) feature engineering for input represen- tation and (2) machine learning algorithms for training and inferencing. Feature engineering is time-consuming, costly, and requires expert linguistic knowledge. Standard linear HMM and linear-chain CRF assume that (1) the current state only depends on the previous state and (2) the previous observation influences the current one only via the states. Resulting methods suffer from complex constraints on input observations and output states. Thanks to

65 deep learning networks, these limitations are mainly resolved. Deep learning-based models trained on large corpora can learn complex features due to non-linear activation functions. Moreover, one important advantage of deep learning-based models is the powerful ability of end-to-end representation learning that eliminates the need for hand-crafted rule design and feature engineering as well as simplifies sequence processing pipeline. In general, a deep learning-based sequence labeling model has three components (1) input encoder, (2) context encoder, and (3) tag decoder. The input encoder is the first step which is responsible for converting the raw input text into low-dimensional dense vectors with each dimension representing a latent feature. This component can be just a simple look-up table which is initialized either randomly or by a pre-trained word embedding model such as Word2Vec [123], GloVe [46], and FastText1. In practice, this look-up table can be either fixed or further fine-tuned during the training stage [108]. Some works consider representation of each word as a sequence of characters and use deep neural network models such as LSTM, GRU, and CNN to generate the word vector representations. In [36], Guillaume Lample and his colleagues utilized Bi-LSTM to generate a word embedding from its characters. In another work [45], instead of using a Bi-LSTM network, Jason P.C. Chiu and Eric Nichols used a CNN network. More recently, in [84], Onur Kuru et al. developed a character-based model for the NER task, namely CharNER. CharNER treats the input text sequence as a sequence of characters and utilizes a deep-stacked Bi-LSTM network to generate word vector representations. In [148], Zhilin Yang et al. proposed a deep hierarchical recurrent neural network with the multi-task and cross-lingual joint training approach for sequence labeling tasks. In this model, they used a Bi-GRU model to capture the morphological features of words from their characters. They reported impressive results in multiple languages over several tasks including Part of Speech, Chunking, and NER as well. Different from the input encoder, the context encoder is used to extract the context in- formation of words, i.e., the meaning of a word inside a context. Recurrent neural networks and its’ variants such as LSTM, and GRU are very suitable for this task since their archi- tecture is designed for handling sequential data. More concretely, in these models, hidden layers are made of recurrent cells whose state is updated based on both previous state and current input. Because of that reason, in many studies the context information was encoded by applying Bi-LSTM ([36], [45], [135], [147]), Bi-GRU ([76], [37]). The last component in a sequence labeling model, the tag decoder, is often implemented in one of these ways (1) using a multilayer perceptron followed by a softmax function, (2) applying the Conditional Random Field, and (3) using a recurrent neural network. Appli- cation of the first approach ([89], [26]) assumes that tagging is a multi-class classification problem which assigns each word a label without taking into account the interaction among

1https://fasttext.cc

66 neighboring output tags. On the contrary, many other studies ([147], [135], [136], [76], [37]) utilized a CRF layer to jointly decode labels for the whole sequence instead of independently modeling tagging decisions. A few works proposed to use a recurrent neural network for the tag decoding. In [149], PenPeng Zhou et al. used an LSTM network for the tag decoding. Likewise, in [142], Yanyao Shen et al. used an LSTM network tag decoder and reported that LSTM outperforms CRF and is faster to train, especially when the set of possible tags is large.

3.2.4 Related Work on Vietnamese Named Entity Recognition

Motivated by the drawback of feature-based models in heavily relying on the feature en- gineering, in [42], Huong Thanh Le and Luan Van Tran proposed a genetic algorithm to automatically select a minimal subset of features in order to produce a high accuracy for the feature-based NER model. For the learning stage, two learning models were explored including Conditional Random Field and k-Nearest Neighbor. The authors experimented with the model on original dataset consisting of 8242 sentences crawled from news websites. The dataset covered four topics economics, politics, cultures and education. The paper re- ported impressive results, 92.65% and 94.23% of F1 with using k-Nearest Neighbor and CRF, respectively. However, it is hard to compare current studies to this result since the dataset is not in use anymore. In [93], Phuong Le-Hong applied a feature-based Maximum Entropy Model, a popular discriminative model for classification, to the Vietnamese NER task. In this model, a complex feature set was designed to classify a token at the given position in a sentence. These features divided into four subsets:

• Basic features: current word, current part-of-speech, current chunk tag, previous named entity tags.

• Word shape features: lower word, capitalized word, the word with all capitalized letters, mixed case letters, digits, date, etc.

• Basic joint features: previous word, joint of the current and previous word, next word, joint of current and next word, previous part-of-speech, joint of current and previous part-of-speech, next part-of-speech, joint of current and next part-of-speech, joint of previous and next part-of-speech, joint of the current word and previous named entity tag.

• Regular expression types: current regular expression type, previous regexp type, etc.

For the tag decoding, the author used the Viterbi algorithm and reported that a significant improvement can be achieved by combining two decoding directions, both forward decoding

67 and backward decoding. The author trained the model on the VLSP-2016 dataset contain- ing 16858 tagged sentences, totaling 386520 words with four entity types including person, location, organization, and miscellaneous. The model obtained an impressive performance, 89.66% of F1 on the development set. In [92], Thai-Hoang Pham and Phuong Le-Hong introduced two models including the word-based model and the character-based model which share the same architecture based on Bi-LSTM and CRF. In the first, word-based model, authors studied four types of word embeddings including (1) random vectors, (2) skip-gram vectors, and (3) combination of skip-gram vectors with CNN-generated word features, and (4) combination of skip-gram vectors with LSTM-generated word features. In the second, character-based model, they just used random vectors since they claimed that the size of the Vietnamese character set is relatively small. According to their experiments on the VLSP-2016 dataset, the word-based model with skip-gram vectors combined with CNN-generated word features obtained better results, about 88.38% of F1. In the follow-up paper [116], these authors addressed the task of NER by using a 2-layer Bi-LSTM network with a softmax classifier on top instead of using a CRF layer. The authors report a series of experiments on the VLSP-2016 dataset, starting with utilizing a vanilla LSTM which obtained 65.80% of F1. Then, Bi-LSTM with 0.5 of dropout showed better performance, 74.02% of F1. Finally, they used additional features including part-of-speech, chunk, case-sensitive, and regular expression features and obtained a significant improvement of 92.05% F1. In a recent paper [75], Pham Quang Nhat Minh introduced a feature based-model that combines word, word-shape features, PoS, chunk, Brown-cluster-based features and word- embedding-based features in the CRF model. The pre-trained word embedding used in this model was generated with GloVe [46]. The model was also evaluated on the VLSP-2016 dataset and got an impressive performance of 93.93% F1.

3.2.5 Related Work on Russian Named Entity Recognition

In [101], Rinat Gareev et al. introduced one of the early datasets for Russian Named Entity Recognition annotated manually with two entity types including (1) person and (2) orga- nization, on which the authors evaluated two models (1) knowledge-based model and (2) feature-based model. In the first model, three resource knowledge types were utilized includ- ing (1) named dictionaries, (2) local context indicators and patterns, and (3) document-level rules, corresponding to three stages of the model (1) dictionary matching, (2) context pattern matching, and (3) document-wide analysis of entity names. The second model is based on the third-party implementation of Conditional Random Fields2 with Viterbi inference. The

2http://mallet.cs.umass.edu

68 knowledge-based and CRF-based models achieved 62.17% and 75.05% of F1, respectively. In [126], Valentin Malykh and Alexey Ozerin addressed the NER task by jointly training a character-based LSTM model to predict the character-level NER tag and the next character. However, the achieved performance on the Gareev’s dataset[101] was not impressive, 62.49% of F1. In [128], Valerie Mozharova and Natalia Loukachevitch proposed the two-stage approach to the Russian NER task. In this approach, firstly, named entities are extracted by a trained machine learning method. Then, the statistics of extracted labels are collected and transformed into an additional feature set which is used for training a new classifier together with the baseline feature set. Three types of two-stage features were explored including (1) the previous history, (2) the document statistic, and (3) the global statistic. These types are corresponding to the context-levels used for the statistic. The dataset used for model evaluation was extended from the Person-1000 dataset3 by adding two entity types including ORG and LOC. The model obtained the best performance, 92.92%, by combining all three two-state feature types. In the follow-up paper [127], these authors addressed the Russian NER task by utilizing a CRF model. Here, the input text was preprocessed with a morphological analyzer to extract useful features such as part of speech, gender, number, and case. Features for each token are grouped into three categories including token features, context features, and lexicon-based features. The model was tested on two Russian datasets, NE3 and NE54 and obtained impressive results of 91.71% F1 and 89.93% F1. Besides, the authors also explored different tagging schemes, IO and BIO, and found out that the BIO tagging scheme gave the better result, by about 1% of F1. In the recent paper [129], Vitaly Romanov and Albina Khusainova evaluated the benefits of incorporating morphology into word embeddings for sequence labeling tasks including Part of Speech, Chunking and NER with the focus on the Russian language. They studied two morphology-based word embedding models, comparing them with other models such as Skip-gram. The model used in the experiments was a CNN-CRF model. The results show that there was no remarkable difference between these word embeddings. FastText was a little bit better over three given tasks. The authors also compared FastText with BERT [44] and reported that BERT was completely outperformed over all tasks.

3.3 Tagging Schemes

In some sequence labeling tasks, the goal is to find and classify each phrase, a small group of words standing together as a conceptual unit, in the input text into one class (a.k.a. tag

3http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000 4http://labinform.ru/pub/named_entities/descr_ne.htm

69 Le An Nhien studies in MIPT in Moscow Oblast . IOB I-PER I-PER I-PER O O I-ORG O I-LOC I-LOC O IOB2 B-PER I-PER I-PER O O B-ORG O B-LOC I-LOC O IOBES B-PER I-PER E-PER O O S-ORG O B-LOC E-LOC O

Table 3.1: Example of applying different tagging schemes to the same sentence. or label). Whereas, most of modern machine learning systems are based on the word-level models, which classify every word of the input text into a predefined class. This leads to division of the original class into sub-classes. This division is called tagging scheme. Consider building a tagging scheme for the task of Named Entity Recognition. The goal of the NER task is to find named entities (e.g., person names, organization names, location names) in the text. For example, in the sentence:

"Le An Nhien studies in MIPT in Moscow Oblast’." we have three named entities: (1) Le An Nhien is a Vietnamese person name, (2) MIPT is a university name, and (3) Moscow Oblast’ is the location name referring to a region of Russia. To locate and tag these words, several tagging schemes are proposed such as IOB, IOB2, IOBES:

• IOB: In this scheme, I is assigned to the tokens that are inside an entity. The tokens, which do not belong to any entity, are labeled with O. B is used for tokens that are the beginning of entities. B is only used for the entity immediately following an entity of the same type.

• IOB2: is the same as the IOB scheme except that B is used for every entity.

• IOBES: This tagging scheme is an extended version of the IOB2 scheme with E and S prefixes. E is used to label the end of entities and the single entities are labeled with S, instead of B.

Table 3.1 shows the tag sequences of the above sentence in these schemes.

3.4 Evaluation Metrics

Most of the datasets for NER task are highly imbalanced with the O class dominating the others. Therefore, the Accuracy metric is not an adequate choice for a model performance evaluation. Let take an example to clear it up. Suppose the output tags of the given sentence “MIPT is one of the most famous universities in Russia” is “O O O O O O O O O O”. In this

70 case, the accuracy is 80%, but actually, the model did not detect any entity. To deal with this problem, P recision, Recall, and F − measure are used instead of the Accuracy metric. P recision is the percentage of entities detected by the model that is correct. Recall is the percentage of entities actually existing in the dataset that is detected by the model. F1 is a harmonic mean of P recision and Recall. Here the arithmetic mean is not used since

P recision and Recall are ratio variables. Harmonic mean of n numbers x1, x2, . . . , xn is computed as: n H = . (3.1) 1 + 1 + ··· + 1 x1 x2 xn

Let use P,R to denote P recision and Recall. F1 is formed by applying formula 3.1 on P recision and Recall: 2 × P × R F = , (3.2) 1 P + R In the equation 3.2, P recision and Recall share the same importance. In some cases when we want to pay more importance to one of them, the equation 3.3 will be used. For instance, when β is set to 2 means that Recall is twice more important than P recision):

(β + 1) × P × R F = (3.3) β β × P + R

3.5 WCC-NN-CRF Models for Sequence Labeling Tasks

3.5.1 Backbone WCC-NN-CRF Architecture

The family of models proposed and studied in this dissertation has the same core backbone architecture. This architecture follows a hybrid approach and consists of three encoding sub-networks to (1) encode the semantic and grammatical features of words, (2) capture character-level word representation as well as capitalization features of words, (3) capture the meaning of words in their contexts. In addition, a CRF layer is utilized to exploit the output tag dependencies. A graphical illustration of the proposed model is shown in Fig. 3.2. We refer to this backbone model as WCC-NN-CRF in the text.

Let Vword,Vchar,Vcap be word, character, and capitalization type vocabularies, respec- tively; here |V | denotes the size of the vocabulary V . In the proposed model, three kinds of lookup tables are used to map words, characters, and capitalization types of words to dense

|Vword|×dword |Vchar|×dchar |Vcap|×dcap vectors. Let denote them as Lword ∈ R , Lchar ∈ R , Lcap ∈ R .

Here dword, dchar, and dcap are the lengths of dense vectors representing words, character, and capitalization type of word, respectively. Lword is initialized by a pre-trained word embed- ding (Glove for example). Lchar and Lcap are initialized randomly with values drawn from a uniform distribution with range [-0.5, 0.5]. All of these lookup tables are then fine-tuned during training state.

71 Input Sentence

Character Embeddings Word Embeddings Capitalization Embeddings

Convolutional Neural Bi-LSTM Network Network

Concatenate

Bi-LSTM Network

Feed-Forward Network

CRF

Tag Sequence

Figure 3.2: Proposed Model for the Sequence Labeling Task.

Given an input sentence with n words x = {x1, .., xn}, it is is transformed into the word,

n n×nbchar n character, and capitalization index forms: xword ∈ R , xchar ∈ R , xcap ∈ R . Here, nbchar denotes the number of characters in a word. The padding technique is used to ensure that all words have the same number of characters. After that, word, capitalization and

n×dword n×dcap n×nbchar×dchar character embeddings eword ∈ R , ecap ∈ R , echar ∈ R are created by looking up xword, xcap, xchar in Lword, Lcap, and Lchar, respectively. The mapping from character and capitalization embeddings to word embeddings are detailed in two sub-sections below.

Character Feature Extraction with CNN

CNN has shown its ability of capturing morphological information of words such as prefix and suffix ([132]; [144]). Hence, in the proposed model, CNN is utilized to build representations of words from their characters. This helps to increase the tagging accuracy, especially in the case of unknown words (the words that are not in the vocabulary).

n×nbchar×dchar Recall that echar ∈ R are vector embeddings representing characters of the input sentence. These vectors are fed into three convolutional layers with different window

3×dchar×dconv 4×dchar×dconv sizes [1, 3], [1, 4], and [1, 5]. Let denote them as: f1 ∈ R 1 , f2 ∈ R 2 , f3 ∈

72 5×dchar×dconv n×nbchar×dconv n×nbchar×dconv n×nbchar×dconv R 3 . Let O1 ∈ R 1 , O2 ∈ R 2 , and O3 ∈ R 3 denote the outputs of these convolutional layers. Each position (i, j) on the tth slice of these outputs are computed using the equation 3.4. The proposed model uses (1, 1) slide along with padding technique to ensure that the outputs of convolutional layers have the same shape with the inputs. In other words, the number of words and number of characters remain unchanged after applying convolutional layers.

d −1 rright charX X O[i, j, t] = e[i, j + r, k] × f[r + bfw/2c, k, t] + b[k, t], (3.4)

k=0 r=rleft where:

rleft = −bfw/2c, (3.5)

rright = bfw/2c − (fw − 1) mod 2, (3.6) here b c and mod denote floor division and modulo operations, respectively. b, fw denote biases and the filter width, respectively.

O1, O2, O3 are then concatenated along the features dimension to create the feature tensor O ∈ Rn×nbchar×dsum . Finally, the character-based word representations are computed by reducing the character dimension:

O = [O1, O2, O3], (3.7)

char eword = max_pooling(O), (3.8) here [] denote the concatenation operation. Fig. 3.3 shows the complete illustration of the CNN network for character feature ex- traction.

Capitalization Feature and Sentence Context Extraction with Bi-LSTM

In order to achieve a good performance, some additional features are often utilized such as part of speech, gazetteers, capitalization feature, as well as chunking feature. Among these features, the capitalization feature is not an external feature, and it is very useful for text processing tasks, especially for the task of NER because an entity such as person names, name of locations, name of organizations is often a combination of several capitalized words in a sentence. Hence, in the proposed model, a Bi-LSTM network is used to capture the capitalization feature of words in their left and right contexts (See Fig. 3.4). Input sentences are transformed into the capitalization form that encodes each word with one integer value representing the capitalization of that word. For example, the sentence: “MIPT is located in Dolgoprudny.” will be encoded to “0 1 1 1 2”. Here 0 value denotes the word with all

73 Character Vector Representations ....

......

......

......

concatenate

....

....

Figure 3.3: Character Features Extraction using CNN.

74 Code Capitalization Types Description 0 UPPER_CASE All characters are uppercase 1 lower_case All characters are lowercase 2 First_Cap The first letter is capitalized 3 Otherwise The words that do not belong to three formats above

Table 3.2: Capitalization types of words. characters uppercase, values of 1 mean that all characters of the word are in lowercase, and value of 2 is used for the words starting with a capitalized character. All capitalization types are shown in the table 3.2.

n×dchar Recall from the beginning of section 3.5.1 that ecap ∈ R are capitalization vector representations of the input sentence. These vectors are fed into two Bi-LSTM networks to capture the sentence-level context of words and produce the capitalization-based vector representation of words:

cap e = [O−−→ , O←−− ], (3.9) word lstm(ecap) lstm(ecap)

−−→ ←−− here Olstm, Olstm denote the outputs from the forward and backward LSTM layers, respec- tively; [] is the concatenation operation. char cap Once three kinds of word vector representations eword, eword, eword are computed, the word vector representation is just their concatenation:

char cap e = [eword, eword, eword]. (3.10) e is then fed into another Bi-LSTM network to produce the final word representation in the sentence context:

−−→ ←−− rword = [Olstm(e), Olstm(e)]. (3.11)

Let nbtags be a number of tags predefined for the given task. A feed-forward neu- ral network is used to produce scores of words in the input sentence according to tags th th Sffnn[n, nbtags]. Here Sffnn[i, j] represents the score of j tag for the i word.

Tag Dependencies Extraction with CRF

Two sub-sections above describe the ways character-level and capitalization features are extracted in the proposed model. The word representation is a concatenation of a pre- trained word embedding, character-level word embedding produced by CNN network, and

75 Output concatenation

Backward-LSTM ...

Forward-LSTM ...

Cap. embedding

Cap. sequence 2 2 1

Input sequence An Nhien future

Figure 3.4: Extraction Capitalization Feature Extraction using Bi-LSTM Network. capitalization embedding generated by the Bi-LSTM network. This word representation is then fed into the second Bi-LSTM network to produce the final word representation in the left and right contexts. From now on, we have two ways to produce the final tag sequence. The first way is to directly feed the final word representation into a softmax layer to compute probabilities of tags. In this case, predicting the output tag just depends on the input sequence, but not on the previously assigned tags. The second way is to use CRF or another recurrent neural network to make tagging decisions based on the previous and next output tags. From our experiments, CRF for extracting tag dependencies performs better than a recurrent neural network.

In addition to the Sffnn, the CRF model uses another type of score called transition score T that represents how likely one tag follows another. The score for the pair of the input sentence x and a tagging sequence y = {y1, .., yn} is calculated by equation below:

n X Scrf (x, y) = T[y0, y1] + (Sffnn[i, yi] + T[yi, yi+1]), (3.12) i=1 where y0, yn+1 are dummy tags added to represent the beginning and the end of the tag sequence. Hence, the transition matrix has shape of [nbtags + 2, nbtags + 2]. This transition matrix is fine tuned during a training stage. After that, the softmax function is applied to estimate conditional probabilities of the tag sequence:

76 eScrf (x,y) p(y|x) = , (3.13) P eScrf (x,ˆy) ˆy∈Yx here Yx denotes the set of all possible output tag sequences. During the training stage, the log-probability of the correct tag sequence is maximized. At the inference stage, the output tag sequence is the sequence that maximizes the score given by:

∗ y = argmaxˆy∈Yx Scrf (x, ˆy). (3.14)

3.5.2 Language Model-based Architecture

To further improve the system’s performance, state-of-the-art language models such as BERT [44], ELMo [70] are leveraged to enhance the quality of input word vector representations. These language models can be integrated into the model from Section 3.5.1 in two ways: (1) word vector representations are extracted from the language model and used as an additional input features, (2) language model is directly integrated into the architecture and fine-tuned during the training stage. According to our conducted experiments, treating the language model as a part of the overall architecture to fine-tune it during training for the NER task is better than just using it as an additional feature. This section presents two language model-based architectures for the NER task: (1) WCC-NN-CRF + ELMo to address the monolingual NER task, and (2) WCC-NN-CRF + BERT to deal with the multilingual task.

WCC-NN-CRF with ELMo

This model is an extension of the model described in the section 3.5.1, in which two word embedding types are used: (1) free-context word embedding like Glove, and (2) context-based word embedding like ELMo.

WCC-NN-CRF with BERT-based Multilingual Model

This section describes another proposed model for multilingual tasks. The model is a com- bination of the WCC-NN-CRF model described in section 3.5.1 with BERT [44]. A BERT model5 pre-trained on a large text corpus covered more than 100 languages is fine-tuned to produce word vector representations specific for the NER task. Besides that, a character CNN network is still used to generate word vector representations from their characters to better represent Out-Of-Vocabulary (OOV) words. The final word vector representations are the combination of BERT output and character-level word vectors. A Bi-LSTM network is then utilized to produce the word vectors in the sentence context. Finally, a feed-forward

5https://github.com/google-research/bert

77 Input Sentence

Character Extractor BERT Tokenizer

Convolutional Neural BERT Network

Concatenate

Bi-LSTM Network

Feed-Forward Network

CRF

Tag Sequence

Figure 3.5: BERT-based Multilingual NER Model. network followed by a CRF layer is used to estimate probabilities of the tags. See Fig. 3.5 for more details.

3.6 Application of WCC-NN-CRF Models for Named Entity Recog- nition

3.6.1 Overview of Named Entity Recognition Task

Named Entity Recognition (NER) is a subtask of Information Extraction aiming at seeking to locate and classify tokens in text documents into predefined classes (a.k.a tags) like person names, names of locations, names of organizations, expressions of time, etc. NER is often one of the first steps in an NLP pipeline, and plays an important role in many NLP tasks, including but not limited to Question Answering, Coreference Resolution, Topic Modeling, and Text Summarization.

3.6.2 Datasets and Pre-trained Word Embeddings

Many datasets have been published for evaluating the performance of NER systems. In 1995, the sixth in a series of Message Understanding Conferences (MUC 6) 6 aimed at eval- uating information extraction systems applied to NLP tasks such as Named Entity Recog-

6https://cs.nyu.edu/cs/faculty/grishman/muc6.html

78 nition, Coreference Resolution. In 2002, 2003, the shared tasks ([28], [29]) at Conference on Computational Natural Language Learning focused on evaluating the performance of language-independent NER systems. Jin-Dong Kim et al. in [48] introduced a shared task for search of bio-medial terms in paper abstracts and classifying them into five categories: protein, RNA, DNA, cell line, and cell type. In 2017, the shared task “Emerging and rare entity recognition” was organized by Leon Derczynski et al. [67] that aims at evaluating the performance of the models when facing rare, previously-unseen entities. To have a comprehensive analysis of WCC-NN-CRF models, six datasets were selected. These datasets cover four languages including Vietnamese, Russian, English, and Chinese:

• CoNLL-2003 [29]: Datasets for NER task published by Erik F. Tjong Kim Sang and Fien De Meulder at the Conference on Computational Natural Language Learning 7 in 2003.

• VLSP-2016: The Vietnamese dataset8 built and copyrighted by Vietnamese Language and Speech Processing community.

• Gareev’s dataset: A Russian dataset by Rinat Gareev et al [101]. This is a small dataset with only two entity types: Person and Organization.

• Named Entity 3 (NE3) and Named Entity 5 (NE5) [128]: Two Russian datasets created by Information Research Laboratory 9. The NE5 is annotated with 5 entities types: Person, Location, Organization, Media and Geopolit. The NE3 is a variant of NE5 which combines Media with Organization and Geopolit with Location.

• MSRA dataset10: A Chinese NER dataset labeled by the Natural Language Computing group at Microsoft Research Asia.

The complete statistic of these datasets is reported in the table 3.3. To initialize lookup tables that map input words to the dense vector representations, four pre-trained word embeddings are used, corresponding to four languages Vietnamese, Russian, English, and Chinese:

• Word2vecvn2016 [134]: The Vietnamese pre-trained word embedding published by Xuan-Son Vu.

• Lenta: The Russian pre-trained word embedding generated by training FastText 11 model on Lenta corpus 12. 7https://www.clips.uantwerpen.be/CoNLL-2003/ 8http://vlsp.org.vn/VLSP-2016/eval/ner 9http://labinform.ru 10https://www.microsoft.com/en-us/download/details.aspx?id=52531 11https://fasttext.cc/ 12https://github.com/yutkin/lenta.ru-news-dataset

79 Dataset Language Per Org Loc Misc Geo Med NE5 Russian 10623 7032 3143 - 4103 1509 NE3 Russian 10623 8541 7244 - - - Gareev’s Russian 485 1311 - - - - VLSP-2016 Vietnamese 7480/ 1210/ 6244/ 282/ - - (train/test) 1294 274 1377 49 - - MSRA Chinese 17610/ 20584/ 36616/ - - - (train/test) 1973 1330 2863 - - - CoNLL-2003 English 6600/ 6321/ 7140/ 3438/ - - (train/dev/test) 1842/ 1341/ 1181/ 1010/ - - 1617 1661 1668 702 - -

Table 3.3: Statistics of NER datasets.

• Glove6B100d [46]: Vector representations for English words released by Jeffrey Pen- nington, Richard Socher, and Christopher D. Manning, researchers from Stanford uni- versity.

• Wiki100utf8 13: The Chinese pre-trained word embedding generated by using word2vec model on Chinese wiki corpus.

3.6.3 Evaluation of backbone WCC-NN-CRF Model

All experiments were run on GPU NVIDIA GeForce GTX 1080 Ti with 11GB of RAM. Training on each dataset mentioned before takes from 1 to 3 hours. If not specified, splitting datasets into training, validation, and test sets is done in the ratio of 60% : 20 % : 20 %. For evaluation conlleval script 14 was used. It is a script written in Perl to evaluate NER systems at CoNLL-2003 shared task by computing P, R, and F1 metrics. The experiments are divided into three groups:

• The experiments in the first group aimed at evaluating the backbone WCC-NN-CRF model on six datasets.

• The goal of the second one is to analyze the effect of different input features on the backbone model performance in different languages.

• The last group of experiments aimed to find out how well the backbone model deals with the small datasets. 13https://github.com/zjy-ucas/ChineseNER 14https://www.clips.uantwerpen.be/conll2000/chunking/output.html

80 Dataset Metric Per Org Loc Geo Med Overall P 97.13 90.35 93.92 95.89 90.06 94.33 NE5 R 98.43 91.75 91.67 98.08 90.06 95.29

F1 97.78 91.04 92.78 96.97 90.06 94.81 P 98.79 97.15 98.16 - - 98.09 NE3 R 99.24 97.83 97.59 - - 98.34

F1 99.02 97.49 97.87 - - 98.21

Table 3.4: Tagging performance of backbone WCC-NN-CRF model for Named Entity 5 and Named Entity 3 datasets.

Metric Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall P 88.11 89.66 88.30 86.02 83.24 87.07 R 93.42 89.66 89.61 91.32 88.00 90.40

F1 90.69 89.66 88.95 88.59 85.56 88.69

Table 3.5: Tagging performance of backbone WCC-NN-CRF model on Gareev’s dataset using 5-folds cross validation.

The first group of experiments starts with evaluating the backbone WCC-NN-CRF model on two Russian datasets: NE3 and NE5. The experimental results show that the model achieved a good performance. Table 3.4 shows evaluation metrics calculated on the test sets of NE3 and NE5 datasets. The next experiment evaluates the proposed model on the Gareev’s dataset. The k-fold cross-validation method is used for this experiment due to the small size of the dataset. The experiment results showed in the table 3.5, points out that the model still demonstrates a good performance when training on a small dataset. To further improve the model perfor- mance on Gareev’s dataset, it was pre-trained on the NE3 dataset. This helped to improve the model performance by about 3%, from F1 88.68% to F1 91.10% (refer to tables 3.6 for the complete comparison). After that, the proposed model is tested on Vietnamese and English datasets: VLSP- 2016, CoNLL-2003. Both of them contain additional POS and Chunk features. The model is evaluated in both cases with and without using POS and Chunk features. The results of these experiments are showed in the tables 3.7, 3.8. Table 3.9 shows the proposed model performance in comparison with some cutting-edge models for Vietnamese. Evaluation on the MSRA dataset is the last experiment in the first group. Because of the difficulty in finding an official Chinese dataset for the NER task, the MSRA dataset, a Chinese dataset

81 Metric Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall P 93.63 88.53 88.20 90.83 87.47 89.73 R 92.60 91.73 93.18 91.60 93.71 92.56

F1 93.11 90.10 90.62 91.21 90.48 91.10

Table 3.6: Tagging performance of backbone WCC-NN-CRF on Gareev’s datasets with pre- training on NE3.

Model Metric Per Org Misc Loc Overall WCC-NN-CRF P 95.35 73.79 100 88.58 90.61 R 91.96 55.47 79.59 89.41 87.25

F1 93.63 63.33 88.64 88.99 88.90 WCC-NN-CRF P 96.43 90.17 100 94.15 94.91 + pos R 95.98 77.01 87.76 95.65 93.96

+ chunk F1 96.20 83.07 93.48 94.89 94.43

Table 3.7: Tagging performance of backbone WCC-NN-CRF on VLSP-2016 dataset. designed for the task of word segmentation, is used. In this experiment, the information about the word segmentation is leveraged instead of character features. The testing result is showed in the table 3.10. As described in section 3.5.1, the final word vector representation is a combination of four components: (1) the word vector initialized by an external word embedding, (2) character features extracted by a CNN network, (3) capitalization features produced by a Bi-LSTM network, and (4) additional POS, Chunk features. To study the effect of each component on the model performance in different languages, four variants of the full model were tested:

Model Metric Per Org Misc Loc Overall WCC-NN-CRF P 97.75 90.16 77.50 89.74 90.44 R 94.12 86.90 83.98 94.16 90.76

F1 95.90 88.50 80.61 91.90 90.60 WCC-NN-CRF P 97.10 90.01 80.61 90.35 90.91 + pos R 95.24 88.16 83.41 94.65 91.52

+ chunk F1 96.16 89.08 81.99 92.45 91.22

Table 3.8: Tagging performance of backbone WCC-NN-CRF on CoNLL-2003 dataset.

82 Model P R F1 Le et al. (2016) [93] 89.56 89.75 89.66 Pham et al. (2017) [116] 91.09 93.03 92.05 Pham et al. (2017) [117] 92.76 93.07 92.91 Nguyen et al. (2018) [75] 93.87 93.99 93.93 WCC-NN-CRF 94.91 93.96 94.43

Table 3.9: Tagging performance of backbone WCC-NN-CRF on VLSP-2016 dataset com- pared with some state-of-the-art models.

Metric Per Org Loc Overall P 91.18 89.85 93.51 91.99 R 94.41 91.62 94.82 93.92

F1 92.77 90.73 94.16 92.95

Table 3.10: Tagging performance of backbone WCC-NN-CRF on MSRA dataset.

• M1: Backbone WCC-NN-CRF without character and capitalization components: In this model, the word vector representation is just a dense vector initialized by an external pre-trained model.

• M2: Backbone WCC-NN-CRF without capitalization component: This variant com- bines a word representation with a CNN network to extract character features.

• M3: Backbone WCC-NN-CR: Full backbone model with character and capitalization components.

• M4: The fourth variant is the full model with additional POS and Chunk features.

Three datasets used in these experiments are VLSP-2016, CoNLL-2003, and Gareev’s dataset. Among them, the third one lacks additional POS and Chunk features. Because of that, the third-party system, UDPipe15, is utilized to generate these features. The results are presented in Fig. 3.6. According to the experiment results, character features extracted by the CNN network plays an important role in enhancing the model performance: about 12%, 5%, and 15% for VLSP-2016, CoNLL-2003, and Gareev’s datasets, respectively. Adding a Bi- LSTM network to extract the capitalization features of words yielded further improvement, about 3%. It is interesting to find out from these experiments that using POS and Chunk

15http://ufal.mff.cuni.cz/udpipe: The tagging system published by Milan Straka and Jana Strakov at Institute of Formal and Applied Linguistics, Charles University, Czech Republic.

83 Figure 3.6: Importance of different components for performance of backbone WCC-NN-CRF model across the datasets. M1: WCC-NN-CRF without Character CNN and Capitalization Bi-LSTM. M2: WCC-NN-CRF without Capitalization Bi-LSTM. M3: WCC-NN-CRF. M4: WCC-NN-CRF with POS and Chunk features. features helps to significantly boost the model performance, about 5%, when testing on the VLSP-2016 dataset, whereas the improvement is almost negligible on CoNLL-2003 dataset. This partially shows that the syntactic features in the Vietnamese language play a more important role than in English in the context of the NER task. In the last group of experiments, the backbone WCC-NN-CRF model was tested on different sizes of training dataset to evaluated how the model performs on small data, and analyze the effectiveness of the main architecture components. Each dataset, therefore, was split into five pairs of training and validation sets that contain 100, 200, 500, 800, and 1000 entities per entity type. The ratio of training and validation data was set at 4:1. First, the proposed model is tested on four datasets: CoNLL-2003, VLSP-2016, NE3, and MSRA. The achieved results show that WCC-NN-CRF model can obtain acceptable performance, about 70% of F1, on almost all given datasets with using only 80 samples for training and 20 samples for validation. The performance on the Chinese dataset is lower than the others due to the lack of capitalization and character features. The proposed model with 800 samples for training and 200 samples for validation nearly reach to the performance achieved by training on full datasets. See Fig. 3.7 for more details. Second, we tested four variants, described in the second group of experiments, on the CoNLL-2003 dataset. According to the achieved results (See Fig. 3.8), character features are very useful, especially when training

84 Figure 3.7: Five-run average tagging performance of the backbone WCC-NN-CRF with the different amounts of training data accross the datasets. At each data point, the vertical line shows the range from the lowest to highest F1 of five runs. with small data. In addition, using additional features such as POS and Chunk also helps to boost the performance, but the improvement is not really impressive.

3.6.4 Evaluation of ELMo-based WCC-NN-CRF model

In this section we evaluate the ELMo-based (ELMo WCC-NN-CRF) model described in the section 3.5.2 on the CoNLL-2003 datasets and two Russian datasets including NE3 and Gareev’s datasets. The performance of the model compared with some published models are shown in tables 3.11, 3.12. By combining three kinds of word embeddings (1) context- free word embedding, (2) context-based word embedding, and (3) character-based word embedding generated by CNN, ELMo WCC-NN-CRF obtained 92.91%, 99.19%, and 92.27% of F1 on Gareev’s, NE3, and CoNLL-2003 datasets. These results are higher than that of WCC-NN-CRF and much higher than some previous models.

3.6.5 Evaluation of BERT-based Multilingual WCC-NN-CRF Model

Performance of BERT-Multi WCC-NN-CRF was evaluated on Vietnamese, English, Russian. The obtained results compared with WCC-NN-CRF and ELMo WCC-NN-CRF are shown in tables 3.13. According to this experiment, ELMo WCC-NN-CRF has the best performance.

The BERT-Multi WCC-NN-CRF performance is not as high as we expected. Using BERTbase trained on multi-lingual corpora might be not better than using ELMo fine-tuned on specific

85 Figure 3.8: Five-run average tagging performance of variants of the backbone WCC-NN- CRF model on CoNLL-2003 dataset with different amounts of training samples. At each data point, the vertical line shows the range from the lowest to the highest F1 of five runs.

Gareev’s NE3 Model PER ORG OVERALL PER ORG LOC OVERALL Gareev et al. [101] (Knowledge-based model) 79.30 55.48 62.17 - - - - Gareev et al. [101] (CRF-based model) 84.84 71.31 75.05 - - - - Malykh et al. [126] (Character-based LSTM) 92.89 69.14 62.49 - - - - Mozharova et al. [127] (CRF-based model) - - - 96.08 83.84 94.57 91.71 Mozharova et al. [128] (Feature-based model) - - - 97.21 95.21 85.60 92.92 Romanov et al. [129] (FastText-CNN-CRF) ------95.00 Romanov et al. [129] (BERT-based model) ------96.00 Our model (WCC-NN-CRF) 96.08 85.48 88.76 99.02 97.49 97.87 98.21 Our model (ELMo WCC-NN-CRF) 98.70 90.32 92.91 99.90 99.05 99.30 99.17

Table 3.11: F1-score of ELMo WCC-NN-CRF model on Russian datasets compared with some published models.

86 Model P R F1 Huang et al. (2015) [147] - - 90.10 Strubell et al. (2017) [26] - - 90.54 Xu et al. (2017) [74] 92.13 89.61 90.85 Passos et al. (2014) [8] - - 90.90 Lample et al. (2016) [36] - - 90.94 Luo et al. (2015) [34] 91.50 91.40 91.20 Ma et al. (2016) [135] - - 91.21 Wang et al. (2017) [19] 91.39 91.09 91.24 Devlin et al. (2018) [44] - - 92.80 Akbik et al. (2018) [2] - - 93.09 ELMo WCC-NN-CRF 92.04 92.50 92.27

Table 3.12: Tagging performance of ELMo WCC-NN-CRF on CoNLL-2003 dataset com- pared with some state-of-the-art models.

VLSP-2016 CoNLL-2003 NE3 Model (Vietnamese) (English) (Russian)

P R F1 P R F1 P R F1 WCC-NN-CRF 90.61 87.25 88.90 90.44 90.76 90.60 98.08 98.34 98.21 ELMo WCC-NN-CRF - - - 92.04 92.50 92.27 99.05 99.30 99.17 BERT-Multi WCC-NN-CRF 91.24 90.34 90.79 89.27 91.91 90.47 99.05 99.24 99.15

Table 3.13: Tagging performance of Multilingual BERT-based WCC-NN-CRF model on VLSP-2016, CoNLL-2003, and NE3. language. WCC-NN-CRF is already a complex model with many sub-networks (CNN, Bi- LSTM, FFNN, CRF). Integrating BERT into such kind of complex model may make it hard to be fine-tuned when training on a limited number of training samples.

3.7 Application of WCC-NN-CRF Model for Sentence Boundary Detection

3.7.1 Introduction to the Sentence Boundary Detection Task

Sentence Boundary Detection (SBD) is an NLP problem of deciding where sentences begin and end in an unpunctuated text. This is one of the first stages in the preprocessing pipeline for some NLP tasks such as Named Entity Recognition, Part of Speech, Spell Correction, as well as Common Grammar Error Correction. In addition, SBD is also used in voice-enabled

87 tell me some interesting news Hey Alexa. about chess That is incredible! When did AlphaZero crush StockFish?

alphazero has crushed stockfish in a new 1000-game match

Sent. Rewriting task NER hey alexa that is incredible when Topic Classification, Sentiment Analysis, Toxic Detection SBD task ..

Hey Alexa. That is incredible! When?

Alexa, AlphaZero, StockFish Sport Very Positiive None ..

Figure 3.9: Annotating step of a voice-enabled chatbot. chatbot systems. Unlike in a simple voice-enabled chatbot where each user’s utterance is treated as one sentence, in a modern one, an input utterance could be composed of several sentences. This is more natural way of communication but makes understanding of the user input harder. To do this, an SBD module is integrated into the Natural Language Understanding to split the user’s utterance into punctuated sentences. These sentences are then used as input to downstream modules including Intent Detection, Topic Classification, Sentiment Analysis as well (See Fig. 3.9). One approach to the task of SBD is to use structured prediction models such as Maxi- mum Entropy Modeling, Conditional Random Field (CRF). In [49], Jing Huang and Geoffrey Zweig introduced a Maximum Entropy-based model for annotating spontaneous conversa- tional speech with punctuation marks. Wei Lu and Hwee Tou Ng, in [130], proposed a CRF- based model that jointly performs punctuation prediction together with sentence boundary and sentence type prediction on speech utterances. In 2013, Nicola Ueffing, Maximilian Bisanil, and Paul Vozila [81] also built auto-punctuation models which combine various kind of textual features using Conditional Random Fields. Another approach is to treat the SBD task as a machine translation task. This approach has been received quite a lot of attention from researchers ([39], [138], [112]). More recently, when deep learning models for the sequence-to-sequence task in general and the machine translation task, in particular, have gained impressive results, the sequence-to-sequence- based models continues to be applied to the SBD task [18]. Sequence-to-sequence models are not only be able to deal with inputs and outputs with different lengths but also can be trained in an end-to-end manner. Furthermore, this class of models have been proved to be effective for many natural language processing tasks such as machine translation, question

88 Figure 3.10: Example of sequence-to-sequence-based approach to the task of sequence bound- ary detection.

Figure 3.11: Example of sequence-labeling-based approach to the task of SBD. B-E, B-S, B-Q denote the beginning of exclamation sentence, statement sentence, and question, respectively. answering, as well as text summarization. Fig. 3.10 shows an example of solving the SBD task using sequence-to-sequence approach. However, training a sequence-to-sequence model need a huge amount of training data and consumes a lot of time. In addition, the number of model parameters should be large because of the encoder-decoder architecture. The third approach is to treat SBD task as a sequence labeling task and apply deep neural networks. In [86], Ottokar Tilk and Tanel Alumae proposed to use LSTM for restoring periods and commas in speech transcripts. However, in that work, only the past contextual input is leveraged. In [52], Kaituo Xu et al. introduced a hybrid model that utilizes a Bi-LSTM network to capture both left and right context of the input and a CRF layer to condition on the output sequence of tags. Inspired by these works, we firstly reformulate the SBD task as a task of sequence labeling and then apply a variant of ELMo WCC-NN- CRF. The combination of deep neural networks for representation learning and structured prediction models for conditional structured prediction has been proven to be very useful for the sequence labeling task ([45], [19], [135], [136], [66]). Moreover, training a sequence tagging model is often less time-consuming and also requires less training parameters compared with training a corresponding sequence-to-sequence model. An example of a sample of the SBD task using a sequence tagging approach is shown in Fig. 3.11. Each word is labeled with a tag. Here, B-E, B-S, B-Q denote the beginning of exclamation sentence, statement sentence, and question, respectively.

3.7.2 Sentence Boundary Detection as a Sequence Labeling Task

Let x be the unpunctuated text consisting of n tokens outputted by an automatic speech recognition device:

x = {x1, x2, .., xn}. (3.15)

89 Let t be the list of k predefined punctuation mark types:

t = {t1, t2, .., tk}, (3.16) corresponding to the k types of sentences. The goal of SBD task is to split x into m sentences s, by adding punctuation marks:

s = {s1, s2, .., sm}, (3.17) where:

si = {xa, xa+1, .., xb, tsi }, (3.18) m X |si| = n + m, (3.19) i=1

here si is a segment of x after adding the corresponding punctuation mark tsi , and si is consecutive with the previous one in the order in x. |sj| denote the length of the punctuated sentence sj. We transform s into the sequence y:

y = {y1, y2, .., yn}, (3.20) which has the same length with the input sequence x, by traveling through each sentence in s, each word per time, to create the corresponding tag by following below rules:

• If the current word is the first word of the sentence then replace it with the tag, tsi , corresponding to the last word in the sentence.

• Remove the last word of the sentence.

• Replace the remaining words with tag O marking that these words are not the first one.

• Finally, concatenate all modified sentences to create the tag sequence.

We use the first word in the sentence to mark the sentence boundary instead of the last one since the first word (e.g., who, what, when, how, do, am) often contains more information for determining the type of sentence. Now the task can be reformulated. Given an unpunctuated lowercase text x consisting of n tokens:

x = {x1, x2, .., xn}, (3.21) the model need to predict the sequence of tags:

y = {y1, y2, .., yn}, (3.22)

90 where yi ∈ t. Since we focus on building an SBD module for our conversational chatbot system which is responsible for extracting two types of sentences (1) statement sentences and (2) questions for downstream modules, the texts with more than three sentences are removed, then use three tags:

• B-S to label the first word of a statement sentence,

• B-Q to label the first word of a question,

• O to label the other words.

Now modification of backbone WCC-NN-CRF model described in section 3.5 can be used to address the SBD task.

3.7.3 Evaluation of WCC-NN-CRF SBD Model

We evaluated modification of backbone WCC-NN-CRF model on two conversational datasets. These datasets consist of multi-sentence discourses that are suitable for training an SBD model for integrating into a chatbot system:

• Cornell Movie-Dialog dataset released by Cristian Danescu-Niculescu-Mizil and Lil- lian Lee [20], a large metadata-rich collection of fictional conversations extracted from raw movie scripts, to evaluate our model. This corpus has a total of 304713 utter- ances, which consists of 220579 conversational exchanges between 10292 pairs of movie characters, involving 9035 characters from 617 movies.

• DailyDialog dataset published by Yanran Li et al. [141], a multi-turn dialog dataset that reflects the daily communication way and covers various topics about daily life.

Table 3.14 shows the statistic of our datasets generated from the CornellMovie-Dialog and DailyDialog datasets. From now on we call them CornellMovie-Dialog and DailyDialog, for short. The details of the modified backbone WCC-NN-CRF model are shown in the table 3.15. It takes about less than a day to train our model on one GPU GTX-1080-Ti with 11GB of RAM. The evaluation metrics used in our experiments are Precision, Recall, and F1- measure. We use conlleval, a Perl script which was used for evaluating the performance of a system participated in the CoNLL-2000 shared task, to compute these metrics. Our model achieved an impressive result on the DailyDialog. The precision, recall, and F1 on the test set are 95.99%, 95.77%, and 95.88%, respectively. The results are relatively high on the CornellMovie-Dialog, about 90% of F1 on the test set. The detail results are shown in the table 3.16, 3.17.

91 Item CornellMovie-Dialog DailyDialog Number of samples 819668 99299 Total words 8565560 1139540 Number of B-S words 957835 111838 Number of B-Q words 236317 37447 Number of O words 7371408 990255

Table 3.14: SBD dataset statistic.

Layer Component Size GloVe Word embedding 100 Word embedding 1024 ELMo Hidden units (FFNN) 128 Char embedding 64 Convolution 1 (filter size) 1 × 3 × 50 Char. CNN Convolution 2 (filter size) 1 × 4 × 50 Convolution 3 (filter size) 1 × 5 × 50 Max pooling 1 × 50 Forward LSTM (hidden units) 128 Word Bi-LSTM Backward LSTM (hidden units) 128 Dense layer Hidden units 128 CRF layer Transition matrix 5 × 5

Table 3.15: The SBD model setup in detail.

Result on Tag Precison Recall F1 Questions 87.62 76.44 81.65 Development set Statements 91.29 92.42 91.85 Overall 90.64 89.26 89.95 Questions 87.99 76.11 81.62 Test set Statements 91.27 92.54 91.90 Overall 90.70 89.30 89.99

Table 3.16: Tagging performance of ELMo WCC-NN-CRF on the CornellMovie-Dialog dataset.

92 Result on Tag Precison Recall F1 Questions 95.70 94.15 94.22 Development set Statements 96.73 96.59 96.66 Overall 96.48 95.97 96.22 Questions 95.54 93.79 94.66 Test set Statements 96.14 96.43 96.29 Overall 95.99 95.77 95.88

Table 3.17: Tagging performance of ELMo WCC-NN-CRF on the DailyDialog dataset.

3.8 Conclusions

In conclusion, this chapter introduced some concepts of Sequence Labeling task including the task definition and common metrics for performance evaluation. Then, related works along with their pros and cons were also systematically summarized. Next, the family of WCC- NN-CRF hybrid deep neural network models were introduced. The idea behind these family of models is to generate the word vector representations capturing as much as possible of the meaning of words in the sentence context and use a structured prediction model to condition on the sequence of output tags. To generate word vector representation of useful features, we combine three types of word vectors (1) free-context word vector representation, (2) context- based word vector representation, and (3) character-level based word vector representation. The first two are initialized from pre-trained models and fine-tuned during the training step. The last one is directly learned by a deep neural model which is useful in case of out-of- vocabulary words. In order to take into account the word context, the Bi-LSTM network is utilized that has been proven to be useful in capturing both left and right contexts of words. In addition, to condition the output sequence of tags, we use Conditional Random Field. To further improve the performance, the word vector representation is enhanced by combining both context-free and context-based models. This helps to not only improve the model performance but also reduce the training time. The proposed models were evaluated on two tasks (1) Named Entity Recognition, and (2) Sentence Boundary Detection. For the first task, a series of experiments were conducted covering four languages including Vietnamese, Russian, English, and Chinese. The experi- ment results showed that the proposed models obtained state-of-the-art results. The detailed model descriptions and achieved results were published in two papers [64], [66]. The idea behind the second task, the task of Sentence Boundary Detection, is to reformulate it as a sequence labeling task then apply proposed models. The conducted experiments showed that the proposed models achieved impressive results. The detailed model description and

93 achieved results were presented in the paper [65]. The main findings and contributions drawn from this chapter are listed bellow:

• The current state-of-the-art approach to the Sequence Labeling task is the hybrid deep learning-based approach that combines deep neural networks with traditional structured prediction models. This modern approach leverages the power of deep neural networks in extracting useful hidden features of the input sequence and the efficiency of traditional structured prediction models in capturing correlations and dependencies in the output sequence of tags.

• The word vector representation generated from the characters is a crucial feature, especially in case of small training data or the large number of out-of-vocabulary words. The Char-CNN network of WCC-NN-CRF model improves the model performance on all given datasets covering three languages including Vietnamese, Russian, and English. The improvement is clearly evident in VLSP-2016 and Gareev’s datasets, about 12% and 15% F1, respectively.

• POS and Chunk features also play an important role for the NER task. However, such additional features are not always available in NER datasets. Fortunately, recent POS and Chunking systems achieve impressive performance with higher than 90% F1.

• Fine-tuning modern language models trained on huge corpora to address NLP tasks is a trend in recent years. This helps to (1) remarkable improve the model performance, (2) deal with the task in which the training data is scarce, and (3) make the model more robust and generalize. In case of NER task, using ELMo improve the F1 by about 2%.

• Using the multilingual pre-trained BERT model is a good choice when building NER model in low-resource language. This is also very useful when applied in practice, especially in some specific domains where collecting large enough amounts of data for training deep neural networks is time-consuming and costly. Our multilingual BERT-based model shows an impressive performance in all given language including Vietnamese, Russian, and English. Our WCC-NN-CRF model also achieves good performance with small amounts of training data.

• Reformulating the SBD task as Sequence Labeling task and applying ELMo WCC- NN-CRF model is a good choice to address the SBD task.

94 Chapter 4

Coreference Resolution with Sentence-level Coreferential Scoring

4.1 The Coreference Resolution Task

Coreference Resolution (CR) is the task of identifying and clustering all expressions that refer to the same entity in a text. Let take two sentences below as an example:

My sister has a friend called John. She thinks he is so funny.

There are two people mentioned in these sentences: My sister and John. A CR model needs to find and cluster all mentions referring to these people. Hence, the output should be {My sister, She}, {John, he}. CR is very useful for information extraction, text summa- rization, as well as question answering systems. For example, CR plays an important role in question answering systems as it can be used to retrieve the named entities that pronouns refer to (i.e., CR enables question answering systems to answer the question: “What are we talking about?”). Building a CR model is one of the most difficult tasks in the field of NLP because of some main reasons below:

• The huge search space: The input of CR models can be a discourse, a chat conversation, or even a document with hundreds of sentences. This leads to a big challenge in finding and clustering mentions.

• Clustering mentions into mention chains is a global decision that depends on the whole input context. More specifically, considering only the sentences to which mentions belong is not enough to make decisions of clustering. Let take two discourses in table 4.1 as an example. The first and the third sentence in both discourses are exactly the same. However, in the left discourse, the person name John is antecedent of the pronoun him. Whereas in the right one, there is no coreferential link between these

95 - Are you waiting for John? - Are you waiting for John? - Yes, for two hours. - He arrived yesterday. I am waiting for my son. - Oh, I’ve just seen him in the restaurant. - Oh, I’ve just seen him in the restaurant.

Table 4.1: Illustration of global decisions in CR task

two mentions. It is obvious that in the CR task, cluster decisions have to be made based on the document context.

• A lot of sources of information affect mention clustering: lexical of words, number, gender, and speaker information, etc. None of them is a completely reliable indicator.

4.2 Related Work

4.2.1 Rule-based Models

Rule-based models focus on manually building a set of rules that aids in finding mentions, searching and eliminating antecedents, and pairing them as well. One of the early algorithms for the CR task was introduced by Jerry Hobbs [47] in 1978. It implements a tree-search procedure with left-to-right breadth-first search strategy to search for an antecedent. Hand- crafted rules are utilized here to reduce antecedent search space. In 1997, Baldwin introduced a pronoun resolution system, named CogNIAC [15], that achieved a high precision. CogNIAC has two key components, including the set of high precision rules, which work as effective filters, and the control structure which could control general inference techniques. The experiments showed that CogNIAC obtained better results in comparison with other rule-based systems at that time. Another rule-based system, which used the knowledge-based method to address the anaphora resolution task, was proposed by Liang and Wu [125] in 2004. This system lever- aged WordNet ontology and heuristic rules and was able to tackle intra-sentential and inter- sentential anaphora. It parses every utterance to extract tokens, POS feature as well as lemma feature and stores them in an internal data structure. Additional features (e.g., gen- der, person name identification) are also leveraged. Besides that, NPs finding is addressed by using a finite state machine. Extracted features are then sequentially checked for the anaphora resolution. The achieved results are comparable with prior works. In 2009, Aria Haghighi and Dan Klein [12] proposed a simple approach to the CR task which completely modularizes three aspects including syntactic, semantic, and discourse constraints. Their rule-based system is deterministic and is driven entirely by syntactic and

96 semantic compatibility. They noted that rich syntactic and semantic processing vastly re- duces the need to rely on discourse effects or evidence reconciliation for reference resolution. In spite of that their model was simple and having no tunable parameters, it outperformed state-of-the-art unsupervised coreference resolution and demonstrated performence compa- rable to state of the art supervised systems. In 2010, the sieve model [53], authored by Raghunathan et al., introduced a multi-tier model that is implemented as a succession of independent coreference models. Each tier builds on the entity clusters generated from the previous one. The experiment results showed that this model outperforms or performs comparably to state of the art prior models.

4.2.2 Deep Learning Models

Building rules that capture both syntactic and semantic features often requires a lot of labor and knowledge of language experts. Therefore, implementing a rule-based model is time- consuming and costly, and in some cases is not feasible. Moreover, such models are hard to adapt to other languages. Fortunately, learning-based models are able to overcome these long-standing shortcomings due to the ability to automatically learn rules from training data. In 2015, one of the early deep learning-based models for the CR task were introduced by Same Wiseman et al. in [106]. This is a non-linear mention ranking model aiming at learn- ing distinct feature representations for anaphoricity detection and antecedent ranking. By leveraging a neural network that takes as input raw features, this model does not require pre- specifying and feature relationships. In addition, by pre-traning on two sub-tasks, namely, anaphoricity detection and antecedent ranking of known anaphoric mentions, the model is encouraged to learn useful features. The model were evaluated on the CoNLl-2012 dataset [28] and achieved impressive performance, 63.39 % of F1, compared with state-of-the-art models at that time. One year later, in [105], Sam Wiseman, Alexander M. Rush and Stuart M. Shieber utilized an RNN network to learn global representation of entity clusters directly from their mentions. They showed that such representations are especially useful for the prediction of pronominal mentions, and can be incorporated into an end-to-end coreference system. The model achieved better result, 64.21% of F1, outperformed the previous one. Recently, Kevin et al. in [58] utilized a variant of a reinforcement learning solution with the reward-rescaled max-margin objective to directly optimize a mention-ranking model for coreference evaluation metrics. This model obtained remarkable results on the CoNLL2012 dataset [28], 65.73% and 63.88% on English and Chinese test sets, respectively. In the follow- up paper [59], another model was proposed that represents each pair of coreference clusters by a high-dimensional vector, and this vector is used in learning when clusters should be combined. Two pillars of this model are mention-pair encoder and cluster-pair encoder (See

97 Figure 4.1: Mention-pair Encoder and Cluster-pair Encoder [59].

Fig. 4.1). Mention-pair encoder is responsible for computing a distributed representation for each pair of mention and candidate antecedent by using a feed forward neural network. The authors also utilized an LSTM network and reported that the performance was slightly worse than using a feed forward neural network because of heavy overfitting to the training data. The cluster-pair encoder is used to compute the distributed representation for two clusters of mentions by applying a pooling operation over the representations of relevant mention pairs. The cluster-ranking model then scores pairs of clusters by passing their representations through a single neural network layer. They evaluated their model on English and Chinese datasets in CoNLL-2012 [28] and achieved 65.29% and 63.66% of F1, respectively. More recently, Kenton Lee et al. proposed two end-to-end models ([56], [55]) to jointly learn mention detection and mention clustering. The first one is the simple model with two steps: 1) create a span representation from context-dependent boundary representations and head-finding attention mechanism, 2) cluster mentions. This model can be treated as a first- order model because of its mention scoring function. The second model is an improvement of the first one with the inference procedure involving iterations of refining span representations. The model achieved 73% of average F1 on the test set of the English CoNLL-2012 shared task [28]. In 2018, Rui Zhang et al. [103] proposed to improve the end-to-end coreference resolu- tion system by (1) using a biaffine attention model to get antecedent scores for each possible mention, and (2) jointly optimizing the mention detection accuracy and the mention clus- tering log-likelihood given the mention cluster labels. In this work, the authors focused on finding good designs of the scoring architecture and the learning strategy for both mention detection and antecedent scoring. Instead of using a feed-forward neural network to compute antecedent scores, they utilized a biaffine attention model. In addition, they jointly opti- mized the mention detection accuracy and the mention clustering log-likelihood given the mention cluster labels. Mention detection loss was explicitly optimized to extract mentions

98 Figure 4.2: Coreference Resolution model with Deep Biaffine Attention by Joint Mention Detection and Mention Clustering [103]. and also perform anaphoricity detection (See Fig. 4.2). They performed experiments on the CoNLL-2012 dataset and obtained 67.8% of F1 for a single model and 69.2% of F1 for a 5-model ensemble. In more recent paper [41], Hongliang Fei et al. introduced a reinforcement learning-based model which is extended from [55] by replacing higher-order mention ranking algorithm with a reinforced policy gradient model and supplementing maximum entropy regularization for adequate exploration to prevent the model from prematurely converging to a bad local optimum. There are a number of recent studies of coreference resolution for Russian. In [113] Sysoev, Andrianov, and Khadzhiiskaia proposed a two-stage model for CR task including (1) mention detection that is responsible for finding mentions in the given text and (2) coreference resolution to cluster the extracted mentions into groups. The mention detection phase consists of two steps (1) finding all mentions heads in the given text and (2) expanding them to full mentions. Finding mention heads is treated as a binary classification task to predict whether a token is a mention head. Head expansion, i.e. mention boundary detection, is done by iterating through the tokens surrounding the mention head and applying a binary classifier until it predicts a False label. At the coreference resolution stage, mention pairs are generated first, and then converted to feature vectors based on feature sets of linguistic features such as word forms, lemmas, POS, gender, number, as well as positional feature, NER features, etc. These feature vectors are used to classify extracted pairs as belonging to one coreference cluster or not. Finally, mention pairs are merged into clusters with Easy-First Mention-Pair approach [82]. This model obtained 69.28% F1 of MUC on RuCor dataset [120] with gold mentions. In [121], Toldova and Ionov introduced a series of models (1) four rule- based models, namely StrMatch, StrMatchPro, HeadMatch, and HeadMatchPro, that use lemma, gender, number features to determine whether two noun phrases are coreferential,

99 and (2) machine learning-based models that use decision tree classifier with 11 features such as noun phrases are pronouns, noun phrases agree in gender, noun phrases agree in number, etc. These models were evaluated on Rucor dataset [120]. The best result of machine learning-base model was reported as 70.25% F1 of MUC, while rule-based models only reach to 64.87%.

4.3 Coreference Resolution Evaluation Metrics

Evaluating a performance for coreference resolution task has been a tricky issue. Many evaluation metrics have been proposed (e.g., MUC, B3, CEAF, BLANC, ACE, LEA) but none of them is perfect. This section shortly describes three most common metrics including MUC, B3, and CEAF. One of the first evaluation metrics for the CR task was introduced at The Sixth Message Understanding Conference (MUC6)1 by Vilain et al. [68]. MUC calculates Recall based on two factors: (1) the minimal number of correct links that is needed for generating the equivalence class S, and (2) the number of missing links in the system response w.r.t. the key set S:

m(S) c(S) − m(S) (|S| − 1) − (|p(S)| − 1) |S| − |p(S)| R = 1 − = = = , (4.1) c(S) c(S) |S| − 1 |S| − 1 P (|Si| − |p(Si)| RT = P , (4.2) (|Si| − 1) here:

• R, RT are Recall for single key equivalence class, and for the whole test set, respectively.

• S denote the ground-truth equivalence set.

• p(S) is the partition of S relative to the response.

• m(S) is the number of missing links in the system response.

• c(S) is the minimal number of correct links needed to generate the equivalence class.

P recision is computed using the same formulas of Recall with swapping the roles of response and key. Two main drawbacks of MUC are: (1) skipping singleton entities that are mentioned only once, and (2) treating all error types equally. To overcomes these limitations, Bagga and Baldwin proposed B3 [9] use an entity-based approach instead of the link-based approach.

1https://cs.nyu.edu/cs/faculty/grishman/muc6.html

100 B3 firstly computes P recision, and Recall for each entity, then weighted sum over all entities in the test set to produce final P recision, and Recall. Formulas of B3 are shown in below.

number of correct elements in the output chain containing entityi Pi = , (4.3) number of elements in the output chain containing entityi number of correct elements in the output chain containing entityi Ri = , (4.4) number of elements in the truth chain containing entityi N X Precision = wi × Pi, (4.5) i=1 N X Recall = wi × Ri, (4.6) i=1

2 th here N is the number of entities in the test set, and wi denotes the weight of the i entity. In [133], Xiaogiang Luo introduced an entity similarity-based method, named CEAF (short for Constrained Entity-Alignment F-measure). He proposed four kinds of metrics to calculate the entity3 similarity:

 1 if R = S φ1(R,S) = (4.7) 0 otherwise.  1 if R ∩ S φ2(R,S) = (4.8) 0 otherwise.

φ3(R,S) = |R ∩ S| (4.9) 2 × |R ∩ S| φ (R,S) = , (4.10) 4 |R| + |S| here R, S denote the response entity and truth entity, respectively. These entity similarity metrics are used to find the best one-to-one map between response coreference chain and the truth coreference chain. Choosing the best map is treated as maximum bipartite matching problem and solved by using Kuhn-Munkres method ([40], [77]). Once this map is found, precision, recall, and f-measure are computed using below equations:

Φ(g∗) Pφ = P , (4.11) i(φ(Si,Si)) Φ(g∗) Rφ = P , (4.12) i(φ(Ri,Ri)) 2 × P × R F = , (4.13) 1 P + R 2In the CR task, where the importance of entities are equal, these weights are equal. 3Entity in this context is a set of mentions in a cluster.

101 here Φ is the total similarity for the best map g∗. At CoNLL-2012 shared task4, Pradhan et al. [107] used the average of MUC, B3 and CEAF to evaluate the CR model performance:

MUC + B3 + CEAF F = F1 F1 F1 . (4.14) 1 3

4.4 Baseline Model Description

Among some recent CR models, we choose Kenton Lee et al.’s work as a baseline model because of its state of the art performance and code availability. In our work, we extend the baseline model and focus on building an end-to-end model for the Russian language. Bellow, we summarizes two Kenton Lee et al.’s models ([56], [55]). These models treat CR task as a set of decisions for every possible span in the document. T (T +1) Let D be a document of T words, N = 2 be the number of possible spans in D. The span i is determined by its start and end indices START(i) and END(i). The task is to assign each span i with an antecedent yi. Here, yi ∈ Y = {, 1, . . . , i − 1}.  is a dummy antecedent representing two possible scenarios: (1) the span is not a mention or (2) the span is a mention that is not coreferent with any previous span. The goal of the model is to learn a conditional probability distribution:

N N Y Y exp(s(i, yi)) P (y1, . . . , yN |D) = P (yi|D) = P 0 , (4.15) 0 exp(s(i, y )) i=1 i=1 y ∈Y where s(i, j) is the coreferential score between span i and span j. The model visualization is shown in the figure 4.3. Bellow, we step by step describe how the coreferential score is computed:

• Word vector representations, {x1,..., xT }, are concatenation of pre-trained word em- beddings and character-based word embeddings generated by 1-dimensional CNN net- works.

• Compute contextualized word vector representation by using Bi-LSTM network:

∗ −−→ ←−− xt = [LSTMt(x), LSTMt(x)]. (4.16)

• Compute vector representation of span i:

∗ ∗ gi = [xSTART(i), xEND(i), xˆi, φ(i)], (4.17)

4http://conll.cemantix.org/2012/introduction.html

102 here φ(i) is the vector encoding the size of span i, xˆi is the weighted sum of word vectors in span i, computed by using an attention mechanism over words of span i:

∗ αt = wα · FFNNα(xt ), (4.18) exp(α ) α = t , (4.19) i,t PEND(i) k=START(i) exp(αk) END(i) X xˆi = αi,t · xt, (4.20) k=START(i) here FFNN denotes a feed-forward neural network.

• Use feed-forward neural networks to compute mention score, and antecedent score:

sm(i) = wm · FFNNm(gi), (4.21)

sa(i, j) = wa · FFNNa([gi, gj, gi ◦ gj, φ(i, j)]), (4.22)

where · indicates the dot product, ◦ denotes element-wise multiplication, and φ(i, j) is the feature vector encoding speaker and genre information from the metadata and the distance between two spans.

• Compute coreference score:

( 0 j =  s(i, j) = (4.23) sm(i) + sm(j) + sa(i, j) j 6= 

The main drawback of the first model lies in the pairwise scoring function s(i, j). This function just makes local decisions by considering only pairs of spans. To enhance this function, in the second model [55], an iterative inference method was proposed to refine span representations. This entails refining the antecedent distributions: exp(s(gn, gn )) P (y ) = i yi . (4.24) n i P n n y∈Y(i) exp(s(gi , gy ))

1 The first model is used to produce the representation for each span i at the first iteration gi . At the tth iteration, the attention mechanism is utilized to compute the expected antecedent t representation ai, and then the span representation is updated: X at = P (y ) · gt , (4.25) i t i yi yi∈Y(i) t t t fi = σ(Wf [gi, ai]), (4.26) t+1 t t t t gi = fi ◦ gi + (1 − fi ) ◦ ai. (4.27)

103 Softmax

(grand mother yesterday, ) = 0

(grand mother yesterday, visited his grand) (grand mother yesterday, John visited)

Coreference score

(grand mother yesterday, John visited) Antecedent score

(visited his grand) (John visited) (grand mother yesterday)

Mention score

Span representation

Span head

Bi-LSTM

Word and character embedding

John visited his grand mother yesterday

Figure 4.3: Baseline CR model [56]. Only a subset of possible spans is depicted.

To deal with the computational challenge and the large space of possible antecedents, an alternate inference procedure with three-stage beam search was proposed:

• Keep the top M spans based on the mention score sm(i) of each span.

• Keep the top K antecedents of each remaining span i based on the sum of mention

scores sm(i), sm(j) and an alternate antecedent score sc(i, j):

> sc(i, j) = gi Wcgj. (4.28)

• Only the remaining span pairs are used to compute the overall coreference:

s(i, j) = sm(i) + sm(j) + sc(i, j) + sa(i, j). (4.29)

4.5 Sentence-level Coreferential Relation-based Model

The Kenton Lee et al.’s models ([56], [55]) as well as other recent deep learning models lack the ability to represent and process the semantic meaning of mentions in the entire document context. This makes inaccurate clustering decisions, especially when facing a long document. To solve this problem, we proposed a new model named Sentence-level

104 Figure 4.4: Visualization of coreference relationship between sentences in the document chtb_0219.v4_gold_conll of OntoNotes 5.0 dataset. [ ] is used to seperate sentences. Men- tions highlighted with the same color belong to the same cluster.

Coreferential Relation-based model (SCRb), that can take as input a paragraph, a discourse, or even a document, and predict the coreferential relations between sentences of the input text. In order to illustrate this idea clearly, we visualize an input sample, the document chtb_0219.v4_gold_conll in the OntoNotes 5.0 dataset provided at the CoNLL-2012 Shared Task [107] (See Fig. 4.4), and the corresponding output matrix (See Fig. 4.5). The SCRb model generates two useful features:

• Sentence representation sri that is concatenated with span vector representation gi of the baseline model.

• Sentence relation score ss(i, j) that is added to the scoring function (Formula 4.29) of the baseline model. The new scoring function is the weighted sum of two mention

scores sm, two antecedent scores of the baseline model sc, sa, and the sentence relation

score ss:

5 s(i, j) = α × (sm(i) + sm(j) + sc(i, j) + sa(i, j)) + β × ss(i, j) (4.30)

The SCRb model can be combined with the baseline model in two ways:

• Directly integrated SCRb model into the baseline model. Both models are jointly trained together.

• Use output of SCRb model as an additional feature for the baseline model.

5Both parameters α, β are fine tuned during the training state.

105 Figure 4.5: Binary matrix, representing the coreference relationship between sentences in the document chtb_0219.v4_gold_conll, coressponding to the document showed in Fig. 4.4, that SCRb model learns to predict. The value 1 at cell (1, 5) indicates that there is a coreference link between sentence 1 and sentence 5:{reporter Baixin Zhang, this reporter}.

The SCRb model consists of the following stages. The model use three types of word embeddings:

1. a free-context embedding efc;

2. a context-based embedding ecb;

3. a character-based embedding ech learned by CNN network.

All of these vectors are concatenated to generate the final word embedding:

ew = [efc, ecb, ech], (4.31) here [,] denotes the concatenation operator. The resulting word embeddings of each sentence are then feed into a Bi-LSTM network to capture both left and right sentence contexts:

−−→ ←−− w = [LSTM(ew), LSTM(ew)] (4.32) −−→ ←−− here LSTM(ew), LSTM(ew) denote outputs of forward and backward LSTM networks, respec- tively. A max pooling or attention mechanism is utilized to reduce the word dimension to gen- erate the sentence representation:

s = max_pooling(w), (4.33)

106 Figure 4.6: Sentence-level Coreferential Relation-based (SCRb) model.

The second Bi-LSTM is then used to capture the final sentence representation in the document context: −−→ ←−− sdc = [LSTM(s), LSTM(s)]. (4.34)

To generate coreferential relations between sentences, we modified Multi-dimensional Self-attention [115] proposed by Tao Shen et al. in 2018.

ds th Let si ∈ R is the vector representing the i sentence in the document, here ds denotes

dd the length of sentence vectors generated by the last Bi-LSTM network. Let edij ∈ R is distance embedding between si and sj, here dd denotes the length of position encoding

ds ds×ds ds×dd ds vectors. Let W ∈ R , W1,W2 ∈ R ,Wd ∈ R are weight matrices, and b1 ∈ R , b ∈ R are bias terms. Then the alignment score between si and sj is computed as fully connected feed-forward layer:

T f(si, sj) = W σ(W1si + W2sj + Wdedij + b1) + b, (4.35)

where σ is the activation function. Final antecedent score is computed as sum of f(smi , smj ) and antecedent score sa(i, j) of baseline model, smi - sentence, which mention i belongs to. The graphical illustration of SCRb model is shown in Fig. 4.6.

107 4.6 BERT-based Coreference Model

Cutting-edge language models (e.g., BERT [44], GPT [4], ELMo [69]) showed the ability to significantly improve the model performance in a variety of NLP tasks such as Question Answering, Named Entity Recognition, Document Classification, as well as Sentiment Anal- ysis. Kenton Lee et al. used ELMo in both models ([56], [55]) and achieved state-of-the-art performance on the OntoNotes 5.0 dataset [107]. Pre-trained language models can be used to independently provide contextualized word vector representations to other NLP tasks, or can be jointly trained as a part of another task. In this proposed model, we decided to use the BERT pre-trained model due to its high results on eleven natural language processing tasks as reported in [44]. BERT-base model has a total of 12 layers. We conducted experiments with outputs of different layers to study manually how BERT pays attention at each layer in the coreference context. Based on this analysis, we decided to take for our model outputs from three layers 1, 6, and 12 with more coreference relevant attention patterns. The word contextualized vector representations are the weighted sum of these outputs. The final word embedding is the concatenation of two types of embeddings: (1) free-context word embeddings (Glove + Character-based word embedding), and (2) the context-based word embedding generated from BERT-base model. Fig. 4.7 shows the graphical illustration of this model.

4.7 Experiments and Results

4.7.1 Datasets

Three datasets were used to evaluate the proposed models:

• OntoNotes 5.0: An English dataset for CR task provided at the CoNLL-2012 Shared Task [107]. This dataset consists of about 194480 mentions clustered into 44221 chains.

• RuCor [120]: A Russian dataset from the shared task at the Dialogue-21 Conference, available download at http://rucoref.maimbava.net.

• AnCor: An small Russian dataset provided at the shared task of The 25th International Conference on Computational Linguistics and Intellectual Technologies 6.

The statistic of these datasets is shown in table 4.2. 6http://www.dialog-21.ru/en/evaluation/

108 Softmax

(grand mother yesterday, ) = 0

(grand mother yesterday, visited his grand) (grand mother yesterday, John visited)

Coreference score

(grand mother yesterday, John visited) Antecedent score

(visited his grand) (John visited) (grand mother yesterday)

Mention score

Span representation

Span head

Bi-LSTM

BERT embedding BERT

John visited his grand mother yesterday

Figure 4.7: BERT-based coreference model.

Dataset Language Mentions Chains CoNLL-2012 English 194480 44221 RuCor Russian 16558 3638 AnCor Russian 28961 5678

Table 4.2: Statistics on CR datasets.

109 4.7.2 Evaluation of Proposed Models

Proposed models are all implemented on TensorFlow7. ELMo and BERT are re-trained for Russian language using DeepPavlov library [73]. Experiments on Russian datasets are done with only raw text. Additional features (e.g., speaker id, morphological tags) are not used. Experiments related to BERT-based and ELMo-based models on two Russian datasets including RuCor and AnCor were done by Yura Kuratov and Maxim Petrov. To train SCRb model, additional binary matrices to represent sentence relationships (as shown in Fig. 4.5) were generated from the original datasets. All experiments are run on GeForce GTX 1080 Ti with 11GB of RAM. It takes about one day to train each dataset. 3 Evaluation metrics used in all experiments below are MUC, B , CEAFe, and the average F1. A series of experiments was run to make a comprehensive evaluation of the SCRb models. The first experiment aims at studying the significance of the sentence-level coreferential relations. We analyse how the information about sentence-level coreferential relation affects the baseline model performance. To do this, we train one model with exactly the same parameters on two datasets: (1) the original OntoNotes 5.0 dataset, (2) the same dataset but augmented with ground truth sentence-level coreferential relations. Different amount of noise were added in different training sessions to get better understanding of the importance of the sentence-level information. As a result we got the model’s performance for a different values of accuracy of sentence relation presence. The results presented in the table 4.3 show that the information about the sentence relations is a very useful feature for the CR task. Specifically, if this feature is provided with an accuracy of 91%, the model performance will be increased by about 2.5%, a promis- ing percentage. Under the ideal condition, when training with ground truth sentence-level coreferential relations, the model performance reached to 78.84%, an improvement by 5.84% compared with the performance of the baseline model trained with the original OntoNotes 5.0 dataset. In the second experiment, we implement seven modifications of the SCRb model and test their performance on the OntoNotes 5.0 dataset:

• SCRb M1: The baseline model that is the combination of GloVe embedding, Bi-LSTM, CNN, and Self-attention.

• SCRb M2: M1 + log scale distance embedding inside self-attention.

• SCRb M3: M2 + weighted losses to deal with the unbalanced data problem.

• SCRb M4: M4 + ELMo embedding.

7http://www.tensorflow.org/

110 Dataset Max. F1 (%) on the dev. set Original OntoNotes 5.0 72.90 OntoNotes 5.0 + sent.-level coref. relationship with the acc. of 80.0% 74.13 OntoNotes 5.0 + sent.-level coref. relationship with the acc. of 84.0% 74.73 OntoNotes 5.0 + sent.-level coref. relationship with the acc. of 91.0% 75.56 OntoNotes 5.0 + sent.-level coref. relationship with the acc. of 94.0% 76.36 OntoNotes 5.0 + sent.-level coref. relationship with the acc. of 96.5% 77.01 OntoNotes 5.0 + sent.-level coref. relationship with the acc. of 98.5% 77.92 OntoNotes 5.0 + ground truth sent.-level coref. relationship 78.84

Table 4.3: Effect of sentence-level coreferential relations on the baseline SCRb model per- formance.

• SCRb M5: M3 + BERT embedding.

• SCRb M6: This is a modification of M5 by outputting non-symmetric outputs.

• SCRb M7: M5 - GloVe embedding.

The detail results of these variants on the validation set and test set are shown in Fig. 4.8. The findings drawn from this experiments are listed bellow:

• Position feature encoding the distance between sentences plays an important role in determining the corefrential relation between sentences. Using this feature improves the model accuracy by about 2%. This is quite understandable because if two sentences that are far apart, the probability that they have the coference link is low, and vise versa.

• Since the OntoNotes dataset [107] has a high imbalance, using weighted classes helps to remarkably improve the model accuracy, by about 2%.

• Using ELMo languge model also significantly boost the model performance, from 80.60% to 82.60% on the validation set.

• Adding word vector representation generated by BERT language model further im- proves the model accuracy, up to 83.85% on the validation set.

• Since the output of SCRb model is a symmetric matrix, the experiment with nonsym- metric output (M6) achieved lower accuracy compared with the previous one (M5).

111 Figure 4.8: SCRb model’s variants performance on OntoNotes 5.0. M1: Baseline model (Bi-LSTM + CNN + Self attention). M2: M1+ log scale distance inside self attention. M3: M2 + weighted class (to deal with imbalanced data problem. M4: M3 + ELMo. M5: M4 + BERT. M6: M5 + BERT with nonsymmetric output. M7: M5 - GloVe.

• The highest accuracy was achieved in the modification M7, not M5. This is probably because in M5, the word vector representation is composed with too much components (GloVe, ELMo, and BERT) resulting in complex vector representations which is hard to fine tune, especially with a limited amount of training data.

Next, we test the SCRb M7 model on the OntoNotes dataset and compare the received results with some existing models. The details are shown in table 4.4. Due to proposed improvements, SCRb M7 is better than Kenton Lee’s model and slightly better than [41]. Both of SCRb M7 and [41] are extended from [55]. However, our model keep the higher-order mention ranking approach unchange and boosts the model performance by leveraging BERT and sentence-level coreference relations, while [41] uses a reinforced policy gradient model by incorporating the reward associated with a sequence of coreference linking actions. The next experiment is to compare three models: (1) baseline model + ELMo, (2) base- line model + SCRb, and (3) Sysoev’s model [113]. Only one difference between the first two models is that the Baseline + SCRb model learns sentence-level coreferential relations, while the Baseline + ELMo does not. In this experiment, only raw text data is used. The results show that the baseline model incorporated with SCRb model obtained better performance than modification with ELMo (see table 4.5). This points out that the sentence-level coref-

112 3 MUC B CEAFe Model Avg. F1 P R F1 P R F1 P R F1 Martschat et al. [110] 76.7 68.1 72.2 66.1 54.2 59.6 59.5 52.3 55.7 62.5 Clark et al. [57] 76.1 69.4 72.6 65.6 56.0 60.4 59.4 53.0 56.0 63.0 Wiseman et al. [106] 76.2 69.3 72.6 66.2 55.8 60.5 59.4 54.9 57.1 63.4 Wiseman et al. [105] 77.5 69.8 73.4 66.8 57.0 61.5 62.1 53.9 57.7 64.2 Clark et al. [59] 79.9 69.3 74.2 71.0 56.5 63.0 63.8 54.3 58.7 65.3 Clark et al. [58] 79.2 70.4 74.6 69.9 58.0 63.4 63.5 55.5 59.2 65.7 Lee et al. [56] 78.4 73.4 75.8 68.6 61.8 65.0 62.7 59.0 60.8 67.2 Lee et al. [55] 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 72.9 Fei et al. [41] 85.4 77.9 81.4 77.9 66.4 71.7 70.6 66.3 68.4 73.8 SCRb M7 (ours) 81.9 80.6 81.3 72.9 71.2 72.0 69.9 68.0 68.9 74.1

Table 4.4: Testing results on OntoNotes dataset

3 Model MUC B CEAFe Avg. F1 Sysoev [113] 41.89 34.31 29.06 35.09 Baseline + ELMo 67.26 52.29 53.18 57.58 Baseline + SCRb 66.32 54.09 54.86 58.42

Table 4.5: Testing results on the RuCor dataset using only raw text of baseline model + ELMo, and baseline model + SCRb. erential relations learned by SCRb model really helps to improve the model performance. The last experiment aims to compare the performance of five models:

1. Baseline model + ELMo

2. Baseline model + BERT

3. Baseline model + SCRb

4. Baseline model + ELMo + RuCor

5. Baseline model + BERT + RuCor

The obtained results shows that Baseline + SCRb model outperforms Baseline + ELMo and is on par with Baseline + BERT model (refer to table 4.6 for more details). Below we list three conclusions drawn from this experiment:

• Encoding with BERT finetuned on Russian language significantly improve the model performance.

113 3 Model MUC B CEAFe Avg. F1 Baseline + ELMo 50.29 48.89 46.99 51.72 Baseline + SCRb 60.00 48.89 50.39 53.61 Baseline + BERT 60.95 51.08 49.24 53.76 Baseline + ELMo + RuCor 65.01 52.67 50.19 55.96 Baseline + BERT + RuCor 66.74 54.88 51.72 57.78

Table 4.6: Testing results on AnCor dataset with raw text

Team muc bcube ceafe mean DP (our team) (additionally trained on RuCor) 82.62 73.95 72.14 76.24 Legacy 75.83 66.16 64.84 68.94 SagTeam * 62.23 52.79 52.29 55.77 DP (our) * 62.06 53.54 51.46 55.68 MorphoBabushka 61.36 53.39 51.95 55.57 Julia Serebrennikova 48.07 34.7 38.48 40.42

Table 4.7: Coreference track results on Dialog Conference 2019. * denotes result without the gold mentions.

• Sentence-level coreferential relations produced by SCRb model remarkable boosts the model performance. It is better than using ELMo, and comparable with using BERT.

• By training on the larger training set that combine AnCor and RuCor training sets, the model performance is increased dramatically.

To compare our model with other existing models, we participated in the shared task for Anaphora and Coreference Resolution for Russian at the 25th International Conference on Computational Linguistics and Intellectual Technologies. In this task, organizers provided new dataset generated from the Russian open corpus (OpenCorpora8), consisting of 3769 texts of various genres such as fiction, publicist texts, and scientific texts. The detail de- scription of these tasks along with the results participated teams achieved are reported on the website of the Dialog conference 20199. In the tables 4.7, 4.8, we summarize results of our BERT-based model along with the other participated teams.

8http://opencorpora.org/ 9http://www.dialog-21.ru/en/evaluation/2019/disambiguation/anaphora/

114 Acc Prec Rec F1 Acc Prec Rec F1 Team Run soft strong On gold 91.00 91.40 91.00 91.20 83.50 83.90 83.50 83.70 DP (our team) Full 76.30 79.20 76.30 77.80 68.10 70.70 68.10 69.40 Legacy 70.80 75.70 70.80 73.20 59.10 63.10 59.10 61.00 Meanotek 52.40 78.70 52.40 62.90 39.20 58.80 39.20 47.00 Etap 52.40 78.70 52.40 62.90 39.10 58.70 39.10 46.90 Morphobabushka 62.90 63.50 62.90 63.20 38.80 39.10 38.80 39.00 NSU_ai 23.20 43.30 23.20 30.20 6.90 12.90 6.90 9.00

Table 4.8: Anaphora track results on Dialog Conference 2019.

4.8 Conclusions

In conclusion, this chapter presents one of the most difficult tasks in the field of NLP, the task of coreference resolution. Evaluation metrics are summarized along with their pros and cons. Related works are also systematically reviewed. Starting with the baseline model, two new models are proposed:

• The Sentence-level Coreferential Relation-based model that has the ability to capture the sentence relations in the coreference context. This model can be used to boost the baseline model in two ways: (1) directly integrated into the baseline model, (2) output of the proposed model is used as an additional feature for the baseline model.

• Language modeling-based model that leverages the BERT model to improve the per- formance of the baseline model.

The comparative study of the proposed models demonstrates that:

1. Sentence-level coreferential relation is a very useful feature for the coreference resolution task. We experimented with two models (1) Baseline model + ELMo and (2) Baseline model + sentence level coreference relations (SCRb). The results show that proposed SCRb model has better performance, higher by about 1% for Russian (see table 4.5). If this feature is good enough, it can significantly improve the model performance. F1 of the baseline model [55] can be significantly improved by about 5% up to 78.84% with ground truth sentence-level coreferential relationship.

2. Position feature encoding the distance between sentences, weighted class, and modern language models are the key components in building the SCRb model that boost the model accuracy from 77% up to 84%.

115 3. Using modern language models such as ELMo or BERT significantly boosts the model performance and reducing the training time as well.

4. Our SCRb model obtained a state of the art performance on RuCor dataset.

116 Chapter 5

Conclusions

Over the last decade, we have witnessed the dominance of deep learning models over tra- ditional machine learning techniques in a variety of tasks that covers a wide range from Computer Vision to Speech Recognition and Natural Language Processing. These models can automatically learn a multi-layered hierarchy of representations from low to high level of abstraction, reducing the need for human-crafted features and thereby allowing to save time and resources. To train deep neural network models up to the human-level performance, two things are required (1) a massive amount of data (2) a huge amount of computational power. Nowadays, both of them are available. It is estimated that about 2.5 quintillion bytes of data are created each day. Computational power has been increasing exponentially. Just recently, in October 2019, Google scientists published the paper titled “Quantum supremacy using a programmable superconducting processor” [32] to official announce Quantum Supremacy, opening a new age of quantum computing. It is no exaggeration to say that now is the age of deep learning. The number of deep learning projects has therefore increased year by year. Among them, DeepPavlov1, developed by Neural Networks and Deep Learning Lab located in Moscow Institute of Physics and Technology, is a global open-source deep learning project aiming to advance the current state of the art in conversational intelligence. The works within this dissertation were carried out as a part of this project, focusing on building deep neural network models to address two tasks (1) Sequence Labeling task and (2) the task of Corerefence Resolution.

5.1 Conclusions for Sequence Labeling Task

Sequence Labeling is a family of tasks involving the assignment of a tag to each element of a sequence. In a sequence labeling task, both input and output are sequences of the same length. Some applications of sequence labelling include Part of Speech Tagging, Named Entity Recognition, and Speech Recognition as well. Sequence labeling tasks can be treated

1http://deeppavlov.ai

117 as classification tasks since each output tag is a categorical value. Nevertheless, they are more special than other classification tasks in that the accuracy can be boosted by finding the globally best sequence of tags instead of tagging each element of the input sequence independently. Back in the days before the advent of Deep Learning, structured prediction models in- cluding Hidden Markov Model (HMM) and Conditional Random Field (CRF) were two good choices for sequence labeling tasks. The main difference between them is that HMM is a generative model, while CRF is a discriminate model. More concretely, HMM aims to max- imize the joint probability P (x, y) between the sequence of observations x and the sequence of stages y, whereas CRF aims at associating sequence of states with the sequence of ob- servations by computing the conditional probability P (y|x). The main advantage of CRF compared to HMM is that it has the capacity for representing the sequences of observations as vectors whose components are related to features attached to the observations. This is very suitable for sequence labeling tasks. Despite being a very powerful model for sequence labeling tasks, a long-standing disadvantage when applying these models to the NLP task is the lack of semantic awareness. For instance, they treat the words “rose” in two sentences “He rose quickly from his seat.” and “The Bulgarian rose is really unique.” equally. This leads to the incorrect predictions and the difficulty to generalize to unseen data. In other words, the meaning of words in the context plays a key role in NLP tasks. This motivates the use of a deep learning-based approach that can able to learn latent features automatically and directly from input data. Besides that, deep learning models can leverage the massive amount of data to generalize better. To address the sequence labeling tasks we proposed and studied a series of end-to-end deep neural network models focusing on capturing as much as possible useful word features in its context as well as the latent constraints between output tags. In 2017, we proposed the first hybrid model incorporating a Bi-LSTM network with the CRF model to address the task of NER with a focus on the Russian language. The model consists of three main components (1) an external pre-trained word embedding generated by training FastText2 on Lenta corpus3, (2) a Bi-LSTM network to capture the meaning of words in both left and right contexts, and (3) the CRF model to condition on output tags. The conducted experiments on Russian datasets showed that the proposed model outperformed existing models. The details of the model architecture, as well as experimental results, were published in the paper titled “Application of a Hybrid Bi-LSTM-CRF Model to the Task of Russian Named Entity Recognition” [64].

2An open-source library for learning text representations and text classifiers: https://fasttext.cc. 3A Russian public corpus for some tasks of natural language processing: https://github.com/yutkin/ lenta.ru-news-dataset

118 Experiments with the first proposed model demostrated that one of the keys to a robust and powerful model lies in building a rich word vector representations. Hence, to further improve the model performance, we proposed WCC-NN-CRF model which is a hybrid deep neural network model comprising of three sub-networks to fully exploit additional useful features such as character-level and capitalization features. A series of experiments on six datasets covering four languages Vietnamese, Russian, English, and Chinese as well showed that WCC-NN-CRF model achieves cutting-edge results. The paper titled “A Deep Neural Network Model for the Task of Named Entity Recognition” [66] describes this proposed model along with the obtained results in detail. After successful application of proposed models to the named entity recognition task, we train them to solve the task of Sentence Boundary Detection. For this we reformulate the task as a Sequence Labeling task and then applied the models that we developed. The achieved results were detailed in the paper “Sequence Labeling Approach to the Task of Sentence Boundary Detection” [65] represented at ICMLSC 2020 conference. One of the main drawbacks of previous machine learning-based models is the difficulty in building one multilingual model. In this dissertation, we deal with this problem by using a pre-trained multilingual language model like BERT [44], which allows capturing the meaning of words in both left and right contexts in many languages based on attention mechanism instead of using RNNs, to initial the word vector representations. This helps to significantly improve the performance of the proposed multilingual model. For Vietnamese and Russian NER tasks, most of previous proposed models were feature- based models ([42], [93], [75], [101], [127], [128]). More recently, deep learning-based models have been widely applied ([116], [117], [126], [129]). Different from these models, our model focus on building rich semantic and grammatical representation vectors by utilizing three sub- networks including two Bi-LSTM networks and one CNN network to fully exploit character- level features, capitalization features as well as word-level contextual representation. A CNN network, consisting of three convolutional layers with different window sizes followed by a max-pooling layer. Two Bi-LSTM networks are used to extract capitalization features and word contextual representation. In addition, our model leverages the power of modern language models trained on huge datasets in an unsupervised manner. Table 5.1 summarizes obtained results on Vietnamese, English, and Russian datasets. Main findings and conclusions from our development and study of state of the art sequence labelling models are highlighted below:

• From the machine learning-based perspective, one of the keys to sequence labeling tasks is the ability to learn as many as possible useful latent features in the task- specific context for each word from the whole input sequence.

119 Dataset Precision Recall F1 NE3 (Russian) 99.05 99.30 99.17 CoNLL-2013 (English) 92.04 92.50 92.27 VLSP-2016 (Vietnamese) 94.91 93.96 94.43

Table 5.1: Summarization of obtained results over different languages

• Using pre-trained word embeddings to initialize word lookup table helps to increase the model performance and reduce the training time.

• The vector representation of the word learned from its’ characters is a crucial factor that significantly improves the model performance since it plays a key role when dealing with out-of-vocabulary words.

• Encoding the input sequence in the capitalization form also helps to increase the model performance since named entities are often the combination of several capi- talized words. This can be done by using a Bi-LSTM network or CNN.

• Additional features such as Part of Speech and Chunk features can also help to enhance the model performance of named entity recognition, especially in Vietnamese and Rus- sian languages due to the important roles of syntactic features in these languages.

• Language modeling is a pillar of the modern NLP. Usage of word vector representa- tions produced by cutting-edge language models can significantly improve the model performance in most of, if not all, NLP tasks. Importantly, language models can be trained in an unsupervised manner, so the abundant unlabeled data can be exploited. Integrating language models into our proposed model significantly boosts the model performance and allows the model to generalize better.

5.2 Conclusions for Coreference Resolution Task

Coreference Resolution task consists of finding mentions in text that refer to the same entity. It is one of the important tasks in the field of NLP. Referential relations are very important and useful information in many higher-level NLP tasks including Text Summarization, Ques- tion Answering as well as Information Extraction. However, coreference is one of the most challenging tasks and still needs better and more accurate models. The main difficulty here is due to the fact that finding and clustering mentions require deep understanding and use of background knowledge. Some cases with long texts and complex coreference links even make humans confused about how to solve it. In this dissertation, two models are proposed to

120 deal with coreference resolution task (1) Sentence-level Coreferential Relation-based model (SCRb), and (2) Language modeling-based model. While the most recent proposed models for coreference([58], [59], [56], [55]) focus on learning mention representations and building scoring mention-antecedent scoring function, we approach it in a new way. We argue that one of the shortcomings of the previous models is the lack of capturing long-term dependencies since the input of the CR task is usually a long text consisting of many sentences. This drawback motivates our study to focus on finding a way to capture the long-term coreferential dependencies. That is the starting point of our proposed model. We deal with the long-term dependencies problem by building the SCRb model that can capture the sentence relationship between sentences in the coreference context. More concretely, the SCRb model takes as input a text comprising of a set of sen- tences and output a matrix expressing the sentence relationships in the coreference context. A higher values in sentence coreference matrix expresses a higher probability of the existence of a coreference link between mentions inside these sentences. We suppose that such kind of relationship is a very useful feature for finding and clustering mentions. To study this hypoth- esis, we performed an experiment and the received results (refer to table 4.3) demonstrated that our hypothesis was correct. The proposed model uses three kinds of word vector repre- sentations including (1) context-free word representation, (2) context-based representation, and (3) character-level word representation. Both the word-level context and sentence-level context are extracted by using two Bi-LSTM networks. The self-attention mechanism is used to generate the final sentence relation matrix from the sentence representations. The details about the SCRb model were published in the paper “Sentence Level Repre- sentation and Language Models in the Task of Coreference Resolution for Russian” [63]. The SCRb’s performance is on par with the language modeling-based model. By analyzing the weights of the SCRb model in combination with the baseline model, we found that the com- bined model tends to ignore the features learned by the SCRb model. Hence, we claim that a part of the reason may lie in the way two models are jointly trained. One more reason is the class imbalance problem that occurs during transforming mention clusters from original datasets to binary matrices. Although we used a weighted loss function that gives more im- portance to the minority classes, the problem has not been solved thoroughly. Nevertheless, the proposed model still has promising potential to be applied to not only the CR task but also other NLP tasks such as Question Answering, and Text Summarization as well.

5.3 Summary of the Main Contributions of the Dissertation

In conclusion, the main contributions of the dissertation are:

1. An original hybrid model for sequence labeling task was proposed and studied. This

121 model extended existed Bi-LSTM CRF architectures with (1) trainable CNN for gener- ation of character-level representation of an input sequence, and (2) Bi-LSTM network for encoding capitalization features. The model achieved state of the art performance on Russian and Vietnamese datasets with F1 98.21%, 94.43% on NE3 and VLSP- 2016. Ablation studies demonstrated that character-level encoding produces a larger improvement than capitalisation encoding.

2. Extensions of the original architecture with encoders based on language models ELMo and BERT were evaluated on Russian and English datasets. It obtained state of the art performance of 99.17%, 92.91% F1 on NE3 and Gareev’s dataset, and a comparable performance, 92.27% F1, on CoNLL-2003.

3. Application of proposed sequence labeling model to the sentence boundary detection task produced solid results of 89.99% F1 and 95.88% F1 on the Cornell Movie-Dialog and DailyDialog datasets.

4. Sentence-level coreferential relation can significantly improve the performance of solv- ing coreference resolution task. The experiments on OntoNotes dataset shows that quality of solution can be boosted up to 5.84%.

5. An original model for learning sentence-level coreferential relationships was introduced. Incorporation of this model in the baseline coreference architecture improved it’s per- formance for English.

6. Application of the model with sentence coreference module allowed to achieve state of the art of 58.42% average F1 on RuCor dataset.

122 Bibliography

1. Adwait Ratnaparkhi. “A Maximum Entropy Model for Part-Of-Speech Tagging”. In: Conference on Empirical Methods in Natural Language Processing. 1996. url: https: //www.aclweb.org/anthology/W96-0213.

2. Alan Akbik, Duncan Blythe, Roland Vollgraf. “Contextual String Embeddings for Sequence Labeling”. In: Proceedings of the 27th International Conference on Compu- tational Linguistics. 2018. url: https://www.aclweb.org/anthology/C18-1139/.

3. Alan Mathison Turing. “Computing Machinery and Intelligence”. In: MIND 59 (236 1950), pp. 433–460. url: https://academic.oup.com/mind/article/LIX/236/ 433/986238.

4. Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. “Improving language understanding with unsupervised learning”. In: Technical report, Technical report, OpenAI. 2018. url: https://s3- us- west- 2.amazonaws.com/openai- assets/research- covers/language- unsupervised/language_understanding_ paper.pdf.

5. Alec Radford, Kartik Narsimhan, Tim Salimans, Ilya Sutskever. Improving Lan- guage Understanding by Generative Pre-Training. 2018. url: https : / / pdfs . semanticscholar.org/cd18/800a0fe0b668a1cc19f2ec95b5003d0a5035.pdf?_ga= 2.68781521.1765043536.1574952036-6193859.1560422113.

6. Alex Gittens, Dimitris Achlioptas, Michael W. Mahoney. “Skip-Gram – Zipf + Uni- form = Vector Additivity”. In: Proceedings of the 55th Annual Meeting of the Associ- ation for Computational Linguistics. Vancouver, Canada, July 2017, pp. 69–76. url: https://www.aclweb.org/anthology/P17-1007.

7. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton. Speech Recognition with Deep Recurrent Neural Networks. 2013. arXiv: 15081303.5778 [cs.NE]. url: https: //arxiv.org/abs/1303.5778.

8. Alexandre Passos, Vineet Kumar, Andrew McCallum. “Lexicon Infused Phrase Em- beddings for Named Entity Resolution”. In: Proceedings of the Eighteenth Conference

123 on Computational Language Learning, Baltimore, Maryland USA. Vol. 10565. 2014, pp. 78–86. url: https://www.aclweb.org/anthology/W14-1609.

9. Amit Bagga, Breck Baldwin. “Algorithms for scoring coreference chains”. In: Pro- ceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain. 1998, pp. 563–566. url: https://pdfs.semanticscholar.org/ 4b51/2f10838e05f5b2eee94bfbd20f3d9c4ecb9b.pdf.

10. Andriy Mnih, Geoffrey E. Hinton. “A Scalable Hierarchical Distributed Language Model”. In: Advances in Neural Information Processing Systems 21 (NIPS 2008). Vancouver, British Columbia, Canada, Dec. 2008. url: https : / / papers . nips . cc/paper/3583-a-scalable-hierarchical-distributed-language-model.pdf.

11. Andriy Mnih, Geoffrey E. Hinton. “Three new graphical models for statistical language modelling”. In: Proceedings of the 24th International Conference on Machine learning. Corvalis, Oregon, USA, June 2007, pp. 641–648. url: https://www.cs.toronto. edu/~hinton/absps/threenew.pdf.

12. Aria Haghighi, Dan Klein. “Simple coreference resolution with rich syntactic and semantic features”. In: Chinese Computational Linguistics and Natural Language ProcProceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Vol. 3. 2009. url: https : / / dl . acm . org / doi / 10 . 5555 / 1699648 . 1699661.

13. Arora Sanjeev, Li Yuanzhi, Liang Yingyu, Ma Tengyu, Risteski, Andrej. “A Latent Variable Model Approach to PMI-based Word Embeddings”. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 385–399. doi: 10.1162/tacl_ a_00106. url: https://www.aclweb.org/anthology/Q16-1028.

14. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L ukasz Kaiser. “Attention Is All You Need”. In: Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA, 2017. url: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

15. Baldwin Breck. “CogNIAC: high precision coreference with limited knowledge and linguistic resources”. In: Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts. 1997. url: https://www.aclweb.org/anthology/W97-1306.

16. Brill E., Pop M. “Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging”. In: Armstrong S., Church K., Isabelle P., Manzi S., Tzoukermann E., Yarowsky D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology. Vol. 11. Springer, Dordrecht, 1999, pp. 152–155. url: https://doi.org/10.1007/978-94-017-2390-9_3.

124 17. Chinatsu Aone, Lauren Halverson, Tom Hampton, Mila Ramos-Santacruz. “SRA: De- scription of the IE2 System Used for MUC-7”. In: Seventh Message Understanding Conference (MUC-7). 1998. url: https://www.aclweb.org/anthology/M98-1012.

18. Chun-Yen Chen, Dian Yu, Weiming Wen, Yi Mang Yang, Jiaping Zhang, Mingyang Zhou Kevin Jesse, Austin Chau, Antara Bhowmick, Shreenath Iyer, Giritheja Sreeni- vasulu Runxiang Cheng, Ashwin Bhandare, Zhou Yu. “Gunrock: Building A Human- Like Social Bot By Leveraging Large Scale Real User Data”. In: 2nd Proceed- ings of Alexa Prize. 2018. url: https : / / pdfs . semanticscholar . org / b402 / b85ad45e3ac51f1da8ee718373082ce24f47.pdf.

19. Chunqi Wang, Wei Chen, Bo Xu. “Named Entity Recognition with Gated Convolu- tional Neural Networks”. In: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, CCL 2017, NLP-NABD 2017. Vol. 10565. Springer, Cham, 2017, pp. 110–121. url: https://link.springer.com/ chapter/10.1007/978-3-319-69005-6_10.

20. Cristian Danescu-Niculescu-Mizil and Lillian Lee. “Chameleons in imagined conversa- tions: A new approach to understanding coordination of linguistic style in dialogs.” In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011. 2011. url: https://www.cs.cornell.edu/~cristian/Cornell_Movie- Dialogs_Corpus.html.

21. David H. Hubel, Torsten N. Wiesel. “Receptive Fields and Functional Architecture of Monkey Striate Cortex”. In: The Journal of Physiology 195 (1968), pp. 215–243. url: https://physoc.onlinelibrary.wiley.com/doi/pdf/10.1113/jphysiol.1968. sp008455.

22. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Grae- pel, and Demis Hassabis. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529 (2016), pp. 484–489. url: https://www.nature.com/ articles/nature16961.

23. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2015. arXiv: 1409.0473 [cs.CL]. url: https://arxiv.org/abs/1409.0473.

125 24. Edgar Altszyler, Mariano Sigman, Sidarta Ribeiro, and Diego Fern´andezSlezak. Com- parative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. 2017. arXiv: 1610.01520v2 [cs.CL]. url: https://arxiv.org/pdf/1610. 01520.pdf.

25. Elaheh Sadredini, Deyuan Guo, Chunkun Bo, Reza Rahimi, Kevin Skadron, Hongning Wang. “A Scalable Solution for Rule-Based Part-of-Speech Tagging on Novel Hardware Accelerators”. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018, pp. 665–674. url: https://dl.acm. org/doi/pdf/10.1145/3219819.3219889.

26. Emma Strubell, Patrick Verga, David Belanger, Andrew McCallum. “Fast and accu- rate entity recognition with iterated dilated convolutions”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, pp. 2660– 2670. url: https://www.aclweb.org/anthology/D17-1283.

27. Eric Brill. “A simple rule-based part of speech tagger”. In: Proceedings of the third conference on Applied natural language processing. 1992, pp. 152–155. url: https: //dl.acm.org/doi/10.3115/974499.974526.

28. Erik F. Tjong Kim Sang. “Introduction to the CoNLL-2002 shared task: language- independent named entity recognition”. In: COLING-02 proceedings of the 6th confer- ence on Natural language learning. Vol. 20. 2002, pp. 1–4. url: https://www.aclweb. org/anthology/W02-2024.

29. Erik F. Tjong Kim Sang, Fien De Meulder. “Introduction to the conll-2003 shared task: Language-independent named entity recognition”. In: Proceedings of the Seventh Conference on Natural Language Learning. 2003, pp. 142–147. url: https://www. aclweb.org/anthology/W03-0419.

30. Felix A. Gers, Jurgen Schmidhuber. “Recurrent Nets that Time and Count”. In: Pro- ceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Mil- lennium. Como, Italy, July 2000, pp. 189–194. url: https://ieeexplore.ieee.org/ document/861302.

31. Firth, J. R. “A Synopsis of Linguistic Theory 1930-1955”. In: Studies in Linguistic Analysis (1957). url: http : / / cs . brown . edu / courses / csci2952d / readings / lecture1-firth.pdf.

32. Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C. Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando G. S. L. Brandao, David A. Buell, Brian Burkett, Yu Chen, Zijun Chen, Ben Chiaro, Roberto Collins, William Court-

126 ney, Andrew Dunsworth, Edward Farhi, Brooks Foxen, Austin Fowler, Craig Gidney, Marissa Giustina, Rob Graff, Keith Guerin, Steve Habegger, Matthew P. Harrigan, Michael J. Hartmann, Alan Ho, Markus Hoffmann, Trent Huang, Travis S. Hum- ble, Sergei V. Isakov, Evan Jeffrey, Zhang Jiang, Dvir Kafri, Kostyantyn Kechedzhi, Julian Kelly, Paul V. Klimov, Sergey Knysh, Alexander Korotkov, Fedor Kostritsa, David Landhuis, Mike Lindmark, Erik Lucero, Dmitry Lyakh, Salvatore Mandr`a,Jar- rod R. McClean, Matthew McEwen, Anthony Megrant, Xiao Mi, Kristel Michielsen, Masoud Mohseni, Josh Mutus, Ofer Naaman, Matthew Neeley, Charles Neill, Murphy Yuezhen Niu, Eric Ostby, Andre Petukhov, John C. Platt, Chris Quintana, Eleanor G. Rieffel, Pedram Roushan, Nicholas C. Rubin, Daniel Sank, Kevin J. Satzinger, Vadim Smelyanskiy, Kevin J. Sung, Matthew D. Trevithick, Amit Vainsencher, Benjamin Vil- lalonga, Theodore White, Z. Jamie Yao, Ping Yeh, Adam Zalcman, Hartmut Neven, John M. Martinis. “Quantum supremacy using a programmable superconducting pro- cessor”. In: Nature 574 (Oct. 2019), pp. 505–510. url: https://www.nature.com/ articles/s41586-019-1666-5.

33. Frederic Morin and Yoshua Bengio. “Hierarchical Probabilistic Neural Network Lan- guage Model”. In: Proceedings of the Tenth International Workshop on Artificial In- telligence and Statistics. Jan. 2005, pp. 246–252. url: http://www.gatsby.ucl.ac. uk/aistats/aistats2005_eproc.pdf.

34. Gang Luo, Xiaojiang Huang, Chin-Yew Lin, Zaiqing Nie. “Joint named entity recogni- tion and disambiguation”. In: Proceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing. 2015, pp. 879–888. url: https://www.aclweb. org/anthology/D15-1104.

35. George R. Krupka, Kevin Hausman. “IsoQuest Inc.: Description of the NetOwlTM Extractor System as Used for MUC-7”. In: Seventh Message Understanding Conference (MUC-7). 1998. url: https://www.aclweb.org/anthology/M98-1015.

36. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. “Neural Architectures for Named Entity Recognition”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics. 2016, pp. 260–270. url: https://www.aclweb.org/anthology/ N16-1030.

37. Guo J., Wang S., Yu C., Song J. “Chinese POS Tagging Method Based on Bi- GRU+CRF Hybrid Model”. In: Xhafa F., Barolli L., GreguˇsM. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2018. Lecture Notes on Data Engineering and Communications Technologies. Vol. 23. Springer, Cham, 2019. url: https://link.springer.com/chapter/10.1007/978-3-319-98557-2_41.

127 38. GuoDong Zhou and Jian Su. “Named Entity Recognition using an HMM-based Chunk Tagger”. In: Proceedings of the 40th Annual Meeting of the Association for Computa- tional Linguistics. 2002, pp. 473–480. url: https://www.aclweb.org/anthology/ P02-1060.pdf.

39. Hany Hassan, Yanjun Ma, and Andy Way. “Matrex: the dcu machine translation sys- tem for iwslt 2007”. In: Proceedings of the International Workshop on Spoken Language Translation. Trento, Italy, 2007. url: http://doras.dcu.ie/561/1/Hassanetal_ IWSLT07.pdf.

40. Harold W. Kuhn. “The Hungarian Method for the assignment problem”. In: Naval Research Logistics Quarterly 2 (1955), pp. 83–97. url: https://link.springer. com/chapter/10.1007/978-3-540-68279-0_2.

41. Hongliang Fei, Xu Li, Dingcheng Li, Ping Li. “End-to-end Deep Reinforcement Learn- ing Based Coreference Resolution”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Compu- tational Linguistics, Aug. 2019. url: https://www.aclweb.org/anthology/P19- 1064.pdf.

42. Huong Thanh Le and Luan Van Tran. “Automatic feature selection for named entity recognition using genetic algorithm”. In: Proceedings of the Fourth Symposium on Information and Communication Technology. 2013, pp. 81–87. url: https://dl. acm.org/doi/abs/10.1145/2542050.2542056.

43. Ilya Sutskever, Oriol Vinyals, Quoc V. Le. “Sequence to sequence learning with neural networks”. In: Proceeding NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems. Vol. 2. Montreal, Canada: Association for Computational Linguistics, Dec. 2014, pp. 3104–3112. url: https://papers.nips. cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.

44. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018. arXiv: 1810. 04805v2 [cs.CL]. url: https://arxiv.org/abs/1810.04805.

45. Jason P.C. Chiu, Eric Nichols. “Named Entity Recognition with Bidirectional LSTM- CNNs”. In: Transactions of the Association for Computational Linguistics. Vol. 4. 2016, pp. 357–370. url: https://aclweb.org/anthology/Q16-1026.

46. Jeffrey Pennington, Richard Socher, Christopher D. Manning. “GloVe: Global Vec- tors for Word Representation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, pp. 1532–1543. url: https://www. aclweb.org/anthology/D14-1162.

128 47. Jerry R.Hobbs. “Resolving pronoun references”. In: Lingual 44 (1978): 4, pp. 311– 338. url: https : / / www . sciencedirect . com / science / article / pii / 0024384178900062.

48. Jin-Dong Kim, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi. Introduction to the Bio-Entity Recognition Task at JNLPBA. 2004. url: http://www.nactem.ac.uk/ tsujii/GENIA/ERtask/shared_task_intro.pdf.

49. Jing Huang and Geoffrey Zweig. “Maximum entropy model for punctuation annotation from speech”. In: Proceedings of the Annual Conference of the International Speech Communication Association. Denver, Colorado, USA, Sept. 2002. url: https : / / pdfs.semanticscholar.org/0982/93bd4de50541696806758f881b3bd58bb992.pdf.

50. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. “Conditional ran- dom fields: Probabilistic models for segmenting and labeling sequence data”. In: Pro- ceedings of the Eighteenth International Conference on Machine Learning (ICML 2001). 2001, pp. 282–289.

51. K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cunningham, Y. Wilks. “University of Sheffield: Description of the LaSIE-II System as Used for MUC-7”. In: Seventh Message Understanding Conference (MUC-7). 1998. url: https: //www.aclweb.org/anthology/M98-1007.

52. Kaituo Xu, Lei Xie, Kaisheng Yao. “Investigating LSTM for punctuation prediction”. In: Proceedings of the 10th International Symposium on Chinese Spoken Language Pro- cessing (ISCSLP). 2016. url: https://ieeexplore.ieee.org/abstract/document/ 7918492.

53. Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, Christopher Manning. “A Multi-pass Sieve for Coref- erence Resolution”. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP ’10. Cambridge, Massachusetts: Association for Computational Linguistics, 2010, pp. 492–501. url: http://dl.acm.org/citation. cfm?id=1870658.1870706.

54. Kawin Ethayarajh, David Duvenaud, Graeme Hirst. Towards Understanding Linear Word Analogies. 2019. arXiv: 1810.04882v6 [cs.CL]. url: https://arxiv.org/ pdf/1810.04882.pdf.

55. Kenton Lee, Luheng He, and Luke Zettlemoyer. “Higher-order Coreference Resolution with Coarse-to-fine Inference”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics. New Orleans,

129 Louisiana, USA: Association for Computational Linguistics, June 2018, pp. 687–692. url: https://www.aclweb.org/anthology/N18-2108.

56. Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. “End-to-end Neural Coref- erence Resolution”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 188–197. url: https://www.aclweb.org/anthology/ D17-1018.

57. Kevin Clark and Christopher D. Manning. Entity-Centric Coreference Resolution with Model Stacking. 2015. url: https://nlp.stanford.edu/pubs/clark- manning- acl15-entity.pdf.

58. Kevin Clark, Christopher D. Manning. “Deep Reinforcement Learning for Mention- Ranking Coreference Models”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, USA: Association for Com- putational Linguistics, Nov. 2016. doi: 10.18653/v1/D16-1245. url: https://www. aclweb.org/anthology/D16-1245.

59. Kevin Clark, Christopher D. Manning. “Improving Coreference Resolution by Learning Entity-Level Distributed Representations”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Vol. 1. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 643–653. doi: 10.18653/v1/P16-1061. url: https://www.aclweb.org/anthology/P16-1061.

60. Klaus Greff, Rupesh K. Srivastava, Jan Koutn´ık, Bas R. Steunebrink, and J¨urgen Schmidhuber. “LSTM: A Search Space Odyssey”. In: IEEE Transactions on Neural Networks and Learning Systems 28.10 (Oct. 2017), pp. 2222–2232. url: https:// ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7508408.

61. Kunihiko Fukushima. “: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position”. In: Biological Cybernetics 36 (Apr. 1980): 4, pp. 193–202. url: https://link.springer.com/ content/pdf/10.1007%2FBF00344251.pdf.

62. Kyunghyun Cho, Bart van Merrienboer Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734. url: https://www.aclweb.org/anthology/D14-1179.pdf.

130 63. Le T. A., Kuratov Y. M. , Petrov M. A., Burtsev M. S. “Sentence Level Representation and Language Models in the task of Coreference Resolution for Russian”. In: 25th International Conference on Computational Linguistics and Intellectual Technologies. May 2019. url: http://www.dialog-21.ru/media/4609/letaplusetal-160.pdf.

64. Le T.A., Arkhipov M.Y., Burtsev M.S. “Application of a Hybrid Bi-LSTM- CRF Model to the Task of Russian Named Entity Recognition”. In: Filchenkov A., Pivo- varova L., ZiˇzkaJ.ˇ (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science. 2017, pp. 91–103. url: https://link.springer.com/chapter/10.1007/978-3-319-71746-3_8.

65. Le The Anh. “Sequence Labeling Approach to the Task of Sentence Boundary Detec- tion”. In: Proceedings of the 4th International Conference on Machine Learning and Soft Computing. New York, NY, USA: Association for Computing Machinery, 2020, pp. 144–148. doi: 10.1145/3380688.3380703. url: https://dl.acm.org/doi/10. 1145/3380688.3380703.

66. Le The Anh and Mikhail S. Burtsev. “A Deep Neural Network Model for the Task of Named Entity Recognition”. In: International Journal of Machine Learning and Computing. Vol. 9. 1. 2019, pp. 8–13. url: http://www.ijmlc.org/vol9/758- ML0025.pdf.

67. Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham. “Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition”. In: The 3rd Workshop on Noisy User-generated Text. 2017, pp. 140–147. doi: 10.18653/v1/W17- 4418. url: https://www.aclweb.org/anthology/W17-4418.

68. Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, Lynette Hirschman. “A model-theoretic coreference scoring scheme”. In: Proceedings of the 6th Message Un- derstanding Conference (MUC-6). 1995, pp. 45–52. url: https://www.aclweb.org/ anthology/M95-1005.

69. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. “Deep contextualized word representations”. In: Pro- ceedings of The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana, USA: Association for Computational Linguistic, June 2018, pp. 2227–2237. url: https://aclweb.org/anthology/N18-1202.

70. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. “Deep contextualized word representations”. In: pro- ceedings of Conference of the North American Chapter of the Association for Compu-

131 tational Linguistics: Human Language Technologies. 2018, pp. 2227–2237. url: https: //aclweb.org/anthology/N18-1202.

71. Miikka Silfverberg, Teemu Ruokolainen, Krister Lind´en,Mikko Kurimo. “Part-of- Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy”. In: Proceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics. Association for Computational Linguistics, 2014, pp. 259–264.

72. Mike Schuster, Kaisuke Nakajima. “Japanese and Korean voice search”. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ky- oto, Japan: IEEE, Mar. 2012, pp. 5149–5152. url: https://ieeexplore.ieee.org/ document/6289079.

73. Mikhail Burtsev, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dil- yara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kura- tov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva, Marat Zaynutdinov. “DeepPavlov: Open-Source Library for Dialogue Sys- tems”. In: Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics-System Demonstrations. Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 122–127. url: https : / / www . aclweb . org/anthology/P18-4021.

74. Mingbin Xu, Hui Jiang, Sedtawut Watcharawittayakul. “A Local Detection Approach for Named Entity Recognition and Mention Detection”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 1237–1247. url: https: //www.aclweb.org/anthology/P17-1114.

75. Pham Quang Nhat Minh. A Feature-Rich Vietnamese Named-Entity Recognition Model. 2018. arXiv: 1803.04375 [cs.CL]. url: https://arxiv.org/abs/1803. 04375.

76. Mourad Gridach, Hatem Haddad. “Arabic Named Entity Recognition: A Bidirectional GRU-CRF Approach”. In: Gelbukh A. (eds) Computational Linguistics and Intelli- gent Text Processing. CICLing 2017. Lecture Notes in Computer Science. Vol. 10761. Springer, Cham, 2018. url: https://link.springer.com/chapter/10.1007/978- 3-319-77113-7_21.

132 77. Munkres James. “Algorithms for the Assignment and Transportation Problems”. In: Journal of the Society for Industrial and Applied Mathematics 5 (1957), pp. 32–38. url: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.3911& rep=rep1&type=pdf.

78. Murray Campbell, A.Joseph Hoane Jr, Feng-hsiung Hsuc. “Deep Blue”. In: Artificial Intelligence 134 (2002), pp. 57–83. url: https://www.sciencedirect.com/science/ article/pii/S0004370201001291.

79. Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom. “A Convolutional Neural Net- work for Modelling Sentences”. In: Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics. Baltimore, Maryland: Association for Com- putational Linguistics, June 2014, pp. 655–665. url: https://www.aclweb.org/ anthology/P14-1062/.

80. Nguyen Viet Cuong, Nan Ye, Wee Sun Lee. “Conditional Random Field with High- order Dependencies for Sequence Labeling and Segmentation”. In: Journal of Machine Learning Research 15 (2014), pp. 981–1009. url: http://www.jmlr.org/papers/ volume15/cuong14a/cuong14a.pdf.

81. Nicola Ueffing, Maximilian Bisani, and Paul Vozila. “Improved models for automatic punctuation prediction for spoken and written text”. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon, France, Aug. 2013. url: https://research.nuance.com/wp-content/uploads/2014/11/ AutoPunc_Interspeech2013_paper_finalsubmission.pdf.

82. Olga Uryupina and Alessandro Moschitti. “A State-of-the-Art Mention-Pair Model for Coreference Resolution”. In: Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics (*SEM 2015). 2015. url: https://www.aclweb.org/ anthology/S15-1034.pdf.

83. Oliver Bender, Franz Josef Och and Hermann Ney. “Maximum entropy models for named entity recognition”. In: Proceedings of the seventh conference on Natural lan- guage learning at HLT-NAACL 2003. Vol. 4. 2003, pp. 148–151. url: https://dl. acm.org/doi/10.3115/1119176.1119196.

84. Onur Kuru, Ozan Arkan Can, and Deniz Yuret. “CharNER: Character-Level Named Entity Recognition”. In: COLING. 2016.

85. Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates. “Unsupervised named-entity extraction from the Web: An experimental study”. In: Artificial Intelligence 165 (1 June 2005), pp. 91–134. url: https://doi.org/10.1016/j.artint.2005.03.001.

133 86. Ottokar Tilk, Tanel Alumae. “LSTM for Punctuation Restoration in Speech Tran- scripts”. In: Proceedings of the 16th Annual Conference of the International Speech Communication. 2015. url: https : / / www . isca - speech . org / archive / interspeech_2015/i15_0683.html.

87. Paramveer S. Dhillon, Dean Foster, Lyle Ungar. “Multi-View Learning of Word Em- beddings via CCA”. In: Proceedings of Advances in Neural Information Processing Systems 24 (NIPS 2011). Granada, Spain, Dec. 2011. url: https://papers.nips. cc/paper/4193-multi-view-learning-of-word-embeddings-via-cca.pdf.

88. Paul J. Werbos. “Backpropagation through time: what it does and how to do it”. In: Proceedings of the IEEE. Vol. 78. 10. 1990, pp. 1550–1560. url: https://ieeexplore. ieee.org/document/58337.

89. Peng-Hsuan Li, Ruo-Ping Dong, Yu-Siang Wang, Ju-Chieh Chou. “Leveraging Lin- guistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017, pp. 2664–2669. url: https://www.aclweb.org/anthology/D17-1282.pdf.

90. Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, Robert L. Mercer. “Class-based n-gram models of natural language”. In: Computational Linguis- tics 18 (1992): 4, pp. 467–469. url: https://www.aclweb.org/anthology/J92- 4003.

91. Peter J. Liu, Mohammad Saleh, Etienne Pot†, Ben Goodrich, Ryan Sepassi, L ukasz Kaiser, Noam Shazeer. “Generating Wikipedia by Summarizing Long Sequences”. In: Proceedings of the Sixth International Conference on Learning Representations. Van- couver, Canada, 2018. url: https://openreview.net/forum?id=Hyg0vbWC-.

92. Pham TH., Le-Hong P. “End-to-End Recurrent Neural Network Models for Viet- namese Named Entity Recognition: Word-Level Vs. Character-Level”. In: Hasida K., Pa W. (eds) Computational Linguistics. PACLING 2017. Communications in Com- puter and Information Science. Vol. 781. 2017. url: https://link.springer.com/ chapter/10.1007/978-981-10-8438-6_18.

93. Phuong Le-Hong. Vietnamese Named Entity Recognition using Token Regular Expres- sions and Bidirectional Inference. 2016. arXiv: 1610.05652 [cs.CL]. url: https: //arxiv.org/abs/1610.05652v2.

94. Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng. “Building high-level features using large scale unsupervised learning”. In: Proceedings of the 29 th International Conference on Ma-

134 chine Learning. Edinburgh, Scotland, July 2012, pp. 507–514. url: https://icml. cc/2012/papers/73.pdf.

95. Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. Apr. 2015. arXiv: 1054.00941v2 [cs.NE]. url: https://arxiv.org/pdf/1504.00941.pdf.

96. Razvan Pascanu, Tomas Mikolov, Yoshua Bengio. “On the difficulty of training recur- rent neural networks”. In: ICML’13 Proceedings of the 30th International Conference on International Conference on Machine Learning. Vol. 28. 3. Atlanta, GA, USA, June 2013, pp. 1310–1318. url: http://proceedings.mlr.press/v28/.

97. Reinhard Kneser, Hermann Ney. “Improved Clustering Techniques for Class-Based Statistical Language Modelling”. In: Third European Conference on Speech Communi- cation and Technology (EUROSPEECH ’93). Berlin, Germany, Sept. 1993, pp. 973– 976. url: https://www.isca-speech.org/archive/eurospeech_1993/e93_0973. html.

98. Remi Lebret, Ronan Collobert. “Rehabilitation of Count-Based Models for Word Vec- tor Representations”. In: International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015): Computational Linguistics and Intelligent Text Processing. Cairo, Egypt: Association for Computational Linguistics, Apr. 2015, pp. 417–429. url: https://link.springer.com/chapter/10.1007/978-3-319- 18111-0_31.

99. Remi Lebret, Ronan Collobert. “Word Embeddings through Hellinger PCA”. In: Pro- ceedings of the 14th Conference of the European Chapter of the Association for Compu- tational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics, Apr. 2014, pp. 482–490. url: https://www.aclweb.org/anthology/E14-1051.

100. Rico Sennrich, Barry Haddow, Alexandra Birch. “Neural Machine Translation of Rare Words with Subword Units”. In: Proceedings of the 54th Annual Meeting of the Asso- ciation for Computational Linguistics. Vol. 1. Berlin, Germany: Association for Com- putational Linguistics, Aug. 2016, pp. 1715–1725. url: https://www.aclweb.org/ anthology/P16-1162.pdf.

101. Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov. “Introducing Baselines for Russian Named Entity Recognition”. In: Computa- tional Linguistics and Intelligent Text Processing. Vol. 7816. Springer, Berlin, Heidel- berg, 2013, pp. 329–342. url: https://link.springer.com/chapter/10.1007/978- 3-642-37247-6_27.

135 102. Ronan Collobert, Jason Weston. “A Unified Architecture for Natural Language Pro- cessing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the 25th International Conference on Machine Learning - ICML ’08. Helsinki, Finland, July 2008, pp. 160–167. url: https : / / icml . cc / Conferences / 2008 / papers / icml2008proceedings.pdf.

103. Rui Zhang, C´ıceroNogueira dos Santos, Michihiro Yasunaga, Bing Xiang, Dragomir Radev. “Neural Coreference Resolution with Deep Biaffine Attention by Joint Mention Detection and Mention Clustering”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 2. Association for Computational Linguistics, 2018, pp. 102–107. url: https://www.aclweb.org/anthology/P18- 2017/.

104. Rumelhart, D., Hinton, G. and Williams, R. “Learning representations by back- propagating errors”. In: Nature 323 (1986), pp. 533–536. url: https://www.nature. com/articles/323533a0.

105. Sam Wiseman and Alexander M. Rush and Stuart M. Shieber. “Learning Global Features for Coreference Resolution”. In: Proceedings of The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics. San Diego, California, USA: Association for Computational Linguistics, 2016. url: https: //www.aclweb.org/anthology/N16-1114.

106. Sam Wiseman, Alexander M. Rush, Stuart Shieber, Jason Weston. “Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution”. In: Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China: Association for Computational Linguistics, July 2015. url: https://www. aclweb.org/anthology/P15-1137/.

107. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue. “CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes”. In: Proceedings of the Joint Conference on EMNLP and CoNLL: Shared Task. 2012, pp. 1–40. url: https: //www.aclweb.org/anthology/W12-4501.

108. Scharolta Katharina Siencnik. “Adapting word2vec to Named Entity Recognition”. In: Proceedings of the 20th Nordic Conference of Computational Linguistics. 2015, pp. 239–243.

109. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. “Indexing by latent semantic analysis”. In: Journal of The American Soci-

136 ety for Information Science 41.6 (1990), pp. 391–407. url: http://citeseerx.ist. psu.edu/viewdoc/download?doi=10.1.1.108.8490&rep=rep1&type=pdf.

110. Sebastian Martschat and Michael Strube. “Latent Structures for Coreference Resolu- tion”. In: Transactions of the Association for Computational Linguistics 3 (July 2015), pp. 405–418. url: https://www.aclweb.org/anthology/Q15-1029.pdf.

111. Sepp Hochreiter and J¨urgenSchmidhuber. “Long Short-Term Memory”. In: Neural Computation 9 (1997): 8, pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735. url: https://www.bioinf.jku.at/publications/older/2604.pdf.

112. Stephan Peitz, Markus Freitag, Arne Mauser, Hermann Ney. “Modeling Punctuation Prediction as Machine Translation”. In: Proceedings of the International Workshop on Spoken Language Translation. San Francisco, USA, Dec. 2011, pp. 238–245. url: http://www.mt-archive.info/10/IWSLT-2011-Peitz.pdf.

113. Sysoev A. A., Andrianov I. A., Khadzhiiskaia A. Y. “Coreference Resolution in Rus- sian: State-of-the-Art Approaches Application and Evolvement”. In: Proceedings of the International Conference on Computational Linguistics and Intellectual Technolo- gies: Dialogue 2017. Vol. 1. Russian State University for the Humanities, Moscow, Russia, June 2017, pp. 327–338. url: http://www.dialog- 21.ru/media/3954/ sysoevaaetal.pdf.

114. Taku Kudoh and Yuji Matsumoto. “Use of Support Vector Learning for Chunk Iden- tification”. In: Proceedings of CoNLL-2000 and LLL-2000. 2000, pp. 142–144. url: https://www.aclweb.org/anthology/W00-0730.pdf.

115. Tao Shen, Jing Jiang, Tianyi Zhou, Shirui Pan, Guodong Long, Chengqi Zhang. “DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Under- standing”. In: Proceedings of the The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). New Orleans, Louisiana, USA: AAAI Press, Palo Alto, Cal- ifornia USA, Feb. 2018, pp. 5446–5455. url: https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/viewFile/16126/16099.

116. Thai-Hoang Pham, Phuong Le-Hong. The Importance of Automatic Syntactic Features in Vietnamese Named Entity Recognition. 2017. arXiv: 1705.10610 [cs.CL]. url: https://arxiv.org/pdf/1705.10610.pdf.

117. Thai-Hoang Pham, Xuan-Khoai Pham, Tuan-Anh Nguyen, Phuong Le-Hong. “NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit”. In: The 8th International Joint Conference on Natural Language Processing, Taipei, Taiwan. 2017, pp. 37–40. url: https://www.aclweb.org/anthology/I17-3010.

137 118. Thomas R. Nieslel, E.WD. Whittaker, and P.C. Woodland. “Comparison of part-of- speech and automatically derived category-based language models for speech recogni- tion”. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. Seattle, Washington, USA: IEEE, May 1998, pp. 177–180. url: https://ieeexplore.ieee.org/document/674396.

119. Thorsten Brants. “TnT – A Statistical Part-of-Speech Tagger”. In: Sixth Applied Natu- ral Language Processing Conference. Association for Computational Linguistics, 2000, pp. 224–2312. url: https://www.aclweb.org/anthology/A00-1031.

120. Toldova S. Ju., Roytberg A., Nedoluzhko A., Kurzukov M., Ladygina A. A.,Vasilyeva M. D., Azerkovich I. L., Grishina Y., Sim G., Ivanova A., Gorshkov D. “Evaluating Anaphora and Coreference Resolution for Russian”. In: Proceedings of the Interna- tional Conference on Computational Linguistics and Intellectual Technologies: Dia- logue 2014. Russian State University for the Humanities, Moscow, Russia, June 2014, pp. 681–695. url: http://www.dialog-21.ru/digests/dialog2014/materials/ pdf/ToldovaSJu.pdf.

121. Toldova S., Ionov M. “Coreference resolution for russian: The impact of semantic fea- tures”. In: Proceedings of the International Conference on Computational Linguistics and Intellectual Technologies: Dialogue 2017. Vol. 1. Russian State University for the Humanities, Moscow, Russia, June 2017, pp. 339–349. url: http://www.dialog- 21.ru/media/3956/toldovasionovm.pdf.

122. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. “Distributed representations of words and phrases and their compositionality”. In: NIPS’13 Proceed- ings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA, Dec. 2013, pp. 3111–3119. url: https://papers.nips. cc/paper/5021-distributed-representations-of-words-and-phrases-and- their-compositionality.pdf.

123. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv: 1301.3781v3 [cs.CL]. url: https: //arxiv.org/pdf/1301.3781.pdf.

124. Tomas Mikolov, Martin Karafi´at,Lukas Burget, Jan “Honza” Cernock´y,Sanjeev Khu- danpur. “Recurrent neural network based language model”. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTER- SPEECH 2010. Makuhari, Chiba, Japan, Sept. 2010, pp. 1045–1048. url: https: //www.isca- speech.org/archive/archive_papers/interspeech_2010/i10_ 1045.pdf.

138 125. Tyne Liang and Dian-Song Wu. “Automatic pronominal anaphora resolution in english texts”. In: International Journal of Computational Linguistics and Chinese Language Processing 9.1 (2004), pp. 21–40. url: https://www.aclweb.org/anthology/O03- 1007.

126. Valentin Malykh and Alexey Ozerin. “Reproducing Russian NER Baseline Quality without Additional Data”. In: CDUD@CLA. 2016.

127. Valerie Mozharova and Natalia Loukachevitch. “Combining Knowledge and CRF- Based Approach to Named Entity Recognition in Russian”. In: Ignatov D. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2016. Communications in Com- puter and Information Science. Vol. 661. 2016. url: https://link.springer.com/ chapter/10.1007/978-3-319-52920-2_18.

128. Valerie Mozharova and Natalia Loukachevitch. “Two-stage approach in Russian named entity recognition”. In: 2016 International FRUCT Conference on Intelligence, Social Media and Web (ISMW FRUCT). 2016. url: https : / / ieeexplore . ieee . org / document/7584769.

129. Vitaly Romanov, Albina Khusainova. “Evaluation of Morphological Embeddings for the Russian Language”. In: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalJune. 2019. url: https:// doi.org/10.1145/3342827.3342846.

130. Wei Lu and Hwee Tou Ng. “Better punctuation prediction with dynamic conditional random fields”. In: Proceedings of the 2010 Conference on Empirical Methods in Natu- ral Language Processing, ser. EMNLP ’10. Stroudsburg, PA, USA, Oct. 2010, pp. 177– 186. url: http://statnlp.com/people/luwei/publications/emnlp10ln.pdf.

131. William J Black, Fabio Rinaldi, David Mowatt. “FACILE: Description of the NE System Used for MUC-7”. In: Seventh Message Understanding Conference (MUC-7). 1998. url: https://www.aclweb.org/anthology/M98-1014.

132. Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. 2015. arXiv: 1509.01626v3 [cs.LG]. url: https://arxiv.org/ abs/1509.01626.

133. Xiaoqiang Luo. “On coreference resolution performance metrics”. In: Proceeding HLT ’05 Proceedings of the conference on Human Language Technology and Empirical Meth- ods in Natural Language Processing. 2005, pp. 25–32. url: https://www.aclweb. org/anthology/H05-1004.

134. Xuan-Son Vu. Pre-trained Word2Vec models for Vietnamese. 2016. url: https:// github.com/sonvx/word2vecVN.

139 135. Xuezhe Ma and Eduard Hovy. “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Vol. 9. 1. Berlin, Germany: Association for Computational Linguistics, 2016, pp. 1064–1074. url: https://aclweb.org/anthology/P16-1101.

136. Yan Shao, Christian Hardmeier, J¨orgTiedemann, and Joakim Nivre. “Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF”. In: Proceedings of the The 8th International Joint Conference on Natural Language Processing. Vol. 9. 1. Taipei, Taiwan, Nov. 2017, pp. 173–183. url: https://www. aclweb.org/anthology/I17-1018.

137. Yaniv Taigman ; Ming Yang ; Marc’Aurelio Ranzato ; Lior Wolf. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, Sept. 2014, pp. 1701–1708. url: https://ieeexplore.ieee.org/document/6909616.

138. Yanjun Ma, John Tinsley, Hany Hassan, Jinhua Du, and Andy Way. “Exploiting Align- ment Techniques in MaTrEx: the DCU Machine Translation System for IWSLT08”. In: Proceedings of the International Workshop on Spoken Language Translation. Hawaii, USA, 2008, pp. 26–33. url: http://www2.nict.go.jp/astrec- att/workshop/ IWSLT2008/proceedings/EC_2_dcu.pdf.

139. Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE. Vol. 86. IEEE, Nov. 1998, pp. 2278–2324. url: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=726791.

140. Yann LeCun, Yoshua Bengio. “Convolutional networks for images, speech, and time series”. In: The handbook of brain theory and neural networks. 1998. url: http:// yann.lecun.com/exdb/publis/pdf/lecun-bengio-95a.pdf.

141. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. “DailyDi- alog: A Manually Labelled Multi-turn Dialogue Dataset”. In: Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). 2017. url: http://yanran.li/dailydialog.html.

142. Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, Animashree Anand- kumar. “Deep Active Learning for Named Entity Recognition”. In: Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, 2017, pp. 252–256. url: https://www.aclweb.org/anthology/W17- 2630.pdf.

140 143. Yoon Kim. “Convolutional Neural Networks for Sentence Classification”. In: Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1746–1751. url: https://www.aclweb.org/anthology/D14-1181/.

144. Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush. Character-Aware Neural Language Models. 2015. arXiv: 1508.06615v4 [cs.CL]. url: https://arxiv.org/ abs/1508.06615.

145. Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, Christian Jauvin. “A Neural Probabilistic Language Model”. In: Journal of Machine Learning Research 3 (2003), pp. 1137–1155. url: http : / / www . jmlr . org / papers / volume3 / bengio03a / bengio03a.pdf.

146. Zellig S. Harris. “Distributional Structure”. In: WORD 10 (1954): 2,3, pp. 146–162. url: https : / / www . tandfonline . com / doi / abs / 10 . 1080 / 00437956 . 1954 . 11659520.

147. Zhiheng Huang, Wei Xu, Kai Yu. Bidirectional LSTM-CRF Models for Sequence Tag- ging. 2015. arXiv: 1508.01991v1 [cs.CL]. url: https://arxiv.org/abs/1508. 01991.

148. Zhilin Yang, Ruslan Salakhutdinov, William Cohen. Multi-Task Cross-Lingual Se- quence Tagging from Scratch. 2016. arXiv: 1603 . 06270v2 [cs.CL]. url: https : //arxiv.org/abs/1603.06270.

149. Zhou P., Zheng S., Xu J., Qi Z., Bao H., Xu B. “Joint Extraction of Multiple Re- lations and Entities by Using a Hybrid Neural Network”. In: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD 2017, CCL 2017. Lecture Notes in Computer Science. Vol. 10565. 2017. url: https://link.springer.com/chapter/10.1007/978-3-319-69005-6_12.

141