Separating the Signal from the Noise: Predicting the Correct Entities in Named-Entity Linking

Drew Perkins

Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits June 9, 2020

Supervisors: Gongbo Tang, Uppsala University Thorsten Jacobs, Seavus Abstract

In this study, I constructed a named-entity linking system that maps between contextual word embeddings and embeddings to predict correct entities. To establish a named-entity linking system, I rst applied named-entity recognition to identify the entities of interest. I then performed candidate gener- ation via locality sensitivity hashing (LSH), where a candidate group of potential entities were created for each identied entity. Afterwards, my named-entity dis- ambiguation component was performed to select the most probable candidate. By concatenating contextual word embeddings and knowledge graph embeddings in my disambiguation component, I present a novel approach to named-entity link- ing. I conducted the experiments with the Kensho-Derived Wikimedia Dataset and the AIDA CoNLL-YAGO Dataset; the former dataset was used for deployment and the later is a benchmark dataset for entity linking tasks. Three deep learning models were evaluated on the named-entity disambiguation component with dierent context embeddings. The evaluation was treated as a classication task, where I trained my models to select the correct entity from a list of candidates. By optimizing the named-entity linking through this methodology, this entire system can be used in recommendation engines with high F1 of 86% using the former dataset. With the benchmark dataset, the proposed method is able to achieve F1 of 79%. Contents

Acknowledgments5

1. Introduction6 1.1. Purpose and Motivation...... 6 1.2. Outline...... 7

2. Background8 2.1. Graph Theory and Concepts...... 8 2.1.1. Algorithms...... 8 2.2. Knowledge Graphs...... 10 2.2.1. Knowledge Representation...... 10 2.2.2. Knowledge Bases...... 10 2.3. Named-Entity Linking Components...... 11 2.3.1. Named-Entity Recognition...... 11 2.3.2. Candidate Generation via Locality Sensitivity Hashing.... 11 2.3.3. Named-Entity Disambiguation...... 12 2.4. Feature Embeddings...... 13 2.4.1. Word Embeddings...... 13 2.4.2. Graph Embeddings...... 14 2.5. Neural Networks and Deep Learning...... 16 2.5.1. Long Short Term Memory (LSTM)...... 16 2.5.2. Convolutional Neural Networks (CNN)...... 16 2.5.3. Contextual Embeddings from Language Models (ELMo)... 17

3. Methodology 19 3.1. Named-Entity Linking...... 19 3.2. Disambiguation Models...... 20 3.2.1. Embeddings...... 20 3.2.2. BiLSTM Model...... 21 3.2.3. CNN-BiLSTM Model...... 21 3.2.4. ELMo Model...... 21 3.2.5. Feed-Forward Neural Network (FFNN)...... 22

4. Experiments 23 4.1. Data...... 23 4.1.1. Kensho-Derived Wikimedia Dataset (KDWD)...... 23 4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA)...... 24 4.2. Settings...... 24 4.3. Evaluation Metrics...... 26 4.3.1. Precision and Recall...... 26 4.3.2. Classication Metrics...... 26 4.4. Results...... 27 4.4.1. The Eect of Training Data Size...... 27 4.4.2. Disambiguation Models...... 27 4.4.3. Candidate List Accuracy...... 28 4.4.4. Analysis and Discussion...... 29

3 5. Conclusion and Future Work 31

A. Named-Entity Linking Examples 33

4 Acknowledgments

I would like to thank my university supervisor Gongbo Tang for his help and guidance in the structure and pragmatics of my thesis. I am deeply indebted to the Seavus AB team, particularly my company supervisor Thorsten Jacobs, for their support, feedback, resources, ideas, and deadlines for this thesis. I would like to thank COVID-19 for destroying my social life in the months leading up to completing my thesis. Finally, I would like to thank my family, friends, and girlfriend for their ongoing support during my Master’s studies. Seavus AB is an IT consulting rm that provides enterprise-wide business solutions across the world, mainly covering the US and European markets. The department I conducted this work in was their articial intelligence and machine learning division located in Stockholm. Their current work includes but is not limited to chatbots, QA systems, and business intelligence.

5 1. Introduction

"I am convinced that the crux of the problem of learning is recognizing relationships and being able to use them" Christopher Strachey in a letter to Alan Turing, 1954

Knowledge graphs have been exploding in recent years within the scope of natural language processing. Whether it be natural language generation, question-answering, or named-entity recognition and relation linking, when common natural language tasks are leveraged with knowledge graphs, improvements can be made across tasks and domains. That is why I sought to construct a named-entity linking system, whereupon ambiguities in the named-entities can be detected and properly claried with assistance from knowledge graphs. On some initial work with named-entity recognition of a corpus, I noticed "Bush" would come up several times without any clarity to whether this person was in reference to "George H.W. Bush" or "George W. Bush". For our purposes, this seemed like a glaring oversight and one that we chose to expand on to nd a proper solution. One clear method to solve this problem is through named- entity linking. A named-entity linking system consists of three primary components: named-entity recognition, candidate generation, and named-entity disambiguation. With all three of these components, we eectively identify entities, construct a list of possible candidate for the identied entities, and nally disambiguate these entities from the candidate list and link to a distinctive identier within a knowledge graph. It is this nal component – named-entity disambiguation – that I focused on for my research and evaluation. The data I originally trained the disambiguation component on was the Kensho-Derived Wikimedia Dataset, which includes text, links, and the knowledge base. I then conducted further studies with the AIDA CoNLL-YAGO Dataset, a benchmark in named-entity linking. Furthermore, I performed named-entity linking with three disambiguation models that map between contextual word embeddings and knowledge graph embeddings. To optimize named-entity linking with deep learning, I treated the problem as a classication task; the models predict the correct entity among a series of candidates. With my best model performance with the benchmark dataset, I achieved 87% recall, 73% precision, and an ROC-AUC of 83%. The thesis project was conducted at Seavus AB, an IT consulting rm in Stockholm, Sweden. It was founded in Malmo, Sweden, yet has oces throughout Northern and Southeastern Europe. Seavus AB oers state-of-the-art machine learning and articial intelligence services to companies from around the world.

1.1. Purpose and Motivation

The purpose of this thesis is to examine and evaluate dierent ways to improve on named-entity linking systems by mapping between context embeddings with knowledge graph embeddings for correct entities; it would appear that there is bereft research into the usage of both of these embeddings together. The motivation came from my initial ndings of ambiguity with certain persons and places, such as in the "Bush" example. Particularly when we are working with news corpora, a last

6 name can often be found alone, leading to more than a mere isolated incident. Named- entity linking is a step toward more accurate semantic representation, and synonym extraction, that remains a challenge in NLP research despite the robust ontology networks and lexicons widely available. Not only that, the relations we make in how we think, what we watch, and who we connect with are all inexorably linked together to how we have come to understand the world, and fundamental to language technology. There are several research questions I considered to help me move forward with my work. Can entities be adequately predicted through the concatenation of knowledge graph embeddings with contextual word embeddings? Most current disambiguation methods rely purely on the contextual information of documents. Can my system manage name variations – i.e., the same entity can appear with various naming conven- tions? This may be caused by aliases, spelling errors, or abbreviations. Can my system manage ambiguity – i.e., the same mention may be polysemous (i.e. have multiple meanings) depending on the specic context? Can my system manage incomplete information when there is a limited amount of knowledge? Will my system be able to ll the contextual gaps?

1.2. Outline

The outline of the thesis will consist of the following. In Chapter 2, I will introduce essential graph theory concepts and terminology that will be necessary to understand the rest of the thesis. This will be followed by knowledge graphs, the components used for the named-entity linking system, feature embeddings, and the deep learning necessary to understand the later chapters. In Chapter 3, I discuss the methodology that went towards the system, embeddings, and disambiguation models. In Chapter 4, I rst investigate the datasets used. I then break down the settings of the data, the earlier entity linking components, and of course, the model environments for disambiguation. This is followed by soft introductions to the evaluation metrics used in this work. I nish this chapter by reporting my results with an analysis and discussion afterwards. In Chapter 5, I conclude my ndings and examine a few dierent ways this research could be expanded on in the future.

7 2. Background

First, I note the fundamental graph concepts and algorithms. I then discuss knowledge graphs from base to graph structure. I follow with explanations of the named-entity linking components that include named-entity recogntion, candidate generation, and named-entity disambiguation. Afterwords, I delve into word and graph feature embed- dings. Finally, I explain the core deep learning necessary for this research.

2.1. Graph Theory and Concepts

In graph theory, graphs are a high delity way of modeling pairwise relations between objects. Graphs are comprised of objects known as vertices, entities, or nodes that are connected to edges, links, or relationships. A label of a node marks it as part of a larger group. In classic graph theory, this traditionally signied one node, but this has since taken on the meaning of a node group. For relationships, they are classied into types rather than labels. Nodes and relationships can also have embedded attributes known as properties that contain various data types, whether they be numerical or categorical. A subgraph is a smaller section within a larger graph. A path is a group of nodes and their connecting relationships.

Figure 2.1.: Semantic Triple

In addition, there are common graph attributes that should be considered. Graphs can have nodes that connect or disconnect with relations. Nodes and relationships can carry certain weights. Nodes can have relations with a xed direction. In this case, start nodes are known as heads and end nodes are known as tails; a series of heads, relations, and tails are called semantic triples. The paths can be cyclic or acyclic depending on which node it starts and ends on. The relationship to node ratio can be sparse or dense, which can lead to divergent results. Monopartite, bipartite, and k-partite graphs are those that connect nodes by one, two, or any number of node types. They express a subgraph of a knowledge graph.

2.1.1. Algorithms Pathfinding Two fundamental algorithms to traverse an entire graph are depth-rst search and breadth-rst search. A depth-rst search algorithm iterates outward from a starting node to some end node before repeating a similar search down a dierent path from

8 the same start node. Breadth-rst search iterates the graph one layer at a time, rst visiting each node at depth 1, then depth 2, and so on, until the entire graph has been visited. Pathnding algorithms are built on top of graph search algorithms as these two and explore routes between nodes until the goal node has been reached. The pathnding algorithms primarily covered in my work are shortest path (shortest path between nodes) and random walk (set of random nodes following any relationship, selected arbitrarily).

Centrality Centrality is often implemented to retrieve the most important people or most relevant answers in response to a query. Some algorithms such as PageRank (Page et al., 1998), that was devised at Google, permitted the traversal through its to measure the most important web pages. It counts the number and quality of links to a page to determine a rough estimate of how important a web page is. The underlying assumption here is that more important pages are likely to receive more links from other pages:

%'phq %'pDq “ !phq hP ÿD The PageRank value for a page (u) relies on the PageRank values for each page (v) that the set (B) contains, divided by the number of links (L) from page v.

Community Detection Social networks are the most striking, and paradigmatic, examples of relationships between individuals (or communities) within a graph. Whether in a group comprised of your coworkers, family, or friends, people gravitate to form groups. Zachary’s network of karate club members, a standard benchmark in community detection demonstrated aggregations, of people as they drifted apart and formed two factions of communities (Zachary, 1977).

Figure 2.2.: Zachary’s Karate Club

By looking at Figure 2.2, it is possible to infer two major aggregations of people pulled apart from those around vertex 34 (the club president) and vertex 1 (the instructor). The Louvain method is a clustering algorithm for community detection that evaluates how much more densely connected the nodes within a community are in comparison

9 within a random network (Lu et al., 2014), but graph neural networks have also been used to detect overlapping and disjoint communities (Shchur and Günnemann, 2019).

Link Prediction When we want to foreshadow the most likely future relations in Figure 2.2, we use link prediction algorithms to predict future possible connections in the network. In addition, they can be used to propose missing links for obstructed or missing data. The Adamic Adar algorithm (Adamic and Adar, 2003) was adopted early to predict links in social networks, such as in Figure 2.2, using the formula: 1 pG,~q “ D P # pGq X # p~q ;>6|# pDq| ÿ In Adamic and Adar (2003), N(u) is the set of nodes adjacent to u. A value of 0 asserts that two nodes are not close, while higher values indicate closeness.

Similarity Dierent vector-based metrics are applied when we want to compute the similarity of pairs of nodes. A node similarity is calculated by looking at how many neighbors two nodes share, as in an approximate nearest neighbor algorithm. This algorithm constructs a k-nearest neighbors (k-NN) graph for a set of objects based on a given similarity algorithm. If I use the Euclidean distance and k-NN to calculate the distance between two nodes:

= p?,@q “ p@,?q “ p@ ´ ? q2 ` p@ ´ ? q2 ` ... ` p@ ´ ? q2 “ p@ ´ ? q2 1 1 2 2 = = g 8 8 f8“1 b fÿ e The above formula takes the n number of dimensions (or features). The similarity algorithms that can be leveraged with k-nearest neighbor are Cosine similarity, Jaccard similarity, Euclidean distance, and Pearson similarity, to name a few.

2.2. Knowledge Graphs

2.2.1. Knowledge Representation Knowledge graphs model information in the form of entities and relationships between them. This sort of knowledge representation is a eld, long explored in logic and reasoning, focused on representing abstract information about the world in a way that a computer can interpret. A few examples of knowledge representation formalisms include semantic nets, systems architecture, frames, rules, and ontologies.

2.2.2. Knowledge Bases A knowledge base is a centralized database for storing, organizing, and disseminating represented knowledge. The general representation for a knowledge base is an object model with classes, subclasses, and instances. Some of what I touched on, such as semantic nets and ontologies, are these object models. The two main forms of knowl- edge bases are machine-readable and human-readable. Machine-readable knowledge bases store data that can only be analyzed by articial intelligence systems. Human- readable knowledge bases store documents and physical texts that can be accessed by

10 humans. The key factors to consider to determine the usefulness of knowledge bases are the completeness, accuracy, and quality of the information we are using. When we structure large, unstructured data, we often use a graphical representation of this knowledge known as a knowledge graph.

2.3. Named-Entity Linking Components

2.3.1. Named-Entity Recognition Many natural language processing applications require identifying named-entities in text data and classifying them. Named-entities can be, for example, person or company names, dates and time expressions, organizations, locations, etc. The task of identifying these in a text is called named-entity recognition and is often performed for a specic domain of unstructured data.

Figure 2.3.: Named-Entity Recognizer for Music

Named-entity recognition is usually a supervised task because of its reliance on annotated data, such as CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003). With this strong need for annotated data for certain domains, such as medicine or law, this task becomes knowledge intensive by nature. The most common NLP applications that benet from named-entity recognition are QA, relation extraction, and coreference resolution. In fact, coreference resolution solves the ambiguities of named-entity recognition by nding the references to the same entity in a text.

2.3.2. Candidate Generation via Locality Sensitivity Hashing In the candidate generation process, I needed to nd the k-NN among the candidates (Dong et al., 2011). Using brute force to process all possible candidate combinations would have given us the exact nearest neighbor but it is neither scalable or fast. I also calculated the frequency that an anchor link (Wikipedia hyperlink) corresponds to a target page, but this method alone was not enough. Thankfully, there are heuristics that lead to a promising approximation to this k-NN search task, locality sensitivity hashing (LSH) being one such algorithm (Gionis et al., 1999).

Figure 2.4.: Locality Sensitivity Hashing

LSH hashes data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in dierent buckets. This makes it possible to observe data within

11 various degrees of similarity. The reason why I don’t simply rely on the anchor link for candidate generation is that even a slight character dierence could result in no matches. Similarity metrics, such as Jaccard similarity, can successfully retrieve entities with names similar to the identied entity text (E. Zhu et al., 2016). Jaccard similarity checks to see how similar two texts are and can be eciently approximated by MinHash LSH:

| X | p, q “ | Y | If I have a greater intersection of characters between two words than I can expect a higher Jaccard index. The frequency and rarity of candidates in a set should be given proper consideration when choosing the similarity measure. Entities that are constantly reoccurring in the text tend to have embeddings with large normalizations. This can dominate the candidate generation process and using variants of similarity measures that put less emphasis on the normalization of an entity should be applied. Regularization can aid in the inverse issue of rare entities being selected over more relevant ones.

Figure 2.5.: Jaccard Index of 1/4

As in Figure 2.5, I have one word "Floyd" and another word "Florida"; I can expect there to be an overall Jaccard index of 1/4 because of how much they have in common out of all the overall character variables. Min hashing is the most critical aspect of the algorithm and was chosen because of its eectiveness with Jaccard similarity, that is mapping similarity between sets.

2.3.3. Named-Entity Disambiguation Named-entity disambiguation and named-entity linking are used interchangeably but for the purposes of our research, I distinguish the overall system as named-entity linking and the named-entity linking component as named-entity disambiguation. This component performs the task of mapping words of interest, such as names of persons, locations and companies, from an input text document to corresponding unique entities in a target knowledge base such as Wikipedia. When performing named-entity disambiguation, I do not directly employ Wikipedia; there are databases better suited for accessing and retrieving information from their knowledge base, such as DBpedia or Wikidata. Kulkarni et al. (2009) were the rst to annotate and bridge unstructured text with entity IDs from a knowledge base to disambiguate entities. Two years later Hoart et al. (2011) made one of the preeminent contributions to the disambiguation task by adding rich context through a combined framework of popularity priors, similarity measures, and coherence algorithms. Their robust framework became the standard that

12 most state-of-the-art models augment and attempt to surpass. Finally, Parravicini et al. (2019) achieved state-of-the-art accuracy by leveraging knowledge graph embeddings for the disambiguation task.

Figure 2.6.: Named-Entity Disambiguation with Wikidata (Parravicini et al., 2019)

Fields of author disambiguation (Franzoni et al., 2019), natural language generation (NLG) (Koncel-Kedziorski et al., 2019), (Logan et al., 2019), and QA systems (Reddy et al., 2017) benet from higher-level representations of text that we cannot reach with simple recognition or recommendation algorithms. In order for us to nd concepts relevant to the task or application that are separate from the text, named-entity disambiguation can discover underlying meanings. For one example as to the benets of using named- entity disambiguation, consider a simple query, “Floyd revolutionized rock with the Wall”. As I mentioned with PageRank and LSH, search and recommendation engines try to nd the most relevant documents to recommend a user and to nd supplementary information that may be to their liking. Without a named-entity disambiguation component, the search engine only looks for information that mention “Floyd”, "rock" and "wall." This engine may provide us "Pink Floyd", "rock music", and "The Wall", but it could also give us false negatives, meaning that it misses out on retrieving additional information pertaining to Pink Floyd, such as "progrock", "David Gilmour", and "Comfortably Numb". Even worse, the engine could provide us a series of false positives, such as information on "rock wall climbing", "Floyd Mayweather", "border walls", and "The Rock".

2.4. Feature Embeddings

One of the more assured ways to encode the kind of properties that I have in mind is through the use of feature vectors. Feature vectors consist of numeric or nominal values that we embed specic information as an input to many machine learning algorithms. This embedded information is a hidden low-dimensional vector representation used to preserve linguistic, spatial, or extraneous features into a new space for eective learning.

2.4.1. Word Embeddings Word embeddings focus on the word features of a certain lexicon. They are capable of capturing the context of a word, such as its semantic and syntactic similarity.

13 Words or phrases are mapped to vectors of real numbers. It involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. By reducing the size of word embeddings, we can improve their utility in memory-constrained pipelines.

Figure 2.7.:

Word2Vec is one of the most popular techniques, developed by Google, to learn word embeddings using a group of models (Goldberg and Levy, 2014). One of these models in Word2Vec is known as a skip-gram model. It takes every word in a large corpora and also takes one-by-one the words that surround it within a dened window to then feed a neural network that after training will predict the probability for each word to actually appear in the window around the highlighted word.

Figure 2.8.: Context Window

Context2Vec is a neural model that develops from a generic embedding function for these context windows of target words. Melamud et al. (2016) demonstrated the e- ciency of training billions of words (with reasonable time constraints) could maintain high-quality context representations which signicantly outperform traditional word embeddings.

2.4.2. Graph Embeddings Graph embeddings are the transformation of property graphs to a vector or set of vectors. Embeddings should capture the graph topology, node-to-node relationship, and other relevant information about graphs, subgraphs, and nodes. More properties embedded have the potential to encode better results. Graph embeddings are often divided between three main groups:

• Node embeddings: We encode each node with its own vector representation. We would use this embedding when we want to perform visualization or prediction on the node level, e.g. visualization of vertices in the 2D plane, or prediction of new connections based on vertex similarities. In many ways, this method is very similar in mapping as Word2Vec. A few examples are DeepWalk (Perozzi et al., 2014) and Node2Vec (Grover and Leskovec, 2016).

14 • Bilinear-based embeddings: We encode the relationships between the two entity vectors using multiple matrices. Assuming we have a total number of entities as E and a total number of relations as R, the total number of parameters will be E x E x R. Bilinear-based models like RESCAL (Nickel et al., 2011) generates the score s of a triple (h,r,t) via tensor-factorization:

B ℎ, A, C \) " \ p q “ ℎ A C

The head nodes \ℎ are represented as a matrix transpose T and relations are represented as a matrix "A . There is a need for weight decay with RESCAL because each relation carries with it many parameters which generally leads to overtting and downgrading the overall performance (Nickel et al., 2011).

• Translation-based embeddings: We encode the whole relation with a single vector. The fundamental notion is that the model is making the sum of the head vector and relation vector as close as possible to the tail vector. Translation-based models like TransE (Bordes et al., 2013) and TransD (Ji et al., 2015) solve link prediction in multi-relational data by interpreting relationships as translations operating on a learned low-dimensional embedding of the entities in a knowledge graph, rather than on the graph structure itself.

Figure 2.9.: TransE

TransE is one of the most notable translation-based models for knowledge graph embeddings due to the sheer simplicity of its method:

Bpℎ, A, Cq “ 3p\ℎ ` hA ´ \C q

As two embeddings are compared to generate the score s of their triple (h,r,t), the head embedding \ℎ is rst translated by the relationship vector EA . TransE returns the lower scores to entities that are close, therefore the semantic triple score is computed as such, where d is a dissimilarity function like !1 or !2.

DistMult (Yang et al., 2014) is similar to both RESCAL and TransE. Instead of complex matrices, Yang et al. (2014) reduce the number of relations by only using diagonal matrices as vectors v to generate the score s of a triple (h,r,t) :

15  Bpℎ, A, Cq “ x\ℎ ` hA ´ \C y “ \ℎ,3 hA ,3 \C,3 3“1 ÿ Above, d is the diagonal operator, which is limited to representing only symmet- ric relations; the same embedding space is on the left and right sides. DistMult and TransE both use a low number of parameters to achieve state-of-the-art results. How- ever, having a model that focuses solely on the diagonal matrices is not without its limitations.

2.5. Neural Networks and Deep Learning

2.5.1. Long Short Term Memory (LSTM) Long Short Term Memory (LSTM) networks are a form of recurrent neural networks (RNN) that are capable of learning long-term dependencies (Hochreiter and Schmidhu- ber, 1997). In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

Figure 2.10.: LSTM Unit

LSTMs also have this linked, chain structure, but the repeating module has a dierent structure. Instead of having a single neural network, there are four layers, interacting in a way that cleverly manages time intervals. In Figure 2.10, at center is an LSTM unit composed of a cell, input gate, output gate and forget gate. The cell remembers values over these time intervals and the gates regulate and control the information that comes in and out of the cell. LSTMs were created to deal with the vanishing gradient problem that is encountered with traditional RNN.

2.5.2. Convolutional Neural Networks (CNN) A convolutional neural network will apply 1D convolutions to map features of text, and concurrently apply max pooling operations over the time-step dimension to obtain a xed-length output. We are often talking about 1D convolutions when working with text data and 2D convolutions when we are working with image data. With graphs in mind, studies have gone into exploring the encapsulation of graphs through these 2D and 3D convolutions, respectively. However, learning a graph through convolutions is one diculty and learning an entire knowledge graph through convolutions is another (Battaglia et al., 2018). The general purpose of using various convolutions within one network of graph data is to capture larger representations in dierent dimensions. For example, one

16 Figure 2.11.: CNN with Semantic Triples knowledge graph will have nodes and edges, and each node and edge will more than likely have additional labels and types. Unsurprisingly, where LSTM fails to capture these features beyond linear representations, CNN shows more promise in capturing these complexities. In link prediction, densely connected convolutional neural networks have been eective when in conjunction with classic graph heuristics and similarity metrics (W. Wang et al., 2019).

2.5.3. Contextual Embeddings from Language Models (ELMo) Looking at a way to advance my context embeddings, where I can look at the en- tire sentence before assigning each word to a corresponding embedding, I decided to implement ELMo embeddings. ELMo performs many tasks with state-of-the-art precision and recall on predicting following words in sentences (Peters et al., 2018). ELMo uses two layers of bidirectional LSTMs (BiLSTM) in its training, with both layers bridged with a residual connection. A residual connection is used to allow gradients to ow through a network directly, without passing through the non-linear activation functions. The high-level intuition is that residual connections help neural networks to train more successfully (Peters et al., 2018).

Figure 2.12.: BiLSTM Layers in ELMo

ELMo embeddings are character-based, which allows a neural network to use morphological notions to form representations for out-of-vocabulary tokens unseen in training. It is for this reason that static word embeddings like Word2Vec and GloVe usually fall short. Even when we create a with a wide context window, the word will ultimately have the same vector representation regardless of the context. ELMo embeddings change with context. It is this text prediction that is being achieved

17 by the forward and backward language models in ELMo that make it one of the best at tracking language patterns and transfer learning.

18 3. Methodology

I demonstrate the entire named-entity linking system from input to output. I then discuss the embeddings, as well as the general model pipelines of the disambiguation component and the feed-forward neural network (FFNN).

3.1. Named-Entity Linking

The input text is processed by a named-entity recognition component set up with the SpaCy implementation. The start and end positions of the named-entity are catalogued along with the raw text input and the identied named-entity. A few named-entities that I exempted were number related such as money, time, percentages, etc. Afterwards, the input text is cleaned and realigned with new start and end positions. The top candidates are then retrieved from the anchor link frequency and the LSH algorithm, and nally, the named-entity disambiguation model is performed to deliver the output.

Figure 3.1.: Named-Entity Linking Components

19 For evaluation purposes, I focus on the named-entity disambiguation component. To prepare the training data, I performed text feature engineering for some key elements that were used in our work. I rst set up clean candidate/anchor link lists. Soon after, I processed the 1.5 million sections of Wikipedia that I was using. This took several days to process and the text had to be processed a few times over. The reason to reprocess the text is the case of alignment. I had to ensure the alignment of the textual mention, in our case the anchor link, in the section text was the same before and after processing. This required recalculating positions each time I cleaned the section text. In the case of number substitution, I opted to replace numbers with hashes as this was a useful way to avoid problems that would arise from mixed data types.

3.2. Disambiguation Models

For my deep learning models, I treated the task of named-entity disambiguation as a classication task. For a candidate in the candidate list for an identied entity in a text, the model predicts whether this candidate is the true named-entity for the identied named-entity. Specically, given the knowledge embedding of the candidate and the local context embedding of the text as inputs, the model predicts true if the candidate is the correct entity and false otherwise. For example, our baseline model uses word embeddings with a local context window, where it is trained as part of the embedding layer of a BiLSTM. Once this occurs, the context embeddings are concatenated with knowledge graph embeddings to be fed into a feed-forward neural network. With this framework, I successfully mapped between the knowledge graph embeddings and local context embeddings to disambiguate the named-entities. The notion of incorporating graph embeddings with local context embeddings to map ambiguous named-entities to those in a knowledge base stems from Parravicini et al. (2019). In their framework, Parravicini et al. (2019) performed successful leveraging of graph embeddings to achieve named-entity disambiguation. Using DBpedia as the knowledge base and existing graph algorithms for candidate generation, they were able to achieve state-of-the-art accuracy on a number of datasets and fast retrieval of entities in real-world engines. In comparison to my work, they use dierent similarity metrics in their candidate generation rather than Jaccard similarity and the LSH algorithm; and their graph embeddings were node embeddings (DeepWalk), rather than higher-level knowledge graph embeddings. Our approach is also novel since its the rst of its kind to concatenate contextual word embeddings, such as ELMo, with knowledge graph embeddings to conduct disambiguation as a classication task.

3.2.1. Embeddings The Context2Vec embeddings were trained with the 500k most frequently occurring words in the dataset. With this subset, I mapped them to embeddings with a context window of (+/-) 10 words, which is fed into an embedding layer of the model. The ELMo embeddings were trained by taking 1.5 million Wikipedia section texts from the dataset. I truncated the text to a window of (+/-) 10 words to make ELMo simple to compare with Context2Vec and also ease the computation. The TransE embeddings were trained from 5 million vectors of entities and relations in Wikidata and Wikipedia. This includes general domain entities such as concepts, people, and things. I use the graph embedding engine GraphVite (X. Wang et al., 2019; Z. Zhu et al., 2019) to utilize the knowledge graph embedding algorithms from Chapter 2.4.2. and generate embeddings in a short amount of time. DistMult was used in some preliminary experiments, but ultimately I chose TransE as my knowledge graph embeddings for the nal results.

20 3.2.2. BiLSTM Model My baseline model takes the Context2Vec embeddings as input. This input is fed into a BiLSTM, whose output is concatenated with the knowledge graph embeddings, which are then fed into a FFNN. The FFNN maps between the context embeddings and knowledge graph embeddings for the nal output. The motivation for the baseline model was to augment and tune a model to build up the Context2Vec embedding section.

Figure 3.2.: BiLSTM Model Pipeline

I also know that a BiLSTM is quite eective on sequential tagging and word classi- cation. Particularly when we are looking at a window size of 10 words before and after our target word, an LSTM is fundamental for Context2Vec.

3.2.3. CNN-BiLSTM Model The motivation for the second model is much like the baseline model yet the purpose was to develop a stack on BiLSTM. A CNN captures the hierarchical relations and there have been positive stacks with text classication and CNN-LSTM models (Zhou et al., 2015). I use the CNN to extract a sequence of higher-level phrase representations, and then further feed this into the BiLSTM for sentence representation. I fed my context embeddings into one convolutional layer to see how it would perform.

Figure 3.3.: CNN-BiLSTM Model Pipeline

3.2.4. ELMo Model The ELMo model replaces the Context2Vec embeddings with ELMo embeddings as input, and since the ELMo embeddings is already made up of BiLSTMs, I only use a basic LSTM to process the ELMo embeddings; this output is also concatenated like the CNN-BiLSTM model and ELMo model. Depending on the window size of the embedding, runtime would vary greatly.

21 Figure 3.4.: ELMo Model Pipeline

3.2.5. Feed-Forward Neural Network (FFNN) Each model utilizes a FFNN towards the back of its architecture. The FFNN has a sigmoid function that applies the transformations I commit to vectors in a range of (0,1) coming out of the previously established network before the loss computation: 1 5 pB8 q “ 1 ` 4´B8 It is independently connected to each element and is also known as the logistic function. Unlike softmax loss, each vector component (class) is independent, therefore the loss computed for each output class is not aected by other classes. A sigmoid activation function applied to the scores before computing the cross-entropy loss:

  “ ´ C8;>6pB8 q 8 ÿ I use cross-entropy to calculate the dierence between two (or more) probability dis- tributions and it measures the performance of my models whose output is a probability value of either 0 or 1.

22 4. Experiments

I perform a number of experiments and present these results on openly available datasets using dierent feature combinations, models, hyperparameters, and sample sizes to exemplify this performance. The datasets are detailed for the named-entity dis- ambiguation, as well as their settings. The Kensho-Derived Wikimedia Dataset (KDWD) was used for the QA system and AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA) was used as an entity linking benchmark. Afterwards, I dene the evaluation metrics used for named-entity disambiguation, perform my experiments and compare the results with other notable works.

4.1. Data

4.1.1. Kensho-Derived Wikimedia Dataset (KDWD) Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd- sourced projects supported by the Wikimedia Foundation. Recently, Wikipedia added its six millionth English article after two decades operating. Wikidata, a more machine- readable sister project, holds more than 75 million items since its creation 8 years ago. The Wikimedia Foundation disseminates this information under a free license, and therefore, have been heavily researched by data scientists and computer science groups, particularly in the eld of natural language processing (NLP). The Kensho-Derived Wikimedia Dataset (KDWD) 1 is a concentrated subset of the raw Wikimedia data in a condition more t for NLP research.

Pages Tokens Entities Relations 5.3M 2.3B 51M 140M

Table 4.1.: KDWD Dataset

The KDWD dataset is structured with three layers of data; there is the text from the Wikipedia page, the hyperlinks between the pages, and the entities and relations built from the Wikidata graph. Entities and relations are synonymous with items (Q) and statement (P) in Wikidata. For example, the Noam Chomsky (Q9049) item in Wikidata has statements that the item is an instance of (P106) human and the occupation (P106) linguist (Q14467526) and political writer (Q15958642). If I only observe the "Introduction" sections of these pages, I am still left with 460M tokens in our corpus. Sometimes evaluation can prove more dicult depending on the referent used in the data, whereby a system annotates an entity with an encrypted redirect rather than the direct entity in the URI. DBPedia (Auer et al., 2007), which relies on the Wikimedia project for its knowledge base, will have redirects such as http://dbpedia.org/resource/PEHDTSCKJBMA that will be completely apt for referencing http://dbpedia.org/resource/Tom_Waits. In both datasets, I established the direct links without relying on above redirects for our network.

1https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data

23 4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA) The AIDA CoNLL-YAGO Dataset created by Hoart et al. (2011) contains assignments of entities to the mentions of named-entities that were annotated in the original CoNLL 2003 NER task (Tjong Kim Sang and De Meulder, 2003). The entities are detected by YAGO2 identication, by Wikipedia URL, or by Freebase mid. For our purposes, I used the YAGO2 entity identier as the target ID, rather than the Wikidata ID used in KDWD. Each mention of an entity has the accompanying text section that can be used to train the model.

Documents Entities TRAIN 946 18k VALID 215 4.6k TEST 230 4.3k

Table 4.2.: CoNLL03/AIDA Dataset

The referent used is the Wikipedia link, or anchor link, that I used in KDWD. It makes the comparison between KDWD and CoNLL03/AIDA easier when their entity- linking is sourced from similar knowledge bases. CoNLL03/AIDA is the standard for disambiguation.

4.2. Seings

I discuss the environment I created from the data preparation, the NER component, candidate generation component, and provide a comprehensive look at the three models in the disambiguation component. Data Setting: Due to the size of KDWD, I train on only the "Introduction" sections of 1.5 million pages. The amount of text in KDWD can vary dramatically, therefore feature engineering was quite computationally expensive. The data was split through the traditional 70% training, 15% validation, and 15% testing. I keep the 500k most frequent tokens as our Context2Vec lexicon. For the CoNLL03/AIDA dataset, I use all documents, which were already partitioned. The entire document text accompanies each of the entities mentioned in the dataset, whereas this structure preexists in KDWD. I use the 21k most frequent tokens as our Context2Vec lexicon. A max length of 50 tokens was set for processing the ELMo embeddings in both datasets. Text normalization was standard, but numbers were replaced with hashtags. NER Setting: The entities were recognized using Spacy v2.0, which uses subword features and Bloom embeddings to parse entities (Serrà and Karatzoglou, 2017). The entities recognized were person, location, organization, etc. Candidate Generation Setting: The top candidates were calculated for the anchor link frequency and the LSH algorithm in Chapter 2.3.2. I use jaccard similarity and min hashing to map the similarity between these given sets of candidates. This setting accounts for erroneous spelling and helps reduce the dimensionality of the data. NED Setting: The general pipelines were discussed in Chapter 4 but here I have the exact sizes of the input (with sequences no longer than 20) and output, along with the detailed layers of their architecture. The candidate lists that are used for disambiguation label one true entity among ten false entities.

• BiLSTM Model: The BiLSTM has an output size of 100 and the FFNN has a dense layer size of 256 with ReLU activation. A dropout layer is added, and the nal output layer is a dense layer of 1 with sigmoid activation to classify the entities.

24 Learning rate was 0.007, batch size was 256, with 100 epochs that stopped once the validation loss peaked after 2 patience.

Figure 4.1.: BiLSTM Model Architecture

• CNN-BiLSTM Model: Most of my hyperparameters and features were chosen from Yenter and Verma (2017) who used a CNN-LSTM for binary classication of movie review sentiment. I have a dropout before the convolution. The convo- lution has kernel size of 5, 64 lters, ReLU activation, valid padding, and 1 stride. I add a max pooling layer and batch normalization before feeding into the same BiLSTM and FFNN as in the BiLSTM model. Learning rate was 0.007, batch size was 256, with 100 epochs that stopped once the validation loss peaked after 2 patience.

Figure 4.2.: CNN-BiLSTM Model Architecture

• ELMo Model: I have an output layer of 1024 in our LSTM to match the size of the ELMo embeddings. The LSTM has a recurrent dropout of 0.2. The nal dropout is 0.2. I used Adam optimizer with batch size of 64 and a learning rate of 0.001 and 20 epochs that also had early stopping (Broscheit, 2019). The rest of the model stays consistent with the other two after concatenation.

25 Figure 4.3.: ELMo Model Architecture

4.3. Evaluation Metrics

4.3.1. Precision and Recall The general statistical measures I am observing is F1, precision, and recall. Precision and recall both give us indications of the accuracy of a model but provide deeper meanings for what the model is actually predicting. Precision means the percentage of our results which are relevant to the task, whereas recall means the percentage of total relevant results which are correctly classied:

|Relevant Results X Retrieved Results| %'($# “ |Retrieved Results| |Relevant Results X Retrieved Results| '!! “ |Truly Relevant Results| The tradeos are that lowering our precision will give us irrelevant results not suitable for a user and that raising precision will provide. This inverse relationship is why we use the harmonic mean of the F1 score to balance precision and recall:

Precision ¨ Recall 1($' “ 2 ¨ Precision ` Recall

4.3.2. Classification Metrics There are two ways I observe the classication task in my method. I rst observe a confusion matrix of predicted results. A confusion matrix displays the number of predicted values on the y-axis and the number of actual values on the x-axis, and is broken down by each class. In my task, I should expect two. A confusion matrix allows us to understand not just the errors being made by the classier, but more importantly, the types of errors that are being produced by our model. I secondly observe a receiver operating characteristic (ROC) curve and the area under the curve (AUC). It expresses how well a model is capable of distinguishes between classes; true named-entities and false named-entities in our classication task. The higher the AUC, the better the model is predicting true and false entities. The ROC curve is plotted with the true positive rate (TPR) is measured against the false positive rate (FPR). The TPR is synonymous with recall, yet in contrast to precision, the FPR measures the ratio of false positives in the negative samples.

26 4.4. Results

First, I juxtapose the sample sizes and the correlative eects this has on the precision and recall of my models. Then I compare confusion matrices of the two datasets with the ELMo model and contrast the micro-precision of my model with state-of-the-art models. I note the accuracy of the candidate lists that I am basing my predictions from. I conclude with a brief analysis on the ROC curves and discussion on remaining tangents.

4.4.1. The Eect of Training Data Size The sample sizes that are listed in Table 4.3 and Table 4.4 are approximate sizes of the training and validation instances yielded. Since CoNLL03/AIDA dataset is far smaller than KDWD, this was reected in experiment sizes.

Precision Recall F1 ROC-AUC Sample Size BiLSTM Model 0.71 0.66 0.68 0.76 500k CNN-BiLSTM Model 0.72 0.67 0.69 0.78 500k ELMo Model 0.86 0.72 0.78 0.90 500k BiLSTM Model 0.74 0.75 0.74 0.82 1M CNN-BiLSTM Model 0.72 0.82 0.77 0.85 1M ELMo Model 0.85 0.80 0.82 0.91 1M BiLSTM Model 0.86 0.86 0.86 0.92 2.8M CNN-BiLSTM Model 0.84 0.88 0.86 0.92 2.8M ELMo Model 0.88 0.81 0.84 0.93 2.8M

Table 4.3.: Average Disambiguation Scores of 5 Runs - KDWD

Unsurprisingly, the increase in samples I train and validate from will correlate to an improvement on the classication task across all models. The BiLSTM model and CNN- BiLSTM model slightly edge out the ELMo model in performance with 2.8M samples. With less data, state-of-the-art models like ELMo can be eective in performing near F1 of 80% with the smallest sample size in Table 4.3.

Precision Recall F1 ROC-AUC Sample Size BiLSTM Model 0.66 0.85 0.74 0.80 20k CNN-BiLSTM Model 0.63 0.86 0.73 0.80 20k ELMo Model 0.73 0.87 0.79 0.83 20k

Table 4.4.: Average Disambiguation Scores of 5 Runs - CoNLL03/AIDA

4.4.2. Disambiguation Models While I am training my models, I often include at maximum 10 false entities among the true entity as the potential candidates. When predicting the candidates, the BiLSTM Model and CNN-BiLSTM model repeatably have higher recall and lower precision, underscoring the classier deciding too much of these candidates may be the true candidate without classifying the exactly true candidates. The ELMo model has higher precision and lower recall, which substantiates the notion that state-of-the-art language models like ELMo and BERT (Sun et al., 2019) often can have a harder time generalizing, thus generating higher false negatives, despite it generating sucient true positives. However, I nd a direct contrast between the two datasets with the ELMo model. In Figure 4.5, I generate a much higher recall (87%) and lower precision (73%). It should

27 be further noted that both confusion matrices in Figure 4.4 and Figure 4.5 denote the datasets with the largest training instances. The BiLSTM model and CNN-BiLSTM model have consistent confusion matrices that hold between both datasets, but it could stand to reason that CoNLL03/AIDA generates higher false positives from the lack of sucient diversity in the data. For example, when a specic subject or theme manifests in a text, the model has a harder time dierentiating between the macro-context of the text and the micro-context of the entity within the text.

Figure 4.4.: ELMo CM - KDWD Figure 4.5.: ELMo CM - CoNLL03/AIDA

The broader precision that I have thus far noted has been the micro-precision. The micro-precision is the fraction of correctly disambiguated named-entities in an entire corpus, whereas the macro-precision is the fraction of correctly disambiguated named-entities averaged by their respective documents. In Table 4.5 there is a 9% drop from our highest micro-precision with CoNLL03/AIDA compared to the lowest state-of-the-art disambiguation model by Hoart et al. (2011).

Micro-Precision J. Raiman and O. Raiman (2018) 0.95 Sil et al. (2018) 0.94 Le and Titov (2018) 0.93 Hoart et al. (2011) 0.82 Our Model 0.73

Table 4.5.: Disambiguation Models - CoNLL03/AIDA

These disambiguation models in Table 4.5 are comparatively more complex networks, some of which implement rich integration of the other components of the entity linking system that is absent in our work. J. Raiman and O. Raiman (2018) integrated symbolic knowledge into the reasoning process of a neural network. Sil et al. (2018) trained ne-grained similarities and dissimilarities between the query and candidate document. Le and Titov (2018) used multi-relational learning with candidates. Hoart et al. (2011) approximate eective joint mention-entity mapping.

4.4.3. Candidate List Accuracy I evaluated the performance of the candidate lists that we were selecting the candidates from, where the candidate with the highest probability is picked. It is not guaranteed that the true candidate appears in the list if it happens to be missing. This is noted in Table 4.6 and highlights the strengths of our models in predicting the correct candidates. The accuracy is calculated from the recall at k approach. In the classication task, there is either the possibility for more than one candidate to be predicted or the potential for no candidate to be predicted as the true entity. I

28 KDWD CoNLL03/AIDA BiLSTM Model 0.92 0.83 CNN-BiLSTM Model 0.93 0.83 ELMo Model 0.92 0.85

Table 4.6.: Average Candidate List Accuracy of 5 Runs have chosen to predict the candidate with the highest probability as the true entity. In candidate generation, bottleneck problems materialize with gaps in the breadth of a knowledge base. I extrapolate solutions and penalties for this issue in Chapter 4.5.5.

4.4.4. Analysis and Discussion As I found, the greatest advantages in our models were careful consideration of the context embeddings and the tradeo between precision and recall. With more advanced models like the ELMo model, I can expect higher precision at the cost of recall, with the BiLSTM model and CNN-BiLSTM model I can expect higher recall at the cost of precision. On testing with the CNN-BiLSTM model, I found a slight recall and ROC-AUC boost compared to BiLSTM model. The ELMo embeddings often perform stronger than the Context2Vec embeddings with less data.

Figure 4.6.: CNN-BiLSTM - KDWD Figure 4.7.: CNN-BiLSTM - CoNLL03/AIDA

With a sample size of 2.8M training and validation instances, the BiLSTM model and CNN-BiLSTM model show stronger performance. The BiLSTM model increases approximately 8-10% from the smaller sample size tier. CNN-BiLSTM model and ELMo model begin to converge in their performance levels with more data. However, when working with a smaller dataset and an incomplete knowledge base, it stands to reason that ELMo embeddings would be a more robust model in this setting. Comparing Figure 4.6 and Figure 4.7, I observe how the ROC curve is far more inconsistent with CoNLL03/AIDA. I get an average of 26-30% false positives with CoNLL03/AIDA and Context2Vec.

Figure 4.8.: ELMo - KDWD Figure 4.9.: ELMo - CoNLL03/AIDA

Nonetheless, Figure 4.8 and Figure 4.9 express two smooth curves in both datasets when working with ELMo embeddings and the smallest sample sizes. In general, when

29 relying too heavily on Wikipedia as the fundamental knowledge base, I will have a harder time generalizing my application to new data that does not reect this structure well (Hachey et al., 2013). In my work, the models perform exceedingly well with KDWD and fair with CoNLL03/AIDA; more datasets should be analyzed for a deeper understanding to how eective the performance is across dierent text and formats. I must consider how much the model overts or underts the data, and monitoring loss can be one clear indication. There is a 0.2-0.3 dierence in training and validation loss in CoNLL03/AIDA, whereas there is a magnitude smaller dierence of 0.02-0.03 in KDWD. This illustrates how our models are overtting to the CoNLL03/AIDA dataset, but I believe this to be a healthy amount. Additional dropout layers and batch normalization, along with larger data was used to address these issues where I could apply them.

P(Target|Anchor) P(Anchor|Target) Talking Heads 0.972 0.973 Talking Heads (series) 0.031 0.869 Talking Heads (Australian TV series) 0.021 1.000 Talking Heads (play) 0.015 1.000 Pundit 0.004 0.007

Table 4.7.: Anchor Link of "talking heads"

In Table 4.7, the top 5 likely target candidates are from the anchor link of "talking heads". However, when I retrieve candidates for this same entry, I will get results such as "the walking seeds", "walking through re", "headshaking", and "talking horse" from the LSH algorithm. Penalties, such as those that observe text dissimilarity wherein the dierence is given weight, could lter out superuous results during the disambigua- tion stage. Furthermore, time is considered. I ran our models on Tesla K80 or P100 GPU. ELMo embeddings took approximately twice the amount of time to run as the other deep learning models. When working with the 2.8M samples from KDWD our runtimes would nish in 100-110 min with Context2Vec and 160-170 min with ELMo. On the nal deployment of the entire named-entity linking system, I used the CNN-LSTM model with 2.8M samples for its modest improvement on the response time of query retrieval.

30 5. Conclusion and Future Work

In my work and research, I constructed a complete named-entity linking system that solves many of the research questions that I originally posited. I manage name varia- tions in the case of misspelling, so if we have "George Bush" rather than "George Bosh", the system will conclude that these two entities are the same. Name variations are also managed in the case of aliases, where it can understand that "Bush Jr", and "George Bush" can be the same person given the proper context. However, abbreviations remain a challenge and usually rely on the primary representation of identity. The ambiguity challenge is largely covered by my work, where often the right context can give way to the correct entity. In the case of coreference resolution, there remains room for improvement as a downstream task. Nonetheless, ambiguity is the most evident challenge solved. In the case of incomplete information, my system can determine context for those candidates within this snapshot of the dataset. The bottleneck problem of the candidate lists remains a problem that limits the success of all of the challenges I addressed. In order to begin construction, I rst had to recognize our named-entities. I then performed candidate generation by combining anchor link frequency with the LSH algorithm, where a candidate group of potential entities was created for each identied entity. Afterwards, our named-entity disambiguation component was performed to select the most probable candidate. The selection process for this nal process was treated as a classication task. I managed to successfully incorporate two types of context embeddings (Con- text2Vec, ELMo) to concatenate with knowledge graph embeddings (TransE), a fresh ap- proach to named-entity linking that performs well with both KDWD and CoNLL03/AIDA. The focus of evaluation was mainly carried out with the disambiguation component of the named-entity linking system. Running various models, hyperparameters, sample sizes, and embeddings, I was able to achieve an ecient nal 79% F1 when applied to the CoNLL04/AIDA benchmark. For future work, there are a few dierent aspects of my research that should be expanded on. The rst one would be improving the other components of my system. Joint modeling for named-entity recognition, candidate generation, and named-entity disambiguation has been done heuristically and through neural networks (Broscheit, 2019). Our named-entity recognition system is a vanilla model, but using LSTM and conditional random elds (CRF) could likely improve it. CRFs are used for predicting sequences that use contextual information to supplement information which will be used by the model to make a correct prediction. Given its preeminence with contextual information, it would make compelling work to develop a CRF for the disambiguation component as well, yet this usage lacks proper research and implementation. The candidate generation component could be improved by experimenting with more graph algorithms for greater semantic understanding of text. I could have evalu- ated Jaccard similarity with Cosine similarity, yet previous work suggested that Jaccard similarity would perform best with the LSH algorithm (E. Zhu et al., 2016). In addition, the candidate generation has bottleneck limitations that could be improved with more considerate curation at the feature engineering stage instead of changing the similarity metric.

31 Finally, Luan et al. (2018) assert that multi-task identication of entities, relations, and coreference resolution outperforms other models, such as mine, that only focus on purely entities. There is a lot of unused relational data in my knowledge graph embeddings that could be incorporated into deep learning, as was shown in Figure 2.11. I primarily use link prediction and similarity metrics, but graph algorithms like PageRank and Louvain modularity have been leveraged with deep learning models too (Cao et al., 2018).

32 A. Named-Entity Linking Examples

Figure A.1.: Named-Entity Linking Example 1

Figure A.2.: Named-Entity Linking Example 2

33 Bibliography

Adamic, Lada A. and Eytan Adar (2003). “Friends and neighbors on the Web”. Soc. Networks 25, pp. 211–230. Auer, Sören, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives (2007). “DBpedia: A Nucleus for a Web of Open Data”. In: vol. 6. Jan. 2007, pp. 722–735. doi: 10.1007/978-3-540-76298-0_52. Battaglia, Peter, Jessica Blake Chandler Hamrick, Victor Bapst, Alvaro Sanchez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andy Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Jayne Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu (2018). “Relational inductive biases, deep learning, and graph networks”. arXiv. url: https://arxiv.org/pdf/1806.01261.pdf. Bordes, Antoine, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko (2013). “Translating Embeddings for Modeling Multi-Relational Data”. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. NIPS’13. Lake Tahoe, Nevada: Curran Associates Inc., pp. 2787– 2795. Broscheit, Samuel (2019). “Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking”. In: Proceedings of the 23rd Conference on Computa- tional Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 677–685. doi: 10.18653/v1/K19-1063. url: https://www.aclweb.org/anthology/K19-1063. Cao, Jinxin, Di Jin, Liang Yang, and Jianwu Dang (2018). “Incorporating network structure with node contents for community detection on large networks using deep learning”. Neurocomputing 297 (Feb. 2018). doi: 10.1016/j.neucom.2018.01.065. Dong, Wei, Moses Charikar, and Kai Li (2011). “Ecient K-nearest neighbor graph construction for generic similarity measures”. In: Jan. 2011, pp. 577–586. doi: 10. 1145/1963405.1963487. Franzoni, Valentina, Michele Lepri, and Alfredo Milani (2019). “Topological and Seman- tic Graph-based Author Disambiguation on DBLP Data in Neo4j”. CoRR abs/1901.08977. arXiv: 1901.08977. url: http://arxiv.org/abs/1901.08977. Gionis, Aristides, Piotr Indyk, and Rajeev Motwani (1999). “Similarity Search in High Dimensions via Hashing”. In: Proceedings of the 25th International Conference on Very Large Data Bases. VLDB ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 518–529. isbn: 1558606157. Goldberg, Yoav and Omer Levy (2014). “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method”. CoRR abs/1402.3722. arXiv: 1402.3722. url: http://arxiv.org/abs/1402.3722. Grover, Aditya and Jure Leskovec (2016). “node2vec: Scalable Feature Learning for Networks”. CoRR abs/1607.00653. arXiv: 1607.00653. url: http://arxiv.org/abs/1607. 00653. Hachey, Ben, Will Radford, Joel Nothman, Matthew Honnibal, and James Curran (2013). “Evaluating Entity Linking with Wikipedia”. Articial Intelligence 194 (Jan. 2013), pp. 130–150. doi: 10.1016/j.artint.2012.04.005.

34 Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-term Memory”. Neural computation 9 (Dec. 1997), pp. 1735–80. doi: 10.1162/neco.1997.9.8.1735. Hoart, Johannes, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum (2011). “Robust Disambiguation of Named Entities in Text”. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scot- land, UK.: Association for Computational Linguistics, July 2011, pp. 782–792. url: https://www.aclweb.org/anthology/D11-1072. Ji, Guoliang, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao (2015). “Knowledge Graph Embedding via Dynamic Mapping Matrix”. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics, July 2015, pp. 687–696. doi: 10.3115/v1/P15-1067. url: https://www.aclweb.org/anthology/P15-1067. Koncel-Kedziorski, Rik, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Ha- jishirzi (2019). “Text Generation from Knowledge Graphs with Graph Transformers”. CoRR abs/1904.02342. arXiv: 1904.02342. url: http://arxiv.org/abs/1904.02342. Kulkarni, Sayali, Ganesh Ramakrishnan, and Soumen Chakrabarti (2009). “Collective annotation of Wikipedia entities in web text”. In: Jan. 2009, pp. 457–466. doi: 10. 1145/1557019.1557073. Le, Phong and Ivan Titov (2018). “Improving Entity Linking by Modeling Latent Relations between Mentions”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Aus- tralia: Association for Computational Linguistics, July 2018, pp. 1595–1604. doi: 10.18653/v1/P18-1148. url: https://www.aclweb.org/anthology/P18-1148. Logan, Robert, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh (2019). “Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling”. In: Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 5962–5971. doi: 10.18653/v1/P19-1598. url: https://www.aclweb.org/ anthology/P19-1598. Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman (2014). “Parallel Heuris- tics for Scalable Community Detection”. Parallel Computing 486 (Oct. 2014). doi: 10.1016/j.parco.2015.03.003. Luan, Yi, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi (2018). “Multi-Task Identication of Entities, Relations, and Coreference for Scientic Knowledge Graph Construction”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 3219–3232. doi: 10.18653/v1/D18-1360. url: https://www.aclweb.org/ anthology/D18-1360. Melamud, Oren, Jacob Goldberger, and Ido Dagan (2016). “context2vec: Learning Generic Context Embedding with Bidirectional LSTM”. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 51–61. doi: 10.18653/v1/ K16-1006. url: https://www.aclweb.org/anthology/K16-1006. Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel (2011). “A Three-Way Model for Collective Learning on Multi-Relational Data.” In: Jan. 2011, pp. 809–816. Page, Larry, Sergey Brin, R. Motwani, and T. Winograd (1998). The PageRank Citation Ranking: Bringing Order to the Web. Parravicini, Alberto, Rhicheek Patra, Davide B. Bartolini, and Marco D. Santambrogio (2019). “Fast and Accurate Entity Linking via Graph Embedding”. In: Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences

35 Systems (GRADES) and Network Data Analytics (NDA). GRADES-NDA’19. Amster- dam, Netherlands: Association for Computing Machinery. isbn: 9781450367899. doi: 10.1145/3327964.3328499. url: https://doi.org/10.1145/3327964.3328499. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena (2014). “DeepWalk: Online Learning of Social Representations”. CoRR abs/1403.6652. arXiv: 1403.6652. url: http://arxiv. org/abs/1403.6652. Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer (2018). “Deep contextualized word representa- tions”. In: Proc. of NAACL. Raiman, Jonathan and Olivier Raiman (2018). “DeepType: Multilingual Entity Linking by Neural Type System Evolution”. In: AAAI. Reddy, Sathish, Dinesh Raghu, Mitesh M. Khapra, and Sachindra Joshi (2017). “Gen- erating Natural Language Question-Answer Pairs from a Knowledge Graph Using a RNN Based Question Generation Model”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 376–385. url: https://www.aclweb.org/anthology/E17-1036. Serrà, Joan and Alexandros Karatzoglou (2017). “Getting deep recommenders t: Bloom embeddings for sparse binary input/output networks”. CoRR abs/1706.03993. arXiv: 1706.03993. url: http://arxiv.org/abs/1706.03993. Shchur, Oleksandr and Stephan Günnemann (2019). “Overlapping Community Detec- tion with Graph Neural Networks”. arXiv: 1909.12201 [cs.LG]. Sil, Avirup, Gourab Kundu, Radu Florian, and Wael Hamza (2018). “Neural Cross- Lingual Entity Linking”. In: AAAI. Sun, Chi, Xipeng Qiu, Yige Xu, and Xuanjing Huang (2019). “How to Fine-Tune BERT for Text Classication?” CoRR abs/1905.05583. arXiv: 1905.05583. url: http://arxiv. org/abs/1905.05583. Tjong Kim Sang, Erik F. and Fien De Meulder (2003). “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition”. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142– 147. url: https://www.aclweb.org/anthology/W03-0419. Wang, Wentao, Lintao Wu, Ye Huang, Hao Wang, and Rongbo Zhu (2019). “Link Prediction Based on Deep Convolutional Neural Network”. Information 10 (May 2019), p. 172. doi: 10.3390/info10050172. Wang, Xiaozhi, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang (2019). KEPLER: A Unied Model for Knowledge Embedding and Pre-trained Language Representation. arXiv: 1911.06136 [cs.CL]. Yang, Bishan, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and li Deng (2014). “Embedding Entities and Relations for Learning and Inference in Knowledge Bases” (Dec. 2014). Yenter, Alec and Abhishek Verma (2017). “Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis”. In: Oct. 2017, pp. 540– 546. doi: 10.1109/UEMCON.2017.8249013. Zachary, Wayne W (1977). “An information ow model for conict and ssion in small groups”. Journal of anthropological research, pp. 452–473. Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis C. M. Lau (2015). “A C-LSTM Neural Network for Text Classication”. CoRR abs/1511.08630. arXiv: 1511.08630. url: http://arxiv.org/abs/1511.08630. Zhu, Erkang, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller (2016). “LSH Ensemble: Internet-Scale Domain Search”. Proc. VLDB Endow. 9.12 (Aug. 2016), pp. 1185–1196. issn: 2150-8097. doi: 10.14778/2994509.2994534. url: https://doi.org/10.14778/ 2994509.2994534.

36 Zhu, Zhaocheng, Shizhen Xu, Meng Qu, and Jian Tang (2019). “GraphVite: A High- Performance CPU-GPU Hybrid System for Node Embedding”. CoRR abs/1903.00757. arXiv: 1903.00757. url: http://arxiv.org/abs/1903.00757.

37