Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling Robert L. Logan IV∗ Nelson F. Liu†§ Matthew E. Peters§ Matt Gardner§ Sameer Singh∗ ∗ University of California, Irvine, CA, USA † University of Washington, Seattle, WA, USA § Allen Institute for Artificial Intelligence, Seattle, WA, USA {rlogan, sameer}@uci.edu, {mattg, matthewp}@allenai.org, nfl[email protected] Abstract [Super Mario Land] is a [1989] [side-scrolling] [platform video game] developed and published Modeling human language requires the ability by [Nintendo] as a [launch title] for their [Game to not only generate fluent text but also en- Boy] [handheld game console]. code factual knowledge. However, traditional publisher Super Mario Land Nintendo launch game language models are only capable of remem- Q647249 Q8093 Q1425505 Publication platform manufacturer bering facts seen at training time, and often genre have difficulty recalling them. To address this, Date 21 April 1989 platform game Game Boy we introduce the knowledge graph language Date Q828322 Q186437 model (KGLM), a neural language model with instance of mechanisms for selecting and copying facts side-scrolling video game from a knowledge graph that are relevant to Q2281714 handheld game console Q941818 the context. These mechanisms enable the model to render information it has never seen Figure 1: Linked WikiText-2 Example. A localized before, as well as generate out-of-vocabulary knowledge graph containing facts that are (possibly) tokens. We also introduce the Linked WikiText- conveyed in the sentence above. The graph is built by it- 2 dataset,1 a corpus of annotated text aligned to eratively linking each detected entity to Wikidata, then the Wikidata knowledge graph whose contents adding any relations to previously mentioned entities. (roughly) match the popular WikiText-2 bench- Note that not all entities are connected, potentially due mark (Merity et al., 2017). In experiments, we to missing relations in Wikidata. demonstrate that the KGLM achieves significantly better performance than a strong base- line language model. We additionally compare different language models’ ability to com- training. For instance, when conditioned on the text plete sentences requiring factual knowledge, at the top of Figure 1, an AWD-LSTM language and show that the KGLM outperforms even model (Merity et al., 2018) trained on Wikitext-2 very large language models in generating facts. assigns higher probability to the word “PlaySta- tion” than “Game Boy”, even though this sentence 1 Introduction appears verbatim in the training data. This is not surprising—existing models represent the distribu- For language models to generate plausible sen- tion over the entire vocabulary directly, whether tences, they must be both syntactically coherent as they are common words, references to real world well as consistent with the world they describe. Al- entities, or factual information like dates and num- though language models are quite skilled at generat- bers. As a result, language models are unable to ing grammatical sentences, and previous work has generate factually correct sentences, do not gen- shown that language models also possess some de- eralize to rare/unseen entities, and often omit rare gree of common-sense reasoning and basic knowl- tokens from the vocabulary (instead generating UN- edge (Vinyals and Le, 2015; Serban et al., 2016; KNOWN tokens). Trinh and Le, 2019), their ability to generate fac- We introduce the knowledge graph language tually correct text is quite limited. The clearest model (KGLM), a neural language model with limitation of existing language models is that they, mechanisms for selecting and copying information at best, can only memorize facts observed during from an external knowledge graph. The KGLM 1https://rloganiv.github.io/linked-wikitext-2 maintains a dynamically growing local knowledge graph, a subset of the knowledge graph that con- 2 Knowledge Graph Language Model tains entities that have already been mentioned in In this section we introduce a language model that the text, and their related entities. When generating is conditioned on an external, structured knowledge entity tokens, the model either decides to render source, which it uses to generate factual text. a new entity that is absent from the local graph, thereby growing the local knowledge graph, or to 2.1 Problem Setup and Notation render a fact from the local graph. When render- A language model defines a probability distribution ing, the model combines the standard vocabulary over each token within a sequence, conditioned on with tokens available in the knowledge graph, thus the sequence of tokens observed so far. We denote supporting numbers, dates, and other rare tokens. the random variable representing the next token as Figure 1 illustrates how the KGLM works. Ini- xt and the sequence of the tokens before t as x<t, tially, the graph is empty and the model uses the i.e. language models compute p(xt|x<t). RNN lan- entity Super Mario Land to render the first three guage models (Mikolov et al., 2010) parameterize tokens, thus adding it and its relations to the local this distribution using a recurrent structure: knowledge graph. After generating the next two to- W h b kens (“is”, “a”) using the standard language model, p(xt|x<t) = softmax( h t + ), (1) the model selects Super Mario Land as the parent ht = RNN(ht−1, xt−1). entity, Publication Date as the relation to render, and copies one of the tokens of the date entity as We use LSTMs (Hochreiter and Schmidhuber, the token (“1989” in this case). 1997) as the recurrent module in this paper. A knowledge graph (KG) is a directed, labeled To facilitate research on knowledge graph-based graph consisting of entities E as nodes, with edges language modeling, we collect the distantly su- defined over a set of relations R, i.e. KG = pervised Linked WikiText-2 dataset. The underly- {(p, r, e) | p ∈ E, r ∈ R, e ∈ E}, where p is a par- ing text closely matches WikiText-2 (Merity et al., ent entity with relation r to another entity e. Prac- 2017), a popular benchmark for language model- tical KGs have other aspects that make this for- ing, allowing comparisons against existing mod- mulation somewhat inexact: some relations are to els. The tokens in the text are linked to entities in literal values, such as numbers and dates, facts Wikidata (Vrandeciˇ c´ and Krötzsch, 2014) using a may be expressed as properties on relations, and combination of human-provided links and off-the- entities have aliases as the set of strings that can shelf linking and coreference models. We also use refer to the entity. We also define a local knowl- relations between these entities in Wikidata to con- edge graph for a subset of entities E<t as KG<t = struct plausible reasons for why an entity may have {(p, r, e) | p ∈ E<t, r ∈ R, e ∈ E}, i.e. contains been mentioned: it could either be related to an entities E<t and all facts they participate in. entity that is already mentioned (including itself) or a brand new, unrelated entity for the document. 2.2 Generative KG Language Model We train and evaluate the KGLM on Linked The primary goal of the knowledge graph lan- WikiText-2. When compared against AWD-LSTM, guage model (KGLM) is to enable a neural lan- a recent and performant language model, KGLM guage model to generate entities and facts from obtains not only a lower overall perplexity, but also a knowledge graph. To encourage the model to a substantially lower unknown-penalized perplex- generate facts that have appeared in the context ity (Ueberla, 1994; Ahn et al., 2016), a metric that already, KGLM will maintain a local knowledge allows fair comparisons between models that accu- graph containing all facts involving entities that rately model rare tokens and ones that predict them have appeared in the context. As the model decides to be unknown. We also compare factual com- to refer to entities that have not been referred to pletion capabilities of these models, where they yet, it will grow the local knowledge graph with predict the next word after a factual sentence (e.g., additional entities and facts to reflect the new entity. “Barack is married to ”) and show that KGLM Formally, we will compute p(xt, Et|x<t, E<t) is significantly more accurate. Lastly, we show that where x<t is the sequence of observed tokens, E<t the model is able to generate accurate facts for rare is the set of entities mentioned in x<t, and KG<t is entities, and can be controlled via modifications the local knowledge graph determined by E<t, as the knowledge graph. described above. The generative process is: knowledge graph, denoted by ve for entity e and graph, the text is made up of disjoint sentences that vr for relation r. To select et from all entities in do not provide sufficient context to train a pow- case tt = new, we use: erful language model. Our goals are much more aligned to the data-to-text task (Ahn et al., 2016; v h h p(et) = softmax( e · ( t,p + t,r)) Lebret et al., 2016; Wiseman et al., 2017; Yang et al., 2017; Gardent et al., 2017; Ferreira et al., over all e ∈ E. The reason we add h and h is t,p t,r 2018), where a small table-sized KB is provided to to mimic the structure of TransE, which we use to generate a short piece of text; we are interested in obtain entity and relation embeddings. Details on language models that dynamically decide the facts TransE will be provided in Section 4. For mention to incorporate from the knowledge graph, guided of a related entity, t = related, we pick a parent t by the discourse.

Load more