The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
Latent Relation Language Models
Hiroaki Hayashi,1∗ Zecong Hu,1∗ Chenyan Xiong,2 Graham Neubig1 1Carnegie Mellon University, 2Microsoft Research AI {hiroakih, zeconghu, gneubig}@cs.cmu.edu, [email protected]
Abstract Topic: Barack Obama In this paper, we propose Latent Relation Language Models Article Barack Hussein Obama II (...; born August 4, 1961) is an (LRLMs), a class of language models that parameterizes the American[nationality] attorney[occupation] and polit- joint distribution over the words in a document and the en- ician[occupation] who served as the 44th president of the tities that occur therein via knowledge graph relations. This United States[position held] from 2009 to 2017. ... model has a number of attractive properties: it not only im- Knowledge Graph politician proves language modeling performance, but is also able to lawyer annotate the posterior probability of entity spans for a given ... (“attorney”, ...) American improvements over both word-based language models and held> a previous approach that incorporates knowledge graph in- president of the United States formation. Qualitative analysis further demonstrates the pro- posed model’s ability to learn to predict appropriate relations in context.† Figure 1: Overview of our task of language modeling condi- tioned on a knowledge graph. For a given topic, we want to learn a language model that leverages the knowledge graph 1 Introduction through relations when modeling the text. Language models (LMs) calculate the probability P (X) of textual data X, and are a core model class of interest to NLP. LMs are used as testbeds for evaluation of generative models n-grams (Neubig and Dyer 2016). of text, and have applications such as rescoring of upstream Methods to mitigate this bottleneck have been proposed language generation inputs (Sundermeyer, Schluter,¨ and Ney in the context of conditional LMs, which instead model the 2012), grammatical error correction (Felice et al. 2014), or conditional probability P (X | C), where C is some con- pre-training of sentence representations (Peters et al. 2018). text given to the model. For instance, in sequence transduc- Neural networks are used to model this probability in state- tion tasks, there are mechanisms to copy from the source of-the-art LMs (Bengio et al. 2003; Mikolov et al. 2010; sequence (Gu et al. 2016) or use word or phrase dictio- Merity et al. 2017). naries (Arthur, Neubig, and Nakamura 2016) to improve Textual data X comprise a wide variety of words to modeling of low-frequency words. Perhaps more interest- be modeled, from closed-class function words, to common ing from an LM perspective are methods conditioned on nouns or verbs, to named entities and numbers (Zipf 1949). information from structured knowledge sources such as Notably, words on the rarer end of this spectrum are often knowledge graphs (Ahn et al. 2016; Parvez et al. 2018; more semantically or topically important, as evidenced by Logan et al. 2019), tables (Lebret, Grangier, and Auli 2016), the success of heuristics such as TF-IDF (Salton and McGill or grammars (Konstas and Lapata 2013). These methods are 1986), which up-weight words with low frequency. Previ- analogous to human language production, where the under- ous work has noted that while neural LMs greatly outper- lying knowledge is converted into linguistic realizations. n form alternatives such as -gram models on frequent words, In this work, we propose Latent Relation Language Mod- they often under-perform on these rare words due to their els (LRLMs), a class of conditional LMs that take relational limited parameter budget, which puts them at a disadvan- information between entities in a knowledge graph as con- tage compared to non-parametric models like count-based text. Specifically, our model is able to generate either words Copyright c 2020, Association for the Advancement of Artificial from a fixed word vocabulary, or a span of words defined Intelligence (www.aaai.org). All rights reserved. according to their relations with a topic entity of interest, as ∗Equal Contribution. shown in Figure 1. The choices of which method of gener- †Code & Data: https://github.com/neulab/lrlm. ation to use is defined as a latent variable sequence Z.We
7911 use Latent Predictor Networks (LPNs; Ling et al. (2016)) We consider an open-vocabulary setting where all word to jointly learn P (X, Z | C), thus tractably marginalizing types within X are incorporated. Perplexity under this set- over all the possible spans. Compared to other word-by- ting provides a more realistic measure than under closed- word generation methods that condition LMs on knowledge vocabulary setting by taking into account words that rarely graphs (KGs; Ahn et al. (2016); Wang et al. (2018)), the or never appear in the training set, which, as previously span-based generation from the KGs alleviates problems of noted, are particularly important for conveying the main malformed or incomplete mentions. Moreover, the posterior content of the text. probabilities of Z can be considered as entity links, which are of interest in their own right in the information extraction Why Condition on Knowledge Graphs? field (Ceccarelli et al. 2013; Ganea and Hofmann 2017). KGs provide two important benefits for neural LMs. First, We apply the model on articles from Wikipedia (X), the high coverage of rarer words due to entities being often with the help of relational information (C) such as Wiki- infrequent addresses lack of textual supervision for predict- data (Vrandeciˇ c´ and Krotzsch¨ 2014) or Freebase (Bollacker ing these words. More importantly, KGs have the potential et al. 2008) regarding each article topic. Empirical results to help LMs generate factually consistent text by providing on open vocabulary language modeling show that the pro- consistent associations between entities. Normal LMs would posed model outperforms previous approaches on the same have to rely on supervision purely from textual data, which task, demonstrating that LRLMs provide an effective way to may not provide a learning signal strong enough to accu- condition on this context. We also demonstrate the merit of rately generate these facts. For instance, results from Rad- explicitly modeling latent relations by examining the poste- ford et al. (2019) show that even with a very large model rior probabilities over the chosen relations Z, which are in trained on massive amounts of data, samples can be factu- concert with human intuitions about how relations are being ally incorrect, although being fluent and coherent. expressed in the text. 3 Latent Relation Language Models 2 Language Modeling Conditioned on In this setion, we describe our proposed framework of Latent Structured Knowledge Relation Language Models (LRLMs). In this section, we define the task of open-vocabulary lan- Definition guage modeling conditioned on structured data. Knowledge from the KG subgraph G can be incorporated Task Definition into generation by copying aliases from related entities into the generated text. For instance in Figure 2, to generate Knowledge graphs (KGs) can be represented as a directed Obama’s birth date, the model can of course pick words from labeled graph G =(V,E) consisting of a set of nodes its vocabulary. But it is more straightforward to copy from V = {v1,...,v|V |} and a set of relation edges E = the relation of the topic entity “Barack {ei : si,ωi,oi |si,oi ∈ V, ωi ∈ R}. Relation ei contains Obama”, which gives the correct birth date. si, ωi, and oi as the subject, relation type, and object. R However, it is insufficient to model probabilities for such is the set of all relation types. Each node vi ∈ V rep- choices conditioning only on G and s, because it is un- resents either an entity or an attribute1, and is associated known to us which text spans are matched to which rela- with a set of surface forms (also called aliases) A(vi)= tions. Na¨ıve solutions like simple text matching algorithms {a ,...,a } v i,1 i,|A(vi)| that can be used to refer to i.For would yield many false positives. For example, “New York instance in Figure 1, the subject “Barack Obama” is con- City” has an alias “New York”, which matches “New York” nected to both “politician” and “lawyer” with the rela- (state) and parts of “New York City Council”. tion , and the object entity “politician” To circumvent this lack of relation annotation, we treat has “political figure” and “polit.” as additional relations corresponding to such text spans as latent variables. N aliases. Notably surface forms of many objects in the KG Formally, let X = {xi}i=1 be the sequence of N tokens, and T can be multiple words, and thus it is necessary to have ma- Z = {(σt,πt,ρt)}t=1 a sequence of latent variable triplets chinery to deal with this fact. describing text span matches: Given this KG, we further define a topic entity s about • The span variable σt :=(t,rt) specifies a token subse- which we would like to generate a piece of text. Our con- rt quence xσ = {xi} . ditional language modeling problem is then defined as the t i=t problem of modeling the conditional probability of text X: • The source variable πt ∈{REL, WORD} denotes the gen- x P (X | G, s). In particular, we consider a subgraph G = eration source of the span σt . (V ,E ) G of the original KG by extracting nodes and edges • The relation variable ρt :=(et,at) describes the match- directly related to the topic entity s: x ing relation and surface form of the span σt , and is only used when πt = REL. V : {s}∪{oi | s, ∗,oi ∈E} , Z E : {e : s, ω ,o | s, ω ,o ∈E ∧ o ∈ V }. For to be a valid sequence of latent variables, the fol- i i i i i i lowing conditions must be satisfied: T • Span variables {σt}t=1 form a segmentation of X, i.e., 1 A value specified with a relation from an entity (e.g., dates). t = rt−1 +1for t =2,...,T. This also implies T ≤ N.
7912 Relation
Model ……
Word Barack Hussein Obama II born August 4 , 1961
Barack Hussein Obama II… born August 4 , 1961 … Chosen Span Generated x x x x x x9 x x x12 Text 1 2 3 4 … 8 10 11 … Possible Span σ1 σ3 σ4
Figure 2: While generating, our model switches between the two sources: “Relation” and “Word”. Circles represent hidden states up to each token, and edges represent possible span matches. Here we show one valid derivation with solid lines, and other options as dashed lines. We also show an “annotation” of the generated tokens by the spans and sources we choose.
• If πt = WORD, then t = rt. Parameterization • If πt = REL, then ρt =(et,at) where et = s, ωt,ot We use neural networks to parameterize all probability dis- e ∈ E a ∈A(o ) x = a ρ t should satisfy t , t t , and σt t, i.e., t tributions mentioned above. Decisions for time step are
must correspond to a valid surface form of an object that based on a D-dimensional hidden state ht . This hidden state is related to the topic entity s and matches the text span. can be generated by any neural sequence model, and we ex- Let Z be the set of all valid latent variable sequences. We periment with multiple models to demonstrate the generality Z of our approach. can now model the probability by marginalizing over : P (X | G ,s)= P (X, Z | G ,s). (1) Source Selection Source selection is done using a simple Z∈Z linear model followed by a softmax function applied to the h For sake of brevity, unless noted otherwise, we drop G and latest word-level hidden state t : s from the conditions in the following sections. P (π | x )=softmax(W h + b ), t <t π t π 2×D 2 Training where Wπ ∈ R , bπ ∈ R are trainable parameters. Given the latent variable sequence Z, we follow Ling et al. (2016) in factoring the joint probability: Word Generation Like conventional word-level neural language models, we have the option to generate the next T P (X, Z)= P (σ ,π ,ρ ,x | x ) token from a fixed vocabulary. This option is used to gener- t t t σt <t ate any word that isn’t part of an object entity participating t=1 in a relation. The probability is: T P (x | x ) = softmax(Linear (h )), = P (πt | x<t )P (σt,xσt ,ρt | πt,x<t ), t <t w t t=1 where Linear(h) is a linear transform with a bottleneck of here x | x< )P (c1 ...c|c|; θ ), (σ :(, r),π,ρ) such that r = i, i.e., all valid spans ending t t char at the i-th token. The marginal probability we optimize for where θchar are the parameters of the character LM. We pre- is then αN . The backward probability βi which is required train this model on the set of all unique words in the training for gradient computation can be similarly calculated. set and fix its parameters while training LRLM.
7913 Algorithm 1 Generative Process of LRLM
Input previous span σt−1 =(t−1,rt−1), previously generated tokens x