<<

AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding

Xiang Ren+, Wenqi He+, Meng Qu+, Lifu Huang#, Heng Ji#, Jiawei Han+ + University of Illinois at Urbana-Champaign, Urbana, IL # Rensselaer Polytechnic Institute, Troy, NY Context-Dependent Entity Typing

q Given token spans of entity mentions in text, focuses on classifying them into types of interest

[] arrived this afternoon in [Washington, D.C.]. [President Obama]’s wife [Michelle] accompanied him

[TNF alpha] is produced chiefly by activated [macrophages] Context-Dependent Entity Typing

q Given token spans of entity mentions in text, focuses on classifying them into types of interest

[Barack Obama] arrived this afternoon in [Washington, D.C.]. PERSON [President Obama]’s wife [Michelle] accompanied him LOCATION

PROTEIN [TNF alpha] is produced chiefly by activated [macrophages] CELL Entity Types: From Coarse Labels to Fine-Grained Labels

ID Sentence The fourth movie in the Predator series entitled ‘The Predator’ may see S2 the return of action-movie star Arnold Schwarzenegger to the franchise

root ...

product person location organiz ation Person ...... Location ... business politician artist ... Organization man ...... A set of few common types author actor singer ...

A type hierarchy with 100+ types Entity Types: From Coarse Labels to Fine-Grained Labels

ID Sentence The fourth movie in the Predator series entitled ‘The Predator’ may see S2 the return of action-movie star Arnold Schwarzenegger to the franchise

root q Fine-grained entity type features for deeperNLP tasks ... product person location organiz – Relation extraction: a 93% ation improvement on NYT news articles, ...... reported by (Ling and Weld, 2012) ...

– Coreference resolution business politician artist man ... q Assists downstream applications ... – Question answering systems ... – Knowledge base completion author actor singer ...

*Ling and Weld, “Fine-Grained Entity Recognition”, AAAI 2012 How to Get Labeled Data?

q Human annotation (for 100+ entity types) – Cost cannot scale up! – Error-prone

Crowding sourcing? Hard for (non-expert) annotators to distinguish over 100 types consistently ID Sentence Target Type root Hierarchy ... Governor Arnold Schwarzenegger gi ves a speech at S1 Mission Serve's service project on Veterans Day 2010. organiz product person location The fourth movie in the Predator series entitled 'The ation S2 Predator' may see the return of action-movie star Arnold ...... Schwarzenegger to the franchise. ... Schwarzenegger’s first property investment was a block S3 althete politician artist business ... of six units, for which he scraped together $US27,000. man ......

Noisy Training Examples author actor singer ... Mention: “Arnold Schwarzenegger”; Context: S1; Candidate Type Candidate Type Set: {person, politician, artist, Knowledge Bases S1 Set (Sub-tree) actor, author, businessman, althete} Distant Mention: “Arnold Schwarzenegger”; Context: S2; How to Get Labeled Data? SupervisionID Sentence Target Type root Hierarchy ... q Human Governor Arnoldannotation Schwarzenegger gi ves a speech at S1 Candidate Type Set: {person, politician, artist, Mission Serve's service project on Veterans Day 2010. S2 organiz product person location q Distant The fourth moviesupervision in the Predator series entitled* 'The ation S2 Predator' may see the return of action-movie star Arnold ...... actor, author, businessman, althete} – Heuristically Schwarzenegger to the franchiselabels. large ... Schwarzenegger’s first property investment was a block S3 althete politician artist business ... corpus of six units, forwith which heKB scrapedtypes together $US27,000. man ......

Noisy Training Examples author actor singer ... ID Sentence Target Type root Mention: “Arnold Schwarzenegger”; Context: S1; Hierarchy Candidate Type ... Mention: “Schwarzenegger”; Context: S3; GovernorS1 ArnoldCandidate Schwarzenegger Type Set: {person gi ves, politician a speech, artist at , Knowledge Bases S1 Set (Sub-tree) Mission Serve's actorservice, author project, businessman on Veterans, althete Day 2010} . organiz Distantproduct person location S3 Candidate Type Set: {person, politician, artist, The fourthMention movie :in “ Arnoldthe Predator Schwarzenegger series entitled”; Context 'The: S2; Supervision ation S2 PredatorS2 ' mayCandidate see the Type return Set of: { personaction,- moviepolitician star, artist Arnold, ...... CandidateSchwarzeneggertypes:actor to the{, personauthor franchise, businessman, politician. , althete, } ... athleteSchwarzeneggerbusinessman’s first property, artist investment, actor, author was a block} business ... actor, author, businessman, althete} S3 Mention: “Schwarzenegger”; Context: S3; althete politician artist ofEntity sixS units3 Candidate, :for Arnold which Type he scraped Set Schwarzenegger: {person together, politician $US27, artist,000., man ... actor, author,... businessman, althete} Entity: Arnold Schwarzenegger ...

Noisy Training Examples author actor singer ... *Mintz et al. “Distant supervision for relation extraction without labeled data”, ACL 2009 Mention: “Arnold Schwarzenegger”; Context: S1; Candidate Type Candidate Type Set: {person, politician, artist, Knowledge Bases S1 Set (Sub-tree) actor, author, businessman, althete} Distant Mention: “Arnold Schwarzenegger”; Context: S2; Supervision S2 Candidate Type Set: {person, politician, artist, actor, author, businessman, althete}

Mention: “Schwarzenegger”; Context: S3; S3 Candidate Type Set: {person, politician, artist, actor, author, businessman, althete} Entity: Arnold Schwarzenegger Automatic Fine-Grained Entity Typing

• Problem: How to learn an effective model to predict a single type-path for each unlinkable entity mentions, using the automatically-labeled training corpus

Predictions for Text corpus Labeled corpus Typing model unlinkable mentions ? ------? ------? ------

root ...

product person location organiz ation

...... NER + Distant ...

business Supervision politician artist man ...... author actor singer ... ID Sentence Target Type root Hierarchy ... Governor Arnold Schwarzenegger gi ves a speech at S1 Mission Serve's service project on Veterans Day 2010. organiz product person location The fourth movie in the Predator series entitled 'The ation S2 Predator' may see the return of action-movie star Arnold ...... Schwarzenegger to the franchise. ... Schwarzenegger’s first property investment was a block S3 althete politician artist business ... of six units, for which he scraped together $US27,000. man ......

Noisy Training Examples author actor singer ... Mention: “Arnold Schwarzenegger”; Context: S1; Candidate Type Candidate Type Set: {person, politician, artist, Knowledge Bases S1 Set (Sub-tree) actor, author, businessman, althete} Distant Mention: “Arnold Schwarzenegger”; Context: S2; Challenges SupervisionID Sentence Target Type root Hierarchy ... q Noisy Governortype Arnold Schwarzeneggerlabels gi ves a speech at S1 Target Type Candidate Type Set: {person, politician, artist, ID Mission Serve's serviceSentence project on Veterans Day 2010. root S2 Hierarchy organiz – Context-agnostic entity product person location ation ... Governor ArnoldThe fourth Schwarzenegger movie in the Predator gi ves series a speech entitled ' atThe S1 S2 Predator' may see the return of action-movie star Arnold ...... Mission typeServe Schwarzenegger's serviceassignment project to the franchise on Veteranson. Day 2010. ... actor, author, businessman, althete} organiz Schwarzenegger’s first property investment was a block product person location The fourthS3entity movie inmentions the Predator series entitled 'The althete politician artist business ... ation of six units, for which he scraped together $US27,000. man S2 Predator' may see the return of action-movie star Arnold ...... Schwarzenegger to the franchise...... Schwarzenegger’s fiNoisyrst property Training investment Examples was a block Targetauthor Type actor singer ... SID3 Sentence althete politician rootartist business ... of six unitsMention, for which: “Arnold he scraped Schwarzenegger together $”;US Context27,000: .S1; Hierarchy man Candidate Type ... Mention: “Schwarzenegger”; Context: S3; GovernorS1 ArnoldCandidate Schwarzenegger Type Set: {person gi ves, politician a speech, artist at , Knowledge Bases ...S1 ... Set (Sub-tree) Mission Serve's actorservice, author project, businessman on Veterans, althete Day 2010} . ... Distant organiz product person location ... S3 Candidate Type Set: {person, politician, artist, The fourthMention movieNoisy :in “ Training Arnoldthe Predator Schwarzenegger Examples series entitled”; Context 'The: S2; Supervision author actor singer ation Candidate Type Set: {person, politician, artist, S2 Mention PredatorS2 ' :may “Arnold see the Schwarzenegger return of action-movie”; Context star Arnold: S1; ...... Schwarzeneggeractor to the, author franchise, businessman. , althete} ... Candidate Type S1 Candidate Type Set: {person, politician, artist, Knowledge Bases Schwarzenegger’s first property investment was a block Set (Sub-tree) business ... actor, author, businessman, althete} S3 actorMention, author: ,“ Schwarzeneggerbusinessman,” althete; Context}: S3; althete politician artist of six units, for which he scraped together $US27,000. man EntityS3 Candidate: Arnold Type Set Schwarzenegger: {person, politician, artist, Distant actor, author, businessman, althete} ... Mention: “Arnold Schwarzenegger... ”; Context: S2; SupervisionEntity: Arnold Schwarzenegger ... Candidate Type Set: {person, politician, artist, S2 ... actorNoisy, author Training, businessman Examples, althete} author actor singer Mention: “Arnold Schwarzenegger”; Context: S1; Candidate Type CandidateMention: Type “Schwarzenegger Set: {person,” politician; Context,: artistS3; , Knowledge Bases S1 Set (Sub-tree) S3 Candidateactor Type, author Set, :businessman {person, politician, althete, artist} , Distant actor, author, businessman, althete} Entity: Arnold Schwarzenegger Mention: “Arnold Schwarzenegger”; Context: S2; Supervision S2 Candidate Type Set: {person, politician, artist, actor, author, businessman, althete}

Mention: “Schwarzenegger”; Context: S3; S3 Candidate Type Set: {person, politician, artist, actor, author, businessman, althete} Entity: Arnold Schwarzenegger Challenges

q Noisy type labels: How severe ?

ID Sentence Target Type root Hierarchy ... Governor Arnold Schwarzenegger gi ves a speech at S1 Mission Serve's service project on Veterans Day 2010. organiz product person location The fourth movie in the Predator series entitled 'The ation S2 Predator' may see the return of action-movie star Arnold ...... Schwarzenegger to the franchise. ... Noisy Schwarzeneggermention’s fi:rstentity propertymention investment waswhich a block S3 althete politician artist business ... is assigned of six units, forwith whichmultiple he scraped togethersibling $UStypes27,000. man in... the given type hierarchy......

Noisy Training Examples author actor singer ... Mention: “Arnold Schwarzenegger”; Context: S1; Candidate Type Candidate Type Set: {person, politician, artist, Knowledge Bases S1 Set (Sub-tree) actor, author, businessman, althete} Distant Mention: “Arnold Schwarzenegger”; Context: S2; Supervision S2 Candidate Type Set: {person, politician, artist, actor, author, businessman, althete}

Mention: “Schwarzenegger”; Context: S3; S3 Candidate Type Set: {person, politician, artist, actor, author, businessman, althete} Entity: Arnold Schwarzenegger Challenges

q Noisy type labels

What about just removing the noisy mentions*?

A significant loss (>20%) of training instances!

*Gillick et al., “Context-Dependent Fine-Grained Entity Type Tagging”, 2014 Challenges

q Noisy type labels q Type correlation: independent à correlated

root ... actor

product person location organiz ation more similar

...... less similar business singer politician artist man ...

...... author actor singer ... politician

How to deal with infrequent (fine-grained) entity types Our Solution: “AFET” q Jointly embed entity mentions and type labels into a low-dimensional vector space q Design a noise-robust loss function to model “false positive” labels in training data q Enforce adaptive margins on entity mentions, to encode type correlation Our Solution: “AFET” q Jointly embed entity mention and type labels into a low-dimensional vector space q Design a noise-robust loss function to model “false positive” labels in training data q Enforce adaptive margins on entity mentions, to encode type correlation q Contributions – Minimally-supervised: distant supervision – Effective: consistent gains over state-of-the-art – Efficient: scalable to large training corpus; fast inference AFET: Framework Overview

ID Sentence Governor [Arnold Schwarzenegger] gives a speech at S1 Mission Serve's service project on Veterans Day 2010. Extract text features*

“S1_Arnold Schwarzenegger” Candidate types: {person, politician, Noisy mention Partition training set artist, actor, author, businessman, athlete}

“S4_Ted Cruz” Clean mention Candidate types: {person, politician}

Joint embedding

text feature type label Type inference

* Yogatama et al. “Embedding methods for fine grained entity type classification”, ACL 2015. tions have already been extracted and mapped to KB if and only if their share similar types. Mapping entities using anchor links in the corpus. In specific functions for entity mentions and type labels are dif- domains (e.g., reviews, tweets) where such public ferent as they have different representation in the datasets are unavailable, one can utilize distant su- raw feature space, but are jointly learned by opti- pervision (Ling and Weld, 2012) to automatically mizing an objective of interests to handle the afore- label the corpus, where an entity linker (Shen et al., mentioned challenges. 2014) will detect mentions m (in set ) and map We start with a representation of entity mentions. i M them to one or more entity e in . Types of e in To capture the shallow syntax and distributional se- i E i KB are then associated with m to form its type mantics of a mention m , we extract various i i 2 M set , i.e., = y (e ,y) ,y . features from both m itself (e.g., head token) and Yi Yi | i 2 T 2 Y i c Problem Description. Since i is annotated for en- its context i (e.g., bigram). Table 2 lists the set of Y tity ei, it includes all possible types of ei and thus text features used in this work, which is similar to may contain types that are irrelevant to mi’s spe- those used in (Yogatama et al., 2015; Ling and Weld, 2012). We denote the set of M unique features of cific context ci. Ideally, the type labels for mi M 2 M extracted from as = f M . Details of feature should form a type-path (not required to end at a D F { j}j=1 leaf) in (Yogatama et al., 2015; Gillick et al., generation are introduced in Sec. 4.1. Each entity Yi 2014; Yosef et al., 2012), which serves as a context- mention mi M can be representation by a M- 2 +M dimensional feature vector mi R , where mi,j dependent type annotation for mi. However, as dis- 2 0 cussed in (Gillick et al., 2014) and shown in Fig. 1, is the number of occurrences of fj for mi. Each type y K i may contain type-paths that are irrelevant to mi label k is represented by a -dimensional in- Y 2 Y K in c . Even though in some cases is already a dicator vector yk 0, 1 , where yk,k =1, and 0 i Yi 2 type-path, it may be overly specific for ci and so in- otherwise. sufficient to infer the whole type-path using ci.We We aim to learn a mapping function from the denote the true type-path for mention m as ⇤. This mention’s feature space to a low-dimensional vector i Yi work focuses on estimating ⇤ from based on space Yi Yi mention m as well as its context c , where the can- i i (m ): M d, didate type set may contain (1) types that are ir- i R R Yi M 7! relevant to ci, and (2) types that are overly specific and a mapping function from type label space to the to ci. Formally, we define the LNR task as follows. same low-dimensional space Definition 1 (Problem Definition) Given a train- N K d ing corpus = (mi,ci, i) , a KB with type (yk):R R . D Y i=1 Y 7! schema and entity-type facts = (e, y) , and Y T a target type hierarchy , the task of auto- In this work, we adopt linear maps, as similar to Y ✓ Y Mapping Mentions & Types into A Joint Space matic fine-grained entity typing aims to estimate a the mapping functions in (). single type-path i⇤ i forfeature each mentionmatrix Umi mTi Y ✓ Y 2 M (mi)=Umi; (yk)=Vyk. (1) based on mi itself and its context(d-byci-.M) M Y d M d K where U R ⇥ and V R ⇥ are the linear 3 Hierarchical Partial-Label Embedding 2 2 X =projection matrices for mentions and type labels, re- This section follows notations in Table 3 to for- spectively. “S4_Ted Cruz” mulate an optimization problem for jointbinary embed-featureModelingvector for Type-path Correlation. senator Ted ding of entity mentions and type labels into“S4_Ted a low- Cruz”In(dim target= M) type hierarchy , types closer to each Y dimensional vector space. other (i.e., shorter path) tend to be more related (e.g., The Joint Mention-Type-path Model. actor is more related to artist than to person We propose to learn mappings into low- in the left column of Fig. ??). In KB , types as- dimensional vector space, where, both entity men- signed to similar setsd-dim of entitiesvector shouldspace be more re- tions and type labels are represented, and in that lated to each other than those assigned to quite dif- space, two objects are embedded close to each other ferent entities (Jiang et al., 2015) (e.g., actor is tions have already been extracted and mapped to KB if and only if their share similar types. Mapping entities using anchor links in the corpus. In specific functions for entity mentions and type labels are dif- domains (e.g., reviews, tweets) where such public ferent as they have different representation in the datasets are unavailable, one can utilize distant su- raw feature space, but are jointly learned by opti- pervision (Ling and Weld, 2012) to automatically mizing an objective of interests to handle the afore- label the corpus, where an entity linker (Shen et al., mentioned challenges. 2014) will detect mentions m (in set ) and map We start with a representation of entity mentions. i M them to one or more entity e in . Types of e in To capture the shallow syntax and distributional se- i E i KB are then associated with m to form its type mantics of a mention m , we extract various i i 2 M set , i.e., = y (e ,y) ,y . features from both m itself (e.g., head token) and Yi Yi | i 2 T 2 Y i c Problem Description. Since i is annotated for en- its context i (e.g., bigram). Table 2 lists the set of Y tity ei, it includes all possible types of ei and thus text features used in this work, which is similar to may contain types that are irrelevant to mi’s spe- those used in (Yogatama et al., 2015; Ling and Weld, 2012). We denote the set of M unique features of cific context ci. Ideally, the type labels for mi M 2 M extracted from as = f M . Details of feature should form a type-path (not required to end at a D F { j}j=1 leaf) in (Yogatama et al., 2015; Gillick et al., generation are introduced in Sec. 4.1. Each entity Yi 2014; Yosef et al., 2012), which serves as a context- mention mi M can be representation by a M- 2 +M dimensional feature vector mi R , where mi,j dependent type annotation for mi. However, as dis- 2 0 cussed in (Gillick et al., 2014) and shown in Fig. 1, is the number of occurrences of fj for mi. Each type y K i may contain type-paths that are irrelevant to mi label k is represented by a -dimensional in- Y 2 Y K in c . Even though in some cases is already a dicator vector yk 0, 1 , where yk,k =1, and 0 i Yi 2 type-path, it may be overly specific for ci and so in- otherwise. sufficient to infer the whole type-path using ci.We We aim to learn a mapping function from the denote the true type-path for mention m as ⇤. This mention’s feature space to a low-dimensional vector i Yi work focuses on estimating ⇤ from based on space Yi Yi mention m as well as its context c , where the can- i i (m ): M d, didate type set may contain (1) types that are ir- i R R Yi M 7! relevant to ci, and (2) types that are overly specific and a mapping function from type label space to the to ci. Formally, we define the LNR task as follows. same low-dimensional space Definition 1 (Problem Definition) Given a train- N K d ing corpus = (mi,ci, i) , a KB with type (yk):R R . D Y i=1 Y 7! schema and entity-type facts = (e, y) , and Y T a target type hierarchy , the task of auto- In this work, we adopt linear maps, as similar to Y ✓ Y Mapping Mentions & Types into A Joint Space matic fine-grained entity typing aims to estimate a the mapping functions in (). single type-path i⇤ i for each mention mi T Y ✓ Y 2 M (mi)=Umi; (yk)=Vyk. (1) based on mi itself and its context ci. M Y d M d K where U R ⇥ and V R ⇥ are the linear 3 Hierarchical Partial-Label Embedding 2 2 X =projection matrices for mentions and type labels, re- This section follows notations in Table 3 to for- spectively. “S4_Ted Cruz” mulate an optimization problem for joint“S4_Ted embed- Cruz”Modeling Type-path Correlation. senator Ted ding of entity mentions and type labels into a low- In target typeT T hierarchy , types closer to each fk(mi) = mi U V yk Y dimensional vector space. other (i.e., shorter path) tend to be more related (e.g., type matrix V Similarity by dot product politician The Joint Mention-Type-path Model. y actor is more related to artist than to person (d-by-K) k We propose to learn mappings into low- in the left column of Fig. ??). In KB , types as- dimensional vector space, where, both entityX men- signed to similar setsd-dim of entitiesvector shouldspace be more re- tions and type labels are represented, and in that = lated to each other than those assigned to quite dif- space, two objects are embedded close to each other ferent entities (Jiang et al., 2015) (e.g., actor is binary vector for politician politician (dim = K) How to Learn Mapping Matrices? feature matrix U (d-by-M) ? X = “S4_Ted Cruz”

politician type matrix V y (d-by-K) k ? X = d-dim vector space Feature Description Example Head Syntactic head token of the mention “HEAD Turing” Token Tokens in the mention “Turing”, “Machine” POS Part-of-Speech tag of tokens in the mention “NN” Character All character trigrams in the head of the mention “:tu”, “tur”, ..., “ng:” Word Shape Word shape of the tokens in the mention “Aa” for “Turing” Length Number of tokens in the mention “2” “CXT B:Maserati ,”, “CXT A:and Context Unigrams/bigrams before and after the mention the” Brown Cluster Brown cluster ID for the head token (learned using ) “4 1100”, “8 1101111”, “12 111011111111” D Stanford syntactic dependency (Manning et al., 2014) Dependency Modeling“GOV:nn”,Each Clean “GOV:turing”Mention associated with the head token Table 2: Text features used in this paper. “Turing Machine” isFor useda clean as an examplemention, mentionits “positive from “Thetypes band’s” should former drummer Jerry Fuchs—who was also a member of Maserati,beTuringranked Machinehigherandthan Theall Juanits “negative MacLean—diedtypes” after falling down an elevator shaft.”.

Algorithm 1: Path Edit Distance as follows. Input: path 1 p1[1 m], path 2 p2[1 n], cost matrix W [1 T,1··· T ] ··· ··· ··· `c(mi, i, i)= L ranky f(mi) ⇥ ¯ Output: edit distance between p1 and p2 Y Y k i,k,k 1 for mi do yk i y¯ i 2 M X2Y kX2Y j ⇣ ⌘k 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p2[1]] 6 ⇥ (2) 3 for i 2 to m do

4 Edit[i, 1] Edit[i 1, 1] + W [p1[i 1],p1[i]] ⇥i,k,k¯ = max 0, k,k¯ fk(mi)+fk¯(mi) . (3) 5 end 6 for j 2 to n do n o rank f(m ) = ¯ + f¯(m ) >f (m ) 7 Edit[1,j] Edit[1,j 1] + W [p2[j 1],p2[j]] yk i k,k k i k i 8 end y¯ ⇣ ⌘ kX2Yi ⇣ ⌘ 9 for i 2 to m do (4) 10 for j 2 to n do

11 if p1[i]=p2[j] then 12 Edit[i, j] Edit[i 1,j 1] 13 end 14 else A simple way to measure correlation between two 15 Edit[1,j] min { types is to use their distance in the target type hier- 16 Edit[i 1,j 1] + W [p1[i],p2[j]], archy (tree). Specifically, a link (yk,yk ) is formed 17 Edit[i 1,j]+W [p1[i 1],p1[i]], 0 18 Edit[i, j 1] + W [p2[j 1],p2[j]] if there exists a path between types yk and yk in } 0 19 end (paths passing root node are excluded). We de- 20 end Y fine the weight of link (yk,yk ) GYY as wkk = 21 end 0 2 0 22 return Edit[m, n] 1/ 1+⇢(yk,yk0 ) , where ⇢(yk,yk0 ) denotes the length 23 end of the shortest path between types y and y in . k k0 Y Although using shortest path to compute type corre- lation is efficient, its accuracy is limited—It is not always true that a type (e.g., athlete) is more re- more related to director than to author in the lated to its parent type (i.e., person) than to its sib- right column of Fig. ??). We propose to model type ling types (e.g., coach), or that all sibling types are correlation based on the following hypothesis. equally related to each other (e.g., actor is more related to director than to author). Hypothesis 1 (Type Correlation) If high correla- An alternative approach to avoid this accuracy tion exists between two target types based on either issue is to exploit entity-type facts in KB to T type hierarchy or KB, they should be embedded close measure type correlation. Given two target types yk,yk0 , the correlation between them is propor- to each other. tional to2 theY number of entities they share in the KB. Let k denote the set of entities assigned with type We formulate the hierarchy-induced WARP loss y inE KB, i.e., = e (e, y ) . The weight k Ek | k 2 T Feature Description Example Head Syntactic head token of the mention “HEAD Turing” Token Tokens in the mention “Turing”, “Machine” POS Part-of-SpeechFeature tag of tokensDescription in the mention “NN” Example Character All characterHead trigrams in theSyntactic head of head the token mention of the mention“:tu”, “tur”, ..., “ng:”“HEAD Turing” Word Shape Word shapeToken of the tokensTokens in the mention in the mention “Aa” for “Turing” “Turing”, “Machine” Length NumberPOS of tokens in the mentionPart-of-Speech tag of tokens in the“2” mention “NN” Character All character trigrams in the head of“CXT the mentionB:Maserati ,”,“:tu”, “CXT “tur”,A:and ..., “ng:” Context Unigrams/bigrams before and after the mention Word Shape Word shape of the tokens in the mentionthe” “Aa” for “Turing” Brown Cluster BrownLength cluster ID for the headNumber token of tokens (learned in the using mention) “4 1100”, “8 1101111”,“2” “12 111011111111” D “CXT B:Maserati ,”, “CXT A:and StanfordContext syntactic dependencyUnigrams/bigrams (Manning et before al., 2014) and after the mention Dependency Modeling“GOV:nn”,Each Clean “GOV:turing”the”Mention associatedBrown with Cluster the head tokenBrown cluster ID for the head token (learned using ) “4 1100”, “8 1101111”, “12 111011111111” D Stanford syntactic dependency (Manning et al., 2014) Table 2: Text features used inDependency this paper. “Turing Machine” isFor useda clean as an examplemention, mentionits “positive from “Thetypes“GOV:nn”, band’s” should “GOV:turing” former associated with the head token drummer Jerry Fuchs—who was also a member of Maserati,beTuringranked Machinehigherandthan Theall Juanits “negative MacLean—diedtypes” after falling down an elevator shaft.Table”. 2: Text features used in this paper. “Turing Machine” is used as an example mention from “The band’s former drummer Jerry Fuchs—who was also a member of Maserati, Turing Machine and The Juan MacLean—died after falling down an elevator shaft.”. Algorithm 1: Path Edit Distance as follows. Input: path 1 p1[1 m], path 2 p2[1 n], cost matrix as follows. W [1 T,1··· T ] Algorithm··· 1: Path Edit Distance ··· ··· `c(mi, i, i)= L ranky f(mi) ⇥ ¯ Output: edit distance between p1 andInputp:2path 1 p1[1 m], path 2 p2[1 n], cost matrix k i,k,k ··· ··· Y Y ` (m , , )= L rank f(m ) ⇥ ¯ W [1 T,1 T ] y ic i i i yk i i,k,k 1 for mi do ··· ··· k2Y yk¯ Yi jY ⇣ ⌘k 2 M Output: edit distance between p1 and p2 X X2Y y i 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p2[1]] Xk2Y ykX¯ i j ⇣ ⌘k 6 1 for m⇥i do 2Y (2) 3 for i 2 to m do 2 M Margin-based indicator (2) 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p2[1]] 4 Edit[i, 1] Edit[i 1, 1] + W [p1[ i 1],p1[i6 ]] For every⇥ pair of (positive 3 for i 2 to m do ⇥i,k,k¯ = max 0, k,k¯ fk(mi)+fk¯(mi) . (3) type, negative type) of mi ⇥ ¯ = max 0, ¯ fk(mi)+f¯(mi) . (3) 5 end 4 Edit[i, 1] Edit[i 1, 1] + W [p1[i 1],p1[i]] i,k,k k,k k 6 for j 2 to n do 5 end n o rank f(m ) = ¯ +nf¯(m ) >f (m ) o 7 Edit[1,j] Edit[16,j 1]for +jW [p22to[jn do1],p2[j]] yk i k,k k i k i 8 end 7 Edit[1,j] Edit[1,j 1] + W [p2[j 1],p2[j]] y¯ ⇣ ⌘ kX2AY simplei ⇣ way to measure correlation⌘ between two 9 for i 2 to m do 8 end 9 for i 2 to m do (4) 10 for j 2 to n do types is to use their distance in the target type hier- 10 for j 2 to n do 11 if p1[i]=p2[j] then (y ,y ) 11 if p1[i]=p2[j] then archy (tree). Specifically, a link k k0 is formed 12 Edit[i, j] Edit[i 1,j 1] 12 Edit[i, j] Edit[i 1,j 1] if there exists a path between types yk and yk in 13 end 0 13 end A simple way to measure(paths correlation passing root between node are two excluded). We de- 14 else 14 else Y 15 Edit[1,j]15 min Edit[1,j] min types is to use theirfine distance the weight in the of target link ( typeyk,yk hier-) GYY as wkk = { { 0 2 0 16 Edit[i 116,j 1] + W [p1[i],pEdit2[j]][i, 1,j 1] + W [p1[i],p2[j]], archy (tree). Specifically,1/ 1+⇢ a(yk link,yk0 )(y,k where,yk ) ⇢is(y formedk,yk0 ) denotes the length Edit[i 117,j]+W [p [i 1],pEdit[i]][,i 1,j]+W [p1[i 1],p1[i]], 0 17 1 1 18 Edit[i, j 1] + W [p2[j 1],p2[j]] of the shortest path between types yk and yk0 in . 18 Edit[i, j 1] + W [p2[j 1],p2[j]] if there exists} a path between types yk and yk0 in Y 19 end } 19 end (paths passing rootAlthough node are using excluded). shortest path We to computede- type corre- 20 end 20 end Y lation is efficient, its accuracy is limited—It is not 21 end fine the weight of link (yk,yk ) GYY as wkk = 21 end 0 0 22 return Edit[m, n] always true that2 a type (e.g., athlete) is more re- 22 return Edit[m, n] 1/ 1+⇢(yk,yk ) , where ⇢(yk,yk ) denotes the length 23 end 0 lated to its parent0 type (i.e., person) than to its sib- 23 end of the shortest path between types y and y in . ling types (e.g., coachk ), ork0 that allY sibling types are Although using shortestequally path related to compute to each type other corre- (e.g., actor is more lation is efficient, itsrelated accuracy to director is limited—Itthan to isauthor not ). more related to director thanalways to author truein that the a typeAn (e.g. alternative, athlete approach) is more to avoid re- this accuracy more related to directorrightthan column to author of Fig. ??in). We the propose to model type issue is to exploit entity-type facts in KB to lated to its parent type (i.e., person) than to its sib- T right column of Fig. ??). Wecorrelation propose based to model on the type following hypothesis. measure type correlation. Given two target types ling types (e.g., coachy ,y), or that, the all correlation sibling types between are them is propor- correlation based on the following hypothesis. k k0 2 Y equally related to eachtional other to the (e.g. number, actor of entitiesis more they share in the KB. Hypothesis 1 (Type Correlation) If high correla- Let k denote the set of entities assigned with type related to directory thaninE KB, toi.e.author, = ).e (e, y ) . The weight tion exists between two target types based on either k Ek | k 2 T Hypothesis 1 (Type Correlation) If high correla- wkk of link (yk,yk ) GYY is defined as follows. type hierarchy or KB, they should beAn embedded alternative close approach0 to avoid0 this2 accuracy issue is to exploit entity-type facts in KB to tion exists between two targetto each types other. based on either w = k T/ k + k / /2, (4) type hierarchy or KB, they should be embedded close measure type correlation.kk0 GivenE \ twoEk0 targetE typesE \ Ek0 Ek0 yk,yk , the correlation between⇣ them is propor- ⌘ to each other. We formulate the hierarchy-induced0 WARP2 Y loss where denotes the size of set . tional to the number of entities|Ek| they share in the KB.Ek Let k denote the set of entities assigned with type We formulate the hierarchy-induced WARP loss y inE KB, i.e., = e (e, y ) . The weight k Ek | k 2 T Feature Description Example Head Syntactic head token of the mention “HEAD Turing” Token Tokens in the mention “Turing”, “Machine” Feature Description POS Part-of-Speech tag of tokens in the mentionExample “NN” Character All character trigrams in the head of the mention “:tu”, “tur”, ..., “ng:” Head Syntactic head tokenWord Shape of the mentionWord shape of the tokens in the mention“HEAD Turing”“Aa” for “Turing” Token Tokens in the mentionLength Number of tokens in the mention “Turing”, “Machine”“2” “CXT B:Maserati ,”, “CXT A:and POS Part-of-Speech tagContext of tokens inUnigrams/bigrams the mention before and after the“NN” mention Feature Description the” Example Character All characterHead trigramsBrown Cluster in theSyntactic headBrown of head cluster the token mention ID for of the the head mention token“:tu”, (learned “tur”, using ) ...,“4 “ng:”1100”,“HEAD “8 1101111”,Turing” “12 111011111111” D Stanford syntactic dependency (Manning et al., 2014) Word Shape Word shapeToken of theDependency tokensTokens in the mention in the mention “Aa” for “Turing”“GOV:nn”,“Turing”, “GOV:turing” “Machine” Length NumberPOS of tokens in the mentionPart-of-Speechassociated with tag the of head tokens token in the“2” mention “NN” CharacterTable 2: Text featuresAll character used in this trigrams paper. “Turing in the Machine head of“CXT” is the used mention asB:Maserati an example mention ,”,“:tu”, “CXT from “tur”, “TheA:and ..., band’s “ng:” former Context Unigrams/bigramsWord Shapedrummer before JerryWord andFuchs—who after shape the was of also mentionthe tokensa member in of the Maserati, mentionTuring Machine and The“Aa” Juan for MacLean—died “Turing” after falling down an elevator shaft.”. the” Brown Cluster BrownLength cluster ID for the headNumber token of tokens (learned in the using mention) “4 1100”, “8 1101111”,“2” “12 111011111111” D “CXT B:Maserati ,”, “CXT A:and StanfordContext syntactic dependencyUnigrams/bigrams (Manning et before al., 2014) and after theas follows. mention Dependency Algorithm 1: Path Edit Distance Modeling“GOV:nn”,Each Clean “GOV:turing”the”Mention associated with theInput head: path 1 p token1[1 m], path 2 p2[1 n], cost matrix Brown Cluster W [1 BrownT,1··· T ] cluster ID··· for the head token (learned using ) “4 1100”, “8 1101111”, “12 111011111111” ··· ··· `c(mi, i, i)= D L rankyk f(mi) ⇥i,k,k¯ Output: edit distanceStanford between syntacticp1 and p2 dependency (ManningY et al.,Y 2014) Table 2: Text features used in this paper.1 for m “iTuringdo Machine” is used as an example mention fromyk i “y¯Thei band’s former Dependency 2 M For a clean mention, its “positiveX2Y kXtypes2“GOV:nn”,Y j ” should⇣ “GOV:turing”⌘k 2 Edit[1, 1]associated(p1[1] = p with2[1]) theW [p1 head[1],p2[1]] token drummer Jerry Fuchs—who was also a member of Maserati,6 ⇥ and The Juan MacLean—died after (2) 3 for i 2 to m do beTuringranked Machinehigher than all its “negative types”

4 Edit[i, 1] Edit[i 1, 1] + W [p1[i 1],p1[i]] falling down an elevator shaft.Table”. 2: Text features used in this paper. “Turing Machine” is⇥i,k, usedk¯ = as max an0 example, k,k¯ fk mention(mi)+fk¯( frommi) . “The(3) band’s former 5 end drummer Jerry6 for Fuchs—whoj 2 to n doLarger was alsopenalty a memberif the of Maserati, Turing Machinen and The Juano MacLean—died after rank f(m ) = ¯ + f¯(m ) >f (m ) 7 Edit[1,j] Edit[1,j 1] + W [p2[j 1],p2[j]] yk i k,k k i k i falling down an elevator shaft. positive”. labels got 8 end y¯ ⇣ ⌘ kX2Yi ⇣ ⌘ 9 for i 2 to m doranked loweras follows. Algorithm 1: Path Edit Distance (4) 10 for j 2 to n do Rank-based weight Input: path 1 p1[1 m], path 2 p2[1 n11], cost matrixif p1[i]=p2[j] then as follows. ··· Algorithm··· 12 1: Path EditEdit Distance[i, j] Edit[i 1,j 1] W [1 T,1 T ] 13 end ··· ··· Input: path 1 p1[1 m], path 2 p2[1 `nc](,m costi, matrixi, i)= L rankyk f(mi) ⇥i,k,k¯ Output: edit distance between p1 and p2 14 else A simple way to measure correlation between two ··· ··· Y Y ` (m , , )= L rank f(m ) ⇥ ¯ W15[1 T,1 T ] Edit[1,j] min yk typesic isi to usei theiri distance in the target typeyk hier- i i,k,k 1 for mi do ··· ··· { 2Y yk¯ Yi jY ⇣ ⌘k 2 M Output: edit16 distance betweenEditp1[iand1,jp2 1] + W [p1[i],p2[j]], X X2Y yk i y¯ (yi ,y ) 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p2[1]] Edit[i 1,j]+W [p [i 1],p [i]], archy (tree). Specifically,2Y a linkk k jk0 is formed⇣ ⌘k 1 for mi 17 do 1 1 X X2Y (2) 6 ⇥ Margin-based indicator 3 for i 2 to m do 218M Edit[i, j 1] + W [p2[j 1],p2[j]] if there exists a path between types yk and yk0 in (2) 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p 2[1]] } 19 end 6 For every⇥ pair of (positive (paths passing root node are excluded). We de- 4 Edit[i, 1] Edit[i 3 1, 1] + Wi [p21[toi m 1],p1[i]] for 20 enddo ⇥i,k,k¯ = max 0,Yk,k¯ fk(mi)+fk¯(mi) . (3) type, negative type) of mi ⇥ ¯ = max 0(,y,y¯ ) fGk(mi)+wf¯(=mi) . (3) 5 end 4 Edit[i, 1] Edit[i 1, 1] + W [p1[i 1],p1[i]] fine thei,k, weightk of link k k,kk0 YY as kkk0 21 end 2 6 for j 2 to n do 5 end 22 return Edit[m, n] n 1/ 1+⇢(yk,yk0 ) , where ⇢(yk,yk0 )odenotes the length rank f(m ) = ¯ +nf¯(m ) >f (m ) o 7 Edit[1,j] Edit[16,j 1]for +j23Wend[p22to[jn do1],p2[j]] yk i of the shortestk, pathk betweenk typesi y kand yi in . k k0 Y 8 end 7 Edit[1,j] Edit[1,j 1] + W [p2[j 1],p2[j]] Althoughy¯ using shortest path to compute type corre- ⇣ ⌘ kX2AY simplei ⇣ way to measure correlation⌘ between two 9 for i 2 to m do 8 end lation is efficient, its accuracy is limited—It is not 9 for i 2 to m do (4) 10 for j 2 to n do alwaystypes true is that to use a type their (e.g. distance, athlete in) is the more target re- type hier- 10 morefor j related2 to n tododirector than to author in the 11 if p1[i]=p2[j] then lated to its parent type (i.e., person) than to(y its,y sib- ) 11 if p1[i]=p2[j] then archy (tree). Specifically, a link k k0 is formed 12 Edit[i, j] Edit[i right1,j column1] of Fig. ??). We propose to model type 12 Edit[i, j] Edit[i 1,j 1] ling types (e.g., coach), or that all sibling types are correlation based on the following hypothesis. if there exists a path between types yk and yk0 in 13 end actor 13 end A simple way toequally measure(paths related correlation passing to each other root between ( nodee.g., are two excluded).is more We de- 14 else 14 else relatedY to director than to author). 15 Edit[1,j]15 min Edit[1,j] min types is to use theirfine distance the weight in the of target link ( typeyk,yk hier-) GYY as wkk = { Hypothesis 1 (Type Correlation){ If high correla- An alternative approach to avoid this0 accuracy2 0 16 Edit[i 116,j 1] + W [p1[i],pEdit2[j]][i, 1,j 1] + W [p1[i],p2[j]], tion exists between two target types based on either issue1/ 1+ is to⇢( exploityk,yk entity-type) , where facts⇢(yk,y k in) denotes KB to the length archy (tree). Specifically, a link0 (yk,yk0 ) is formedT 0 17 Edit[i 117,j]+W [p1[i 1],pEdit1[i]][,i 1,j]+W [p1[i 1],p1[i]], measure type correlation. Given two target types type hierarchy or KB, they should be embedded close of the shortest path between types y and y in . Edit[i, j 18 1] + W [p [j 1],pEdit[j]][i, j 1] + W [p2[j 1],p2[j]] yk,yk , the correlationy between themy is propor-k k0 18 to2 each other.2 if there exists} a path0 between2 Y types k and k0 in Y 19 end } tionalAlthough to the number using of shortest entities they path share to in compute the KB. type corre- 19 end (paths passing rootLet nodedenote are the set excluded). of entities assigned We de- with type 20 end Ek 20 end We formulate the hierarchy-inducedY WARP loss yklationin KB, isi.e. efficient,, k = e ( itse, yk accuracy) . The is limited—It weight is not 21 end fine the weight of link (yk,yEk ) |GYY as2 T wkk = 21 end 0 0 22 return Edit[m, n] always true that2 a type (e.g., athlete) is more re- 22 return Edit[m, n] 1/ 1+⇢(yk,yk ) , where ⇢(yk,yk ) denotes the length 23 end 0 lated to its parent0 type (i.e., person) than to its sib- 23 end of the shortest path between types y and y in . ling types (e.g., coachk ), ork0 that allY sibling types are Although using shortestequally path related to compute to each type other corre- (e.g., actor is more lation is efficient, itsrelated accuracy to director is limited—Itthan to isauthor not ). more related to director thanalways to author truein that the a typeAn (e.g. alternative, athlete approach) is more to avoid re- this accuracy more related to directorrightthan column to author of Fig. ??in). We the propose to model type issue is to exploit entity-type facts in KB to lated to its parent type (i.e., person) than to its sib- T right column of Fig. ??). Wecorrelation propose based to model on the type following hypothesis. measure type correlation. Given two target types ling types (e.g., coachy ,y), or that, the all correlation sibling types between are them is propor- correlation based on the following hypothesis. k k0 2 Y equally related to eachtional other to the (e.g. number, actor of entitiesis more they share in the KB. Hypothesis 1 (Type Correlation) If high correla- Let k denote the set of entities assigned with type related to directory thaninE KB, toi.e.author, = ).e (e, y ) . The weight tion exists between two target types based on either k Ek | k 2 T Hypothesis 1 (Type Correlation) If high correla- wkk of link (yk,yk ) GYY is defined as follows. type hierarchy or KB, they should beAn embedded alternative close approach0 to avoid0 this2 accuracy issue is to exploit entity-type facts in KB to tion exists between two targetto each types other. based on either w = k T/ k + k / /2, (4) type hierarchy or KB, they should be embedded close measure type correlation.kk0 GivenE \ twoEk0 targetE typesE \ Ek0 Ek0 yk,yk , the correlation between⇣ them is propor- ⌘ to each other. We formulate the hierarchy-induced0 WARP2 Y loss where denotes the size of set . tional to the number of entities|Ek| they share in the KB.Ek Let k denote the set of entities assigned with type We formulate the hierarchy-induced WARP loss y inE KB, i.e., = e (e, y ) . The weight k Ek | k 2 T Feature Description Example Head Syntactic head token of the mention “HEAD Turing” Token Tokens in the mention “Turing”, “Machine” POS Part-of-Speech tag of tokens in the mention “NN” Character All character trigrams in the head of the mention “:tu”, “tur”, ..., “ng:” Word Shape Word shape of the tokens in the mention “Aa” for “Turing” Length Number of tokens in the mention “2” “CXT B:Maserati ,”, “CXT A:and Context Unigrams/bigrams before and after the mention the” Brown Cluster Brown cluster ID for the head token (learned using ) “4 1100”, “8 1101111”, “12 111011111111” D Stanford syntactic dependency (Manning et al., 2014) Dependency “GOV:nn”, “GOV:turing” associated with the head token Table 2: Text features used in this paper. “Turing Machine” is used as an example mention from “The band’s former drummer Jerry Fuchs—who was also a member of Maserati, Turing Machine and The Juan MacLean—died after falling down an elevator shaft.”.

Algorithm 1: Path Edit Distance as follows. Input: path 1 p1[1 m], path 2 p2[1 n], cost matrix ··· ··· W [1 T,1 T ] `c(mi, i, i)= L rankyk f(mi) ⇥i,k,k¯ ··· ··· Y Y Output: edit distance between p1 and p2 Adaptive Marginy toi Incorporate Type Correlation k2Y yk¯ i j ⇣ ⌘k 1 for mi do X X2Y 2 M (2) 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p2[1]] 6 ⇥ 3 for i 2 to m do T T ⇥ ¯ = max 0, ¯ fk(mi)+f¯(mi) . fk(m(3)i) = mi U V yk 4 Edit[i, 1] Edit[i 1, 1] + W [p1[i 1],p1[i]] i,k,k k,k k 5 end n o 6 for j 2 to n do

7 Edit[1,j] Edit[1,j 1] + W [p2[j 1],p2[j]] 8 end A simple way to measure correlation between two 9 for i 2 to m do types is to use their distance in the target type hier- 10 for j 2 to n do (y ,y ) 11 if p1[i]=p2[j] then archy (tree). Specifically, a link k k0 is formed 12 Edit[i, j] Edit[i 1,j 1] if there exists a path between types y and y in More correlated between k k0 13 end (paths passing root node are excluded). We de- 14 else (posY type, neg type) 15 Edit[1,j] min fine the weight of link (yk,yk ) GYY as wkk = { 0 0 16 Edit[i 1,j 1] + W [p1[i],p2[j]], à Smaller margin 2 1/ 1+⇢(yk,yk0 ) , where ⇢(yk,yk0 ) denotes the length 17 Edit[i 1,j]+W [p1[i 1],p1[i]], 18 Edit[i, j 1] + W [p2[j 1],p2[j]] of the shortest path between types yk and yk0 in . } Y 19 end LessAlthough correlated using shortest between path to compute type corre- 20 end (poslationtype, is efficient, neg type) its accuracy is limited—It is not 21 end 22 return Edit[m, n] à Largeralways true margin that a type (e.g., athlete) is more re- 23 end lated to its parent type (i.e., person) than to its sib- ling types (e.g., coach), or that all sibling types are equally related to each other (e.g., actor is more related to director than to author). more related to director than to author in the An alternative approach to avoid this accuracy right column of Fig. ??). We propose to model type issue is to exploit entity-type facts in KB to T correlation based on the following hypothesis. measure type correlation. Given two target types

yk,yk0 , the correlation between them is propor- tional to2 theY number of entities they share in the KB. Let denote the set of entities assigned with type Hypothesis 1 (Type Correlation) If high correla- Ek tion exists between two target types based on either yk in KB, i.e., k = e (e, yk) . The weight w of link (y E,y ) G| is defined2 T as follows. type hierarchy or KB, they should be embedded close kk0 k k0 2 YY to each other. w = k / k + k / /2, (4) kk0 E \ Ek0 E E \ Ek0 Ek0 ⇣ ⌘ We formulate the hierarchy-induced WARP loss where denotes the size of set . |Ek| Ek Feature Description Example Head Syntactic head token of the mention “HEAD Turing” Token Tokens in the mention “Turing”, “Machine” POS Part-of-Speech tag of tokens in the mention “NN” Character All character trigrams in the head of the mention “:tu”, “tur”, ..., “ng:” Word Shape Word shape of the tokens in the mention “Aa” for “Turing” Length Number of tokens in the mention “2” “CXT B:Maserati ,”, “CXT A:and Context Unigrams/bigrams before and after the mention the” Brown Cluster Brown cluster ID for the head token (learned using ) “4 1100”, “8 1101111”, “12 111011111111” D Stanford syntactic dependency (Manning et al., 2014) Dependency “GOV:nn”, “GOV:turing” associated with the head token Table 2: Text features used in this paper. “Turing Machine” is used as an example mention from “The band’s former drummer Jerry Fuchs—who was also a member of Maserati, Turing Machine and The Juan MacLean—died after falling down an elevator shaft.”.

Algorithm 1: Path Edit Distance as follows. Input: path 1 p1[1 m], path 2 p2[1 n], cost matrix ··· ··· W [1 T,1 T ] `c(mi, i, i)= L rankyk f(mi) ⇥i,k,k¯ ··· ··· Y Y Output: edit distance between p1 and p2 Adaptive Marginy toi Incorporate Type Correlation k2Y yk¯ i j ⇣ ⌘k 1 for mi do X X2Y 2 M (2) 2 Edit[1, 1] (p1[1] = p2[1]) W [p1[1],p2[1]] 6 ⇥ 3 for i 2 to m do T T ⇥ ¯ = max 0, ¯ fk(mi)+f¯(mi) . fk(m(3)i) = mi U V yk 4 Edit[i, 1] Edit[i 1, 1] + W [p1[i 1],p1[i]] i,k,k k,k k 5 end n o 6 for j 2 to n do Knowledge Base Example Type-Type Adaptive Margin

7 Edit[1,j] Edit[1,j 1] + W [p2[j 1],p2[j]] (Ben Affleck, act or) Correlation Scores Context in Sn: “The effective (Ben Affleck, director) 8 end A simple way to measure correlation betweenend of twoTed Cruz 's presidential (, act or) artist 9 for i 2 to m do (Woody Allen, director) campaign came on a call …” types(J. K. Rowling is, to aut hor use) their distance in the target type hier- 10 for j 2 to n do (Kobe Bryant, athlete) author ... actor (y ,y ) Score Sn_Ted Cruz athlete 11 if p1[i]=p2[j] then archy (tree). Specifically, a link k k0 is formed Entity-type facts 12 Edit[i, j] Edit[i 1,j 1] Ben Afflec k y y if there exists a path between types k and k0 in 13 end director director Margin = 1 / (paths passingCorr = singer root node are excluded). Wesim(politician de- , 14 else Woody Allen Y (0.6+0.6)/2 athlete) = 3 15 Edit[1,j] min fine theperson weight=0.6 of link (ypersonk,yk ) GYY as wkk = { 0 0 16 Edit[i 1,j 1] + W [p1[i],p2[j]], 2 1/ 1+⇢(yk,yk0 ) , where ⇢(yk,yk0 ) denotes the length 17 Edit[i 1,j]+W [p1[i 1],p1[i]], actor athlete Score Sn_Ted Cruz politician J. K. Rowling ... 18 Edit[i, j 1] + W [p2[j 1],p2[j]] of the shortest pathpolitician between types yk and yk0 in . Margin = 1 / sim(politician, } author Y 19 businessman) = 1.5 end AlthoughCorr using = shortest path to compute type corre- coach 20 end Ko be Bryant (0.25+0.55)/2 lation is efficient,=0.4 its accuracy is limited—It is not 21 end athlete businessman Score Sn_Ted Cruz businessman 22 return Edit[m, n] always true that a type (e.g., athlete) is more re- 23 end lated to its parent type (i.e., person) than to its sib- ling types (e.g., coach), or that all sibling types are equally related to each other (e.g., actor is more related to director than to author). more related to director than to author in the An alternative approach to avoid this accuracy right column of Fig. ??). We propose to model type issue is to exploit entity-type facts in KB to T correlation based on the following hypothesis. measure type correlation. Given two target types

yk,yk0 , the correlation between them is propor- tional to2 theY number of entities they share in the KB. Let denote the set of entities assigned with type Hypothesis 1 (Type Correlation) If high correla- Ek tion exists between two target types based on either yk in KB, i.e., k = e (e, yk) . The weight w of link (y E,y ) G| is defined2 T as follows. type hierarchy or KB, they should be embedded close kk0 k k0 2 YY to each other. w = k / k + k / /2, (4) kk0 E \ Ek0 E E \ Ek0 Ek0 ⇣ ⌘ We formulate the hierarchy-induced WARP loss where denotes the size of set . |Ek| Ek Modeling Each Noisy Mention

ID Sentence Governor Arnold Schwarzenegger S1 gives a speech at Mission Serve's service project ……

Candidate types: {person, politician, athlete businessman, artist, actor, author} Modeling Each Noisy Mention

ID Sentence S1_Arnold Schwarzenegger Governor Arnold Schwarzenegger S1 gives a speech at Mission Serve's service project ……

Candidate types: {person, politician, athlete businessman, artist, actor, author}

For a noisy mention, its “best candidate type” should be ranked higher than all its “negative types” (non-candidate types) wkk of link (yk,yk ) GYY is defined as follows. Text corpus 0 0 2 D N = mi i=1 TextEntity corpus mentions in (size N) wkk0 of link (yk,yk0 ) GYY is defined as follows. M { } D w = k 2 / k + k / /2, (5) D , N Clean and noisy mentions in kk0 k0 k0 k0 =c mi n Entity mentions in (size N) E \ E E E \ E E MM { M}iK=1 D M w = k⇣ / k + k / /2⌘, (5) ,= yk k=1 CleanTarget and entity noisy types mentions (size K in) kk0 E \ Ek0 E E \ Ek0 Ek0 MYc M{n } M where k denotes the size of set k. i K Candidate types of mi |E⇣ | E ⌘ = yk k=1 Target entity types (size K) Y Y { } Text corpus whereInk thisdenotes work, thew tokk consider0sizeof link of ( setyk the,ykk0. type-path) GYY is defined correla- as follows.= Non-candidate types of m 2 i i Di CandidateN types of mi i |E | E Y Y Y\YM = mi i=1 Entity mentions in (size N) Intions, this wework, propose to consider a path the edit type-path distance to correla- measure i == fj ijM=1 {Non-candidateText} features in types(size of mMDi ) w = k / k + k / /2, (5) , Clean and noisy mentions in kk0 E \ Ek0 E E \ Ek0 Ek0 Y F Y\Y{ M} Mc Mn D M tions,the we semantic propose differences a path edit between distance twoto type-paths measure in m=i Textfj corpus TextFeatureK features vector in for(sizemi M) wkk0 of link (yk,yk0 ) GYY is defined⇣ as follows. ⌘ Rj=1 = yk k=1 Target entity types (size K) 2 D FN {2 M}K Y { } D 2 M where k denotes the size of set k=. mi i=1 Entity mentions in (size N) m the semanticthe given differences hierarchy. between Algorithm|E | two 1 summarizes type-pathsME in our{ }myi k RR i FeatureTypeD label vectorCandidate vector for m typesfori yk of i w = / + / /2, (5) 22 K Y 2 M2 Y kk0 k k0 k In thisk work,k0 k to0 consider the type-pathc, n correla-Cleand M andi noisy= mentionsi Non-candidate in types of mi the given hierarchy.E \ E E AlgorithmE \ E 1E summarizesM our M KykU RR ⇥ TypeMappings label vector forM forto ydk-dim space algorithm for deriving the path edit distance. = yk Target entityY typesY\YM (size K) ⇣ tions, we propose a⌘ path edit distance tok=1 measure22 d M = fj j=1 Text featuresM in 2(sizeY M) Y { } U ⇥ F {MappingsEmbedding} for of ftoj (dj-dim-thD column space of algorithmwhere Wek denotes forcompare deriving the these size the of three set pathk. methodsedit distance. for measuringi RCandidated typesM of mi |E | the semanticE differences betweenY two type-pathsU2j inR mi R FeatureM vector for mi In this work, to consider the type-path correla- d 2 EmbeddingKU) of fj (j-th column2 M of We compare these three methods for measuringi = Ui 2Non-candidatey types of mTypei label vector for y type correlationthe in our given experiments. hierarchy. Algorithm Entity-entityY 1 summarizesY\YM j ourR d K k R k tions, we propose a path edit distance to measure = fj j=1V2 Text⇥ features2 inUd Mappings)M(size M) for to d-dim space2 Y type correlation in our experiments. Entity-entity R U R ⇥ Mappings for to d-dim space facts of various relationshipsalgorithm for deriving in the KB the path can edit alsoF distance. be{ M} 2 d K 2 D Y M the semantic differences between two type-paths in mi R V RFeature⇥ d vectorMappings forEmbeddingmi Embedding for of toykd(-dim ofk-thf space column(j-th column of of facts of various relationshipsWe compare in the these KB three can methods also be2 forK measuring d 2 M j utilized to model type correlation, as discussed in V2k R Uj R Y the given hierarchy. Algorithm 1 summarizes our yk R 2Typed label vector2 EmbeddingV for) yk U) of yk (k-th column of utilized to model typetype correlation, correlation in as our discussed experiments. in2 Entity-entityd MVk R d K 2 Y algorithmKB embedding for deriving (Hu the path et al., edit 2015; distance. Bordes et al., 2013).U R ⇥ 2 MappingsV for V⇥)to d-dimMappings space for to d-dim space facts of various relationships in the2 KB can also be RM Notations. KB embedding (Hu et al., 2015; Bordes et al., 2013). d Embedding2 ofTablefj (j-th 3: column of Y We compare these three methods for measuring U d Embedding of yk (k-th column of We leave this as futureutilized work. to model type correlation,j asR discussed in Vk TableR 3: Notations. 2 U) 2 V) Wetype leave correlation this as in future our experiments. work. Entity-entity d K Modeling NoisyKB Type embedding Labels. (HuTo et effectively al., 2015; Bordes modelV R et⇥ al., 2013).Mappings for to d-dim space facts of various relationships in the KB can also be 2 Minimizing `Y encouragesTable 3: aNotations.large margin be- Modeling Noisy TypeWe Labels. leave thisTo as future effectively work. model d Embedding ofn,iyk (k-th column of utilizedthe tonoisy modelmention-type type correlation, links as in discussed subgraph in GMYVk, weR Minimizing `n,i encourages a large margin be- the noisy mention-type links in subgraph GMY , we2 tweenV) the maximum scores maxy i s(mi,y) and Modeling Noisy Type Labels. To effectively model 2Y KB embeddingextend the (Hu margin-based et al., 2015; Bordes loss in et (Nguyen al., 2013). and Caru- tween the maximumMinimizing scores`n,i encouragesmaxy i as(largemi,y) marginand be- extend the margin-based loss in (Nguyen and Caru- GmaxTable 3:s(mNotations.i,y0). This forces m2Yi to be embed- We leave this as future work.the noisy mention-type links in subgraph MY ,y we0 i ana, 2008) (used to learn linear classifiers) to enforce maxy 2Ys(mtweeni,y0) the. This maximum forces scoresmi maxto bey i embed-s(mi,y) and ana, 2008) (used to learnextend linear the margin-based classifiers) loss to enforce in (Nguyen andded Caru-02 closerYi to the most “relevant” type in2Y the noisy To effectively model max s(mi,y0). This forces mi to be embed- ModelingHypothesis Noisy Type??. Labels. The intuition of the loss is simple: ded closer to they0 mosti “relevant” type in the noisy Hypothesis . Theana, intuition 2008) (used of the to learn loss linear is simple: classifiers)Minimizing to enforce`n,i encourages2Y a large marginy = argmaxbe- s(m ,y) the noisy mention-type?? links in subgraph G , we candidateded type closer set, toi.e. the, most⇤ “relevant”y typei in thei noisy, m MY candidate type set, i.e., y⇤ = argmaxy 2Y s(mi,y), for mention i, theHypothesis maximum??. score The intuition associated oftween the with loss the is maximum simple: scores maxy i s(mi,y) and i for mention mi, the maximum score associated with 2Y 2Y extend the margin-based loss in (Nguyen and Caru- than to anycandidate other non-candidate type set, i.e., y types⇤ = argmax (i.e., Hypoth-y i s(mi,y), its candidate typesfor mentioni is greatermi, the than maximum the maximum scoremaxy associatedthans(mi,y to with0).any This other forcesnon-candidatemi to be embed- types (i.e., Hypoth-2Y itsana, candidate 2008) (used types to learni linearisY greater classifiers) than to enforcethe maximum02Yi esis ??).than This to constrastsany other non-candidate sharply with types multi-label (i.e., Hypoth- Modeling EachYits candidateNoisy types iMentionis greaterded than closer theesis maximum to the?? most). This “relevant” constrasts type in sharply the noisy with multi-label Hypothesisscore associated??. The intuition with any of the other loss non-candidate isY simple: types esis ??). This constrasts sharply with multi-label score associated with any other non-candidate types learning (Yosefy = et argmax al., 2012),s(m where,y) a large margin score associated with any othercandidate non-candidatelearning type types set, i.e. (Yosef, ⇤ et al., 2012),y i wherei , a large margin for mention= i = mi,, the wherei, maximum where the the scores score scores associated are are measured measured with using using learning (Yosef2 etY al., 2012), where a large margin Yi Y Y\YY\Yi i = iS1_Arnold, where theSchwarzenegger scoresthan are measured to anyis otheris enforced using enforcednon-candidate between between typesallall (i.e.candidatecandidate, Hypoth- types types and and non- non- ID currentits candidateSentencecurrent embedding embedding types i vectors.isY greater vectors.Y\Y than the maximum is enforced between all candidate types and non- Y current embedding vectors. desis ??). Thiscandidate constrasts types sharply without with considering multi-label noisy types. Governor scoreArnold associatedSpecifically,Schwarzenegger with any we other use vectors non-candidateu , v typesd to rep- candidated typescandidate without types considering without considering noisy types. noisy types. Specifically, we useSpecifically, vectors ui we, vi usek k vectorsR Rtoulearning rep-i, vk (YosefR to rep-et al., 2012), where a large margin S1 gives a speechi = at Missioni, where Serve's the scores are measured2 using2 2 Our goal is resentY resentY\Y mention mentionmi mresenti mentionandand typem typei yk yk andin typeisin the enforced theyk TheThe betweenin Joint the JointallThe Optimizationcandidate Optimization Joint Optimization types Problem. and Problem. non- Problem.Our goalOur goal is is service projectcurrent …… embedding vectors.2d-dimensionalM2 M embedding2 M2 2Y space,Y respectively.2 Y The to embed the heterogeneous graphG G intod a d- d-dimensionald-dimensional embedding embedding space, space, respectively.d respectively.candidate The The to typesto embed embedwithout the considering the heterogeneous heterogeneous noisy types. graph graph G intointo a a d-- Specifically, we use vectorsscore ofu(im, iv,ykk) isR definedto rep- as the dot product of their dimensional vector space, following the three pro- Candidate types:score{personscore of (,mpolitician ofi,y(mki),yis, k) definedis defined as the as the dot2 dot product product of oftheir their dimensionaldimensional vector vector space, space, following following the the three three pro- pro- resent mention mi embeddings,and type yi.e.k , s(min,y the)=vTheT u . Joint We define Optimization the posed Problem. hypothesesOur in goal Sec. is??. Intuitively, one can athlete businessmanembeddings,, artist, actor, authori.e.2 M, s(} m ,y )=T v2T uY .i Wek definek i the posedposed hypotheses hypotheses in in Sec. Sec. ??.. Intuitively, Intuitively, one one can can embeddings,d-dimensional embeddingi.e., s(mpartial-labeli,y space,ki)=k respectively.v lossk uik`.for Wei m The definetoas the embed follows. the heterogeneouscollectively graphminimizeG ??into the a objectivesd- of the three sub- i i 2 M collectively partial-labelscorepartial-label of (mi,yk loss) is defined loss`i for`i asform thei m doti productas follows.as follows. of their dimensionalcollectively vector space,graphsminimizeminimize followingGMY the, the theG objectivesMF objectives threeand pro-GYY of of, theas the mentions three three sub- sub-and 2T M2 M M embeddings, i.e., s(mi,yk)=v ui. We define the posed hypothesesgraphsgraphsG inMYG Sec.MYtypes, G,??GMF.MFare Intuitively,and sharedandGGYY acrossYY one,, as ascan them. mentions mentions To achieveand theand goal, `n(mi, k i, i)=L rankyk f(mi) ⌦i; (6) partial-label loss ` for m Yas follows.Y ⇤ collectively· minimize the objectivesY of the three sub- MM `n(m`ni,(mii,, ii)=, i iL)=rankLi ranky y f(mfi()mi) ⌦i;⌦i; (6)(6) typestypes areare sharedwe shared formulate across across a them. joint them. optimization To To achieve achieve problem the the goal, goal, as fol- 2 Mk⇤ k⇤ j ⇣ ⌘k Y Y YY Y ⌦i = max 0, · ¯ · fk (mgraphsi)+f¯G(MYmi,) G, MF(7)Yandlows.GYY, as mentions and k⇤,k⇤ ⇤ k⇤ wewe formulate formulate a a joint joint optimization optimization problem problem as as fol- fol- j j ⇣ ⇣ ⌘k⌘k types are shared across them. To achieve theM goal, `n(mi, i, i)=L rankyk f(mi) ⌦i; (6) ⌦i =⌦i max= max0, k0,,k¯⇤ ¯ fk⇤f(mk ni()+mi)+fk¯ f(¯m(im) i,) (7), (7)Y lows.lows.o Y Y ⇤ k⇤⇤,k⇤ ⇤· ⇤ k⇤ min = + (10) where yk ,y¯ are defined aswe follows. formulate a joint optimizationc problemn as fol- j ⇣ ⇤⌘kk⇤ U, V O O O ⌦ = max 0, ¯ f ({m )+f¯}(m ) , i n k⇤n,k⇤ k⇤ i k⇤ i (7) olows.o y ,y¯ minmin == c +c + n=n ` (m , , )+ ` (m(10)(10), , ), wherewherek⇤ ykk⇤,y¯areare definedyk defined= argmax as follows. as follows.fk(mi); y¯ = argmax fUk,(mVi). c i i i n i i i best{candidaten⇤ } ktype⇤ ⇤ o k⇤ U, OV O OO OO Y Y Y Y { } yk i miny = + m (10) m where yk ,y¯ are defined as follows.2Y k i c n i c i n ⇤ k⇤ U, V O2Y O O X2M X2M y {= argmax} f (m ); y¯ = argmax f (m ). = `c(mi, i, i)+ `n(mi, i, i), k⇤ y = argmaxk f i(m );k⇤y¯ = argmaxk f (im ). = `c(mi, i, i)+ `n(mi, i, i), k⇤ k i k⇤ k i Y Y Y Y y best= argmaxnegativeyk fi (typem ); y¯ = argmaxyfk (mi ). = `c(mmi,i i, ci)+ Y`n(Ymi, mi, i i), n Y Y k⇤ 2Yykk i i Notek⇤ that onek cany i also consider all the non- m m y 2Y 2Yk i XY2iMY c Y YX2i M n k i yk i 2Y mi c X2M mi n X2M 2Y candidate types2Y instead of just the “bestX2M non- 3.1 ModelX2M Learning and Inference Note that one cancandidate also type”, consider by modifying all the Eq. non- (6) as follows. We propose an alternative minimization algo- NoteNote that one that can one also can consider also consider all the non- all the non- 3.1 Modelrithm Learning based and on block-wiseInference coordinate descent candidatecandidate types types instead instead of just of justthe “best the non- “best3.1 non- Model3.1 Learning Model and Learning Inference and Inference candidate types instead`n(mi, i, ofi)= just theL rank “bestyk f non-(mi) ⌦i,k¯; (8) candidate type”, by modifyingY Y Eq. (6) as follows.⇤ schema (Tseng, 2001) to jointly optimize the objec- candidatecandidate type”, type”, by modifying by modifying Eq. (6) as Eq. follows.k¯ (6) as follows.We proposeWe an propose alternative an minimization alternative algo-minimization algo- X2Yi j ⇣ ⌘k We proposetive anin Eq. alternative (10). minimization algo- rithm basedrithm on based block-wise onO coordinate block-wise descent coordinate descent ⌦i,k = max 0, k ,k¯ fk⇤ (mi)+fk¯(mi)rithm. (9) basedWe on first block-wise take the derivative coordinate of with descent respect to `nn((mmi,i, i,i, i)=i)= L rankL rankyk fy(mi)f(⌦mi,⇤ik¯); (8)⌦ ¯; (8) ⇤ k⇤ i,k O `nY(YmYi,Y i, i)= L ranky f(mi) ⌦schema¯; (8) (Tseng,schema 2001) (Tseng, toUjointlywhile 2001)optimize fixing to Vjointly. the The objec-optimize derivative the of `c objec-(mi, i, i) k¯ ¯ nk⇤ i,k o Y YX2Yki ji ⇣ ⌘k schema (Tseng, 2001) to jointly optimize the objec-Y Y X2Y¯ j ⇣ ⌘k tive in Eq. (10). (denoted as `c,i) with respect to U is computed as k i j ⇣ ⌘k O tive in Eq. (10). ⌦i,k = max 0, ¯ X2fkY (mi)+f¯(mi) . (9) We first taketiveO thein derivative Eq. (10). of with respect to ⌦ = max 0k,⇤,k ¯ ⇤ f (mk)+f¯(m ) . i,k k⇤,k k⇤ i k i (9) WeO first take the derivativeO of with respect to ⌦i,k = max 0, ¯ fk (mi)+f¯(mi) U. while(9) fixingWeV. first The derivative take the derivative of `c(mi, i, ofOi) with respect to n k⇤,k ⇤ o k U V Y Y ` (m , , ) n o (denoted as `while) with fixing respect to.U Theis computed derivative asO of c i i i Uc,iwhile fixing V. The derivative of `c(mi,Y i,Y i) n o (denoted as `c,i) with respect to U is computedY Y as (denoted as `c,i) with respect to U is computed as w of link (y ,y ) G is defined as follows. Text corpus kk0 k k0 YY D 2 = m N Entity mentions in (size N) M { i}i=1 D wkk0 = k k0 / k + k k0 / k0 /2, (5) c, n Clean and noisy mentions in E \ E E E \ E E M M K M wkk of link (⇣yk,yk ) GYY is defined as follows.⌘ = yk k=1Text corpusTarget entity types (size K) 0 0 2 D Y { N } where k denotes the size of set k. = mi i=1 Entity mentions in (sizemN) |E | E M i{ } Candidate typesD of i w = k / k + k / /2, (5) ,Y Clean and noisy mentions in In thiskk0 work, tok0 consider thek type-path0 k0 correla- c i =n i Non-candidate types of mi E \ E E E \ E E M M K M ⇣ ⌘ =Yyk Y\YM Target entity types (size K) tions, we propose a path edit distance to measure Y { =}k=1fj j=1 Text features in (size M) where k denotes the size of set k. i F { M} Candidate types of mi D the semantic|E | differences betweenE two type-paths in Y mi R Feature vector for mi In this work, to consider the type-path correla- i = 2 iK Non-candidate types of mi 2 M the given hierarchy. Algorithm 1 summarizes our Y ykY\YMR Type label vector for yk tions, we propose a path edit distance to measure = fj2 j=1d M Text features in (size M) 2 Y algorithm for deriving the path edit distance. F U{ M}R ⇥ MappingsD for to d-dim space the semantic differences between two type-paths in mi R2 Feature vector for miM 2 K d Embedding of 2fjM(j-th column of theWe given compare hierarchy. these Algorithm three methods 1 summarizes for measuring our yk URj R Type label vector for yk 2 d 2M U) 2 Y algorithmtype correlation for deriving in our the experiments. path edit distance. Entity-entity U R ⇥ d K Mappings for to d-dim space 2V ⇥ MappingsM for to d-dim space d R Embedding of fj (j-th column of factsWe of compare various these relationships three methods in the for KB measuring can also be Uj R2 Y 2 d U) Embedding of yk (k-th column of typeutilized correlation to model in type our experiments. correlation, as Entity-entity discussed in Vkd KR V R ⇥2 MappingsV) for to d-dim space factsKB embedding of various relationships (Hu et al., 2015; in the Bordes KB can et also al., 2013). be 2 Y d EmbeddingTable 3: ofNotations.yk (k-th column of utilized to model type correlation, as discussed in Vk R We leave this as future work. 2 V) KB embedding (Hu et al., 2015; Bordes et al., 2013). Modeling Noisy Type Labels. To effectively model Table 3: Notations. We leave this as future work. Minimizing `n,i encourages a large margin be- the noisy mention-type links in subgraph GMY , we Modeling Noisy Type Labels. To effectively model tween the maximum scores maxy i s(mi,y) and extend the margin-based loss in (Nguyen and Caru- Minimizing `n,i encourages a large margin2Y be- max s(mi,y0). This forces mi to be embed- the noisy mention-type links in subgraph GMY , we y0 i ana, 2008) (used to learn linear classifiers) to enforcetween the2Y maximum scores maxy s(mi,y) and extend the margin-based loss in (Nguyen and Caru- ded closer to the most “relevant”2Yi type in the noisy max s(mi,y0). This forces mi to be embed- Hypothesis ??. The intuition of the loss is simple: y0 i ana, 2008) (used to learn linear classifiers) to enforce candidate2Y type set, i.e., y⇤ = argmaxy s(mi,y), for mention m , the maximum score associated withded closer to the most “relevant” type in the2 noisyYi Hypothesis ??.i The intuition of the loss is simple: than to any other non-candidatey = argmax typess( (mi.e.,y,) Hypoth- its candidate types is greater than the maximumcandidate type set, i.e., ⇤ y i i , for mention m , the maximumi score associated with 2Y i Y thanesis to any??). other Thisnon-candidate constrasts types sharply (i.e. with, Hypoth- multi-label itsscore candidate associated types withis any greater other than non-candidate the maximum types Yi esislearning??). This (Yosef constrasts et al., sharply 2012), with where multi-label a large margin scorei = associatedi, where with any the other scores non-candidate are measured types using Y Y\Y learningis enforced (Yosef et between al., 2012),all wherecandidate a large types margin and non- currenti = embeddingi, where vectors. the scores are measured using Y Y\Y d is enforcedcandidate between types withoutall candidate considering types and noisy non- types. currentSpecifically, embedding we vectors. use vectors ui, vk R to rep- 2d candidate types without considering noisy types. resent mention mi andu , typev yk in the The Joint Optimization Problem. Our goal is Specifically, we use2 vectorsM i k R2 toY rep- resentd-dimensional mention m embeddingand space, type respectively.y 2 in the TheTheto Joint embed Optimization the heterogeneous Problem. graphOur goalG into is a d- i 2 M k 2 Y dscore-dimensional of (mi,y embeddingk) is defined space, as the respectively. dot product of The theirto embeddimensional the heterogeneous vector space, graph followingG into the a threed- pro- T scoreembeddings, of (mi,yki.e.) is, defineds(mi,yk as)= thev dotk u producti. We of define their thedimensionalposed hypotheses vector space, in Sec. following??. the Intuitively, three pro- one can T embeddings,partial-labeli.e. loss, s(`mfori,ykm)=v uasi. follows.We define the posedcollectively hypothesesminimize in Sec. ?? the. objectives Intuitively, of one the can three sub- i i 2 Mk partial-label loss `i for mi as follows. collectivelygraphs minimizeG , G the objectivesand G of, as the mentions three sub- and 2 M MY MF YY M graphstypesGMYare, GMF sharedand acrossGYY, them. as mentions To achieveand the goal, `n(mi, i, i)=L rankyk f(mi) ⌦i; (6) M Y Y ⇤ · types areY shared across them. To achieve the goal, `n(mi, i, i)=L rankyk f(mi) ⌦i; (6) we formulate a joint optimization problem as fol- Y Y j ⇤ ⇣ ⌘k· PuttingY Things Together: The Optimization Problem ⌦ = max 0, ¯ f (m )+f¯ (m ) , we formulate a joint optimization problem as fol- i j k⇤,k⇤ ⇣ k⇤ ⌘ki k⇤ i (7) lows. ⌦i = max 0, k ,k¯ fk (mi)+fk¯ (mi) , (7) lows. n ⇤ ⇤ ⇤ ⇤ o y ,y¯n o min = c + n (10) where k⇤ k⇤ are defined as follows. minU, V= + (10) where {yk ,y¯ }are defined as follows. O c O n O { ⇤ k⇤ } U, V O O O yk = argmax fk(mi); y¯ = argmax fk(mi). = `c(mi, i, i)+ `n(mi, i, i), y ⇤ = argmax f (m ); y¯ k=⇤ argmax f (m ). = `c(mi, i, i)+ `n(mi, i, i), k⇤ y k i k⇤ k i Y Y Y Y y k i yk i mi c Y Y mi n Y Y k 2Yi yk i2Y mi c 2M mi n 2M 2Y 2Y X2M X X2M X NoteNote that that one one can can also also consider consider all all the the non- non- Accounts for clean mentions M candidate types types instead instead of of just just the the “best “best non- non-3.13.1 Model Model Learning Learning andc Inference andAccounts Inferencefor noisy mentions Mn candidate type”, type”, by by modifying modifying Eq. Eq. (6) (6) as follows.as follows. qWeMinimizeWe propose proposethe anobjective alternative anà alternative minimization minimization algo- algo- rithm–rithmClean basedmentions: based onpositive block-wise on types block-wiseare ranked coordinatehigher coordinatethan descentnegative types descent ¯ ``nn((mmii,, i,i, i)=i)= LLrankrankyk yk f(mf(im) i)⌦i,k⌦; i,k¯(8); (8) YY YY ⇤ ⇤ schema–schemaNoisy (Tseng,mentions: (Tseng, 2001)best candidate 2001) to jointly totypejointlyisoptimizerankedoptimizehigher thethan objec-negative the objec- types k¯ ¯ k i i j ⇣ ⌘k X2XY2Y j ⇣ ⌘k tive in Eq. (10). tiveO in Eq. (10). ⌦ = max 0, ¯ f (m )+f¯(m ) . (9) We firstO take the derivative of with respect to ⌦i,ki,k = max 0,k⇤,k ¯ kf⇤k (imi)+k f¯(i mi) . (9) We first take the derivative of with respect to k⇤,k ⇤ k O U while fixing V. The derivative of `cO(mi, i, i) nn o o U while fixing V. The derivative of Y`c(mY i, i, i) (denoted as `c,i) with respect to U is computed as Y Y (denoted as `c,i) with respect to U is computed as Model Learning

q Alternative minimization (between U and V) – Block-wise coordinate descent algorithm – Converges to local minimum

q Can also apply SGD for online update

q Easy to parallelize by partitioning the mention set

What we need for inference: embedding vectors for text features (U) & type labels (V)

Tseng. “Convergence of a block coordinate descent method for non-differentiable minimization. JOTA 2011 Training mentions with extracted features Hierarchical Partial-label Embedding

Mention: “S1_Arnold Schwarzenegger”; Context: S1; CXT1_B: Governor Candidate Type Set: {person, politician, artist, actor, CXT1_B: author, businessman, althete} politician Senator Text Features: {HEAD_Arnold, CXT1_B:Governor, CXT1_A:gives, (mi) Hierarchical Partial-label Embedding Training mentions with extracted features lc CXT1_A:gives POS:NN, TKN_arnold, TKN_schwarzenegger, SHAPE_Aa, ...} Mention: “S1_Arnold Schwarzenegger”; Context: S1; CXT1_B: SHAPE:Aa Governor Candidate Type Set: {person, politician, artist, actor, person CXT1_B: author, businessman, althete} politician Senator Text Features: {HEAD_Arnold, CXT1_B:Governor, CXTMention1_: “SA2_Arnold:gives Schwarzenegger, ”; Context: S2; (mi) lc HEAD_arnold CXT1_A:gives POS:NN, TKN_arnold, TKN_schwarzenegger, SHAPECandidate_Aa Type Set, : {...person}, politician, artist, actor, SHAPE:Aa author, businessman, althete} CXT3_A:play artist person Mention: “S2_Arnold Schwarzenegger”; ContextText Features: S: {2HEAD; _Arnold, CXT1_B:star, CXT2_B:action the role HEAD_arnold Candidate Type Set: {person, politician, artist, actor, actor author, businessman, althete} -movie star, CXT3_A:to the franchise, POS:NN, SHAPE_Aa, ...} CXT3_A:play artist Text Features: {HEAD_Arnold, CXT1_B:star, CXT2_B:action CXT1_B:star Joint the role -movie star, CXT3_A:to the franchise, POS:NN, SHAPE_Aa, ...} actor S1_Arnold Embedding CXT1_B:star Joint Mention: “Ted Cruz”; Context: Sn; Embedding S1_Arnold Space Mention: “Ted Cruz”; Context: Sn; Candidate Type Set: {person, politician} Schwarzenegger Type Inference Space Candidate Type Set: {person, politician} Schwarzenegger Type Inference Text Features: {HEAD_Ted, CXT1_B:senator, CXTText1_ FeaturesB:: told{HEAD_Ted, CXT, 1_B:senator, CXT1_B:told, root root CXT3_B:campaign of senator, POS:NN, SHAPE_Aa, ...} CXT3_B:campaign of senator, POS:NN, SHAPE_Aa, ...} ...... Training mentions with extracted features productHierarchicalperson Partial-labellocation Embeddingorganiz Mention: “S1_Arnold Schwarzenegger”; Context: S1; CXT1_B: Candidate Type Set: {person, politician, artist, actor, ... Governor CXT1_B: author, businessman, althete} politician Senator ation Text Features: {HEAD_Arnold, CXT1_B:Governor, CXT1_A:gives, (mi) lc CXT1_A:gives POS:NN, TKN_arnold, TKN_schwarzenegger, SHAPE_Aa, ...} organiz SHAPE:Aa product person location person Mention: “S2_Arnold Schwarzenegger”; Context: S2; Hierarchical Partial-labelHEAD Embedding_arnold TrainingCandidate mentions Type Set: {withperson ,extracted politician, artist features, actor, author, businessman, althete} CXT3_A:play artist TextMention Features: “S: {HEAD1_Arnold_Arnold PartitionSchwarzenegger, CXT1_ B:star”;, ContextCXT2_B: actionS1; CXT1 the_B: role Governoractor -movieCandidate star, CXT Type3_A: toSet the: {person franchise, politician, POS:NN, artist, SHAPE, actor_Aa, , ...} ation CXT1_B: CXT1_B:star Joint author, businessman, althete} politician Senator Text Features: {HEAD_Arnold, CXT1_B:Governor, CXT1_A:gives, ... lc(mi)S1_Arnold ... Embedding CXT1_A:gives POS:NN, TKNMention_arnold,: TKN“Ted_ schwarzeneggerCruz”; Context: Sn, SHAPE; _Aa, ...} Schwarzenegger Space Candidate Type Set: {person, politician} Type InferenceSHAPE:Aa Text Features: {HEAD_Ted, CXT1_B:senator, CXT1_B:told, ... person training root CXTMention3_B:campaign: “S2_Arnold of senator Schwarzenegger, POS:NN, ”SHAPE; Context_AaType: ,S ...2;} Inference HEAD_arnold Candidate Type Set: {person, politician, artist, actor, ... author, businessman... , althete} CXT3_A:play product person location artistorganiz Text Features: {HEAD_Arnold, CXT1_B:star, CXT2_B:action the role Hierarchical Partial-label Embeddingactor ation -movie star, CXT3_A:to the franchiseTraining mentions, POS:NN with, SHAPE extracted_Aa, features ...} business Partition Partition CXT1_B:star ... Joint mentions politician artist ... Embedding... training ... Mention: “Ted Cruz”; Context: Sn; S1_Arnold CXT1_B: q Top-downMention: “S 1nearest_Arnold Schwarzeneggerneighbor”; Context: S1; business Space Candidate Typementions Set: {person, politician} Schwarzeneggerpolitician Typeartist Inference ... Candidate Type Set: {person, politician, artist, actor, Governor manman CleanText TrainingFeatures: {HEAD _Ted, CXT1_B:senatorNoisy, CXT1 _TrainingB:told, ... CXT1_B: ... root CXT3_B:campaign of senator, POS:NNauthor, ,SHAPE businessman_Aa, althete, ...}} politician Mentions search in the givenMentionstype hierarchy Senator ... Clean Training Text Features: {HEAD_ArnoldNoisy, CXT1_B :GovernorTraining, CXT1_A: gives, (mi) ...... lc authorCXT1_A:gives actor singer POS:NN, TKN_arnold, TKN_schwarzenegger, SHAPE_Aa, ...} product person location organiz ... ation training SHAPE:Aa Partition person ...... ID Sentence ... trainingMention : “S2_Arnold Schwarzenegger”; Context: S2; HEAD_arnold business mentionsCandidate Type Set: {person, politician, artist, actor, politician artist ... Mentions Mentions man Clean Training Governor ArnoldNoisy Schwarzenegger Training author, businessman, althete} business CXT3_A:play artist Mentions Mentions the role Text Features: {HEAD_Arnold, CXT1_B:star, CXT2_B:action ... S2 gives a speech at Mission Serve's author actor actor singer ... -movie star, CXT3_A:to the franchise, POS:NN, SHAPE_Aa, ...} politician artist ... mentions CXT1_B:star Joint service project on Veterans Day 2010. author actor singer S1_Arnold Embedding Mention: “Ted Cruz”; Context: Sn; man Space Candidate Type Set: {person, politician} Schwarzenegger Type Inference Test mention: Text Features: {HEAD_Ted, CXT1_B:senator, CXT1_B:told, root CXT3_B:campaign of senatorClean, POS Training:NN, SHAPE_Aa, Noisy...} Training S1_Arnold ... Schwarzenegger Governor gives... speech product person location organiz ation Partition … ...... = + training + + ... business mentionsMentions Mentions politician artist ... Clean Training Noisy Training man Mentions Mentions ... author ... actor singer Learned embeddings for text features author actor singer Experiments q Datasets: – Wiki (780k Wikipedia articles) – OntoNotes (13,109 news articles) – BBN (2,311 WSJ news articles) q Compared Methods – Bootstrapping: ClusType – Classifier: FIGER, NYENA, Hybrid Neural Model – Embedding: DeepWalk, LINE, PTE, WSABIE – Partial-Label Learning: CLPL, PL-SVM – Variants of AFET: AFET-NoCo, AFET-NoPa, AFET-CoH Example Output

AFET predicts fine-grained types with better accuracy

AFET avoids overly-specific predictions Performance Comparison on Fine-Grained Typing

q AFET vs. supervised methods (classifiers & embedding models) – Partial-label loss for careful modeling of noisy type labels Performance Comparison on Fine-Grained Typing

q AFET vs. partial-label learning methods – Adaptive margins for incorporating type correlation Performance Comparison on Fine-Grained Typing

q AFET vs. AFET-NoCo à gain from incorporating type correlation q AFET vs. AFET-NoPa à gain from noise-robust loss function Comparing on Different Type Levels

q Type correlation signal for helping fine-grained (long-tailed) types q AFET achieves a 22.36% improvement in Accuracy on level-3 types, compared to the next best method FIGER. Conclusion

q A challenging problem: Fine-Grained Entity Typing with Noisy Distant Supervision

q An effective and efficient solution: AFET – Noise-robust embedding of type labels – Adaptive margins for type correlation

q Improvement on three public entity typing datasets – https://github.com/shanzhenren/AFET q Acknowledgement – Thanks Google PhD Fellowship for supporting my research Backup: Performance on Pruned Training Data

q AFET vs. Classifiers on pruned training data Backup: Performance Study

Varying the training set size Varying the dimensionality d Backup: Performance on Frequent/Infrequent Types

Infrequent Type Frequent Type Backup: Model Learning

Algorithm 2: Model Learning of AFET with respect to U is computed as follows. q Discussions: N Input: Feature vectors mi i=1, Type vectors L ranky f(mi) K { } @`n,i k⇤ yk , learning rate ↵, normalization T { }k=1 = V yi⇤ mi , (13) – Since our type hierarchy constant C @U j ⇣ ⌘k · · rankyk f(mi) : feature embeddings U, type embeddings V ⇤ size is acceptable (~100), Output ⇣ ⌘ 1 Initialize: U and V as random matrices; while in where vector y⇤ is defined as y⇤ = i (y¯ yk ). O i i k⇤ ⇤ Eq. (11) not converge do |Y | we don’t have to do If we use the definition of `n,i in Eq. (9), then yi⇤ will 2 for mi c do take the form as follows. 2 M negative sampling to 3 Compute the margin-infused rank for ¯ ¯ ¯ y yi⇤ = (yk⇤ ,yk)+fk(mi) >fk⇤ (mi) (yk yk). k 2 Yi y¯ speed up 4 Compute @`c,i/@U using Eq. (12) kX2Yi ⇣ ⌘ 5 Compute @`c,i/@V using Eq. (15) (14) 6 end 7 for mi n do Second, we can minimize with respect to V – 2 M O We don’t need to 8 Compute the margin-infused rank for yk while fixing U. The derivative of `c,i with respect ⇤ to V is computed as follows. approximate the rank 9 Compute @`n,i/@U using Eq. (13) 10 Compute @`n,i/@V using Eq. (16) @` c,i = L rank f(m ) Um yT function 11 end @V yk i i · i,k y @`c,i @`n,i Xk2Yi j ⇣ ⌘k 12 U U ↵ @U + @U · T mi c mi n = Um L rank f(m ) b y . ⇣ 2M 2M ⌘ i yk i i,k P @`c,i P @`n,i · 13 V V ↵ + yk i @V @V n 2Y j ⇣ ⌘k o · m m X ⇣ i2Mc i2Mn ⌘ b (15) 14 Normalize the normsP of U and V toPC

15 end The derivative of `n,i in Eq. (6) with respect to V is computed as follows.

L ranky f(mi) @`n,i k⇤ T (denoted as ` ) with respect to U is computed as = Um y⇤ . (16) c,i @V j ⇣ ⌘k i · i ranky f(mi) follows. k⇤ ⇣ ⌘ Similarly, if we use the definition of `n,i in Eq. (9), @`c,i T = L ranky f(mi) V yi,k m , @U k · · i then yi⇤ will take the form in Eq. (14). y Xk2Yi j ⇣ ⌘k Algorithm 2 summarizes our algorithm. Eq. (11) b T = V L rankyk f(mi) yi,k mi , can also be solved by a mini-batch extension of y n Xk2Yi j ⇣ ⌘k o the Pegasos algorithm (Shalev-Shwartz et al., 2011), b (12) which is a stochastic sub-gradient descent method and thus can efficiently handle massive text corpora. where we define ( ) as the indicator function and Due to lack of space, we do not include derivation · the vector yi,k as follows. details here. Type Inference. With the learned mention embed- b (yk,y¯)+f¯(mi) >fk(mi) dings u and type embeddings v , we perform k k { i} { k} yi,k = (y¯ yk). ⇣ ⌘ k top-down search in the candidate type sub-tree i y¯ rankyk f(mi) Y kX i to estimate the correct type-path ⇤. Starting from 2Y Yi b ⇣ ⌘ the tree’s root (denoted as r), we recursively find the Note that if we follow the negative sampling process best type among the children types (denoted as (r)) Ci in (Weston et al., 2011), vector yi simply changes by measuring the dot product of the corresponding into (yk¯ yk) where yk¯ is sampled following the s(u , v ) mention and type embeddings, i.e., i k . The procedure in (Weston et al., 2011).b While computing search process stops when we reach to leaf type, or the rank, we can compute Eq. (??) at the same time, the similarity score is below a pre-defined threshold which is efficient as 100. |Y| ⇡ ⌘ > 0. Algorithm ?? summarizes the proposed type The derivative of ` (m , , ) (denoted as ` ) inference process. n i Yi Yi n,i