Entity Linking at Web Scale

Thomas Lin, Mausam, Oren Etzioni Computer Science & Engineering University of Washington Seattle, WA 98195, USA {tlin, mausam, etzioni}@cs.washington.edu

Abstract

This paper investigates entity linking over mil- lions of high-precision extractions from a cor- pus of 500 million Web documents, toward the goal of creating a useful knowledge base of general facts. This paper is the first to re- port on entity linking over this many extrac- tions, and describes new opportunities (such as corpus-level features) and challenges we found when entity linking at Web scale. We present several techniques that we developed and also lessons that we learned. We envi- sion a future where and Figure 1: Entity Linking elevates textual argument strings entity linking are paired to automatically gen- to meaningful entities that hold properties, semantic erate knowledge bases with billions of asser- types, and relationships with each other. tions over millions of linked entities.

Kulkarni et al., 2009; Dredze et al., 2010). To link 1 Introduction a million documents they would repeat a million Information Extraction techniques such as Open IE times. However, there are opportunities to do bet- (Banko et al., 2007; Weld et al., 2008) operate at un- ter when we know ahead of time that the task is precedented scale. The REVERB extractor (Fader et large scale linking. For example, information on al., 2011) was run on 500 million Web pages, and one document might help link an entity on another extracted 6 billion (Subject, Relation, Object) ex- document. This relates to cross-document corefer- tractions such as (“Orange Juice”, “is rich in”, “Vi- ence (Singh et al., 2011), but is not the same be- tamin C”), over millions of textual relations. Link- cause cross-document coreference does not offer all ing each textual argument string to its corresponding the benefits of linking to . Another op- Wikipedia entity, known as entity linking (Bunescu portunity is that after linking a million documents, and Pas¸ca, 2006; Cucerzan, 2007), would offer ben- we can discover systematic linking errors when par- efits such as semantic type information, integration ticular entities are linked to many more times than with resources (Bizer et al., 2009), and expected. disambiguation (see Figure 1). In this paper we entity link millions of high- Existing entity linking research has focused pri- precision extractions from the Web, and present our marily on linking all the entities within individual initial methods for addressing some of the opportu- documents into Wikipedia (Milne and Witten, 2008; nities and practical challenges that arise when link- entity assertions wiki inlinks ratio “Barack Obama” 16,094 16,415 0.98 “Larry Page” 13,871 588 23.6 “Bill Clinton” 5,710 11,176 0.51 “Microsoft” 5,681 12,880 0.44 “Same” 6,975 36 193

Table 1: The ratio between an entity’s linked assertion count and its inlink prominence can help to detect sys- tematic errors to correct or filter out.

• New York Yankees at 8,647 inlinks Figure 2: Context matching using more source sentences • New York University at 7,983 inlinks can increase entity linking accuracy, especially in cases where link ambiguity is high. After obtaining a list of candidate entities, we em- ing at this scale. ploy a context matching (Bunescu and Pas¸ca, 2006) step that uses cosine similarity to measure the se- 2 Entity Linking mantic distance between the assertion and each can- didate’s Wikipedia article. For example, if our as- Given a textual assertion, we aim to find the sertion came from the sentence “New York acquired Wikipedia entity that corresponds to the argument. Pineda on January 23,” then we would calculate the For example, the assertion (“New York”, “ac- similarity between this sentence and the Wikipedia quired”, “Pineda”) should link to the Wikipedia arti- articles for New York (State), New York City, etc. cle for New York Yankees, rather than New York City. As a final step we calculate a link score for each Speed is a practical concern when linking this candidate as a product of string match level, promi- many assertions, so instead of designing a sys- nence score and context match score. We also cal- tem with sophisticated features that rely on the full culate a link ambiguity as the 2nd highest link score Wikipedia graph structure, we instead start with a divided by the highest link score. The best matches faster system leveraging linking features such as have high link score and low link ambiguity. string matching, prominence, and context matching. (Ratinov et al., 2011) found that these “local” fea- 2.2 Corpus-Level Features tures already provide a baseline that is very difficult to beat with the more sophisticated “global” features This section describes two novel features we used that take more time to compute. For efficient high- that are enabled when linking at the corpus level. quality entity linking of Web scale corpora, we focus 2.2.1 Collective Contexts on the faster techniques, and then later incorporate corpus-level features to increase precision. One corpus-level feature we leverage here is col- lective contexts. We observe that in an extraction 2.1 Our Basic Linker corpus, the same assertion is often extracted from Given an entity string, we first obtain the most multiple source sentences across different docu- prominent Wikipedia entities that meet string match- ments. If we collect together the various source sen- ing criteria. As in (Fader et al., 2009), we measure tences, this can provide stronger context signal for prominence using inlink count, which is the number entity linking. While “New York acquired Pineda on of Wikipedia pages that link to a Wikipedia entity’s January 23” may not provide strong signal by itself, page. In our example, candidate matches for “New adding another source sentence such as “New York York” include entities such as: acquired Pineda to strengthen their pitching staff,” could be enough for the linker to choose the New • New York (State) at 92,253 inlinks York Yankees over the others. Figure 2 shows the • New York City at 87,974 inlinks gain in linking accuracy we observed when using 6 randomly sampled source sentences per assertion in- also allows us to capture assertions concerning the stead of 1. At each link ambiguity level, we took 200 long tail of entities (e.g., “prune juice”) that are not random samples. prominent enough to have their own Wikipedia ar- ticle. Wikipedia has dedicated articles for “apple 2.2.2 Link Count Expectation juice” and “orange juice” but not for “prune juice” or Another corpus-level feature we found to be use- “wheatgrass juice.” Searching Wikipedia for “prune ful is link count expectation. When linking millions juice” instead redirects to the article for “prune,” of general assertions, we do not expect strong rela- which works for an encyclopedia, but not for many tive deviation between the number of assertions link- entity linking end tasks because “prunes” and “prune ing to each entity and the known prominence of the juice” are not the same and do not have the same entities. For example, we would expect many more semantic types. Out of 15 million assertions, we assertions to link to “Lady Gaga” than “Michael observed that around 5 million could not be linked. Pineda.” We formalize this notion by calculating an Even if an argument is not in Wikipedia, we would inlink ratio for each entity as the number of asser- still like to assign semantic types to it and disam- tions linking to it divided by its inlink prominence. biguate it. We have been exploring handling these When linking 15 million assertions, we found that unlinkable entities over 3 steps, which can be per- ratios significantly greater than 1 were often signs of formed in sequence or jointly: systematic errors. Table 1 shows ratios for several entities that had many assertions linked to them. It 2.4.1 Detect Entities turns out that many assertions of the form “(Page, loaded in, 0.23 seconds)” were being incorrectly Noun phrase extraction arguments that cannot be linked to “Larry Page,” and assertions like “(Same, linked tend to be a mix of entities that are too goes for, women)” were being linked to a city in East new or not prominent enough (e.g., “prune juice,” Timor named “Same.” We filtered systematic errors “fiscal year 2012”) and non-entities (e.g., “such detected in this way, but these errors could also serve techniques,” “serious change”). We have had suc- as valuable negative labels in training a better linker. cess training a classifier to separate these categories by using features derived from the Google Books 2.3 Speed and Accuracy Ngrams corpus, as unlinkable entities tend to have different usage-over-time characteristics than non- Some of the existing linking systems we looked entities. For example, many non-entities are seen at (Hoffart et al., 2011; Ratinov et al., 2011) can in books usage going back hundreds of years, with take up to several seconds to link documents, which little year-to-year frequency variation. makes them difficult to run at Web scale without massive distributed computing. By focusing on the fastest local features and then improving precision 2.4.2 Predict Types using corpus-level features, our initial implementa- We found that we can predict the Freebase types tion was able to link at an average speed of 60 as- of unlinkable entities using instance-to-instance sertions per second on a standard machine without class propagation from the linked entities. For ex- using multithreading. This translated to 3 days to ample, if “prune juice” cannot be linked, we can link the set of 15 million textual assertions that RE- predict its semantic types by observing that the col- VERB identified as having the highest precision over lection of relations it appears with (e.g., “is a good its run of 500 million Web pages. On our Figure 2 source of”) also occur with linkable entities such as data, overall linking accuracy was above 70%. “orange juice” and “apple juice,” and propagate the semantic types from these similar entities. Linking 2.4 Unlinkable Entities at Web scale means that unlinkable entities often Aside from the speed benefit, another advantage of have many relations to use for this process. When using the “extract then entity-link” pipeline (rather available, shared term heads (e.g., “juice”) could than entity linking all the source documents and then also serve as a signal for finding entities that are running extraction only for linked entities) is that it likely to share semantic types. top assertions rank by “(teachers, teach at, school)” freq “(friend, teaches at, school)” “(Mike, teaches at, school)” “(biologist, teaches at, Harvard)” “(Jorie Graham, teaches at, Harvard)” rank by “(Pauline Oliveros, teaches at, RPI)” link “(Azar Nafisi, teaches at, Johns Hopkins)” score “(Steven Plaut, teaches at, Univ of Haifa)” “(Niall Ferguson, teaches at, NYU)” “(Ha Jin, teaches at, Boston University)” Table 2: Ranking based on link score gives higher quality results than ranking based on frequency. Figure 3: On top of an Entity Linked Open IE corpus we could learn inference rules and commonsense knowledge.

2.4.3 Disambiguation 3.2 Improved Instance Ranking Function In cases where we predict mutually exclusive types (e.g., film and person can be observed to be We observed link score to be a better ranking func- mutually exclusive in Freebase instances), this sig- tion than assertion frequency for presenting query nifies that the argument is a name shared by multi- results. For example, Table 2 shows the top results ple entities. We plan to use clustering to recover the when searching the extractions for instances of the most likely types of the multiple entities and then “teaches at” textual relation. When results are sorted divide assertions among them. by frequency in the corpus, assertions like “(friend, teaches at, school)” and “(Mike, teaches at, school)” 3 Resources Enabled are returned first. When results are sorted by link score, the top hundred results are all specific in- We observed that entity linking of 15 million textual stances of professors and the schools they teach at, extractions enables several resources. and are noticeably more specific and generally cor- rect than the top frequency-sorted instances. 3.1 Freebase Selectional Preferences

Each Wikipedia entity that gets linked is eas- 3.3 Inference ily annoted with its Freebase (Bollacker et al., 2008) semantic types using data from the Freebase Disambiguated and typed entities are especially Wikipedia Extraction (WEX) project. On the 15 valuable for inference applications over extracted million extractions, the entities that we linked to data. For example if we observe enough instances encompassed over 1,300 Freebase types. Know- like “Orange Juice is rich in Vitamin C,” “Vitamin C ing these entity types then allows us to compute helps prevent scurvy,” and “Orange Juice helps pre- the Freebase selectional preferences of all our tex- vent scurvy,” then we can learn the inference rule tual relations. For example, we can observe from shown in Figure 3. (Schoenmackers et al., 2010) our linked entities that the “originated in” relation explored this, but without entity linking they had to most often has types such as food, sport, and an- rely on heavy filtering against hypernym data, losing imal breed in the domain. Selectional preferences most of their extraction instances in the process. We have been calculated for WordNet (Agirre and Mar- plan to explore how much gain we get in inference tinez, 2002), but have not been calculated at scale for rule learning when using entity linking instead of Freebase, which is something that we get for free in hypernym filtering. Linked instances would also be our scenario. Freebase has a much greater focus on higher precision input than what is currently avail- named entities than WordNet, so these selectional able for learning implicit common sense properties preferences could be valuable in future applications. of textual relations (Lin et al., 2010). 4 Conclusions Chapter of the Association of Computational Linguis- tics (EACL). While numerous entity-linking systems have been Silviu Cucerzan. 2007. Large-scale named entity disam- developed in recent years, we believe that going biguation based on wikipedia data. In Proceedings of forward, researchers will increasingly be consider- EMNLP. ing the opportunities and challenges that arise when Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- scaling up from the single document level toward the ber, and Tim Finin. 2010. Entity disambiguation for Web-scale corpus level. This paper is the first to run knowledge base population. In Proceedings of COL- and report back on entity linking over millions of ING. textual extractions, and we proposed novel ideas in Anthony Fader, Stephen Soderland, and Oren Etzioni. 2009. Scaling Wikipedia-based named entity disam- areas such as corpus-level features and unlinkable biguation to arbitrary Web text. In IJCAI-09 Workshop entities. There are potentially many other corpus- on User-contributed Knowledge and Artificial Intelli- level features and characteristics to explore, as well gence (WikiAI09). as additional challenges (e.g., how to best evaluate Anthony Fader, Stephen Soderland, and Oren Etzioni. recall at this scale), and we look forward to seeing 2011. Identifying relations for open information ex- additional research in Entity Linking at Web scale traction. In Proceedings of EMNLP. over the coming years. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Furstenau,¨ Manfred Pinkal, Marc Spaniol, 5 Acknowledgements Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. This research was supported in part by NSF grant In Proceedings of EMNLP. IIS-0803481, ONR grant N00014-08-1-0431, and Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and DARPA contract FA8750-09-C-0179, and carried Soumen Chakrabarti. 2009. Collective annotation of out at the University of Washington’s Turing Cen- wikipedia entities in text. In Proceedings of KDD. ter. Thomas Lin, Mausam, and Oren Etzioni. 2010. Com- monsense from the web: Relation properties. In AAAI Fall Symposium on Commonsense. References David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17h ACM Inter- Eneko Agirre and David Martinez. 2002. Integrating national Conference on Information and Knowledge selectional preferences in wordnet. In Proceedings of Management (CIKM). the first International WordNet Conference. Lev Ratinov, Dan Roth, Doug Downey, and Mike Ander- Michele Banko, Michael J Cafarella, Stephen Soderland, son. 2011. Local and global algorithms for disam- Matt Broadhead, and Oren Etzioni. 2007. Open in- biguation to wikipedia. In Proceedings of the Annual formation extraction from the Web. In Proceedings of Meeting of the Association of Computational Linguis- IJCAI. tics (ACL). Christian Bizer, Tom Heath, and Tim Berners-Lee. 2009. Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, International Journal Linked data - the story so far. In and Jesse Davis. 2010. Learning first-order horn on Semantic Web and Information Systems (IJSWIS) . clauses from web text. In Proceedings of EMNLP. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sameer Singh, Amarnag Subramanya, Fernando Pereira, Sturge, and Jamie Taylor. 2008. Freebase: a collabo- and Andrew McCallum. 2011. Large-scale cross- ratively created graph database for structuring human document coreference using distributed inference and knowledge. In Proceedings of SIGMOD ’08, pages hierarchical models. In Proceedings of ACL. 1247–1250, New York, NY, USA. Daniel S. Weld, Raphael Hoffmann, and Fei Wu. 2008. Razvan Bunescu and Marius Pas¸ca. 2006. Using ency- Using wikipedia to bootstrap open information extrac- clopedic knowledge for named entity disambiguation. tion. In SIGMOD Record. In Proceedings of the 11th Conference of the European