How to Keep a Knowledge Base Synchronized with Its Encyclopedia Source
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) How to Keep a Knowledge Base Synchronized with Its Encyclopedia Source Jiaqing Liang12, Sheng Zhang1, Yanghua Xiao134∗ 1School of Computer Science, Shanghai Key Laboratory of Data Science Fudan University, Shanghai, China 2Shuyan Technology, Shanghai, China 3Shanghai Internet Big Data Engineering Technology Research Center, China 4Xiaoi Research, Shanghai, China [email protected], fshengzhang16,[email protected] Abstract However, most of these knowledge bases tend to be out- dated, which limits their utility. For example, in many knowl- Knowledge bases are playing an increasingly im- edge bases, Donald Trump is only a business man even af- portant role in many real-world applications. How- ter the inauguration. Obviously, it is important to let ma- ever, most of these knowledge bases tend to be chines know that Donald Trump is the president of the United outdated, which limits the utility of these knowl- States so that they can understand that the topic of an arti- edge bases. In this paper, we investigate how to cle mentioning Donald Trump is probably related to politics. keep the freshness of the knowledge base by syn- Moreover, new entities are continuously emerging and most chronizing it with its data source (usually ency- of them are popular, such as iphone 8. However, it is hard for clopedia websites). A direct solution is revisiting a knowledge base to cover these entities in time even if the the whole encyclopedia periodically and rerun the encyclopedia websites have already covered them. entire pipeline of the construction of knowledge freshness base like most existing methods. However, this One way to keep the of the knowledge base is solution is wasteful and incurs massive overload synchronizing the knowledge base with the data source from of the network, which limits the update frequency which the knowledge base is built. For example, encyclope- and leads to knowledge obsolescence. To over- dia websites such as Wikipedia are obviously good sources come the weakness, we propose a set of synchro- for update. Not only because many knowledge bases are ex- nization principles upon which we build an Update tracted from these encyclopedia websites, but also due to their System for knowledge Base (USB) with an update high-quality and updated information since many volunteers frequency predictor of entities as the core compo- are manually updating existing articles or adding articles for nent. We also design a set of effective features and new entities. realize the predictor. We conduct extensive exper- Thus, our problem becomes how should the knowledge iments to justify the effectiveness of the proposed base be kept synchronized with online encyclopedia. system, model, as well as the underlying principles. In the best situation, the encyclopedia website may provide Finally, we deploy USB on a Chinese knowledge the access (such as live feed) to recent changes made in the base to improve its freshness. website. The DBpedia live extraction systems [Hellmann et al., 2009; Morsey et al., 2012] rely on this access mechanism. However, most of the encyclopedia websites do not provide 1 Introduction such access. In this case, a direct update solution is revisit- Knowledge bases (KBs) are playing an increasingly impor- ing the whole encyclopedia periodically and rerun the entire tant role in many real-world applications. Many knowledge pipeline of the construction of knowledge base like most ex- bases, such as Freebase [Bollacker et al., 2008] and DBpedia isting methods. There are two alternative ways to access the [Auer et al., 2007; Lehmann et al., 2015], are extracted from encyclopedia articles. The first one is directly downloading encyclopedia website such as Wikipedia with many volun- the dump file provided by the encyclopedia websites, such as teers maintaining unstructured or semistructured knowledge Wikipedia. However, the dump is usually updated monthly, every day. Knowledge bases extracted from encyclopedia which can not satisfy the real requirement about the knowl- websites usually have high precision and coverage, thus have edge freshness. Moreover, only few encyclopedia websites been widely used in many real applications, such as search provide dump downloading. The second way is crawling all intent detection, recommendation and summarization. the articles in encyclopedia. But crawling the entire online encyclopedia with tens of millions of articles usually needs ∗Corresponding author. This paper was supported by National Key Basic Research Program of China under No.2015CB358800, more than one month with a single machine. No matter which by the National NSFC (No.61472085, U1509213), by Shanghai Mu- way we choose, we have to download data of GB size, which nicipal Science and Technology Commission foundation key project consumes too much network bandwidth. What’s worse, the under No.15JC1400900, by Shanghai Municipal Science and Tech- encyclopedia websites might ban the crawling if the access is nology project under No.16511102102, No.16JC1420401. too frequently. As a result, time delay of the update is still 3749 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) significant, leading to the obsolescence of knowledge. • Second, this method can only update the existing entities Above solution is not only wasteful but also unnecessary. in KB, and it misses new entities that are not in KB. We notice that most entities have stable properties and are sel- To overcome weaknesses above, we highlight that many dom changed, such as basic concepts like orange, or histor- entities have stable properties and running predictor on these ical persons like Newton. In contrast, some very hot entities entities is wasteful. In contrast, the entities with changed facts are subject to change, such as Donald Trump. Distinguishing actually are not too many. We find that hot entities mentioned the entities subject to change (hot entities) from others with on Web (in hot news or hot topics) are good candidates, be- stable properties and then only updating hot entities is clearly cause hot entities on Web in general are either new emerg- a smarter strategy, which not only saves network bandwidth ing entities or old entities that have a large chance to change. but also shortens the time delay. Thus, the key will be how to This rationale motivates us to use hot entities on Web as a estimate the update frequency of an entity in an encyclopedia seed set to start the updating procedure. This strategy is not website. good enough since the number of hot entities on Web per day In this paper, we systematically study the update frequency is usually small. Hence, we further propose entity expansion of entities in encyclopedia. We find that only few of them fol- to improve the recall. low Poisson distribution. Thus, we can only use the change rate (λ) of Poisson distribution to estimate the update fre- Solution framework Based on the ideas above, we develop quency for quite few entities. This weakness motivates us to a framework to update a KB, namely USB, which is illus- build a more effective predictor of update frequency that em- trated in Algorithm 1. USB mainly consists of four steps: ploys not only the historical update information but also the 1. Seed finding. We find hot entities from Web. semantic information in the article page of the entity. Based on the predictor, we build a KB update system USB (Update 2. Seed synchronizing. We visit/crawl the latest encyclopedia System for Knowledge Base), whose effectiveness is justified page for each seed entity, and synchronize (update or insert) in our experiments. The update system was further deployed the information in online encyclopedia with the knowledge bases via our knowledge extractor from online encyclopedia. on our knowledge base CN-DBpedia1, which is created from the largest Chinese encyclopedia BaiduBaike2. The system 3. Entity expanding. We use the newest pages of the synchro- updates 1K entities per day, and about 70% of them actually nized entity to find more related entities via the hyperlinks in contain newly updated facts. the encyclopedia pages. 4. Expanded entities synchronizing. For the expanded entities, 2 Framework we synchronize the entities according to their priority. The priority is provided by a predictor of update frequency. In this section, we present the basic framework of USB. 2.1 Problem Model and Solution Framework Algorithm 1 USB: Update System for KB Problem model Since the network resources are always Input: K: upper limits of accesses to the online encyclopedia limited and some encyclopedia websites have restricted ac- Func: VISIT(x): access x’s page in the online encyclopedia, can cess, we assume that there is an upper limit (K) on the num- be called at most K times ber of entities we can access every day. In other words, a good Func: EXTRACT SEED(hots): extract entities from the sentence update strategy should maximize the number of crawled en- set hots tities with a newer version than which in our knowledge base Func: EXPAND(x): return all hyperlinked entities of x with at most K entities to crawl. The formal objective is: 1: // Step 1: Seed finding 2: hots crawl hot sentences/phrases from Web arg max jfxjx 2 R; tn(x) > ts(x)gj (1) 3: seeds EXTRACT SEED(hots) R;jR|≤K 4: // Step 2 & 3: Seed synchronizing and seed expanding where R is the entity set that we crawl, tn(x) is the last up- 5: pq ; //pq is a priority queue 6: for s 2 seeds do date time of x in online encyclopedia, and ts(x) is the last synchronization time of x. When x is a new entity that is not 7: page(s) VISIT(s) in the current KB, t (x) = −∞. 8: if page(s) exists in online encyclopedia then s 9: update s in KB with page(s) A baseline solution Recall that the key of a smart update 10: for x in EXPAND(s) do strategy is predicting whether an entity has been updated 11: insert x into pq with priority value as E[u(x)] 12: end for since its last synchronization.