Farsbase: the Persian Knowledge Graph
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Web 1 (0) 1–5 1 IOS Press 1 1 2 2 3 3 4 FarsBase: The Persian Knowledge Graph 4 5 5 a a a,* 6 Majid Asgari , Ali Hadian and Behrouz Minaei-Bidgoli 6 a 7 Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran 7 8 8 9 9 10 10 11 11 Abstract. Over the last decade, extensive research has been done on automatically constructing knowledge graphs from web 12 12 resources, resulting in a number of large-scale knowledge graphs such as YAGO, DBpedia, BabelNet, and Wikidata. Despite 13 13 that some of these knowledge graphs are multilingual, they contain few or no linked data in Persian, and do not support tools 14 for extracting Persian information sources. FarsBase is a multi-source knowledge graph especially designed for semantic search 14 15 engines in Persian language. FarsBase uses some hybrid and flexible techniques to extract and integrate knowledge from various 15 16 sources, such as Wikipedia, web tables and unstructured texts. It also supports an entity linking that allow integrating with 16 17 other knowledge bases. In order to maintain a high accuracy for triples, we adopt a low-cost mechanism for verifying candidate 17 18 knowledge by human experts, which are assisted by automated heuristics. FarsBase is being used as the semantic-search system 18 19 of a Persian search engine and efficiently answers hundreds of semantic queries per day. 19 20 20 Keywords: Semantic Web, Linked Date, Persian, Knowledge Graph 21 21 22 22 23 23 24 24 1. Introduction In this paper we present FarsBase, a Persian knowl- 25 25 edge graph constructed from various information 26 26 Extracting a knowledge graph (KG) from open- sources, including Wikipedia, web tables and raw text. 27 27 FarsBase is specifically designed to fit the require- 28 access structural data such as Wikipedia has paved 28 ments of structural query answering in Persian search 29 the path for a revolution in information retrieval sys- 29 engines. Our contribution are as follows: 30 tems, including search engines and personal assistants 30 like Siri, Google Assistant, Alexa, and Cortana. In a 31 – We provide a hybrid architecture for knowl- 31 search query, users prefer to find exact answer instead 32 edge graph construction from multiple sources 32 of showing a list of web pages. For example, the de- 33 that leverages both top-down and bottom-up ap- 33 sired response for “How many children does the Queen 34 proaches: A preliminary version of the knowl- 34 have?” is simply "Four". This requires an credible and 35 edge graph is constructed from Wikipedia info- 35 up-to-date knowledge graph with comprehensive in- 36 boxes, which is consequently used to extract more 36 formation for responding to most user queries. In fact, 37 knowledge from other knowledge graphs, raw 37 most of the efforts for information acquisition that 38 text, and tables. 38 was traditionally done by the user —including fact- 39 – Contrary to other knowledge graphs, FarsBase is 39 checking, conflict resolution, and credibility analysis 40 tailor-made for [Persian] search engines. There- 40 of the information sources— are to be done when the 41 fore, the entire process of data collection, data 41 42 knowledge graph is being constructed. filtering, and query processing is specifically de- 42 43 In the last decade, extensive research have been signed to boost the user experience. In that re- 43 44 done on constructing knowledge graphs. This includes spect, the query log has a key role in prioritizing 44 45 knowledge graphs constructed from Wikipedia such as data sources, entities, classes, infoboxes, proper- 45 46 DBPedia [1]; systems that extract knowledge raw text, ties, and images, in various stages of the system. 46 47 e.g. NELL [2]; as well as the hybrid systems that ex- The workload-driven design of FarsBase requires 47 48 ploit multiple types of information sources, including fewer experts for building a knowledge graph, be- 48 49 Yago [3]. cause the human resources and system-tuning ef- 49 50 forts focus on data records that are more impor- 50 51 *Corresponding author. E-mail: [email protected] tant to the user, e.g. frequently searched. 51 1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 1 – FarsBase supports rule-based methods that enable 2.1. Knowledge Base and Knowledge Graph 1 2 flexibility for data extraction and manipulation in 2 3 several components of our architecture, including A knowledge base contains a set of facts, assump- 3 4 infobox extraction, raw text extraction, data trans- tions and rules that allows storing knowledge in a com- 4 5 form, and data cleansing. puter system. Knowledge bases can be specific to cer- 5 6 – FarsBase supports efficient human labeling for tain domains, e.g. a medical knowledge base contain- 6 7 managing and cleansing data from various sources ing facts about medical drugs (such as their proper- 7 8 and in multiple versions. It benefits from various ties and interactions). Also, knowledge from multiple 8 9 types of metadata provided by the different ex- domains can be integrated to build a general-domain 9 10 tractors, e.g. the time-flags and the accuracy/con- knowledge base. For example, DBPedia [1] is a multi- 10 11 fidence of different extraction modules for each domain knowledge base that is semi-automatically 11 12 constructed based on the entire Wikipedia articles. 12 triple. Such features can be used for prioritizing 13 Knowledge bases require a data model to organize 13 and grouping the entities for cost-effective batch 14 the facts. A typical approach is to define an ontology, 14 verification of triples by human experts. 15 where data instances (a.k.a. entities) are assigned to 15 – We provide a mechanism for integrating data 16 classes. Each class can be a subclass of another class, 16 from different knowledge extractors. Our mecha- 17 which results in a hierarchy known as ontology tree. 17 18 nism handles different versions from data sources The facts of a knowledge base are commonly repre- 18 19 with minimum expert intervention. This requires sented using a knowledge representation format. Mod- 19 20 extracting temporal facts and triple versioning for ern multi-domain knowledge bases use the Resource 20 21 handling further conflicts between the new and Description Framework (RDF) for knowledge repre- 21 22 current information. To the best of our knowl- sentation. RDF is primarily designed to represent re- 22 23 edge, FarsBase is the only multi-source knowl- sources on the web, but it can also be used for knowl- 23 24 edge base that supports timeliness[4] by handling edge management and supports essential features for 24 25 different versions of data from multiple sources. constructing a knowledge base, such as Is-A relations 25 26 and object properties. 26 The remainder of this paper is organized as follows. 27 In Semantic Web and linked data, there are differ- 27 The preliminaries and motivation is briefly introduced 28 ent definition of knowledge graph (KG); Ehrlinger et 28 in section 2. Section 3 describes a cost-based solu- 29 al tried to clarify the term in [5]. They mentioned 5 se- 29 30 tion to select knowledge sources for FarsBase. We give lected definitions of knowledge graph and presented an 30 31 an overview about FarsBase architecture in section 4. architecture for it. They assumed a knowledge graph 31 32 Section 5 explicates knowledge extraction from differ- is somehow superior and more complex than a knowl- 32 33 ent sources, including Wikipedia, web table and raw edge base because it contains a reasoning engine and 33 34 text. In section 6, we describe how extracted triple are also integrates knowledge from one or more sources. 34 35 mapped and integrated into a unified knowledge graph. 35 36 Evaluation and statistics about FarsBase has been re- 2.2. RDF Knowledge Representation Format 36 37 ported in Section 7. Section 8 describes related work 37 38 in knowledge graph construction, quality assessment, RDF is a standard for conceptualizing structural 38 39 mapping, relation extraction from raw texts, never end- data. In this model, data is represented as a set of triples 39 40 ing paradigms and knowledge augmentation. Finally, consisting of a subject, a predicate, and an object.A 40 41 section 9 concludes the paper with directions for future set of triples forms an RDF graph. For example, the 41 42 work. phrase "Einstein was born in Ulm" can be represented 42 43 as (subj:Albert_Einstein, pred:birth_place, obj:Ulm). 43 44 Triples are the atomic component of the RDF data 44 45 model. 45 46 2. Preliminaries and Motivation The RDF format enables knowledge representa- 46 47 tion using web resources, where each resource has a 47 48 In this section, we briefly introduce the basics Unique Resource Identifier (URI). For instance, the 48 49 of knowledge graph construction and representation. URI corresponding to “Albert Einstein” can be defined 49 50 Also, we explain challenges for constructing a multi- as http://example.name/Albert_Einstein 50 51 domain Persian knowledge graph. where http://example.name is the prefix address of 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 3 1 each entity in the knowledge base. In RDF, subjects ing language[9], such as Persian. This is mostly due 1 2 and predicates are URIs, and objects can be either to the fact that tools required for automatic knowledge 2 3 URIs or literal values.