1 (0) 1–5 1 IOS Press

1 1 2 2 3 3 4 FarsBase: The Persian Knowledge Graph 4 5 5 a a a,* 6 Majid Asgari , Ali Hadian and Behrouz Minaei-Bidgoli 6 a 7 Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran 7 8 8 9 9 10 10 11 11 Abstract. Over the last decade, extensive research has been done on automatically constructing knowledge graphs from web 12 12 resources, resulting in a number of large-scale knowledge graphs such as YAGO, DBpedia, BabelNet, and . Despite 13 13 that some of these knowledge graphs are multilingual, they contain few or no in Persian, and do not support tools 14 for extracting Persian information sources. FarsBase is a multi-source knowledge graph especially designed for 14 15 engines in Persian language. FarsBase uses some hybrid and flexible techniques to extract and integrate knowledge from various 15 16 sources, such as , web tables and unstructured texts. It also supports an that allow integrating with 16 17 other knowledge bases. In order to maintain a high accuracy for triples, we adopt a low-cost mechanism for verifying candidate 17 18 knowledge by human experts, which are assisted by automated heuristics. FarsBase is being used as the semantic-search system 18 19 of a Persian and efficiently answers hundreds of semantic queries per day. 19 20 20 Keywords: Semantic Web, Linked Date, Persian, Knowledge Graph 21 21 22 22 23 23 24 24 1. Introduction In this paper we present FarsBase, a Persian knowl- 25 25 edge graph constructed from various information 26 26 Extracting a knowledge graph (KG) from open- sources, including Wikipedia, web tables and raw text. 27 27 FarsBase is specifically designed to fit the require- 28 access structural data such as Wikipedia has paved 28 ments of structural query answering in Persian search 29 the path for a revolution in information retrieval sys- 29 engines. Our contribution are as follows: 30 tems, including search engines and personal assistants 30 like , Google Assistant, Alexa, and . In a 31 – We provide a hybrid architecture for knowl- 31 search query, users prefer to find exact answer instead 32 edge graph construction from multiple sources 32 of showing a list of web pages. For example, the de- 33 that leverages both top-down and bottom-up ap- 33 sired response for “How many children does the Queen 34 proaches: A preliminary version of the knowl- 34 have?” is simply "Four". This requires an credible and 35 edge graph is constructed from Wikipedia info- 35 up-to-date knowledge graph with comprehensive in- 36 boxes, which is consequently used to extract more 36 formation for responding to most user queries. In fact, 37 knowledge from other knowledge graphs, raw 37 most of the efforts for information acquisition that 38 text, and tables. 38 was traditionally done by the user —including fact- 39 – Contrary to other knowledge graphs, FarsBase is 39 checking, conflict resolution, and credibility analysis 40 tailor-made for [Persian] search engines. There- 40 of the information sources— are to be done when the 41 fore, the entire process of data collection, data 41 42 knowledge graph is being constructed. filtering, and query processing is specifically de- 42 43 In the last decade, extensive research have been signed to boost the user experience. In that re- 43 44 done on constructing knowledge graphs. This includes spect, the query log has a key role in prioritizing 44 45 knowledge graphs constructed from Wikipedia such as data sources, entities, classes, infoboxes, proper- 45 46 DBPedia [1]; systems that extract knowledge raw text, ties, and images, in various stages of the system. 46 47 e.g. NELL [2]; as well as the hybrid systems that ex- The workload-driven design of FarsBase requires 47 48 ploit multiple types of information sources, including fewer experts for building a knowledge graph, be- 48 49 Yago [3]. cause the human resources and system-tuning ef- 49 50 forts focus on data records that are more impor- 50 51 *Corresponding author. E-mail: [email protected] tant to the user, e.g. frequently searched. 51

1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 – FarsBase supports rule-based methods that enable 2.1. and Knowledge Graph 1 2 flexibility for data extraction and manipulation in 2 3 several components of our architecture, including A knowledge base contains a set of facts, assump- 3 4 infobox extraction, raw text extraction, data trans- tions and rules that allows storing knowledge in a com- 4 5 form, and data cleansing. puter system. Knowledge bases can be specific to cer- 5 6 – FarsBase supports efficient human labeling for tain domains, e.g. a medical knowledge base contain- 6 7 managing and cleansing data from various sources ing facts about medical drugs (such as their proper- 7 8 and in multiple versions. It benefits from various ties and interactions). Also, knowledge from multiple 8 9 types of provided by the different ex- domains can be integrated to build a general-domain 9 10 tractors, e.g. the time-flags and the accuracy/con- knowledge base. For example, DBPedia [1] is a multi- 10 11 fidence of different extraction modules for each domain knowledge base that is semi-automatically 11 12 constructed based on the entire Wikipedia articles. 12 triple. Such features can be used for prioritizing 13 Knowledge bases require a data model to organize 13 and grouping the entities for cost-effective batch 14 the facts. A typical approach is to define an ontology, 14 verification of triples by human experts. 15 where data instances (a.k.a. entities) are assigned to 15 – We provide a mechanism for integrating data 16 classes. Each class can be a subclass of another class, 16 from different knowledge extractors. Our mecha- 17 which results in a hierarchy known as ontology tree. 17 18 nism handles different versions from data sources The facts of a knowledge base are commonly repre- 18 19 with minimum expert intervention. This requires sented using a knowledge representation format. Mod- 19 20 extracting temporal facts and triple versioning for ern multi-domain knowledge bases use the Resource 20 21 handling further conflicts between the new and Description Framework (RDF) for knowledge repre- 21 22 current information. To the best of our knowl- sentation. RDF is primarily designed to represent re- 22 23 edge, FarsBase is the only multi-source knowl- sources on the web, but it can also be used for knowl- 23 24 edge base that supports timeliness[4] by handling edge management and supports essential features for 24 25 different versions of data from multiple sources. constructing a knowledge base, such as Is-A relations 25 26 and object properties. 26 The remainder of this paper is organized as follows. 27 In Semantic Web and linked data, there are differ- 27 The preliminaries and motivation is briefly introduced 28 ent definition of knowledge graph (KG); Ehrlinger et 28 in section 2. Section 3 describes a cost-based solu- 29 al tried to clarify the term in [5]. They mentioned 5 se- 29 30 tion to select knowledge sources for FarsBase. We give lected definitions of knowledge graph and presented an 30 31 an overview about FarsBase architecture in section 4. architecture for it. They assumed a knowledge graph 31 32 Section 5 explicates from differ- is somehow superior and more complex than a knowl- 32 33 ent sources, including Wikipedia, web table and raw edge base because it contains a reasoning engine and 33 34 text. In section 6, we describe how extracted triple are also integrates knowledge from one or more sources. 34 35 mapped and integrated into a unified knowledge graph. 35 36 Evaluation and statistics about FarsBase has been re- 2.2. RDF Knowledge Representation Format 36 37 ported in Section 7. Section 8 describes related work 37 38 in knowledge graph construction, quality assessment, RDF is a standard for conceptualizing structural 38 39 mapping, relation extraction from raw texts, never end- data. In this model, data is represented as a set of triples 39 40 ing paradigms and knowledge augmentation. Finally, consisting of a subject, a predicate, and an object.A 40 41 section 9 concludes the paper with directions for future set of triples forms an RDF graph. For example, the 41 42 work. phrase "Einstein was born in Ulm" can be represented 42 43 as (subj:Albert_Einstein, pred:birth_place, obj:Ulm). 43 44 Triples are the atomic component of the RDF data 44 45 model. 45 46 2. Preliminaries and Motivation The RDF format enables knowledge representa- 46 47 tion using web resources, where each resource has a 47 48 In this section, we briefly introduce the basics Unique Resource Identifier (URI). For instance, the 48 49 of knowledge graph construction and representation. URI corresponding to “Albert Einstein” can be defined 49 50 Also, we explain challenges for constructing a multi- as http://example.name/Albert_Einstein 50 51 domain Persian knowledge graph. where http://example.name is the prefix address of 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 3

1 each entity in the knowledge base. In RDF, subjects ing language[9], such as Persian. This is mostly due 1 2 and predicates are URIs, and objects can be either to the fact that tools required for automatic knowledge 2 3 URIs or literal values. RDF data is serialized and graph construction are not mature enough to be used in 3 4 stored using different textual syntaxes, such as , multilingual knowledge extraction engines. Also, Per- 4 5 NTriples, XML, RDFa, TriG, N-Quads, and JSON- sian NLP toolsets suffer lower accuracies due to the 5 6 LD. For instance, Turtle serializes triples into the fol- small size of the Persian corpora. We tackled these 6 7 lowing syntax [6]: challenges by various techniques that boost the accu- 7 8 racy of knowledge graph construction in Persian. The 8 9 main challenges for FarsBase construction are summa- 9 10 . rized as followed: 10 11 11 – Knowledge graph construction engines adopt 12 Therefore, the fact that “Einstein knows Niels Bohr” 12 state-of-the-art methods for extracting knowledge 13 can be represented in Turtle syntax as follows : 13 14 form raw text and semi-structured data such as 14 15 web tables. Essential software libraries include 15 16 Entity Linking, base phrase chunking, depen- 16 17 . dency parsing, coreference resolution, and entity 17 18 disamniguation. The errors caused by each of the 18 19 Note that entities with similar names (e.g. "London, tools will be propagated throughout the rest of the 19 20 UK" and "London, Canada") must have different URIs system and results in low precision results. When- 20 21 in order to avoid further ambiguity in the facts. ever the accuracy of standard NLP toolsets do not 21 22 RDF can be easily used for knowledge bases de- suffice for certain subtasks in knowledge extrac- 22 23 rived from non-English data. String literals can have tion, the inaccurate methods should be either en- 23 24 a language tag, which is very useful for building hanced or replaced with an alternative approaches 24 25 multilingual knowledge bases. For example, Albert to deliver enough accuracy for knowledge graph 25 26 Einstein can be represented as “Albert_Einstein”@en construction. 26 27   – Knowledge graph construction requires human 27 or “á J‚ K @_H QË@”@fa. Also, RDF supports using 28 “IRI”s, an extended. version of URI that allow using supervision as a part of the process, e.g. in the 28 29 non-ASCII characters in the resource names. For the mapping phase (explained later in sec. 6). DBpe- 29 30 sake of simplicity, in this paper we simply assume that dia and other projects mostly focused on English 30 31 URI may also contain UTF-8 characters from other and other widespread languages, and did not put 31 32 languages. effort on mapping and cleaning Persian data. We 32 33 argue that human supervision for yet-another lan- 33 34 2.3. FarsBase: A Persian Knowledge Graph guage is non-trivial. For example, the human su- 34 35 pervision process for dependency parsing (used 35 36 Construction a multi-domain and comprehensive for knowledge extraction from raw text) is en- 36 37 knowledge graph from unstructured and semi-structured tirely different in Persian. 37 38 web contents has been of interest for a while. The DB- – Having an ideal set of high-precision NLP toolk- 38 39 pedia project was initiated a decade ago to construct a its, it is challenging to extract enough knowledge 39 40 knowledge graph from Wikipedia. Further works like from Persian web sources that satisfies a certain 40 41 Yago [3] and ... integrated other sources, e.g. Word- application. In particular, FarsBase is primarily 41 42 Net and GeoNames[7], in order to construct a more- constructed to be used as a backbone for seman- 42 43 enriched knowledge graph. Raw text data collected tic search in Persian Search Engines. This re- 43 44 from web can also be used to supplement knowledge quires that the knowledge graph be accurate for 44 45 graphs [8]. user queries, specifically for the frequent queries. 45 46 Thanks to tons of research in knowledge graph con- Also, since a significant number of user queries 46 47 struction techniques, multi-domain knowledge graphs target recent knowledge (e.g. details about a new 47 48 such as DBpedia and Yago have a comprehensive set celebrity or a recent event), the knowledge con- 48 49 of facts extracted from English and European lan- struction mechanism should be specifically eager 49 50 guages. However, these knowledge graphs do not con- to extract knowledge form the trending entities 50 51 tain enough facts from a low-resource and challeng- and relationships. To the best of our knowledge, 51 4 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 none of the accessible knowledge graphs are op- input sources may be used for knowledge graph con- 1 2 timized with respect to user search query log. struction, including: 2 3 3 FarsBase is designed to be specifically precise – Existing knowledge graphs: , Wikidata. 4 4 on parts of the knowledge graph that is frequently 5 – Encyclopedias: Wikipedia, World Book Encyclo- 5 searched. Aside from having precise knowledge for 6 pedia 6 frequent queries, the of the structural queries in a 7 – Domain-specific : IMDb, BrainyQuote, 7 search engine correspond to recent content, such as 8 TripAdvisor, etc. 8 queries about new Celebrities and Recent events. 9 – Semi-structured web sources: Web tables, in- 9 10 Therefore, our knowledge construction engine extracts foboxes. 10 11 new entities and relationships that are recently intro- – Unstructured (raw) text, collected from the web 11 12 duced in Persian web sources. and book texts. 12 13 FarsBase is automatically constructed from Persian – Direct Human Input: collected by human experts 13 14 (Farsi) section of Wikipedia, and is expanded by other or crowdsourcing. 14 data sources such as text and web tables. Despite 15 Each of the existing knowledge graphs used one 15 16 the fact that FarsBase is connected to various multi- 16 lingual knowledge graphs, the main focus of FarsBase or a couple of the above sources that were accessi- 17 ble sources at the time of knowledge graph construc- 17 18 is on extracting knowledge from Persian sources. In 18 fact, some multilingual information extraction tools tion. Table 1, summarizes the input sources for com- 19 mon knowledge graphs. 19 20 already support Persian, but their accuracies on Per- 20 21 sian sources is very low. In DBpedia, for example, the 21 3.1. Availability of Sources in Persian 22 knowledge extraction engine is entirely run on Persian 22 23 Wikipedia but very few triples are extracted. However, 23 Sources for constructing Persian knowledge graphs 24 FarsBase is primarily constructed to extract Knowl- 24 are limited, both in terms of the number of avail- 25 edge from Persian sources on the web. 25 able options, and their quantity and quality. Wikipedia, 26 FarsBase has been designed to be a multi-source 26 27 knowledge graph. Even though other knowledge graphs for example, contains over 5.6 million English arti- 27 28 also use multiple sources, none of them is designed cles, but only 0.6 million articles in Persian. Aside 28 29 to be a knowledge graph which accepts structural from Wikipedia and raw text, the other common data 29 30 data and raw texts concurrently. For example DBpe- sources for knowledge graph construction, shown in 30 31 dia mainly extracts triples from Wikipedia (Some re- Table 1, have no Persian alternative or are so small that 31 32 searches has been made to augment DBpedia from do not worth extraction. For example, there are few 32 33 other sources e.g. [10],[11],[12]). Similarly, BabelNet Persian websites similar to the IMDB movie , 33 34 is constructed with Wikipedia and WordNet only; and such as sourehcinema.com, but their databases is 34 35 NELL has mainly focused on raw texts and extracts almost a subset of the information already available in 35 36 limited number of predicates. To our knowledge, Yago Persian Wikipedia. 36 37 is the only knowledge graph that supports multiple Persian sources are sparse. In Wikipedia, for exam- 37 38 structural sources and can extract from raw text using ple, English articles usually contains more information 38 39 an extension [13]. However, the libraries and archite- than their Persian version. More specifically, Persian 39 40 ture of Yago requires language-specific libraries that Wikipedia articles are less verbose, have fewer links 40 41 are of very poor quality for Persian. and many of them have no infoboxes. This is mostly 41 42 due to the smaller community of contributors for Per- 42 43 sian, which also impacts the quality of the content. 43 44 3. Knowledge Sources Other types of sources such as web pages have even 44 45 poorer quality due to the massive amount of hoaxes 45 46 FarsBase is constructed by integrating data from dif- and false information, that even appear on credible 46 47 ferent sources, including Wikipedia infoboxes, Tables news sources. 47 48 extracted from web pages, and raw text collected from Due to the various challenges associated with Per- 48 49 various sources. sian sources, the first step for FarsBase construction is 49 50 The information sources for building a knowledge to select a set of sources that have comprehensive and 50 51 graph should be rich and accurate. Various types of accurate information. 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 5

1 Table 1 1 2 Sources used in famous knowledge graphs 2 3 Knowledge Graph Data Sources 3 4 Knowledge Vault Freebase, Wikidata, Wikipedia, Raw text, Tables and web pages, Human knowledge 4 5 Wikidata Wikipedia, Human knowledge 5 6 Freebase Wikipedia, Domain-specific Databases like NNDB, Fashion Model Directory, MusicBraiz, ... 6 7 DBpedia Wikipedia 7 8 BabelNet Wikipedia, WordNet 8 9 Yago Wikipedia, WordNet, GeoNames 9 10 NELL Raw text, Crowdsourcing 10 11 Wikidata, Freebase, Wikipedia, CIA World Factbook, User feed-backs 11 12 Knowledge Graph Freebase, Wikipedia, LinkedIn, User feed-backs, Online Databases like Foursquare, TripAdvisor, Yelp, 12 13 BrainQuote, IMDB, ... 13 14 14 15 3.2. Source Selection: A Cost-based Approach of hoaxes and wrong statements. The amount of 15 16 false information in a source should be quite low, 16 17 In order to use FarsBase for query answering in say <5%, so that the data can be verified in batch 17 18 search engines, its primary application, it should con- or using automated methods. 18 19 tain a diverse and comprehensive set of facts from all 2. Precision of the available extractor: Informa- 19 20 domains to cover a large share of user queries. Such tion extraction from raw text and web tables is 20 21 a large knowledge graph should be collected with an more challenging compared to structured sources 21 22 affordable effort, hence we should consider the cost of such as databases and Wiki infoboxes. Only few 22 23 knowledge extraction from each source. Extracting in- knowledge graphs are constructed based on raw- 23 24 formation from texts, tables, and other sources on the text, most notably KnowledgeVault and NELL. 24 25 web are far more challenging and costly than the struc- Such knowledge graphs either use crowdsourc- 25 26 tured sources like Wikipedia, mainly due to the lack of ing (which needs high amount of user contri- 26 27 structure and quality issues. bution), or massively rely on other knowledge 27 28 General cost measures: The cost of exploiting a graphs, e.g. Freebase and Wikidata, which are 28 29 source can be defined as the human effort required for non available for Persian. Moreover, the diffi- 29 30 verifying its facts, e.g. in terms of "second per tuple". culty of the extraction task depends on the matu- 30 31 Using an incredible data source might lead to false in- rity of state-of-the-art NLP tools. Persian NLP li- 31 32 formation and increase the demand for human verifi- braries cannot efficiently extract knowledge, es- 32 33 cation. Errors generated by the knowledge extraction pecially if the content has a high rate of mis- 33 34 modules can also call for more verification. spellings and slang words. The immaturity of 34 35 Application-specific cost measures: In most ap- Persian NLP tools, along with the low quality of 35 36 plications, facts of the knowledge graph are not of the Persian sources, makes it very challenging to 36 37 the same importance. In a search engine, for exam- extract high-quality knowledge in Persian. 37 38 ple, the importance of an entity or relationship is pro- 3. Verification cost: Knowledge extraction should 38 39 portional to how frequently it appears in user queries. ideally be an automatic process. If the preci- 39 40 A false relationship about frequently-searched entities, sion of the extracted information is higher than 40 41 41 e.g. "Barack Obama is born in Africa", has a higher im- a certain threshold, say 95%, it could be permis- 42 42 pact on users than the same false information on non- sible to accept all extracted data without man- 43 43 famous people. Moreover, the cost of extracting a false ual verification. For lower qualities, however, the 44 44 relationship is much higher than missing a relation- extracted information should be verified by hu- 45 45 ships from a source, which should be considered when man experts or human-in-the-loop labeling meth- 46 ods [14, 15]. 46 tuning the sensitivity of the extraction algorithms. 47 4. Overlapping of sources: If a relationship is 47 We considered the following criteria for selecting 48 available in multiple sources, we prefer to ex- 48 and prioritizing the sources 49 tract it from the source with the lowest cost. Con- 49 50 1. Quality of source: Some sources such as blogs sidering that most of the knowledge in FarsBase 50 51 and many news websites contain a high share comes from encyclopedias such as Wikipedia, 51 6 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 for other sources we are interested in how much lect the most informative texts that contain a high num- 1 2 extra information can the new source add to the ber of entities and relationships, in order to make the 2 3 existing knowledge graph. high cost of extraction worthwhile. 3 4 To select the most cost-effective sources as input for 4 To achieve a good trade off between the quality and 5 raw-text extractor, we investigated four types of raw- 5 quantity of the extracted knowledge, in the following 6 text data: Wikipedia article bodies, news articles, tech- 6 we provide a brief cost analysis for each of the avail- 7 nical and personal blog posts. We picked a set of ran- 7 able sources for Persian knowledge extraction. 8 domly selected articles from each sources, and manu- 8 9 3.2.1. Wikipdia ally counted the number of triples that are suitable for 9 10 Wikipedia is a reach knowledge resource devel- the knowledge graph. The data from all sources is col- 10 11 oped by millions of contributors. Most of the multi- lected on November 1, 2016. 11 12 domain knowledge graphs such as DBPedia, Yago, and To investigate the difficulty of knowledge extraction 12 13 BabelNet use Wikipedia as their primary knowledge from each raw text source, we investigate the impact 13 14 source [4]. Although the Persian Wikipedia is smaller of co-reference resolution for each data source. Co- 14 15 than the English version, it is still the most valuable reference resolution can help extractor to find out more 15 16 and accurate Persian knowledge source in compari- triples from the raw texts, but the quality of state-of- 16 17 son with other sources. Many Wikipedia articles have the-art Persian NLP toolsets for co-reference resolu- 17 18 one or more infoboxes. Knowledge graph construc- tion is not very high. In all of experiments, experts 18 19 tion from infoboxes is very cost-efficient, because the have counted number of triples with and without co- 19 20 human effort is mostly on the mapping and transfor- reference resolution. 20 21 mation (sec 6.1) and the extracted data are accurate We applied a few preprocessing steps for each data 21 22 enough such that no human verification is required on source to ensure a valid comparison: 22 23 the extracted triples. 23 24 Aside from the structured data in Wikipedia (in- – Wikipedia article bodies: We randomly sam- 24 25 foboxes, abstracts, categories, redirects, etc), the raw pled 50 Wikipedia article with a minimum length 25 26 content of the articles is very rich and has the high- of 500 words, and triples were extracted by ex- 26 27 est quality compared to other raw text sources. This is perts. 27 28 further discussed in section 3.2.3. – News articles: We sampled 50 news articles from 28 1 29 a famous news agency website ( ) . Note 29 3.2.2. Web Tables ¸AJK.AK 30 that most Persian news websites contain a large 30 Another rich sources for triple extraction are web ta- 31 amount of auto-generated values, such as tables, 31 bles, one can easily extract a considerable number of 32 summaries of stock, weather, currency, gold, etc. 32 entities and their relations. Due to the high variety of 33 Since the auto-generated reflect daily events and 33 structures in web tables, the cost of extracting infor- 34 are not valuable for KGs, we excluded them from 34 mation from tables is much higher than infoboxes be- 35 the body of the samples. We observed that most 35 cause it need entity linking [16] and special extraction 36 news articles contain a very small number of use- 36 approaches[17][18][19]. 37 ful triples and 37 38 3.2.3. Raw Text – Blogs: We sampled 50 posts from Persian techni- 38 39 Due to the limited size of Persian Wikipedia, it is cal weblogs, and another 50 samples from literary 39 40 inevitable to extract additional knowledge from other weblogs. 40 41 sources, most notably the raw text and web tables. 41 3.2.4. Evaluation of Raw Text Sources 42 Hence, it is very important that the knowledge source 42 Table 2 shows the average number of triples in each 43 contains substantial amount of information with high 43 category of raw text sources, measured by human ex- 44 quality, and the source is not so complicated, e.g. not 44 perts with and without co-reference resolution. Results 45 having complex grammar or table structure, so that the 45 are normalized by length to represent number of triples 46 available tools can efficiently extract the knowledge. 46 per 1000 words. 47 Raw-text extractor can be applied on any type of raw 47 Table 3 summarizes various cost aspects of the 48 text, such as web content, books, and OCR-extracted 48 available data sources for FarsBase, including the 49 texts. However, raw text requires significant effort to 49 50 extract and verify knowledge because each triples must 50 51 be verified by human experts. Therefore, we should se- 1http://tabnak.com 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 7

Table 2 1 Extractors. FarsBase supports various input sources 1 Number of triples per 1000 words for various raw text sources 2 with different data formats. The main three extractors 2 3 Source # Triples # Triples with C.R. are as follows: 3 4 Wikipedia 46 68 4 – Wikipedia extractor (WE): Extracts triples from 5 News Sites 3 3.3 5 Wikipedia dump, including the infoboxes, ab- 6 Technical blogs 1.4 1.6 6 stracts, categories, redirects, and ambiguities. 7 Literary blogs 0.2 0.26 7 – Table extractor (TE): Extracts triples from web 8 8 pages that contain tables with informational records. 9 need for natural language processing (especially Co- 9 10 The schema of the table, i.e. the mapping to KG, 10 reference resolution), number of triples, quality of 11 is suggested by the system but should be veri- 11 each triple, and the overall cost of extracting knowl- 12 fied by a human expert. Current version of TE is 12 edge from the source. Considering all these parame- 13 mostly designed for parsing tables in Wikipedia, 13 ters, we built FarsBase using structured data parts of 14 where the mapping is easier to infer. 14 Persian Wikipedia (primarily infoboxes); along with a 15 – Raw-text triple triple-extractor (RTE) consists of 15 selection of Persian news websites to cover recent in- 16 four extractor modules, namely Rule-base, Dis- 16 17 formation. We also support a high-precision approach tant supervision, Dependency pattern and Unsu- 17 18 to extract data from Web tables, which requires explicit pervised triple extractor. RTE continuously ex- 18 19 human supervision. tracts new triples for the KG. To prioritize the in- 19 20 In more details, we use the following data: Wikipedia put sources, KG prioritizes input sources (web- 20 21 structured data (Wikipedia infoboxes, entity cate- sites) based on the facts that are under-demand. 21 22 gories, ambiguities, external links, web pages, images, To provide raw-text data for RTE, some crawled 22 23 connections between entities, rewriting ids and so on), sources by a search engine has been fed it. RTE 23 24 tables (with explicit human labeling), Wikipedia arti- consumes and archives fed raw texts. 24 cle bodies, and News websites. 25 Each extraction module preserves various metadata, 25 26 including the version and URL of the source that the 26 27 triple has been extracted from. For example, when WE 27 28 4. Architecture reads a new dump to generate triples, the version of the 28 29 dump is stored along with all the extracted triples. 29 30 FarsBase consists of a construction system for ex- 30 Source Selector. Raw-Text extractor is fed by the 31 tracting knowledge from various sources, as well as a 31 crawler of a Persian search engine, although it can 32 search system for answering knowledge-graph-based 32 use an independent crawler such as Heritrix or Apache 33 user queries and suggesting similar entities in a search 33 Nutch. The Source Selector module prioritizes web 34 engine. Figure 1 shows an overview of the FarsBase 34 35 pages based on the cost-efficiency of the sources, as 35 architecture. 36 discussed in sec. 3.2. The module also exploits the 36 The construction and search systems benefit from 37 query-log of the search engine to prioritize web pages 37 shared components and data in a symbiotic manner: 38 based on the frequency that they appear in search re- 38 The construction system updates the knowledge graph 39 sults and the click-through rates. 39 with the most recent triples, and this helps Resource 40 40 Extractor to extract more entities and ontology pred- Experts. The extracted predicates must be mapped to 41 a unified ontology. The mapping table is incrementally 41 42 icates from the input sources to feed the construction 42 constructed by the Human experts and stored in CFM- 43 system. 43 Store. The experts are also used for various verifica- 44 44 tion tasks, most importantly for verifying the mapping 45 4.1. Knowledge Graph Construction 45 schemata for TE and all the triples extracted by RTE. 46 46 47 FarsBase extracts knowledge from various types Mapper. The Mapper integrates triples generated by 47 48 of sources, including Wikipedia, semi-structured web all extractors, and converts them into mapped triples. 48 49 data (e.g. web-tables) and raw texts. The extracted data The mapped triples and their corresponding meta-data 49 50 is then cleaned and organized with partial supervision is written to the Candidate Fact and Meta-data Store 50 51 by the expert. (CFM-Store). The meta-data of triples include docu- 51 8 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 Table 3 1 2 Cost of available sources for FarsBase 2 3 Source Co-references Number of Triples Accuracy of Triples Cost Used in FarsBase 3 4 Wikipedia (excl. raw text) - Very High Very High Affordable Yes 4 5 Web-Tables - Very High Very High Very High Limited 5 6 News sites (text) Average Low High Very High Yes 6 7 Formal/Technical Weblogs (text) Low Low Low Very High No 7 8 Personal Weblogs (text) Low Very Low Very Low Very High No 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 Fig. 1. FarsBase architecture 26 26 27 27 ment URLs, extraction module, version information, disambiguation on the extracted entities. Rather, it ex- 28 28 date of extraction, source texts (if available), expert tracts all combinations of entities and relationships that 29 29 votes and notes. might overlap with each other, and further provides 30 30 features such as positions and proximities of the each 31 Candidate Fact and Meta-data Store (CFM-Store). 31 URI, so that the target components can adopt custom 32 CFM-Store stores all triples and their meta-data. Fars- 32 policies for filtering and disambiguation. This is very 33 Base supports versioning, i.e. the CFM-Store contains 33 important in Persian, because many Persian words are 34 various versions of each triples. The latest version of 34 written with one or more spaces. 35 these triples, after expert approval, is updated/filtered 35 Note that the RTE relies on the resource extractor, 36 periodically at specified intervals (currently once a 36 which in turn is backed by the Belief-Store. Therefore, 37 day) to the Belief-Store without any metadata. 37 the Belief-Store should be initially filled by extracting 38 38 Filtering/Updating Triples. When transferring the and mapping data from Wikipedia and web-tables, and 39 39 triples to Belief-Store, Update Handler re-builds the then the resource extractor can be used to run RTE. 40 40 Belief-Store using the latest version of the mapped 41 41 triples. Therefore, if a relation is removed from the in- 4.2. The Search System 42 42 put sources, it will no longer be written in Belief-Store. 43 43 The FarsBase system is designed to respond to un- 44 Resource Extractor. Qualified triples in the Belief- 44 structured queries with structured output (entities) as 45 Store are used by the extraction engine RTE, along 45 the result. While the details of the search system is be- 46 with the search engine and related-entity recommender 46 yond the scope of this paper, we briefly explain the 47 that respond to the end-users. Nevertheless, all of these 47 main components. 48 three systems rely on a single shared component, the 48 49 resource extractor, which extracts all possible entities 4.2.1. Searcher 49 50 and relationships from a given text in a brute-fore man- Much of the attempt for search over knowledge 50 51 ner. Resource extractor does not apply any filtering or graphs is on systems, where the 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 9

1 goal is to retrieve snippets of the documents related of these types. The values can also be an object that 1 2 to the question which contains the answer [20]. How- contain a list of key-values. 2 3 ever, we aim to design a system which answers spe- 3 Abstracts and Body of Articles. Wikipedia’s body 4 cific questions with highly-reliable results instead of 4 and abstract contain internal links which are valuable 5 returning snippets. 5 sources for raw-text and table extractors. In particular, 6 An important requirement in a KB-based search in 6 internal links are used to automatically extract high- 7 search engines is that any wrong entity is intolera- 7 confidence patterns for extracting facts from raw-texts. 8 ble, such that a single wrong result by the KG-based 8 9 search system for "President of the US" could be re- Redirects. Entities may have multiple titles. For ex- 9 10 garded as a disaster for the entire search business. As ample, “øP@ Qƒ øYªƒ” (Saadi Shirazi, a Persian poet) 10 11 a result, the KG-based search is only admissible if has multiple alternative titles that refer to his name in 11 12 the precision is very close to 1. With this in mind, Persian Wikipedia, such as “á m  XAJƒ@ ” (The Master 12 13 we follow a template-based approach which is simi- of Speech) and “á  ” (The Most Eloquent 13 14 lar to TBSL [21], where text queries are mapped to a ÒʾJÖÏ@ i’¯@ 14 Speaker). 15 structural SPARQL queries according to a set of tem- 15 16 plates. Despite [21, 22], we do not directly allow auto- Disambiguation pages. Different entities might have 16 17 generated templates to be fed into the searcher. Instead, a common title, such as people with the same name. 17 18 we first generate a set of templates, and then each tem- For each ambiguous term, there is a disambiguation 18 19 plate is verified by human experts to ensure that the page that specifies all entities that might refer to it. Dis- 19 20 precision is well preserved. To handle the very large ambiguations and redirects of Wikipedia are specifi- 20 21 number of generated templates, we exploit the query cally helpful for accurate entity resolution in other ex- 21 22 log to rank templates based on how frequently they tractors. 22 23 are triggered, and then the most frequent templates are 23 24 verified by the experts. Categories. Each Wikipedia article has one or more 24 25 categories, e.g. "Saadi Shirazi" can belong to "13th- 25 4.2.2. Related Entity Recommender 26 century Persian poets" and "Persian Poets". Despite 26 Entity recommendation, i.e. suggesting related en- 27 the fact that categories have some sort of hierarchy, 27 tities for a given entity, is an interesting service for 28 e.g. might have sub-categories, the categories are not 28 search engines. FarsBase exploits relevance propaga- 29 well-structured and can be treated as a set of tags as- 29 tion through heterogeneous pathes in the knowledge 30 signed by different people to each entity. Nevertheless, 30 graph to estimate the relevance between entities. The 31 data analytic applications such as related-entity rec- 31 training is done using an active learning approach. 32 ommendation or knowledge graph search can leverage 32 33 these labels. Moreover, various enhancements can be 33 34 applied to improve the quality of category labels [24]. 34 5. Extraction 35 Images. The images of entities can be used for con- 35 36 36 FarsBase extract knowledge from three types of data structing the image-based FarsBase, but the details of 37 37 sources, namely Wikipedia, web-tables, and raw text. this system do not fall in the scope of this paper. 38 38 We briefly explain each extractor, and the specific in- 39 Inter-language and Inter-KG Links. Articles belong- 39 formation that can be extracted from each source. 40 ing to the same entity in different languages are linked 40 41 to each other in wikipedia. Also, other knowledge 41 5.1. Wikipedia Extraction 42 graphs and ontologies might have links to Wikipedia 42 43 articles. Linking entities is further done in the mapping 43 The Wikipedia extractor in FarsBase is similar to 44 stage to integrate all records from different sources that 44 DBpedia [23] and Yago [3]. The most important data 45 belong to the same entity. 45 types extracted from Wikipedia are as follows: 46 46 47 Infobox Data. Infoboxes contain a list of attribute 5.2. Tables 47 48 and values. Note that an attribute might contain more 48 49 than one value, e.g. “occupation: [job1, job2, ...]”. A Information extraction from tables is a challeng- 49 50 value can be a string literal, an image, a link to other ing task, specifically because the schema of web ta- 50 51 Wikipedia articles, an external link, or a combination bles are very flexible. Indeed, web tables are primar- 51 10 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 ily designed for viewing purposes, not to store a col- Nevertheless, for other preprocessing tasks we used 1 2 lection of data, such that many tables have splitted the JHazm2 library. 2 3 and merged cells. In most long tables, the subjects of FarsBase has 4 modules for raw-text triple extrac- 3 4 the underlying triples is in a specific column of the tion, namely rule-based extraction, distant supervision, 4 5 table, albeit the subjects might also be in a specific dependency patterns, and unsupervised extraction. In 5 6 row. More importantly, it is not trivial to recognize the following, we briefly describe how different RTE 6 7 if a table contains significant information to be ex- methods pre-process and extract triples in FarsBase. 7 8 tracted. Despite these challenges, some research has 8 5.3.1. Entity Linking 9 been done to automatically extract entities or triples 9 Entity linking is the task of linking the entities men- 10 from tables [16][19][25]. 10 11 tioned in raw text to their corresponding KB URIs. It 11 is a very essential task for triple extraction, because 12 5.3. Raw Text 12 13 the subject of a triple must be linked to an entity URI. 13 14 Once the URI of an entity in the text is known, we 14 The Raw Text Extractor (RTE) extract triples from can leverage its type (ontology class) for further post- 15 unstructured text based on the current knowledge of 15 16 processing and filtering. 16 the KG. Even though FarsBase’s RTE engine has some Entity linking in RTE uses the resource extractor 17 aspects of never-ending learning (NELL), it is primar- 17 18 module, which finds the links of all resources in a 18 ily designed to supplement the KB with information 19 given input text. More specifically, the entity linker 19 that are missing from Wikipedia. The triples supplied 20 identifies the entities, categories and ontology predi- 20 by the RTE should be almost of the same accuracy as 21 cates in the text, based on the known labels in the KG. 21 the information extracted from Wikipedia, in order not 22 Each resource may have more than one label, and one 22 to degrade the quality of the KG for semantic search. 23 word in the text may be shared between more than one 23 Therefore, RTE mostly adopts methods that maintain 24 resource. 24 a high-precision, such as rule-based approaches, even 25 The resource extractor does not perform any dis- 25 though it can sacrifice the recall. Note that all triples 26 ambiguation analysis and merely find the labels of 26 extracted by RTE are verified by the expert, however, 27 all potential resources. For example, a phrase may be 27 evaluating a highly-accurate result requires much less 28 linked to more than one entity, or even some verbs and 28 29 effort because the triples can be grouped and verified adverbs may be detected as a Village or Person en- 29 30 in batches. tity. This is mainly because the modules that use RE, 30 31 Before extracting the triples from the raw text, some such as search and RTE, have different requirements 31 32 preprocessing task must be apply on the text, including and thresholds for disambiguation. RTE has a sepa- 32 33 sentence boundary detection, word tokenization, part rate module, named entity linker, which gets the ex- 33 34 of speech (POS) tagging, base phrase chunking, de- tracted URIs as input and disambiguate the entities in 34 35 pendency parsing, co-reference resolution and entity the sentence, so that each word is linked to one and 35 36 linking. Most of developed tools for these preprocesses only one entity. The entity linker removes other type of 36 37 in Persian language are not precise enough or are only resources including the predicate and category links. 37 38 developed at the laboratory level and can not be used The disambiguation process uses three heuristics: 38 39 at all in real environments. Many basic NLP tasks are 39 POS Tags. POS tag of each entity is determined. If 40 state of the art for Persian language [26][27][28]. Thus 40 41 lack of precise NLP tools in Persian language causes a link contains just one word with certain POS tags, 41 42 error propagation in triple extraction process. it be eliminated. The set of POS tags consists of P, 42 43 Developing an RTE engine for Persian requires so Pe, CONJ, POSTP, PUNC, DET, NUM, V, PRO and 43 44 much effort compared to the English RTEs. For ex- ADV. Label of many entities in FarsBase are homo- 44 graph with verbs or prepositions, e.g. “ ” can be 45 ample, there are no Co-reference resolution and en- XðP ú× 45 46 tity linking modules with enough accuracy, so we de- refer to a village (XðP ú×) or present continuous form 46 47 veloped the libraries with only the basic functionali- of “Go” verb. “ ” can be refer to a prepositions or a 47 48 ties for RTE. Moreover, a base phrase chunker (BPC) éK. 48 fruit ( ). 49 could be very useful in relation extraction, but there is éK. 49 50 no reliable library for BPC in Persian language, thus 50 51 we did not use BPC for preprocessing the RTE input. 2https://github.com/mojtaba-khallash/JHazm 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 11

1 Homograph. Homographs are words that are writ- Low Probabilities. Whenthe probability of all enti- 1 2 ten the same, but have different meanings. For exam- ties is lower than an adjustable threshold, none of them 2 3 ple, even though the word "very" is a common mod- is linked to the word but all of them are added to the 3 4 ifier in English, in a few portion of texts it may re- ambiguity-list. 4 5 fer to other entities such as the "Very" company. State- The precision of FarsBase Entity Liking in 67.8 5 6 of-the-art NLP methods are still not very reliable for which is evaluated by human experts on 30000 words. 6 7 7 disambiguation such rare homographs [29]. Therefore, 5.3.2. Pronoun Resolution. 8 8 specific and rare entities are ignored using a manually- Table 2 shows co-reference resolution increases the 9 9 created list. recall of triple extraction up to 46 percent. To have a 10 10 basic co-reference resolution, We have used a baseline 11 Classes with Ambiguous Instances. 11 Some entities algorithm for pronoun resolution[30]. In this approach 12 12 have a very generic name that may cause a high level for each pronoun in the text, we choose the antecedent 13 13 of ambiguity for raw-text extraction. For example, the most recent subject which has in number and gen- 14 14 “ ”("At the age of 40") is a famous Persian der. To detect the gender of person entities, we have 15 úÆËAƒ Éêk 15 movie, although it can be a part of a sentence, e.g. used a gazetteer (which was built for a named entity 16 16 "Alex died at the age of 40"). Such entities are very recognizer) that contains a list of male and female first 17 17 common in special classes, like movies (movie names), names in Persian, Arabic, Turkish, English and French 18 18 books, and artworks. To alleviate the ambiguity issue, languages. For the first names that are not in the list, 19 19 we randomly choose a gender. The success rate of this 20 if the detected entity belongs to certain classes, such 20 algorithm on the small corpus of 15321 words of Per- 21 as Movies, we will look for more evidence in the sur- 21 sian Wikipedia was 68.9%. 22 rounding context using a reference list. For example, 22 23 if a sentence contains "At the age of 40" (as a movie 5.3.3. Rule-based Triple Extraction 23 24 name), we require that the surrounding context also Rule-base extraction module is high-confidence ex- 24 25 contains phrases such as "movie", "channel", "video", traction module, which is working based on Stanford 25 26 etc.; otherwise the linking will be ignored. TokensRegex 3[31]. Instead using named entity tags, 26 27 after entity linking on the text, we have used class of 27 Method 1. this type of disambiguation is based on the 28 each entity. 60 rules has been written based on words, 28 context of the words. If a word is linked to more than 29 POS tags and entity classes. Subject and object are de- 29 30 one entity, word’s context is compared with the body fined in body of each regular expression and each ex- 30 31 of corresponding Wikipedia articles for each entity; pression pointed to a predicate in FarsBase. For exam- 31 32 The cosine similarity value between context and each ple when the following TokensRegex matches in a sen- 32 33 article body is used as a measure to sort the entities. tence, RTE extracts a fb:birthPlace triple: 33 34 Entity with highest probability is linked to the word 34 35 and other entities are added to the word’s "ambiguity-  35 {word : / I jK AK /}] 36 list". For each ambiguous words, entity linker returns (?\$object [{ner:/.+Country. ∗ /}]) 36 37 0 or 1 linked entity and an "ambiguity-list" and proba- [{ word : / { QîD /}}]? 37 38 bility of each ambiguous entities. (?\$subject [{ner:/.+Settlement. ∗ / } ] + ) 38 39 39 40 Method 2. Suppose we have a text, and Resource Ex- 40 41 tractor has extracted totally n entity links in list L. 5.3.4. Extraction with Distant Supervision 41 42 Given each of entity links, all internal links in corre- Distant supervision is a new approach for informa- 42 43 sponding Wikipedia article are added to list V with its tion extraction from raw-text [10, 32]. It is based on 43 44 frequency. For each link in L, we find link frequency the intuition that the facts in the knowledge graph al- 44 45 in V. The greater link frequency, the more probability ready have instances in the raw text sources. By align- 45 46 in the disambiguation process. ing known facts, i.e. triples, with the corresponding 46 47 sentences, we can automatically create a training set, 47 48 Probability Calculation. We calculate probability of so that a classifier can be trained to extract more facts 48 49 each entity based on formula 0.7P1 + 0.3P2 where P1 from the raw text. 49 50 is probability which calculated from in Method 1 and 50 3 51 P2 is probability which calculated from in Method 2. https://nlp.stanford.edu/software/tokensregex.html 51 12 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 Inspired by Aprosio et al [10], we developed a train- patterns, the subject can be extracted from the first and 1 2 ing data by automatically aligning FarsBase known second words. 2 3 facts with Persian Wikipedia’s text. Out of 2.5 Million Note that extracted triples by this method are not 3 4 sentences from Persian Wikipedia articles, 172,368 linked to entities and a mapping process must the sub- 4 5 sentences were mapped to the facts in FarsBase. The ject and predicate to the KG resources. 5 6 mapped sentences were evaluated by human experts, This method is a supervised method, and human ex- 6 7 where 16,745 sentences were verified. To improve the perts must define the position of the subject, predicate, 7 8 accuracy, we added a number of negative examples, and object in the sentences. Currently, 2000 frequent 8 9 i.e. sentences without any relationship between their dependency patterns are extracted automatically from 9 10 entities. Wikipedia texts and annotated by the experts. This 10 11 We used LibSVM4[33] to train a SVM classifier method has extracted 240320 triples from Wikipedia 11 12 with kernel. We used the following features: articles based on these 2000 patterns. 12 13 13 14 – POS tags between two entities 5.3.6. Unsupervised Extraction Method 14 15 – Words after the first and before the second entity In this section, we have introduced an unsupervised 15 16 under a certain window method for triple extraction which is working based 16 17 – The words before the first and after the second on dependency parsing and constituency tree. Unfortu- 17 18 entity under a certain window nately, there are no good libraries in Persian language 18 19 – Class of each entity. for constituency parsing. Our unsupervised method 19 20 needs main phrases of a sentence, and we have used 20 Using the above feature set, we reached a preci- 21 dependency parse tree to the phrases. A base phrase 21 sion of 54% on the gold data. To increase the preci- 22 chunker can be used in the future to this end. 22 sion, we used a type-checking algorithm, which elim- 23 In the following, we explain the method by an exam- 23 inates triples that are not compliant with the domain 24 ple. Consider the sentence which is shown in 3. Given 24 and range constraints of their predicates. For example, 25 dependency parse and constituency trees, consider fol- 25 the subject of a nationalTeam must be a Person, and its 26 lowing definitions: 26 object must be a Team. Therefore, if the extracted rule 27 27 is "Iran nationalTeam Africa", it will be eliminated due – A verb phrase (VP) is a phrase which contains at 28 28 to incompliance with the domain constraints. least one verb. 29 – If head of a phrase is not connected to the verb in 29 30 5.3.5. Extraction with Dependency Pattern dependency parse tree, called ignored phrase (IP) 30 31 Extraction with dependency patterns is an inno- and not evolve in the extraction process. 31 32 vative method that attempts to extract triples using – After entity linking, a phrase which all of its 32 33 “unique dependence trees”. By definition, two depen- words is linked to one and only one entity, called 33 34 dency parse tree has the same dependency pattern, if linked phrase (LP). 34 35 replacing each word with their corresponding POS tag, – A phrase which includes at least one verb or noun 35 36 generates two identical trees, in other words, the se- called candidate phrase (CP). Phrases which do 36 37 quences of the POS tags in both sentences are exactly not include any name or verb will not be included 37 38 the same, and also the dependency head and type for in triple extraction. 38 39 each position are the same. Figure 2 shows two sen- 39 40 Triples are extracted in the following conditions: 40 tences with a unique dependency pattern. 41 41 Extraction with dependency pattern is based on the – If a sentence hasn’t one and only one VP is ig- 42 42 intuition that if a sentence contains one or more triples, nored in extraction process. 43 43 other sentences with the same structure (same depen- – If the sentence has one VP and two LPs, method 44 44 dency pattern) contain triples too. In such cases, the extracts a triple as (LP1, VP, LP2) 45 45 subject, object and predicate can be extracted from – If the sentence has one VP, one LP and more than 46 46 the words with the same indexes in all sentences. For one CPs (N > 1), method extracts N triples as (LP, 47 47 example, if in a sentence, subject contains two first VP, CPi) where i(1..N) and CPi is ith candidate 48 48 words, in all other sentences with the same dependency phrase. 49 49 50 Also, we can extract some relations in the follow- 50 51 4https://www.csie.ntu.edu.tw/ cjlin/libsvm/ ing condition, but the relations which are extracted in 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 13

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 Fig. 2. Two sentences with same dependency pattern. The pattern is placed at the bottom of figure 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 Fig. 3. Dependency parse tree and main phrases of a sample Persian sentence 42 43 43 44 such case are not KG triples because the subject hasn’t the weakness of this method is the lack of confidence 44 45 linked to any KG entities: allocation for the extracted triples. 45 46 46 – If the sentence has a VP and more than two CPs 47 47 (M > 2), method extracts M relations as (CPi, VP, 48 6. Mapping and Integration 48 CP j) for all pair of i and j. 49 49 50 The precision of this method is 67% which is eval- The triples extracted from knowledge sources are 50 51 uated on 600 sentences. Due to this precision value, originated from Wikipedia, Tables and Raw texts. We 51 14 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 may have a predicate in several ways; for example, sponding ontology predicates. Similarly, URIs of on- 1 2 “place of birth” and “birthplace” have the same mean- tology predicates in FarsBase are not in Persian lan- 2 3 ing, extracted from different sources. Also, we may ex- guage and are mapped to internationalized URIs. The 3 4 tract a predicate in different language from a Persian experts who enter these mapping information, must 4 5 Wikipedia article. Therefor there is a need for a uni- have sufficient familiarity with standard datasets e.g. 5 6 fication process on the data to turn such cases into a OWL, RDFS, FOF and etc. 6 7 specific predicate. The mapping process is designed to 7 6.1.2. Enhancements on DBPedia 8 do this task. Mapping is originally suggested by DB- 8 Our mapping approach is different from DBpedia 9 Pedia [1] to map Wikipedia infoboxes to an ontology. 9 mapping approach, the main difference is the enhanced 10 DBpedia uses rules. Rules are hard to maintain and 10 user interface, automated mapping proposals, trans- 11 write. We used tables. This is much more straightfor- 11 former functions and mapping support for the raw 12 ward. 12 texts. 13 From the point of view, Integration means get- 13 14 ting information from triples generated by all possible Mapping Tables. DBpedia uses a Mapping Wiki, 14 15 sources and converting them to mapped data. Our inte- which is rather unfamiliar for linguistic experts. It also 15 16 gration system supports a highly flexible approach for hard to define, read and maintain the mapping entries. 16 17 function-based triggers. We propose a more friendly method to store and han- 17 18 In this section, we will first explain the mapping dle the mapping data. This method is defined com- 18 19 process and our initial mapping phase for data triples. pletely in section 6.2 and shows how we can con- 19 20 Then we show how FarsBase integrates mapped data vert any Wikipedia infobox attribute/values to ontol- 20 21 from multiple sources, how candidate facts are updated ogy triples without any limitations and also how to ig- 21 22 and in which process they are promoted to the beliefs. nore useless data in transformation phase. 22 23 23 Automated Mapping and Proposals. A part of the 24 24 6.1. Mapping Wikipedia Infoboxes mapping operation is semi-automatic, and the other 25 25 part is done by experts. In Persian Wikipedia, some of 26 26 Mapping is one of the most important components the infobox types as well as their attributes are writ- 27 27 in FarsBase. Wikipedia infoboxes contains predicates ten in English (but displaying in Persian in user in- 28 28 in different syntactic shapes or different languages. terface). In such cases DBpedia mappings are trans- 29 29 Without data mapping, practically no SPARQL queries ferred to our mapping data. In some cases, transla- 30 30 are usable, and a large number of predicates must be tions have been used to map as proposal. By look- 31 31 questioned in each query. ing at the Persian properties, the system automatically 32 32 translates them into English and looks for the equiv- 33 6.1.1. The DBPedia Approach 33 alent in English DBpedia and human experts have fi- 34 [23] proposed an approach to map Wikipedia In- 34 nally confirmed them. Therefore, this process is semi- 35 fobox data to the triples based on a mapping markup 35 automatic. In other cases, system proposes standard 36 which has been maintained by a community since 36 predicates to the expert when typing. 37 2010. DBpedia mapping can be divided into two major 37 38 parts: the mapping of infoboxes to the ontology classes Transformers. We propose Transformer functions, 38 39 and the mapping of the properties in the infoboxes to which automates data cleansing and normalization and 39 40 the ontology predicates. In the first part, an ontology requires minimum input from the experts. 40 41 must be made. The process of ontology construction is 41 Mapping Support for Raw Text Extraction. We pro- 42 a bottom to top approach. That is, Each Wikipedia in- 42 pose a Ontology-aware mapping method for raw texts 43 foboxes must be mapped to a class in the DBpedia on- 43 in section 6.4. We have used generated mapping en- 44 tology. In the second part, experts assign each attribute 44 tries which are originated from Wikipedia in raw text 45 in every infoboxes individually to corresponding on- 45 extraction. 46 tology predicates. All of these mapping data is stored 46 5 47 in DBpedia Mapping Wiki . 6.1.3. Ontology Construction 47 48 In DBpedia, all infobox attributes in all languages The ontology is one of the most important parts of 48 49 which have same meaning, are mapped to the corre- any knowledge graph. In FarsBase, each entity is an 49 50 instance of one of the classes. This ontology is a tree of 50 51 5http://mappings.dbpedia.org/ the classes, each of which has one and only one parent. 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 15

1 For Example ÉK_ éƒ_ð_úæ is a Bridge, Architec- 6.2. FarsBase Mapping Approach: Transformers 1 2 turalStructure, Infrastructure, RouteOfTransportation, 2 3 Place and Thing. Main class of each entity (the deepest The values in Wikipedia templates and infoboxes 3 4 class in the tree) is defined by a fbo:instanceOf predi- are written in different formats. 4 5 cate: For example, consider the following examples: 5 6 6 – If infobox type of an article is "Iranian village", 7 f b r : _ _ _ fbo:instanceOf 7 ÉK éƒ ð úæ the type of entity is fbo:Village. But it can be con- 8 fbo:Bridge . 8 cluded from the infobox type that the village is 9 9 located in fbr:Iran. 10 All of classes of an entity defined with rdf:type pred- 10 – If the value of "years of activity" in an infobox is 11 icate: 11 1357-1366, a string value would be extracted for 12 12 years of activity, but instead of a string value, two 13 f b r : _ _ _ r d f : t y p e 13 ÉK éƒ ð úæ numerical values could be generated for the start 14 fbo:Bridge . 14 and end years of activity. 15 f b r : _ _ _ r d f : t y p e 15 ÉK éƒ ð úæ – If the area_km2 value in an infobox is equal to 16 fbo:ArchitecturalStructure . 16 1897, a string value would be extracted for the 17 f b r : _ _ _ rdf:type fbo:Infrastructure. 17 ÉK éƒ ð úæ area in square kilometre, but the nature of this 18 f b r : _ _ _ r d f : t y p e 18 ÉK éƒ ð úæ value is numeric (not string.) and its unit is km2. 19 fbo:RouteOfTransportation . 19 – If the ”lat“ value for a geographical entity is equal 20 f b r : _ _ _ r d f : t y p e 20 ÉK éƒ ð úæ to 35 41’ 36" (degrees minutes seconds format) 21 fbo : P l a c e . 21 and for another entity is equal to 35.69333333 22 f b r : _ _ _ r d f : t y p e 22 ÉK éƒ ð úæ (signed degrees format), we have two string val- 23 fbo : Thing . 23 ues whose nature is completely different but it 24 24 necessary to normalize all values of same at- 25 25 The ontology of FarsBase consists of 767 classes. tribute (lat) to the same nature. 26 The main body of this ontology is borrowed from 26 27 DBpedia’s ontology. Classes with no entities in Per- To support above cases, a new approach has been 27 28 sian Wikipedia have been removed from this ontology implemented in FarsBase. In this approach, mapping 28 29 (these classes are usually have not been used exten- for each template, can be shown as a rule-set table, 29 30 sively in Persian language and culture.) and the num- each of rows in the table equals to a rule which built 30 31 ber of classes has also been added based on its pres- from criteria on attribute, predicate, constant value, 31 32 ence in Persian Wikipedia. The proposed classes are unit, and transformer function. For example, a map- 32 33 listed in Table 4. ping table for template "a city of Iran" is defined in 33 34 Despite the ontology construction is a bottom to up Table 5. 34 35 approach, this ontology has been approved by experts. When the mapper gives all triples which are ex- 35 36 Experts have added some valuable information to this tracted from an article with "a city in the Iran" tem- 36 37 ontology, e.g. different labels for classes and predi- plate, it refers to its rule-set and, in the first step, ex- 37 38 cates, range and domains. Two of most valuable data ecutes rules which have empty attribute field. All rule 38 39 about properties are range and domain which are very with the empty attribute must have a constant value. 39 40 useful for type checking process. Defining the range In this example, two triples for the type (fbo:City) and 40 41 and domain of a property is a hard work even for ex- location (fbr:Iran) are generated. 41 42 perts, FarsBase help these process by suggesting range In the next step, mapper checks the infobox attribute 42 43 and domain for each property. System calculates the in all triples and looks it up in the rule set. For example, 43 44 range and domain of each property automatically. This the mayor’s attribute turns into a corresponding predi- 44 45 automatically generated information is saved as triples cate. Also wheat_production matches with two rules in 45 46 in the KG (fbo:autoRange and fbo:autoDomain). Note the rule set. As a result, several triples corresponding 46 47 that the number of different properties in the KG is to such attribute are generated. 47 48 very high (25032) and system propose more frequent The transformer functions are certain functions de- 48 49 properties to the expert to approve them. Many of these fined in the system. The inputs of these functions 49 50 properties are not approved by the experts and this au- are always a string value and output is numerical, 50 51 tomatically extracted data is valuable yet. string, or date value. For example, the minMaxRange- 51 16 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 Table 4 1 2 New added classes to original ontology based on Persian language and culture 2 3 New class Parent class 3 4 Mineralogic Zone (Index Mineral) Mine 4 5 Rural District Governmental Administrative Region 5 6 QanÄ ˛At(KÄ ˛Ariz,Underground Canel) Canal 6 7 Waterfall Stream 7 8 Electoral District District 8 9 County Governmental Administrative Region 9 10 Short Story Book 10 11 Marja’ Cleric 11 12 ImÄ ˛AmzIJAdeh Religious Building 12 13 Scholar (Muslim Scholar) Scientist 13 14 MartialArt Sport 14 15 Seminary Educational Institution 15 16 16 ImÄ ˛Am Cleric 17 17 Militant Group MilitaryUnit 18 18 Intangible Cultural Heritage Thing 19 19 Exchange Market Organisation 20 20 21 21 Table 5 22 22 Mapping rules for Wikipedia templates 23 23 24 Infobox Attribute Predicate Constant Unit/Type Transformer 24 25 - rdf:type fbo:City owl: NamedIndividual - 25 26 - rdf:location fbr:Iran owl: NamedIndividual - 26 27 mayor fbo:mayor - owl:NamedIndividual 27 28 weath_production fbo:weathProductionMin - xsd:int minMaxRangeToMin 28 29 weath_production fbo:weathProductionMax - xsd:int minMaxRangeToMax 29 30 area_mi2 fbo:area - fbr:km2 mile2ToKm2 30 31 latitute fbo:lat - xsd:double latLong 31 32 fbo:lat fbo:long - xsd: double latLong 32 33 33 34 ToMin function receives an interval string (minimum 6.3. N-ary Relations 34 35 and maximum) and returns the minimum value (Input: 35 36 “12-13”, output: 13). Or the latLong function receives Most of knowledge graphs including FarsBase are 36 37 a string corresponding to latitude or longitude, detects based on triples expressed as subject-predicate-object, 37 38 its format and generates a decimal number (Input: “35 which is defined by the RDF model. This repre- 38 39 41’ 36"”, output: 35.69333333). This method offers sentation is suitable for relations involving two en- 39 40 great potential for converting existing Wikipedia data tities, but it is not straightforward for n-ary rela- 40 41 into RDF triples. tions. Rouces et al. reviewed some modelling pat- 41 42 Also, for some attributes in Wikipedia infoboxes, we terns for representation of n-ary relations: The basic- 42 43 have mapped the attribute to NULL, meaning that this triple pattern, triple-reification, singleton-property pat- 43 44 key is ignored in the mapping operation and its infor- tern, specific-role-neo-davidsonia, general-role-neo- 44 45 mation is not stored to the knowledge graph. For ex- davidsonian and Role-class pattern. 45 46 ample the size of an image in a Wikipedia article is not Wikipedia infoboxes support n-ary relations with or- 46 47 valuable information. dered attribute-values. Figure 4 shows a sample of n- 47 48 Storing mapping data in the rule sets, make map- ary relations in Wikipedia infoboxes. In such cases, 48 49 ping process easier to represent and maintain. Above all related attributes end by a digit, e.g. years4, caps4 49 50 method is friendly and flexible enough to handle all and goals4. Wikipedia shows these related attributes in 50 51 complicated cases. proper view (e.g. table) based on the template of the in- 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 17

1 fobox. FarsBase uses triple-reification modelling pat- 6.5. Updating FarsBase 1 2 tern for handling n-ary relations. For example, these 2 3 triples will be generated based on years4, caps4 and The Mapper system stores all the data generated by 3 4 goals4 attributes (fbr:úG@X_úΫ/relation_4 is a the various extraction modules after processing to the 4 5 automatically created entity.): CFM-Store. These modules prepare the output in some 5 6 situations: When a new Wikipedia dump is released 6 7 (wiki extractor), when a new table extractor is manu- 7 f b r : _ 8 úG @X úΫ ally placed on the system and when the RSS crawler 8 9 fkgo: relatedPredicates downloads new RSS feeds for the raw text extractor. 9 f b r : _ /relation_4 . 10 úG @X úΫ In each of the these situations, the generated files, 10 11 f b r : úG@X_úΫ /relation_4 rdf:type along with general extraction information (start and 11 12 fbo:RelatedPredicates . end time of the extraction and the extractor module), 12 13 f b r : úG@X_úΫ /relation_4 are placed on a specified path. The Mapper continu- 13 14 fbo:mainPredicate fbo:club . ously traverse each of these paths in short intervals, 14 15 f b r : _ /relation_4 and executes the mapping process on them. With this 15 16 úG@X úΫ 16 fbo:club fbr: . method, the CFM-Store data is mapped after a short 17  ËñJƒQK 17 f b r : _ /relation_4 extraction time but some of the triples must be vali- 18 úG@X úΫ dated before transferring to the final data. Update Han- 18 fbo:caps 38 . 19 dler transfers mapped triples to the final repository in 19 f b r : _ /relation_4 20 úG @X úΫ two conditions: when the data is validated by experts, 20 21 fbo:goals 23 . or has a confident more than a threshold. Most of the 21 22 triples which are generated by WE modules has confi- 22 23 This method also handles relations that contain tem- dent greater than 90 and Update Handler has been con- 23 24 poral data. figured to transfer these triples to the final data. 24 25 Update Handler must delete triples from the final 25 26 6.4. Mapping Triples from Raw Texts data if those have been existed in Wikipedia dump pre- 26 27 viously but have been deleted currently. Clearly, han- 27 28 28 The discussed mapping in 6.1 was only applicable dling new triples which are added in new dumps is easy 29 29 to Wikipedia infoboxes. But the triples which are ex- and the problem is for the triples that either change 30 30 tracted from other sources must be mapped too. their value or are basically wrong. This problem has 31 31 Raw text triple-extractor also helped by Wikipedia been solved by adding a numeric field as the version 32 32 mapping rule sets. There is a module that identifies number. After finding new triples from a module, the 33 Mapper receives a version number from CFM-Store 33 34 the entities from a raw text (5.3.1). These entities will 34 definitely have a page on Wikipedia. Obviously, there (based on module type) and stores all triples with the 35 received version number to the CFM-Store. 35 36 are many ambiguities in entity extraction, and we can’t 36 disambiguate any entities from raw text. An inaccu- The final (validated) data is stored in specific peri- 37 ods (currently each 24 hours) in Belief Store. When 37 38 rate mechanism has been developed to overcome this 38 problem. Assuming that the entities are correctly iden- performing the transfer operation from CFM-Store to 39 Belief Store, all the information erased, and then only 39 40 tified and disambiguated, if the entity has an infobox 40 in the Wikipedia data, the predicate declared by the the data that has the final version number is written to 41 the Belief Store. Suppose an update process. The pre- 41 42 raw text is searched in the mappings of that type of in- 42 vious version number for “wiki” module was 11 and 43 fobox. If this property is found, it is exactly the same 43 the new version is 12, 44 as Wikipedia mapping. Otherwise, we try to find any 44 45 rules in any rule-sets which contains extracted predi- – Triple (Movie Z, director, Mr. W) with version 12, 45 46 cate exists in the rule. In this case, the mapping is done has just been added to our information. Here, be- 46 47 as usual, otherwise the mapping does not apply. cause the triple version matches the latest version, 47 48 In this mechanism, errors can be propagated, so we it is transferred to Belief Store. 48 49 need a more supervision. Hence, every triple which is – Triple (Organization Y, Location, Tehran) with 49 50 extracted from raw texts, must be approved by experts version 11 already existed. But this organization 50 51 before appearing in final RDF data. has been dissolved and no longer has its place. In 51 18 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 Fig. 4. N-ary relations in Wikipedia infoboxes 28 29 29 30 this case, since the tenth version (11) is not equal Human experts collaborate in FarsBase on four ar- 30 31 to the latest version (12), this data will be deleted eas: Triple approval, Ontology construction, Mapping 31 32 (not transferred). process and Evaluations (e.g. Creating gold data for 32 33 – Triple (Village X, population, 12,000) has existed evaluation of distant supervision). 33 34 34 with version 11, and in the new Wikipedia dump, Brute-force verification of over 20M triples by 35 12,000 have changed to 12050, and triple (Vil- 35 the human experts is not affordable. To overcome 36 lage X, population, 12050) have been produced 36 37 with version 12. During the transfer, since the ver- this problem, triples with the confident higher than a 37 38 sion of (Village X, population, 12,000) is differ- threshold is transferred to the qualified triple store au- 38 39 ent from last version (12), this data will not trans- tomatically. Each triple must be voted by three experts. 39 40 ferred, but data (village X, population, 12050) Some experts can verify triples based on their knowl- 40 41 will be transferred. edge without the need for any other votes. 41 42 The cost of a wrong knowledge in FarsBase is pro- 42 43 6.6. Expert Approval portional to the number of users that see the relation. 43 44 Therefore, the verification of triples about frequent en- 44 45 Automatic approaches for knowledge graph con- 45 tities, are more important than other triples. FarsBase 46 struction, such as NELL [2], have recently reached 46 support triple approval proposal to elevate more im- 47 considerably high accuracy, but to construct a high- 47 48 quality KG, NELL uses a crowdsourcing method to portant triples. 48 49 verify extracted facts. Therefore, the human expert is Number of new attributes in Wikipedia infoboxes is 49 50 still essential to ensure correctness, consistency, and not increasing significantly and maintenance of map- 50 51 completeness of the data. ping data by experts is not too expensive. 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 19

Table 6 1 6.7. Linking to External Data Sets 1 2 Frequency of entities in each classes 2 3 One of the most important features of a knowledge Class Frequency of entities 3 4 graph is its external links to other datasets. The exis- Settlement 69317 4 5 tence of these links, in addition to making the knowl- Village 49512 5 6 edge graph applicable in the other systems and tools, Person 26035 6 7 provides the requirement for entering data from other Species 24350 7 8 8 datasets into the knowledge graph. To link FarsBase Historic Place 21881 9 9 to other data sets, a baseline method is used: Many of Film 19093 10 10 Wikipedia’s articles have inter-language links (Many Ship 18356 11 11 of Persian article are linked to corresponding English Soccer Player 15180 12 12 articles.). Each English article in DBpedia is equiv- Planet 11814 13 Airport 11148 13 14 alent to an entity. Fortunately, DBpedia entities have 14 links to 36 other datasets. According to this method, Actor 8106 15 Music artist 7227 15 16 a number of FarsBase entities can be linked to En- 16 Chemical Compound 6290 17 glish DBpedia entities, and consequently those enti- 17 Office holder 5900 18 ties are linked to other datasets via owl:sameAs pred- 18 Canal 5520 19 icates in DBpedia. 439,445 of FarsBase entities have 19 City 5301 20 at least one link to external datasets. Totally FarsBase 20 21 has 5,582,589 links to external datasets totally. 21 22 This solution provides a low-cost method to con- Table 7 22 23 nect FarsBase entities to some English and multilin- Frequency of triples in each classes 23 24 gual datasets, but this does not cover the linking to the Class Frequency of triples 24 25 Persian datasets. FarsNet [34], a lexical ontology, is Settlement 2965740 25 26 one of the most important NLP datasets in Persian lan- Village 1155247 26 27 27 guage. Soccer Player 670884 28 28 Film 610592 29 29 Historic Place 566268 30 30 Species 558861 31 7. Evaluation and Statistics 31 Person 476467 32 32 Ship 405828 33 We have evaluated our knowledge graph from these 33 Planet 342807 34 aspects: Size, Precision, Mapping quality, Coverage, 34 Airport 282147 35 Freshness (Timeliness) and Size of resources which 35 Chemical Compound 273666 36 are linked to the other LOD datasets. 36 Musical Artist 201786 37 37 Office holder 196276 38 7.1. Size 38 39 Actor 191179 39 40 City 155559 40 Tables 6 and 7 shows the frequency of entities and 41 Canal 152603 41 42 triples divided by each class, based on their frequency 42 in FarsBase. Settlement and Village classes contain the 43 7.2. Precision 43 44 largest number of entities (69317 and 49512). Also, 44 more than 4 million triple belong to these classes. 45 To evaluate precision of KG, some queries has 45 In general, more than 190,000 entities are instances 46 been collected and 3 experts checked answer for each 46 in the Place class. There are also more than 6 million 47 queries. Queries collected from 3 different sources: 47 48 triples in this class. Therefore, the Place class hierar- 48 49 chy has the largest amount of knowledge graph. In Ta- – Expert queries: 62 queries have been proposed 49 50 ble 8, the frequency of entities and triples of the first from three experts including 14 queries about en- 50 51 level of the ontology has been shown. tities. 51 20 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 Table 8 1 2 The frequency of entities and triples of the first level of the ontology 2 3 Class Frequency of Frequency of 3 4 Entities Triples 4 5 Place 196073 6329219 5 6 Agent 106669 2951735 6 7 Work 35035 1133014 7 8 Species 24465 561777 8 9 Mean of Transporta- 22944 539869 9 10 tion 10 11 Chemical Substance 7630 322617 11 12 Event 2942 118451 12 13 Device 1403 44974 13 14 Disease 1306 28594 14 15 Time Period 874 9883 15 16 Language 571 17666 16 17 Food 423 7838 17 18 Flag 364 6425 18 19 Award 344 8683 19 20 Currency 290 7181 20 21 Topical Concept 225 7528 21 22 Holiday 129 3360 22 23 23 24 – Search engine log: 63 queries selected randomly are mapped to the ontology. Table 10, reports size of 24 25 from the end user query log. If the answer of a mapped templates, attributes and Wikipedia infobox 25 26 query was not in the knowledge graph (checked triples. Note that 90 percent of triples (extracted from 26 27 by human experts), the query is replaced by an- Wikipedia template and infoboxes) are mapped with 27 28 other query. this size of template and attribute mapping. 28 29 – Class-centric queries: 253 queries have been se- 29 30 lected based on ontology classes. These queries 7.4. Coverage 30 31 has been constructed automatically from triples 31 32 of each entity. FarsBase has been constructed to improve responses 32 33 of a search engine, though coverage is a very impor- 33 34 Precision has been calculated for each source in tant metric in FarsBase and it must cover queries about 34 35 the table 9. Average precision for all classes was 93.5 most frequent entities. We have evaluated FarsBase 35 36 and weighted-average precision (based on frequency coverage by three methods: Wikipedia Article, Named 36 37 of each class) was 89.8. entities and Information need. 37 38 38 Wikipedia Articles. In this method, 505 random ar- 39 7.3. Mapping 39 ticles have been selected from Wikipedia and experts 40 40 checked the existence of corresponding entities in 41 The number of unique templates in Persian Wikipedia 41 FarsBase. The result has been shown in table 11. 42 is 1712 and 683 templates are mapped to the ontol- 42 43 ogy classes. The number of unique attributes in Per- Named Entities. In this evaluation, three gazetteers 43 44 sian Wikipedia infoboxes is 25032, and 7808 attributes of named entity recognizer which is developer by Iran 44 45 45 Table 9 Table 10 46 46 Precision for queries from different sources Statistics about mapped templates, attributes and triples 47 47 Source Precision Total count Mapped count Percent 48 48 expert queries 97.5 Template 1712 683 40% 49 49 50 search engine queries 93.2 Attributes 25032 7808 31% 50 51 class-centric queries 93.5 Infobox Triples 11917420 10682099 90% 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 21

1 Telecommunication Research Center (ITRC). Exis- 8. Related Work 1 2 tence of each popular named entities has been checked 2 3 by experts in FarsBase. Table 12 shows the results 8.1. Knowledge Graph Construction 3 4 for the person, location and organization classes. The 4 The Semantic MediaWiki. As an extension of the Me- 5 results shows FarsBase covers famous persons wider 5 diaWiki software, Semantic MediaWiki tried to fa- 6 than other named entities. 6 7 cilitate the knowledge searching, browsing and pro- 7 8 7.4.1. Information Needs cessing on Wikipedia by using of semantic technolo- 8 9 In this method, we have used 321 popular queries gies. This project proposed a new syntax to convert 9 10 from search engine log. Validity and existence of each Wikipedia contents to a more machine-interpretable 10 structural knowledge, but the conversion process re- 11 query have been check by human experts. FarsBase 11 quired authors to deal with the syntax [39][40]. 12 was covered 91.85 percent of queries. 12 13 FreeBase. Freebase, developed by Metaweb, was a 13 14 14 7.5. Data Freshness public, practical knowledge graph which launched in 15 early 2007 containing more than 125,000,000 tuples, 15 16 more than 4,000 types, and more than 7,000 proper- 16 17 To evaluate the freshness of data in FarsBase, a list ties [41]. Freebase data was harvested from multiple 17 18 of famous recent events in a time interval was col- sources including Wikipedia and was available using 18 19 lected. This list has been extracted automatically from Metaweb Query Language (MQL). 19 20 “Portal:Current_events” and “Deaths_in_2018” pages 20 YAGO. Suchanek et al. [42] presented YAGO as 21 in Wikipedia. Updating process has been evaluated 21 22 a light-weight, high-quality and extensible ontology 22 in Data freshness metric. After checking 100 entities with 14 relations, 149,162 classes, 90,7462 entities 23 from the list, all of the entities and 91.02% of their at- 23 24 and 5 million facts. YAGO unified Wikipedia articles 24 tributes existed in FarsBase. 25 and WordNet synsets using heuristic methods. In this 25 26 stage, YAGO did not perform any extraction process 26 27 7.6. Linked Data on Wikipedia infoboxes. YAGO used Wikipedia cat- 27 28 egory hierarchy to generate Is-A facts. In same year, 28 [43] derived a large taxonomy including Is-A relations 29 Total 5582589 FarsBase resources has been linked 29 from Wikipedia category system. 30 to 33 external datasets. This result shows total 439,445 30 YAGO has been extended by using Wikipedia in- 31 of 541,927 FarsBase entities (0.8109%) has at least 31 32 foboxes and quality control system and combined 32 one link to other datasets. Table 13 shows all linked them with WordNet taxonomic relations. YAGO ex- 33 datasets and count of links to each of them. 33 34 tracted faces using heuristics including Infobox heuris- 34 35 tics, Type heuristics, Word heuristics and Category 35 Table 11 36 heuristics [24]. 36 Coverage of random Wikipedia articles 37 YAGO2 was an extension of YAGO, which added 37 38 Metric Value temporal and spatial aspects to the knowledge graph 38 39 Number of random articles 505 with high quality. To provide geographical informa- 39 40 Number of entities in FarsBase 502 tion, Yago2 included GeoNames database and con- 40 tained 447 million facts about 9.8 million entities [44]. 41 Coverage 99.4 41 YAGO3 covered mapped Wikipedia information 42 42 from 10 different languages using multilingual infor- 43 43 mation extraction techniques [3][45]. 44 44 Table 12 45 45 Coverage of named entities DBpedia. DBpedia converted Wikipedia contents to 46 large multi-domain RDF dataset with 103 million RDF 46 Named entity class Count Coverage 47 triples. DBpedia as a nucleus for the web of open data, 47 Person 1771 95.65 48 interlinked to other open data sources including FOAF, 48 Location 4697 89.71 49 GeoNames, Berlin, World Factbook, Mu- 49 Organization 5524 76.64 50 sic Brainz. They also developed a series of modules 50 51 All 11992 85.31 which made DBpedia accessible via web services [46]. 51 22 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 Table 13 1 2 Datasets linked from FarsBase 2 3 Dataset URI Prefix Count 3 4 All Links - 5582589 4 5 All DBpedia .org 3802782 5 6 English DBpedia dbpedia.org 652773 6 7 Wikidata wikidata.org 637062 7 8 YAGO yago-knowledge.org 410538 8 9 Freebase freebase.com 397261 9 10 GeoNames geonames.org 113314 10 11 Deutsche National Bibliothek d-nb.info 64378 11 12 LinkedGeoData linkedgeodata.org 31413 12 13 Virtual International Authority File (VIAF)[35] viaf.org 24252 13 14 EU Open Data data.europa.eu 22999 14 15 GeoVocab geovocab.org 15021 15 16 16 OpenCyc .com 13783 17 17 rdfabout.com rdfabout.com 7723 18 18 fu-berlin.de (Drugbank, Sider, WikiCompany, Factbook ...) fu-berlin.de 6250 19 19 isprambiente.it isprambiente.it 5718 20 20 GeoSpecies geospecies.org 5094 21 21 LinkedOpenData linkedopendata.org 4881 22 22 New York Times nytimes.com 4784 23 23 LinkedMDB linkedmdb.org 4565 24 24 Musicbrainz[36] zitgist.com 4080 25 25 European Nature Information System eunis.eea.europa.eu 1675 26 26 27 BBCWildlife Finder bbc.co.uk 1593 27 28 Linked Open Data of Ecology (LODE)[37] ecowlim.tfri.gov.tw 959 28 29 Camera dei deputati camera.it 675 29 30 The LinkedWeb APIs ontology [38] linked-web-apis.fit.cvut.cz 347 30 31 Dublin Core Metadata Initiative purl.org 345 31 32 OpenEI openei.org 304 32 33 270a Linked 270a.info 299 33 34 Eurostat (Linked Stats) eurostat.linked-statistics.org 187 34 35 GHO ghodata 140 35 36 logainm.ie data.logainm.ie 88 36 37 UK Learning Providers id.learning-provider.data.ac.uk 58 37 38 DBTune dbtune.org 17 38 39 Revyu revyu.com 4 39 40 40 41 Further developments on DBpedia led to creation DBpedia supported 111 different languages in 2012. 41 42 42 of multilingual knowledge graph from 97 different Infoboxes of 27 languages are mapped to the DBpe- 43 dia shared ontology with 320 classes and 1650 prop- 43 44 language editions of Wikipedia. The DBpedia extrac- 44 erties. DBpedia developed on a live system archi- 45 tion framework contained four extraction modules: 45 tecture which included four main components: local 46 Mapping-Based Infobox Extraction, Raw Infobox Ex- 46 Wikipedia mirror which is kept in read-time synchro- 47 traction, Feature (e.g. geographic coordinates) Extrac- 47 48 nization, Mappings Wiki which serves as another live 48 tion, Statistical Extraction. By mapping-based infobox 49 input source, DBpedia live extraction manager which 49 50 extraction system, 15 languages was mapped to a sin- gets all Wikipedia sources as input and extracts triples, 50 51 gle ontology [47]. and a synchronization tool which allows third party 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 23

1 softwares to keep live mirrors. In last quarter of 2012, construct an on-the-fly knowledge graph from user’s 1 2 English DBpedia dataset had been downloaded over queries a question answering system [55]. 2 3 100,000 times [1]. FrameBase[56] proposed a KG schema which uses 3 4 FrameBase[57] to store and query n-ary relations 4 BabelNet. BabelNet integrated lexicographic and en- 5 (facts with more than two entities or literals) from het- 5 cyclopedic knowledge across multiple languages and 6 erogeneous sources which combine efficiency and ex- 6 presented a lightweight method to map Wikipedia ar- 7 pressiveness. The article investigated different triple 7 ticles to WordNet senses with 78% F1 score. For 8 representation of n-ary relations and used NLP frames 8 resource-poor languages, they used human-edited and 9 to handle this relations on other knowledge graphs. 9 10 statistical machine translations of Wikipedia articles in 10 11 the other languages. BabelNet is also presented multi- Comparative Survey. Farber et al. define 35 aspects 11 12 lingual Word-Sense Disambiguation system based on in seven categories (General Information, Format and 12 13 its integrated knowledge graph [48]. Representation, Genesis and Usage, Entities, Rela- 13 tions, Schema and Particularities) for a comparative 14 Wikidata. Wikidata (Wikipedia for data) target man- 14 study on knowledge graphs and compare these aspects 15 aging factual information of Wikipedia. Wikipedia 15 on 5 famous knowledge graphs DBpedia, Freebase, 16 uses facts instead triples and users enters facts directly 16 OpenCyc, Wikidata, and YAGO [58]. 17 to the database and consequently it has data outside 17 18 of Wikipedia. Each entity has an ID (ID is not URI). 18 8.2. Quality Assessment 19 There is no extraction process on Wikidata and each 19 20 fact are entered by the users. Instead triples, Each en- 20 Paulheim et al. [59] provided a survey on knowl- 21 tities has multiple statements. Each statement has one 21 edge graph refinement approaches (completion vs. er- 22 or more references and one claim. Wikidata is not reg- 22 ror detection, target of refinement, internal vs. ex- 23 istered in LOD cloud [49][50]. 23 24 ternal methods) and evaluation methodologies (par- 24 25 Other knowledge Bases. An automatic knowledge tial gold standards, knowledge graph as silver stan- 25 26 graph construction using a statistical text analysis dards, retrospective evaluation, computational perfor- 26 27 method over database systems (MADDEN), deduc- mance). These evaluation methods have been assessed 27 28 tive reasoning system (PROBKB) and human feedback on Cyc [60] and OpenCyc, Freebase, DBpedia, YAGO, 28 29 (CAMEL) has been proposed in [51]. NELL and Knowldge Vault. The article also presented 29 30 Knowledge Vault introduces as a probabilistic knowl- some statistics about Wikidata, Google’s Knowledge 30 31 edge graph that mixes the extraction of information Graph, Yahoo!’s Knowledge Graph[61], Microsoft’s 31 32 from web content with prior knowledge derived from Satori and Entity Graph. 32 33 existing KGs. This knowledge graph uses a supervised Färber et al. proposed 34 metrics for knowledge 33 34 method to fuse these information graph quality assessments and analyzed on the DBpe- 34 35 sources. Knowledge Vault has tree main components: dia, Freebase, OpenCyc, Wikidata, and YAGO [4]. The 35 36 triple extractors (from text documents, HTML trees, metrics categorized on Intrinsic (accuracy, trustwor- 36 37 HTML tables and Human annotated pages). Graph- thiness and consistency), Contextual (relevancy, com- 37 38 based priors (these systems learn the prior probability pleteness, timeliness), Representational Data Quality 38 39 of each possible triple based on an existing KG, Free- (ease of understanding and interoperability) and Ac- 39 40 base) and knowledge fusion. Triple extractors may ex- cessibility (accessibility, license, interlinking) metrics. 40 41 tract unreliable and noisy knowledge. To reduce the Rashid et al. [62] also proposed 4 quality assess- 41 42 noise content in the extracted data, prior knowledge ments (Persistency, Historical Persistency, Consis- 42 43 has been used [52]. tency, and Completeness) and assessed this metrics on 43 44 DeepDive cleans and integrates data from multi- 11 release of DBpedia and 8 release of 3cixty [63]. 44 45 ple sources like text documents, PDFs and structured 45 46 databases. System uses statistical and ma- 8.3. Mapping 46 47 chine learning to extract tuples and defines a probabil- 47 48 ity score for each of tuple[53][54]. Dimou, et al. [64] proposed a uniform assessment 48 49 Most of the knowledge graphs are not extracted approach for mapping instead of validating generated 49 50 based on needs of the end users and miss out many data by mapping process using RML mapping lan- 50 51 relevant predicates. QKBfly propose an approach to guage and RDFUnit a test-driven approach for every 51 24 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 vocabulary, ontology, dataset or application. They also tegrates them to YAGO. NAGA also has consistency 1 2 applied this validation on the DBpedia [65]. checking approach for controlling the quality of gen- 2 3 Ahmeti, et al. [66] proposed using DBpedia map- erated facts. 3 4 ping infrastructure to enhance Wikipedia content us- Wanderlust was an unsupervised dependency gram- 4 5 ing a Ontology-Based Data Management (OBDM) ap- mar approach to extract semantic relations from raw 5 6 proach, for example using conflict resolution policies texts [75]. 6 7 to ensure the consistency of updates on Wikipedia in- Nakashole[76] presented automatic extraction for 7 8 foboxes. a web-scale knowledge graph by proposing a robust 8 9 method for extracting high quality facts from noisy text 9 10 8.4. Relation Extraction from Raw Texts sources. He also proposed a method to handle new en- 10 11 tities in the dynamic web sources like news. 11 12 Many researchers have worked on triple and relation TokenRegex [31] which was proposed by Stanford 12 13 extraction from raw texts. In knowledge graph con- Natural Language Processing Group, facilitate rule- 13 14 struction, we must use triple extraction and each en- based and regular-expression approaches for relation 14 15 tities must be linked to our knowledge graph. Thus extraction by implementing a framework for cascaded 15 16 entity linking is an essential task in triple extraction. RE over sequences of tokens. 16 17 Some relation extraction approach like Open Informa- Madaan et al. [77] proposed a relation extractor for 17 18 tion Extraction (OIE) focuses on extracting relations numerical data (e.g. atomic number like “Aluminium, 18 19 without concerning about its linking to a particular 13” or inflation rate like “India, 10.9%”) which tried to 19 20 knowledge graph. In this section, we will inform last handle units with minimal human supervision. 20 21 researches on Entity linking, Triple and relation Ex- Presutti et al. [78] proposed Legalo, an unsupervised 21 22 traction and using Distant supervision methods to ex- open-domain knowledge extractor from raw texts. 22 23 tract data from different structured and unstructured Legalo works based on the hypothesis that hyperlinks 23 24 data. between two entities provide semantic relations be- 24 25 25 Entity Extraction and Entity Linking. Han et al. [67] tween them. 26 26 introduced collective entity linking which works based Speer et al. [79] proposed ConceptNet, a knowl- 27 27 on jointing name mentions in the same document by a edge graph extracted from many sources including 28 28 representation called Referent Graph. Open Mind Common Sense (OMCS) [80], Wik- 29 29 Exner et al. [68] [69] proposed an entity extraction tionary, games with a purpose, WordNet, JMDict[81] 30 30 pipeline which includes a semantic parser and corefer- (a Japanese-multilingual dictionary), OpenCyc and a 31 31 ence resolver and worked based on coreference chains. subset of DBpedia. 32 32 This approach extracted more than 1 million triples Vo et al. [82] described second generation of OIE 33 33 from 114000 Wikipedia articles. which is able to extract relations between Noun and 34 34 Oramas et al. [70] presented a rule-based approach Pre, Verb and Prep using deep linguistic analysis, in 35 35 for extraction knowledge about music from song- addition to relation extraction from verb phrases which 36 36 facts.com website. The extraction pipeline includes was focused on first generation of OIE. 37 37 Babelfy [71] as a state-of-the-art entity linker with Stanovsky et al. [83] presented a supervised se- 38 38 highest precision on musical entities. quence tagging approach to OIE which is aimed by 39 39 Nguyen et al. [72] presented J-REED as a joint ap- Semantic Role Labeling models. To provide training 40 40 proach for entity linking in OIE based on graphical data, they automatically converted a question answer- 41 41 models. ing dataset to an open IE corpus. 42 42 Röder et al. [73] proposed a GERBIL including an 43 Distant Supervision. Distant supervision (DS) firstly 43 evaluation algorithm for entity linking to compare two 44 proposed by Mintz et al in [32] on Freebase knowl- 44 entity URIs without dependency to a specific knowl- 45 edge graph and the model was able to extract 10,000 45 edge graph. 46 instances and 102 relations with 67.6% precession. 46 47 Triple and Relation Extraction. Bach et al. [74] pre- Aprosio et al. [10] used DS to extract missing values 47 48 sented a review on most important supervised and from Wikipedia articles to enrich DBpedia knowledge 48 49 semi-supervised relation extraction in 2007. graph. 49 50 Kasneci et al. [13] proposed YAGO NAGA which Augenstein et al. [84] uses DS on web sources with 50 51 extracts candidate facts from raw text sources and in- data sparsity, noise and lexical ambiguity. To handle 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 25

1 such data, they used an entity recognition tool and an InfoGather[19] is an entity augmentation which 1 2 unsupervised co-reference resolver. Finally, they pre- works on entity-attribute tables. This information gath- 2 3 sented some methods for information integration to ag- ering system, works based on tree core operations: 3 4 gregate extracted knowledge to the main KG. 1. Augmentation by attribute name which gives some 4 5 Heist et al. [85] proposed a language-independent entities (with same type) and an attribute as a query 5 6 approach based on distant supervision and extracted and searches for them in a table corpus. 2. Augmen- 6 7 1.6 million triples with 95% precision from the ab- tation by example which gives some entities, an at- 7 8 stract of 21 Wikipedia language editions using ma- tribute and corresponding value for one of the entities 8 9 chine learning approaches based on language-agnostic and searches for value of other entities in the corpus 9 10 features. and 3. Attribute discovery, which gives some entities 10 11 from same type and searches for related attributes in 11 12 8.5. Never Ending Paradigms the corpus. InfoGather crawls, extracts and identify re- 12 13 lational tables from the Web and builds a table graph 13 14 Never-Ending Language Learner (NELL) was mainly from them. The system gives query tables and answers 14 15 informed in 2010 [86] containing 4 modules Coupled- them by tree core operations. 15 16 Pattern Learner (CPL), Coupled SEAL (CSEAL),Coupled TabEL [16] is a entity linking tool especially de- 16 17 Morphological Classifier (CMC), RuleLearner (RL), signed for web tables. Weakening assumption that a 17 18 Knowledge Integrator (KI) to extract and integrate set of entities or relations in a table are in a particular 18 19 knowledge from text sources. NELL crafted 242,000 type, assigns higher likelihood to entities with higher 19 20 beliefs with precision of 74% after running 67 days. co-occurrence in Wikipedia articles. 20 21 This number has been increased to the 80 million be- Ritze et al. worked on web table matching to find 21 22 liefs until 2015 by adding human supervision and 4 ex- missing triples on the knowledge graphs. They pro- 22 23 tra modules (Actively search for web text (OpenEval), posed T2K Match framework [17] to match triples in 23 24 Infer new beliefs from old (PRA), Image classi- Web Tables Corpus (147 million tables) with a knowl- 24 25 fier (NEIL) and Ontology extender(OntExt)) [2] and edge graph, and used this matching tables to augment 25 26 120 million in 2018 [8] by adding another module DBpedia as a cross domain knowledge graph [18]. 26 27 (Learned embeddings (LE)). Prophet is another mod- 27 ule in NELL which helps to extraction process by link 28 9. Conclusion and Future Work 28 29 prediction [87]. 29 30 Never ending paradigm is not restricted to the re- We presented FarsBase, a multi-source knowledge 30 31 lation extraction from the text. NEQA [88] is a con- graph that is specifically designed to provide linked 31 32 tinuous learning paradigm for question answering over data for answering semantic queries in Persian search 32 33 knowledge bases (KB-QA). NEQA tries to overcome engines. It leverages multiple engines for extracting 33 34 two problems in KB-QA: needing to large training data knowledge from different types of data sources such as 34 35 and answering questions about unseen domains. Wikipedia, web-tables and unstructured text. The ex- 35 36 tracted triples are then integrated based on a unified 36 8.6. Knowledge Graph Augmentation 37 ontology. In order to perform as the semantic-search 37 38 component of a search engine, we adopted two policies 38 FACT EXTRACTOR [89] presented a n-ary relation 39 for designing FarsBase. First, the extracted informa- 39 extractor using FrameNet and populates the KG with 40 tion should be managed and updated in a never-ending 40 supervised yet reasonably price NLP layers and tried 41 manner, hence it supports versioning and keeps track 41 to reduce supervising costs trough crowdsourcing. 42 of all operations adopted in various components. Sec- 42 Chen et al. [90] proposed a parallel first-order rule 43 ondly, the quality of triples should be close to 1, such 43 mining (Path-finding algorithm) and a pruning system 44 that almost no wrong result is returned in the seman- 44 to augment Freebase and mined 36625 rules and 0.9 45 tic search. This is addressed by adopting conservative 45 billion new facts from it. 46 approaches that favor high precision while maintain- 46 47 8.6.1. Using Web Tables for Knowledge ing an affordable cost for human labelings, such as us- 47 48 Augmentation ing rule-based methods for information extraction, and 48 49 Limaye et al. [25] proposed machine learning tech- batch-verification of triples by human experts. 49 50 niques to find entities and type of entities in web tables We identified several ideas for improving FarsBase. 50 51 and extracting relation type from the tables. Future work will include: 51 26 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 FarsNet. FarsNet[34] is a valuable lexical datasets [3] F. Mahdisoltani, J. Biega and F. Suchanek, Yago3: A knowl- 1 2 in Persian language. It provides a clean and carefully edge base from multilingual , in: 7th Biennial Con- 2 3 handcrafted hierarchy of concepts. Despite that we ference on Innovative Data Systems Research, CIDR Confer- 3 ence, 2014. 4 4 have already linked parts of FarsBase with FarsNet, a [4] M. Färber, F. Bartscherer, C. Menne and A. Rettinger, Linked 5 complete linking between the two sources can increase data quality of dbpedia, freebase, opencyc, wikidata, and yago, 5 6 the extendability of both, and benefit certain applica- Semantic Web (2016), 1–53. 6 7 tions for the Persian language. Interestingly, many of [5] L. Ehrlinger and W. Wöß, Towards a Definition of Knowledge 7 8 Graphs., in: (Posters, Demos, SuCCESS), 2016. 8 the FarsNet entries are already mapped to WordNet [6] W.W.W. Consortium, RDF 1.1 Turtle, 9 9 synsets, which could pave the path for designing multi- https://www.w3.org/TR/turtle (2014). 10 lingual applications. [7] B. Vatant and M. Wick, Geonames ontology, Dostupné 10 11 online:< http://www. geonames. org/ontology/ontology_v3 1 11 12 Integrating with Other Persian Sources. Structural (2012). 12 13 datasets such as Persian wikis, theses, academic in- [8] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, 13 J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel et 14 formation, and books, can augment FarsBase in next 14 al., Never-ending learning, Communications of the ACM 61(5) 15 steps. (2018), 103–115. 15 16 [9] M. Shamsfard, Challenges and open problems in Persian text 16 OWL Reasoning. One of the methods for knowledge 17 processing, Proceedings of LTC 11 (2011). 17 graph augmentation is reasoning [91][92][93][94]. 18 [10] A.P. Aprosio, C. Giuliano and A. Lavelli, Extending the Cov- 18 OWL supports standard reasoning capabilities, includ- erage of DBpedia Properties using Distant Supervision over 19 19 ing symmetric, inverse, transitive, and functional prop- Wikipedia., in: NLP-DBPEDIA@ ISWC, 2013. 20 20 erties. Produced triples by reasoning can be verified by [11] E. Munoz, A. Hogan and A. Mileo, Triplifying Wikipedia’s 21 Tables., LD4IE@ ISWC 1057 (2013). 21 22 experts. [12] E. Muñoz, A. Hogan and A. Mileo, Using linked data to 22 23 mine RDF from wikipedia’s tables, in: Proceedings of the 7th 23 Deep Learning for Raw Texts. The idea of apply- ACM international conference on Web search and data mining, 24 24 ing deep learning for information extraction is becom- ACM, 2014, pp. 533–542. 25 ing popular. Deep learning can be applied as an inde- [13] G. Kasneci, M. Ramanath, F. Suchanek and G. Weikum, The 25 26 pendent extraction tool or can be integrated with the YAGO-NAGA approach to knowledge discovery, SIGMOD 26 27 current stack using frameworks like Snorkel [14] and Rec. 37(4) (2008), 41–47. 27 28 [14] A. Ratner, S.H. Bach, H. Ehrenberg, J. Fries, S. Wu and C. Ré, 28 DeepDive [54]. Snorkel: Rapid training data creation with weak supervision, 29 29 Proceedings of the VLDB Endowment 11(3) (2017), 269–282. 30 Temporal and Spatial Aspects. FarsBase currently [15] P. Li, H. Wang, H. Li and X. Wu, Employing Semantic Context 30 31 extracts temporal information from the infoboxes and for Sparse Information Extraction Assessment, ACM Trans- 31 32 exploits them to model n-ary relations. However, the actions on Knowledge Discovery from Data (TKDD) 12(5) 32 33 temporal information can be also extracted and en- (2018), 54. 33 [16] C.S. Bhagavatula, T. Noraset and D. Downey, TabEL: entity 34 34 hanced using other sources, such as Wikipedia cate- linking in web tables, in: International Semantic Web Confer- 35 gories. Similarly, Integrating the spatial information ence, Springer, 2015, pp. 425–441. 35 36 and geographical location of entities to FarsBase can [17] D. Ritze, O. Lehmberg and C. Bizer, Matching html tables 36 37 benefit location-based applications. to dbpedia, in: Proceedings of the 5th International Confer- 37 38 ence on Web Intelligence, Mining and Semantics, ACM, 2015, 38 p. 10. 39 39 [18] D. Ritze, O. Lehmberg, Y. Oulabi and C. Bizer, Profiling the 40 40 References Potential of Web Tables for Augmenting Cross-domain Knowl- 41 edge Bases, in: Proceedings of the 25th International Confer- 41 42 ence on - WWW ’16, International World Wide 42 [1] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, 43 Web Conferences Steering Committee., 2016, pp. 251–261. 43 P.N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer [19] M. Yakout, K. Ganjam, K. Chakrabarti and S. Chaudhuri, Info- 44 44 et al., DBpedia–a large-scale, multilingual knowledge base ex- gather: entity augmentation and attribute discovery by holistic 45 tracted from Wikipedia, Semantic Web 6(2) (2015), 167–195. matching with web tables, in: Proceedings of the 2012 ACM 45 46 [2] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, SIGMOD International Conference on Management of Data, 46 47 A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, ACM, 2012, pp. 97–108. 47 N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platan- 48 [20] A. Moschitti, K. Tymoshenko, P. Alexopoulos, A. Walker, 48 ios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, M. Nicosia, G. Vetere, A. Faraotti, M. Monti, J.Z. Pan, H. Wu 49 49 A. Gupta, X. Chen, A. Saparov, M. Greaves and J. Welling, et al., Question Answering and Knowledge Graphs, in: Exploit- 50 Never-Ending Learning, in: Proceedings of the Twenty-Ninth ing Linked Data and Knowledge Graphs in Large Organisa- 50 51 AAAI Conference on Artificial Intelligence (AAAI-15), 2015. tions, Springer, 2017, pp. 181–212. 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 27

1 [21] C. Unger, L. Bühmann, J. Lehmann, A.-C. Ngonga Ngomo, [37] G.-S. Mai, Y.-H. Wang, Y.-J. Hsia, S.-S. Lu, C.-C. Lin et al., 1 2 D. Gerber and P. Cimiano, Template-based question answer- Linked Open Data of Ecology (LODE): a new approach for 2 3 ing over RDF data, in: Proceedings of the 21st international ecological data sharing, Taiwan Journal of Forest Science 26(4) 3 conference on World Wide Web, WWW, 2012, pp. 639–648. (2011), 417–424. 4 4 [22] A. Abujabal, M. Yahya, M. Riedewald and G. Weikum, Auto- [38] M. Dojchinovski and T. Vitvar, Linked web APIs dataset, Se- 5 mated template generation for question answering over knowl- mantic Web (2018), 1–11. 5 6 edge graphs, in: Proceedings of the 26th international confer- [39] M. Krötzsch, D.V. Denny Vrandecic and M. Völkel, Wikipedia 6 7 ence on world wide web, WWW, 2017, pp. 1191–1200. and the semantic web-the missing links, in: Proceedings of 7 8 [23] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, Wikimania 2005, Citeseer, 2005. 8 9 P.N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer [40] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller and 9 and C. Bizer, DBpedia - A Large-scale, Multilingual Knowl- R. Studer, Semantic wikipedia, in: Proceedings of the 15th 10 10 edge Base Extracted from Wikipedia, Semantic Web Journal 1 international conference on World Wide Web, ACM, 2006, 11 (2012), 1–5. pp. 585–594. 11 12 [24] F.M. Suchanek, G. Kasneci and G. Weikum, Yago: A large on- [41] K. Bollacker, C. Evans, P. Paritosh, T. Sturge and J. Taylor, 12 13 tology from wikipedia and , Web Semantics: Science, Freebase: a collaboratively created for structur- 13 14 Services and Agents on the World Wide Web 6(3) (2008), 203– ing human knowledge, in: Proceedings of the 2008 ACM SIG- 14 217. MOD international conference on Management of data, AcM, 15 15 [25] G. Limaye, S. Sarawagi and S. Chakrabarti, Annotating and 2008, pp. 1247–1250. 16 searching web tables using entities, types and relationships, [42] F.M. Suchanek, G. Kasneci and G. Weikum, Yago: a core of 16 17 Proceedings of the VLDB Endowment 3(1–2) (2010), 1338– semantic knowledge, in: Proceedings of the 16th international 17 18 1347. conference on World Wide Web, ACM, 2007, pp. 697–706. 18 19 [26] S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, Par- [43] S.P. Ponzetto and M. Strube, Deriving a Large Scale Taxonomy 19 20 sivar: A Language Processing Toolkit for Persian., in: LREC, from Wikipedia, in: Proceedings of the 22nd Conference on the 20 2018. Advancement of Artificial Intelligence (AAAI), 2007, pp. 1440– 21 21 [27] A. Mirzaei and P. Safari, Persian Discourse Treebank and 1445. 22 coreference corpus., in: LREC, 2018. [44] J. Hoffart, F.M. Suchanek, K. Berberich and G. Weikum, 22 23 [28] K. Dashtipour, M. Gogate, A. Adeel, A. Algarafi, N. Howard YAGO2: A spatially and temporally enhanced knowledge base 23 24 and A. Hussain, Persian Named Entity Recognition, in: Cog- from Wikipedia, IJCAI International Joint Conference on Ar- 24 25 nitive Informatics & Cognitive Computing (ICCI* CC), 2017 tificial Intelligence 194 (2013), 3161–3165. 25 IEEE 16th International Conference on, IEEE, 2017, pp. 79– [45] T. Rebele, F. Suchanek, J. Hoffart, J. Biega, E. Kuzey and 26 26 83. G. Weikum, YAGO: a multilingual knowledge bse form 27 [29] F. Liu, H. Lu and G. Neubig, Handling Homographs in Neu- Wikipedia, Wordnet and Geonames, in: In International Se- 27 28 ral Machine Translation, arXiv preprint arXiv:1708.06510 mantic Web Conference, Vol. 94, Springer, 2016, pp. 1–26. 28 29 (2017). [46] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak and 29 30 [30] R. Mitkov, Robust pronoun resolution with limited knowledge, Z. Ives, Dbpedia: A nucleus for a web of open data, in: The 30 31 in: Proceedings of the 36th Annual Meeting of the Association Semantic Web, Springer, 2007, pp. 722–735. 31 for Computational Linguistics and 17th International Confer- [47] P.N. Mendes, M. Jakob and C. Bizer, DBpedia: A Multi- 32 32 ence on Computational Linguistics-Volume 2, Association for lingual Cross-Domain Knowledge Base, Language Resources 33 Computational Linguistics, 1998, pp. 869–875. and Evaluation LRES (2012), 1813–1817. 33 34 [31] A.X. Chang and C.D. Manning, TokensRegex: Defining cas- [48] R. Navigli and S.P. Ponzetto, BabelNet: The automatic con- 34 35 caded regular expressions over tokens, Tech. Rep. CSTR 2014- struction, evaluation and application of a wide-coverage mul- 35 36 02 (2014). tilingual , Artificial Intelligence 193 (2012), 36 [32] M. Mintz, S. Bills, R. Snow and D. Jurafsky, Distant supervi- 217–250. 37 37 sion for relation extraction without labeled data, in: Proceed- [49] D. Vrandeciˇ c´ and M. Krötzsch, Wikidata: a free collaborative 38 38 ings of the Joint Conference of the 47th Annual Meeting of knowledgebase, Communications of the ACM 57(10) (2014), 39 the ACL and the 4th International Joint Conference on Natu- 78–85. 39 40 ral Language Processing of the AFNLP: Volume 2-Volume 2, [50] A. Ismayilov, D. Kontokostas, S. Auer, J. Lehmann, S. Hell- 40 41 Association for Computational Linguistics, 2009, pp. 1003– mann et al., Wikidata through the Eyes of DBpedia, Semantic 41 42 1011. Web (2018), 1–11. 42 [33] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support [51] Wang, Daisy Zhe, Yang Chen, Sean Goldberg, Christan Grant 43 43 vector machines, ACM transactions on intelligent systems and and K. Li, Automatic knowledge base construction using prob- 44 technology (TIST) 2(3) (2011), 27. abilistic extraction, deductive reasoning, and human feedback, 44 45 [34] M. Shamsfard, A. Hesabi, H. Fadaei, N. Mansoory, A. Famian, in: In Proceedings of the Joint Workshop on Automatic Knowl- 45 46 S. Bagherbeigi, E. Fekri, M. Monshizadeh and S.M. Assi, Semi edge Base Construction and Web-scale Knowledge Extraction, 46 47 automatic development of farsnet; the persian wordnet, in: Pro- Association for Computational Linguistics, 2012, pp. 106– 47 ceedings of 5th Global WordNet Conference, Mumbai, India, 48 110. 48 Vol. 29, 2010. [52] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Mur- 49 49 [35] B.B. Tillett, A Virtual International Authority File. (2001). phy, T. Strohmann, S. Sun and W. Zhang, Knowledge vault: 50 [36] A. Swartz, Musicbrainz: A semantic web service, IEEE Intel- a web-scale approach to probabilistic knowledge fusion, Pro- 50 51 ligent Systems 17(1) (2002), 76–77. ceedings of the 20th ACM SIGKDD international conference 51 28 M. Asgari et al. / FarsBase: The Persian Knowledge Graph

1 on Knowledge discovery and data mining - KDD ’14 (2014), [70] S. Oramas, L. Espinosa-Anke, M. Sordo, H. Saggion and 1 2 601–610. X. Serra, Information extraction for knowledge base construc- 2 3 [53] A. Halevy, Technical Perspective : Incremental Knowledge tion in the music domain, Data and 3 Base Construction Using DeepDive, in: SIGMOD Record, 106 (2016), 70–83. 4 4 Vol. 45, 2016, p. 2016. [71] A. Moro, A. Raganato and R. Navigli, Entity linking meets 5 [54] C. Zhang, C. Ré, M. Cafarella, C. De Sa, A. Ratner, J. Shin, word sense disambiguation: a unified approach, Transactions 5 6 F. Wang and S. Wu, DeepDive: Declarative knowledge base of the Association for Computational Linguistics 2 (2014), 6 7 construction, Communications of the ACM 60(5) (2017), 93– 231–244. 7 8 102. [72] D.B. Nguyen, M. Theobald and G. Weikum, J-REED: Joint 8 [55] Nguyen, Dat Ba, Abdalghani Abujabal, Nam Khanh Tran, 9 Relation Extraction and Entity Disambiguation, in: Proceed- 9 Martin Theobald and G. Weikum, Query-Driven On-The-Fly ings of the 26th ACM International Conference on Information 10 10 Knowledge Base Construction, Proceedings of the VLDB En- and , 2017, pp. 2227–2230. 11 dowment 11(1) (2017), 66–77. [73] M. Röder, R. Usbeck and A.-C. Ngonga Ngomo, GERBIL– 11 12 [56] J. Rouces, G. De Melo and K. Hose, FrameBase: Enabling in- Benchmarking Named Entity Recognition and Linking Con- 12 13 tegration of heterogeneous knowledge, Semantic Web Journal sistently, Semantic Web (2017), 1–21. 13 8(6) (2017), 817–850. 14 [74] N. Bach and S. Badaskar, A survey on relation extraction, Lan- 14 [57] C.F. Baker, C.J. Fillmore and J.B. Lowe, The berkeley guage Technologies Institute Carnegie Mellon University www 15 15 framenet project, in: Proceedings of the 17th international con- ark cs cmu eduLS2images997BachBadaskar (2007). 16 ference on Computational linguistics-Volume 1, Association [75] A. Akbik and J. Broß, Wanderlust: Extracting semantic rela- 16 17 for Computational Linguistics, 1998, pp. 86–90. tions from natural language text using dependency grammar 17 18 [58] M. Färber, B. Ell, C. Menne and A. Rettinger, A Comparative patterns, CEUR Workshop Proceedings 491 (2009), 6–15. 18 Survey of DBpedia , Freebase,OpenCyc,Wikidata,And YAGO, 19 [76] N.T. Nakashole, Automatic Extraction of Facts , Relations , 19 Semantic Web 1 (2015), 1–5. 20 and Entities for Web-Scale Knowledge Base Population, PhD 20 [59] H. Paulheim, Knowledge Graph Refinement: A Survey of Ap- thesis, University of Saarland, 2012. 21 proaches and Evaluation Methods, Semantic Web 0 (2015), 1– 21 [77] A. Madaan, A. Mittal, Mausam, G. Ramakrishnan and 0. 22 S. Sarawagi, Numerical Relation Extraction with Minimal Su- 22 [60] D.B. Lenat, CYC: A large-scale investment in knowledge in- 23 pervision, Proceedings of the 30th Conference on Artificial In- 23 frastructure, Communications of the ACM 38(11) (1995), 33– 24 telligence (AAAI 2016) (2016), 2764–2771. 24 38. [78] V. Presutti, A.G. Nuzzolese, S. Consoli, D.R. Recupero and 25 [61] R. Blanco, B.B. Cambazoglu, P. Mika and N. Torzec, Entity 25 A. Gangemi, From hyperlinks to Semantic Web properties us- 26 recommendations in web search, in: International Semantic 26 ing Open Knowledge Extraction, Semantic Web Journal 7(4) Web Conference, Springer, 2013, pp. 33–48. 27 (2016), 1–5. 27 [62] M. Rashid, M. Torchiano, G. Rizzo and N. Mihindukula- 28 [79] R. Speer, J. Chin and C. Havasi, ConceptNet 5.5: An Open 28 sooriya, A Quality Assessment Approach for Evolving Knowl- 29 Multilingual Graph of General Knowledge, in: Proceedings 29 edge Bases, Semantic Web Preprint (2018). of the Thirty-First AAAI Conference on Artificial Intelligence 30 [63] G. Rizzo, R. Troncy, O. Corcho, A. Jameson, J. Plu, 30 (AAAI-17) ConceptNet, 2017, pp. 4444–4451. 31 J.C.B. Hermida, A. Assaf, C. Barbu, A. Spirescu, K.-D. Kuhn 31 [80] P. Singh et al., The public acquisition of commonsense knowl- 32 et al., 3cixty@ Expo Milano 2015: Enabling Visitors to Ex- 32 edge, in: Proceedings of AAAI Spring Symposium: Acquiring 33 plore a Smart City (2015). 33 [64] A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, (and Using) Linguistic (and World) Knowledge for Information 34 34 J. Lehmann, E. Mannens, S. Hellmann and R. Van de Walle, Access, 2002. 35 Assessing and refining mappingsto rdf to improve dataset [81] J. Breen, JMDict: a Japanese-multilingual dictionary, in: 35 36 quality, in: International Semantic Web Conference, Springer, Proceedings of the Workshop on Multilingual Linguistic 36 Ressources, Association for Computational Linguistics, 2004, 37 2015, pp. 133–149. 37 pp. 71–79. 38 [65] A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, 38 J. Lehmann, E. Mannens and S. Hellmann, DBpedia map- [82] D.-T. Vo and E. Bagheri, Open information extraction, Ency- 39 39 pings quality assessment, CEUR Workshop Proceedings 1690 clopedia with and Robotic Intelligence 40 (2016). 01(01) (2017), 1630003. 40 41 [66] A. Ahmeti, J.D. Fernández, A. Polleres and V. Savenkov, Up- [83] G. Stanovsky, J. Michael, L. Zettlemoyer and I. Dagan, Su- 41 42 dating wikipedia via DBpedia mappings and SPARQL, Lecture pervised Open Information Extraction, in: Proceedings of 42 NAACL-HLT, Association for Computational Linguistics, New 43 Notes in Computer Science (including subseries Lecture Notes 43 in Artificial Intelligence and Lecture Notes in Bioinformatics) Orleans, Louisiana, pp. 885–895. 44 44 10249 LNCS (2017), 485–501. [84] I. Augenstein, D. Maynard and F. Ciravegna, Distantly super- 45 [67] X. Han, Collective Entity Linking in Web Text : A Graph- vised Web relation extraction for knowledge base population, 45 46 Based Method, Sigir (2011), 765–774. Semantic Web Journal 7(4) (2016), 335–349. 46 47 [68] P. Exner and P. Nugues, Entity extraction: From unstruc- [85] N. Heist and H. Paulheim, Language-agnostic relation extrac- 47 In International Semantic 48 tured text to dbpedia rdf triples, CEUR Workshop Proceedings tion from wikipedia abstracts, in: 48 906(Iswc) (2012), 58–69. Web Conference, Vol. 10587 LNCS, Springer, 2017, pp. 383– 49 49 [69] P. Exner and P. Nugues, Entity Extraction : From Unstructured 399. 50 Text to DBpedia RDF Triples, in: The Web of Linked Entities [86] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hr- 50 51 Workshop (WoLE 2012), 2012, pp. 58–69. uschka Jr and T.M. Mitchell, Toward an architecture for never- 51 M. Asgari et al. / FarsBase: The Persian Knowledge Graph 29

1 ending language learning., in: AAAI, Vol. 5, Atlanta, 2010, 1 2 p. 3. 2 3 [87] S.D.S. Pedro, A.P. Appel and E.R. Hruschka, Autonomously 3 reviewing and validating the knowledge base of a never-ending 4 4 learning system, Proceedings of the 22nd International Con- 5 ference on World Wide Web - WWW ’13 Companion (2013), 5 6 1195–1204. 6 7 [88] A. Abujabal, R.S. Roy and G. Weikum, Never-Ending Learn- 7 8 ing for Open-Domain Question Answering over Knowledge 8 Bases, in: In Proceedings of the 2018 World Wide Web Confer- 9 9 ence on World Wide Web, International World Wide Web Con- 10 ferences Steering Committee, 2018., 2018, pp. 1053–1062. 10 11 [89] M. Fossati, E. Dorigatti and C. Giuliano, N-ary Relation Ex- 11 12 traction for Joint T-Box and A-Box Knowledge Base Augmen- 12 13 tation, Semantic Web Journal 0(0) (2015), 1–28. 13 [90] Y. Chen, D.Z. Wang and S. Goldberg, ScaLeKB: scalable 14 14 learning and inference over large knowledge bases, The VLDB 15 Journal 25(6) (2016), 893–918. 15 16 [91] S. Colucci, F.M. Donini and E. Di Sciascio, Reasoning over 16 17 RDF Knowledge Bases: where we are, in: Conference of the 17 18 Italian Association for Artificial Intelligence, Springer, 2017, 18 pp. 243–255. 19 19 [92] J.-H. Qian, X. Jin, Z.-J. Zhang and C. Shao, Construction of 20 Knowledge Base Based on Ontology, in: Proceedings of the 20 21 2017 International Conference on Wireless Communications, 21 22 Networking and Applications, ACM, 2017, pp. 77–83. 22 23 [93] Z. Quan and V. Haarslev, A parallel computing architecture 23 for high-performance OWL reasoning, Parallel Computing 24 24 (2018). 25 [94] S. Batsakis, E.G. Petrakis, I. Tachmazidis and G. Antoniou, 25 26 Temporal representation and reasoning in OWL 2, Semantic 26 27 Web 8(6) (2017), 981–1000. 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51