Babelnet: Building a Very Large Multilingual Semantic Network

Total Page:16

File Type:pdf, Size:1020Kb

Babelnet: Building a Very Large Multilingual Semantic Network BabelNet: Building a Very Large Multilingual Semantic Network Roberto Navigli Simone Paolo Ponzetto Dipartimento di Informatica Department of Computational Linguistics Sapienza Universita` di Roma Heidelberg University [email protected] [email protected] Abstract (Atserias et al., 2004), and many others. How- ever, manual construction methods inherently suf- In this paper we present BabelNet – a fer from a number of drawbacks. First, maintain- very large, wide-coverage multilingual se- ing and updating lexical knowledge resources is mantic network. The resource is automat- expensive and time-consuming. Second, such re- ically constructed by means of a method- sources are typically lexicographic, and thus con- ology that integrates lexicographic and en- tain mainly concepts and only a few named enti- cyclopedic knowledge from WordNet and ties. Third, resources for non-English languages Wikipedia. In addition Machine Transla- often have a much poorer coverage since the con- tion is also applied to enrich the resource struction effort must be repeated for every lan- with lexical information for all languages. guage of interest. As a result, an obvious bias ex- We conduct experiments on new and ex- ists towards conducting research in resource-rich isting gold-standard datasets to show the languages, such as English. high quality and coverage of the resource. A solution to these issues is to draw upon a large-scale collaborative resource, namely 1 Introduction Wikipedia1. Wikipedia represents the perfect com- In many research areas of Natural Language Pro- plement to WordNet, as it provides multilingual cessing (NLP) lexical knowledge is exploited to lexical knowledge of a mostly encyclopedic na- perform tasks effectively. These include, among ture. While the contribution of any individual user others, text summarization (Nastase, 2008), might be imprecise or inaccurate, the continual in- Named Entity Recognition (Bunescu and Pas¸ca, tervention of expert contributors in all domains re- 2006), Question Answering (Harabagiu et al., sults in a resource of the highest quality (Giles, 2000) and text categorization (Gabrilovich and 2005). But while a great deal of work has been re- Markovitch, 2006). Recent studies in the diffi- cently devoted to the automatic extraction of struc- cult task of Word Sense Disambiguation (Nav- tured information from Wikipedia (Wu and Weld, igli, 2009b, WSD) have shown the impact of the 2007; Ponzetto and Strube, 2007; Suchanek et amount and quality of lexical knowledge (Cuadros al., 2008; Medelyan et al., 2009, inter alia), the and Rigau, 2006): richer knowledge sources can knowledge extracted is organized in a looser way be of great benefit to both knowledge-lean systems than in a computational lexicon such as WordNet. (Navigli and Lapata, 2010) and supervised classi- In this paper, we make a major step towards the fiers (Ng and Lee, 1996; Yarowsky and Florian, vision of a wide-coverage multilingual knowledge 2002). resource. We present a novel methodology that Various projects have been undertaken to make produces a very large multilingual semantic net- lexical knowledge available in a machine read- work: BabelNet. This resource is created by link- able format. A pioneering endeavor was Word- ing Wikipedia to WordNet via an automatic map- Net (Fellbaum, 1998), a computational lexicon of ping and by integrating lexical gaps in resource- English based on psycholinguistic theories. Sub- sequent projects have also tackled the significant 1http://download.wikipedia.org. We use the problem of multilinguality. These include Eu- English Wikipedia database dump from November 3, 2009, which includes 3,083,466 articles. Throughout this paper, we roWordNet (Vossen, 1998), MultiWordNet (Pianta use Sans Serif for words, SMALL CAPS for Wikipedia pages et al., 2002), the Multilingual Central Repository and CAPITALS for Wikipedia categories. 216 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 216–225, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics WIKIPEDIA SENTENCES BABEL SYNSET ...world’s first hydrogen balloon flight. balloonEN, BallonDE, ...an interim balloon altitude record... has-part aerostatoES, globusCA, ...from a British balloon near Becourt...´ cluster balloon gasbag pallone aerostaticoIT, + ballooning is-a ballonFR, montgolfiere` FR SEMCOR SENTENCES is-a hot-air wind ...look at the balloon and the... Montgolfier high wind ...suspended like a huge balloon, in... brothers balloon ...the balloon would go up... gas blow gas Fermi gas is-a Machine Translation system Wikipedia WordNet Figure 1: An illustrative overview of BabelNet. poor languages with the aid of Machine Transla- using (a) the human-generated translations pro- tion. The result is an “encyclopedic dictionary”, vided in Wikipedia (the so-called inter-language that provides concepts and named entities lexical- links), as well as (b) a machine translation sys- ized in many languages and connected with large tem to translate occurrences of the concepts within amounts of semantic relations. sense-tagged corpora, namely SemCor (Miller et al., 1993) – a corpus annotated with WordNet 2 BabelNet senses – and Wikipedia itself (Section 3.3). We We encode knowledge as a labeled directed graph call the resulting set of multilingual lexicalizations G = (V, E) where V is the set of vertices – i.e. of a given concept a babel synset. An overview of concepts2 such as balloon – and E ⊆ V ×R×V is BabelNet is given in Figure 1 (we label vertices the set of edges connecting pairs of concepts. Each with English lexicalizations): unlabeled edges are edge is labeled with a semantic relation from R, obtained from links in the Wikipedia pages (e.g. e.g. {is-a, part-of ,..., }, where denotes an un- BALLOON (AIRCRAFT) links to WIND), whereas 3 1 specified semantic relation. Importantly, each ver- labeled ones from WordNet (e.g. balloonn has- 1 tex v ∈ V contains a set of lexicalizations of the part gasbagn). In this paper we restrict ourselves to concepts lexicalized as nouns. Nonetheless, our concept for different languages, e.g. { balloonEN, methodology can be applied to all parts of speech, BallonDE, aerostatoES,..., montgolfiere` FR }. Concepts and relations in BabelNet are har- but in that case Wikipedia cannot be exploited, vested from the largest available semantic lexi- since it mainly contains nominal entities. con of English, WordNet, and a wide-coverage collaboratively edited encyclopedia, the English 3 Methodology Wikipedia (Section 3.1). We collect (a) from 3.1 Knowledge Resources WordNet, all available word senses (as concepts) and all the semantic pointers between synsets (as WordNet. The most popular lexical knowledge relations); (b) from Wikipedia, all encyclopedic resource in the field of NLP is certainly WordNet, entries (i.e. pages, as concepts) and semantically a computational lexicon of the English language. unspecified relations from hyperlinked text. A concept in WordNet is represented as a synonym In order to provide a unified resource, we merge set (called synset), i.e. the set of words that share the intersection of these two knowledge sources the same meaning. For instance, the concept wind (i.e. their concepts in common) by establishing a is expressed by the following synset: 1 1 1 mapping between Wikipedia pages and WordNet { windn, air currentn, current of airn }, senses (Section 3.2). This avoids duplicate con- where each word’s subscripts and superscripts in- cepts and allows their inventories of concepts to dicate their parts of speech (e.g. n stands for noun) complement each other. Finally, to enable mul- tilinguality, we collect the lexical realizations of 3We use in the following WordNet version 3.0. We de- i the available concepts in different languages by note with wp the i-th sense of a word w with part of speech p. We use word senses to unambiguously denote the corre- 2 1 1 1 Throughout the paper, unless otherwise stated, we use sponding synsets (e.g. planen for { airplanen, aeroplanen, 1 the general term concept to denote either a concept or a planen }). Hereafter, we use word sense and synset inter- named entity. changeably. 217 and sense number, respectively. For each synset, • Sense labels: e.g. given the page BALLOON WordNet provides a textual definition, or gloss. (AIRCRAFT), the word aircraft is added to the For example, the gloss of the above synset is: “air disambiguation context. moving from an area of high pressure to an area of • Links: the titles’ lemmas of the pages linked low pressure”. from the target Wikipage (i.e., outgoing links). Wikipedia. Our second resource, Wikipedia, For instance, the links in the Wikipage BAL- is a Web-based collaborative encyclopedia. A LOON (AIRCRAFT) include wind, gas, etc. Wikipedia page (henceforth, Wikipage) presents • Categories: Wikipages are typically classi- the knowledge about a specific concept (e.g. BAL- fied according to one or more categories. LOON (AIRCRAFT)) or named entity (e.g. MONT- For example, the Wikipage BALLOON (AIR- GOLFIER BROTHERS). The page typically con- CRAFT) is categorized as BALLOONS, BAL- tains hypertext linked to other relevant Wikipages. LOONING, etc. While many categories are For instance, BALLOON (AIRCRAFT) is linked to very specific and do not appear in Word- WIND,GAS, and so on. The title of a Wikipage Net (e.g., SWEDISH WRITERS or SCIEN- (e.g. BALLOON (AIRCRAFT)) is composed of TISTS WHO COMMITTED SUICIDE), we the lemma of the concept defined (e.g. balloon) use their syntactic heads as disambiguation con- plus an optional label in parentheses which speci- text (i.e. writer and scientist, respectively). fies its meaning if the lemma is ambiguous (e.g. Given a Wikipage w, we define its disambiguation AIRCRAFT vs. TOY). Wikipages also provide context Ctx(w) as the set of words obtained from inter-language links to their counterparts in other all of the three sources above. languages (e.g. BALLOON (AIRCRAFT) links to the Spanish page AEROSTATO). Finally, some 3.2.2 Disambiguation Context of a WordNet Wikipages are redirections to other pages, e.g.
Recommended publications
  • Probabilistic Topic Modelling with Semantic Graph
    Probabilistic Topic Modelling with Semantic Graph B Long Chen( ), Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang School of Computing Science, University of Glasgow, Sir Alwyns Building, Glasgow, UK [email protected] Abstract. In this paper we propose a novel framework, topic model with semantic graph (TMSG), which couples topic model with the rich knowledge from DBpedia. To begin with, we extract the disambiguated entities from the document collection using a document entity linking system, i.e., DBpedia Spotlight, from which two types of entity graphs are created from DBpedia to capture local and global contextual knowl- edge, respectively. Given the semantic graph representation of the docu- ments, we propagate the inherent topic-document distribution with the disambiguated entities of the semantic graphs. Experiments conducted on two real-world datasets show that TMSG can significantly outperform the state-of-the-art techniques, namely, author-topic Model (ATM) and topic model with biased propagation (TMBP). Keywords: Topic model · Semantic graph · DBpedia 1 Introduction Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7]and Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana- lyzing textual content. Specifically, each document in a document collection is represented as random mixtures over latent topics, where each topic is character- ized by a distribution over words. Such a paradigm is widely applied in various areas of text mining. In view of the fact that the information used by these mod- els are limited to document collection itself, some recent progress have been made on incorporating external resources, such as time [8], geographic location [12], and authorship [15], into topic models.
    [Show full text]
  • Semantic Memory: a Review of Methods, Models, and Current Challenges
    Psychonomic Bulletin & Review https://doi.org/10.3758/s13423-020-01792-x Semantic memory: A review of methods, models, and current challenges Abhilasha A. Kumar1 # The Psychonomic Society, Inc. 2020 Abstract Adult semantic memory has been traditionally conceptualized as a relatively static memory system that consists of knowledge about the world, concepts, and symbols. Considerable work in the past few decades has challenged this static view of semantic memory, and instead proposed a more fluid and flexible system that is sensitive to context, task demands, and perceptual and sensorimotor information from the environment. This paper (1) reviews traditional and modern computational models of seman- tic memory, within the umbrella of network (free association-based), feature (property generation norms-based), and distribu- tional semantic (natural language corpora-based) models, (2) discusses the contribution of these models to important debates in the literature regarding knowledge representation (localist vs. distributed representations) and learning (error-free/Hebbian learning vs. error-driven/predictive learning), and (3) evaluates how modern computational models (neural network, retrieval- based, and topic models) are revisiting the traditional “static” conceptualization of semantic memory and tackling important challenges in semantic modeling such as addressing temporal, contextual, and attentional influences, as well as incorporating grounding and compositionality into semantic representations. The review also identifies new challenges
    [Show full text]
  • Knowledge Graphs on the Web – an Overview Arxiv:2003.00719V3 [Cs
    January 2020 Knowledge Graphs on the Web – an Overview Nicolas HEIST, Sven HERTLING, Daniel RINGLER, and Heiko PAULHEIM Data and Web Science Group, University of Mannheim, Germany Abstract. Knowledge Graphs are an emerging form of knowledge representation. While Google coined the term Knowledge Graph first and promoted it as a means to improve their search results, they are used in many applications today. In a knowl- edge graph, entities in the real world and/or a business domain (e.g., people, places, or events) are represented as nodes, which are connected by edges representing the relations between those entities. While companies such as Google, Microsoft, and Facebook have their own, non-public knowledge graphs, there is also a larger body of publicly available knowledge graphs, such as DBpedia or Wikidata. In this chap- ter, we provide an overview and comparison of those publicly available knowledge graphs, and give insights into their contents, size, coverage, and overlap. Keywords. Knowledge Graph, Linked Data, Semantic Web, Profiling 1. Introduction Knowledge Graphs are increasingly used as means to represent knowledge. Due to their versatile means of representation, they can be used to integrate different heterogeneous data sources, both within as well as across organizations. [8,9] Besides such domain-specific knowledge graphs which are typically developed for specific domains and/or use cases, there are also public, cross-domain knowledge graphs encoding common knowledge, such as DBpedia, Wikidata, or YAGO. [33] Such knowl- edge graphs may be used, e.g., for automatically enriching data with background knowl- arXiv:2003.00719v3 [cs.AI] 12 Mar 2020 edge to be used in knowledge-intensive downstream applications.
    [Show full text]
  • Large Semantic Network Manual Annotation 1 Introduction
    Large Semantic Network Manual Annotation V´aclav Nov´ak Institute of Formal and Applied Linguistics Charles University, Prague [email protected] Abstract This abstract describes a project aiming at manual annotation of the content of natural language utterances in a parallel text corpora. The formalism used in this project is MultiNet – Multilayered Ex- tended Semantic Network. The annotation should be incorporated into Prague Dependency Treebank as a new annotation layer. 1 Introduction A formal specification of the semantic content is the aim of numerous semantic approaches such as TIL [6], DRT [9], MultiNet [4], and others. As far as we can tell, there is no large “real life” text corpora manually annotated with such markup. The projects usually work only with automatically generated annotation, if any [1, 6, 3, 2]. We want to create a parallel Czech-English corpora of texts annotated with the corresponding semantic network. 1.1 Prague Dependency Treebank From the linguistic viewpoint there language resources such as Prague Dependency Treebank (PDT) which contain a deep manual analysis of texts [8]. PDT contains annotations of three layers, namely morpho- logical, analytical (shallow dependency syntax) and tectogrammatical (deep dependency syntax). The units of each annotation level are linked with corresponding units on the preceding level. The morpho- logical units are linked directly with the original text. The theoretical basis of the treebank lies in the Functional Gener- ative Description of language system [7]. PDT 2.0 is based on the long-standing Praguian linguistic tradi- tion, adapted for the current computational-linguistics research needs. The corpus itself is embedded into the latest annotation technology.
    [Show full text]
  • Detecting Personal Life Events from Social Media
    Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online.
    [Show full text]
  • Universal Or Variation? Semantic Networks in English and Chinese
    Universal or variation? Semantic networks in English and Chinese Understanding the structures of semantic networks can provide great insights into lexico- semantic knowledge representation. Previous work reveals small-world structure in English, the structure that has the following properties: short average path lengths between words and strong local clustering, with a scale-free distribution in which most nodes have few connections while a small number of nodes have many connections1. However, it is not clear whether such semantic network properties hold across human languages. In this study, we investigate the universal structures and cross-linguistic variations by comparing the semantic networks in English and Chinese. Network description To construct the Chinese and the English semantic networks, we used Chinese Open Wordnet2,3 and English WordNet4. The two wordnets have different word forms in Chinese and English but common word meanings. Word meanings are connected not only to word forms, but also to other word meanings if they form relations such as hypernyms and meronyms (Figure 1). 1. Cross-linguistic comparisons Analysis The large-scale structures of the Chinese and the English networks were measured with two key network metrics, small-worldness5 and scale-free distribution6. Results The two networks have similar size and both exhibit small-worldness (Table 1). However, the small-worldness is much greater in the Chinese network (σ = 213.35) than in the English network (σ = 83.15); this difference is primarily due to the higher average clustering coefficient (ACC) of the Chinese network. The scale-free distributions are similar across the two networks, as indicated by ANCOVA, F (1, 48) = 0.84, p = .37.
    [Show full text]
  • Structure at Every Scale: a Semantic Network Account of the Similarities Between Very Unrelated Concepts
    De Deyne, S., Navarro D. J., Perfors, A. and Storms, G. (2016). Structure at every scale: A semantic network account of the similarities between very unrelated concepts. Journal of Experimental Psychology: General, 145, 1228-1254 https://doi.org/10.1037/xge0000192 Structure at every scale: A semantic network account of the similarities between unrelated concepts Simon De Deyne, Danielle J. Navarro, Amy Perfors University of Adelaide Gert Storms University of Leuven Word Count: 19586 Abstract Similarity plays an important role in organizing the semantic system. However, given that similarity cannot be defined on purely logical grounds, it is impor- tant to understand how people perceive similarities between different entities. Despite this, the vast majority of studies focus on measuring similarity between very closely related items. When considering concepts that are very weakly re- lated, little is known. In this paper we present four experiments showing that there are reliable and systematic patterns in how people evaluate the similari- ties between very dissimilar entities. We present a semantic network account of these similarities showing that a spreading activation mechanism defined over a word association network naturally makes correct predictions about weak sim- ilarities and the time taken to assess them, whereas, though simpler, models based on direct neighbors between word pairs derived using the same network cannot. Keywords: word associations, similarity, semantic networks, random walks. This work was supported by a research grant funded by the Research Foundation - Flanders (FWO), ARC grant DE140101749 awarded to the first author, and by the interdisciplinary research project IDO/07/002 awarded to Dirk Speelman, Dirk Geeraerts, and Gert Storms.
    [Show full text]
  • Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models
    Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models. Farhan Samir Barend Beekhuizen Suzanne Stevenson Department of Computer Science Department of Language Studies Department of Computer Science University of Toronto University of Toronto, Mississauga University of Toronto ([email protected]) ([email protected]) ([email protected]) Abstract DSMs thought to instantiate one but not the other kind of sim- Distributional semantic models (DSMs) are substantially var- ilarity have been found to explain different priming effects on ied in the types of semantic similarity that they output. Despite an aggregate level (as opposed to an item level). this high variance, the different types of similarity are often The underlying conception of semantic versus associative conflated as a monolithic concept in models of behavioural data. We apply the insight that word2vec’s representations similarity, however, has been disputed (Ettinger & Linzen, can be used for capturing both paradigmatic similarity (sub- 2016; McRae et al., 2012), as has one of the main ways in stitutability) and syntagmatic similarity (co-occurrence) to two which it is operationalized (Gunther¨ et al., 2016). In this pa- sets of experimental findings (semantic priming and the effect of semantic neighbourhood density) that have previously been per, we instead follow another distinction, based on distri- modeled with monolithic conceptions of DSM-based seman- butional properties (e.g. Schutze¨ & Pedersen, 1993), namely, tic similarity. Using paradigmatic and syntagmatic similarity that between syntagmatically related words (words that oc- based on word2vec, we show that for some tasks and types of items the two types of similarity play complementary ex- cur in each other’s near proximity, such as drink–coffee), planatory roles, whereas for others, only syntagmatic similar- and paradigmatically related words (words that can be sub- ity seems to matter.
    [Show full text]
  • Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
    Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation Alexander Panchenkoz, Fide Martenz, Eugen Ruppertz, Stefano Faralliy, Dmitry Ustalov∗, Simone Paolo Ponzettoy, and Chris Biemannz zLanguage Technology Group, Department of Informatics, Universitat¨ Hamburg, Germany yWeb and Data Science Group, Department of Informatics, Universitat¨ Mannheim, Germany ∗Institute of Natural Sciences and Mathematics, Ural Federal University, Russia fpanchenko,marten,ruppert,[email protected] fsimone,[email protected] [email protected] Abstract manually in one of the underlying resources, such as Wikipedia. Unsupervised knowledge-free ap- Interpretability of a predictive model is proaches, e.g. (Di Marco and Navigli, 2013; Bar- a powerful feature that gains the trust of tunov et al., 2016), require no manual labor, but users in the correctness of the predictions. the resulting sense representations lack the above- In word sense disambiguation (WSD), mentioned features enabling interpretability. For knowledge-based systems tend to be much instance, systems based on sense embeddings are more interpretable than knowledge-free based on dense uninterpretable vectors. Therefore, counterparts as they rely on the wealth of the meaning of a sense can be interpreted only on manually-encoded elements representing the basis of a list of related senses. word senses, such as hypernyms, usage We present a system that brings interpretability examples, and images. We present a WSD of the knowledge-based sense representations into system that bridges the gap between these the world of unsupervised knowledge-free WSD two so far disconnected groups of meth- models. The contribution of this paper is the first ods. Namely, our system, providing access system for word sense induction and disambigua- to several state-of-the-art WSD models, tion, which is unsupervised, knowledge-free, and aims to be interpretable as a knowledge- interpretable at the same time.
    [Show full text]
  • Ten Years of Babelnet: a Survey
    Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Survey Track Ten Years of BabelNet: A Survey Roberto Navigli1 , Michele Bevilacqua1 , Simone Conia1 , Dario Montagnini2 and Francesco Cecconi2 1Sapienza NLP Group, Sapienza University of Rome, Italy 2Babelscape, Italy froberto.navigli, michele.bevilacqua, [email protected] fmontagnini, [email protected] Abstract to integrate symbolic knowledge into neural architectures [d’Avila Garcez and Lamb, 2020]. The rationale is that the The intelligent manipulation of symbolic knowl- use of, and linkage to, symbolic knowledge can not only en- edge has been a long-sought goal of AI. How- able interpretable, explainable and accountable AI systems, ever, when it comes to Natural Language Process- but it can also increase the degree of generalization to rare ing (NLP), symbols have to be mapped to words patterns (e.g., infrequent meanings) and promote better use and phrases, which are not only ambiguous but also of information which is not explicit in the text. language-specific: multilinguality is indeed a de- Symbolic knowledge requires that the link between form sirable property for NLP systems, and one which and meaning be made explicit, connecting strings to repre- enables the generalization of tasks where multiple sentations of concepts, entities and thoughts. Historical re- languages need to be dealt with, without translat- sources such as WordNet [Miller, 1995] are important en- ing text. In this paper we survey BabelNet, a pop- deavors which systematize symbolic knowledge about the ular wide-coverage lexical-semantic knowledge re- words of a language, i.e., lexicographic knowledge, not only source obtained by merging heterogeneous sources in a machine-readable format, but also in structured form, into a unified semantic network that helps to scale thanks to the organization of concepts into a semantic net- tasks and applications to hundreds of languages.
    [Show full text]
  • Ontology-Based Approach to Semantically Enhanced Question Answering for Closed Domain: a Review
    information Review Ontology-Based Approach to Semantically Enhanced Question Answering for Closed Domain: A Review Ammar Arbaaeen 1,∗ and Asadullah Shah 2 1 Department of Computer Science, Faculty of Information and Communication Technology, International Islamic University Malaysia, Kuala Lumpur 53100, Malaysia 2 Faculty of Information and Communication Technology, International Islamic University Malaysia, Kuala Lumpur 53100, Malaysia; [email protected] * Correspondence: [email protected] Abstract: For many users of natural language processing (NLP), it can be challenging to obtain concise, accurate and precise answers to a question. Systems such as question answering (QA) enable users to ask questions and receive feedback in the form of quick answers to questions posed in natural language, rather than in the form of lists of documents delivered by search engines. This task is challenging and involves complex semantic annotation and knowledge representation. This study reviews the literature detailing ontology-based methods that semantically enhance QA for a closed domain, by presenting a literature review of the relevant studies published between 2000 and 2020. The review reports that 83 of the 124 papers considered acknowledge the QA approach, and recommend its development and evaluation using different methods. These methods are evaluated according to accuracy, precision, and recall. An ontological approach to semantically enhancing QA is found to be adopted in a limited way, as many of the studies reviewed concentrated instead on Citation: Arbaaeen, A.; Shah, A. NLP and information retrieval (IR) processing. While the majority of the studies reviewed focus on Ontology-Based Approach to open domains, this study investigates the closed domain.
    [Show full text]
  • KOI at Semeval-2018 Task 5: Building Knowledge Graph of Incidents
    KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents 1 2 2 Paramita Mirza, Fariz Darari, ∗ Rahmad Mahendra ∗ 1 Max Planck Institute for Informatics, Germany 2 Faculty of Computer Science, Universitas Indonesia, Indonesia paramita @mpi-inf.mpg.de fariz,rahmad.mahendra{ } @cs.ui.ac.id { } Abstract Subtask S3 In order to answer questions of type (ii), participating systems are also required We present KOI (Knowledge of Incidents), a to identify participant roles in each identified an- system that given news articles as input, builds swer incident (e.g., victim, subject-suspect), and a knowledge graph (KOI-KG) of incidental use such information along with victim-related nu- events. KOI-KG can then be used to effi- merals (“three people were killed”) mentioned in ciently answer questions such as “How many the corresponding answer documents, i.e., docu- killing incidents happened in 2017 that involve ments that report on the answer incident, to deter- Sean?” The required steps in building the KG include: (i) document preprocessing involv- mine the total number of victims. ing word sense disambiguation, named-entity Datasets The organizers released two datasets: recognition, temporal expression recognition and normalization, and semantic role labeling; (i) test data, stemming from three domains of (ii) incidental event extraction and coreference gun violence, fire disasters and business, and (ii) resolution via document clustering; and (iii) trial data, covering only the gun violence domain. KG construction and population. Each dataset
    [Show full text]