Dbpedia Del Gallego: Recursos Y Aplicaciones En Procesamiento Del

Total Page:16

File Type:pdf, Size:1020Kb

Dbpedia Del Gallego: Recursos Y Aplicaciones En Procesamiento Del View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repositorio Institucional de la Universidad de Alicante Procesamiento del Lenguaje Natural, Revista nº 57, septiembre de 2016, págs. 139-142 recibido 14-03-2016 revisado 13-04-2016 aceptado 28-04-2016 DBpedia del gallego: recursos y aplicaciones en procesamiento del lenguaje Galician DBpedia: resources and applications in language processing Miguel Anxo Solla Portela Xavier G´omezGuinovart Universidade de Vigo Universidade de Vigo Grupo TALG Grupo TALG [email protected] [email protected] Resumen: En esta presentaci´on,describimos la metodolog´ıautilizada para la crea- ci´onde la DBpedia del gallego y algunas de sus aplicaciones para el procesamiento ling¨u´ısticoen los ´ambitos del reconocimiento de entidades y de la extracci´onl´exica. Palabras clave: DBpedia, Wikipedia, WordNet, datos enlazados abiertos, web sem´antica Abstract: In this presentation, we review the methodology used in the development of the Galician DBpedia and some of its applications for language processing in the fields of entity recognition and lexical extraction. Keywords: DBpedia, Wikipedia, WordNet, linked open data, semantic web 1 Introducci´on Data)4 establecidas por el W3C (World Wide Web Consortium) (Auer et al., 2007). En este art´ıculo1 se describe la metodolog´ıa seguida en la creaci´onde la DBpedia del ga- 2 Recursos llego y algunas de sus aplicaciones en el cam- po del procesamiento del lenguaje. La cons- La DBpedia del gallego, desarrollada y man- trucci´onde este recurso se realiz´ogracias a tenida por el Grupo TALG (Tecnolox´ıas e la financianci´onde la Red de Investigaci´on Aplicaci´onsda Lingua Galega) de la Univer- Tecnolox´ıase an´alise dos datos ling¨u´ısticos, sidade de Vigo, contiene 11 millones de tuplas orientada al desarrollo de recursos para el sem´anticas extra´ıdasa partir de toda la infor- 5 procesamiento ling¨u´ısticodel gallego, siendo maci´oncontenida en la Galipedia y est´aalo- uno de sus objetivos principales la puesta en jada en el subdominio oficial de dbpedia.org 6 marcha de nuevas aplicaciones y herramien- correspondiente a la lengua gallega . tas con tecnolog´ıasde base sem´antica. La elaboraci´onde la DBpedia del galle- La DBpedia2 (Lehmann et al., 2015) es un go supuso la adaptaci´onde la aplicaci´onde proyecto internacional para crear una versi´on extracci´onde los datos procedentes de los fi- estructurada de los contenidos de la Wikipe- cheros dump de la Wikipedia, de Wikimedia 7 8 dia3 y publicarla libremente en Internet en- Commons y de Wikidata para que funcio- trelazada con el conjunto de bases de conoci- nase satisfactoriamente con los datos proce- miento que constituyen la web sem´antica. dentes de la Galipedia. Las modificaciones realizadas en el c´odigode la aplicaci´onse La DBpedia permite realizar consultas 9 complejas a partir del conjunto de datos deri- pueden consultar en Github y han sido ya implementadas en la aplicaci´onprincipal de vados de la Wikipedia y permite enlazar estos 10 datos con otros conjuntos de datos que hay extracci´onde la DBpedia . en la web, siguiendo las especificaciones pa- 4https://www.w3.org/wiki/SweoIG/ ra los datos enlazados abiertos (Linked Open TaskForces/CommunityProjects/LinkingOpenData 5http://gl.wikipedia.org 1Esta investigaci´onse realiz´oen el marco de la 6http://gl.dbpedia.org Red de Investigaci´on Tecnolox´ıase an´alisedos datos 7https://commons.wikimedia.org ling¨u´ısticos financiada por la Conseller´ıade Cultura, 8https://www.wikidata.org Educaci´one Ordenaci´onUniversitaria de la Xunta de 9https://github.com/galician/ Galicia, ref. CN 2014/007. extraction-framework/ 2http://dbpedia.org 10https://github.com/dbpedia/ 3http://wikipedia.org extraction-framework/ Miguel Anxo Solla Portela, Xavier Gómez Guinovart Igualmente, con el mismo objetivo de crea- para poder incorporar los enlaces con las cla- ci´ondel recurso, se elaboraron los ficheros de sificaciones sem´anticas y ontolog´ıaspresentes conversi´on(mappings) necesarios para obte- en el MCR y Galnet21 y mantener su natu- ner informaci´onestructurada a partir de las raleza pluriling¨uea trav´esde un ´ındiceinter- infoboxes y de las cajas de navegaci´onde la ling¨u´ıstico(ILI). Adem´as,con el fin de am- Galipedia11. Aunque esta tarea se halla to- pliar su cobertura a consultas externas, se ali- dav´ıa en curso de finalizaci´on,la cobertura ne´ocada synset con el correspondiente en la alcanzada con el trabajo ya realizado resul- versi´on3.1 de Princeton y con la versi´on3.0 ta bastante amplia, como se puede compro- en formato lemonUby22. El resultado de este bar en las estad´ısticasdisponibles de los map- alineamiento conlleva la compatibilidad del pings de la DBpedia12. El conjunto de datos ´ındice interling¨u´ıstico de WordNet presente se ha completado, adem´as,con la extracci´on en el MCR con innumerables fuentes de da- de los res´umenesde los art´ıculosde la Gali- tos enlazados que ya se encuentran disponi- pedia ligados a cada recurso. bles en la web sem´antica. Los ficheros RDF de la DBpedia del ga- llego generados a partir de la Galipedia, pue- 3 Aplicaciones den ser libremente descargados desde el si- 3.1 DBpedia Spotlight tio de la DBpedia13, y sus contenidos pue- den consultarse y visualizarse en la web del Una vez elaborados los recursos y habilitado grupo mediante las aplicaciones Lodview14 y el acceso abierto a los datos estructurados, LodLive15 (ambas localizadas en gallego co- se desarroll´ouna versi´onadaptada al gallego mo parte del proyecto), utilizando la interfaz de la aplicaci´onDBpedia Spotlight (Daiber adaptada de la propia DBpedia16 o a trav´es et al., 2013) para poder ofrecer una primera del punto de acceso Virtuoso SPARQL a los herramienta de explotaci´oninmediata de los datos estructurados17. datos de la DBpedia del gallego en el campo La publicaci´on del punto de acceso del procesamiento del lenguaje. SPARQL propici´otambi´enel modelado en DBpedia Spotlight es una utilidad para la formato de datos enlazados abiertos de Gal- anotaci´onde textos con referencias a los con- net18 (Solla Portela y G´omez Guinovart, ceptos de la DBpedia. La identificaci´onen 2015), el WordNet 3.0 del gallego desarro- contexto de las formas relativas a los concep- llado por el Grupo TALG que forma par- tos se realiza mediante un sistema adapta- te de la distribuci´ondel Multilingual Cen- ble que localiza y desambigua de forma au- tral Repository (MCR) (Gonz´alezAgirre, La- tom´aticalas menciones a recursos de la DB- parra, y Rigau, 2012). La consulta de la pedia presentes en el lenguaje natural. En es- versi´on RDF de Galnet se encuentra dis- te sentido, la identificaci´onde entidades lleva- ponible a trav´es del servidor SPARQL de da a cabo por DBpedia Spotlight posee un al- la DBpedia del gallego utilizando el grafo cance menos restringido que el reconocimien- http://sli.uvigo.gal/rdf_galnet. to de entidades nombradas, habitualmente li- El dise~node la estructura de los datos mitado a ciertas categor´ıaspredefinidas como RDF se bas´oen la versi´on3.1 del WordNet personas, organizaciones y lugares. de Princeton19, siguiendo el modelo lemon20, La adaptaci´on al gallego de DBpedia con ligeras modificaciones respecto al original Spotlight realizada en el marco de este pro- yecto identifica y anota en los textos las re- ferencias a conceptos de la DBpedia del ga- 11http://mappings.dbpedia.org/index.php/ Mapping_gl llego, y puede utilizarse libremente desde su 23 24 12http://mappings.dbpedia.org/server/ interfaz de usuario o como servicio web . statistics/gl/ 13http://downloads.dbpedia.org/2015-10/ 21Concretamente, los WordNet Domains (Bentivo- core-i18n/gl/ gli et al., 2004), la ontolog´ıaAdimen-SUMO (Alvez,´ 14http://sli.uvigo.gal/dbpedia/lodview/ Lucio, y Rigau, 2012), la Top Ontology (Alvez´ et al., 15http://sli.uvigo.gal/dbpedia/lodlive/ 2008), los Basic Level Concepts (Izquierdo, Su´arez, 16https://github.com/dbpedia/ y Rigau, 2007) y los epin´onimos (Solla Portela y dbpedia-vad-i18n G´omezGuinovart, 2015) 17http://gl.dbpedia.org/sparql/ 22http://lemon-model.net/lexica/uby/wn/ 18http://sli.uvigo.gal/galnet/ 23http://sli.uvigo.gal/dbpedia/spotlight/ 19http://wordnet-rdf.princeton.edu 24https://github.com/dbpedia-spotlight/ 20http://lemon-model.net dbpedia-spotlight/wiki/Web-service 140 DBpedia del gallego: recursos y aplicaciones en procesamiento del lenguaje BabelNet de WordNet 3.0 obtenidos y se proponen co- http://dbpedia.org/resource/Cairn mo candidatos a variante los recursos relacio- 107288507-n nados de la DBpedia del gallego. Con esta estrategia se consiguieron 910 Galnet RDF candidaturas con variantes nominales que 107288507-n apuntaban a synsets que todav´ıano ten´ıan ili-30-07273802-n ninguna variante en gallego. El ´ındice de precisi´onobtenido en el experimento de ex- tracci´on,tras su revisi´onhumana, alcanz´oel DBpedia 82,3 %, como se refleja en los resultados de http://dbpedia.org/resource/Cair la Tabla 1. Durante la revisi´onse observ´o http://gl.dbpedia.org/resource/Amilladoiro adem´asque, salvo en algunos casos aislados en los que la equivalencia entre idiomas en Candidatura la DBpedia no es correcta, en la mayor parte amilladoiro de los casos en los que no se puede estable- ili-30-07273802-n cer la validez, el origen del error se encuentra en la inadecuaci´ondel alineamiento entre el Figura 1: Extracci´onde variantes (1). recurso de la DBpedia y el identificador de WordNet 3.1 en BabelNet. La Figura 1 ilus- 3.2 Extracci´onl´exica tra este proceso de extracci´onde variantes de Galnet a partir de los recursos LOD de la Para poder comprobar las posiblidades de DBpedia, BabelNet y Galnet con un ejemplo explotaci´onde estos recursos LOD en otras de candidatura aceptada27.
Recommended publications
  • A Collaborativeplatform for Multilingual Ontology
    PhD Dissertation International Doctorate School in Information and Communication Technologies DIT - University of Trento A COLLABORATIVE PLATFORM FOR MULTILINGUAL ONTOLOGY DEVELOPMENT Ahmed Maher Ahmed Tawfik Moustafa Advisor: Professor. Fausto Giunchiglia Università degli Studi di Trento Abstract The world is extremely diverse and its diversity is obvious in the cultural differences and the large number of spoken languages being used all over the world. In this sense, we need to collect and organize a huge amount of knowledge obtained from multiple resources differing from one another in many aspects. A possible approach for doing that is to think of designing effective tools for construction and maintenance of linguistic resources and localized domain ontologies based on well-defined knowledge representation methodologies capable of dealing with diversity and the continuous evolvement of human knowledge. In this thesis, we present a collaborative platform which allows for knowledge organization in a language-independent manner and provides the appropriate mapping from a language independent concept to one specific lexicalization per language. This representation ensures a smooth multilingual enrichment process for linguistic resources and a robust construction of ontologies using language-independent concepts. The collaborative platform is designed following a workflow-based development methodology that models linguistic resources as a set of collaborative objects and assigns a customizable workflow to build and maintain each collaborative object in a community driven manner, with extensive support of modern web 2.0 social and collaborative features. Keywords Knowledge Representation, Multilingual Resources, Ontology Development, Computer Supported Collaborative Work 2 Acknowledgments I am particularly grateful for my supervisor, Professor. Fausto Giunchiglia, for the guidance and advices he has provided throughout my time as a PhD student.
    [Show full text]
  • Latent Semantic Network Induction in the Context of Linked Example Senses
    Latent semantic network induction in the context of linked example senses Hunter Scott Heidenreich Jake Ryland Williams Department of Computer Science Department of Information Science College of Computing and Informatics College of Computing and Informatics [email protected] [email protected] Abstract of a network—much like the Princeton Word- Net (Miller, 1995; Fellbaum, 1998)—that is con- The Princeton WordNet is a powerful tool structed solely from the semi-structured data of for studying language and developing nat- ural language processing algorithms. With Wiktionary. This relies on the noisy annotations significant work developing it further, one of the editors of Wiktionary to naturally induce a line considers its extension through aligning network over the entirety of the English portion of its expert-annotated structure with other lex- Wiktionary. In doing so, the development of this ical resources. In contrast, this work ex- work produces: plores a completely data-driven approach to network construction, forming a wordnet us- • an induced network over Wiktionary, en- ing the entirety of the open-source, noisy, user- riched with semantically linked examples, annotated dictionary, Wiktionary. Compar- forming a directed acyclic graph (DAG); ing baselines to WordNet, we find compelling evidence that our network induction process • an exploration of the task of relationship dis- constructs a network with useful semantic structure. With thousands of semantically- ambiguation as a means to induce network linked examples that demonstrate sense usage construction; and from basic lemmas to multiword expressions (MWEs), we believe this work motivates fu- • an outline for directions of expansion, includ- ture research. ing increasing precision in disambiguation, cross-linking example usages, and aligning 1 Introduction English Wiktionary with other languages.
    [Show full text]
  • Visualization Design for a Web Interface to the Large-Scale Linked Lexical Resource UBY
    Visualization Design for a Web Interface to the Large-Scale Linked Lexical Resource UBY We present the results of a collaboration of visualization experts and computational linguists which aimed at the re-design of the visualization component in the Web user interface (Web UI) to the large-scale linked lexical resource UBY. UBY combines a wide range of information from expert- constructed (e.g., WordNet, FrameNet, VerbNet) and collaboratively constructed (e.g., Wiktionary, Wikipedia) resources for English and German, see https://www.ukp.tu-darmstadt.de/uby. All resources contained in UBY distinguish not only different words but also their senses. A distinguishing feature of UBY is that the different resources are aligned to each other at the word sense level, i.e. there are links connecting equivalent word senses from different resources in UBY. For senses that are linked, information from the aligned resources can be accessed and the resulting enriched sense representations can be used to enhance the performance of Natural Language Processing tasks. Targeted user groups of the UBY Web UI are researchers in the field of Natural Language Processing and in the Digital Humanities (e.g., lexicographers, linguists). In the context of exploring the usually large number of senses for an arbitrary search word, the UBY Web UI should support these user groups in assessing the added value of sense links for particular applications. It is important to emphasize that this is an open research question for most applications. We will present the results of our detailed requirements analysis that revealed a number of central requirements a visualization of all the senses for a given search word and the links between them must meet in order to be useful for this purpose.
    [Show full text]
  • A Large, Interlinked, Syntactically-Rich Lexical Resource for Ontologies
    Semantic Web 0 (0) 1 1 IOS Press lemonUby - a large, interlinked, syntactically-rich lexical resource for ontologies Judith Eckle-Kohler, a;∗ John Philip McCrae b and Christian Chiarcos c a Ubiquitous Knowledge Processing (UKP) Lab, Department of Computer Science, Technische Universität Darmstadt and Information Center for Education, German Institute for International Educational Research, Germany, http://www.ukp.tu-darmstadt.de b Cognitive Interaction Technology (CITEC), Semantic Computing Group, Universität Bielefeld, Germany, http://www.sc.cit-ec.uni-bielefeld.de c Applied Computational Linguistics (ACoLi), Department of Computer Science and Mathematics, Goethe-University Frankfurt am Main, Germany, http://acoli.cs.uni-frankfurt.de Abstract. We introduce lemonUby, a new lexical resource integrated in the Semantic Web which is the result of converting data extracted from the existing large-scale linked lexical resource UBY to the lemon lexicon model. The following data from UBY were converted: WordNet, FrameNet, VerbNet, English and German Wiktionary, the English and German entries of Omega- Wiki, as well as links between pairs of these lexicons at the word sense level (links between VerbNet and FrameNet, VerbNet and WordNet, WordNet and FrameNet, WordNet and Wiktionary, WordNet and German OmegaWiki). We linked lemonUby to other lexical resources and linguistic terminology repositories in the Linguistic Linked Open Data cloud and outline possible applications of this new dataset. Keywords: Lexicon model, lemon, UBY-LMF, UBY, OLiA, ISOcat, WordNet, VerbNet, FrameNet, Wiktionary, OmegaWiki 1. Introduction numerous mappings and linkings of lexica, as well as standards for representing lexical resources, such Recently, the language resource community has begun as the ISO 24613:2008 Lexical Markup Framework to explore the opportunities offered by the Semantic (LMF) [13].
    [Show full text]
  • Mining Translations from the Web of Open Linked Data
    Mining translations from the web of open linked data John Philip McCrae Philipp Cimiano University of Bielefeld University of Bielefeld [email protected] [email protected] Abstract OmegaWiki on the web of data should ameliorate In this paper we consider the prospect of the process of harvesting translations from these extracting translations for words from the resources. web of linked data. By searching for We consider two sources for translations from entities that have labels in both English linked data: firstly, we consider mining labels for and German we extract 665,000 transla- concepts from the data contained in the 2010 Bil- tions. We then also consider a linguis- lion Triple Challenge (BTC) data set, as well as tic linked data resource, lemonUby, from DBpedia (Auer et al., 2007) and FreeBase (Bol- which we extract a further 115,000 transla- lacker et al., 2008). Secondly, we mine transla- tions. We combine these translations with tions from lemonUby (Eckle-Kohler et al., 2013), the Moses statistical machine translation, a resource that integrates a number of distinct dic- and we show that the translations extracted tionary language resources in the lemon (Lexi- from the linked data can be used to im- con Model for Ontologies) format (McCrae et al., prove the translation of unknown words. 2012), which is a model for representing rich lex- ical information including forms, sense, morphol- 1 Introduction ogy and syntax of ontology labels. We then con- In recent years there has been a massive explo- sider the process of including these extra transla- sion in the amount and quality of data available tions into an existing translation system, namely as linked data on the web.
    [Show full text]
  • Proceedings of KONVENS 2012 (Main Track: Poster Presentations), Vienna, September 19, 2012 Ments Between Its Lsrs
    Navigating Sense-Aligned Lexical-Semantic Resources: THE WEB INTERFACE TO UBY Iryna Gurevych1,2, Michael Matuschek1, Tri-Duc Nghiem1, Judith Eckle-Kohler1, Silvana Hartmann1, Christian M. Meyer1 1Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt 2Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information http://www.ukp.tu-darmstadt.de Abstract to the large sense-aligned LSR UBY (Gurevych et al., 2012). UBY is represented in compli- In this paper, we present the Web inter- ance with the ISO standard LMF (Francopoulo face to UBY, a large-scale lexical resource based on the Lexical Markup Framework et al., 2006) and currently contains interoper- (LMF). UBY contains interoperable ver- able versions of nine heterogeneous LSRs in sions of nine resources in two languages. two languages, as well as pairwise sense align- The interface allows to conveniently exam- ments for a subset of them: English WordNet ine and navigate the encoded information (WN), Wiktionary (WKT-en), Wikipedia (WP- in UBY across resource boundaries. Its en), FrameNet (FN), and VerbNet (VN); German main contributions are twofold: 1) The vi- Wiktionary (WKT-de), Wikipedia (WP-de), and sual view allows to examine the sense clus- GermaNet (GN), and the English and German en- ters for a lemma induced by alignments between different resources at the level of tries of OmegaWiki (OW-en/de). word senses. 2) The textual view uniformly The novel aspects of our interface can be sum- presents senses from different resources in marized as 1) A graph-based visualization of detail and offers the possibility to directly sense alignments between the LSRs integrated in compare them in a parallel view.
    [Show full text]
  • Arxiv:2107.00333V2 [Cs.CL] 2 Jul 2021 Which Represents a Growth of Around One Thousand New Relations Per Month
    Multilingual Central Repository: a Cross-lingual Framework for Developing Wordnets Xavier G´omezGuinovart1, Itziar Gonzalez-Dios2, Antoni Oliver3, and German Rigau2 1 Seminario de Ling¨u´ısticaInform´atica(SLI), Universidade de Vigo [email protected] 2 Ixa group, HiTZ center,University of the Basque Country (UPV/EHU) fitziar.gonzalezd,[email protected] 3 Universitat Oberta de Catalunya [email protected] Abstract. Language resources are necessary for language processing, but building them is costly, involves many researches from different ar- eas and needs constant updating. In this paper, we describe the cross- lingual framework used for developing the Multilingual Central Repos- itory (MCR), a multilingual knowledge base that includes wordnets of Basque, Catalan, English, Galician, Portuguese, Spanish and the follow- ing ontologies: Base Concepts, Top Ontology, WordNet Domains and Suggested Upper Merged Ontology. We present the story of MCR, its state in 2017 and the developed tools. Keywords: Language Resources · Knowledge Bases · Wordnets · Basque, Catalan, English, Galician, Portuguese, Spanish · Ontologies 1 Introduction Building large and rich knowledge bases and language resources is a very costly effort which involves large research groups for long periods of development. For instance, hundreds of person-years have been invested in the development of wordnets for various languages [16,36,35,30]. In the case of the English WordNet, in more than ten years of manual construction (from 1995 to 2006, that is, from version 1.5 to 3.0), WordNet grew from 103,445 to 235,402 semantic relations4, arXiv:2107.00333v2 [cs.CL] 2 Jul 2021 which represents a growth of around one thousand new relations per month.
    [Show full text]
  • Similarity Measures for Semantic Relation Extraction
    Université catholique de Louvain & Bauman Moscow State Technical University Similarity Measures for Semantic Relation Extraction The dissertation is presented by Alexander Panchenko in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Jury : Prof. Cédrick Fairon (supervisor), Université catholique de Louvain Prof. Andrey Philippovich (supervisor), Bauman Moscow State Technical University Prof. Henri Bouillon (jury president) Université catholique de Louvain Prof. Marco Saerens, Université catholique de Louvain Dr. Jean-Michel Renders, Xerox Research Center Europe Prof. Marie-Francine Moens, KU Leuven Louvain-la-Neuve 2012-2013 To my parents Luidmila and Ivan for their unconditional love and support. Contents Acknowledgments vii Publications Related to this Thesis ix List of Notations and Abbreviations xiii Introduction xxi 1 Semantic Relation Extraction: the Context and the Problem 1 1.1 Semantic Relations and Resources . .1 1.1.1 Definition . .2 1.1.2 Examples . .5 1.2 Semantic Relation Extraction . 13 1.2.1 Extraction Process . 14 1.2.2 Similarity-Based Extraction . 15 1.2.3 Evaluation . 22 1.3 Conclusion . 31 2 Single Semantic Similarity Measures 33 2.1 Related Work . 33 2.2 SDA-MWE: A Similarity Measure Based on Syntactic Distributional Analysis 36 2.2.1 Dataset . 37 iv CONTENTS 2.2.2 Method . 37 2.2.3 Evaluation . 42 2.2.4 Results . 43 2.2.5 Summary . 45 2.3 DefVectors: A Similarity Measure Based on Definitions . 46 2.3.1 Method . 47 2.3.2 Results . 51 2.3.3 Discussion . 53 2.3.4 Summary . 54 2.4 PatternSim: A Similarity Measure Based on Lexico-Syntactic Patterns .
    [Show full text]
  • Roberto Navigli, Tiziano Flati – Sapienza University of Rome
    Language Resources and Linked Data (EKAW 2014, Linköping, Sweden) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano Flati – Sapienza University of Rome 18/11/2014 Presenter name 1 The instructor • Tiziano Flati, PhD student, Department of Computer Science, Sapienza University of Rome • Roberto Navigli, associate professor, Department of Computer Science, Sapienza University of Rome 18/11/2014 Roberto Navigli e Tiziano Flati – La Sapienza 2 University of Rome And, if you resist until the end… you will… …receive a prize!!! A BabelNet t-shirt!!! [model is not included] 18/11/2014 Roberto Navigli e Tiziano Flati – La Sapienza 3 University of Rome Part 1: Identifying multilingual concepts and entities in text 18/11/2014 Roberto Navigli e Tiziano Flati – La Sapienza 4 University of Rome The driving force • Web content is available in many languages • Information should be extracted and processed independently of the source/target language • This could be done automatically by means of high-performance multilingual text understanding 18/11/2014 Roberto Navigli e Tiziano Flati – La Sapienza 5 University of Rome Word Sense Disambiguation and Entity Linking «Thomas and Mario are strikers playing in Munich» Entity Linking: The task WSD: The task aimed at of discovering mentions assigning meanings to of entities within a text word occurrences within and linking them in a text. knowledge base. 18/11/2014 Roberto Navigli e Tiziano Flati – La Sapienza 6 University of Rome The general problem POLYSEMY • Natural language is ambiguous • The most frequent words have several meanings! • Our job: model meaning from a computational perspective 18/11/2014 Roberto Navigli e Tiziano Flati – La Sapienza 11 University of Rome Monosemous vs.
    [Show full text]
  • Comparing and Combining Portuguese Lexical-Semantic Knowledge Bases
    Comparing and Combining Portuguese Lexical-Semantic Knowledge Bases Hugo Gonçalo Oliveira CISUC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal [email protected] Abstract There are currently several lexical-semantic knowledge bases (LKBs) for Portuguese, developed by different teams and following different approaches. In this paper, the open Portuguese LKBs are briefly analysed, with a focus on size and overlapping contents, and new LKBs are created from their redundant information. Existing and new LKBs are then exploited in the performance of semantic analysis tasks and their performance is compared. Results confirm that, instead of selecting a single LKB to use, it is worth combining all the open Portuguese LKBs. 1998 ACM Subject Classification I.2.7 Natural Language Processing Keywords and phrases Lexical Knowledge Bases, Portuguese, WordNet, Redundancy, Semantic Similarity Digital Object Identifier 10.4230/OASIcs.SLATE.2017.16 1 Introduction Lexical-semantic knowledge bases (LKBs) are computational resources that organize words according to their meaning, typically used in natural language processing (NLP) tasks at the semantic level. Princeton WordNet [11] is the paradigmatic resource of this kind, for English, with a model adapted to many languages, including Portuguese. However, the first Portuguese WordNet [21] was not available to be used by the research community and the first open alternatives were only developed in the last decade. Several open Portuguese LKBs are currently available, developed by different teams, following different approaches. Due to the difficulties inherent to crafting such a broad resource manually, most LKBs have some degree of automation in their creation process, which increases the chance of noise.
    [Show full text]
  • Sar-Graphs: a Linked Linguistic Knowledge Resource Connecting Facts with Language
    Sar-graphs: A Linked Linguistic Knowledge Resource Connecting Facts with Language Sebastian Krause, Leonhard Hennig, Aleksandra Gabryszak, Feiyu Xu, Hans Uszkoreit DFKI Language Technology Lab, Berlin, Germany skrause,lehe02,alga02,feiyu,uszkoreit @dfki.de { } Abstract open data movement, since they address com- plementary aspects of encyclopedic and linguistic We present sar-graphs, a knowledge re- knowledge. source that links semantic relations from Few to none of the existing resources, however, factual knowledge graphs to the lin- explicitly link the semantic relations of knowl- guistic patterns with which a language edge graphs to the linguistic patterns, at the level can express instances of these relations. of phrases or sentences, that are used to express Sar-graphs expand upon existing lexico- these relations in natural language text. Lexical- semantic resources by modeling syntactic semantic resources focus on linkage at the level and semantic information at the level of of individual lexical items. For example, Babel- relations, and are hence useful for tasks Net integrates entity information from Wikipedia such as knowledge base population and re- with word senses from WordNet, UWN is a mul- lation extraction. We present a language- tilingual WordNet built from various resources, independent method to automatically con- and UBY integrates several linguistic resources by struct sar-graph instances that is based linking them at the word-sense level. Linguistic on distantly supervised relation extraction. knowledge resources that go beyond the level of We link sar-graphs at the lexical level to lexical items are scarce and of limited coverage BabelNet, WordNet and UBY, and present due to significant investment of human effort and our ongoing work on pattern- and relation- expertise required for their construction.
    [Show full text]
  • NLP Data Cleansing Based on Linguistic Ontology Constraints
    NLP Data Cleansing Based on Linguistic Ontology Constraints Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13 Jens Lehmann1 Lazaros Ioannidis2 1AKSW, University of Leipzig 2Aristotle University of Thessaloniki 3DBpedia Association 2014-05-27 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33 LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33 LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33 Linguistic Communities Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33 Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33 Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33 Linguistic LOD Cloud (LLOD Cloud) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33 Problem denition Linguistic (related) Data Purpose-Driven denition Increasing Data, ontologies & vocabularies New-comers ! hard to understand the ontologies / follow updates Validation is essential Many dierent pipelines (parsing, annotation, disambiguation, etc) Errors are propagated Partially provided by maintainers (incomplete) Focus on Lemon & NIF (proof of concept) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33 Lemon - Lexicon Model for Ontologies Models lexicon and machine-readable dictionaries http://lemon-model.net/ RDF-native form Linguistically sound structure (LMF) Separation of the lexicon and ontology layers Linking to data categories ! arbitrarily complex linguistic description Principle of least power - the less expressive the language, the more reusable the data. Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33 Lemon - Example :lexicon a lemon:Lexicon ; lemon:entry :Pizza, :Tortilla .
    [Show full text]