From Information Extraction to Knowledge Discovery: Semantic Enrichment of Multilingual Content with Linked Open Data

Total Page:16

File Type:pdf, Size:1020Kb

From Information Extraction to Knowledge Discovery: Semantic Enrichment of Multilingual Content with Linked Open Data From Information Extraction to Knowledge Discovery: Semantic Enrichment of Multilingual Content with Linked Open Data Th`esepr´esent´eeen vue de l'obtention du grade acad´emiquede Docteur en Information et communication [Max De Wilde] Promoteur Prof. Isabelle Boydens Co-promoteur Prof. Pierrette Bouillon Ann´eeacad´emique 2015{2016 “Prise entre deux langues, la pensée n’est pas réduite à un double servage ; au contraire, elle tire de cette confrontation permanente un regain de liberté.” Hélène Monsacré “We don’t need more information. We need more meaning.” Paul Salopek Contents Introduction......................................1 1 Motivation..............................3 2 Objectives..............................7 3 Structure............................... 15 I Information Extraction.............................. 19 1 Background............................. 21 2 Named-Entity Recognition..................... 29 3 Relations and Events........................ 40 II Semantic Enrichment with Linked Data................... 51 1 Making Sense of the Web...................... 53 2 Semantic Resources......................... 64 3 Enriching Content.......................... 76 III The Humanities and Empirical Content................... 89 1 Empirical Information....................... 91 2 Digital Humanities......................... 96 3 Historische Kranten......................... 107 IV Quality, Language, and Time........................... 123 1 Data Quality............................. 125 2 Multilingualism........................... 137 3 Language Evolution......................... 147 V Knowledge Discovery................................ 157 1 MERCKX: A Knowledge Extractor................. 159 2 Evaluation.............................. 175 3 Validation.............................. 190 Conclusions...................................... 207 1 Overview............................... 209 2 Outcomes.............................. 213 3 Perspectives............................. 218 A Source Code...................................... 223 B Guidelines for Annotators............................ 225 C Follow-up........................................ 226 Detailed Contents.................................. 227 Bibliography...................................... 233 i Acknowledgements I would like to express my gratitude on the one hand to Pierrette Bouillon whose Master course of Ingénierie linguistique allowed me to get acquainted with natural language processing – and who was also the first one to suggest I take the job as a teaching assistant which financed this thesis – and on the other hand to Isabelle Boydens and Françoise D’Hautcourt for making a bet on me after a quick, last-minute interview one evening in late August, 2009. Isabelle and Pierrette proved to be patient, understanding, and complemen- tary supervisors: I sincerely thank them both for their precious counselling and continuing support. My interest in NLP subsequently led me to enrol, concurrently with my doctorate, for an advanced Master in computational linguistics at Universiteit Antwerpen, where Walter Daelemans successfully managed to rid me of all the administrative obstacles I encountered. Under his committed supervision and the mentoring of Roser Morante, I completed a highly rewarding intensive course in biomedical information extraction and attended my first interna- tional workshop in Cambridge in 2011. Later on, in 2013, I met Tobias Blanke during another workshop at the CEGES and was pleased to discover that we shared many research interests, from named-entity recognition to OCR and from the digital humanities to graph-based approaches. Walter and Tobias readily accepted to sit on my thesis committee, which I must admit makes me feel honoured. My thanks go to Liesbeth Thiers, Hilde Cuyt, and Renee Mestdagh from the CO7 Erfgoedcel, and especially to Jochen Vermote from the Ypres City Archive, for welcoming me in their work environment and entrusting me with their dataset and analytics, even allowing me on first encounter to drive home with their 1TB hard drive of data, to be brought back the following week! They had to wait for a few years before getting any tangible results, I hope they are not disappointed with the outcome of the thesis. Thanks also to Robert Tiessen and Walter Tromp from Picturae for inviting me to their headquarters in Heiloo, near Amsterdam, for a presentation and in-depth talk. iii iv Acknowledgements Cheers to my colleagues Seth van Hooland and Ruben Verborgh for all our trips and talks as part of the Free Your Metadata world tour, from 8:00 a.m. pastrami with Amalia in a deserted diner to Marvin Gaye in cheap hotel rooms, and from Whit Monday library sequestration to ukulele recordings in Seth’s living room. Special thanks to Ruben for asking me to be co-author of Using OpenRefine, which was a great collaborative experience and also a (morally, if not financially) rewarding achievement. Thanks to my fellow PhD students Laurence, Raphaël and Simon (in al- phabetical order) for partially relieving me of my teaching and administrative chores in the final months of my redacting, I doubt I could have made it in time without them. Having worked alone for the first four years, I can appreciate the added value of team work and constant brainstorming on matters ranging from the best definition of ontologies to the choice of the cutest lemur picture for a new research project. I am definitely indebted to my family for providing constant support – to my dad in particular for technical assistance, from language detection to Python helpdesk, and to my mum for reading this work three times over with- out getting bored to death. To my brother Sam and his fiancée Vané then for delaying their wedding just after my submission (or is it the other way round?), and to my brother Tom for our not-so-productive-but-still-rewarding working nights while he wrote his Master thesis. Big thanks also to my other diligent proof-readers: Simon again, Lionel, and my uncle Marc. Of course my gratefulness goes to my partner Hélène for encouraging, sup- porting and putting up with me during these six long years. She anticipated right from the start that I should have begun the redaction much sooner, so I’m extra grateful for her having granted me long undisturbed periods of writ- ing time that proved crucial for the finalisation of this dissertation. I deeply admire her for the patience she could keep with the whole process. I feel these acknowledgements would not be complete without a quick nod to the 7e tasse teashop which provided me the fuel to go on. I obviously cannot count the number of cups gulped since 2009 but 13 000 is my rough guess, which averages to about 50 for every page written. Music-wise, Mr. Django remained as great a source of inspiration in late night hours, as ever. And last but not least, this one goes out to my 2-year-old daughter Jeanne for regularly diverting me from my work, which could sometimes be annoying but in the end proved a blessing in disguise, allowing me to put things into perspective... List of Figures 1 Disambiguation and enrichment with a knowledge base.... 13 I.1 Illustration of the Turing test.................... 22 I.2 Information retrieval........................ 25 I.3 Information extraction....................... 25 I.4 Text mining.............................. 27 I.5 SoNaR named entity typology................... 33 I.6 Illustration of the Ypres Salient.................. 40 I.7 Candidate event for template filling................ 48 II.1 DIKW Pyramid............................ 55 II.2 The Semantic Web layer cake................... 57 II.3 Overly complex TEI workflow................... 59 II.4 Snapshot of the Linked Open Data cloud............. 63 II.5 Preview of the “Ostend” resource in DBpedia.......... 65 II.6 Google’s Knowledge Graph..................... 77 II.7 Arbitrariness of the linguistic sign................. 78 III.1 Close and distant reading...................... 101 III.2 Browsing named entities...................... 102 III.3 The Hype cycle............................ 104 III.4 Historische Kranten homepage................... 107 III.5 Bilingual article from De Handboog................ 116 III.6 Noisy Google Analytics data.................... 119 IV.1 Quality trade-off........................... 127 IV.2 OCRised article from the Messager d’Ypres............ 129 IV.3 Problematic NER due to multilingual ambiguity......... 141 IV.4 Evolution of language........................ 151 IV.5 Relative salience of terms in US inaugural addresses...... 154 V.1 NERD platform........................... 161 v vi List of Figures V.2 Calais Viewer............................. 163 V.3 Babelfy................................ 166 V.4 Natural Language Toolkit pipeline................. 167 V.5 X-Link................................. 168 V.6 SemEval workflow.......................... 175 V.7 From corpus to knowledge..................... 191 V.8 Place Browser............................ 195 V.9 People Finder............................ 196 V.10 Discovering related entities..................... 197 V.11 Geographical dispersion of Perelman’s correspondents..... 205 Unless otherwise stated, all images are either own work, screenshots from referenced websites, or in the public domain. Images under Creative Commons licences are indicated with one of these: CC BY-SA (Attribution – Share Alike), CC BY-NC-SA (Attribution – Non Commercial – Share Alike), or CC BY-NC-ND (Attribution – Non Commercial – No Derivatives). Visit https://creativecommons.org/
Recommended publications
  • Amber Billey Senylrc 4/1/2016
    AMBER BILLEY SENYLRC 4/1/2016 Photo by twm1340 - Creative Commons Attribution-ShareAlike License https://www.flickr.com/photos/89093669@N00 Created with Haiku Deck Today’s slides: http://bit.ly/SENYLRC_BF About me... ● Metadata Librarian ● 10 years of cataloging and digitization experience for cultural heritage institutions ● MLIS from Pratt in 2009 ● Zepheira BIBFRAME alumna ● Curious by nature [email protected] @justbilley Photo by GBokas - Creative Commons Attribution-NonCommercial-ShareAlike License https://www.flickr.com/photos/36724091@N07 Created with Haiku Deck http://digitalgallery.nypl.org/nypldigital/id?1153322 02743cam 22004094a 45000010008000000050017000080080041000250100017000660190014000830200031000970200028001280200 03800156020003500194024001600229035002400245035003900269035001700308040005300325050002200378 08200120040010000300041224500790044225000120052126000510053330000350058449000480061950400640 06675050675007315200735014066500030021416500014021717000023021858300049022089000014022579600 03302271948002902304700839020090428114549.0080822s2009 ctua b 001 0 eng a 2008037446 a236328594 a9781591585862 (alk. paper) a1591585864 (alk. paper) a9781591587002 (pbk. : alk. paper) a159158700X (pbk. : alk. paper) a99932583184 a(OCoLC) ocn236328585 a(OCoLC)236328585z(OCoLC)236328594 a(NNC)7008390 aDLCcDLCdBTCTAdBAKERdYDXCPdUKMdC#PdOrLoB-B00aZ666.5b.T39 200900a0252221 aTaylor, Arlene G., d1941-14aThe organization of information /cArlene G. Taylor and Daniel N. Joudrey. a3rd ed. aWestport, Conn. :bLibraries Unlimited,c2009. axxvi,
    [Show full text]
  • Документ – Книга – Семантический Веб: Вклад Старой Науки О Документации Milena Tsvetkova
    Документ – книга – семантический веб: вклад старой науки о документации Milena Tsvetkova To cite this version: Milena Tsvetkova. Документ – книга – семантический веб: вклад старой науки о документации. Scientific Enquiry in the Contemporary World: Theoretical basiсs and innovative approach, 2016,San Francisco, United States. pp.115-128, ⟨10.15350/L_26/7/02⟩. ⟨hal-01687965⟩ HAL Id: hal-01687965 https://hal.archives-ouvertes.fr/hal-01687965 Submitted on 31 Jan 2018 HAL is a multi-disciplinary open access archive L’archive ouverte pluridisciplinaire HAL, est des- for the deposit and dissemination of scientific re- tinée au dépôt et à la diffusion de documents scien- search documents, whether they are published or not. tifiques de niveau recherche, publiés ou non, émanant The documents may come from teaching and research des établissements d’enseignement et de recherche institutions in France or abroad, or from public or pri- français ou étrangers, des laboratoires publics ou vate research centers. privés. RESEARCH ARTICLES. SOCIOLOGICAL SCIENCES DOCUMENT – BOOK – SEMANTIC WEB: OLD SCIENCE DOCUMENTATION`S CONTRIBUTION M. Tsvetkova1 DOI: http://doi.org/10.15350/L_26/7/02 Abstract The key focus of the research is the contribution of the founder of the documentation science, Paul Otlet, to the evolution of communication technologies and media. Methods used: systematic-mediological approach to research in book studies, documentology, information and communication sciences, retrospective discourse analysis of documents, studies, monographs. The way in which Otlet presents arguments about the document “book” as the base technology of the universal documentation global network is reviewed chronologically. His critical thinking has been established: by expanding the definition of a “book”, he projects its future transformation in descending order – from The Universal Book of Knowledge (Le Livre universel de la Science) through the “thinking machine” (machine à penser), to its breakdown to Biblions, the smallest building blocks of written knowledge.
    [Show full text]
  • ISO/TC46 (Information and Documentation) Liaison to IFLA
    ISO/TC46 (Information and Documentation) liaison to IFLA Annual Report 2015 TC46 on Information and documentation has been leading efforts related to information management since 1947. Standards1 developed under ISO/TC46 facilitate access to knowledge and information and standardize automated tools, computer systems, and services relating to its major stakeholders of: libraries, publishing, documentation and information centres, archives, records management, museums, indexing and abstracting services, and information technology suppliers to these communities. TC46 has a unique role among ISO information-related committees in that it focuses on the whole lifecycle of information from its creation and identification, through delivery, management, measurement, and archiving, to final disposition. *** The following report summarizes activities of TC46, SC4, SC8 SC92 and their resolutions of the annual meetings3, in light of the key-concepts of interest to the IFLA community4. 1. SC4 Technical interoperability 1.1 Activities Standardization of protocols, schemas, etc. and related models and metadata for processes used by information organizations and content providers, including libraries, archives, museums, publishers, and other content producers. 1.2 Active Working Group WG 11 – RFID in libraries WG 12 – WARC WG 13 – Cultural heritage information interchange WG 14 – Interlibrary Loan Transactions 1.3 Joint working groups 1 For the complete list of published standards, cfr. Appendix A. 2 ISO TC46 Subcommittees: TC46/SC4 Technical interoperability; TC46/SC8 Quality - Statistics and performance evaluation; TC46/SC9 Identification and description; TC46/SC 10 Requirements for document storage and conditions for preservation - Cfr Appendix B. 3 The 42nd ISO TC46 plenary, subcommittee and working groups meetings, Beijing, June 1-5 2015.
    [Show full text]
  • A Comparison of Knowledge Extraction Tools for the Semantic Web
    A Comparison of Knowledge Extraction Tools for the Semantic Web Aldo Gangemi1;2 1 LIPN, Universit´eParis13-CNRS-SorbonneCit´e,France 2 STLab, ISTC-CNR, Rome, Italy. Abstract. In the last years, basic NLP tasks: NER, WSD, relation ex- traction, etc. have been configured for Semantic Web tasks including on- tology learning, linked data population, entity resolution, NL querying to linked data, etc. Some assessment of the state of art of existing Knowl- edge Extraction (KE) tools when applied to the Semantic Web is then desirable. In this paper we describe a landscape analysis of several tools, either conceived specifically for KE on the Semantic Web, or adaptable to it, or even acting as aggregators of extracted data from other tools. Our aim is to assess the currently available capabilities against a rich palette of ontology design constructs, focusing specifically on the actual semantic reusability of KE output. 1 Introduction We present a landscape analysis of the current tools for Knowledge Extraction from text (KE), when applied on the Semantic Web (SW). Knowledge Extraction from text has become a key semantic technology, and has become key to the Semantic Web as well (see. e.g. [31]). Indeed, interest in ontology learning is not new (see e.g. [23], which dates back to 2001, and [10]), and an advanced tool like Text2Onto [11] was set up already in 2005. However, interest in KE was initially limited in the SW community, which preferred to concentrate on manual design of ontologies as a seal of quality. Things started changing after the linked data bootstrapping provided by DB- pedia [22], and the consequent need for substantial population of knowledge bases, schema induction from data, natural language access to structured data, and in general all applications that make joint exploitation of structured and unstructured content.
    [Show full text]
  • National Standardization Plan 2019-2022
    FINAL APRIL 2020 NATIONAL STANDARDIZATION PLAN 2019-2022 Table of Contents 1 Introduction ............................................................................................................................................................. 2 2 Background ............................................................................................................................................................. 4 3 Methodology ............................................................................................................................................................ 4 3.1 Economic Priorities (Economic Impact Strategy) ................................................................................ 5 3.2 Government Policy Priorities ............................................................................................................ 12 3.3 Non-Economic Priorities (Social Impact Strategy) ............................................................................ 14 3.4 Stakeholders requests (Stakeholder Engagement Strategy) ............................................................. 15 3.5 Selected Sectors of Standardization and Expected Benefits ............................................................. 16 3.5.2 Benefits of Selected Sectors and Sub-Sectors of Standardization ................................................ 17 4 Needed Human and Financial Resources and Work Items Implementation Plan............................................ 19 4.1 Human Resources by Type of Work Item and Category ...................................................................
    [Show full text]
  • A Combined Method for E-Learning Ontology Population Based on NLP and User Activity Analysis
    A Combined Method for E-Learning Ontology Population based on NLP and User Activity Analysis Dmitry Mouromtsev, Fedor Kozlov, Liubov Kovriguina and Olga Parkhimovich ITMO University, St. Petersburg, Russia [email protected], [email protected], [email protected], [email protected] Abstract. The paper describes a combined approach to maintaining an E-Learning ontology in dynamic and changing educational environment. The developed NLP algorithm based on morpho-syntactic patterns is ap- plied for terminology extraction from course tasks that allows to interlink extracted terms with the instances of the system’s ontology whenever some educational materials are changed. These links are used to gather statistics, evaluate quality of lectures’ and tasks’ materials, analyse stu- dents’ answers to the tasks and detect difficult terminology of the course in general (for the teachers) and its understandability in particular (for every student). Keywords: Semantic Web, Linked Learning, terminology extraction, education, educational ontology population 1 Introduction Nowadays reusing online educational resources becomes one of the most promis- ing approaches for e-learning systems development. A good example of using semantics to make education materials reusable and flexible is the SlideWiki system[1]. The key feature of an ontology-based e-learning system is the possi- bility for tutors and students to treat elements of educational content as named objects and named relations between them. These names are understandable both for humans as titles and for the system as types of data. Thus educational materials in the e-learning system thoroughly reflect the structure of education process via relations between courses, modules, lectures, tests and terms.
    [Show full text]
  • Die Welt in 100 Jahren En
    "Everyone will have his own pocket telephone that will enable him to get in touch with anyone he wishes. People living in the Wireless Age will be able to go everywhere with their transceivers, which they will be able to affix wherever they like—to their hat, for instance…" Robert Sloss: The Wireless Century in “The World in 100 Years,” Berlin 1910 The World in 100 Years A Journey through the History of the Future June 16-September 19, 2010 Ars Electronica Center Linz (Linz, June 16, 2010) Our longing to know the future is timeless. Just like our burning desire to co-determine and change the course of events transpiring in this world. The exhibition “The World in 100 Years – A Journey through the History of the Future” pays tribute to some great thinkers and activists who were ahead of their times, men and women who have displayed creativity, courage and resourcefulness in their commitment to a vision of the future. We begin by presenting Albert Robida (FR) and Paul Otlet (BE), two prominent visionaries of the late 19 th and early 20 th centuries. Then we shift the spotlight to contemporary artists and scientists and their NEXT IDEAS. “The World in 100 Years” will run from June 16 to September 19, 2010 at the Ars Electronica Center Linz. “Conquer interplanetary space; free humankind from the earthly bonds that have kept aerial navigation within our own atmosphere; colonize the Moon and then communicate with the other planets, our sisters in this cosmic void … This will be humanity’s next quest, bequeathed to our descendents of the twenty-first century!” Albert Robida: “Le Vingtième Siècle,” Paris 1883 Albert Robida and Paul Otlet, or: The Future in Now Buildings that rotate to follow the sun, weather machines, artificial islands in the oceans, venturing into outer space, the universal library—amazingly, all of these up-to-the-minute ideas were elements of futuristic visions by Albert Robida (FR) and Paul Otlet (BE).
    [Show full text]
  • Information Extraction Using Natural Language Processing
    INFORMATION EXTRACTION USING NATURAL LANGUAGE PROCESSING Cvetana Krstev University of Belgrade, Faculty of Philology Information Retrieval and/vs. Natural Language Processing So close yet so far Outline of the talk ◦ Views on Information Retrieval (IR) and Natural Language Processing (NLP) ◦ IR and NLP in Serbia ◦ Language Resources (LT) in the core of NLP ◦ at University of Belgrade (4 representative resources) ◦ LR and NLP for Information Retrieval and Information Extraction (IE) ◦ at University of Belgrade (4 representative applications) Wikipedia ◦ Information retrieval ◦ Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. ◦ Natural Language Processing ◦ Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation. Experts ◦ Information Retrieval ◦ As an academic field of study, Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collection (usually stored on computers). ◦ C. D. Manning, P. Raghavan, H. Schutze, “Introduction to Information Retrieval”, Cambridge University Press, 2008 ◦ Natural Language Processing ◦ The term ‘Natural Language Processing’ (NLP) is normally used to describe the function of software or hardware components in computer system which analyze or synthesize spoken or written language.
    [Show full text]
  • Forgotten Prophet of the Internet Philip Ball Ponders the Tale of a Librarian Who Dreamed of Networking Information
    BOOKS & ARTS COMMENT INFORMATION TECHNOLOGY Forgotten prophet of the Internet Philip Ball ponders the tale of a librarian who dreamed of networking information. he Internet is considered establish the league in Geneva, in neu- a key achievement of the tral Switzerland, rather than Brussels. computer age. But as for- But their objective was much more Tmer New York Times staffer Alex grandiose, utopian and strange. Wright shows in the meticulously Progressive thinkers such as researched Cataloging the World, H. G. Wells (whom Otlet read) desired BELGIUM MUNDANEUM, the concept predates digital tech- world government in the interwar nology. In the late nineteenth cen- period, but Otlet’s plans often seemed tury, Belgian librarian Paul Otlet detached from mundane realities. conceived schemes to collect, store, They veered into mystical notions of automatically retrieve and remotely transcendence of the human spirit, distribute all human knowledge. influenced by theosophy, and Otlet His ideas have clear analogies with seems to have imagined that learn- information archiving and net- ing could be transmitted not only by working on the web. Wright makes careful study of documents but by a a persuasive case that Otlet — now symbolic visual language in posters largely forgotten —deserves to and displays. In the late 1920s, he and be ranked among the conceptual architect Le Corbusier devised a plan inventors of the Internet. to realize the Mundaneum as a build- Wright locates Otlet’s work in a ing complex full of sacred symbolism, broader narrative about collation as much temple as library. Wright and cataloguing of information. overlooks the real heritage of these Compendia of knowledge date back ideas: Otlet’s predecessor here was at least to Pliny the Elder’s Natural Paul Otlet’s Mondothèque workstation.
    [Show full text]
  • On the Composition of ISO 25964 Hierarchical Relations (BTG, BTP, BTI)
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Int J Digit Libr (2016) 17:39–48 DOI 10.1007/s00799-015-0162-2 On the composition of ISO 25964 hierarchical relations (BTG, BTP, BTI) Vladimir Alexiev1 · Antoine Isaac2 · Jutta Lindenthal3 Received: 12 January 2015 / Revised: 29 July 2015 / Accepted: 4 August 2015 / Published online: 20 August 2015 © The Author(s) 2015. This article is published with open access at Springerlink.com Abstract Knowledge organization systems (KOS) can use In addition, we relax some of the constraints assigned to the different types of hierarchical relations: broader generic ISO properties, namely the fact that hierarchical relationships (BTG), broader partitive (BTP), and broader instantial apply to SKOS concepts only. This allows us to apply them (BTI). The latest ISO standard on thesauri (ISO 25964) to the Getty Art and Architecture Thesaurus (AAT), where has formalized these relations in a corresponding OWL they are also used for non-concepts (facets, hierarchy names, ontology (De Smedt et al., ISO 25964 part 1: thesauri for guide terms). In this paper, we present extensive examples information retrieval: RDF/OWL vocabulary, extension of derived from the recent publication of AAT as linked open SKOS and SKOS-XL. http://purl.org/iso25964/skos-thes, data. 2013) and expressed them as properties: broaderGeneric, broaderPartitive, and broaderInstantial, respectively. These Keywords Thesauri · ISO 25964 · BTG · BTP · BTI · relations are used in actual thesaurus data. The composition- Broader generic · Broader partitive · Broader instantial · ality of these types of hierarchical relations has not been AAT investigated systematically yet.
    [Show full text]
  • A Terminology Extraction System for Technology Related Terms
    TEST: A Terminology Extraction System for Technology Related Terms Murhaf Hossari Soumyabrata Dev John D. Kelleher ADAPT Centre, Trinity College ADAPT Centre, Trinity College ADAPT Centre, Technological Dublin Dublin University Dublin Dublin, Ireland Dublin, Ireland Dublin, Ireland [email protected] [email protected] [email protected] ABSTRACT of applications [6], ranging from e-mail spam filtering, advertising Tracking developments in the highly dynamic data-technology and marketing industry [4, 7], and social media. Particularly, in landscape are vital to keeping up with novel technologies and tools, these realm of online articles and blog articles, the growth in the in the various areas of Artificial Intelligence (AI). However, It is amount of information is almost in an exponential nature. Therefore, difficult to keep track of all the relevant technology keywords. In it is important for the researchers to develop a system that can this paper, we propose a novel system that addresses this problem. automatically parse the millions of documents, and identify the key This tool is used to automatically detect the existence of new tech- technological terms. Such system will greatly reduce the manual nologies and tools in text, and extract terms used to describe these work, and save several man-hours. new technologies. The extracted new terms can be logged as new In this paper, we develop a machine-learning based solution, AI technologies as they are found on-the-fly in the web. It can be that is capable of the discovery and extraction of information re- subsequently classified into the relevant semantic labels and AIdo- lated to AI technologies.
    [Show full text]
  • Ontology Learning and Its Application to Automated Terminology Translation
    Natural Language Processing Ontology Learning and Its Application to Automated Terminology Translation Roberto Navigli and Paola Velardi, Università di Roma La Sapienza Aldo Gangemi, Institute of Cognitive Sciences and Technology lthough the IT community widely acknowledges the usefulness of domain ontolo- A gies, especially in relation to the Semantic Web,1,2 we must overcome several bar- The OntoLearn system riers before they become practical and useful tools. Thus far, only a few specific research for automated ontology environments have ontologies. (The “What Is an Ontology?” sidebar on page 24 provides learning extracts a definition and some background.) Many in the and is part of a more general ontology engineering computational-linguistics research community use architecture.4,5 Here, we describe the system and an relevant domain terms WordNet,3 but large-scale IT applications based on experiment in which we used a machine-learned it require heavy customization. tourism ontology to automatically translate multi- from a corpus of text, Thus, a critical issue is ontology construction— word terms from English to Italian. The method can identifying, defining, and entering concept defini- apply to other domains without manual adaptation. relates them to tions. In large, complex application domains, this task can be lengthy, costly, and controversial, because peo- OntoLearn architecture appropriate concepts in ple can have different points of view about the same Figure 1 shows the elements of the architecture. concept. Two main approaches aid large-scale ontol- Using the Ariosto language processor,6 OntoLearn a general-purpose ogy construction. The first one facilitates manual extracts terminology from a corpus of domain text, ontology engineering by providing natural language such as specialized Web sites and warehouses or doc- ontology, and detects processing tools, including editors, consistency uments exchanged among members of a virtual com- checkers, mediators to support shared decisions, and munity.
    [Show full text]