Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction

Total Page:16

File Type:pdf, Size:1020Kb

Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction LEXICAL COHESION ANALYSIS FOR TOPIC SEGMENTATION, SUMMARIZATION AND KEYPHRASE EXTRACTION a dissertation submitted to the department of computer engineering and the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of doctor of philosophy By G¨onen¸cErcan December, 2012 I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Fazlı Can(Supervisor) I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ilyas_ C¸i¸cekli(Co-supervisor) I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ozg¨urUlusoy¨ I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ferda Nur Alpaslan ii I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Assoc. Prof. Dr. Hakan Ferhatosmano~glu I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Assist. Prof. Dr. Seyit Ko¸cberber Approved for the Graduate School of Engineering and Science: Prof. Dr. Levent Onural Director of the Graduate School iii ABSTRACT LEXICAL COHESION ANALYSIS FOR TOPIC SEGMENTATION, SUMMARIZATION AND KEYPHRASE EXTRACTION G¨onen¸cErcan PhD. in Computer Engineering Supervisor: Prof. Dr. Fazlı Can December, 2012 When we express some idea or story, it is inevitable to use words that are seman- tically related to each other. When this phenomena is exploited from the aspect of words in the language, it is possible to infer the level of semantic relationship between words by observing their distribution and use in discourse. From the aspect of discourse it is possible to model the structure of the document by ob- serving the changes in the lexical cohesion in order to attack high level natural language processing tasks. In this research lexical cohesion is investigated from both of these aspects by first building methods for measuring semantic relatedness of word pairs and then using these methods in the tasks of topic segmentation, summarization and keyphrase extraction. Measuring semantic relatedness of words requires prior knowledge about the words. Two different knowledge-bases are investigated in this research. The first knowledge base is a manually built network of semantic relationships, while the second relies on the distributional patterns in raw text corpora. In order to discover which method is effective in lexical cohesion analysis, a comprehensive comparison of state-of-the art methods in semantic relatedness is made. For topic segmentation different methods using some form of lexical cohesion are present in the literature. While some of these confine the relationships only to word repetition or strong semantic relationships like synonymy, no other work uses the semantic relatedness measures that can be calculated for any two word pairs in the vocabulary. Our experiments suggest that topic segmentation perfor- mance improves methods using both classical relationships and word repetition. Furthermore, the experiments compare the performance of different semantic re- latedness methods in a high level task. The detected topic segments are used in iv v summarization, and achieves better results compared to a lexical chains based method that uses WordNet. Finally, the use of lexical cohesion analysis in keyphrase extraction is inves- tigated. Previous research shows that keyphrases are useful tools in document retrieval and navigation. While these point to a relation between keyphrases and document retrieval performance, no other work uses this relationship to identify keyphrases of a given document. We aim to establish a link between the problems of query performance prediction (QPP) and keyphrase extraction. To this end, features used in QPP are evaluated in keyphrase extraction using a Naive Bayes classifier. Our experiments indicate that these features improve the effective- ness of keyphrase extraction in documents of different length. More importantly, commonly used features of frequency and first position in text perform poorly on shorter documents, whereas QPP features are more robust and achieve better results. Keywords: Lexical Cohesion, Semantic Relatedness, Topic Segmentation, Sum- marization, Keyphrase Extraction. OZET¨ KONU BOL¨ UMLEME,¨ OZETLEME¨ VE ANAHTAR KELIME_ C¸IKARMA IC¸_ IN_ KELIME_ BUT¨ UNL¨ UG¨ U¨ ANALIZ_ I_ G¨onen¸cErcan Bilgisayar M¨uhendisli˘gi,Doktora Tez Y¨oneticisi:Prof. Dr. Fazlı Can Aralık, 2012 Insanlar_ bir fikri veya hikayeyi anlatırken birbiriyle anlam olarak ili¸skilikelimeleri kullanmaktan ka¸camazlar. Bu fenomenden iki farklı bakı¸sa¸cısıylafaydalanmak m¨umk¨und¨ur. Kelimeler a¸cısındanbakıldı˜gında,anlam olarak ili¸skili kelimelerin istatistiksel da~gılımıve anlatımda kullanımlarına bakarak anlam olarak ili¸skili kelimeleri tanımlamak m¨umk¨unolabilir. Anlam b¨ut¨unl¨u~g¨uneanlatım a¸cısından baktı˜gımızdada kelimelerin anlam ili¸skilerindeki de~gi¸simebakarak bir metnin yapısını modellemek ve bu modeli farklı do~galdil i¸slemeproblemlerinde kullan- mak m¨umk¨und¨ur.Bu ara¸stırmadaanlam b¨ut¨unl¨u~g¨u,bu iki a¸cıdanda incelenmek- tedir. Once¨ kelimeler arası anlam ili¸sikli~ginin¨ol¸c¨ulmesii¸cin anlam b¨ut¨unl¨u~g¨ukul- lanılmı¸sdaha sonra bu kelime ili¸skilerikonu b¨ol¨umleme,¨ozet¸cıkarma ve anahtar kelime ¸cıkarma problemlerinde kullanılmı¸stır. Kelimelerin anlam ili¸sikli~ginin¨ol¸c¨ulmesii¸cinbir bilgi da~garcı˜gıgerekmektedir. Ara¸stırmakapsamında iki farklı bilgi da~garcı˜gındanfaydalanılmaya ¸calı¸sılmı¸stır. Birinci kelime da~garcı˜gıkelime ili¸skilerininelle girildi~gibir anlam a~gıdır. Ikinci_ y¨ontem ise kelimelerin d¨uzmetin derlemindeki kullanım da~gılımlarınıkullanmak- tadır. Ara¸stırmakapsamında bu y¨ontemlerin birbirine g¨oreba¸sarımı¨ol¸c¨ulmekte ve kapsamlı bir analiz yapılmaktadır. Konu b¨ol¨umlemeprobleminde kelime b¨ut¨unl¨u~g¨ukullanan farklı y¨ontemler lit- erat¨urdekullanılmaktadır. Bunların bazıları sadece kelime tekrarlarından fay- dalanırken, bazıları da e¸sanlam gibi g¨u¸cl¨uanlamsal ili¸skilerdenfaydalanmak- tadır. Fakat ¸suana kadar ¸cokdaha kapsamlı olan kelime ili¸sikli~giy¨ontemleri bu problemde kullanılmamı¸stır. Yapılan deneyler g¨ostermektedir ki konu b¨ol¨umlemeprobleminin ba¸sarımıkelime ili¸sikli~gi kullanılarak arttırılabilmektedir. vi vii Ayrıca deneyler farklı kelime ili¸sikli~gi¨ol¸c¨umy¨ontemlerini kar¸sıla¸stırmaki¸cinkul- lanılabilmektedir. Konulara g¨ore b¨ol¨umlenmi¸smetinler otomatik ¨ozet¸cıkarma probleminde kullanılmı¸sve kelime zinciri tabanlı y¨ontemlere g¨oredaha ba¸sarılı sonu¸clarelde etmi¸stir. Son olarak kelime b¨ut¨unl¨u~g¨u analizi anahtar kelime bulma probleminde ara¸stırılmaktadır. Ge¸cmi¸s ara¸stırmalar anahtar kelimelerin belge getirme ve navigasyon i¸cin ba¸sarılı ara¸clar oldu~gunu g¨ostermektedir. Her ne kadar bu ara¸stırmalaranahtar kelime ve belge getirme arasında bir ili¸skioldu~gunu g¨osterse de, ba¸ska bir ¸calı¸smada anahtar kelimeleri bulmak i¸cinonların belge getirme ba¸sarım tahmini kullanılmamı¸stır. Bu ara¸stırmada sorgu ba¸sarım tahmini y¨ontemlerinin anahtar kelime bulmada kullanımı incelenmi¸stir. Bunun i¸cin sorgu ba¸sarı tahmininde kullanılan ¨oznitelikleranahtar kelime bulma proble- minde Naive Bayes sınıflandırıcı ile birlikte kullanılmı¸stir. Yapılan deneyler bu ¨ozniteliklerin farklı boyuttaki belgelerde ba¸sarımıarttırdı˜gınıg¨ostermektedir. Daha da ¨onemlisibu ¨ozniteliklerinyaygın olarak kullanılan deyim ge¸cmefrekansı ve belgede ilk kullanım yeri ¨ozniteliklerinintersine kısa belgelerde daha ba¸sarılı oldu~gunu g¨ostermektedir. Anahtar s¨ozc¨ukler: Kelime b¨ut¨unl¨u~g¨u,Anlamsal ili¸siklilik, Konu B¨ol¨umleme, Ozetleme,¨ Anahtar Kelime C¸ıkarma. Acknowledgement I would like to thank Prof. Ilyas_ C¸i¸cekliwho has supervised and supported me since the day I started my graduate studies. I am indebted to Prof. Fazlı Can for all the encouragement and guidance he has provided me. Their support, expertise and invaluable recommendations were crucial in this long academic journey. I would like to thank all my committee members for spending their time and effort to read and comment on my dissertation. Many thanks to the former and current members of the Computer Engineering Department. It was both a pleasure and honour to get to know and work with you all. I would like to express my deepest gratitude to my family, Nurhayat-Sadık Ercan and Ilkay-G¨orkem_ Ercan and Berrin G¨ozenfor their patience and support throughout this long journey. Finally I thank and dedicate this dissertation to my wife Pınar who has given me both strength and courage to take on this challenge. viii Contents 1 Introduction 1 1.1 Goals and Contributions . 9 1.2 Outline . 11 2 Related Work 13 2.1 Related Linguistic Theories . 13 2.1.1 Semiotics and Meaning . 14 2.1.2 Analytic View . 15 2.1.3 Generative Grammar and Meaning . 15 2.2 Linguistic Background . 16 2.2.1 Morphology . 17 2.2.2 Syntax . 22 2.2.3 Coherence . 23 2.2.4 Cohesion . 24 2.2.5 Lexical Cohesion . 25 2.3 Literature Survey . 26 ix CONTENTS x 2.3.1 Semantic Relatedness . 27 2.3.2 Topic Segmentation . 31 2.3.3 Summarization . 36 2.3.4 Keyphrase Extraction . 41 3 Semantic Relatedness 44 3.1 Semantic Relatedness Methods . 45 3.1.1 WordNet Based Semantic Relatedness Measures . 46 3.1.2 WordNet based Semantic Relatedness Functions . 51 3.1.3 Corpora Based Semantic Relatedness Measures . 58 3.1.4 Building the Vocabulary . 59 3.1.5 Building the Co-occurrence Matrix .
Recommended publications
  • Web Document Analysis: How Can Natural Language Processing Help in Determining Correct Content Flow?
    Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman1 and Yuliya Tarnikova BCL Technologies Inc. [email protected] Abstract x What is the definition of 'relatedness'? x If other segments are geometrically embedded One of the fundamental questions for document within closely related segments, can we determine analysis and subsequent automatic re-authoring solutions if this segment is also related to the surrounding is the semantic and contextual integrity of the processed segments? document. The problem is particularly severe in web x When a hyperlink is followed and a new page is document re-authoring as the segmentation process often accessed, how do we know which exact segment creates an array of seemingly unrelated snippets of within that new page is directly related to the link content without providing any concrete clue to aid the we just followed? layout analysis process. This paper presents a generic It is very difficult to answer these questions with a technique based on natural language processing for high degree of confidence, as the absence of precise determining 'semantic relatedness' between segments information about the geometric and the linguistic within a document and applies it to a web page re- relationships among the candidate segments make it authoring problem. impossible to produce a quantitative measurement about the closeness or relatedness of these segments. This paper 1. Introduction has proposed a natural language processing (NLP) based method to determine relationship among different textual In web document analysis, document decomposition segments. The technique is generic and is applicable to and subsequent analysis is a very viable solution [1].
    [Show full text]
  • CS460/626 : Natural Language Processing/Speech, NLP and the Web
    CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 24, 25, 26 Wordnet Pushpak Bhattacharyya CSE Dept., IIT Bombay 17th and 19th (morning and night), 2013 NLP Trinity Problem Semantics NLP Trinity Parsing Part of Speech Tagging Morph Analysis Marathi French HMM Hindi English Language CRF MEMM Algorithm NLP Layer Discourse and Corefernce Increased Semantics Extraction Complexity Of Processing Parsing Chunking POS tagging Morphology Background Classification of Words Word Content Function Word Word Verb Noun Adjective Adverb Prepo Conjun Pronoun Interjection sition ction NLP: Thy Name is Disambiguation A word can have multiple meanings and A meaning can have multiple words Word with multiple meanings Where there is a will, Where there is a will, There are hundreds of relatives Where there is a will There is a way There are hundreds of relatives A meaning can have multiple words Proverb “A cheat never prospers” Proverb: “A cheat never prospers but can get rich faster” WSD should be distinguished from structural ambiguity Correct groupings a must … Iran quake kills 87, 400 injured When it rains cats and dogs run for cover Should be distinguished from structural ambiguity Correct groupings a must … Iran quake kills 87, 400 injured When it rains, cats and dogs runs for cover When it rains cats and dogs, run for cover Groups of words (Multiwords) and names can be ambiguous Broken guitar for sale, no strings attached (Pun) Washington voted Washington to power pujaa ne pujaa ke liye phul todaa (Pujaa plucked
    [Show full text]
  • Machine Reading = Text Understanding! David Israel! What Is It to Understand a Text? !
    Machine Reading = Text Understanding! David Israel! What is it to Understand a Text? ! • To understand a text — starting with a single sentence — is to:! • determine its truth (perhaps along with the evidence for its truth)! ! • determine its truth-conditions (roughly: what must the world be like for it to be true)! • So one doesn’t have to check how the world actually is — that is whether the sentence is true! • calculate its entailments! • take — or at least determine — appropriate action in light of it (and some given preferences/goals/desires)! • translate it accurately into another language! • and what does accuracy come to?! • ground it in a cognitively plausible conceptual space! • …….! • These are not necessarily competing alternatives! ! For Contrast: What is it to Engage Intelligently in a Dialogue? ! • Respond appropriately to what your dialogue partner has just said (and done)! • Typically, taking into account the overall purpose(s) of the interaction and the current state of that interaction (the dialogue state)! • Compare and contrast dialogue between “equals” and between a human and a computer, trying to assist the human ! • Siri is a very simple example of the latter! A Little Ancient History of Dialogue Systems! STUDENT (Bobrow 1964) Question-Answering: Word problems in simple algebra! “If the number of customers Tom gets is twice the square of 20% of the number of advertisements he runs, and the number of advertisements he runs is 45, what is the number of customers Tom gets?” ! Overview of the method ! . 1 Map referential expressions to variables. ! ! . 2 Use regular expression templates to identify and transform mathematical expressions.
    [Show full text]
  • Automatic Labeling of Troponymy for Chinese Verbs
    Automatic labeling of troponymy for Chinese verbs 羅巧Ê Chiao-Shan Lo*+ s!蓉 Yi-Rung Chen+ [email protected] [email protected] 林芝Q Chih-Yu Lin+ 謝舒ñ Shu-Kai Hsieh*+ [email protected] [email protected] *Lab of Linguistic Ontology, Language Processing and e-Humanities, +Graduate School of English/Linguistics, National Taiwan Normal University Abstract 以同©^Æ與^Y語意關¶Ë而成的^Y知X«,如ñ語^² (Wordnet)、P語^ ² (EuroWordnet)I,已有E分的研v,^²的úË_已øv完善。ú¼ø同的目的,- 研b語言@¦已úË'規!K-文^Y²路 (Chinese Wordnet,CWN),è(Ð供完t的 -文­YK^©@分。6而,(目MK-文^Y²路ûq-,1¼目M;要/¡(ºº$ 定來標記同©^ÆK間的語意關Â,因d這些標記KxÏ尚*T成可L應(K一定規!。 因d,,Ç文章y%針對動^K間的上下M^Y語意關 (Troponymy),Ðú一.ê動標 記的¹法。我們希望藉1句法上y定的句型 (lexical syntactic pattern),úË一個能 ê 動½取ú動^上下M的ûq。透N^©意$定原G的U0,P果o:,dûqê動½取ú 的動^上M^,cº率將近~分K七A。,研v盼能將,¹法應(¼c(|U-的-文^ ²ê動語意關Â標記,以Ê知X,體Kê動úË,2而能有H率的úË完善的-文^Y知 XÇ源。 關關關uuu^^^:-文^Y²路、語©關Âê動標記、動^^Y語© Abstract Synset and semantic relation based lexical knowledge base such as wordnet, have been well-studied and constructed in English and other European languages (EuroWordnet). The Chinese wordnet (CWN) has been launched by Academia Sinica basing on the similar paradigm. The synset that each word sense locates in CWN are manually labeled, how- ever, the lexical semantic relations among synsets are not fully constructed yet. In this present paper, we try to propose a lexical pattern-based algorithm which can automatically discover the semantic relations among verbs, especially the troponymy relation. There are many ways that the structure of a language can indicate the meaning of lexical items. For Chinese verbs, we identify two sets of lexical syntactic patterns denoting the concept of hypernymy-troponymy relation.
    [Show full text]
  • Terry Ruas Final Dissertation.Pdf
    Semantic Feature Extraction Using Multi-Sense Embeddings and Lexical Chains by Terry L. Ruas A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer and Information Science) in the University of Michigan-Dearborn 2019 Doctoral Committee: Professor William Grosky, Chair Assistant Professor Mohamed Abouelenien Rajeev Agrawal, US Army Engineer Research and Development Center Associate Professor Marouane Kessentini Associate Professor Luis Ortiz Professor Armen Zakarian Terry L. Ruas [email protected] ORCID iD 0000-0002-9440-780X © Terry L. Ruas 2019 DEDICATION To my father, Júlio César Perez Ruas, to my mother Vládia Sibrão de Lima Ruas, and my dearest friend and mentor William Grosky. This work is also a result of those who made me laugh, cry, live, and die, because if was not for them, I would not be who I am today. ii ACKNOWLEDGEMENTS First, I would like to express my gratefulness to my parents, who always encouraged me, no matter how injurious the situation seemed. With the same importance, I am thankful to my dear friend and advisor Prof. William Grosky, without whom the continuous support of my Ph.D. study and related research, nothing would be possible. His patience, motivation, and immense knowledge are more than I could ever have wished for. His guidance helped me during all the time spent on the research and writing of this dissertation. I could not have asked for better company throughout this challenge. If I were to write all the great moments we had together, one book would not be enough. Besides my advisor, I would like to thank the rest of my dissertation committee: Dr.
    [Show full text]
  • Ontology and the Lexicon
    Ontology and the Lexicon Graeme Hirst Department of Computer Science, University of Toronto, Toronto M5S 3G4, Ontario, Canada; [email protected] Summary. A lexicon is a linguistic object and hence is not the same thing as an ontology, which is non-linguistic. Nonetheless, word senses are in many ways similar to ontological concepts and the relationships found between word senses resemble the relationships found between concepts. Although the arbitrary and semi-arbitrary distinctions made by natural lan- guages limit the degree to which these similarities can be exploited, a lexicon can nonetheless serve in the development of an ontology, especially in a technical domain. 1 Lexicons and lexical knowledge 1.1 Lexicons A lexicon is a list of words in a language—a vocabulary—along with some knowl- edge of how each word is used. A lexicon may be general or domain-specific; we might have, for example, a lexicon of several thousand common words of English or German, or a lexicon of the technical terms of dentistry in some language. The words that are of interest are usually open-class or content words, such as nouns, verbs, and adjectives, rather than closed-class or grammatical function words, such as articles, pronouns, and prepositions, whose behaviour is more tightly bound to the grammar of the language. A lexicon may also include multi-word expressions such as fixed phrases (by and large), phrasal verbs (tear apart), and other common expressions (merry Christmas!; teach someone ’s grandmother to suck eggs; Elvis has left the building). ! " Each word or phrase in a lexicon is described in a lexical entry; exactly what is included in each entry depends on the purpose of the particular lexicon.
    [Show full text]
  • Lexical Scoring System of Lexical Chain for Quranic Document Retrieval
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by UKM Journal Article Repository GEMA Online® Journal of Language Studies 59 Volume 18(2), May 2018 http://doi.org/10.17576/gema-2018-1802-05 Lexical Scoring System of Lexical Chain for Quranic Document Retrieval Hamed Zakeri Rad [email protected] Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia Sabrina Tiun [email protected] Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia Saidah Saad [email protected] Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia ABSTRACT An Information Retrieval (IR) system aims to extract information based on a query made by a user on a particular subject from an extensive collection of text. IR is a process through which information is retrieved by submitting a query by a user in the form of keywords or to match words. In the Al-Quran, verses of the same or comparable topics are scattered throughout the text in different chapters, and it is therefore difficult for users to remember the many keywords of the verses. Therefore, in such situations, retrieving information using semantically related words is useful. In well-composed documents, the semantic integrity of the text (coherence) exists between the words. Lexical cohesion is the results of chains of related words that contribute to the continuity of the lexical meaning found within the text are a direct result of text being about the same thing (i.e. topic, etc.). This indicates that using an IR system and lexical chains are a useful and appropriate method for representing documents with concepts rather than using terms in order to have successful retrieval based on semantic relations.
    [Show full text]
  • WN-EWN Relations
    A Bottom-up Comparative Study of EuroWordNet and WordNet 3.0 Lexical and Semantic Relations Maria Teresa Pazienzaa, Armando Stellatoa, Alexandra Tudoracheab a) AI Research Group, Dept. of Computer Science, b) Dept. of Cybernetics, Statistics and Economic Systems and Production University of Rome, Informatics, Academy of Economic Studies Bucharest Tor Vergata Calea Dorobanţilor 15-17, 010552, Via del Politecnico 1, 00133 Rome, Italy Bucharest, Romania {pazienza,stellato,tudorache}@info.uniroma2.it [email protected] Abstract The paper presents a comparative study of semantic and lexical relations defined and adopted in WordNet and EuroWordNet. This document describes the experimental observations achieved through the analysis of data from different WordNet versions and EuroWordNet distributions for different languages, during the development of JMWNL (Java Multilingual WordNet Library), an extensible multilingual library for accessing WordNet-like resources in different languages and formats. The goal of this work was to realize an operative mapping between the relations defined in the two lexical resources and to unify library access and content navigation methods for both WordNet and EuroWordNet. The analysis focused on similarities, differences, semantic overlaps or inclusions, factual misinterpretations and inconsistencies between the intended and practical use of each single relation defined in these two linguistic resources. The paper details with examples the produced mapping, discussing required operations which implied merging, extending or simply keeping separate the examined relations. This document describes the experimental observations 1. Introduction achieved through the analysis of data from different We introduce a comparative study of semantic and WordNet versions and EuroWordNet distributions for lexical relations defined and adopted in two renowned different languages, during the development of JMWNL.
    [Show full text]
  • Question-Answering Using Semantic Relation Triples Kenneth C
    Question-Answering Using Semantic Relation Triples Kenneth C. Litkowski CL Research 9208 Gue Road Damascus, MD 20872 [email protected] http://www.clres.com Abstract can be used in such tasks as word-sense disambiguation and text summarization. This paper describes the development of a prototype system to answer questions by selecting The CL Research question-answering prototype sentences from the documents in which the answers extended functionality of the DIMAP dictionary occur. After parsing each sentence in these creation and maintenance software, which includes documents, databases are constructed by extracting some components intended for use as a lexicographer's relational triples from the parse output. The triples workstation.1 The TREC-8 Q&A track provided an consist of discourse entities, semantic relations, and opportunity not only for examining use of the governing words to which the entities are bound in computational lexicons, but also for their generation as the sentence. Database triples are also generated for well, since many dictionaries (particularly specialized the questions. Question-answering consists of one) contain encyclopedic information as well as the matching the question database records with the usual genus-differentiae definitions. The techniques records for the documents. developed for TREC and described herein are now being used for parsing dictionary definitions to help The prototype system was developed specifically construct computational lexicons that contain more to respond to the TREC-8 Q&A track, with an existing information about semantic relations, which in turn parser and some existing capability for analyzing parse will be useful for natural language processing tasks, output. The system was designed to investigate the including question-answering.
    [Show full text]
  • University of Hagen at CLEF 2004: Indexing and Translating Concepts for the GIRT Task
    University of Hagen at CLEF 2004: Indexing and Translating Concepts for the GIRT Task Johannes Leveling, Sven Hartrumpf Intelligent Information and Communication Systems University of Hagen (FernUniversitat¨ in Hagen) 58084 Hagen, Germany {Johannes.Leveling,Sven.Hartrumpf}@fernuni-hagen.de Abstract This paper describes the work done at the University of Hagen for our participation at the German Indexing and Retrieval Test (GIRT) task of the CLEF 2004 evaluation campaign. We conducted both monolingual and bilingual information retrieval experiments. For monolin- gual experiments with the German document collection, the focus is on applying and compar- ing three indexing methods targeting full word forms, disambiguated concepts, and extended semantic networks. The bilingual experiments for retrieving English documents for German topics rely on translating and expanding query terms based on a ranking of semantically related English terms for a German concept. English translations are compiled from heterogeneous resources, including multilingual lexicons such as EuroWordNet and dictionaries available online. 1 Introduction This paper investigates the performance of different indexing methods and automated retrieval strategies for the second participation in the University of Hagen in the evaluation campaign of the Cross Language Evaluation Forum (CLEF). In 2003, retrieval strategies based on generating query variants for a natural language (NL) query were compared (Leveling, 2003). The result of our best experiment in 2003 with respect to mean average precision (MAP) was 0.2064 for a run using both topic title and description. Before presenting the different approaches and results for monolingual and bilingual retrieval experi- ments with the GIRT documents (German Indexing and Retrieval Test database, see Kluck and Gey (2001)) in separate sections, a short overview of the analysis and transformation of natural language queries and documents is given.
    [Show full text]
  • Text Relatedness Based on a Word Thesaurus
    Journal of Artificial Intelligence Research 37 (2010) 1-39 Submitted 07/09; published 01/10 Text Relatedness Based on a Word Thesaurus George Tsatsaronis [email protected] Department of Computer and Information Science Norwegian University of Science and Technology, Norway Iraklis Varlamis [email protected] Department of Informatics and Telematics Harokopio University, Greece Michalis Vazirgiannis [email protected] Department of Informatics Athens University of Economics and Business, Greece Abstract The computation of relatedness between two fragments of text in an automated manner requires taking into account a wide range of factors pertaining to the meaning the two fragments convey, and the pairwise relations between their words. Without doubt, a measure of relatedness between text segments must take into account both the lexical and the semantic relatedness between words. Such a measure that captures well both aspects of text relatedness may help in many tasks, such as text retrieval, classification and clustering. In this paper we present a new approach for measuring the semantic relatedness between words based on their implicit semantic links. The approach ex- ploits only a word thesaurus in order to devise implicit semantic links between words. Based on this approach, we introduce Omiotis, a new measure of semantic relatedness between texts which capitalizes on the word-to-word semantic relatedness measure (SR) and extends it to measure the relatedness between texts. We gradually validate our method: we first evaluate the performance of the semantic relatedness measure between individual words, covering word-to-word similar- ity and relatedness, synonym identification and word analogy; then, we proceed with evaluating the performance of our method in measuring text-to-text semantic relatedness in two tasks, namely sentence-to-sentence similarity and paraphrase recognition.
    [Show full text]
  • Neural Methods Towards Concept Discovery from Text Via Knowledge Transfer
    Neural Methods Towards Concept Discovery from Text via Knowledge Transfer DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Manirupa Das, B.E., M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2019 Dissertation Committee: Prof Rajiv Ramnath, Advisor Prof Eric Fosler-Lussier, Advisor Prof Huan Sun c Copyright by Manirupa Das 2019 ABSTRACT Novel contexts, consisting of a set of terms referring to one or more concepts, often arise in real-world querying scenarios such as; a complex search query into a document retrieval system or a nuanced subjective natural language question. The concepts in these queries may not directly refer to entities or canonical concept forms occurring in any fact-based or rule-based knowledge source such as a knowledge base or on- tology. Thus, in addressing the complex information needs expressed by such novel contexts, systems using only such sources can fall short. Moreover, hidden associa- tions meaningful in the current context, may not exist in a single document, but in a collection, between matching candidate concepts having different surface realizations, via alternate lexical forms. These may refer to underlying latent concepts, i.e., exist- ing or conceived concepts or semantic classes that are accessible only via their surface forms. Inferring these latent concept associations in an implicit manner, by transfer- ring knowledge from the same domain { within a collection, or from across domains (different collections), can potentially better address such novel contexts. Thus latent concept associations may act as a proxy for a novel context.
    [Show full text]