Algorithms for Enhancing Information Retrieval Using

Total Page:16

File Type:pdf, Size:1020Kb

Algorithms for Enhancing Information Retrieval Using ALGORITHMS FOR ENHANCING INFORMATION RETRIEVAL USING SEMANTIC WEB A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Moy’awiah Al-Shannaq August, 2015 i Dissertation written by Moy’awiah A. Al-Shannaq B.S., Yarmouk University, 2003 M.S., Yarmouk University, 2005 Ph.D., Kent State University, 2015 Approved by _______________________________________ Austin Melton, Professor, Ph.D., Computer Science, Doctoral Advisor _______________________________________ Johnnie Baker, Professor Emeritus, Ph.D., Computer Science _______________________________________ Angela Guercio, Associate Professor, Ph.D., Computer Science _______________________________________ Farid Fouad, Associate Professor, Ph.D., Chemistry _______________________________________ Donald White, Professor, Ph.D., Mathematical Sciences Accepted by ______________________________________ Javed I. Khan, Professor, Ph.D., Chair, Department of Computer Science ______________________________________ James L. Blank, Professor, Ph.D., Dean, College of Arts and Sciences ii Table of Contents LIST OF FIGURES........................................................................................................viii LIST OF TABLES............................................................................................................xi DEDICATION................................................................................................................xiii ACKNOWLEDGMENTS..............................................................................................xiv ABSTRACT......................................................................................................................xv CHAPTER 1: INTRODUCTION….................................................................................1 1.1 Goals of the Research.....................................................................................................3 1.2 Contributions..................................................................................................................5 1.3 Dissertation out line ......................................................................................................6 CHAPTER 2: BACKGROUND…....................................................................................8 2.1 Introduction……………………………………………………………………………8 2.2 Information Retrieval………………………………………………………………….9 2.3 Web Evolution……………………………………………………………………….13 2.3.1 Web 1.0…………………………………………………………………………..14 2.3.2 Web 2.0…………………………………………………………..........................14 2.3.3 Web 3.0…………………………………………………………………………..15 2.3.4 Web 4.0…………………………………………………………………………..16 2.4 Semantic Web………………………………………………………………………..18 2.5 Standards, Recommendations, and Models………………………………………….20 iii 2.5.1 URI and Unicode………………………………………………………………...21 2.5.2 Extensible Markup Language……………………………………………………22 2.5.3 RDF: Describing Web Resources………………………………………………..23 2.5.4 RDF Schema: Adding Semantics………………………………………………...24 2.5.5 OWL: Web Ontology Language………………………………………………....25 2.5.6 Logic and Proof…………………………………………………………………..26 2.5.7 Trust……………………………………………………………………………...27 2.6 What is SPARQL? …………………………………………………………………..27 2.7 Integrating Information Retrieval and Semantic Web……………………………….30 CHAPTER 3: RELATED WORK…………………………………………………….34 3.1 Topic Modeling………………………………………………………………………34 3.1.1 From Vector Space Modeling to Latent Semantic Indexing…………………….36 3.1.2 The Probabilistic Latent Semantic Indexing (PLSI) model……………………...43 3.1.3 The Latent Dirichlet Allocation Model (LDA)…………………………………..45 3.1.4 Topics in LDA…………………………………………………………………...47 3.1.4.1 Model………………………………………………………………………...48 3.2 Automatic Query Expansion (AQE)…………………………………………………50 3.3 Document Ranking with AQE……………………………………………………….53 3.4 Why and when AQE works………………………………………………………….54 3.5 How AQE works……………………………………………………………………..57 3.5.1 Preprocessing of Data Source……………………………………………………57 iv 3.5.2 Generation and Ranking of Candidate Expansion Features……………………..58 3.5.2.1 One-to-One Associations…………………………………………………….59 3.5.2.2 One-to-Many Associations…………………………………………………..59 3.5.2.3 Analysis of Feature Distribution in Top-Ranked Documents………………..60 3.5.2.4 Query Language Modeling…………………………………………………..60 3.5.2.5 A Web Search Example……………………………………………………...61 3.6 Selection of Expansion Features……………………………………………………..64 3.7 Query Reformulation………………………………………………………………...65 3.8 Related Work………………………………………………………………………...66 CHAPTER 4: SEMANTIC WEB AND ARABIC LANGUAGE……………………69 4.1 Importance of Arabic Language……………………………………………………..69 4.2 Right-to-Left Languages and the Semantic Web…………………………………….72 4.3 Arabic Ontology……………………………………………………………………...73 4.4 The Arabic Language and the Semantic Web: Challenges and Opportunities………75 CHAPTER 5: GENERATING TOPICS……………………………………………...82 5.1 Introduction…………………………………………………………………………..82 5.2 Data Set………………………………………………………………………………84 5.3 Experimental Results………………………………………………………………...85 5.3.1 Experiments on an English Corpus……………………………………………....88 5.3.1.1 Preprocessing Steps………………………………………………………….88 5.3.1.2 English Corpus Creation……………………………………………………..89 v 5.3.1.3 Experiment 1…………………………………………………………………90 5.1.3.4 Experiment 2…………………………………………………………………91 5.1.3.5 Experiment 3…………………………………………………………………91 5.1.3.6 Experiment 4…………………………………………………………………91 5.1.3.7 Experiment 5…………………………………………………………………91 5.1.3.8 Experiment 6…………………………………………………………………91 5.3.2 Experiments on an Arabic Corpus……………………………………………….92 5.3.2.1 Arabic Corpus Creation……………………………………………………...92 5.3.2.2 Experiment 1…………………………………………………………………94 5.3.2.3 Experiment 2…………………………………………………………………94 5.3.2.4 Experiment 3…………………………………………………………………94 5.3.2.5 Experiment 4………………………………………………………………....94 5.3.2.6 Experiment 5…………………………………………………………………95 5.3.2.7 Experiment 6…………………………………………………………………95 5.4 Discussion……………………………………………………………………………96 CHAPTER 6: TOPIC MODELING AND QUERY EXPANSION………………...102 6.1 Introduction…………………………………………………………………………102 6.2 Why to use the combination of topic modeling and query expansion? ……………105 6.3 Semantic Search…………………………………………………………………….106 6.3.1 Handling Generalizations……………………………………………………….106 6.3.2 Handling Morphological Variations……………………………………………107 vi 6.3.3 Handling Concept Matches…………………………………………………….108 6.3.4 Handling Synonyms with Correct Sense……………………………………….109 6.4 Methodology………………………………………………………………………..110 6.4.1 Query Expansion………………………………………………………………..114 6.4.2 Our Work……………………………………………………………………….116 6.4.2.1 Stemming Subsystem……………………………………………………….121 6.4.2.2 Query Expansion Subsystem……………………………………………….121 6.5 Experimental Results and Discussion………………………………………………122 6.5.1 Experiment 1…………………………………………………………………....123 6.5.2 Experiment 2……………………………………………………………………124 6.5.3 Experiment 3……………………………………………………………………125 6.5.4 Experiment 4……………………………………………………………………126 6.5.5 Experiment 5……………………………………………………………………127 6.5.6 Experiment 6……………………………………………………………………128 6.5.7 Experiment 7……………………………………………………………………129 6.5.8 Experiment 8……………………………………………………………………130 CHAPTER 7: CONCLUSION AND FUTURE WORK……………………………132 APPENDIX A………………………………………………………………………….141 REFERENCES………………………………………………………………………...158 vii List of Figures Figure 1: The classic search model……………………….……………………………...10 Figure 2: Architecture of a textual IR system……………………………………………13 Figure 3: Evolution of the Web………….………………………………………………17 Figure 4: Web of documents………………….…………………………………………19 Figure 5: The structure of Web data……………….…………………………………….20 Figure 6: Semantic Web layered architecture……………………………………………21 Figure 7: An RDF graph storing information about Arab Bank.………………………...24 Figure 8: An RDF graph storing information about transporation services between cities…...............................................................................................................................25 Figure 9: Two-dimmensioal plot of terms and documents………………………………40 Figure 10: Two-dimmensioal plot of terms and documents along with the query application theory ………………………………………………………………………..41 Figure 11: Two-dimmensioal plot of terms and documents using the SVD of a reconstructed Term-Document Matrix………………………………………………….42 viii Figure 12: Plate notation representing the LDA model………………………………….48 Figure 13: Main steps of Automatic Query Expansion……………….…………………57 Figure 14: The first ten results returned by Google in response to the query "foreign minorities Germany"……………………………………………………………………..62 Figure 15: The first ten results returned by Google in response to the expanded query "foreign minorities Germany sorbs Frisians"……………………………………………63 Figure 16: The real strength of the top ten languages…………..………….....................71 Figure 17: The three main components of the semantic annotation process…………….76 Figure 18: A pie chart showing the distribution of languages used in creating ontologies stored in the OntoSelect library………………………………………………………….78 Figure 19: A visualization of the probabilistic generative process for three documents..87 Figure 20: The intuitions behind Latent Dirichlet Allocation…………………………...88 100.…………………… سيعلمونه Figure 21: Segmentation of the Arabic agglutinated form Figure 22: System components for Arabic corpus using LSI topic modeling.…………117 Figure 23: System components for Arabic corpus using LDA topic modeling.………..118 Figure 24: System components for English corpus using LDA topic modeling.………119 ix Figure 25: System components for English corpus using LSI topic modeling................120 Figure 26: Query one for
Recommended publications
  • Enhanced Thesaurus Terms Extraction for Document Indexing
    Enhanced Thesaurus Terms Extraction for Document Indexing Frane ari¢, Jan najder, Bojana Dalbelo Ba²i¢, Hrvoje Ekli¢ Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Croatia E-mail:{Frane.Saric, Jan.Snajder, Bojana.Dalbelo, Hrvoje.Eklic}@fer.hr Abstract. In this paper we present an mogeneous due to diverse background knowl- enhanced method for the thesaurus term edge and expertise of human indexers. The extraction regarded as the main support to task of building semi-automatic and auto- a semi-automatic indexing system. The matic systems, which aim to decrease the enhancement is achieved by neutralising burden of work borne by indexers, has re- the eect of language morphology applying cently attracted interest in the research com- lemmatisation on both the text and the munity [4], [13], [14]. Automatic indexing thesaurus, and by implementing an ecient systems still do not achieve the performance recursive algorithm for term extraction. of human indexers, so semi-automatic sys- Formal denition and statistical evaluation tems are widely used (CINDEX, MACREX, of the experimental results of the proposed MAI [10]). method for thesaurus term extraction are In this paper we present a method for the- given. The need for disambiguation methods saurus term extraction regarded as the main and the eect of lemmatisation in the realm support to semi-automatic indexing system. of thesaurus term extraction are discussed. Term extraction is a process of nding all ver- batim occurrences of all terms in the text. Keywords. Information retrieval, term Our method of term extraction is a part of extraction, NLP, lemmatisation, Eurovoc.
    [Show full text]
  • A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index
    University of California Santa Barbara A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index A Thesis submitted in partial satisfaction of the requirements for the degree Master of Arts in Geography by Bo Yan Committee in charge: Professor Krzysztof Janowicz, Chair Professor Werner Kuhn Professor Emerita Helen Couclelis June 2016 The Thesis of Bo Yan is approved. Professor Werner Kuhn Professor Emerita Helen Couclelis Professor Krzysztof Janowicz, Committee Chair May 2016 A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index Copyright c 2016 by Bo Yan iii Acknowledgements I would like to thank the members of my committee for their guidance and patience in the face of obstacles over the course of my research. I would like to thank my advisor, Krzysztof Janowicz, for his invaluable input on my work. Without his help and encour- agement, I would not have been able to find the light at the end of the tunnel during the last stage of the work. Because he provided insight that helped me think out of the box. There is no better advisor. I would like to thank Yingjie Hu who has offered me numer- ous feedback, suggestions and inspirations on my thesis topic. I would like to thank all my other intelligent colleagues in the STKO lab and the Geography Department { those who have moved on and started anew, those who are still in the quagmire, and those who have just begun { for their support and friendship. Last, but most importantly, I would like to thank my parents for their unconditional love.
    [Show full text]
  • Multi-View Learning for Hierarchical Topic Detection on Corpus of Documents
    Multi-view learning for hierarchical topic detection on corpus of documents Juan Camilo Calero Espinosa Universidad Nacional de Colombia Facultad de Ingenieria, Departamento de Ingenieria de Sistemas e Industrial. Bogot´a,Colombia 2021 Multi-view learning for hierarchical topic detection on corpus of documents Juan Camilo Calero Espinosa Tesis presentada como requisito parcial para optar al t´ıtulode: Magister en Ingenier´ıade Sistemas y Computaciøn´ Director: Ph.D. Luis Fernando Ni~noV. L´ıneade Investigaci´on: Procesamiento de lenguaje natural Grupo de Investigaci´on: Laboratorio de investigaci´onen sistemas inteligentes - LISI Universidad Nacional de Colombia Facultad de Ingenieria, Departamento de Ingenieria en Sistemas e Industrial. Bogot´a,Colombia 2021 To my parents Maria Helena and Jaime. To my aunts Patricia and Rosa. To my grandmothers Lilia and Santos. Acknowledgements To Camilo Alberto Pino, as the original thesis idea was his, and for his invaluable teaching of multi-view learning. To my thesis advisor, Luis Fernando Ni~no,and the Laboratorio de investigaci´onen sistemas inteligentes - LISI, for constantly allowing me to learn new knowl- edge, and for their valuable recommendations on the thesis. V Abstract Topic detection on a large corpus of documents requires a considerable amount of com- putational resources, and the number of topics increases the burden as well. However, even a large number of topics might not be as specific as desired, or simply the topic quality starts decreasing after a certain number. To overcome these obstacles, we propose a new method- ology for hierarchical topic detection, which uses multi-view clustering to link different topic models extracted from document named entities and part of speech tags.
    [Show full text]
  • Semi-Automated Ontology Based Question Answering System Open Access
    Journal of Advanced Research in Computing and Applications 17, Issue 1 (2019) 1-5 Journal of Advanced Research in Computing and Applications Journal homepage: www.akademiabaru.com/arca.html ISSN: 2462-1927 Open Semi-automated Ontology based Question Answering System Access 1, Khairul Nurmazianna Ismail 1 Faculty of Computer Science, Universiti Teknologi MARA Melaka, Malaysia ABSTRACT Question answering system enable users to retrieve exact answer for questions submit using natural language. The demand of this system increases since it able to deliver precise answer instead of list of links. This study proposes ontology-based question answering system. Research consist explanation of question answering system architecture. This question answering system used semi-automatic ontology development (Ontology Learning) approach to develop its ontology. Keywords: Question answering system; knowledge Copyright © 2019 PENERBIT AKADEMIA BARU - All rights reserved base; ontology; online learning 1. Introduction In the era of WWW, people need to search and get information fast online or offline which make people rely on Question Answering System (QAS). One of the common and traditionally use QAS is Frequently Ask Question site which contains common and straight forward answer. Currently, QAS have emerge as powerful platform using various techniques such as information retrieval, Knowledge Base, Natural Language Processing and Hybrid Based which enable user to retrieve exact answer for questions posed in natural language using either pre-structured database or a collection of natural language documents [1,4]. The demand of QAS increases day by day since it delivers short, precise and question-specific answer [10]. QAS with using knowledge base paradigm are better in Restricted- domain QA system since it ability to focus [12].
    [Show full text]
  • Nonstationary Latent Dirichlet Allocation for Speech Recognition
    Nonstationary Latent Dirichlet Allocation for Speech Recognition Chuang-Hua Chueh and Jen-Tzung Chien Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan, ROC {chchueh,chien}@chien.csie.ncku.edu.tw model was presented for document representation with time Abstract evolution. Current parameters served as the prior information Latent Dirichlet allocation (LDA) has been successful for to estimate new parameters for next time period. Furthermore, document modeling. LDA extracts the latent topics across a continuous time dynamic topic model [12] was developed by documents. Words in a document are generated by the same incorporating a Brownian motion in the dynamic topic model, topic distribution. However, in real-world documents, the and so the continuous-time topic evolution was fulfilled. usage of words in different paragraphs is varied and Sparse variational inference was used to reduce the accompanied with different writing styles. This study extends computational complexity. In [6][7], LDA was merged with the LDA and copes with the variations of topic information the hidden Markov model (HMM) as the HMM-LDA model within a document. We build the nonstationary LDA (NLDA) where the Markov states were used to discover the syntactic by incorporating a Markov chain which is used to detect the structure of a document. Each word was generated either from stylistic segments in a document. Each segment corresponds to the topic or the syntactic state. The syntactic words and a particular style in composition of a document. This NLDA content words were modeled separately. can exploit the topic information between documents as well This study also considers the superiority of HMM in as the word variations within a document.
    [Show full text]
  • Large-Scale Hierarchical Topic Models
    Large-Scale Hierarchical Topic Models Jay Pujara Peter Skomoroch Department of Computer Science LinkedIn Corporation University of Maryland 2029 Stierlin Ct. College Park, MD 20742 Mountain View, CA 94043 [email protected] [email protected] Abstract In the past decade, a number of advances in topic modeling have produced sophis- ticated models that are capable of generating hierarchies of topics. One challenge for these models is scalability: they are incapable of working at the massive scale of millions of documents and hundreds of thousands of terms. We address this challenge with a technique that learns a hierarchy of topics by iteratively apply- ing topic models and processing subtrees of the hierarchy in parallel. This ap- proach has a number of scalability advantages compared to existing techniques, and shows promising results in experiments assessing runtime and human evalu- ations of quality. We detail extensions to this approach that may further improve hierarchical topic modeling for large-scale applications. 1 Motivation With massive datasets and corresponding computational resources readily available, the Big Data movement aims to provide deep insights into real-world data. Realizing this goal can require new approaches to well-studied problems. Complex models that, for example, incorporate many de- pendencies between parameters have alluring results for small datasets and single machines but are difficult to adapt to the Big Data paradigm. Topic models are an interesting example of this phenomenon. In the last decade, a number of sophis- ticated techniques have been developed to model collections of text, from Latent Dirichlet Allocation (LDA)[1] through extensions using statistical machinery such as the nested Chinese Restaurant Pro- cess [2][3] and Pachinko Allocation[4].
    [Show full text]
  • Journal of Applied Sciences Research
    Copyright © 2015, American-Eurasian Network for Scientific Information publisher JOURNAL OF APPLIED SCIENCES RESEARCH ISSN: 1819-544X EISSN: 1816-157X JOURNAL home page: http://www.aensiweb.com/JASR 2015 October; 11(19): pages 50-55. Published Online 10 November 2015. Research Article A Survey on Correlation between the Topic and Documents Based on the Pachinko Allocation Model 1Dr.C.Sundar and 2V.Sujitha 1Associate Professor, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. 2PG Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. Received: 23 September 2015; Accepted: 25 October 2015 © 2015 AENSI PUBLISHER All rights reserved ABSTRACT Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. In existing system, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is used.The patterns are generated from the words in the word-based topic representations of a traditional topic model such as the LDA model. This ensures that the patterns can well represent the topics because these patterns are comprised of the words which are extracted by LDA based on sample occurrence of the words in the documents. The Maximum matched pat-terns, which are the largest patterns in each equivalence class that exist in the received documents, are used to calculate the relevance words to represent topics. However, LDA does not capture correlations between topics and these not find the hidden topics in the document. To deal with the above problem the pachinko allocation model (PAM) is proposed.
    [Show full text]
  • An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
    Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours .......................
    [Show full text]
  • Incorporating Domain Knowledge in Latent Topic Models
    INCORPORATING DOMAIN KNOWLEDGE IN LATENT TOPIC MODELS by David Michael Andrzejewski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2010 c Copyright by David Michael Andrzejewski 2010 All Rights Reserved i For my parents and my Cho. ii ACKNOWLEDGMENTS Obviously none of this would have been possible without the diligent advising of Mark Craven and Xiaojin (Jerry) Zhu. Taking a bioinformatics class from Mark as an undergraduate initially got me excited about the power of statistical machine learning to extract insights from seemingly impenetrable datasets. Jerry’s enthusiasm for research and relentless pursuit of excellence were a great inspiration for me. On countless occasions, Mark and Jerry have selflessly donated their time and effort to help me develop better research skills and to improve the quality of my work. I would have been lucky to have even a single advisor as excellent as either Mark or Jerry; I have been extremely fortunate to have them both as co-advisors. My committee members have also been indispensable. Jude Shavlik has always brought an emphasis on clear communication and solid experimental technique which will hopefully stay with me for the rest of my career. Michael Newton helped me understand the modeling issues in this research from a statistical perspective. Working with prelim committee member Ben Liblit gave me the exciting opportunity to apply machine learning to a very challenging problem in the debugging work presented in Chapter 4. I also learned a lot about how other computer scientists think from meetings with Ben.
    [Show full text]
  • Extração De Informação Semântica De Conteúdo Da Web 2.0
    Mestrado em Engenharia Informática Dissertação Relatório Final Extração de Informação Semântica de Conteúdo da Web 2.0 Ana Rita Bento Carvalheira [email protected] Orientador: Paulo Jorge de Sousa Gomes [email protected] Data: 1 de Julho de 2014 Agradecimentos Gostaria de começar por agradecer ao Professor Paulo Gomes pelo profissionalismo e apoio incondicional, pela sincera amizade e a total disponibilidade demonstrada ao longo do ano. O seu apoio, não só foi determinante para a elaboração desta tese, como me motivou sempre a querer saber mais e ter vontade de fazer melhor. À minha Avó Maria e Avô Francisco, por sempre estarem presentes quando eu precisei, pelo carinho e afeto, bem como todo o esforço que fizeram para que nunca me faltasse nada. Espero um dia poder retribuir de alguma forma tudo aquilo que fizeram por mim. Aos meus Pais, pelos ensinamentos e valores transmitidos, por tudo o que me proporcionaram e por toda a disponibilidade e dedicação que, constantemente, me oferecem. Tudo aquilo que sou, devo-o a vocês. Ao David agradeço toda a ajuda e compreensão ao longo do ano, todo o carinho e apoio demonstrado em todas as minhas decisões e por sempre me ter encorajado a seguir os meus sonhos. Admiro-te sobretudo pela tua competência e humildade, pela transmissão de força e confiança que me dás em todos os momentos. Resumo A massiva proliferação de blogues e redes sociais fez com que o conteúdo gerado pelos utilizadores, presente em plataformas como o Twitter ou Facebook, se tornasse bastante valioso pela quantidade de informação passível de ser extraída e explorada.
    [Show full text]
  • Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population
    Using Lexico-Syntactic Ontology Design Patterns for ontology creation and population Diana Maynard and Adam Funk and Wim Peters Department of Computer Science University of Sheffield Regent Court, 211 Portobello S1 4DP, Sheffield, UK Abstract. In this paper we discuss the use of information extraction techniques involving lexico-syntactic patterns to generate ontological in- formation from unstructured text and either create a new ontology from scratch or augment an existing ontology with new entities. We refine the patterns using a term extraction tool and some semantic restrictions derived from WordNet and VerbNet, in order to prevent the overgener- ation that occurs with the use of the Ontology Design Patterns for this purpose. We present two applications developed in GATE and available as plugins for the NeOn Toolkit: one for general use on all kinds of text, and one for specific use in the fisheries domain. Key words: natural language processing, relation extraction, ontology generation, information extraction, Ontology Design Patterns 1 Introduction Ontology population is a crucial part of knowledge base construction and main- tenance that enables us to relate text to ontologies, providing on the one hand a customised ontology related to the data and domain with which we are con- cerned, and on the other hand a richer ontology which can be used for a variety of semantic web-related tasks such as knowledge management, information re- trieval, question answering, semantic desktop applications, and so on. Automatic ontology population is generally performed by means of some kind of ontology-based information extraction (OBIE) [1, 2]. This consists of identi- fying the key terms in the text (such as named entities and technical terms) and then relating them to concepts in the ontology.
    [Show full text]
  • LASLA and Collatinus
    L.A.S.L.A. and Collatinus: a convergence in lexica Philippe Verkerk, Yves Ouvrard, Margherita Fantoli, Dominique Longrée To cite this version: Philippe Verkerk, Yves Ouvrard, Margherita Fantoli, Dominique Longrée. L.A.S.L.A. and Collatinus: a convergence in lexica. Studi e saggi linguistici, ETS, In press. hal-02399878v1 HAL Id: hal-02399878 https://hal.archives-ouvertes.fr/hal-02399878v1 Submitted on 9 Dec 2019 (v1), last revised 14 May 2020 (v2) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. L.A.S.L.A. and Collatinus: a convergence in lexica Philippe Verkerk, Yves Ouvrard, Margherita Fantoli and Dominique Longrée L.A.S.L.A. (Laboratoire d'Analyse Statistique des Langues Anciennes, University of Liège, Belgium) has begun in 1961 a project of lemmatisation and morphosyntactic tagging of Latin texts. This project is still running with new texts lemmatised each year. The resulting files have been recently opened to the interested scholars and they now count approximatively 2.500.000 words, the lemmatisation of which has been checked by a philologist. In the early 2.000's, Collatinus has been developed by Yves Ouvrard for teaching.
    [Show full text]