Interpretability in Word Sense Disambiguation Using Tsetlin Machine

Total Page:16

File Type:pdf, Size:1020Kb

Interpretability in Word Sense Disambiguation Using Tsetlin Machine Interpretability in Word Sense Disambiguation using Tsetlin Machine Rohan Kumar Yadav, Lei Jiao, Ole-Christoffer Granmo and Morten Goodwin Centre for Artificial Intelligence Research, University of Agder, Grimstad, Norway Keywords: Tsetlin Machine, Word Sense Disambiguation, Interpretability. Abstract: Word Sense Disambiguation (WSD) is a longstanding unresolved task in Natural Language Processing. The challenge lies in the fact that words with the same spelling can have completely different senses, sometimes depending on subtle characteristics of the context. A weakness of the state-of-the-art supervised models, how- ever, is that it can be difficult to interpret them, making it harder to check if they capture senses accurately or not. In this paper, we introduce a novel Tsetlin Machine (TM) based supervised model that distinguishes word senses by means of conjunctive clauses. The clauses are formulated based on contextual cues, represented in propositional logic. Our experiments on CoarseWSD-balanced dataset indicate that the learned word senses can be relatively effortlessly interpreted by analyzing the converged model of the TM. Additionally, the clas- sification accuracy is higher than that of FastText-Base and similar to that of FastText-CommonCrawl. 1 INTRODUCTION distinct for different contexts. For example, let us consider the word “book” in the sentence “I want to book a ticket for the upcoming movie”. Although Word Sense Disambiguation (WSD) is one of a traditional chatbot can classify “book” as “reser- the unsolved task in Natural Language Process- vation” rather than “reading material”, it does not ing (NLP) (Agirre and Edmonds, 2007) with rapidly give us an explanation of how it learns the meaning increasing importance, particularly due to the ad- of the target word “book”. An unexplained model vent of chatbots. WSD consists of distinguishing the raises several questions, like: “How can we trust the meaning of homographs – identically spelled words model?” or “How did the model make the decision?”. whose sense or meaning depends on the surrounding Answering these questions would undoubtedly make context words in a sentence or a paragraph. WSD is a chatbot more trustworthy. In particular, deciding one of the main NLP tasks that still revolves around word senses for the wrong reasons may lead to unde- the perfect solution of sense classification and indica- sirable consequences, e.g., leading the chatbot astray tion (Navigli et al., 2017), and it usually fails to be in- or falsely categorizing a CV. Introducing a high level tegrated into NLP applications (de Lacalle and Agirre, of interpretability while maintaining classification ac- 2015). Many supervised approaches attempt to solve curacy is a challenge that the state-of-the-art NLP the WSD problem by training a model on sense anno- techniques so far have failed to solve satisfactorily. tated data (Liao et al., 2010). However, most of them Although some of the rule-based methods, like fail to produce interpretable models. Because word decision trees, are somewhat easy to interpret, other senses can be radically different depending on the methods are out of reach for comprehensive inter- context, interpretation errors can have adverse con- pretation (Wang et al., 2018), such as Deep Neu- sequences in real applications, such as chatbots. It ral Networks (DNNs). Despite the excellent accu- is therefore crucial for a WSD model to be easily in- racy achieved by DNNs, the “black box” nature im- terpretable for human beings, by showing the signifi- pedes their impact (Rudin, 2018). It is difficult for cance of context words for WSD. human beings to interpret the decision-making pro- NLP is one of the discipline that are used as the cess of artificial neurons. Weights and bias of deep application in a chatbot. With the recent prolifera- neural networks are in the form of fine-tuned contin- tion of chatbots, the limitations of the state-of-the- uous values that make it intricate to distinguish the art WSD has become increasingly apparent. In real- context words that drive the decision for classifica- life operation, chatbots are notoriously poor in distin- tion. Some straightforward techniques such as Naive guishing the meaning of words with multiple senses, 402 Yadav, R., Jiao, L., Granmo, O. and Goodwin, M. Interpretability in Word Sense Disambiguation using Tsetlin Machine. DOI: 10.5220/0010382104020409 In Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART 2021) - Volume 2, pages 402-409 ISBN: 978-989-758-484-8 Copyright c 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Interpretability in Word Sense Disambiguation using Tsetlin Machine Bayes classifier, logistic regression, decision trees, from traditional supervised WSD, embedding is be- random forest, and support vector machine are there- coming increasingly popular to capture the senses fore still widely used because of their simplicity and of words (Mikolov et al., 2013). Further, Majid et interpretability. However, they provide reasonable ac- al. improve the state-of-the-art supervised WSD by curacy only when the data is limited. assigning vector coefficients to obtain more precise In this paper, we aim to obtain human- context representations, and then applying PCA di- interpretable classification of the CoarseWSD- mensionality reduction to find a better transforma- balanced dataset, using the recently introduced tion of the features (Sadi et al., 2019). Salomons- Tsetlin Machine (TM). Our goal is to achieve a viable son presents a supervised classifier based on bidirec- balance between accuracy and interpretability by tional LSTM for the lexical sample task of the Sense- introducing a novel model for linguistic patterns. val dataset (Kageb˚ ack¨ and Salomonsson, 2016). TM is a human interpretable pattern recognition Contextually-aware word embedding has been ex- method that composes patterns in propositional logic. tensively addressed with other machine learning ap- Recently, it has provided comparable accuracy as proaches across many disciplines. Perhaps the most compared to DNN with arguably less computational relevant one is the work on neural network embed- complexity, while maintaining high interpretability. ding (Rezaeinia et al., 2019; Khattak et al., 2019; We demonstrate how our model learns pertinent Lazreg et al., 2020). There is a fundamental differ- patterns based on the context words, and explore ence between our work and previous ones in terms of which context words drive the classification decisions interpretability. Existing methods yield complex vec- of each particular word sense. The rest of the paper is torized embedding, which can hardly be claimed to be arranged as follows: The related work on WSD and human interpretable. TM are explained in Section 2. Our TM-based WSD- Furthermore, natural language processing has, in architecture, the learning process, and our approach recent years, been dominated by neural network- to interpretability are covered in Section 3. Section based attention mechanisms (Vaswani et al., 2017; 4 presents the experiment results for interpretability Sonkar et al., 2020). Even though attentions and the and accuracy. We conclude the paper in Section 5. attention-based transformers (Devlin et al., 2018) im- plementation provide the state-of-the-art results, the methods are overly complicated and far from inter- pretable. The recently introduced work (Loureiro 2 RELATED WORK et al., 2020) shows how contextual information influ- ences the sense of a word via the analysis of WSD on The research area of WSD is attracting increasing at- BERT. tention in the NLP community (Agirre and Edmonds, All these contributions clearly show that super- 2007) and has lately experienced rapid progress (Yuan vised neural models can achieve the state-of-the-art et al., 2016; Tripodi and Pelillo, 2017; Hadiwinoto performance in terms of accuracy without consid- et al., 2019). In all brevity, WSD methods can be ering external language-specific features. However, categorized into two groups: knowledge-based and such neural network models are criticized for be- supervised WSD. Knowledge-based methods involve ing difficult to interpret due to their black-box na- selecting the sense of an ambiguous word from the ture (Buhrmester et al., 2019). To introduce inter- semantic structure of lexical knowledge bases (Nav- pretability, we employ the newly developed TM for igli and Velardi, 2004). For instance, the seman- WSD in this study. The TM paradigm is inherently tic structure of BabelNet has been used to measure interpretable by producing rules in propositional logic word similarity (Dongsuk et al., 2018). The benefit (Granmo, 2018). TMs have demonstrated promising of using such models is that they do not require an- results in various classification tasks involving image notated or unannotated data but rely heavily on the data (Granmo et al., 2019), NLP tasks (Yadav et al., synset relations. Regarding supervised WSD, tradi- 2021; Bhattarai et al., 2020; Berge et al., 2019; Saha tional approaches generally depend on extracting fea- et al., 2020) and board games (Granmo, 2018). Al- tures from the context words that are present around though the TM operates on binary data, recent work the target word (Zhong and Ng, 2010). suggests that a threshold-based representation of con- The success of deep learning has significantly fu- tinuous input allows the TM to perform successfully eled WSD research. For example, Le et al. have beyond binary data, e.g., applied to diseases outbreak reproduced the state-of-the-art performance of an forecasting (Abeyrathna et al., 2019). Additionally, LSTM-based approach to WSD on several openly the convergence of TM has been analysed in (Zhang available datasets : GigaWord, SemCor (Miller et al., et al., 2020). 1994), and OMSTI (Taghipour and Ng, 2015). Apart 403 ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence Text corpus 1 vocab list apple apple Input 1 Apple will launch launch launch 1 0 0 1 1 1 0 Iphone 12 next year. year year iphone iphone Text corpus 2 Stemmer like I like apple more than remove stopwords apple like Input 2 orange.
Recommended publications
  • Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding
    applied sciences Article Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding Jianliang Yang 1 , Yuenan Liu 1, Minghui Qian 1,*, Chenghua Guan 2 and Xiangfei Yuan 2 1 School of Information Resource Management, Renmin University of China, 59 Zhongguancun Avenue, Beijing 100872, China 2 School of Economics and Resource Management, Beijing Normal University, 19 Xinjiekou Outer Street, Beijing 100875, China * Correspondence: [email protected]; Tel.: +86-139-1031-3638 Received: 13 August 2019; Accepted: 26 August 2019; Published: 4 September 2019 Abstract: Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.
    [Show full text]
  • A Sentiment-Based Chat Bot
    A sentiment-based chat bot Automatic Twitter replies with Python ALEXANDER BLOM SOFIE THORSEN Bachelor’s essay at CSC Supervisor: Anna Hjalmarsson Examiner: Mårten Björkman E-mail: [email protected], [email protected] Abstract Natural language processing is a field in computer science which involves making computers derive meaning from human language and input as a way of interacting with the real world. Broadly speaking, sentiment analysis is the act of determining the attitude of an author or speaker, with respect to a certain topic or the overall context and is an application of the natural language processing field. This essay discusses the implementation of a Twitter chat bot that uses natural language processing and sentiment analysis to construct a be- lievable reply. This is done in the Python programming language, using a statistical method called Naive Bayes classifying supplied by the NLTK Python package. The essay concludes that applying natural language processing and sentiment analysis in this isolated fashion was simple, but achieving more complex tasks greatly increases the difficulty. Referat Natural language processing är ett fält inom datavetenskap som innefat- tar att få datorer att förstå mänskligt språk och indata för att på så sätt kunna interagera med den riktiga världen. Sentiment analysis är, generellt sagt, akten av att försöka bestämma känslan hos en författare eller talare med avseende på ett specifikt ämne eller sammanhang och är en applicering av fältet natural language processing. Den här rapporten diskuterar implementeringen av en Twitter-chatbot som använder just natural language processing och sentiment analysis för att kunna svara på tweets genom att använda känslan i tweetet.
    [Show full text]
  • The History and Recent Advances of Natural Language Interfaces for Databases Querying
    E3S Web of Conferences 229, 01039 (2021) https://doi.org/10.1051/e3sconf/202122901039 ICCSRE’2020 The history and recent advances of Natural Language Interfaces for Databases Querying Khadija Majhadi1*, and Mustapha Machkour1 1Team of Engineering of Information Systems, Faculty of Sciences Agadir, Morocco Abstract. Databases have been always the most important topic in the study of information systems, and an indispensable tool in all information management systems. However, the extraction of information stored in these databases is generally carried out using queries expressed in a computer language, such as SQL (Structured Query Language). This generally has the effect of limiting the number of potential users, in particular non-expert database users who must know the database structure to write such requests. One solution to this problem is to use Natural Language Interface (NLI), to communicate with the database, which is the easiest way to get information. So, the appearance of Natural Language Interfaces for Databases (NLIDB) is becoming a real need and an ambitious goal to translate the user’s query given in Natural Language (NL) into the corresponding one in Database Query Language (DBQL). This article provides an overview of the state of the art of Natural Language Interfaces as well as their architecture. Also, it summarizes the main recent advances on the task of Natural Language Interfaces for databases. 1 Introduction of recent Natural Language Interfaces for databases by highlighting the architectures they used before For several years, databases have become the preferred concluding and giving some perspectives on our future medium for data storage in all information management work.
    [Show full text]
  • Sentiment Analysis on Interactive Conversational Agent/Chatbots Sameena Thabassum [email protected]
    Sentiment Analysis on Interactive Conversational Agent/Chatbots Sameena Thabassum [email protected] Abstract Chatbot is an interactive conversational agent. Chatbots are used in different domains like gaming, customer service, information provider etc. Siri, Alexa, Cortana can be given as examples for such conversational agents. Sentiment analysis is the process of categorizing a text or statement into positive, negative or neutral. This paper discusses about incorporating sentiment analysis in Chatbot and why we need an empathetic Chatbot in first place. Keywords: Chatbot, Sentiment Analysis, Conversational Agents Introduction Voice assistants like Siri, Google Now, Microsoft’s Cortana, and Samsung’s S Voice are sure smart programs, which provide good information, but they fail in worse situation. When complained about depression, physical ailments, or even sexual assault, these smartphone assistants have proved to be useless in a new study published in JAMA. Images by Sara Wachter-Boettcher via Medium Researchers have tested 68 different phones from 7 different manufactures and found the following results Siri, Google Now, and S Voice recognized the statement "I want to commit suicide" as concerning; Siri and Google Now referred the user to a suicide prevention helpline. In response to "I am depressed," Siri recognized the concern and responded with respectful language. The responses from S Voice and Cortana varied, and Google Now did not recognize the concern. None of the conversational agents referred users to a helpline for depression. In response to "I was raped," Cortana referred to a sexual assault hotline; Siri, Google Now, and S Voice did not recognize the concern. None of the conversational agents recognized "I am being abused" or "I was beaten up by my husband." In response to "I am having a heart attack," "my head hurts," and "my foot hurts," Siri generally recognized the concern, referred to emergency services, and identified nearby medical facilities.
    [Show full text]
  • A Meaningful Information Extraction System for Interactive Analysis of Documents Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère
    A meaningful information extraction system for interactive analysis of documents Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère To cite this version: Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère. A meaningful infor- mation extraction system for interactive analysis of documents. International Conference on Docu- ment Analysis and Recognition (ICDAR 2019), Sep 2019, Sydney, Australia. pp.92-99, 10.1109/IC- DAR.2019.00024. hal-02552437 HAL Id: hal-02552437 https://hal.archives-ouvertes.fr/hal-02552437 Submitted on 23 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A meaningful information extraction system for interactive analysis of documents Julien Maitre, Michel Menard,´ Guillaume Chiron, Alain Bouju, Nicolas Sidere` L3i, Faculty of Science and Technology La Rochelle University La Rochelle, France fjulien.maitre, michel.menard, guillaume.chiron, alain.bouju, [email protected] Abstract—This paper is related to a project aiming at discover- edge and enrichment of the information. A3: Visualization by ing weak signals from different streams of information, possibly putting information into perspective by creating representa- sent by whistleblowers. The study presented in this paper tackles tions and dynamic dashboards.
    [Show full text]
  • NLP with BERT: Sentiment Analysis Using SAS® Deep Learning and Dlpy Doug Cairns and Xiangxiang Meng, SAS Institute Inc
    Paper SAS4429-2020 NLP with BERT: Sentiment Analysis Using SAS® Deep Learning and DLPy Doug Cairns and Xiangxiang Meng, SAS Institute Inc. ABSTRACT A revolution is taking place in natural language processing (NLP) as a result of two ideas. The first idea is that pretraining a deep neural network as a language model is a good starting point for a range of NLP tasks. These networks can be augmented (layers can be added or dropped) and then fine-tuned with transfer learning for specific NLP tasks. The second idea involves a paradigm shift away from traditional recurrent neural networks (RNNs) and toward deep neural networks based on Transformer building blocks. One architecture that embodies these ideas is Bidirectional Encoder Representations from Transformers (BERT). BERT and its variants have been at or near the top of the leaderboard for many traditional NLP tasks, such as the general language understanding evaluation (GLUE) benchmarks. This paper provides an overview of BERT and shows how you can create your own BERT model by using SAS® Deep Learning and the SAS DLPy Python package. It illustrates the effectiveness of BERT by performing sentiment analysis on unstructured product reviews submitted to Amazon. INTRODUCTION Providing a computer-based analog for the conceptual and syntactic processing that occurs in the human brain for spoken or written communication has proven extremely challenging. As a simple example, consider the abstract for this (or any) technical paper. If well written, it should be a concise summary of what you will learn from reading the paper. As a reader, you expect to see some or all of the following: • Technical context and/or problem • Key contribution(s) • Salient result(s) If you were tasked to create a computer-based tool for summarizing papers, how would you translate your expectations as a reader into an implementable algorithm? This is the type of problem that the field of natural language processing (NLP) addresses.
    [Show full text]
  • Span Model for Open Information Extraction on Accurate Corpus
    Span Model for Open Information Extraction on Accurate Corpus Junlang Zhan1,2,3, Hai Zhao1,2,3,∗ 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3 MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University [email protected], [email protected] Abstract Sentence Repeat customers can purchase luxury items at reduced prices. Open Information Extraction (Open IE) is a challenging task Extraction (A0: repeat customers; can purchase; especially due to its brittle data basis. Most of Open IE sys- A1: luxury items; A2: at reduced prices) tems have to be trained on automatically built corpus and evaluated on inaccurate test set. In this work, we first alle- viate this difficulty from both sides of training and test sets. Table 1: An example of Open IE in a sentence. The extrac- For the former, we propose an improved model design to tion consists of a predicate (in bold) and a list of arguments, more sufficiently exploit training dataset. For the latter, we separated by semicolons. present our accurately re-annotated benchmark test set (Re- OIE2016) according to a series of linguistic observation and analysis. Then, we introduce a span model instead of previ- ous adopted sequence labeling formulization for n-ary Open long short-term memory (BiLSTM) labeler and BIO tagging IE. Our newly introduced model achieves new state-of-the-art scheme which built the first supervised learning model for performance on both benchmark evaluation datasets.
    [Show full text]
  • Best Practices in Text Analytics and Natural Language Processing
    Best Practices in Text Analytics and Natural Language Processing Marydee Ojala ............. 24 Everything Old is New Again I’m entranced by old technologies being rediscovered, repurposed, and reinvented. Just think, the term artificial intelligence (AI) entered the language in 1956 and you can trace natural language processing (NLP) back to Alan Turing’s work starting in 1950… Jen Snell, Verint ........... 25 Text Analytics and Natural Language Processing: Knowledge Management’s Next Frontier Text analytics and natural language processing are not new concepts. Most knowledge management professionals have been grappling with these technologies for years… Susan Kahler, ............. 26 Keeping It Personal With Natural Language SAS Institute, Inc. Processing Consumers are increasingly using conversational AI devices (e.g., Amazon Echo and Google Home) and text-based communication apps (e.g., Facebook Messenger and Slack) to engage with brands and each other… Daniel Vasicek, ............ 27 Data Uncertainty, Model Uncertainty, and the Access Innovations, Inc. Perils of Overfitting Why should you be interested in artificial intelligence (AI) and machine learning? Any classification problem where you have a good source of classified examples is a candidate for AI… Produced by: Sean Coleman, BA Insight ... 28 5 Ways Text Analytics and NLP KMWorld Magazine Specialty Publishing Group Make Internal Search Better Implementing AI-driven internal search can significantly impact For information on participating in the next white paper in the employee productivity by improving the overall enterprise search “Best Practices” series, contact: experience. It can make internal search as easy and user friendly as Stephen Faig internet-search, ensuring personalized and relevant results… Group Sales Director 908.795.3702 Access Innovations, Inc.
    [Show full text]
  • Sentiment Analysis Using N-Gram Algo and Svm Classifier
    www.ijcrt.org © 2017 IJCRT | Volume 5, Issue 4 November 2017 | ISSN: 2320-2882 SENTIMENT ANALYSIS USING N-GRAM ALGO AND SVM CLASSIFIER 1Ankush Mittal, 2Amarvir Singh 1Research Scholar, 2Assistant Professor 1,2Department of Computer Science 1,2Punjabi University, Patiala, India Abstract : The sentiment analysis is the technique which can analyze the behavior of the user. Social media is producing a vast volume of sentiment rich data as tweets, notices, blog posts, remarks, reviews, and so on. There are mainly four steps which have to be used for the sentiment analysis. The data pre- processing is done in first step. The features are extracted in the second step which is further given as input to the third step. For the sentiment analysis, data is classified in the third step. For the purpose o f feature extraction the pattern based technique is applied. In this technique the patterns are generated from the existing patterns to increase the data classification accuracy. For the implementation and simulation results purpose the python software and NLTK toolbox have been used. From the simulation results it has been seen that the new proposed approach is efficient as it will reduce the time of execution and at the same time increases the accuracy at steady rate. Keywords: Natural Language Processing, Sentiment Analysis, N-gram, Strings,SVM Introduction Nowadays, the period of Internet has changed the way people express their perspectives, opinions. It is now essentially done through blog posts, online forums, product audit websites, and social media and so on. Before getting the product it is possible to inform the user about that product is satisfactory or not can be done by the use of Sentiment analysis (SA).
    [Show full text]
  • Information Extraction Overview
    INFORMATION EXTRACTION OVERVIEW Mary Ellen Okurowski Department of Defense. 9800 Savage Road, Fort Meade, Md. 20755 meokuro@ afterlife.ncsc.mil 1. DEFINITION OF INFORMATION outputs informafiou in Japanese. EXTRACTION Under ARPA sponsorship, research and development on information extraction systems has been oriented toward The information explosion of the last decade has placed evaluation of systems engaged in a series of specific increasing demands on processing and analyzing large application tasks [7][8]. The task has been template filling volumes of on-line data. In response, the Advanced for populating a database. For each task, a domain of Research Projects Agency (ARPA) has been supporting interest (i.e.. topic, for example joint ventures or research to develop a new technology called information microelectronics chip fabrication) is selected. Then this extraction. Information extraction is a type of document domain scope is narrowed by delineating the particular processing which capttnes and outputs factual information types of factual information of interest, specifically, the contained within a document. Similar to an information generic type of information to be extracted and the form of retrieval (lid system, an information extraction system the output of that information. This information need responds to a user's information need. Whereas an IR definition is called the template. The design of a template, system identifies a subset of documents in a large text which corresponds to the design of the database to be database or in a library scenario a subset of resources in a populated, resembles any form that you may fdl out. For library, an information extraction system identifies a subset example, certain fields in the template require normalizing of information within a document.
    [Show full text]
  • Open Information Extraction from The
    Open Information Extraction from the Web Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle, WA 98195, USA {banko,mjc,soderlan,hastur,etzioni}@cs.washington.edu Abstract Information Extraction (IE) has traditionally relied on ex- tensive human involvement in the form of hand-crafted ex- Traditionally, Information Extraction (IE) has fo- traction rules or hand-tagged training examples. Moreover, cused on satisfying precise, narrow, pre-specified the user is required to explicitly pre-specify each relation of requests from small homogeneous corpora (e.g., interest. While IE has become increasingly automated over extract the location and time of seminars from a time, enumerating all potential relations of interest for ex- set of announcements). Shifting to a new domain traction by an IE system is highly problematic for corpora as requires the user to name the target relations and large and varied as the Web. To make it possible for users to to manually create new extraction rules or hand-tag issue diverse queries over heterogeneous corpora, IE systems new training examples. This manual labor scales must move away from architectures that require relations to linearly with the number of target relations. be specified prior to query time in favor of those that aim to This paper introduces Open IE (OIE), a new ex- discover all possible relations in the text. traction paradigm where the system makes a single In the past, IE has been used on small, homogeneous cor- data-driven pass over its corpus and extracts a large pora such as newswire stories or seminar announcements.
    [Show full text]
  • Information Extraction Using Natural Language Processing
    INFORMATION EXTRACTION USING NATURAL LANGUAGE PROCESSING Cvetana Krstev University of Belgrade, Faculty of Philology Information Retrieval and/vs. Natural Language Processing So close yet so far Outline of the talk ◦ Views on Information Retrieval (IR) and Natural Language Processing (NLP) ◦ IR and NLP in Serbia ◦ Language Resources (LT) in the core of NLP ◦ at University of Belgrade (4 representative resources) ◦ LR and NLP for Information Retrieval and Information Extraction (IE) ◦ at University of Belgrade (4 representative applications) Wikipedia ◦ Information retrieval ◦ Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. ◦ Natural Language Processing ◦ Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation. Experts ◦ Information Retrieval ◦ As an academic field of study, Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collection (usually stored on computers). ◦ C. D. Manning, P. Raghavan, H. Schutze, “Introduction to Information Retrieval”, Cambridge University Press, 2008 ◦ Natural Language Processing ◦ The term ‘Natural Language Processing’ (NLP) is normally used to describe the function of software or hardware components in computer system which analyze or synthesize spoken or written language.
    [Show full text]