A Wordnet-Based Algorithm for Word Sense Disambiguation

Total Page:16

File Type:pdf, Size:1020Kb

A Wordnet-Based Algorithm for Word Sense Disambiguation A WordNet-based Algorithm for Word Sense Disambiguation Xiaobin Li" Stan Szpakowicz and Stan Matwin Department of Computer Science Department of Computer Science Concordia University University of Ottawa Montreal, Quebec Ottawa, Ontario Canada H3G IMS Canada MN 6N5 xiaobmlQjcs concordia ca {szpak,stan}@csi uottawa ca Abstract interpretation By design we only need the user to ap• prove the system s findings or prompt it for alternatives We present an algorithm for automatic word Also hy design we limit ourselves to information source* sense disambiguation ba.sed on lexical knowl in the public domain inexpensive dictionaries and other edge contained in WordNet and on the results lexical sources such as WordNet of surface-syntactic analvsis The algorithm lb part of a system that analyzes texts in or WordNel [Mill* r, 1990 Beckwith et al , 1991] is a very der to acquire knowledge in the presence of as rich source of lexical knowledge Since most entries have little pre-coded semantic knowledge as possi multiple senses, we fare a severe problem of ambiguity blc On the other hand, we want lo make Hit The motivation for the work described here is the desire besl us* of public-domain information sources to design a word sense disambiguation (WSD) algorithm such as WordNet Rather than depend on large that satisfies the needs of our project (learning from text) amounts of hand-crafted knowledge or statis• without large amounts of hand-crafted knowledge or sta• tical data from large corpora, we use syntac• tistical data from large corpora We concentrate on using tic information and information in WordNet information in WordNet to the fullest extent and mini• and minimize the need for other knowledg< mizing the need for other knowledge sources in the WSD sources in the word sense disambiguation pro• algorithm Semantic similarity between words (defined cess We propose to guide disambiguation by in the next section) plays an important role in the algo• semantic similarity between words and heuris• rithm We propose several heuristic rules to guide WSD tic rules based on this similarity I lu algorithm Icsted on an unrestricted, real text (the Canadian In• has been applied to the ( anadian Income fax come Tax Guide) this automatic WSD method gives Guide Test results indicate that even on a rela- encouraging results tivelv small text the proposed method produces correct noun meaning more than 11% of the Word sense disambiguation is essential in natural lan• tim< guage processing Early symbolic methods [Hirst, 1967, Small and Rieger 1962, Wilks, 1975] heavily rely on large amounts of hand-crafted knowledge As a reault, 1 Introduction they can only work in a specific domain To overcome this weakness a variety of statistical WSD methods This work is part of the project that amih at a svner [Brown et a/, 1991, Gale et a/, 1992, Resnik, 1992, gistic integration of Machine Learning and Natural Lan• Schutze, 1992,Charniak, 1993, Lehman, 1994] have been guage Processing The long-t< rm goal of the project is put forward They scale up easily and this makes them a system that performs machine learning on the results useful for large unrestricted corpora One of the most of text analysis to acquire a useful collection of produc• important steps in statistical WSD methods, however, tion rules Because such a svstem should not require is statistically motivated extraction of word-word rela• extensive domain knowledge up front, text analysis is tionships from corpora As Resnik points out [1993], to be dont in a knowledge-scant setting and with mm the corpus may fail to provide sufficient information imal user involvement A domain-independent surface- about relevant word/word relationships and word/word syntactic parser produces an analysis of a text fragment relationships, even those supported by the data, may (usually a sentence) that undergoes interactive semantic not be the appropriate relationships to look at for some tasks" Most of the statistical methods suffer from this 'Tins work was done while the first author was with the Knowledge Acquisition and Machine Learning group at the limitation Resnik proposes constraining the set of pos• University of Ottawa The authors are gralcful to the Nat sible word classes by using WordNet Rather than im• ural Sciences and Engineering Research Council of Canada proving statistical approaches [Resnik, 1993, Sussna, far having supported this work with a Strategic Grant No 1993, Voorhees, 1993] we propose a completely differ• STR0117764 ent WordNet-based algorithm for WSD 1368 NATURAL LANGUAGE 2 WordNet and Semantic Similarity a child synset property belonging, holding, material WordNet is a lexical database with a remarkably broad possession" is not a synonym of its immediate parent coverage One of its most outstanding qualities is a svnset 'possession" at all Using the information about word sense network A word sense node in this net• a parent synset to decide the intended meaning of a word work is a group of strict synonyms called synset' which in its immediate child svnset is different from using the is regarded as a basic object in WordNet Each sense information about a child synset to decide the intended of a word is mapped to such a word sens*, node (1 e meaning of a word in its immediate parent synset So a synset) in WordNet and all word sense nodes in here we have divided the semantic similarity between a parent synset and its immediate child synset into two WordNet are linked by a \anety of semantic relation• levels (Level 2 and Level 3) ships, such as IS-A (hypernymy/hyponymy) PART-OF (meronymy/holonymy), synonymy and antonymy The Although we have only applied the algorithm to WSD IS-A relationship is dominant - synsels are grouped by of noun objects in a text (i e nouns that are objects of it into hicrarchies Our algorithm only accesses informa• verbs in sentences analyzed by, the system), it can also be tion about nouns and verbs, for which thtre exist such applied to other noun phrases in a sentence, in particular lexical hierarchies subjects It has become common to use some messure of se• In this approach, we must consider contexts that are mantic similarity between words to support word sense relevant lo our method and tht semantic similarity in disambiguation [Resmk 1992, Schutze, 1992] In this these contexts work, we have adopted the following defintion seman• Tor all practical purposes, the possible senses of a word tic similarity between words is inverselv proportional (o can be found in WordNet, but due to its extremely broad the semantic distance between words in a Wordnet IS coverage most words have multiple word senses A con• A hierarchy By investigating the semantic relationships text must be considered in order to decide a particular word sense The notion of context and its use could between two given words in WordNet hierarchies seman- differ widely across WSD methods One may consider tic similarity can be measured and roughly divided into a whole text, a 100-word window, a sentence or some the following four levels specific words nearby, and so on In our work, we as• Level 1 The words are strict synonynis according to sume that a group of words co-occurring in a sentence WordNet one word is in the same synset as the will determine earh other s sense despite each of them other word being multiply ambiguous A simple and effective way is Lo consider as context the verbs that dominate noun Level 2 The words are extended synonyms accord• objects in sentences That is, we investigate verb-noun ing to WordNet one word is the immediate parent pairs to determine tht intended meaning of noun objects node of the other word in a IS-A hierarchy of Word- in sentences Net When used to get the synonyms of a word, WordNet not only produces the strict synonyms (a In this work, then we focus on investigating two as• synset), but also the immediate parent nodes of this pects of semantic similarity synset in a IS-A hierarchy of WordNet Here, these • The semantic similarity of the noun objects immediate parent nodes are called extended syn- onyms • The semantic similarity of their verb contexts Level 3 Tht words are hyponyms according to Word- 3 Heuristic Rules and Confidence of Net one word is a child node of the other word in a IS-A hierarchy of WordNet Results Level 4 The words have a coordinate relationship in We have set out to determine the intended meaning of a WordNet a word is a sibling node of the other word noun object in a given verb context from all candidate (i e both words have tht same parent node) m a word senses in WordNet of the noun object, select one IS-A hierarchy of WordNet sense that best fits the given verb context Suppose that the algorithm seeks the intended mean• Level 1 concerns the semantic similarity between ing of a noun object NOBJ in its verb context VC (that words inside a synset, Level 2 - Level 4 describe the se• is, the intended meaning of NOBJ in a verb-object pair mantic similarity between words that, belong to different (VC, NOBJ)) NOBJ has n candidate word senses in synsets (a synset which is composed of a group of strict WordNet s(k) means the kth word sense of NOBJ in synonyms is a node in the hierarchy of WordNet) The WordNet, l<k<n SS means the semantic similarity be• semantic similarity at Level 1 and Level 4 are symmet tween words in a IS-A hierarchy of WordNet An arrow ric, but the semantic similarity at Level 2 and Level 3 is used to describe the relationship between words or be• are not symmetric although both are about the relation tween a word and its word
Recommended publications
  • Automatic Wordnet Mapping Using Word Sense Disambiguation*
    Automatic WordNet mapping using word sense disambiguation* Changki Lee Seo JungYun Geunbae Leer Natural Language Processing Lab Natural Language Processing Lab Dept. of Computer Science and Engineering Dept. of Computer Science Pohang University of Science & Technology Sogang University San 31, Hyoja-Dong, Pohang, 790-784, Korea Sinsu-dong 1, Mapo-gu, Seoul, Korea {leeck,gblee }@postech.ac.kr seojy@ ccs.sogang.ac.kr bilingual Korean-English dictionary. The first Abstract sense of 'gwan-mog' has 'bush' as a translation in English and 'bush' has five synsets in This paper presents the automatic WordNet. Therefore the first sense of construction of a Korean WordNet from 'gwan-mog" has five candidate synsets. pre-existing lexical resources. A set of Somehow we decide a synset {shrub, bush} automatic WSD techniques is described for among five candidate synsets and link the sense linking Korean words collected from a of 'gwan-mog' to this synset. bilingual MRD to English WordNet synsets. As seen from this example, when we link the We will show how individual linking senses of Korean words to WordNet synsets, provided by each WSD method is then there are semantic ambiguities. To remove the combined to produce a Korean WordNet for ambiguities we develop new word sense nouns. disambiguation heuristics and automatic mapping method to construct Korean WordNet based on 1 Introduction the existing English WordNet. There is no doubt on the increasing This paper is organized as follows. In section 2, importance of using wide coverage ontologies we describe multiple heuristics for word sense for NLP tasks especially for information disambiguation for sense linking.
    [Show full text]
  • Universal Or Variation? Semantic Networks in English and Chinese
    Universal or variation? Semantic networks in English and Chinese Understanding the structures of semantic networks can provide great insights into lexico- semantic knowledge representation. Previous work reveals small-world structure in English, the structure that has the following properties: short average path lengths between words and strong local clustering, with a scale-free distribution in which most nodes have few connections while a small number of nodes have many connections1. However, it is not clear whether such semantic network properties hold across human languages. In this study, we investigate the universal structures and cross-linguistic variations by comparing the semantic networks in English and Chinese. Network description To construct the Chinese and the English semantic networks, we used Chinese Open Wordnet2,3 and English WordNet4. The two wordnets have different word forms in Chinese and English but common word meanings. Word meanings are connected not only to word forms, but also to other word meanings if they form relations such as hypernyms and meronyms (Figure 1). 1. Cross-linguistic comparisons Analysis The large-scale structures of the Chinese and the English networks were measured with two key network metrics, small-worldness5 and scale-free distribution6. Results The two networks have similar size and both exhibit small-worldness (Table 1). However, the small-worldness is much greater in the Chinese network (σ = 213.35) than in the English network (σ = 83.15); this difference is primarily due to the higher average clustering coefficient (ACC) of the Chinese network. The scale-free distributions are similar across the two networks, as indicated by ANCOVA, F (1, 48) = 0.84, p = .37.
    [Show full text]
  • The Case for Word Sense Induction and Disambiguation
    Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation Alexander Panchenko‡, Eugen Ruppert‡, Stefano Faralli†, Simone Paolo Ponzetto† and Chris Biemann‡ ‡Language Technology Group, Computer Science Dept., University of Hamburg, Germany †Web and Data Science Group, Computer Science Dept., University of Mannheim, Germany panchenko,ruppert,biemann @informatik.uni-hamburg.de { faralli,simone @informatik.uni-mannheim.de} { } Abstract Word sense induction from domain-specific cor- pora is a supposed to solve this problem. How- The current trend in NLP is the use of ever, most approaches to word sense induction and highly opaque models, e.g. neural net- disambiguation, e.g. (Schutze,¨ 1998; Li and Juraf- works and word embeddings. While sky, 2015; Bartunov et al., 2016), rely on cluster- these models yield state-of-the-art results ing methods and dense vector representations that on a range of tasks, their drawback is make a WSD model uninterpretable as compared poor interpretability. On the example to knowledge-based WSD methods. of word sense induction and disambigua- Interpretability of a statistical model is impor- tion (WSID), we show that it is possi- tant as it lets us understand the reasons behind its ble to develop an interpretable model that predictions (Vellido et al., 2011; Freitas, 2014; Li matches the state-of-the-art models in ac- et al., 2016). Interpretability of WSD models (1) curacy. Namely, we present an unsuper- lets a user understand why in the given context one vised, knowledge-free WSID approach, observed a given sense (e.g., for educational appli- which is interpretable at three levels: word cations); (2) performs a comprehensive analysis of sense inventory, sense feature representa- correct and erroneous predictions, giving rise to tions, and disambiguation procedure.
    [Show full text]
  • NL Assistant: a Toolkit for Developing Natural Language: Applications
    NL Assistant: A Toolkit for Developing Natural Language Applications Deborah A. Dahl, Lewis M. Norton, Ahmed Bouzid, and Li Li Unisys Corporation Introduction scale lexical resources have been integrated with the toolkit. These include Comlex and WordNet We will be demonstrating a toolkit for as well as additional, internally developed developing natural language-based applications resources. The second strategy is to provide easy and two applications. The goals of this toolkit to use editors for entering linguistic information. are to reduce development time and cost for natural language based applications by reducing Servers the amount of linguistic and programming work needed. Linguistic work has been reduced by Lexical information is supplied by four external integrating large-scale linguistics resources--- servers which are accessed by the natural Comlex (Grishman, et. al., 1993) and WordNet language engine during processing. Syntactic (Miller, 1990). Programming work is reduced by information is supplied by a lexical server based automating some of the programming tasks. The on the 50K word Comlex dictionary available toolkit is designed for both speech- and text- from the Linguistic Data Consortium. Semantic based interface applications. It runs in a information comes from two servers, a KB Windows NT environment. Applications can server based on the noun portion of WordNet run in either Windows NT or Unix. (70K concepts), and a semantics server containing case frame information for 2500 System Components English verbs. A denotations server using unique concept names generated for each The NL Assistant toolkit consists of WordNet synset at ISI links the words in the lexicon to the concepts in the KB.
    [Show full text]
  • Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph
    mathematics Article Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph Tajana Ban Kirigin 1,* , Sanda Bujaˇci´cBabi´c 1 and Benedikt Perak 2 1 Department of Mathematics, University of Rijeka, R. Matejˇci´c2, 51000 Rijeka, Croatia; [email protected] 2 Faculty of Humanities and Social Sciences, University of Rijeka, SveuˇcilišnaAvenija 4, 51000 Rijeka, Croatia; [email protected] * Correspondence: [email protected] Abstract: This paper describes a graph method for labeling word senses and identifying lexical sentiment potential by integrating the corpus-based syntactic-semantic dependency graph layer, lexical semantic and sentiment dictionaries. The method, implemented as ConGraCNet application on different languages and corpora, projects a semantic function onto a particular syntactical de- pendency layer and constructs a seed lexeme graph with collocates of high conceptual similarity. The seed lexeme graph is clustered into subgraphs that reveal the polysemous semantic nature of a lexeme in a corpus. The construction of the WordNet hypernym graph provides a set of synset labels that generalize the senses for each lexical cluster. By integrating sentiment dictionaries, we introduce graph propagation methods for sentiment analysis. Original dictionary sentiment values are integrated into ConGraCNet lexical graph to compute sentiment values of node lexemes and lexical clusters, and identify the sentiment potential of lexemes with respect to a corpus. The method can be used to resolve sparseness of sentiment dictionaries and enrich the sentiment evaluation of Citation: Ban Kirigin, T.; lexical structures in sentiment dictionaries by revealing the relative sentiment potential of polysemous Bujaˇci´cBabi´c,S.; Perak, B. Lexical Sense Labeling and Sentiment lexemes with respect to a specific corpus.
    [Show full text]
  • Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R
    International Journal of Emerging Research in Management &Technology Research Article August ISSN: 2278-9359 (Volume-6, Issue-8) 2017 Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R. Patil Shital A. Patil PG Student, Department of Computer Engineering, Associate Professor, Department of Computer Engineering, North Maharashtra University, Jalgaon, North Maharashtra University, Jalgaon, Maharashtra, India Maharashtra, India Abstract— imilarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the S closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index.
    [Show full text]
  • Introduction to Wordnet: an On-Line Lexical Database
    Introduction to WordNet: An On-line Lexical Database George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller (Revised August 1993) WordNet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. Standard alphabetical procedures for organizing lexical information put together words that are spelled alike and scatter words with similar or related meanings haphazardly through the list. Unfortunately, there is no obvious alternative, no other simple way for lexicographers to keep track of what has been done or for readers to ®nd the word they are looking for. But a frequent objection to this solution is that ®nding things on an alphabetical list can be tedious and time-consuming. Many people who would like to refer to a dictionary decide not to bother with it because ®nding the information would interrupt their work and break their train of thought. In this age of computers, however, there is an answer to that complaint. One obvious reason to resort to on-line dictionariesÐlexical databases that can be read by computersÐis that computers can search such alphabetical lists much faster than people can. A dictionary entry can be available as soon as the target word is selected or typed into the keyboard. Moreover, since dictionaries are printed from tapes that are read by computers, it is a relatively simple matter to convert those tapes into the appropriate kind of lexical database.
    [Show full text]
  • Web Search Result Clustering with Babelnet
    Web Search Result Clustering with BabelNet Marek Kozlowski Maciej Kazula OPI-PIB OPI-PIB [email protected] [email protected] Abstract 2 Related Work In this paper we present well-known 2.1 Search Result Clustering search result clustering method enriched with BabelNet information. The goal is The goal of text clustering in information retrieval to verify how Babelnet/Babelfy can im- is to discover groups of semantically related docu- prove the clustering quality in the web ments. Contextual descriptions (snippets) of docu- search results domain. During the evalua- ments returned by a search engine are short, often tion we tested three algorithms (Bisecting incomplete, and highly biased toward the query, so K-Means, STC, Lingo). At the first stage, establishing a notion of proximity between docu- we performed experiments only with tex- ments is a challenging task that is called Search tual features coming from snippets. Next, Result Clustering (SRC). Search Results Cluster- we introduced new semantic features from ing (SRC) is a specific area of documents cluster- BabelNet (as disambiguated synsets, cate- ing. gories and glosses describing synsets, or Approaches to search result clustering can be semantic edges) in order to verify how classified as data-centric or description-centric they influence on the clustering quality (Carpineto, 2009). of the search result clustering. The al- The data-centric approach (as Bisecting K- gorithms were evaluated on AMBIENT Means) focuses more on the problem of data clus- dataset in terms of the clustering quality. tering, rather than presenting the results to the user. Other data-centric methods use hierarchical 1 Introduction agglomerative clustering (Maarek, 2000) that re- In the previous years, Web clustering engines places single terms with lexical affinities (2-grams (Carpineto, 2009) have been proposed as a solu- of words) as features, or exploit link information tion to the issue of lexical ambiguity in Informa- (Zhang, 2008).
    [Show full text]
  • Sethesaurus: Wordnet in Software Engineering
    This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TSE.2019.2940439 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2015 1 SEthesaurus: WordNet in Software Engineering Xiang Chen, Member, IEEE, Chunyang Chen, Member, IEEE, Dun Zhang, and Zhenchang Xing, Member, IEEE, Abstract—Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations.
    [Show full text]
  • Natural Language Processing Based Generator of Testing Instruments
    California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of aduateGr Studies 9-2017 NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTS Qianqian Wang Follow this and additional works at: https://scholarworks.lib.csusb.edu/etd Part of the Other Computer Engineering Commons Recommended Citation Wang, Qianqian, "NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTS" (2017). Electronic Theses, Projects, and Dissertations. 576. https://scholarworks.lib.csusb.edu/etd/576 This Project is brought to you for free and open access by the Office of aduateGr Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected]. NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTS A Project Presented to the Faculty of California State University, San Bernardino In Partial Fulfillment of the Requirements for the DeGree Master of Science in Computer Science by Qianqian Wang September 2017 NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTS A Project Presented to the Faculty of California State University, San Bernardino by Qianqian Wang September 2017 Approved by: Dr. Kerstin VoiGt, Advisor, Computer Science and EnGineerinG Dr. TonG Lai Yu, Committee Member Dr. George M. Georgiou, Committee Member © 2017 Qianqian Wang ABSTRACT Natural LanGuaGe ProcessinG (NLP) is the field of study that focuses on the interactions between human lanGuaGe and computers. By “natural lanGuaGe” we mean a lanGuaGe that is used for everyday communication by humans. Different from proGramminG lanGuaGes, natural lanGuaGes are hard to be defined with accurate rules. NLP is developinG rapidly and it has been widely used in different industries.
    [Show full text]
  • Download Natural Language Toolkit Tutorial
    Natural Language Processing Toolkit i Natural Language Processing Toolkit About the Tutorial Language is a method of communication with the help of which we can speak, read and write. Natural Language Processing (NLP) is the sub field of computer science especially Artificial Intelligence (AI) that is concerned about enabling computers to understand and process human language. We have various open-source NLP tools but NLTK (Natural Language Toolkit) scores very high when it comes to the ease of use and explanation of the concept. The learning curve of Python is very fast and NLTK is written in Python so NLTK is also having very good learning kit. NLTK has incorporated most of the tasks like tokenization, stemming, Lemmatization, Punctuation, Character Count, and Word count. It is very elegant and easy to work with. Audience This tutorial will be useful for graduates, post-graduates, and research students who either have an interest in NLP or have this subject as a part of their curriculum. The reader can be a beginner or an advanced learner. Prerequisites The reader must have basic knowledge about artificial intelligence. He/she should also be aware of basic terminologies used in English grammar and Python programming concepts. Copyright & Disclaimer Copyright 2019 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher.
    [Show full text]
  • Modeling and Encoding Traditional Wordlists for Machine Applications
    Modeling and Encoding Traditional Wordlists for Machine Applications Shakthi Poornima Jeff Good Department of Linguistics Department of Linguistics University at Buffalo University at Buffalo Buffalo, NY USA Buffalo, NY USA [email protected] [email protected] Abstract Clearly, descriptive linguistic resources can be This paper describes work being done on of potential value not just to traditional linguis- the modeling and encoding of a legacy re- tics, but also to computational linguistics. The source, the traditional descriptive wordlist, difficulty, however, is that the kinds of resources in ways that make its data accessible to produced in the course of linguistic description NLP applications. We describe an abstract are typically not easily exploitable in NLP appli- model for traditional wordlist entries and cations. Nevertheless, in the last decade or so, then provide an instantiation of the model it has become widely recognized that the devel- in RDF/XML which makes clear the re- opment of new digital methods for encoding lan- lationship between our wordlist database guage data can, in principle, not only help descrip- and interlingua approaches aimed towards tive linguists to work more effectively but also al- machine translation, and which also al- low them, with relatively little extra effort, to pro- lows for straightforward interoperation duce resources which can be straightforwardly re- with data from full lexicons. purposed for, among other things, NLP (Simons et al., 2004; Farrar and Lewis, 2007). 1 Introduction Despite this, it has proven difficult to create When looking at the relationship between NLP significant electronic descriptive resources due to and linguistics, it is typical to focus on the dif- the complex and specific problems inevitably as- ferent approaches taken with respect to issues sociated with the conversion of legacy data.
    [Show full text]