Using Wordnet in NLP
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Text Mining Course for KNIME Analytics Platform
Text Mining Course for KNIME Analytics Platform KNIME AG Copyright © 2018 KNIME AG Table of Contents 1. The Open Analytics Platform 2. The Text Processing Extension 3. Importing Text 4. Enrichment 5. Preprocessing 6. Transformation 7. Classification 8. Visualization 9. Clustering 10. Supplementary Workflows Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 2 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ Overview KNIME Analytics Platform Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 3 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ What is KNIME Analytics Platform? • A tool for data analysis, manipulation, visualization, and reporting • Based on the graphical programming paradigm • Provides a diverse array of extensions: • Text Mining • Network Mining • Cheminformatics • Many integrations, such as Java, R, Python, Weka, H2O, etc. Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 4 Noncommercial-Share Alike license 2 https://creativecommons.org/licenses/by-nc-sa/4.0/ Visual KNIME Workflows NODES perform tasks on data Not Configured Configured Outputs Inputs Executed Status Error Nodes are combined to create WORKFLOWS Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 5 Noncommercial-Share Alike license 3 https://creativecommons.org/licenses/by-nc-sa/4.0/ Data Access • Databases • MySQL, MS SQL Server, PostgreSQL • any JDBC (Oracle, DB2, …) • Files • CSV, txt -
Automatic Wordnet Mapping Using Word Sense Disambiguation*
Automatic WordNet mapping using word sense disambiguation* Changki Lee Seo JungYun Geunbae Leer Natural Language Processing Lab Natural Language Processing Lab Dept. of Computer Science and Engineering Dept. of Computer Science Pohang University of Science & Technology Sogang University San 31, Hyoja-Dong, Pohang, 790-784, Korea Sinsu-dong 1, Mapo-gu, Seoul, Korea {leeck,gblee }@postech.ac.kr seojy@ ccs.sogang.ac.kr bilingual Korean-English dictionary. The first Abstract sense of 'gwan-mog' has 'bush' as a translation in English and 'bush' has five synsets in This paper presents the automatic WordNet. Therefore the first sense of construction of a Korean WordNet from 'gwan-mog" has five candidate synsets. pre-existing lexical resources. A set of Somehow we decide a synset {shrub, bush} automatic WSD techniques is described for among five candidate synsets and link the sense linking Korean words collected from a of 'gwan-mog' to this synset. bilingual MRD to English WordNet synsets. As seen from this example, when we link the We will show how individual linking senses of Korean words to WordNet synsets, provided by each WSD method is then there are semantic ambiguities. To remove the combined to produce a Korean WordNet for ambiguities we develop new word sense nouns. disambiguation heuristics and automatic mapping method to construct Korean WordNet based on 1 Introduction the existing English WordNet. There is no doubt on the increasing This paper is organized as follows. In section 2, importance of using wide coverage ontologies we describe multiple heuristics for word sense for NLP tasks especially for information disambiguation for sense linking. -
Application of Machine Learning in Automatic Sentiment Recognition from Human Speech
Application of Machine Learning in Automatic Sentiment Recognition from Human Speech Zhang Liu Ng EYK Anglo-Chinese Junior College College of Engineering Singapore Nanyang Technological University (NTU) Singapore [email protected] Abstract— Opinions and sentiments are central to almost all human human activities and have a wide range of applications. As activities and have a wide range of applications. As many decision many decision makers turn to social media due to large makers turn to social media due to large volume of opinion data volume of opinion data available, efficient and accurate available, efficient and accurate sentiment analysis is necessary to sentiment analysis is necessary to extract those data. Business extract those data. Hence, text sentiment analysis has recently organisations in different sectors use social media to find out become a popular field and has attracted many researchers. However, consumer opinions to improve their products and services. extracting sentiments from audio speech remains a challenge. This project explores the possibility of applying supervised Machine Political party leaders need to know the current public Learning in recognising sentiments in English utterances on a sentiment to come up with campaign strategies. Government sentence level. In addition, the project also aims to examine the effect agencies also monitor citizens’ opinions on social media. of combining acoustic and linguistic features on classification Police agencies, for example, detect criminal intents and cyber accuracy. Six audio tracks are randomly selected to be training data threats by analysing sentiment valence in social media posts. from 40 YouTube videos (monologue) with strong presence of In addition, sentiment information can be used to make sentiments. -
Text KM & Text Mining
Text KM & Text Mining An Introduction to Managing Knowledge in Unstructured Natural Language Documents Dekai Wu HKUST Human Language Technology Center Department of Computer Science and Engineering University of Science and Technology Hong Kong http://www.cs.ust.hk/~dekai © 2008 Dekai Wu Lecture Objectives Introduction to the concept of Text KM and Text Mining (TM) How to exploit knowledge encoded in text form How text mining is different from data mining Introduction to the various aspects of Natural Language Processing (NLP) Introduction to the different tools and methods available for TM HKUST Human Language Technology Center © 2008 Dekai Wu Textual Knowledge Management Text KM oversees the storage, capturing and sharing of knowledge encoded in unstructured natural language documents 80-90% of an organization’s explicit knowledge resides in plain English (Chinese, Japanese, Spanish, …) documents – not in structured relational databases! Case libraries are much more reasonably stored as natural language documents, than encoded into relational databases Most knowledge encoded as text will never pass through explicit KM processes (eg, email) HKUST Human Language Technology Center © 2008 Dekai Wu Text Mining Text Mining analyzes unstructured natural language documents to extract targeted types of knowledge Extracts knowledge that can then be inserted into databases, thereby facilitating structured data mining techniques Provides a more natural user interface for entering knowledge, for both employees and developers Reduces -
Knowledge-Powered Deep Learning for Word Embedding
Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20]. -
Universal Or Variation? Semantic Networks in English and Chinese
Universal or variation? Semantic networks in English and Chinese Understanding the structures of semantic networks can provide great insights into lexico- semantic knowledge representation. Previous work reveals small-world structure in English, the structure that has the following properties: short average path lengths between words and strong local clustering, with a scale-free distribution in which most nodes have few connections while a small number of nodes have many connections1. However, it is not clear whether such semantic network properties hold across human languages. In this study, we investigate the universal structures and cross-linguistic variations by comparing the semantic networks in English and Chinese. Network description To construct the Chinese and the English semantic networks, we used Chinese Open Wordnet2,3 and English WordNet4. The two wordnets have different word forms in Chinese and English but common word meanings. Word meanings are connected not only to word forms, but also to other word meanings if they form relations such as hypernyms and meronyms (Figure 1). 1. Cross-linguistic comparisons Analysis The large-scale structures of the Chinese and the English networks were measured with two key network metrics, small-worldness5 and scale-free distribution6. Results The two networks have similar size and both exhibit small-worldness (Table 1). However, the small-worldness is much greater in the Chinese network (σ = 213.35) than in the English network (σ = 83.15); this difference is primarily due to the higher average clustering coefficient (ACC) of the Chinese network. The scale-free distributions are similar across the two networks, as indicated by ANCOVA, F (1, 48) = 0.84, p = .37. -
NL Assistant: a Toolkit for Developing Natural Language: Applications
NL Assistant: A Toolkit for Developing Natural Language Applications Deborah A. Dahl, Lewis M. Norton, Ahmed Bouzid, and Li Li Unisys Corporation Introduction scale lexical resources have been integrated with the toolkit. These include Comlex and WordNet We will be demonstrating a toolkit for as well as additional, internally developed developing natural language-based applications resources. The second strategy is to provide easy and two applications. The goals of this toolkit to use editors for entering linguistic information. are to reduce development time and cost for natural language based applications by reducing Servers the amount of linguistic and programming work needed. Linguistic work has been reduced by Lexical information is supplied by four external integrating large-scale linguistics resources--- servers which are accessed by the natural Comlex (Grishman, et. al., 1993) and WordNet language engine during processing. Syntactic (Miller, 1990). Programming work is reduced by information is supplied by a lexical server based automating some of the programming tasks. The on the 50K word Comlex dictionary available toolkit is designed for both speech- and text- from the Linguistic Data Consortium. Semantic based interface applications. It runs in a information comes from two servers, a KB Windows NT environment. Applications can server based on the noun portion of WordNet run in either Windows NT or Unix. (70K concepts), and a semantics server containing case frame information for 2500 System Components English verbs. A denotations server using unique concept names generated for each The NL Assistant toolkit consists of WordNet synset at ISI links the words in the lexicon to the concepts in the KB. -
Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R
International Journal of Emerging Research in Management &Technology Research Article August ISSN: 2278-9359 (Volume-6, Issue-8) 2017 Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R. Patil Shital A. Patil PG Student, Department of Computer Engineering, Associate Professor, Department of Computer Engineering, North Maharashtra University, Jalgaon, North Maharashtra University, Jalgaon, Maharashtra, India Maharashtra, India Abstract— imilarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the S closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. -
Introduction to Wordnet: an On-Line Lexical Database
Introduction to WordNet: An On-line Lexical Database George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller (Revised August 1993) WordNet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. Standard alphabetical procedures for organizing lexical information put together words that are spelled alike and scatter words with similar or related meanings haphazardly through the list. Unfortunately, there is no obvious alternative, no other simple way for lexicographers to keep track of what has been done or for readers to ®nd the word they are looking for. But a frequent objection to this solution is that ®nding things on an alphabetical list can be tedious and time-consuming. Many people who would like to refer to a dictionary decide not to bother with it because ®nding the information would interrupt their work and break their train of thought. In this age of computers, however, there is an answer to that complaint. One obvious reason to resort to on-line dictionariesÐlexical databases that can be read by computersÐis that computers can search such alphabetical lists much faster than people can. A dictionary entry can be available as soon as the target word is selected or typed into the keyboard. Moreover, since dictionaries are printed from tapes that are read by computers, it is a relatively simple matter to convert those tapes into the appropriate kind of lexical database. -
Web Search Result Clustering with Babelnet
Web Search Result Clustering with BabelNet Marek Kozlowski Maciej Kazula OPI-PIB OPI-PIB [email protected] [email protected] Abstract 2 Related Work In this paper we present well-known 2.1 Search Result Clustering search result clustering method enriched with BabelNet information. The goal is The goal of text clustering in information retrieval to verify how Babelnet/Babelfy can im- is to discover groups of semantically related docu- prove the clustering quality in the web ments. Contextual descriptions (snippets) of docu- search results domain. During the evalua- ments returned by a search engine are short, often tion we tested three algorithms (Bisecting incomplete, and highly biased toward the query, so K-Means, STC, Lingo). At the first stage, establishing a notion of proximity between docu- we performed experiments only with tex- ments is a challenging task that is called Search tual features coming from snippets. Next, Result Clustering (SRC). Search Results Cluster- we introduced new semantic features from ing (SRC) is a specific area of documents cluster- BabelNet (as disambiguated synsets, cate- ing. gories and glosses describing synsets, or Approaches to search result clustering can be semantic edges) in order to verify how classified as data-centric or description-centric they influence on the clustering quality (Carpineto, 2009). of the search result clustering. The al- The data-centric approach (as Bisecting K- gorithms were evaluated on AMBIENT Means) focuses more on the problem of data clus- dataset in terms of the clustering quality. tering, rather than presenting the results to the user. Other data-centric methods use hierarchical 1 Introduction agglomerative clustering (Maarek, 2000) that re- In the previous years, Web clustering engines places single terms with lexical affinities (2-grams (Carpineto, 2009) have been proposed as a solu- of words) as features, or exploit link information tion to the issue of lexical ambiguity in Informa- (Zhang, 2008). -
Sethesaurus: Wordnet in Software Engineering
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TSE.2019.2940439 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2015 1 SEthesaurus: WordNet in Software Engineering Xiang Chen, Member, IEEE, Chunyang Chen, Member, IEEE, Dun Zhang, and Zhenchang Xing, Member, IEEE, Abstract—Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. -
LDA V. LSA: a Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents
LDA v. LSA: A Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents Toni Cvitanic1, Bumsoo Lee1, Hyeon Ik Song1, Katherine Fu1, and David Rosen1! ! 1Georgia Institute of Technology, Atlanta, GA, USA (tcvitanic3, blee300, hyeoniksong) @gatech.edu (katherine.fu, david.rosen) @me.gatech.edu! !! Abstract. One means to support for design-by-analogy (DbA) in practice involves giving designers efficient access to source analogies as inspiration to solve problems. The patent database has been used for many DbA support efforts, as it is a pre- existing repository of catalogued technology. Latent Semantic Analysis (LSA) has been shown to be an effective computational text processing method for extracting meaningful similarities between patents for useful functional exploration during DbA. However, this has only been shown to be useful at a small-scale (100 patents). Considering the vastness of the patent database and realistic exploration at a large- scale, it is important to consider how these computational analyses change with orders of magnitude more data. We present analysis of 1,000 random mechanical patents, comparing the ability of LSA to Latent Dirichlet Allocation (LDA) to categorize patents into meaningful groups. Resulting implications for large(r) scale data mining of patents for DbA support are detailed. Keywords: Design-by-analogy ! Patent Analysis ! Latent Semantic Analysis ! Latent Dirichlet Allocation ! Function-based Analogy 1 Introduction Exposure to appropriate analogies during early stage design has been shown to increase the novelty, quality, and originality of generated solutions to a given engineering design problem [1-4]. Finding appropriate analogies for a given design problem is the largest challenge to practical implementation of DbA.