Alicg: Fine-Grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba

Total Page:16

File Type:pdf, Size:1020Kb

Load more

AliCG: Fine-grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba Ningyu Zhang1,2∗, Qianghuai Jia4∗, Shumin Deng1,2∗, Xiang Chen1,2, Hongbin Ye1,2, Hui Chen3, Huaixiao Tou3, Gang Huang5, Zhao Wang1, Nengwei Hua3, Huajun Chen1,2¢ 1Zhejiang University, China & AZFT Joint Lab for Knowledge Engine, China 3Alibaba Group, China 2Hangzhou Innovation Center, China, Zhejiang University, China, 4AntGroup, China 5Zhejiang Lab, China {zhangningyu,231sm,xiang_chen,yehongbin,huanggang,zhao_wang,huajunsir}@zju.edu.cn, {qianghuai.jqh,weidu.ch,huaixiao.thx,nengwei.huanw}@alibaba-inc.com ABSTRACT 1 INTRODUCTION Conceptual graphs, which is a particular type of Knowledge Graphs, Knowledge is important for text-related applications such as se- play an essential role in semantic search. Prior conceptual graph mantic search. Knowledge Graphs (KGs) organize facts in a struc- construction approaches typically extract high-frequent, coarse- tured graph way as triples in the form of <subject, predicate, object>, grained, and time-invariant concepts from formal texts. In real abridged as ¹B, ?, >º, where B and > denote entities and ? builds rela- applications, however, it is necessary to extract less-frequent, fine- tions between entities. Conceptual Graph, which is a special type grained, and time-varying conceptual knowledge and build taxon- of KGs, builds the semantic connections between concepts and has omy in an evolving manner. In this paper, we introduce an approach proven to be valuable in short text understanding [21], Word Sense to implementing and deploying the conceptual graph at Alibaba. Disambiguation [20], enhanced entity linking [5], semantic query Specifically, We propose a framework called AliCG which is capa- rewriting [22], etc. Essentially, conceptualization helps humans ble of a) extracting fine-grained concepts by a novel bootstrapping generalize previously gained knowledge and experience to new with alignment consensus approach, b) mining long-tail concepts settings, which may reveal paths to high-level cognitive System 2 with a novel low-resource phrase mining approach, c) updating the [2] in a conscious manner. In real-life applications, the conceptual graph dynamically via a concept distribution estimation method graph provides valuable knowledge to support many applications based on implicit and explicit user behaviors. We have deployed the [25], such as semantic search. Web search engines (e.g., Google and framework at Alibaba UC Browser. Extensive offline evaluation as Bing) leverage a taxonomy to better understand user queries and well as online A/B testing demonstrate the efficacy of our approach. improve the search quality. Moreover, many online retailers (e.g., Alibaba and Amazon) organize products into categories of different granularities so that customers can easily search and navigate this CCS CONCEPTS category taxonomy to find the items they want to purchase. • Information systems ! Query representation; Information In this paper, we introduce the Alibaba Conceptual Graph (Al- extraction. iCG), which is a large-scale conceptual graph of more than 5,000,000 fine-grained concepts, still in fast growth, automatically extracted KEYWORDS from noisy search logs. As shown in Figure 1, AliCG comprises level-1 Concept Mining; Taxonomy Construction; Knowledge Graph four levels: consists of concepts expressing the domain that those instances belong to; level-2 consists of concepts referred to ACM Reference Format: level-3 1,2∗ 4∗ 1,2∗ 1,2 the type or subclass of instances; consists of concepts that Ningyu Zhang , Qianghuai Jia , Shumin Deng , Xiang Chen , are the fine-grained conceptualization of instances expressing the Hongbin Ye1,2, Hui Chen3, Huaixiao Tou3, Gang Huang5, Zhao Wang1, implicit user intentions; instance layer includes all instances such Nengwei Hua3, Huajun Chen1,2¢ . 2021. AliCG: Fine-grained and Evolvable as entities and non-entity phrases. AliCG is currently deployed at arXiv:2106.01686v1 [cs.AI] 3 Jun 2021 Conceptual Graph Construction for Semantic Search at Alibaba. In Proceed- ings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Alibaba to support a variety of business scenarios, including the Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New product Alibaba UC Browser. AliCG has been applied to more than York, NY, USA, 12 pages. https://doi.org/10.1145/3447548.3467057 dozens of applications in Alibaba UC Browser, including intent classification, named entity recognition, query rewriting, andso ∗ Equal contribution and shared co-first authorship. ¢ Corresponding author. on, and it receives more than 300 million requests per day. Building AliCG is not a trivial task. Previous studies such as Permission to make digital or hard copies of all or part of this work for personal or YAGO [18] and DBPedia [1] have investigated the extraction of classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation knowledge from formal texts (e.g., Wikipedia). Probase [23] pro- on the first page. Copyrights for components of this work owned by others than ACM poses an approach for extracting concepts from semi-structured must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, Web documents. However, these approaches could not be adapted to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. to our applications because several challenges remain unresolved. KDD ’21, August 14–18, 2021, Virtual Event, Singapore Fine-grained Concept Acquisition. Conventional approaches © 2021 Association for Computing Machinery. devoted to extracting coarse-grained concepts such as categories ACM ISBN 978-1-4503-8332-5/21/08...$15.00 https://doi.org/10.1145/3447548.3467057 or types. However, in Alibaba’s scenario of question fine-grained KDD ’21, August 14–18, 2021, Virtual Event, Singapore Ningyu Zhang et al. Query: Thing 1.0 1.0 Is it safe to eat durian during 1.0 1.0 1.0 confinement? level1 concept 健康 health 娱乐 entertainment 人物 people 组织 organization tool Concepts of the Query: 工具 • during confinement isA 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 postpartum confinement period 饮食 生理 疾病 歌曲 电影 虚拟人物 公司 厨具 • eat durian isA virtual level2 concept diet physiology disease song film company kitchenware eat tropical fruit character 1.0 1.0 0.6 1.0 产褥期 古风歌曲 神话角色 互联网公司 0.8 0.9 1.0 postpartum 1.0 antiquity mythological 1.0 Internet confinement period song character company Concept Level Matching level3 concept eat tropical rare mental Chinese folk tale barbecue fruit disorder animation film character tool 0.9 吃热带水果 罕见精神障碍 中国动漫电影 传说人物 0.9 烧烤工具 0.9 0.9 Concepts of the Documents: 0.1 0.6 0.8 0.1 • postpartum confinement period 0.9 0.9 0.8 0.9 during aboulomania Google • eat tropical fruit instance confinement 坐月子 body integrity 意志缺失狂 谷歌 • eat fruit eat durian Nezha Zeus Alibaba grill identity disorder • dietary contraindications 吃榴莲 身体完整性认同障碍 哪吒 宙斯 阿里巴巴 烧烤架 Figure 1: Data hierarchy of Alibaba Conceptual Graph (AliCG) for semantic search. concepts are necessary to increase the recall of answer results. edges, it is necessary to align those nodes with the same meaning For example, “grill (烤¶)” is a “tool (工w)” and “scarf (围巾)” is in the conceptual graph. Besides, as there are many multiple-to- a "clothes (服p)”. However, it would be more helpful if we can multiple nodes in the conceptual graph, it is prohibitive to update infer that a user searching for these items may be more interested such complex graphs over time. In other words, it is difficult to es- in “barbecue tool (烧烤工w)” or "keep warm clothes (保暖服 timate the confidence distribution of the concepts given instances. p)” rather than another "tool (工w)" like “wrench (sK)”—these To address challenges mentioned above, we propose the follow- concepts are rare in existing conceptual graphs. ing contributions in the design of AliCG: Long-tail Concept Mining. Conventional approaches [23] gen- First, we propose a novel bootstrapping with the alignment erally extract concepts based on Hearst patterns, e.g., "especially" consensus approach to tackle the first challenge of extracting fine- and "such as." However, these approaches cannot extract long-tail grained concepts from noisy search logs. Specifically, we utilize concepts from extremely short or noisy queries, which are com- a small number of predefined string patterns to extract concepts, mon in search engines. For instance, it is non-trivial to extract the which are then used to expand the pool of such patterns. Further, the concept "rare mental disorder (UÁ精^障碍)" of the instance new mined concepts are verified with query-title alignment; that is, "body integrity identity disorder («S完t'¤同障碍)" from the an essential concept in a query should repeat several times in the search log as only 35 instances mentioned "rare mental disorder". It document title frequently clicked by the user. Second, we introduce is rather difficult to extract such concepts from the short text with a novel conceptualized phrase mining and self-training with pattern matching (the pattern is too general [6–8, 26, 27], and there an ensemble consensus approach to extract long-tail concepts. On is little context information as well as co-occurrence samples), as the one hand, we extend the off-the-shelf phrase mining algorithm Figure 2 shows. Besides, there exist lots of scattered concepts in with conceptualized features to mine concepts unsupervisedly. On user search engines such as "traditional activities Tibetan New Year the other hand, we propose a novel low-resource sequence tagging (Ï历新t`俗)". Recent approaches usually regard such concept framework, namely, self-training with an ensemble consensus, to extraction procedure as sequence labeling tasks [13], which rely extract those scattered concepts. Finally, we propose a novel con- on a tremendous amount of training data for each concept.
Recommended publications
  • Formal Concept Analysis in Knowledge Processing: a Survey on Applications ⇑ Jonas Poelmans A,C, , Dmitry I

    Formal Concept Analysis in Knowledge Processing: a Survey on Applications ⇑ Jonas Poelmans A,C, , Dmitry I

    Expert Systems with Applications 40 (2013) 6538–6560 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Review Formal concept analysis in knowledge processing: A survey on applications ⇑ Jonas Poelmans a,c, , Dmitry I. Ignatov c, Sergei O. Kuznetsov c, Guido Dedene a,b a KU Leuven, Faculty of Business and Economics, Naamsestraat 69, 3000 Leuven, Belgium b Universiteit van Amsterdam Business School, Roetersstraat 11, 1018 WB Amsterdam, The Netherlands c National Research University Higher School of Economics, Pokrovsky boulevard, 11, 109028 Moscow, Russia article info abstract Keywords: This is the second part of a large survey paper in which we analyze recent literature on Formal Concept Formal concept analysis (FCA) Analysis (FCA) and some closely related disciplines using FCA. We collected 1072 papers published Knowledge discovery in databases between 2003 and 2011 mentioning terms related to Formal Concept Analysis in the title, abstract and Text mining keywords. We developed a knowledge browsing environment to support our literature analysis process. Applications We use the visualization capabilities of FCA to explore the literature, to discover and conceptually repre- Systematic literature overview sent the main research topics in the FCA community. In this second part, we zoom in on and give an extensive overview of the papers published between 2003 and 2011 which applied FCA-based methods for knowledge discovery and ontology engineering in various application domains. These domains include software mining, web analytics, medicine, biology and chemistry data. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction mining, i.e. gaining insight into source code with FCA.
  • Concept Maps Mining for Text Summarization

    Concept Maps Mining for Text Summarization

    UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO CENTRO TECNOLÓGICO DEPARTAMENTO DE INFORMÁTICA PROGRAMA DE PÓS-GRADUAÇÃO EM INFORMÁTICA Camila Zacché de Aguiar Concept Maps Mining for Text Summarization VITÓRIA-ES, BRAZIL March 2017 Camila Zacché de Aguiar Concept Maps Mining for Text Summarization Dissertação de Mestrado apresentada ao Programa de Pós-Graduação em Informática da Universidade Federal do Espírito Santo, como requisito parcial para obtenção do Grau de Mestre em Informática. Orientador (a): Davidson CurY Co-orientador: Amal Zouaq VITÓRIA-ES, BRAZIL March 2017 Dados Internacionais de Catalogação-na-publicação (CIP) (Biblioteca Setorial Tecnológica, Universidade Federal do Espírito Santo, ES, Brasil) Aguiar, Camila Zacché de, 1987- A282c Concept maps mining for text summarization / Camila Zacché de Aguiar. – 2017. 149 f. : il. Orientador: Davidson Cury. Coorientador: Amal Zouaq. Dissertação (Mestrado em Informática) – Universidade Federal do Espírito Santo, Centro Tecnológico. 1. Informática na educação. 2. Processamento de linguagem natural (Computação). 3. Recuperação da informação. 4. Mapas conceituais. 5. Sumarização de Textos. 6. Mineração de dados (Computação). I. Cury, Davidson. II. Zouaq, Amal. III. Universidade Federal do Espírito Santo. Centro Tecnológico. IV. Título. CDU: 004 “Most of the fundamental ideas of science are essentially simple, and may, as a rule, be expressed in a language comprehensible to everyone.” Albert Einstein 4 Acknowledgments I would like to thank the manY people who have been with me over the years and who have contributed in one way or another to the completion of this master’s degree. Especially my advisor, Davidson Cury, who in a constructivist way disoriented me several times as a stimulus to the search for new answers.
  • Empowering Video Storytellers: Concept Discovery and Annotation for Large Audio-Video-Text Archives

    Empowering Video Storytellers: Concept Discovery and Annotation for Large Audio-Video-Text Archives

    Empowering Video Storytellers: Concept Discovery and Annotation for Large Audio-Video-Text Archives A large amount of video content is available and stored by broadcasters, libraries and other enterprises. However, the lack of effective and efficient tools for searching and retrieving has prevented video becoming a major value asset. Much work has been done for retrieving video information from constrained structured domains, such as news and sports, but general solutions for unstructured audio-video-language intensive content are missing. Our goal is to transform the multimedia archive into an organized and searchable asset so that human efforts can focus on a more important aspect - storytelling. This project involves a cross- disciplinary effort towards development of new techniques for organizing, summarizing, and question answering over audio-video-language intensive content, such as raw documentary footage. Such domains offer the full range of real-world situations and events, manifested with interaction of rich multimedia information. The team includes investigators from Schools of Engineering and Applied Science, Journalism, and Arts. It has extensive experience and deep expertise in analysis of multimedia information - video, audio, and language - as well as broad experience with professional documentary film production and interactive media designs. The project includes two major content partners, WITNESS and WNET, both providing unique video archives, well-defined application domains and user communities. In particular, WITNESS collects and manages a documentary video archive from activists all over the world to support the promotion of human rights. This archive has several important qualities: it is diverse (including interviews, documentation of events, and ‘hidden camera’ reporting), it is relatively ‘raw’ (shot by semi-professional operators), it has been shot for specific purposes (e.g.
  • Technical Phrase Extraction for Patent Mining: a Multi-Level Approach

    Technical Phrase Extraction for Patent Mining: a Multi-Level Approach

    2020 IEEE International Conference on Data Mining (ICDM) Technical Phrase Extraction for Patent Mining: A Multi-level Approach Ye Liu, Han Wu, Zhenya Huang, Hao Wang, Jianhui Ma, Qi Liu, Enhong Chen*, Hanqing Tao, Ke Rui Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science & School of Computer Science and Technology, University of Science and Technology of China {liuyer, wuhanhan, wanghao3, hqtao, kerui}@mail.ustc.edu.cn, {huangzhy, jianhui, qiliuql, cheneh}@ustc.edu.cn Abstract—Recent years have witnessed a booming increase of TABLE I: Technical Phrases vs. Non-technical Phrases patent applications, which provides an open chance for revealing Domain Technical phrase Non-technical phrase the inner law of innovation, but in the meantime, puts forward Electricity wireless communication, wire and cable, TV sig- higher requirements on patent mining techniques. Considering netcentric computer service nal, power plug Mechanical fluid leak detection, power building materials, steer that patent mining highly relies on patent document analysis, Engineering transmission column, seat back this paper makes a focused study on constructing a technology portrait for each patent, i.e., to recognize technical phrases relevant works have been explored on phrase extraction. Ac- concerned in it, which can summarize and represent patents cording to the extraction target, they can be divided into Key from a technology angle. To this end, we first give a clear Phrase Extraction [8], Named Entity Recognition (NER) [9] and detailed description about technical phrases in patents and Concept Extraction [10]. Key Phrase Extraction aims to based on various prior works and analyses. Then, combining characteristics of technical phrases and multi-level structures extract phrases that provide a concise summary of a document, of patent documents, we develop an Unsupervised Multi-level which prefers those both frequently-occurring and closed to Technical Phrase Extraction (UMTPE) model.
  • Information Extraction in Text Mining Matt Ulinsm Western Washington University

    Information Extraction in Text Mining Matt Ulinsm Western Washington University

    Western Washington University Western CEDAR Computer Science Graduate and Undergraduate Computer Science Student Scholarship 2008 Information extraction in text mining Matt ulinsM Western Washington University Follow this and additional works at: https://cedar.wwu.edu/computerscience_stupubs Part of the Computer Sciences Commons Recommended Citation Mulins, Matt, "Information extraction in text mining" (2008). Computer Science Graduate and Undergraduate Student Scholarship. 4. https://cedar.wwu.edu/computerscience_stupubs/4 This Research Paper is brought to you for free and open access by the Computer Science at Western CEDAR. It has been accepted for inclusion in Computer Science Graduate and Undergraduate Student Scholarship by an authorized administrator of Western CEDAR. For more information, please contact [email protected]. Information Extraction in Text Mining Matt Mullins Computer Science Department Western Washington University Bellingham, WA [email protected] Information Extraction in Text Mining Matt Mullins Computer Science Department Western Washington University Bellingham, WA [email protected] ABSTRACT Text mining’s goal, simply put, is to derive information from text. Using multitudes of technologies from overlapping fields like Data Mining and Natural Language Processing we can yield knowledge from our text and facilitate other processing. Information Extraction (IE) plays a large part in text mining when we need to extract this data. In this survey we concern ourselves with general methods borrowed from other fields, with lower-level NLP techniques, IE methods, text representation models, and categorization techniques, and with specific implementations of some of these methods. Finally, with our new understanding of the field we can discuss a proposal for a system that combines WordNet, Wikipedia, and extracted definitions and concepts from web pages into a user-friendly search engine designed for topic- specific knowledge.
  • Concept Mining: a Conceptual Understanding Based Approach

    Concept Mining: a Conceptual Understanding Based Approach

    Concept Mining: A Conceptual Understanding based Approach by Shady Shehata A thesis presented to the University of Waterloo in ful¯llment of the thesis requirement for the degree of Doctor of Philosophy in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2009 °c Shady Shehata 2009 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required ¯nal revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Due to the daily rapid growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as the World Wide Web. Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. These techniques consider documents as bags of words and pay no attention to the meanings of the document content. In addition, statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Therefore, there is an intensive need for a model that captures the meaning of linguistic utterances in a formal structure. The underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based model that analyzes terms on the sentence, document and corpus levels rather than the traditional analysis of document only is introduced.
  • A User-Centered Concept Mining System for Query and Document Understanding at Tencent

    A User-Centered Concept Mining System for Query and Document Understanding at Tencent

    A User-Centered Concept Mining System for Query and Document Understanding at Tencent Bang Liu1∗, Weidong Guo2∗, Di Niu1, Chaoyue Wang2 Shunnan Xu2, Jinghong Lin2, Kunfeng Lai2, Yu Xu2 1University of Alberta, Edmonton, AB, Canada 2Platform and Content Group, Tencent, Shenzhen, China ABSTRACT ACM Reference Format: 1∗ 2∗ 1 2 2 Concepts embody the knowledge of the world and facilitate the Bang Liu , Weidong Guo , Di Niu , Chaoyue Wang and Shunnan Xu , Jinghong Lin2, Kunfeng Lai2, Yu Xu2. 2019. A User-Centered Concept Min- cognitive processes of human beings. Mining concepts from web ing System for Query and Document Understanding at Tencent. In The documents and constructing the corresponding taxonomy are core 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining research problems in text understanding and support many down- (KDD ’19), August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, stream tasks such as query analysis, knowledge base construction, 11 pages. https://doi.org/10.1145/3292500.3330727 recommendation, and search. However, we argue that most prior studies extract formal and overly general concepts from Wikipedia or static web pages, which are not representing the user perspective. 1 INTRODUCTION In this paper, we describe our experience of implementing and de- The capability of conceptualization is a critical ability in natural ploying ConcepT in Tencent QQ Browser. It discovers user-centered language understanding and is an important distinguishing factor concepts at the right granularity conforming to user interests, by that separates a human being from the current dominating machine mining a large amount of user queries and interactive search click intelligence based on vectorization.
  • Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications

    Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications

    Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications Xiajing Li Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits September 25, 2020 Supervisors: Dr. Fredrik Wahlberg, Uppsala University Dr. Marios Daoutis, Ericsson AB Abstract Mapping a research domain can be of great signicance for understanding and structuring the state-of-art of a research area. Standard techniques for system- atically reviewing scientic literature entail extensive selection and intensive reading of manuscripts, a laborious and time consuming process performed by human experts. Researchers have spent eorts on automating methods in one or more sub-tasks of reviewing process. The main challenge of this work lies in the gap in semantic understanding of text and background domain knowledge. In this thesis we investigate the possibility of extracting keywords from scien- tic abstracts in an automated way. We intended to use the categories of these keywords to form a basis of a classication scheme in the context of systemati- cally mapping studies. We propose a framework by joint unsupervised keyphrase extraction and semantic keyphrase clustering. Specically, we (1) explore the eect of domain relevance and phrase quality measures in keyphrase extraction; (2) explore the eect of knowledge graph based word embedding in embedding rep- resentation of phrase semantics; (3) explore the eect of clustering for grouping semantically related keyphrases. Experiments are conducted on a dataset of publications pertaining the do- main of "Explainable Articial Intelligence (XAI)”. We further test the perfor- mance of clustering using terms and labels from publicly available academic taxonomies and keyword databases.
  • Alicoco: Alibaba E-Commerce Cognitive Concept

    Alicoco: Alibaba E-Commerce Cognitive Concept

    AliCoCo: Alibaba E-commerce Cognitive Concept Net Xusheng Luo∗ Luxin Liu, Le Bo, Yuanpeng Cao, Alibaba Group, Hangzhou, China Jinhang Wu, Qiang Li [email protected] Alibaba Group, Hangzhou, China Yonghua Yang, Keping Yang Kenny Q. Zhu Alibaba Group, Hangzhou, China Shanghai Jiao Tong University, Shanghai, China ABSTRACT 1 INTRODUCTION One of the ultimate goals of e-commerce platforms is to One major functionality of e-commerce platforms is to match satisfy various shopping needs for their customers. Much the shopping need of a customer to a small set of items from efforts are devoted to creating taxonomies or ontologies in an enormous candidate set. With the rapid developments e-commerce towards this goal. However, user needs in e- of search engine and recommender system, customers are commerce are still not well defined, and none of the existing able to quickly find those items they need. However, the ontologies has the enough depth and breadth for universal experience is still far from “intelligent”. One significant rea- user needs understanding. The semantic gap in-between pre- son is that there exists a huge semantic gap between what vents shopping experience from being more intelligent. In users need in their mind and how the items are organized this paper, we propose to construct a large-scale e-commerce in e-commerce platforms. The taxonomy to organize items Cognitive Concept net named “AliCoCo”, which is prac- in Alibaba (actually almost every e-commerce platforms) is ticed in Alibaba, the largest Chinese e-commerce platform generally based on CPV (Category-Property-Value): thou- in the world.
  • Unsupervised Concept Hierarchy Learning: a Topic Modeling Guided Approach V

    Unsupervised Concept Hierarchy Learning: a Topic Modeling Guided Approach V

    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 ( 2016 ) 386 – 394 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Unsupervised Concept Hierarchy Learning: A Topic Modeling Guided Approach V. S. Anoopa,∗,S.Asharafa and P. Deepakb aIndian Institute of Information Technology and Management, Kerala (IIITM-K), Thiruvananthapuram, India bQueen’s University Belfast, UK Abstract This paper proposes an efficient and scalable method for concept extraction and concept hierarchy learning from large unstructured text corpus which is guided by a topic modeling process. The method leverages “concepts” from statistically discovered “topics” and then learns a hierarchy of those concepts by exploiting a subsumption relation between them. Advantage of the proposed method is that the entire process falls under the unsupervised learning paradigm thus the use of a domain specific training corpus can be eliminated. Given a massive collection of text documents, the method maps topics to concepts by some lightweight statistical and linguistic processes and then probabilistically learns the subsumption hierarchy. Extensive experiments with large text corpora such as BBC News dataset and Reuters News corpus shows that our proposed method outperforms some of the existing methods for concept extraction and efficient concept hierarchy learning is possible if the overall task is guided by a topic modeling process. © 2016 The Authors. Published by Elsevier B.V. © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license Peer-review under responsibility of organizing committee of the Twelfth International Multi-Conference on Information (http://creativecommons.org/licenses/by-nc-nd/4.0/).
  • Generating Word and Document Embeddings for Sentiment Analysis

    Generating Word and Document Embeddings for Sentiment Analysis

    Generating Word and Document Embeddings for Sentiment Analysis Cem Rıfkı Aydın, Tunga Gung¨ or¨ and Ali Erkan Bogazic¸i˘ University Computer Engineering Department Bebek, Istanbul, Turkey, 34342 [email protected], [email protected], [email protected] Abstract. Sentiments of words differ from one corpus to another. Inducing gen- eral sentiment lexicons for languages and using them cannot, in general, produce meaningful results for different domains. In this paper, we combine contextual and supervised information with the general semantic representations of words occurring in the dictionary. Contexts of words help us capture the domain-specific information and supervised scores of words are indicative of the polarities of those words. When we combine supervised features of words with the features extracted from their dictionary definitions, we observe an increase in the suc- cess rates. We try out the combinations of contextual, supervised, and dictionary- based approaches, and generate original vectors. We also combine the word2vec approach with hand-crafted features. We induce domain-specific sentimental vec- tors for two corpora, which are the movie domain and the Twitter datasets in Turkish. When we thereafter generate document vectors and employ the support vector machines method utilising those vectors, our approaches perform better than the baseline studies for Turkish with a significant margin. We evaluated our models on two English corpora as well and these also outperformed the word2vec approach. It shows that our approaches are cross-domain and portable to other languages. Keywords: Sentiment Analysis, Opinion Mining, Word Embeddings, Machine Learning 1 Introduction Sentiment analysis has recently been one of the hottest topics in natural language pro- arXiv:2001.01269v2 [cs.CL] 7 Dec 2020 cessing (NLP).
  • A Survey of Building a Reverse Dictionary

    A Survey of Building a Reverse Dictionary

    Jincy A K et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (6) , 2014, 7727-7728 A Survey of Building a Reverse Dictionary Jincy A K, Sindhu L Department of Computer Science& Engg,College Of Engineering, Cochin university of science and Technology, Poonjar,Kottayam,kerala,India Abstract- A reverse dictionary takes as input a phrase or II. GENERAL TECHNIQUE a sentence describing a concept, and returns a set of The different steps for the implementation of reverse candidate words that satisfy the meaning of the input dictionary includes lexical analysis, elimination of stop phrase. This paper presents a literature survey words, stemming, pos tagging, concept mapping, building regarding the design and implementation of a reverse the reverse mapping set, querying the reverse mapping set, dictionary. The reverse dictionary identifies a ranking candidate words, term similarity... concept/idea/definition to words and phrases used to describe that concept. It has significant application for A. Stemming those who work closely with words and also in the Stemming is the process of finding the root word of a general field of conceptual search. Mainly for linguists, word contained in a given phrase. Normally stemming is poets, anthropologists and forensics specialist accomplished by means of porter stemmer but studies examining a damaged text that had only the final proved that it has certain drawbacks portion of a particular word preserved. B. POS Tagging Part-Of-Speech Tagging is the process of assigning parts Keywords- Thesaurus, Stemming, Pos Tagging, Concept of speech to each word (and other token), such as noun, Mining verb, adjective, etc.