Erik Cambria Dipankar Das Sivaji Bandyopadhyay Antonio Feraco Editors a Practical Guide to Sentiment Analysis Socio-Affective Computing

Total Page:16

File Type:pdf, Size:1020Kb

Erik Cambria Dipankar Das Sivaji Bandyopadhyay Antonio Feraco Editors a Practical Guide to Sentiment Analysis Socio-Affective Computing Socio-Affective Computing 5 Erik Cambria Dipankar Das Sivaji Bandyopadhyay Antonio Feraco Editors A Practical Guide to Sentiment Analysis Socio-Affective Computing Volume 5 Series Editor Amir Hussain, University of Stirling, Stirling, UK Co-Editor Erik Cambria, Nanyang Technological University, Singapore This exciting Book Series aims to publish state-of-the-art research on socially intelligent, affective and multimodal human-machine interaction and systems. It will emphasize the role of affect in social interactions and the humanistic side of affective computing by promoting publications at the cross-roads between engineering and human sciences (including biological, social and cultural aspects of human life). Three broad domains of social and affective computing will be covered by the book series: (1) social computing, (2) affective computing, and (3) interplay of the first two domains (for example, augmenting social interaction through affective computing). Examples of the first domain will include but not limited to: all types of social interactions that contribute to the meaning, interest and richness of our daily life, for example, information produced by a group of people used to provide or enhance the functioning of a system. Examples of the second domain will include, but not limited to: computational and psychological models of emotions, bodily manifestations of affect (facial expressions, posture, behavior, physiology), and affective interfaces and applications (dialogue systems, games, learning etc.). This series will publish works of the highest quality that advance the understanding and practical application of social and affective computing techniques. Research monographs, introductory and advanced level textbooks, volume editions and proceedings will be considered. More information about this series at http://www.springer.com/series/13199 Erik Cambria • Dipankar Das Sivaji Bandyopadhyay • Antonio Feraco Editors A Practical Guide to Sentiment Analysis 123 Editors Erik Cambria Dipankar Das School of Computer Science Computer Science and Engineering and Engineering Department Nanyang Technological University Jadavpur University Singapore, Singapore Kolkata, India Sivaji Bandyopadhyay Antonio Feraco Computer Science Fraunhofer IDM@NTU and Engineering Department Nanyang Technological University Jadavpur University Singapore, Singapore Kolkata, India ISSN 2509-5706 ISSN 2509-5714 (electronic) Socio-Affective Computing ISBN 978-3-319-55392-4 ISBN 978-3-319-55394-8 (eBook) DOI 10.1007/978-3-319-55394-8 Library of Congress Control Number: 2017938021 © Springer International Publishing AG 2017 Chapter 4 is published with kind permission of the Her Majesty the Queen Right of Canada. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface While sentiment analysis research has become very popular in the past ten years, most companies and researchers still approach it simply as a polarity detection problem. In reality, sentiment analysis is a “suitcase problem” that requires tackling many natural language processing (NLP) subtasks, including microtext analysis, sarcasm detection, anaphora resolution, subjectivity detection, and aspect extrac- tion. In this book, we propose an overview of the main issues and challenges associated with current sentiment analysis research and provide some insights on practical tools and techniques that can be exploited to both advance the state of the art in all sentiment analysis subtasks and explore new areas in the same context. In Chap. 1, we discuss the state of the art of affective computing and sentiment analysis research, including recent deep learning techniques and linguistic patterns for emotion and polarity detection from different modalities, e.g., text and video. In Chap. 2, Bing Liu describes different aspects of sentiment analysis and different types of opinions. In particular, he uses product reviews as examples to introduce general key concepts and definitions that are applicable to all forms of formal and informal opinion text and all kinds of domains including social and political domains. In Chap. 3, Jiwei Li and Eduard Hovy describe possible directions for deeper understanding about what opinions or sentiments are, why people hold them, and why and how their facets are chosen and expressed, helping bridge the gap between psychology/cognitive science and computational approaches. In Chap. 4, Saif Mohammad discusses different sentiment analysis problems and the challenges that are to be faced in order to go beyond simply determining whether a piece of text is positive, negative, or neutral. In particular, the chapter aims to equip researchers and practitioners with pointers to the latest developments in sentiment analysis and encourage more work in the diverse landscape of problems, especially those areas that are relatively less explored. In Chap. 5, Aditya Joshi, Pushpak Bhattacharyya, and Sagar Ahire contrast the process of lexicon creation for a new language or a resource-scarce language from a resource-rich one and, hence, show how the produced sentiment resources can be exploited to solve classic sentiment analysis problems. v vi Preface In Chap. 6, Hongning Wang and ChengXiang Zhai show how generative models can be used to integrate opinionated text data and their companion numerical sentiment ratings, enabling deeper analysis of sentiment and opinions to obtain not only subtopic-level sentiment but also latent relative weights on different subtopics. In Chap. 7, Vasudeva Varma, Litton Kurisinkel, and Priya Radhakrishnan present an overview of general approaches to automated text summarization with more emphasis on extractive summarization techniques. They also describe recent works on extractive summarization and the nature of scoring function for candidate summary. In Chap. 8, Paolo Rosso and Leticia Cagnina describe the very challenging problems of deception detection and opinion spam detection, as lies and spam are becoming increasingly serious issues with the rise, both in size and importance, of social media and public opinion. Finally, in Chap. 9 Federica Bisio et al. describe how to enhance the accuracy of any algorithm for emotion or polarity detection through the integration of commonsense reasoning resources, e.g., by embedding a concept-level knowledge base for sentiment analysis. Singapore, Singapore Erik Cambria Kolkata, India Dipankar Das Kolkata, India Sivaji Bandyopadhyay Singapore, Singapore Antonio Feraco Contents 1 Affective Computing and Sentiment Analysis ............................ 1 Erik Cambria, Dipankar Das, Sivaji Bandyopadhyay, and Antonio Feraco 2 Many Facets of Sentiment Analysis ........................................ 11 Bing Liu 3 Reflections on Sentiment/Opinion Analysis ............................... 41 Jiwei Li and Eduard Hovy 4 Challenges in Sentiment Analysis .......................................... 61 Saif M. Mohammad 5 Sentiment Resources: Lexicons and Datasets ............................. 85 Aditya Joshi, Pushpak Bhattacharyya, and Sagar Ahire 6 Generative Models for Sentiment Analysis and Opinion Mining....... 107 Hongning Wang and ChengXiang Zhai 7 Social Media Summarization ............................................... 135 Vasudeva Varma, Litton J. Kurisinkel, and Priya Radhakrishnan 8 Deception Detection and Opinion Spam................................... 155 Paolo Rosso and Leticia C. Cagnina 9 Concept-Level Sentiment Analysis with SenticNet ....................... 173 Federica Bisio, Claudia Meda, Paolo Gastaldo, Rodolfo Zunino, and Erik Cambria Index ............................................................................... 189 vii Chapter 1 Affective Computing and Sentiment Analysis Erik Cambria, Dipankar Das, Sivaji Bandyopadhyay, and Antonio Feraco Abstract Understanding emotions is one of the most important aspects of personal development and growth and, as such, it is a key tile for the emulation of human intelligence. Besides being a important for the advancement of AI, emotion processing is also important for the closely related task of polarity detection. The opportunity automatically to capture the sentiments of the general public
Recommended publications
  • Proceedings of the 46Th Annual Meeting of The
    Coling 2010 23rd International Conference on Computational Linguistics Proceedings of the 1st Workshop on South and southeast Asian Natural Language Processing 24 August 2010 Beijing International Convention Center Produced by Chinese Information Processing Society of China All rights reserved for Coling 2010 CD production. To order the CD of Coling 2010 and its Workshop Proceedings, please contact: Chinese Information Processing Society of China No.4, Southern Fourth Street Haidian District, Beijing, 100190 China Tel: +86-010-62562916 Fax: +86-010-62562916 [email protected] ii Preface Welcome to the Coling Workshop on South and Southeast Asian Natural Language Processing (WSSANLP). South Asia comprises of the countries- Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka. Southeast Asia, on the other hand, consists of Brunei, Burma, Cambodia, East Timor, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand and Vietnam. There thousands of languages that belong to different language families like Indo-Aryan, Indo- Iranian, Dravidian, Sino-Tibetan, Austro-Asiatic, Kradai, Hmong-Mien, etc. In terms of population, South Asia and Southeast Asia represent 34.94 percent of the total population of the world. Some of the languages of these regions have a large number of native speakers: Hindi (5th largest according to number of its native speakers), Bengali (6th), Punjabi (12th), Tamil (18th), Urdu (20th), etc. A characteristic of these languages is that they are under-resourced. But the words of these languages show rich variations in morphology. Moreover they are often heavily agglutinated and synthetic, making segmentation an important issue. The intellectual motivation for this workshop comes from the need to explore ways of harnessing the morphology of these Source (Lewis, 2009) languages for higher level processing.
    [Show full text]
  • Distributed Intra and Inter Cluster Chaining Framework for Energy-Efficient, Delay Bounded and Scalable Data Gathering Applications in Large Scale Sensor Network
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-9 Issue-2, December 2019 Distributed Intra and Inter Cluster Chaining Framework for Energy-Efficient, Delay Bounded and Scalable Data Gathering Applications in Large Scale Sensor Network Biswanath Dey, Navajyoti Nath, Sivaji Bandyopadhyay, Sukumar Nandi Abstract: This paper proposes a scalable, energy-efficient and The key idea behind the inclusion of hierarchical routing scalable, energy efficient, delay bounded intra and inter cluster protocols is that these protocols impart high energy routing framework viz. GIICCF (Generalized Intra and Inter efficiency, higher scalability, and have effective data cluster chaining framework) for efficient data gathering in large scale wireless sensor networks. This approach extricates the aggregation mechanism. In the proposed approach, chaining benefits of both pure chain-based as well as pure cluster-based is done within the cluster as well as between CHs of different data gathering schemes in large scale Wireless Sensor Network clusters within the system. In the intra-cluster chain is formed (WSNs) without undermining with the drawbacks. GIICCF between the cluster nodes and CH, whereas in inter-cluster defines a localized energy-efficient chaining scheme among the chaining is performed within the CHs. This helps in member nodes within the cluster with bounded data delivery delay improving the efficiency of the protocols and converts into a to the respective cluster-heads (CH) as well as enables the CH to deliver data to the Base station (BS) following an energy-efficient more energy-efficient protocols with higher network multi-hop fashion. Detailed experimental analysis and simulation longevity.
    [Show full text]
  • Conference Committees
    Conference Committees Conference Chair Philip McCarthy (Decooda Marketing Insights, USA) Conference Program Cochairs Chutima Boonthum-Denecke (Hampton University, USA) G. Michael Youngblood (University of North Carolina at Charlotte, USA) Conference Special Tracks Coordinator William Eberle (Tennessee Technological University, USA) General Conference Program Committee David Aha (Naval Research Laboratory, USA) Roman Barták (Charles University, Czech Republic) Ralph Bergmann (Universität Trier, Germany) Ateet Bhalla (NRI Institute of Information Science and Technology, India) Mehul Bhatt (University of Bremen, Germany) Nik Nailah Binti Abdullah (Mimos Berhad, Malaysia) Ismail Biskri (Université du Québec à Trois-Rivières, Canada) John Champaign (University of Waterloo, Canada) Maher Chaouachi (University of Montreal, Canada) Soon Ae Chun (City University of New York, USA) Vincent Cicirello (Richard Stockton College, USA) Diane Cook (Washington State University, USA) Douglas Dankel (University of Florida, USA) William Eberle (Tennessee Technological University, USA) Aymen Elkhlifi (Université Paris Sorbonne, France) Philippe Fournier-Viger (University of Moncton, Canada) Susan Fox (Macalester College, USA) Reva Freedman (Northern Illinois University, USA) James Geller (New Jersey Institute of Technology, USA) Jesus A. Gonzalez (National Institute of Astrophysics, Optics, and Electronics, Mexico) Catherine Havasi (Massachusetts Institute of Technology, USA) Christian F. Hempelmann (Purdue University, USA) Imène Jraidi (University of Montreal,
    [Show full text]
  • (MNDCS-2021) 30-31 Jan 2021
    1st International Conference on Micro/Nanoelectronics Devices, Circuits and Systems (MNDCS-2021) 30-31 Jan 2021 About the Conference: The Department of Electronics and Communication Engineering, National Institute of Technology Silchar, Assam, India (www.nits.ac.in) in association with IEEE ED NIT Silchar Student Branch Chapter and Institute Innovation Cell (IIC), NIT Silchar organises International Virtual Conference on MNDCS-2021 during 30-31 Jan 2021. The objective of this conference is to promote advanced research and developments in the areas of Microelectronics, Nanoelectronics, Semiconductor Devices, VLSI Circuits and Systems. The conference aims to foster the theme through keynotes, invited talks, and oral presentations of research articles in the most relevant areas allied to the theme of the conference. It’ll also create an international platform to share research ideas, scientific thoughts and research problems among academicians, researchers, technologists and scientists. Chief Patron: International Advisory Board: National Advisory Board: • Prof. Durga Madhab Misra, NJIT, USA Prof. Sivaji Bandyopadhyay, Director NIT Silchar • Prof. L. M. Patnaik, IISc, Bangalore • Prof. Armin G. Aberle, SERIS, NUS, Singapore • Prof. P. K. Sinha, DSPM-IIIT, Naya Raipur Patrons: • Prof. C. Jagadish, ANU, Australia • Prof. N. C. Shivaprakash, IISc. Bangalore • Prof. Lei Zuo, Virginia Tech, USA Prof. K. L. Baishnab. HoD, ECE, NIT Silchar • Prof. Satyabrat Jit, IIT BHU • Prof. P. Susthitha Menon, UKM, Malaysia Prof. F. A. Talukdar, NIT Silchar • Prof. Sudeb Dasgupta, IIT Roorkee • Prof. S. Baishya, NIT Silchar Prof. Hieu P. T. Nguyen, NJIT, USA • Prof. Brajesh Kumar Kaushik, IIT Roorkee • Prof. Lan FU, ANU, Australia • Prof. N. S. Murthy, NIT Warangal • General Chair: Prof.
    [Show full text]
  • Redalyc.Document Level Emotion Tagging: Machine Learning And
    Computación y Sistemas ISSN: 1405-5546 [email protected] Instituto Politécnico Nacional México Das, Dipankar; Bandyopadhyay, Sivaji Document Level Emotion Tagging: Machine Learning and Resource Based Approach Computación y Sistemas, vol. 15, núm. 2, 2011, pp. 221-234 Instituto Politécnico Nacional Distrito Federal, México Available in: http://www.redalyc.org/articulo.oa?id=61520938008 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Document Level Emotion Tagging: Machine Learning and Resource Based Approach Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science and Engineering, Jadavpur University, Kolkata, India [email protected], [email protected] Abstract. The present task involves the identification of segundo enfoque está basado en recursos de los cuales emotions from Bengali blog documents using two usamos el Bengalí WordNet Affect —un recurso léxico separate approaches. The first one is a machine que incluye palabras del bengalí etiquetadas con learning approach that accumulates document level emociones. En el primer enfoque, la máquina de information from sentences obtained from word level soporte vectorial (Support Vector Machine, SVM) se usa granular detail whereas the second one is a resource para la clasificación a nivel de palabras. El valor afectivo based approach that considers the Bengali WordNet de las oraciones se calcula según la técnica basada en Affect, the word level Bengali affective lexical resource. promediar los puntajes de pesos asignados a los In the first approach, the Support Vector Machine significados de palabras etiquetadas con emociones en (SVM) classifier is employed to perform the word level estas oraciones.
    [Show full text]
  • Lecture Notes in Computer Science 6608 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan Van Leeuwen
    Lecture Notes in Computer Science 6608 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany Alexander Gelbukh (Ed.) Computational Linguistics and Intelligent Text Processing 12th International Conference, CICLing 2011 Tokyo, Japan, February 20-26, 2011 Proceedings, Part I 13 Volume Editor Alexander Gelbukh Instituto Politécnico Nacional (IPN) Centro de Investigación en Computación (CIC) Col. Nueva Industrial Vallejo, CP 07738, Mexico D.F., Mexico E-mail: [email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-19399-6 e-ISBN 978-3-642-19400-9 DOI 10.1007/978-3-642-19400-9 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011921814 CR Subject Classification (1998): H.3, H.4, F.1, I.2, H.5, H.2.8, I.5 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright.
    [Show full text]
  • Icmc 2021 List of Accepted Papers
    ICMC 2021 LIST OF ACCEPTED PAPERS Sl no ID Paper Title Author Affiliation Konark Yadav The LNM Institute of Information Technology, India 1 16 Stock Market Predictions Using FastRNN Based model Milind Yadav Rajasthan Technical University, India Sandeep Saini The LNM Institute of Information Technology, India Neha Sengar Centre for Advanced Studies, India Automated System for Face-mask Detection using Convolutional Akriti Singh Centre for Advanced Studies, India 2 20 Neural Network Saumya Yadav Centre for Advanced Studies, India Malay Kishore Dutta Centre for Advanced Studies, India Lyapunov-type Inequalities for Fractional Differential Operators Basua Debananda BITS - Pilani, Hyderabad Campus, India 3 22 with Non-singular Kernels Jagan Mohan Jonnalagadda BITS - Pilani, Hyderabad Campus, India Ceiling Improvement on Breast Cancer Prediction Accuracy Ajeet Singh C R Rao Advanced Institute of Mathematics, India 4 23 using Unary KNN and Binary LightGBM Stacked Ensemble Vikas Tiwari C R Rao Advanced Institute of Mathematics, India Learning Appala Naidu Tentu C R Rao Advanced Institute of Mathematics, India Mohan Chintamani University of Hyderabad, India 5 26 A Blind Signature Scheme based on Bilinear Pairings Laba Sa University of Hyderabad, India A simple model on streamflow management with a dynamic risk Hidekazu Yoshioka Shimane University, Japan 6 29 measure Yumi Yoshioka Shimane University, Japan Chebyshev Spectral projection methods for Fredholm integral Bijaya Laxmi Panigrahi Gangadhar Meher University, India 7 30 equations of the second
    [Show full text]
  • Proceedings of the 14Th Pacific Asia Conference on Language, Information and Computation (PACLIC 14)
    PA CLIC/4 14th Pacific Asia Conference on Language , Information and Computation February 15-17, 2000 Waseda University International Conference Center, Tokyo, Japan Proceedings Edited by Akira Ikeya and Masahito Kawamori Logico-Linguistic Society of Japan The Institute of Language Teaching Waseda University Media Network Center CLIC14 14th Pacific Asia Conference on Language , Information and Computation February 15-17, 2000 Waseda University International Conference Center, Tokyo, Japan Proceedings Edited by Akira Ikeya and Masahito Kawamori Logico-Linguistic Society of Japan The Institute of Language Teaching Waseda University Media Network Center The Proceedings of The 14th Pacific Asia Conference on Language, Information and Computation (PACLIC 14) Edited by Akira Ikeya Toyo Gakuen University Masahito Kawamori NTT Communication Science Research Laboratories Front Cover designed by Motoko Kobayashi Kanae Hosako Copyright ® 2000 by PACLIC 14 Organizing Committee All rights reserved. No part of this book may be used or reproduced in any manner whatsoever without written permission. Published by PACLIC14 Organizing Committee c/o Akira Ikeya, Toyo Gakuen University Faculty of Humanities 1660 Hiregasaki Nagareyama-shi Chiba JAPAN 270-01 Printed by Yuwa Printing Co. Printed in Japan ISBN 4-9900354-2-9 TABLE OF CONTENTS Foreword Language Typology and the Comparison of Languages 1 Masayoshi Shibatani Verb Alternations and Japanese: How , What, and Where 3 Timothy Baldwin, Hozumi Tanaka Detection and Correction of Phonetic Errors with a New Ortho- 15 graphic Dictionary Sivaji Bandyopadhyay The Effect of Age on the Style of Discourse among Japanese Women 23 A ndrew Barke Computer Estimation of Spoken Language Skills 35 Jared Bernstein, Ognjen Todic, Brent Townshend, Eryk Warren Textual Information Segmentation by Cohesive Ties 47 Samuel W.K.
    [Show full text]
  • Answer Validation Through Textual Entailment Synopsis
    Answer Validation through Textual Entailment THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY (ENGINEERING) OF JADAVPUR UNIVERSITY BY PARTHA PAKRAY Department of Computer Science & Engineering Jadavpur University Kolkata 700032 UNDER THE ESTEEMED GUIDANCE OF PROF. (DR.) SIVAJI BANDYOPADHYAY & PROF. (DR.) ALEXANDER GELBUKH May, 2013 Answer Validation through Textual Entailment Synopsis 1. Introduction (Chapter 1) A Question Answering (QA) system is an automatic system capable of answering natural language questions in a human-like manner: with a short, accurate answer. A question answering system can be domain specific, which means that the topics of the questions are restricted. Often, this means simply that also the document collection, i.e., the corpus, in which the answer is searched, consists of texts discussing a specific field. This type of QA is easier, for the vocabulary is more predictable, and ontologies describing the domain are easier to construct. The other type of QA, open-domain question answering, deals with unrestricted topics. Hence, questions may concern any subject. The corpus may consist of unstructured or structured texts. Yet another way of classifying the field of QA deals with language. In monolingual QA both the questions and the corpus are in the same language. In cross- language QA the language of the questions (source language) is different from the language of the documents (target language). The question has to be translated in order to be able to perform the search. Multilingual systems deal with multiple target languages i.e., the corpus contains documents written in different languages. In multilingual QA, translation issues are thus central as well.
    [Show full text]
  • Proceedings of the 3Rd Workshop on South and Southeast Asian Natural Language Processing (SANLP)
    COLING 2012 24th International Conference on Computational Linguistics Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP) Workshop chairs: Virach Sornlertlamvanich and Abbas Malik 08 December 2012 Mumbai, India Diamond sponsors Tata Consultancy Services Linguistic Data Consortium for Indian Languages (LDC-IL) Gold Sponsors Microsoft Research Beijing Baidu Netcon Science Technology Co. Ltd. Silver sponsors IBM, India Private Limited Crimson Interactive Pvt. Ltd. Yahoo Easy Transcription & Software Pvt. Ltd. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP) Virach Sornlertlamvanich and Abbas Malik (eds.) Revised preprint edition, 2012 Published by The COLING 2012 Organizing Committee Indian Institute of Technology Bombay, Powai, Mumbai-400076 India Phone: 91-22-25764729 Fax: 91-22-2572 0022 Email: [email protected] This volume c 2012 The COLING 2012 Organizing Committee. Licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Nonported license. http://creativecommons.org/licenses/by-nc-sa/3.0/ Some rights reserved. Contributed content copyright the contributing authors. Used with permission. Also available online in the ACL Anthology at http://aclweb.org ii Introduction Welcome to the 3rd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP - 2012), a collocated event at COLING 2012, 8 - 15 December, 2012. South Asia comprises of the countries, Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka. Southeast Asia, on the other hand, consists of Brunei, Burma, Cambodia, East Timor, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand and Vietnam. This area is the home to thousands of languages that belong to different language families like Indo-Aryan, Indo-Iranian, Dravidian, Sino-Tibetan, Austro-Asiatic, Kradai, Hmong-Mien, etc.
    [Show full text]
  • Proceedings of the 4Th Workshop on South and Southeast Asian Natural Language Processing
    Sixth International Joint Conference on Natural Language Processing Proceedings of the Fourth Workshop on South and Southeast Asian Natural Language Processing WSSANLP - 2013 ii We wish to thank our sponsors and supporters! Platinum Sponsors www.anlp.jp Silver Sponsors www.google.com Bronze Sponsors www.rakuten.com Supporters Nagoya Convention & Visitors Bureau iii We wish to thank our organizers! Organizers Asian Federation of Natural Language Processing (AFNLP) Toyohashi University of Technology iv c 2013 Asian Federation of Natural Language Processing ISBN 978-4-9907348-8-6 v Preface Welcome to the 4th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP - 2013), a collocated event at the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013) , 14 - 18 October, 2013. South Asia comprises of the countries, Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka. Southeast Asia, on the other hand, consists of Brunei, Burma, Cambodia, East Timor, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand and Vietnam. This area is the home to thousands of languages that belong to different language families like Indo- Aryan, Indo-Iranian, Dravidian, Sino-Tibetan, Austro-Asiatic, Kradai, Hmong-Mien, etc. In terms of population, South Asian and Southeast Asia represent 35 percent of the total population of the world which means as much as 2.5 billion speakers. Some of the languages of these regions have a large number of native speakers: Hindi (5th largest according to number of its native speakers), Bengali (6th), Punjabi (12th), Tamil(18th), Urdu (20th), etc. As internet and electronic devices including PCs and hand held devices including mobile phones have spread far and wide in the region, it has become imperative to develop language technology for these languages.
    [Show full text]
  • Named Entity Recognition for a Resource Poor Indo-Aryan Language
    Named Entity Recognition for a Resource Poor Indo-Aryan Language A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Padmaja Sharma Enrollment No. CSP-09006 Registration No. 059 of 2010 Department of Computer Science and Engineering School of Engineering, Tezpur University Tezpur, Assam, India - 784028 September, 2015 Dedicated to my P A R E N T S ii Abstract Named Entity Recognition which is a subfield of information extraction is one of the most important topics of Natural Language Processing. It is a process through which the machines understand the proper nouns in text and associates them with proper tags. NER has made significant progress in European languages, but in Indian languages due to the lack of proper resources, it is a challenging task. As natural language is a polysemous, ambiguity exists among the name references. Recognizing the ambiguity and assigning a proper tag to the names is the goal of NER. Thus NER is a two stage process i.e., identification of the proper nouns and the classification of these nouns into di↵erent classes such as person, location, organization and miscellaneous which includes date, time, year, etc. The main aim of our work is to develop a computational system that can perform NER in text in the Assamese language which is a resource poor Indo-Aryan language. Our thesis discusses the di↵erent issues related to NER in general and in Indian languages, along with the di↵erent approaches to NER. We discuss the di↵erent works carried out by di↵erent researchers in di↵erent Indian languages along with the datasets and the tagsets used by them.
    [Show full text]