Linguistic‑Inspired Chinese Sentiment Analysis : from Characters to Radicals and Phonetics
Total Page:16
File Type:pdf, Size:1020Kb
This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg) Nanyang Technological University, Singapore. Linguistic‑inspired Chinese sentiment analysis : from characters to radicals and phonetics Peng, Haiyun 2019 Peng, H. (2019). Linguistic‑inspired Chinese sentiment analysis : from characters to radicals and phonetics. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/84297 https://doi.org/10.32657/10220/48173 Downloaded on 27 Sep 2021 07:13:48 SGT LINGUISTIC-INSPIRED CHINESE SENTIMENT ANALYSIS: FROM CHARACTERS TO RADICALS AND PHONETICS HAIYUN PENG SCHOOL OF COMPUTER SCIENCE AND ENGINEERING A thesis submitted to the Nanyang Technological University in partial fulfillment of the requirement for the degree of Doctor of Philosophy 2019 Statement of Originality I hereby certify that the work embodied in this thesis is the result of original research, is free of plagiarised materials, and has not been submitted for a higher degree to any other University or Institution. May 2, 2019 Haiyun Peng i Supervisor Declaration Statement I have reviewed the content and presentation style of this thesis and declare it is free of plagiarism and of sufficient grammatical clarity to be examined. To the best of my knowledge, the research and writing are those of the candidate except as acknowledged in the Author Attribution Statement. I confirm that the investigations were conducted in accord with the ethics policies and integrity standards of Nanyang Technological University and that the research data are presented honestly and without prejudice. May 2, 2019 Erik Cambria ii Authorship Attribution Statement This thesis contains material from 4 papers published in the following peer-reviewed journals / from papers accepted at conferences in which I am listed as an author. Chapter 2 is published as Haiyun Peng, Erik Cambria, and Amir Hussain. A review of sentiment analysis research in Chinese language." Cognitive Computation 9, no. 4 (2017): 423-435. The contributions of the co-authors are as follows: • Prof Cambria suggested the review area and edited the manuscript drafts. • I reviewed the literature and wrote the review manuscript draft. • Prof Hussain proofread the manuscript. Chapter 3 is published as Haiyun Peng and Erik Cambria. CSenticNet: A Concept- level Resource for Sentiment Analysis in Chinese Language." In International Confer- ence on Computational Linguistics and Intelligent Text Processing (CiCLing),90-104, 2017 The contributions of the co-authors are as follows: • Prof Cambria suggested topic, edited and proofread the paper. • I designed the algorithm, conducted experiments and wrote the paper. Chapter 4 is published as Haiyun Peng, Erik Cambria, and Xiaomei Zou. "Radical- based hierarchical embeddings for chinese sentiment analysis at sentence level." In FLAIRS conference, 347-352, 2017. The contributions of the co-authors are as follows: • Prof Cambria participated in discussion, edited and proofread the paper. • I designed the methodology and conducted the experiments and wrote the paper. • Xiaomei processed parsing Chinese characters to Chinese radicals. iii Chapter 5 is published as Haiyun Peng, Yukun Ma, Yang Li, and Erik Cam- bria. Learning multigrained aspect target sequence for Chinese sentiment analysis." Knowledge-Based Systems 148 (2018): 167-176. The contributions of the co-authors are as follows: • Prof Cambria participated in discussion, edited and proofread the manuscript. • I designed the models, ran the experiments, and wrote the manuscript. • Yukun participated in discussion and helped design experimental validation. • Yang participated in discussion and extracted visual features. May 2, 2019 Haiyun Peng iv Acknowledgments First of all, I would like to express my sincere gratitude towards my PhD supervisor Prof. Erik Cambria. For the past four years, he has been continuously supportive and encouraging. Without his patient and insightful guidance, I would not have acquired the knowledge and skills to reach this stage. I would like to thank my TAC panel members and Co-supervisor, Prof. Quek Hiok Chai, Prof. Francis Bond and Dr. Chi Xu for their helpful comments and advice. In addition, I would like to thank Dr. Soujanya Poria and Dr. Yukun Ma for their supportive and inspiring discussions and assistance during my PhD study. My study would not have been complete without their collaborations. Furthermore, my PhD journey would not end up as a pleasant and rewarding journey without the friendship and help by Sa Gao, Dr. Yang Li, Sandro Cavallari, Qian Chen, Edoardo Ragusa ,Pranav Rai and Xiaomei Zou. I would like to thank my girlfriend, Xiuhua Geng, for all her love and support. Last but not the least, I would like to express my deep gratitude towards my parents for all the love and support. I am especially thankful to my mother, Ying Chen, for her understanding and relentless support during the crutial stages of my life. This thesis is completely dedicated to them. v Contents Statement of Originality ...........................i Supervisor Declaration Statement ..................... ii Authorship Attribution Statement .................... iii Acknowledgments ...............................v List of Figures ................................. ix List of Tables .................................. xi Abstract ..................................... 1 Introduction1 1.1 Background.................................1 1.2 Challenges..................................5 1.3 Contributions................................7 1.4 Organization................................8 2 Literature Review9 2.1 Sentiment Resource.............................9 2.1.1 Corpus................................9 2.1.2 Lexicon............................... 11 2.2 Monolingual Approach........................... 13 2.2.1 Preprocessing............................ 13 2.2.2 Machine Learning-based Approach................ 14 2.2.3 Knowledge-based Approach.................... 18 2.2.4 Mix Models............................. 20 2.3 Multilingual Approach........................... 21 2.4 Text Representation............................ 22 2.4.1 General Embedding Methods................... 22 2.4.2 Chinese Text Representation.................... 23 2.5 Chinese Phonology............................. 24 2.6 Summary.................................. 25 vi 3 CsenticNet 26 3.1 Introduction................................. 26 3.2 Background................................. 27 3.3 Overview................................... 28 3.3.1 Resources.............................. 28 3.3.2 Two Versions............................ 29 3.4 First Version: SentiWordNet + NTU MC................. 30 3.5 Second Version: SenticNet + NTU MC.................. 32 3.5.1 SenticNet and Preprocessing.................... 32 3.5.2 Mapping SenticNet to WordNet.................. 32 3.5.3 Find and Extract the Overlap................... 36 3.6 Evaluation.................................. 36 3.7 Summary.................................. 39 4 Radical-Based Hierarchical Embeddings 40 4.1 Introduction................................. 40 4.2 Background................................. 41 4.2.1 Chinese Radicals.......................... 42 4.3 Hierarchical Chinese Embedding...................... 43 4.3.1 Skip-Gram Model.......................... 44 4.3.2 Radical-Based Embedding..................... 45 4.3.3 Hierarchical Embedding...................... 46 4.4 Experimental Evaluation.......................... 47 4.4.1 Dataset............................... 47 4.4.2 Experimental Setting........................ 48 4.4.3 Results and Discussion....................... 48 4.5 Summary.................................. 50 5 Multi-grained Aspect Target Sequence Modeling 51 5.1 Introduction................................. 51 5.2 Background................................. 54 5.3 Method Overview.............................. 55 5.3.1 Aspect Target Sequence...................... 55 5.3.2 Task Definition........................... 55 5.3.3 Overview of the Algorithm..................... 56 5.4 Adaptive Embedding Learning....................... 57 vii 5.4.1 Sentence Sequence Learning.................... 57 5.4.2 Aspect Target Unit Learning.................... 58 5.5 Sequence Learning of Aspect Target.................... 59 5.6 Fusion of Multi-Granularity Representation................ 59 5.6.1 Early Fusion............................. 61 5.6.2 Late Fusion............................. 61 5.7 Evaluation.................................. 63 5.7.1 Datasets............................... 63 5.7.2 Comparison Methods........................ 64 5.7.3 Result Analysis........................... 67 5.7.4 Visual Case Study......................... 69 5.7.5 Granularity and Fusion Analysis.................. 70 5.8 Summary.................................. 73 6 Phonetic-enriched Text Representation 74 6.1 Introduction................................. 74 6.2 Model.................................... 76 6.2.1 Textual Embedding......................... 76 6.2.2 Training Visual Features...................... 77 6.2.3 Learning Phonetic Features.................... 78 6.2.4 DISA................................. 83 6.2.5 Fusion of Modalities........................ 87 6.3 Experiments and Results.......................... 88 6.3.1 Experimental Setup......................... 88 6.3.2 Experiments on Unimodality.................... 91 6.3.3 Experiments on Fusion of Modalities............... 93 6.3.4 Cross-domain Evaluation...................... 94 6.3.5 Ablation Tests............................ 96 6.3.6