<<

Application of in Automatic Sentiment Recognition from Human Speech

Zhang Liu Ng EYK Anglo-Chinese Junior College College of Engineering Singapore Nanyang Technological University (NTU) Singapore [email protected]

Abstract— Opinions and sentiments are central to almost all human human activities and have a wide range of applications. As activities and have a wide range of applications. As many decision many decision makers turn to due to large makers turn to social media due to large volume of opinion data volume of opinion data available, efficient and accurate available, efficient and accurate is necessary to sentiment analysis is necessary to extract those data. Business extract those data. Hence, text sentiment analysis has recently organisations in different sectors use social media to find out become a popular field and has attracted many researchers. However, consumer opinions to improve their products and services. extracting sentiments from audio speech remains a challenge. This project explores the possibility of applying supervised Machine Political party leaders need to know the current public Learning in recognising sentiments in English utterances on a sentiment to come up with campaign strategies. Government sentence level. In addition, the project also aims to examine the effect agencies also monitor citizens’ opinions on social media. of combining acoustic and linguistic features on classification Police agencies, for example, detect criminal intents and cyber accuracy. Six audio tracks are randomly selected to be training data threats by analysing sentiment valence in social media posts. from 40 YouTube videos (monologue) with strong presence of In addition, sentiment can be used to make sentiments. Speakers express sentiments towards products, films, or predictions, such as in stock market, electoral politics and political events. These sentiments are manually labelled as negative even box office revenue. Moreover, sentiment analysis that and positive based on independent judgement of 3 experimenters. A moves towards achieving emotion recognition can potentially wide range of acoustic and linguistic features are then analysed and extracted using sound editing and text mining tools respectively. A enhance psychiatric treatment as emotions of patients are more novel approach is proposed, which uses a simplified sentiment score accurately identified. to integrate linguistic features and estimate sentiment valence. This approach improves negation analysis and hence increased overall accuracy. Results have shown that when both linguistic and acoustic Since 2000, researchers have made many successful features are used, accuracy of sentiment recognition improves attempts in text sentiment analysis. In comparison, audio significantly, and that excellent prediction is achieved when the four sentiment analysis does not seem to receive as much attention. classifiers are trained respectively, with kNN and Neural Network It is, however, equally significant as text sentiment analysis. having higher accuracies. Possible sources of error and inherent Many people in the contemporary society shared their challenges of audio sentiment analysis are discussed to provide opinions using online-based multimedia platforms such as potential directions for future . YouTube videos, Instagram stories, TV talk shows and TED Keywords – Sentiment Analysis; Natural Language Processing; talks. It is difficult to manually classify sentiments in them due Machine Learning; Affective Computing; Data ; Speech to the sheer amount of data. With the help of machine Processing; Computational Linguistic. automation, we can recognise, with an acceptable accuracy, the general sentiments about certain products, movies, and I. INTRODUCTION socio-political events, hence aiding decision-making process of corporations, societal organisations and governments. Sentiment analysis, also called opinion mining, is the field of study that analyses people’s opinions, sentiments, This project explores the possibility of using a machine appraisals, attitudes, and emotions toward entities and their learning approach to recognise sentiments accurately and attributes [1]. Opinions and sentiments are central to almost all

automatically from natural audio speech in English. In Ding, et al. proposed a holistic lexicon-based approach addition, the project also aims to examine the effect of [4] to solve the problem of insufficient acoustic features by combining acoustic and linguistic features on classification exploiting external evidences and linguistic conventions of accuracy. Training data consist of 150 speech segments natural language expressions. Inspired by above work, a extracted from 6 YouTube videos of different genres. Both simplified sentiment score model is proposed in this project. acoustic features and linguistic features will be examined in The method proposed provides sentence level audio speech order to increase the accuracy of automatic sentiment analysis. The detail of the method will be explained in section recognition. Sentiments will be categorised into 2 target III. classes, positive and negative.

III. METHODOLOGY II. LITERATURE REVIEW Data Collection and Processing There were previous attempts to combine acoustic and Speech data from YouTube videos linguistic features of speech in sentiment analysis. Chul & Automatic (ASR) software Audio converter Narayanan (2005) [2] explored the detection of domain- Transcribed text (.txt) Speech signal (.wav) specific emotions using language and discourse information in conjunction with acoustic correlates of emotion in (using sound editing software, Praat) signals. The specific focus was on a case study of detecting Sentence-level speech data negative and non-negative emotions using spoken language Linguistic Analysis Acoustic Analysis data obtained from a call center application. Results showed Sentiment Lexicons Usiing Praat Negation Analysis Low Level Descriptors (LLDs) that combining all the information, rather than using only Emotionally salient

Number of Number of Number of Intensity (amp, acoustic information, improved emotion classification by Pitch (avg, energy, power) positive words negative words negators max, sd,)

40.7% for males and 36.4% for females (linear discriminant Formula Voice Quality (HNR, jitter, classifier used for acoustic information). This study suggested Sentiment Score shimmer) a comprehensive range of features and provided some insights Linguistic Features Acoustic Features for my project: acoustic features (Fundamental Frequency Both (F0), Energy, Duration, Formants), and textual features Machine Learning —> Compare Accuracies

(emotional salience, discourse information). However, with its Naive Neural SVM kNN speech data collected from a call center, the research focused Bayes Network on emotions in human-machine interactions, rather than in natural human speech. Figure 1. An overview of the methodology.

3.1 Another research, Kaushik & Sangwan & Hansen (2013) [3], provided an alternative source of speech data - The speech data used in the experiments are obtained from YouTube videos. In this study, the authors proposed a system YouTube, a social media platform. The source is chosen for automatic sentiment detection in natural audio streams because thousands of YouTube users share their personal such as those found in YouTube. The proposed technique uses opinions or reviews on their channels. There is a huge amount Part of Speech (POS) tagging and Maximum Entropy of accessible speech data containing sentiment valence. More modeling (ME) to develop a text-based sentiment detection importantly, their ways of speaking are usually closest to model. Using decoded Automatic Speech Recognition (ASR) natural, spontaneous human speech. Six videos are randomly transcripts and the ME sentiment model, the proposed system selected from 40 YouTube videos that have strong presence of is able to estimate sentiments in YouTube videos. Their results negative or positive sentiments. Subject matters include: 1) showed that it is possible to perform sentiment analysis on Product Review; 2) Movie Review; 3) Political Opinion. natural spontaneous speech data despite poor error rates. This study provided a systematic approach and proved that such audio sentiment analysis is possible. It did not, however, include enough acoustic features of audio speech, possibly due to the limitation of document-level analysis.

During the pre-processing stage, the videos are converted positively connoted words in each segment are then counted into .wav files. Speech transcriptions are generated using the respectively and the numerical values were stored in the Automatic Speech Recognition (ASR) software, Speechmatics training data set. (https://www.speechmatics.com) and checked manually to increase reliability. Each sound file (.wav) is then edited in the vocal toolkit, Praat (http://www.fon.hum.uva.nl/praat/). The TextGrid annotation (as shown in Figure 2) includes 2 tiers, transcription text and numbering, which are useful in keeping track of the data. Meanwhile, the sound file is segmented into smaller sections containing 1 to 5 sentences of relevant meaning and the same sentiment. Each segment is pre-assigned a sentiment label (‘negative’ or ‘positive’) based on independent judgement of 3 experimenters so as to minimise bias and subjective errors. There is a total of 150 sound Figure 3. 3 Text Mining workflow. segments (including 70 positive, 80 negative) in the training data set. The segmentation process is necessary as most opinion videos contain mixed sentiments.

Figure 4. Text preprocessor parameters.

For negation cues, a similar approach is adopted, where Figure 2. Using Praat to annotate speech. words are looked up against a list of explicit negation cues (compiled manually) as shown in Table I. 3.2 FEATURE EXTRACTION TABLE I. LIST OF NEGATION CUES 3.2.1 LINGUISTIC FEATURES Natural Language Processing toolkit, Orange 3-Text aint doesnt havent lacks nobody prevent Mining, is used in this stage. Speech transcripts are transformed into lowercase, tokenised into words, and arent dont havnt mightnt none rarely normalised using WordNet Lemmatiser. Part of Speech (POS) barely doubt improbable mustnt nor scarcely tagger is used to label each word as, for instance, a noun, a cannot few isnt neednt not seldom verb or an adjective, in order to preserve the linguistic cant hadnt lack neither nothing shant function of each word in the sentence. The workflow and text darent hardly lacked never nowhere shouldnt processor parameters are shown in Figure 3 and 4. didnt hasnt lacking no oughtnt unlikely

wasnt werent without wouldnt little Textual feature extraction is done by filtering the emotionally salient words (negatively connoted words and Last but not least, simplified sentiment score model is positively connoted words). Words in the training corpus are proposed to “integrate” all the linguistic features that provide looked up against Harvard General Inquirer and Opinion emotion-related information. For every opinion segment ! , a Lexicon by Bing, Liu [5] to decide if they are negatively or sentiment score, !(#), is calculated using the formula below: positively connoted. The frequencies of negatively and

!(#) = '()(#) – +,-(#) + 2 × ( (−1)345_789(:)) – (-1)345_345(:) ), 3.3 MACHINE LEARNING

In the Orange Canvas Software [6], kNN, NN, Naïve where !"#(%) is the number of positive words in the opinion Bayes and SVM are used to evaluate the proposed method. segment, ! ; !"#(%) is the number of negative words in the Training data are sent to appropriate classifier (Figure 6). opinion segment, !; !"#_%&'()) is the number of times each Sentiment label (positive, negative) is selected as the target positive word is negated, and similarly, !"#_!"#(&) is the class and the rest of the features as attributes. Stratified 10- sum of the number of times each negative word is negated. folds cross-validation method is used to measure model Note that (−1)%&'_)*+(,)) and (-1)%&'_%&'()) are counted performance. Hence, each time the training data will be split manually to give the most reliable values. The following are into ten folds and one out of ten folds will be randomly some advantages of this model. selected for testing. After multiple experiments, optimal configuration for each classifier was determined and used in the machine learning process. 1) The problem of multiple negation can be solved. When

a word is negated twice (with our loss of generality, TABLE II. OPTIMAL CONFIGURATION FOR DIFFERENT CLASSIFIERS suppose it is a positive word, as in “can’t live without”), Classifier Optimal Configurations the formula will correctly give a positive value that § k = 74 (weighting by distances) k Nearest signifies positive sentiment. § Euclidean (normalize continuous Neighbours 2) It allows semi-automation negation analysis and has attributes) potential to be developed into a fully automated process. § Prior: Relative Frequency § Conditional: M-Estimate (parameter = 2.0) Naïve § Size of LOESS window = 1.0 3.2.2 ACOUSTIC FEATURES Bayes § LOESS sample points = 11 Acoustic features of the sound segments were extracted § Adjust threshold manually using built-in functions in Praat, as shown in Figure § Hidden layer neurons = 11 5. In order to achieve a more comprehensive representation of Neural § Regularization factor = 1.0 the sound, I chose a sufficiently wide range of acoustic Network § Max iterations = 300 § Normalize data features: intensity (amplitude, total energy, mean power), pitch § C-SVM (C = 1.00) Support (maximum pitch, average pitch, standard deviation, mean § Linear Kernel, x- y Vector absolute slope), and voice quality (jitter, shimmer, Mean § Numerical tolerance = 0.0010 Machine harmonics-to-noise ratio). Considering the inherent differences § Estimate class probabilities in pitch between females and males, an attribute “gender” was included to normalise the data.

Figure 5. Extracting acoustic features using Praat.

Figure 6. Illustration for machine learning workflow.

IV. RESULTS AND DISCUSSIONS

4.1 RESULTS ANALYSIS The evaluation will be focused on Area Under the ROC Curve (AUC) as it has “better statistical foundations than most other measures” [7] ROC Area Benchmark: 1.0: perfect prediction; 0.9: excellent prediction; 0.8: good prediction; 0.7: mediocre prediction; 0.6: poor prediction; 0.5: random prediction; <0.5: something wrong. [7] As shown in Table II, accuracy improves significantly when both acoustic and linguistic features are used, instead of only acoustic features or only linguistic features. When both acoustic and linguistic features are extracted, excellent classification of sentiments is achieved when the four classifiers are trained respectively, with kNN and Neural Network having higher accuracies. The shapes of ROC curves for these four classifiers resemble the shape of ROC curve for excellent prediction (Figure 7 and 8).

TABLE III. AUC WHEN DIFFERENT CLASSIFIERS & FEATURES ARE USED AUC AUC AUC (Both acoustic (acoustic Classifier (linguistic features & linguistic features features only) features) only) Figure 8. Shapes of ROC curves indicate different levels of accuracy. kNN 0.8750 0.8420 0.9321 > 0.9 Naïve Base 0.7964 0.8348 0.8929 ≈ 0.9 Neural 4.2 LIMITATIONS AND SOURCES OF ERROR 0.9018 0.8384 0.9304 > 0.9 Network SVM 0.8589 0.8607 0.9232 > 0.9 The might not be large enough. There has to be sufficient representation of the different sentiments and different speech types in order for more comprehensive learning, and hence more accurate recognition. Errors can occur in Automated Speech Recognition (ASR) step, even after manual check. Negation analysis is subject to human error and there might be inaccurate detection of context- specific meanings of polysemes1.

4.3 INHERENT CHALLENGES

Sentiment analysis is a challenging task due to ambiguities in language, such as subtlety, concession, manipulation, sarcasm and ironies in speech. To address this problem, an accurate conclusion might entail examination of other features such as physiological symptoms (blood pressure etc.) and facial expressions. Although inaccuracies arising from ambiguities could be minimised by analysing data from multiple dimensions, cultural differences and multilinguality2 further complicate the process. Due to differences in cultural backgrounds, the ways people express their sentiments vary among individuals. (For example, the way a Japanese

1 A word or lexical unit that has several or multiple meanings Figure 7. ROC curves for different classifiers when both acoustic and 2 Multilinguality is a characteristic of tasks that involve the use of more than linguistc features are used. one natural language. (Kay, n.d.)

expresses a sentiment differs from the way an American ACKNOWLEDGMENT expresses the same sentiment). Moreover, sentiment expressions depend on contexts of speech, and hence vary I would like to express my sincere gratitude to my project even for the same person at different times. In addition, the supervisor Professor Ng Yin Kwee for being open-minded speakers might constantly change subject or compare with about my decision of changing project topic, without which I another subject, which might be hard to detect. could not have been able to delve deep into my field of interest. Although his primary research is not of this area specifically, There are also issues with mutual interpretability. his insightful suggestions and unwavering support has guided Interjections that express feelings (such as “urggghhh”) might me through doubts and difficulties. be deemed as irrelevant by the machine. It might be hard, if not impossible, for the machine to “master” contextual REFERENCES 3 knowledge such as some exophoric references to historical figures (“the German dictator”, which refers to Hitler). The [1] Liu, Bing. (2015). Sentiment Analysis: Mining Opinions, Sentiments, issue becomes more significant when dialects are used. For and Emotions. Cambridge University Press. example, the negation analysis is based on Standard English [2] Chul Min Lee & Shrikanth S. Narayanan. (2005). Toward Detecting Emotions in Spoken Dialogs. IEEE TRANSACTIONS ON SPEECH AND usage, which might not be useful for other varieties of AUDIO PROCESSING, VOL. 13, NO. 2, MARCH 2005. English. Speakers of certain dialects like African American [3] Kaushik, Lakshmish & Sangwan, Abhijeet & Hansen, John H.L. (2013). Vernacular English (AAVE) usually employ double negatives A Holistic Lexicon-Based Approach to Opinion Mining. IEEE. to emphasise the negative meaning. [4] Ding, Xiaowen, Bing Liu, and Philip S. Yu. A Holistic Lexicon-Based Approach to Opinion Mining. In Proceedings of the Conference on Web Search and Web (WSDM- 2008). 2008. V. CONCLUSION AND FUTURE WORK [5] Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference In this study, we have built a machine learning model on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA combining acoustic and linguistic features. As the results have [6] Demšar, J., Curk, T., & Erjavec, A. (2013). Orange: Data mining shown, this model has significantly higher accuracy than toolbox in Python. Journal of Machine Learning Research, 14, models with only acoustic or only linguistic features. Under 2349−2353. this model, excellent prediction can be achieved if Neural [7] Unknown. Cornell University. (2003). Retrieved from: Network or kNN is used as the classifier. Although limitations https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures .pdf and challenges are real and a considerable amount of manual [8] Tape, Thomas G. (n.d.). Interpreting Diagnostic Tests. University of work is necessary, the positive results of this study have Nebraska Medical Center. Retrieved from: clearly pointed to the possibility of achieving a fully http://gim.unmc.edu/dxtests/roc3.htm automated audio sentiment analysis in future.

Based on the limitations and challenges discussed in section IV, the following three main directions of research are proposed. 1. From audio sentiment analysis towards video sentiment analysis by incorporating facial expression features, and further towards multi-dimensional sentiment analysis by incorporating physiological features such as blood pressure and heart rate 2. From semi-automatic sentiment analysis towards fully automatic sentiment analysis, by reducing the amount of manual processing of data 3. From sentiment recognition towards emotion recognition, by enabling classification of specific emotions such as fear, anger, happiness, sadness

3 Exophoric reference is referring to a situation or entities outside the text. (University Of Pennsylvania, 2006)