<<

Lexicons for Sentiment, Lexicons for Sentiment, , and Affect, and Connotation Connotation Affective meaning

Drawing on literatures in ◦ (Picard 95) ◦ linguistic subjectivity (Wiebe and colleagues) ◦ social psychology (Pennebaker and colleagues) Can we model the lexical semantics relevant to: ◦ sentiment ◦ ◦ personality ◦ ◦ attitudes

2 Why compute affective meaning? Detecting: ◦ sentiment towards politicians, products, countries, ideas ◦ of callers to a help line ◦ stress in drivers or pilots ◦ and other medical conditions ◦ in students talking to e-tutors ◦ in novels (e.g., for studying groups that are feared over time) Could we generate: ◦ emotions or moods for literacy tutors in the children’s storybook domain ◦ emotions or moods for computer games ◦ personalities for dialogue systems to match the user Connotation in the lexicon

Words have connotation as well as sense Can we build lexical resources that represent these connotations? And use them in these computational tasks?

4 Scherer Typology of Affective States

Emotion: brief organically synchronized … evaluation of a major event ◦ angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective ◦ cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction ◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons ◦ liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies ◦ nervous, anxious, reckless, morose, hostile, jealous Scherer Typology of Affective States

Emotion: brief organically synchronized … evaluation of a major event ◦ angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling ◦ cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction ◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons ◦ liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies ◦ nervous, anxious, reckless, morose, hostile, jealous What is a Lexicon?

A (usually hand-built) list of words that correspond to some meaning or class Possibly with numeric values Commonly used as simple classifiers, or as features to more complex classifiers Lexicons for Sentiment, Lexicons for Sentiment, Affect, and Affect, and Connotation Connotation Lexicons for Sentiment, Emotion Affect, and Connotation Scherer’s typology of affective states Emotion: relatively brief episode of synchronized response of all or most organismic subsystems in response to the evaluation of an event as being of major significance angry, sad, joyful, fearful, ashamed, proud, desperate Mood: diffuse affect state …change in subjective feeling, of low intensity but relatively long duration, often without apparent cause cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stance: affective stance taken toward another person in a specific interaction, coloring the interpersonal exchange distant, cold, warm, supportive, contemptuous Attitudes: relatively enduring, affectively colored beliefs, preferences predispositions towards objects or persons liking, loving, hating, valuing, desiring Personality traits: emotionally laden, stable personality dispositions and behavior tendencies, typical for a person nervous, anxious, reckless, morose, hostile, envious, jealous Two families of theories of emotion

Atomic basic emotions ◦ A finite list of 6 or 8, from which others are generated Dimensions of emotion ◦ Valence (positive negative) ◦ (strong, weak) ◦ Control

11 Ekman’s 6 basic emotions: , , , , ,

Ekman & Matsumoto 1989 Plutchick’s wheel of emotion

• 8 basic emotions • in four opposing pairs: • –sadness • anger–fear • –disgust • –surprise

Wikipedia Alternative: spatial model

An emotion is a point in 2- or 3-dimensional space valence: the pleasantness of the stimulus arousal: the intensity of emotion provoked by the stimulus (sometimes) dominance: the degree of control exerted by the stimulus Valence/Arousal Dimensions

High arousal, low High arousal, high pleasure

anger arousal excitement

valence

Low arousal, low pleasure Low arousal, high pleasure sadness Simple sentiment lexicons The General Inquirer

Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press

Positiv (1915 words) Negativ (2291 words) MPQA Subjectivity Cues Lexicon

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.

6885 words Is a subjective word positive or negative? ◦ Strongly or weakly? http://mpqa.cs.pitt.edu/lexicons/ GNU GPL

18 type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandonment pos1=noun stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative type=strongsubj len=1 word1=abase pos1=verb stemmed1=y priorpolarity=negative type=strongsubj len=1 word1=abasement pos1=anypos stemmed1=y priorpolarity=negative type=strongsubj len=1 word1=abash pos1=verb stemmed1=y priorpolarity=negative type=weaksubj len=1 word1=abate pos1=verb stemmed1=y priorpolarity=negative

19 4 CHAPTER 20 • LEXICONS FOR SENTIMENT,AFFECT, AND CONNOTATION

work in the cognitive psychology of word meaning (Osgood et al., 1957). The Gen- eral Inquirer has a lexicon of 1915 positive words and a lexicon of 2291 negative words (as well as other lexicons discussed below). The MPQA Subjectivity lexicon (Wilson et al., 2005) has 2718 positive and 4912 negative words drawn from prior lexicons plus a bootstrapped list of subjective words and phrases (Riloff and Wiebe, Words with2003) consistent. Each entry in the sentiment lexicon is hand-labeled across for sentiment lexicons and also labeled for reliability (strongly subjective or weakly subjective). The polarity lexicon of Hu and Liu (2004) gives 2006 positive and 4783 negative words, drawn from product reviews, labeled using a bootstrapping method from WordNet.

Positive admire, amazing, assure, celebration, charm, eager, enthusiastic, excellent, fancy, fan- tastic, frolic, graceful, happy, joy, luck, majesty, mercy, nice, patience, perfect, proud, rejoice, relief, respect, satisfactorily, sensational, super, terrific, thank, vivid, wise, won- derful, zest Negative abominable, anger, anxious, bad, catastrophe, cheap, complaint, condescending, deceit, defective, , embarrass, fake, fear, filthy, fool, , hate, idiot, inflict, lazy, miserable, mourn, nervous, objection, pest, plot, reject, scream, silly, terrible, unfriendly, vile, wicked Figure 20.3 Some words with consistent sentiment across the General Inquirer (Stone et al., 1966), the MPQA Subjectivity lexicon (Wilson et al., 2005), and the polarity lexicon of Hu and Liu (2004).

Slightly more general than these sentiment lexicons are lexicons that assign each word a value on all three affective dimensions. The NRC Valence, Arousal, and Dominance (VAD) lexicon (Mohammad, 2018a) assigns valence, arousal, and dom- inance scores to 20,000 words. Some examples are shown in Fig. 20.4.

Valence Arousal Dominance vacation .840 enraged .962 powerful .991 delightful .918 party .840 authority .935 whistle .653 organized .337 saxophone .482 consolation .408 effortless .120 discouraged .0090 torture .115 napping .046 weak .045 Figure 20.4 Values of sample words on the emotional dimensions of Mohammad (2018a).

EmoLex The NRC Word-Emotion Association Lexicon, also called EmoLex (Moham- mad and Turney, 2013), uses the Plutchik (1980) 8 basic emotions defined above. The lexicon includes around 14,000 words including words from prior lexicons as well as frequent nouns, verbs, adverbs and adjectives. Values from the lexicon for some sample words: positive trust disgust negative surprise sadness fear anger joy Word anticipation reward 0 1 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 tenderness 0 0 0 0 1 0 0 0 1 0 sweetheart 0 1 0 0 1 1 0 1 1 0 suddenly 0 0 0 0 0 0 1 0 0 0 thirst 0 1 0 0 0 1 1 0 0 0 garbage 0 0 1 0 0 0 0 0 0 1 Let’s look at two emotion lexicons!

1. 8 basic emotions: ◦ NRC Word-Emotion Association Lexicon (Mohammad and Turney 2011) 2. Dimensions of valence/arousal/dominance ◦ NRC Valence-Arousal-Dominance Lexicon (Mohammad 2018) Plutchick’s wheel of emotion

• 8 basic emotions • in four opposing pairs: • joy–sadness • anger–fear • trust–disgust • anticipation–surprise

Wikipedia NRC Word-Emotion Association Lexicon Mohammad and Turney 2011

amazingly anger 0 amazingly anticipation 0 amazingly disgust 0 amazingly fear 0 amazingly joy 1 amazingly sadness 0 amazingly surprise 1 amazingly trust 0 amazingly negative 0 amazingly positive 1 4 CHAPTER 20 • LEXICONS FOR SENTIMENT,AFFECT, AND CONNOTATION

work in the cognitive psychology of word meaning (Osgood et al., 1957). The Gen- eral Inquirer has a lexicon of 1915 positive words and a lexicon of 2291 negative words (as well as other lexicons discussed below). The MPQA Subjectivity lexicon (Wilson et al., 2005) has 2718 positive and 4912 negative words drawn from prior lexicons plus a bootstrapped list of subjective words and phrases (Riloff and Wiebe, 2003). Each entry in the lexicon is hand-labeled for sentiment and also labeled for reliability (strongly subjective or weakly subjective). The polarity lexicon of Hu and Liu (2004) gives 2006 positive and 4783 negative words, drawn from product reviews, labeled using a bootstrapping method from WordNet.

Positive admire, amazing, assure, celebration, charm, eager, enthusiastic, excellent, fancy, fan- tastic, frolic, graceful, happy, joy, luck, majesty, mercy, nice, patience, perfect, proud, rejoice, relief, respect, satisfactorily, sensational, super, terrific, thank, vivid, wise, won- derful, zest Negative abominable, anger, anxious, bad, catastrophe, cheap, complaint, condescending, deceit, defective, disappointment, embarrass, fake, fear, filthy, fool, guilt, hate, idiot, inflict, lazy, miserable, mourn, nervous, objection, pest, plot, reject, scream, silly, terrible, unfriendly, vile, wicked Figure 20.3 Some words with consistent sentiment across the General Inquirer (Stone et al., 1966), the MPQA Subjectivity lexicon (Wilson et al., 2005), and the polarity lexicon of Hu and Liu (2004).

Slightly more general than these sentiment lexicons are lexicons that assign each word a value on all three affective dimensions. The NRC Valence, Arousal, and Dominance (VAD) lexicon (Mohammad, 2018a) assigns valence, arousal, and dom- inance scores to 20,000 words. Some examples are shown in Fig. 20.4.

Valence Arousal Dominance vacation .840 enraged .962 powerful .991 delightful .918 party .840 authority .935 whistle .653 organized .337 saxophone .482 consolation .408 effortless .120 discouraged .0090 torture .115 napping .046 weak .045 Figure 20.4 Values of sample words on the emotional dimensions of Mohammad (2018a).

EmoLex The NRC Word-Emotion Association Lexicon, also called EmoLex (Moham- mad and Turney, 2013), uses the Plutchik (1980) 8 basic emotions defined above. The lexiconMore includes examples around 14,000 words including words from prior lexicons as well as frequent nouns, verbs, adverbs and adjectives. Values from the lexicon for some sample words: positive trust disgust negative surprise anticipation joy sadness fear Word anger reward 0 1 0 0 1 0 1 1 1 0 worry 0 1 0 1 0 1 0 0 0 1 tenderness 0 0 0 0 1 0 0 0 1 0 sweetheart 0 1 0 0 1 1 0 1 1 0 suddenly 0 0 0 0 0 0 1 0 0 0 thirst 0 1 0 0 0 1 1 0 0 0 garbage 0 0 1 0 0 0 0 0 0 1 20.2 • AVA I L A B L E SENTIMENT AND AFFECT LEXICONS 5 NRC Emotion/Affect Intensity Lexicon (Mohammad, For a smaller set of 5,814 words, the NRC Emotion/Affect Intensity Lexicon (Mohammad,2018b); 2018b) realcontains values real-valued for 5814 scores words of association for anger, fear, joy, and sadness; Fig. 20.5 shows examples.

Anger Fear Joy Sadness outraged 0.964 horror 0.923 superb 0.864 sad 0.844 violence 0.742 0.703 cheered 0.773 guilt 0.750 coup 0.578 pestilence 0.625 rainbow 0.531 unkind 0.547 oust 0.484 stressed 0.531 gesture 0.387 difficulties 0.421 suspicious 0.484 failing 0.531 warms 0.391 beggar 0.422 nurture 0.059 confident 0.094 hardship .031 sing 0.017 Figure 20.5 Sample emotional intensities for words for anger, fear, joy, and sadness from Mohammad (2018b).

LIWC LIWC, Linguistic Inquiry and Word Count, is a widely used set of 73 lex- icons containing over 2300 words (Pennebaker et al., 2007), designed to capture aspects of lexical meaning relevant for social psychological tasks. In addition to sentiment-related lexicons like ones for negative emotion (bad, weird, hate, prob- lem, tough) and positive emotion (, nice, sweet), LIWC includes lexicons for categories like anger, sadness, cognitive mechanisms, perception, tentative, and in- hibition, shown in Fig. 20.6.

Positive Negative Emotion Emotion Insight Inhibition Family Negate appreciat* anger* aware* avoid* brother* aren’t comfort* bore* believe careful* cousin* cannot great cry decid* hesitat* daughter* didn’t happy despair* feel limit* family neither fail* figur* oppos* father* never joy* fear know prevent* grandf* no perfect* griev* knew reluctan* grandm* nobod* please* hate* means safe* husband none safe* * notice* stop mom nor terrific suffers recogni* stubborn* mother nothing value terrify sense wait niece* nowhere wow* violent* think wary wife without Figure 20.6 Samples from 5 of the 73 lexical categories in LIWC (Pennebaker et al., 2007). The * means the previous letters are a word prefix and all words with that prefix are included in the category.

There are various other hand-built affective lexicons. The General Inquirer in- cludes additional lexicons for dimensions like strong vs. weak, active vs. passive, overstated vs. understated, as well as lexicons for categories like pleasure, , virtue, vice, motivation, and cognitive orientation. concrete Another useful feature for various tasks is the distinction between concrete abstract words like banana or bathrobe and abstract words like belief and although. The lexicon in Brysbaert et al. (2014) used crowdsourcing to assign a rating from 1 to 5 of the concreteness of 40,000 words, thus assigning banana, bathrobe, and bagel 5, belief 1.19, although 1.07, and in between words like brisk a 2.5. Lexicons for Sentiment, Other Useful Lexicons Affect, and Connotation 20.2 • AVA I L A B L E SENTIMENT AND AFFECT LEXICONS 5

For a smaller set of 5,814 words, the NRC Emotion/Affect Intensity Lexicon (Mohammad, 2018b) contains real-valued scores of association for anger, fear, joy, and sadness; Fig. 20.5 shows examples.

Anger Fear Joy Sadness outraged 0.964 horror 0.923 superb 0.864 sad 0.844 violence 0.742 anguish 0.703 cheered 0.773 guilt 0.750 coup 0.578 pestilence 0.625 rainbow 0.531 unkind 0.547 oust 0.484 stressed 0.531 gesture 0.387 difficulties 0.421 suspicious 0.484 failing 0.531 warms 0.391 beggar 0.422 nurture 0.059 confident 0.094 hardship .031 sing 0.017 Figure 20.5 Sample emotional intensities for words for anger, fear, joy, and sadness from Mohammad (2018b).

LIWC LIWC, Linguistic Inquiry and Word Count, is a widely used set of 73 lex- icons containing over 2300 words (Pennebaker et al., 2007), designed to capture aspects of lexical meaning relevant for social psychological tasks. In addition to sentiment-related lexicons like ones for negative emotion (bad, weird, hate, prob- lem, tough) and positive emotion (love, nice, sweet), LIWC includes lexicons for categories like anger, sadness, cognitive mechanisms, perception, tentative, and in- hibition,LIWC: shown Linguistic in Fig. 20.6 .Inquiry and Word Count

Positive Negative Emotion Emotion Insight Inhibition Family Negate appreciat* anger* aware* avoid* brother* aren’t comfort* bore* believe careful* cousin* cannot great cry decid* hesitat* daughter* didn’t happy despair* feel limit* family neither interest fail* figur* oppos* father* never joy* fear know prevent* grandf* no perfect* griev* knew reluctan* grandm* nobod* please* hate* means safe* husband none safe* panic* notice* stop mom nor terrific suffers recogni* stubborn* mother nothing value terrify sense wait niece* nowhere wow* violent* think wary wife without Figure 20.6 Samples from 5 of the 73 lexical categories in LIWC (Pennebaker et al., 2007). The * means the previous letters are a word prefix and all words with that prefix are included in the category.

There are various other hand-built affective lexicons. The General Inquirer in- cludes additional lexicons for dimensions like strong vs. weak, active vs. passive, overstated vs. understated, as well as lexicons for categories like pleasure, pain, virtue, vice, motivation, and cognitive orientation. concrete Another useful feature for various tasks is the distinction between concrete abstract words like banana or bathrobe and abstract words like belief and although. The lexicon in Brysbaert et al. (2014) used crowdsourcing to assign a rating from 1 to 5 of the concreteness of 40,000 words, thus assigning banana, bathrobe, and bagel 5, belief 1.19, although 1.07, and in between words like brisk a 2.5. LIWC (Linguistic Inquiry and Word Count)

Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007. Austin, TX http://www.liwc.net/ 2300 words >70 classes The General Inquirer

Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press ◦ Home page: http://www.wjh.harvard.edu/~inquirer ◦ List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm ◦ Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls Categories: ◦ Positiv (1915 words) and Negativ (2291 words) ◦ Strong vs Weak, Active vs Passive, Overstated versus Understated ◦ Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc Free for Research Use Concreteness versus abstractness

The degree to which the concept denoted by a word refers to a perceptible entity. Brysbaert, M., Warriner, A. B., and Kuperman, V. (2014) Concreteness ratings for 40 thousand generally known English word lemmas Behavior Research Methods 46, 904-911. Supplementary data: This work is licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. 37,058 English words and 2,896 two-word expressions ( “zebra crossing” and “zoom in”), Rating from 1 (abstract) to 5 (concrete) Concreteness versus abstractness

Brysbaert, M., Warriner, A. B., and Kuperman, V. (2014) Concreteness ratings for 40 thousand generally known English word lemmas Behavior Research Methods 46, 904-911. Supplementary data: This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Some example ratings from the final dataset of 40,000 words and phrases banana 5 bathrobe 5 bagel 5 brisk 2.5 badass 2.5 basically 1.32 belief 1.19 although 1.07 Lexicons for Sentiment, Building Lexicons using Affect, and Human Labelers Connotation Where do lexicons come from?

• One method: crowdsourcing!!! • 10,000 words • Collected from earlier lexicons • Labeled by workers on Amazon Mechanical Turk • “Turkers” • 5 Turkers per hit The AMT Hit

… NRC Valence, Arousal, Dominance (VAD) lexicon Mohammad (2018)

20,000 words, 3 emotional dimensions: ◦ valence (the pleasantness of the stimulus) ◦ arousal (the intensity of emotion provoked by the stimulus) ◦ dominance (the degree of control exerted by the stimulus) Best-worst scaling: valence Q1. Which of the four words below is associated with the MOST happiness / pleasure / positiveness / satisfaction / contentedness / hopefulness OR LEAST unhappiness / annoyance / negativeness / dissatisfaction / melancholy / despair? vacation, consolation, whistle, torture

Q2. Which of the four words below is associated with the LEAST happiness / pleasure / positiveness / satisfaction / contentedness / hopefulness OR MOST unhappiness / annoyance / negativeness / dissatisfaction / melancholy / despair? Lexicon of valence, arousal, and dominance

Valence Arousal Dominance delightful .918 enraged .962 powerful .991 vacation .840 party .840 authority .935 whistle .653 organized .337 saxophone .482 consolation .408 effortless .120 discouraged .0090 torture .115 napping .046 weak .045

37 Issues to keep in mind with crowdsourcing lexicons

Native (or very fluent) speakers Making the task clear for non-linguists or non computer scientists Paying minimum wage (fairwork.stanford.edu) Lexicons for Sentiment, Building Lexicons using Affect, and Human Labelers Connotation Lexicons for Sentiment, Semi-supervised Induction Affect, and of Affect Lexicons Connotation Semantic Axis Methods (An et al., 2018, Turney and Littman 2003)

Start with seed words like good or bad for the two poles For each word to be added to lexicon • Compute a word representation • Use this to measure its distance from the poles • Assign it to the pole it is closer to 20.4 • SEMI-SUPERVISED INDUCTION OF AFFECT LEXICONS 7 20.4 Semi-supervised Induction of Affect Lexicons

Another common way to learn sentiment lexicons is to start from a set of seed words that define two poles of a semantic axis (words like good or bad), and then find ways to label each word w by its similarity to the two seed sets. Here we summarize two families of seed-based semi-supervised lexicon induction algorithms, axis-based and graph-based.

20.4.1 Semantic Axis Methods

One of the most well-known lexicon induction methods, the Turney and Littman (2003) algorithm, is given seed words like good or bad, and then for each word w to be labeled, measures both how similar it is to good and how different it is from bad. Here we describe a slight extension of the algorithm due to An et al. (2018), which is based on computing a semantic axis. Initial seedsIn the firstfor step, different we choose seed words domains by hand. There are two methods for dealing with the fact that the affect of a word is different in different contexts: (1) start with a single large seed lexicon and rely on the induction algorithm to fine-tune it to the domain, or (2) choose different seed words for different genres. Hellrich (1) Start withet al. (2019) a singlesuggests that large for modeling seed affect acrosslexicon different historicaland rely time periods, on the starting with a large modern affect dictionary is better than small seedsets tuned to inductionbe stable algorithm across time. As an to example fine of the-tune second approach, it to Hamiltonthe domain et al. (2016) define one set of seed words for general sentiment analysis, a different set for Twitter, (2) Chooseand different yet another set for seed sentiment words in financial text:for different genres:

Domain Positive seeds Negative seeds General good, lovely, excellent, fortunate, pleas- bad, horrible, poor, unfortunate, un- ant, delightful, perfect, loved, love, pleasant, disgusting, evil, hated, hate, happy unhappy Twitter love, loved, , awesome, nice, hate, hated, hates, terrible, nasty, awful, amazing, best, fantastic, correct, happy worst, horrible, wrong, sad Finance successful, excellent, profit, beneficial, negligent, loss, volatile, wrong, losses, improving, improved, success, gains, damages, bad, litigation, failure, down, positive negative

In the second step, we compute embeddings for each of the pole words. These embeddings can be off-the-shelf word2vec embeddings, or can be computed directly on a specific corpus (for example using a financial corpus if a finance lexicon is the goal), or we can fine-tune off-the-shelf embeddings to a corpus. Fine-tuning is espe- cially important if we have a very specific genre of text but don’t have enough data to train good embeddings. In fine-tuning, we begin with off-the-shelf embeddings like word2vec, and continue training them on the small target corpus. Once we have embeddings for each pole word, we create an embedding that represents each pole by taking the centroid of the embeddings of each of the seed words; recall that the centroid is the multidimensional version of the mean. Given a set of embeddings for the positive seed words S+ = E(w+),E(w+),...,E(w+) , { 1 2 n } and embeddings for the negative seed words S = E(w),E(w),...,E(w ) , the { 1 2 m } Compute representation

Can just use off-the-shelf static embeddings • word2vec, GloVe, etc. Or compute on a corpus Or fine-tune pre-trained embeddings to a corpus 20.4 • SEMI-SUPERVISED INDUCTION OF AFFECT LEXICONS 7 20.4 • SEMI-SUPERVISED INDUCTION OF AFFECT LEXICONS 7 20.4 Semi-supervised Induction of Affect Lexicons 20.4 Semi-supervised Induction of Affect Lexicons Another common way to learn sentiment lexicons is to start from a set of seed words Anotherthat define common two poles way toof learn a semantic sentiment axis lexicons (words is like to startgood fromor bad a set), ofand seed then words find ways thatto label define each two word polesw ofby a semantic its similarity axis (words to the like twogood seedor sets.bad), Here and then we summarize find ways two tofamilies label each of seed-based word w by semi-supervised its similarity to the lexicon two seed induction sets. Here algorithms, we summarize axis-based two and familiesgraph-based. of seed-based semi-supervised lexicon induction algorithms, axis-based and graph-based.

20.4.1 Semantic Axis Methods 20.4.1 Semantic Axis Methods One of the most well-known lexicon induction methods, the Turney and Littman One(2003) ofalgorithm, the most well-known is given seed lexicon words induction like good methods,or bad, the andTurney then for and each Littman word w to (2003)be labeled,algorithm, measures is given both seed how words similar like itgood is toorgoodbad,and and how then different for each word it is fromw to bad. beHere labeled, we describe measures a both slight how extension similar it of is the to good algorithmand how due different to An et it al. is from (2018)bad,. which Hereis based we describe on computing a slight a extensionsemantic of axis the. algorithm due to An et al. (2018), which is basedIn the on firstcomputing step, we a semantic choose axis seed. words by hand. There are two methods for dealingIn the with first the step, fact we that choose the affect seed words of a word by hand. is different There are in different two methods contexts: for (1) dealingstart with with a single the fact large that seed the affect lexicon of a and word rely is on different the induction in different algorithm contexts: to fine-tune (1) startit to with the domain, a single large or (2) seed choose lexicon different and rely seed on the words induction for differentalgorithm genres. to fine-tuneHellrich it to the domain, or (2) choose different seed words for different genres. Hellrich et al. (2019) suggests that for modeling affect across different historical time periods, et al. (2019) suggests that for modeling affect across different historical time periods, starting with a large modern affect dictionary is better than small seedsets tuned to starting with a large modern affect dictionary is better than small seedsets tuned to be stable across time. As an example of the second approach, Hamilton et al. (2016) be stable across time. As an example of the second approach, Hamilton et al. (2016) definedefine one one set set of of seed seed words words for for general general sentiment sentiment analysis, analysis, a different a different set for set Twitter, for Twitter, andand yet yet another another set set for for sentiment sentiment in financialin financial text: text:

DomainDomain Positive seeds seeds NegativeNegative seeds seeds GeneralGeneral good, lovely, lovely, excellent, excellent, fortunate, fortunate, pleas- pleas- bad,bad, horrible, horrible, poor, poor, unfortunate, unfortunate, un- un- ant, delightful, delightful, perfect, perfect, loved, loved, love, love, pleasant,pleasant, disgusting, disgusting, evil, evil, hated, hated, hate, hate, happy unhappyunhappy TwitterTwitter love, loved, loved, loves, loves, awesome, awesome, nice, nice, hate,hate, hated, hated, hates, hates, terrible, terrible, nasty, nasty, awful, awful, amazing, best, best, fantastic, fantastic, correct, correct, happy happy worst,worst, horrible, horrible, wrong, wrong, sad sad FinanceFinance successful, excellent, excellent, profit, profit, beneficial, beneficial, negligent,negligent, loss, loss, volatile, volatile, wrong, wrong, losses, losses, improving, improved, improved, success, success, gains, gains, damages,damages, bad, bad, litigation, litigation, failure, failure, down, down, positive negativenegative

HAPTER EXICONS FOR ENTIMENT FFECT AND ONNOTATION InIn the the second second step, step, we8 we compute computeC embeddings embeddings208 forC•HAPTER eachforL each of the20 of pole the• pole words.LEXICONS words. TheseS These FOR SENTIMENT,A ,AFFECT, , ANDC CONNOTATION embeddingsembeddings can can be be off-the-shelf off-the-shelf word2vec word2vec embeddings, embeddings, or can or be can computed be computed directly directly onon a a specific specific corpus corpus (for (for example example using using a financial a financial corpus corpus if a finance ifpole a finance centroids lexicon lexicon is are: the is the goal), or we can fine-tune off-the-shelf embeddingspole to centroids a corpus. Fine-tuning are: is espe- goal), or we can fine-tune off-the-shelf embeddings to a corpus. Fine-tuning is espe- n cially important if we have a very specific genre of text but don’t have enough data 1 cially important if we have a very specific genre of text but don’t have enough data V+ n= E(w+) toto train train good good embeddings. embeddings. In In fine-tuning, fine-tuning, we we begin begin with with off-the-shelf off-the-shelf embeddings embeddings+ 1 n + i like word2vec, and continue training them on the small target corpus. V = E(1wi ) like word2vec, and continue training them on the small target corpus. n Xm Once we have embeddings for each pole word,Represent we create an each embedding pole that 1 1 Once we have embeddings for each pole word, we create an embedding that VX= E(w) (20.1) represents each pole by taking the centroid of the embeddings of each of the seed m m i represents each pole by taking the centroid of the embeddings of each of the seed 1 1 words; recall that the centroid is the multidimensionalStart with version embeddings of the mean. Given for seed words:X words; recall that the centroid is the multidimensional version of the mean. GivenV = E(wi) (20.1) S+ = E(w+),TheE(w semantic+),...,E(w axis+) defined by the poles is computed just by subtracting the two vec- a set of embeddingsHAPTER for the positiveEXICONS seed FOR wordsENTIMENT+ FFECT1 + AND2 + ONNOTATIONn , + m a set8 ofC embeddings20 for• theL positive seedS words S {=,AE(w1 ,),E(w2C),...,E}(wn ) , 1 and embeddings for the negative seed words S = E(w1){,E(wtors:2),...,E(wm) , the } X and embeddings for the negative seed words S{= E(w1),E(w2),...,E}(wm) , the The semantic{ axis defined by} the poles is computed+ just by subtracting the two vec- pole centroids are: Vaxis = V V (20.2) Pole centroids are: Semantic axis is: tors: n + 1 Vaxis+, the semantic axis, is a vector in the direction of positive sentiment. Finally, V = E(wi ) + n we compute (via cosineV similarity)axis = V the angleV between the vector in the direction(20.2) of X1 positive sentiment and the direction ofw’s embedding. A higher cosine means that m + 1 w is more alignedWord with S scorethan S is. cosine with axis VaxisV,= the semanticE(w) axis, is a vector in the(20.1) direction of positive sentiment. Finally, m i we computeX1 (via cosine similarity)score the(w)= anglecos between(E(w),V theaxis vector in the direction of The semantic axis definedpositive by the poles sentiment is computed and just by the subtracting direction the of twow vec-’s embedding. A higher cosine means that E(w) Vaxis tors: + = · (20.3) w is more aligned with S than S. E(w) Vaxis + k kk k Vaxis = V VIf a dictionary of words with sentiment(20.2) scores is sufficient, we’re done! Or if we score(w)= cos(E(w),Vaxis Vaxis, the semantic axis, is a vector in theneed direction to group of positive words sentiment. into a positive Finally, and a negative lexicon, we can use a threshold we compute (via cosine similarity) the angleor between other method the vector to give in usthe discrete direction lexicons.E of(w) V = · axis (20.3) positive sentiment and the direction of w’s embedding. A higher cosine meansE that(w) + Vaxis w is more aligned with S than S. 20.4.2 Label Propagationk kk k If a dictionary of words with sentiment scores is sufficient, we’re done! Or if we score(w)= cosAn(E( alternativew),Vaxis family of methods defines lexicons by propagating sentiment labels need to group words into a positive and a negative lexicon, we can use a threshold E(onw) graphs,Vaxis an idea suggested in early work by Hatzivassiloglou and McKeown or other= method(1997)· to. give We’ll us describe discrete the lexicons.simple(20.3) SentProp (Sentiment Propagation) algorithm of E(w) Vaxis k Hamiltonkk etk al. (2016), which has four steps: If a dictionary of words with sentiment scores is sufficient, we’re done! Or if we need to group words into a20.4.2 positive and Label a negative1. Define Propagation lexicon, a graph we can: Given use a threshold word embeddings, build a weighted lexical graph by or other method to give us discrete lexicons. connecting each word with its k nearest neighbors (according to cosine simi- An alternative familylarity). of The methods weights of defines the edge lexicons between words by propagatingwi and w j are sentiment set as: labels

20.4.2 Label Propagationon graphs, an idea suggested in early work by Hatzivassiloglouwi>wj and McKeown Ei, j = arccos . (20.4) An alternative family of methods(1997) defines. We’ll lexicons describe by propagating the simple sentiment SentProp labels (Sentiment w w Propagation) algorithm of k ikk jk! on graphs, an idea suggestedHamilton in early et work al. (2016) by Hatzivassiloglou, which has and four McKeown steps: (1997). We’ll describe the simple SentProp (Sentiment2. Define Propagation)a seed set: Choose algorithm positive of and negative seed words. Hamilton et al. (2016), which has1. fourDefine steps: a3. graphPropagate: Given polarities word from embeddings, the seed set: buildNow a we weighted perform a random lexical walk graph on by this graph, starting at the seed set. In a random walk, we start at a node and 1. Define a graph: Given wordconnecting embeddings, buildeach a word weighted with lexical its graphk nearest by neighbors (according to cosine simi- then choose a node to move to with probability proportional to the edge prob- connecting each word with itslarity).k nearest The neighbors weights (according of the to edge cosine between simi- words wi and w j are set as: ability. A word’s polarity score for a seed set is proportional to the probability larity). The weights of the edge between words wi and w j are set as: of a random walk from the seed set landing on that word (Fig. 20.7). wi>wj 4.wiCreate>wj word scores= arccos: We walk from both positive. and negative seed sets,(20.4) Ei, j = arccos . Ei, j (20.4)+ wi resultingwj ! in positive (rawscore (wi)w) andi negativewj ! (rawscore(wi)) raw label k kkscores.k We then combine these valuesk intokk a positive-polarityk score as: 2. Define a seed set: Choose positive and negative seed words. 2. Define a seed set: Choose positive and negative+ seed words. 3. Propagate polarities from the seed set: Now we perform a random+ walk on rawscore (wi) score (wi)= + (20.5) this graph, starting at the3. seedPropagate set. In a random polarities walk, we from start at the a node seedrawscore and set: Now(wi)+ werawscore perform(wi a) random walk on then choose a node to move tothis with graph, probability startingIt’s often proportional helpful at the to to seed the standardize edge set. prob- In the a scores random to have walk, zero we mean start and unit at a variance node and ability. A word’s polarity scorethen for choose a seed setwithin a is node proportional a corpus. to move to the to probability with probability proportional to the edge prob- of a random walk from the seedability. set landing A word’s on that polarity word (Fig. score20.7). for a seed set is proportional to the probability 4. Create word scores: We walk from both positive and negative seed sets, of+ a random walk from the seed set landing on that word (Fig. 20.7). resulting in positive (rawscore (wi)) and negative (rawscore(wi)) raw label scores. We then combine4. theseCreate values into word a positive-polarity scores: We score walk as: from both positive and negative seed sets, + + + resultingrawscore in positive(wi) (rawscore (wi)) and negative (rawscore(wi)) raw label score (wi)= + (20.5) rawscorescores. We(wi)+ thenrawscore combine(wi) these values into a positive-polarity score as: It’s often helpful to standardize the scores to have zero mean and unit variancerawscore+(w ) within a corpus. + i score (wi)= + (20.5) rawscore (wi)+rawscore(wi) It’s often helpful to standardize the scores to have zero mean and unit variance within a corpus. Label Propagation Methods

Alternative to axis methods: propagate sentiment labels on word graphs First proposed by Hatzivassiloglou and McKeown (1997) Let's see method of Hamilton et al. 2016 8 CHAPTER 20 • LEXICONS FOR SENTIMENT,AFFECT, AND CONNOTATION

pole centroids are: n 1 V+ = E(w+) n i 1 Xm 1 V = E(w) (20.1) m i X1 The semantic axis defined by the poles is computed just by subtracting the two vec- tors: + V = V V (20.2) axis Vaxis, the semantic axis, is a vector in the direction of positive sentiment. Finally, we compute (via cosine similarity) the angle between the vector in the direction of positive sentiment and the direction of w’s embedding. A higher cosine means that + w is more aligned with S than S.

score(w)= cos(E(w),Vaxis E(w) V = · axis (20.3) E(w) V k kk axisk If a dictionary of words with sentiment scores is sufficient, we’re done! Or if we need to group words into a positive and a negative lexicon, we can use a threshold or other method to give us discrete lexicons.

20.4.2 Label Propagation An alternative family of methods defines lexicons by propagating sentiment labels on graphs,Label an idea Propagation suggested in early work by(HamiltonHatzivassiloglou and et McKeown al., 2016 version) (1997). We’ll describe the simple SentProp (Sentiment Propagation) algorithm of Hamilton et al. (2016), which has four steps: 1. Define a graph: Given word embeddings, build a weighted lexical graph by 1.connecting Define each worda graph: with its k nearest connecting neighbors (according each to cosine word simi- with k nearestlarity). The weights neighbor of the edge between words wi and w j are set as:

wi>wj Ei, j = arccos . (20.4) w w k ikk jk! 2. Define a seed set: Choose positive and negative seed words. 3.2.Propagate Define polarities a seed from the seedset set: (posNow we and perform neg a random words) walk on this graph, starting at the seed set. In a random walk, we start at a node and then chooselove, a node hate, to move to etc with probability proportional to the edge prob- ability. A word’s polarity score for a seed set is proportional to the probability of a random walk from the seed set landing on that word (Fig. 20.7). 4. Create word scores: We walk from both positive and negative seed sets, + resulting in positive (rawscore (wi)) and negative (rawscore(wi)) raw label scores. We then combine these values into a positive-polarity score as: + + rawscore (wi) score (wi)= + (20.5) rawscore (wi)+rawscore(wi) It’s often helpful to standardize the scores to have zero mean and unit variance within a corpus. 20.4 • SEMI-SUPERVISED INDUCTION OF AFFECT LEXICONS 9

5. Assign confidence to each score: Because sentiment scores are influenced by the seed set, we’d like to know how much the score of a word would change if a different seed set is used. We can use bootstrap sampling to get confidence Label Propagationregions, by computing (Hamilton the propagation et Bal.,times 2016 over random version) subsets of the positive and negative seed sets (for example using B = 50 and choosing 7 of the 10 seed words each time). The standard deviation of the bootstrap sampled 3. Propagatepolarity polarities scores gives from a confidence the measure. seed set: randomly walk on the graph loathe loathe like like abhor abhor find find love idolize hate love idolize hate dislike dislike see uncover see uncover adore despise adore despise disapprove disapprove appreciate notice appreciate notice (a) (b) FigurePolarity 20.7 Intuition score of the is S ENTproportionalPROP algorithm. (a) Runto random probability walks from the of seed words. (b) Assign polarityrandom scores (shown walk here aslanding colors green oron red) word based on the frequency of random walk visits.

20.4.3 Other Methods The core of semisupervised algorithms is the metric for measuring similarity with the seed words. The Turney and Littman (2003) and Hamilton et al. (2016) ap- proaches above used embedding cosine as the distance metric: words were labeled as positive basically if their embeddings had high cosines with positive seeds and low cosines with negative seeds. Other methods have chosen other kinds of distance metrics besides embedding cosine. For example the Hatzivassiloglou and McKeown (1997) algorithm uses syntactic cues; two adjectives are considered similar if they were frequently conjoined by and and rarely conjoined by but. This is based on the intuition that adjectives conjoined by the words and tend to have the same polarity; positive adjectives are generally coordinated with positive, negative with negative: fair and legitimate, corrupt and brutal but less often positive adjectives coordinated with negative: *fair and brutal, *corrupt and legitimate By contrast, adjectives conjoined by but are likely to be of opposite polarity: fair but brutal Another cue to opposite polarity comes from morphological negation (un-, im-, -less). Adjectives with the same root but differing in a morphological negative (ad- equate/inadequate, thoughtful/thoughtless) tend to be of opposite polarity. Yet another method for finding words that have a similar polarity to seed words is to make use of a thesaurus like WordNet (Kim and Hovy 2004, Hu and Liu 2004). A word’s synonyms presumably share its polarity while a word’s antonyms probably have the opposite polarity. After a seed lexicon is built, each lexicon is updated as follows, possibly iterated. Lex+: Add synonyms of positive words (well) and antonyms (like fine) of negative words 8 CHAPTER 20 • LEXICONS FOR SENTIMENT,AFFECT, AND CONNOTATION

pole centroids are: n 1 V+ = E(w+) n i 1 Xm 1 V = E(w) (20.1) m i X1 The semantic axis defined by the poles is computed just by subtracting the two vec- tors: + V = V V (20.2) axis Vaxis, the semantic axis, is a vector in the direction of positive sentiment. Finally, we compute (via cosine similarity) the angle between the vector in the direction of positive sentiment and the direction of w’s embedding. A higher cosine means that + w is more aligned with S than S.

score(w)= cos(E(w),Vaxis E(w) V = · axis (20.3) E(w) V k kk axisk If a dictionary of words with sentiment scores is sufficient, we’re done! Or if we need to group words into a positive and a negative lexicon, we can use a threshold or other method to give us discrete lexicons.

20.4.2 Label Propagation

An alternative family of methods20.4 • definesSEMI-SUPERVISED lexicons byINDUCTION propagating OF A sentimentFFECT LEXICONS labels 9 on graphs, an idea suggested in early work by Hatzivassiloglou and McKeown (1997). We’ll describe5. Assign the simple confidence SentProp to each (Sentiment score: Because Propagation) sentiment scores algorithm are influenced of by Hamilton et al. (2016),the which seed set, has we’d four like steps: to know how much the score of a word would change if a different seed set is used. We can use bootstrap sampling to get confidence 1. Define a graph:regions, Given by word computing embeddings, the propagation build aB weightedtimes over lexical random graph subsets by of the connecting eachpositive word with and negative its k nearest seed sets neighbors (for example (according using B = to50 cosine and choosing simi- 7 of the 10 seed words each time). The standard deviation of the bootstrap sampled larity). The weights of the edge between words wi and w j are set as: Label Propagationpolarity scores gives (Hamilton a confidence measure. et al., 2016 version) wi>wj Ei, j = arccos . (20.4) loathe loathe like wi wj ! like abhor k kk k abhor find find 2. loveDefineidolize a seed set: Choose positivehate and negativelove idolize seed words. hate dislike dislike see uncover see uncover 3. Propagateadore polarities from thedespise seed set: Nowadore we perform a random walkdespise on this graph, starting atdisapprove the seed set. In a random walk, we startdisapprove at a node and appreciate notice appreciate notice then choose a node to move to with probability proportional to the edge prob- (a) (b) ability.4. Create A word’s word polarity scores: score for a seed set is proportional to the probability Intuition of the SENTPROP algorithm. (a) Run random walks from the seed words. (b) Assign Figureof a 20.7 random walk from the seed set landing on that word (Fig. 20.7). polarity• scores (shown here as colors green or red) based on the frequency of random walk visits. 4. CreateWalking word scores from: positive We walk fromand negative both positive seedsets and negative seed sets, • + + - resulting inGives positive rawscore (rawscore(wi) and(wi )rawscore) and negative(wi) (rawscore(wi)) raw label scores.• Combine We then20.4.3 combine into Other one these Methods score: values into a positive-polarity score as: The core of semisupervised algorithms is+ the metric for measuring similarity with rawscore (wi) the seed+ words. The Turney and Littman (2003) and Hamilton et al. (2016) ap- score (wi)= + (20.5) proaches above usedrawscore embedding( cosinewi)+ asrawscore the distance(w metric:i) words were labeled It’s often helpfulas positive to standardize basically if their the scores embeddings to have had zero high cosines mean and with unit positive variance seeds and low cosines with negative seeds. Other methods have chosen other kinds of distance within a corpus.metrics besides embedding cosine. For example the Hatzivassiloglou and McKeown (1997) algorithm uses syntactic cues; two adjectives are considered similar if they were frequently conjoined by and and rarely conjoined by but. This is based on the intuition that adjectives conjoined by the words and tend to have the same polarity; positive adjectives are generally coordinated with positive, negative with negative: fair and legitimate, corrupt and brutal but less often positive adjectives coordinated with negative: *fair and brutal, *corrupt and legitimate By contrast, adjectives conjoined by but are likely to be of opposite polarity: fair but brutal Another cue to opposite polarity comes from morphological negation (un-, im-, -less). Adjectives with the same root but differing in a morphological negative (ad- equate/inadequate, thoughtful/thoughtless) tend to be of opposite polarity. Yet another method for finding words that have a similar polarity to seed words is to make use of a thesaurus like WordNet (Kim and Hovy 2004, Hu and Liu 2004). A word’s synonyms presumably share its polarity while a word’s antonyms probably have the opposite polarity. After a seed lexicon is built, each lexicon is updated as follows, possibly iterated. Lex+: Add synonyms of positive words (well) and antonyms (like fine) of negative words 20.4 • SEMI-SUPERVISED INDUCTION OF AFFECT LEXICONS 9

5. Assign confidence to each score: Because sentiment scores are influenced by the seed set, we’d like to know how much the score of a word would change if a different seed set is used. We can use bootstrap sampling to get confidence regions, by computing the propagation B times over random subsets of the positive and negative seed sets (for example using B = 50 and choosing 7 of the 10 seed words each time). The standard deviation of the bootstrap sampled Label Propagationpolarity scores gives (Hamilton a confidence measure. et al., 2016 version)

loathe loathe like like abhor abhor find find love idolize hate love idolize hate dislike dislike see uncover see uncover adore despise adore despise disapprove disapprove appreciate notice appreciate notice (a) (b) Figure 20.7 Intuition of the SENTPROP algorithm. (a) Run random walks from the seed words. (b) Assign polarity5. Assign scores (shown here as colors green via or red) bootstrap based on the frequency sampling: of random walk visits. • Compute the propagation B times over random subsets of the positive and negative seed sets 20.4.3 Other Methods • The standard deviation of the bootstrap sampled The core of semisupervised algorithms is the metric for measuring similarity with polaritythe seed scores words. gives The Turney a confidence and Littman (2003) measure.and Hamilton et al. (2016) ap- proaches above used embedding cosine as the distance metric: words were labeled as positive basically if their embeddings had high cosines with positive seeds and low cosines with negative seeds. Other methods have chosen other kinds of distance metrics besides embedding cosine. For example the Hatzivassiloglou and McKeown (1997) algorithm uses syntactic cues; two adjectives are considered similar if they were frequently conjoined by and and rarely conjoined by but. This is based on the intuition that adjectives conjoined by the words and tend to have the same polarity; positive adjectives are generally coordinated with positive, negative with negative: fair and legitimate, corrupt and brutal but less often positive adjectives coordinated with negative: *fair and brutal, *corrupt and legitimate By contrast, adjectives conjoined by but are likely to be of opposite polarity: fair but brutal Another cue to opposite polarity comes from morphological negation (un-, im-, -less). Adjectives with the same root but differing in a morphological negative (ad- equate/inadequate, thoughtful/thoughtless) tend to be of opposite polarity. Yet another method for finding words that have a similar polarity to seed words is to make use of a thesaurus like WordNet (Kim and Hovy 2004, Hu and Liu 2004). A word’s synonyms presumably share its polarity while a word’s antonyms probably have the opposite polarity. After a seed lexicon is built, each lexicon is updated as follows, possibly iterated. Lex+: Add synonyms of positive words (well) and antonyms (like fine) of negative words Other metrics besides cosine: Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the Semantic Orientation of Adjectives. ACL, 174–181

Adjectives conjoined by “and” have same polarity ◦ Fair and legitimate, corrupt and brutal ◦ *fair and brutal, *corrupt and legitimate Adjectives conjoined by “but” do not ◦ fair but brutal Lexicons for Sentiment, Semi-supervised Induction Affect, and of Affect Lexicons Connotation Lexicons for Sentiment, Supervised Learning of Affect, and Word Sentiment Connotation Learn word sentiment supervised by online review scores Potts, Christopher. 2011. On the negativity of negation. SALT 20, 636-659. Potts 2011 NSF Workshop talk. Review datasets ◦ IMDB, Goodreads, Open Table, Amazon, Trip Advisor Each review has a score (1-5, 1-10, etc) Just count how many times each word occurs with each score ◦ (and normalize)

53 Online review data 20.5 • SUPERVISED LEARNING OF WORD SENTIMENT 11

Movie review excerpts (IMDb) 10 A great movie. This film is just a wonderful experience. It’s surreal, zany, witty and slapstick all at the same time. And terrific performances too. 1 This was probably the worst movie I have ever seen. The story went nowhere even though they could have done some interesting stuff with it. Restaurant review excerpts (Yelp) 5 The service was impeccable. The food was cooked and seasoned perfectly... The watermelon was perfectly square ... The grilled octopus was ... mouthwatering... 2 ...it took a while to get our waters, we got our entree before our starter, and we never received silverware or napkins until we requested them... Book review excerpts (GoodReads) 1 I am going to try and stop being deceived by eye-catching titles. I so wanted to like this book and was so disappointed by it. 5 This book is hilarious. I would recommend it to anyone looking for a satirical read with a romantic twist and a narrator that keeps butting in Product review excerpts (Amazon) 5 The lid on this blender though is probably what I like the best about it... enables you to pour into something without even taking the lid off! ... the perfect pitcher! ... works fantastic. 1 I hate this blender... It is nearly impossible to get frozen fruit and ice to turn into a smoothie... You have to add a TON of liquid. I also wish it had a spout ... Figure 20.9 Excerpts from some reviews from various review websites, all on a scale of 1 to 5 stars except IMDb, which is on a scale of 1 to 10 stars.

ber of words occurring in 1-star reviews (25,395,214), so the IMDb estimate of P(disappointing 1) is .0003. | A slight modification of this weighting, the normalized likelihood, can be used as an illuminating visualization (Potts, 2011)1: count(w,c) P(w c)= | w C count(w,c) 2 P(w c) PottsScore(w)=P | (20.6) P(w c) c | Dividing the IMDb estimate P(disappointingP 1) of .0003 by the sum of the likeli- | hood P(w c) over all categories gives a Potts score of 0.10. The word disappointing | thus is associated with the vector [.10, .12, .14, .14, .13, .11, .08, .06, .06, .05]. The Potts diagram Potts diagram (Potts, 2011) is a visualization of these word scores, representing the prior sentiment of a word as a distribution over the rating categories. Fig. 20.10 shows the Potts diagrams for 3 positive and 3 negative scalar adjec- tives. Note that the curve for strongly positive scalars have the shape of the letter J, while strongly negative scalars look like a reverse J. By contrast, weakly posi- tive and negative scalars have a hump-shape, with the maximum either below the mean (weakly negative words like disappointing) or above the mean (weakly pos- itive words like good). These shapes offer an illuminating typology of affective meaning. Fig. 20.11 shows the Potts diagrams for emphasizing and attenuating adverbs. Note that emphatics tend to have a J-shape (most likely to occur in the most posi- tive reviews) or a U-shape (most likely to occur in the strongly positive and nega- tive). Attenuators all have the hump-shape, emphasizing the middle of the scale and

1 Potts shows that the normalized likelihood is an estimate of the posterior P(c w) if we make the | incorrect but simplifying assumption that all categories c have equal probability. Analyzing the polarity of each word in IMDB Potts, Christopher. 2011. On the negativity of negation. SALT 20, 636-659. How likely is each word to appear in each sentiment class? Count(“bad”) in 1-star, 2-star, 3-star, etc. But can’t use raw counts: f (w,c) Instead, likelihood: P(w | c) = f (w,c) ∑w∈c Make them comparable between words ◦ Scaled likelihood: P(w | c) P(w) Overview Data Methods Categorization Scale induction Looking ahead

Example: attenuators

somewhat/r

IMDB – 53,775 tokens OpenTable – 3,890 tokens Goodreads – 3,424 tokens Amazon/Tripadvisor – 2,060 tokens

Cat = 0.33 (p = 0.004) Cat = 0.11 (p = 0.707) Cat = -0.55 (p = 0.128) Cat = 0.42 (p = 0.207) Cat^2 = -4.02 (p < 0.001) Cat^2 = -6.2 (p = 0.014) Cat^2 = -5.04 (p = 0.016) Cat^2 = -2.74 (p = 0.05) 0.38 0.36

0.28

0.15 0.19 0.09 0.12 0.08 0.05 0.08 0.06 0.17 0.28 0.39 0.50 0.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.50 -0.50 -0.39 -0.28 -0.17 -0.06 -0.50 -0.25 -0.50 -0.25 -0.50 -0.25 Category Category Category Category

“Potts diagrams” Potts, Christopher. 2011. NSF workshop on fairly/r restructuring adjectives. IMDB – 33,515 tokens OpenTable – 2,829 tokens Goodreads – 1,806 tokens Amazon/Tripadvisor – 2,158 tokens

Cat = -0.13 (p = 0.284) Cat = 0.2 (p = 0.265) Cat = -0.87 (p = 0.016) Cat = 0.54 (p = 0.183) Positive scalars Negative scalars Emphatics AttenuatorsCat^2 = -5.37 (p < 0.001) Cat^2 = -4.16 (p = 0.007) Cat^2 = -5.74 (p = 0.004) Cat^2 = -3.32 (p = 0.045) 0.35 0.31 good 0.29 disappointing 0.17 somewhat totally 0.18

0.09 0.12 0.11 0.08 0.04 0.05 0.06 0.17 0.28 0.39 0.50 0.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.50 -0.50 -0.39 -0.28 -0.17 -0.06 -0.50 -0.25 -0.50 -0.25 -0.50 -0.25 Category Category Category Category fairly great bad absolutely pretty/r

IMDB – 176,264 tokens OpenTable – 8,982 tokens Goodreads – 11,895 tokens Amazon/Tripadvisor – 5,980 tokens

Cat = -0.43 (p < 0.001) Cat = -0.64 (p = 0.035) Cat = -0.71 (p = 0.072) Cat = 0.26 (p = 0.496) Cat^2 = -3.6 (p < 0.001) Cat^2 = -4.47 (p = 0.007) Cat^2 = -4.59 (p = 0.018) Cat^2 = -2.23 (p = 0.131)

0.34 pretty 0.32 excellent terrible 0.28 utterly 0.19 0.13 0.14 0.15 0.15 0.09 0.05 0.08 0.07 0.06 0.17 0.28 0.39 0.50 0.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.50 -0.50 -0.39 -0.28 -0.17 -0.06 -0.50 -0.25 -0.50 -0.25 -0.50 -0.25 Category Category Category Category Or use regression coefficients to weight words Train a classifier based on supervised data ◦ Predict: human-labeled connotation of a document ◦ From: all the words and bigrams in it Use the regression coefficients as the weights

57 20.5 • SUPERVISED LEARNING OF WORD SENTIMENT 13

20.5.1 Log Odds Ratio Informative Dirichlet Prior One thing we often want to do with word polarity is to distinguish between words that are more likely to be used in one category of texts than in another. We may, for example, want to know the words most associated with 1 star reviews versus those associated with 5 star reviews. These differences may not be just related to senti- ment. We might want to find words used more often by Democratic than Republican members of Congress, or words used more often in menus of expensive restaurants than cheap restaurants. Given two classes of documents, to find words more associated with one cate- gory than another, we could measure the difference in frequencies (is a word w more frequent in class A or class B?). Or instead of the difference in frequencies we could compute the ratio of frequencies, or compute the log odds ratio (the log of the ratio between the odds of the two words). We could then sort words by whichever associ- ation measure we pick, ranging from words overrepresented in category A to words overrepresented in category B. The problem with simple log-likelihood or log odds methods is that they don’t work well for very rare words or very frequent words; for words that are very fre- quent, all differences seem large, and for words that are very rare, no differences seem large. In this section we walk through the details of one solution to this problem: the “log odds ratio informative Dirichlet prior” method of Monroe et al. (2008) that is a particularly useful method for finding words that are statistically overrepresented in one particular category of texts compared to another. It’s based on the idea of using another large corpus to get a prior estimate of what we expect the frequency of each word to be. Let’s startLog with odds the goal: ratio assume informative we want to Dirichlet know whether prior the word horrible log likelihood ratio occurs more in corpusMonroe,i orB. L., Colaresi corpus, M. P., jand. Quinn, We K. could M. (2008). Fightin’words compute: Lexical the featurelog selection likelihood ratio, using f i(w) to meanand the evaluation frequency for identifying ofthe content word of politicalw in conflict. corpus Political Analysisi, and 16(4),n i372to–403. mean the total number ofLog words likelihood in corpus ratioi: : does “horrible” occur more % in corpus A or B? Pi(horrible) llr(horrible)=log P j(horrible) = logPi(horrible) logP j(horrible) fi(horrible) f j(horrible) = log log (20.7) ni n j log odds ratio Instead, let’s compute the log odds ratio: does horrible have higher odds in i or in j:

Pi(horrible) P j(horrible) lor(horrible)=log log 1 Pi(horrible) 1 P j(horrible) ✓ ◆ ✓ ◆ fi(horrible) f j(horrible) i j = log n log n 0 fi(horrible) 1 0 f j(horrible) 1 1 1 B ni C B n j C @ fi(horrible) A @ f j(horrible) A = log log (20.8) ni fi(horrible) n j f j(horrible) ✓ ◆ ✓ ◆ The Dirichlet intuition is to use a large background corpus to get a prior estimate of what we expect the frequency of each word w to be. We’ll do this very simply by 20.5 • SUPERVISED LEARNING OF WORD SENTIMENT 13

20.5.1 Log Odds Ratio Informative Dirichlet Prior One thing we often want to do with word polarity is to distinguish between words that are more likely to be used in one category of texts than in another. We may, for example, want to know the words most associated with 1 star reviews versus those associated with 5 star reviews. These differences may not be just related to senti- ment. We might want to find words used more often by Democratic than Republican members of Congress, or words used more often in menus of expensive restaurants than cheap restaurants. Given two classes of documents, to find words more associated with one cate- gory than another, we could measure the difference in frequencies (is a word w more frequent in class A or class B?). Or instead of the difference in frequencies we could compute the ratio of frequencies, or compute the log odds ratio (the log of the ratio between the odds of the two words). We could then sort words by whichever associ- ation measure we pick, ranging from words overrepresented in category A to words overrepresented in category B. The problem with simple log-likelihood or log odds methods is that they don’t work well for very rare words or very frequent words; for words that are very fre- quent, all differences seem large, and for words that are very rare, no differences seem large. In this section we walk through the details of one solution to this problem: the “log odds ratio informative Dirichlet prior” method of Monroe et al. (2008) that is a particularly useful method for finding words that are statistically overrepresented in one particular category of texts compared to another. It’s based on the idea of using another large corpus to get a prior estimate of what we expect the frequency of each word to be. Let’s start with the goal: assume we want to know whether the word horrible log likelihood ratio occurs more in corpus i or corpus j. We could compute the log likelihood ratio, using f i(w) to mean the frequency of word w in corpus i, and ni to mean the total number of words in corpus i:

Pi(horrible) llr(horrible)=log P j(horrible) = logPi(horrible) logP j(horrible) Log odds ratio informativefi(horrible) Dirichletf j(horrible prior) = log i log j (20.7) Monroe, B. L., Colaresi, M. P., and Quinn, K. M. (2008).n Fightin’words: Lexical feature selection and nevaluation for identifying the content of political conflict. Political Analysis 16(4), 372–403. log odds ratio Instead, let’s compute the log odds ratio: does horrible have higher odds in i or in j: Log odds ratio: does “horrible” have higher odds in A or B?

Pi(horrible) P j(horrible) lor(horrible)=log log 1 Pi(horrible) 1 P j(horrible) ✓ ◆ ✓ ◆ fi(horrible) f j(horrible) i j = log n log n 0 fi(horrible) 1 0 f j(horrible) 1 1 1 B ni C B n j C @ fi(horrible) A @ f j(horrible) A = log log (20.8) ni fi(horrible) n j f j(horrible) ✓ ◆ ✓ ◆ The Dirichlet intuition is to use a large background corpus to get a prior estimate of what we expect the frequency of each word w to be. We’ll do this very simply by 20.5 • SUPERVISED LEARNING OF WORD SENTIMENT 13

20.5.1 Log Odds Ratio Informative Dirichlet Prior One thing we often want to do with word polarity is to distinguish between words that are more likely to be used in one category of texts than in another. We may, for example, want to know the words most associated with 1 star reviews versus those associated with 5 star reviews. These differences may not be just related to senti- ment. We might want to find words used more often by Democratic than Republican members of Congress, or words used more often in menus of expensive restaurants than cheap restaurants. Given two classes of documents, to find words more associated with one cate- gory than another, we could measure the difference in frequencies (is a word w more frequent in class A or class B?). Or instead of the difference in frequencies we could compute the ratio of frequencies, or compute the log odds ratio (the log of the ratio between the odds of the two words). We could then sort words by whichever associ- ation measure we pick, ranging from words overrepresented in category A to words overrepresented in category B. The problem with simple log-likelihood or log odds methods is that they don’t work well for very rare words or very frequent words; for words that are very fre- quent, all differences seem large, and for words that are very rare, no differences seem large. In this section we walk through the details of one solution to this problem: the “log odds ratio informative Dirichlet prior” method of Monroe et al. (2008) that is a particularly useful method for finding words that are statistically overrepresented in one particular category of texts compared to another. It’s based on the idea of using another large corpus to get a prior estimate of what we expect the frequency of each word to be. Let’s start with the goal: assume we want to know whether the word horrible log likelihood ratio occurs more in corpus i or corpus j. We could compute the log likelihood ratio, using f i(w) to mean the frequency of word w in corpus i, and ni to mean the total number of words in corpus i:

Pi(horrible) llr(horrible)=log P j(horrible) = logPi(horrible) logP j(horrible) fi(horrible) f j(horrible) = log log (20.7) ni n j log odds ratio Instead, let’s compute the log odds ratio: does horrible have higher odds in i or in j:

Pi(horrible) P j(horrible) 14 CHAPTERlor(horrible20 • )=LEXICONSlog FOR SENTIMENT,AlogFFECT, AND CONNOTATION 1 Pi(horrible) 1 P j(horrible) ✓ ◆ ✓ ◆ i( ) j( ) adding theLog counts odds fromf thatratiohorrible corpus with to thea prior numeratorf horrible and denominator, so that we’re ni n j essentially= shrinkinglog Monroe, the B. L., countsColaresi, M. P., and toward Quinn, K. M. (2008). that Fightin’wordslog prior.: Lexical feature It’s selection like and evaluation asking for identifying how the content large of political are the 0 conflict. Politicalfi( horribleAnalysis 16(4), 372–403.) 1 0 f j(horrible) 1 differences between1i and j given what we would1 expect given their frequencies in Log odds ratio fromni previous slide: n j a well-estimated largeB background corpus.C B C @ fi(horrible) A @ f j(horrible) A The method= log estimates the differencelog between the frequency of(20.8) word w in two ni fi(horrible) n j f j(horrible(i )j) corpora i and j via✓ the prior-modified◆ log odds✓ ratio for w, dw ◆, which is estimated The Dirichletas: intuitionNow is with to use a a prior: large background corpus to get a prior estimate of what we expect the frequency of eachi word w to be. We’ll do this veryj simply by (i j) fw + aw fw + aw dw = log log (20.9) i i j j n + a0 ( fw + aw) n + a0 ( fw + aw)! ✓ ◆ �! = isize of corpus i, �" = size of corpus jj, �! = count of word w in corpus i, �i" = count of word (where n is the size of corpus i, n is# the size of corpus j, fw# is the count of word w w in corpusj j, �$ is the size of the background corpus, and �# =count of word w in the in corpusbackgroundi, fw corpus.)is the count of word w in corpus j, a0 is the size of the background corpus, and aw is the count of word w in the background corpus.) In addition, Monroe et al. (2008) make use of an estimate for the variance of the log–odds–ratio:

2 (i j) 1 1 ˆ s dw i + j (20.10) ⇡ fw + aw fw + aw ⇣ ⌘ The final statistic for a word is then the z–score of its log–odds–ratio:

(i j) dˆw (20.11) 2 (i j) s dˆw r ⇣ ⌘ The Monroe et al. (2008) method thus modifies the commonly used log odds ratio in two ways: it uses the z-scores of the log odds ratio, which controls for the amount of variance in a word’s frequency, and it uses counts from a background corpus to provide a prior count for words. Fig. 20.12 shows the method applied to a dataset of restaurant reviews from Yelp, comparing the words used in 1-star reviews to the words used in 5-star reviews (Jurafsky et al., 2014). The largest difference is in obvious sentiment words, with the 1-star reviews using negative sentiment words like worse, bad, awful and the 5-star reviews using positive sentiment words like great, best, amazing. But there are other illuminating differences. 1-star reviews use logical negation (no, not), while 5-star reviews use emphatics and emphasize universality (very, highly, every, always). 1- star reviews use first person plurals (we, us, our) while 5 star reviews use the second person. 1-star reviews talk about people (manager, waiter, customer) while 5-star reviews talk about dessert and properties of expensive restaurants like courses and atmosphere. See Jurafsky et al. (2014) for more details.

20.6 Using Lexicons for Sentiment Recognition

In Chapter 4 we introduced the naive Bayes algorithm for sentiment analysis. The lexicons we have focused on throughout the chapter so far can be used in a number of ways to improve sentiment detection. In the simplest case, lexicons can be used when we don’t have sufficient training data to build a supervised sentiment analyzer; it can often be expensive to have a human assign sentiment to each document to train the supervised classifier. Log odds ratio informative Dirichlet prior

Monroe, B. L., Colaresi, M. P., and Quinn, K. M. (2008). Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4), 372–403.

1 1 ! #$% � �" ≈ # + % �" + �" �" + �"

Final statistic for a word: z-score of its log-odds-ratio:

"#$ �! % "#$ � �! Top 50 words associated with bad (= 1-star) reviews by Monroe, et al. (2008) method 20.7 • OTHER TASKSJurafsky:PERSONALITY et al., 2014 15

Class Words in 1-star reviews Class Words in 5-star reviews Negative worst, rude, terrible, horrible, bad, Positive great, best, love(d), delicious, amazing, awful, disgusting, bland, tasteless, favorite, perfect, excellent, awesome, gross, mediocre, overpriced, worse, friendly, fantastic, fresh, wonderful, in- poor credible, sweet, yum(my) Negation no, not Emphatics/ very, highly, perfectly, definitely, abso- universals lutely, everything, every, always 1Pl pro we, us, our 2 pro you 3 pro she, he, her, him Articles a, the Past verb was, were, asked, told, said, did, Advice try, recommend charged, waited, left, took Sequencers after, then Conjunct also, as, well, with, and Nouns manager, waitress, waiter, customer, Nouns atmosphere, dessert, chocolate, wine, customers, attitude, waste, poisoning, course, menu money, bill, minutes Irrealis would, should Auxiliaries is/’s, can, ’ve, are modals Comp to, that Prep, other in, of, die, city, mouth Figure 20.12 The top 50 words associated with one–star and five-star restaurant reviews in a Yelp dataset of 900,000 reviews, using the Monroe et al. (2008) method (Jurafsky et al., 2014).

In such situations, lexicons can be used in a rule-based algorithm for classifica- tion. The simplest version is just to use the ratio of positive to negative words: if a document has more positive than negative words (using the lexicon to decide the po- larity of each word in the document), it is classified as positive. Often a threshold l is used, in which a document is classified as positive only if the ratio is greater than l. If the sentiment lexicon includes positive and negative weights for each word, + qw and qw, these can be used as well. Here’s a simple such sentiment algorithm:

+ + f = qw count(w) w s.t. w positivelexicon 2 X f = qwcount(w) w s.t. w negativelexicon 2 X + + if f > l f 8 f sentiment = if f + > l (20.12) > <> 0 otherwise. > If supervised training data is available,:> these counts computed from sentiment lex- icons, sometimes weighted or normalized in various ways, can also be used as fea- tures in a classifier along with other lexical or non-lexical features. We return to such algorithms in Section 20.8.

20.7 Other tasks: Personality

Many other kinds of affective meaning can be extracted from text and speech. For personality example detecting a person’s personality from their language can be useful for di- alog systems (users tend to prefer agents that match their personality), and can play a useful role in computational social science questions like understanding how per- sonality is related to other kinds of behavior. Lexicons for Sentiment, Supervised Learning of Affect, and Word Sentiment Connotation Lexicons for Sentiment, Using the lexicons to Affect, and detect affect Connotation Lexicons for detecting document affect: Simplest unsupervised method

Sentiment: ◦ Sum the weights of each positive word in the document ◦ Sum the weights of each negative word in the document ◦ Choose whichever value (positive or negative) has higher sum Emotion: ◦ Do the same for each emotion lexicon 19.7 OTHER TASKS:PERSONALITY 15 •

Class Words in 1-star reviews Class Words in 5-star reviews Negative worst, rude, terrible, horrible, Positive great, best, love(d), delicious, bad, awful, disgusting, bland, amazing, favorite, perfect, excel- tasteless, gross, mediocre, over- lent, awesome, friendly, fantastic, priced, worse, poor fresh, wonderful, incredible, sweet, yum(my) Negation no, not Emphatics/ very, highly, perfectly, definitely, ab- universals solutely, everything, every, always 1Pl pro we, us, our 2 pro you 3 pro she, he, her, him Articles a, the Past verb was, were, asked, told, said, did, Advice try, recommend charged, waited, left, took Sequencers after, then Conjunct also, as, well, with, and Nouns manager, waitress, waiter, cus- Nouns atmosphere, dessert, chocolate, tomer, customers, attitude, waste, wine, course, menu poisoning, money, bill, minutes Irrealis would, should Auxiliaries is/’s, can, ’ve, are modals Comp to, that Prep, other in, of, die, city, mouth Figure 19.11 The top 50 words associated with one–star and five-star restaurant reviews in a Yelp dataset of 900,000 reviews, using the Monroe et al. (2008) method (Jurafsky et al., 2014).

In such situations, lexicons can be used in a simple rule-based algorithm for classification. The simplest version is just to use the ratio of positive to negative words: if a document has more positive than negative words (using the lexicon to decide the polarity of each word in the document), it is classified as positive. Often a threshold l is used, in which a document is classified as positive only if the ratio is greater than l. If the sentiment lexicon includes positive and negative weights for Lexicons for detecting document affect: each word, q + and q , these can be used as well. Here’s a simple such sentiment w Simplestw unsupervised method algorithm:

+ + f = qw count(w) w s.t. w positivelexicon 2 X f = qwcount(w) w s.t. w negativelexicon 2 X + + if f > l Sentiment = +f if f+ > f- 8 f sentiment = if f + > l (19.12) > <> 0 otherwise. > If supervised training data is available,:> these counts computed from sentiment lex- icons, sometimes weighted or normalized in various ways, can also be used as fea- tures in a classifier along with other lexical or non-lexical features. We return to such algorithms in Section 19.8.

19.7 Other tasks: Personality

Many other kinds of affective meaning can be extracted from text and speech. For personality example detecting a person’s personality from their language can be useful for di- alog systems (users tend to prefer agents that match their personality), and can play 19.7 OTHER TASKS:PERSONALITY 15 • 19.7 OTHER TASKS:PERSONALITY 15 • Class Words in 1-star reviews Class Words in 5-star reviews Class Words in 1-star reviews Class Words in 5-star reviews Negative worst, rude, terrible, horrible, Positive great, best, love(d), delicious, worst, rude, terrible, horrible, great, best, love(d), delicious, Negativebad, awful, disgusting, bland, Positiveamazing, favorite, perfect, excel- bad, awful, disgusting, bland, amazing, favorite, perfect, excel- tasteless, gross, mediocre, over- lent, awesome, friendly, fantastic, tasteless, gross, mediocre, over- lent, awesome, friendly, fantastic, priced, worse, poor fresh, wonderful, incredible, sweet, priced, worse, poor fresh, wonderful, incredible, sweet, yum(my) yum(my) Negation no, not Emphatics/ very, highly, perfectly, definitely, ab- Negation no, not Emphatics/ very, highly, perfectly, definitely, ab- universals solutely, everything, every, always universals solutely, everything, every, always 1Pl pro we, us, our 2 pro you 1Pl pro we, us, our 2 pro you 3 pro she, he, her, him Articles a, the 3 pro she, he, her, him Articles a, the Past verb was, were, asked, told, said, did, Advice try, recommend was, were, asked, told, said, did, try, recommend Pastcharged, verb waited, left, took Advice charged, waited, left, took Sequencers after, then Conjunct also, as, well, with, and Sequencers after, then Conjunct also, as, well, with, and Nouns manager, waitress, waiter, cus- Nouns atmosphere, dessert, chocolate, manager, waitress, waiter, cus- atmosphere, dessert, chocolate, Nounstomer, customers, attitude, waste, Nounswine, course, menu tomer, customers, attitude, waste, wine, course, menu poisoning, money, bill, minutes poisoning, money, bill, minutes Irrealis would, should Auxiliaries is/’s, can, ’ve, are Irrealis would, should Auxiliaries is/’s, can, ’ve, are modals modals Comp to, that Prep, other in, of, die, city, mouth Comp to, that Prep, other in, of, die, city, mouth Figure 19.11 The top 50 words associated with one–star and five-star restaurant reviews in a Yelp dataset of The top 50 words associated with one–star and five-star restaurant reviews in a Yelp dataset of 900,000 reviews,Figure using 19.11 the Monroe et al. (2008) method (Jurafsky et al., 2014). 900,000 reviews, using the Monroe et al. (2008) method (Jurafsky et al., 2014). In such situations, lexicons can be used in a simple rule-based algorithm for In such situations, lexicons can be used in a simple rule-based algorithm for classification. The simplest version is just to use the ratio of positive to negative classification. The simplest version is just to use the ratio of positive to negative words: if a document has more positive than negative words (using the lexicon to words: if a document has more positive than negative words (using the lexicon to decide the polarity of each word in the document), it is classified as positive. Often decide the polarity of each word in the document), it is classified as positive. Often a threshold l is used, in which a document is classified as positive only if the ratio a threshold l is used, in which a document is classified as positive only if the ratio is greaterLexicons than l. If the for sentiment detecting lexicon includesdocument positive affect: and negative weights for is greater than l. If the sentiment lexicon includes positive and negative weights for each word, q + and q , these can be used as well. Here’s a simple such sentiment Slightlyeachw word, moreqw+ and complexq , these can unsupervised be used as well. Here’s method a simple such sentiment algorithm: w w algorithm: + + f = + qw count(w+) f = qw count(w) w s.t. w positivelexicon 2 Xw s.t. w positivelexicon 2 X f = qwcount(w) f = qwcount(w) w s.t. w negativelexicon 2 Xw s.t. w negativelexicon f + 2 X + if > l f + f + if > l f 8 f sentiment = if8f + > l f (19.12) sentiment> = if f + > l (19.12) <> 0 otherwise.> <> 0 otherwise. > If supervised training data is available,:> these> counts computed from sentiment lex- If supervised training data is available,:> these counts computed from sentiment lex- icons, sometimes weighted or normalized in various ways, can also be used as fea- icons, sometimes weighted or normalized in various ways, can also be used as fea- tures in a classifier along with other lexical or non-lexical features. We return to tures in a classifier along with other lexical or non-lexical features. We return to such algorithms in Section 19.8. such algorithms in Section 19.8.

19.719.7 Other Other tasks: tasks: Personality Personality

Many other kinds of affective meaning can be extracted from text and speech. For Many other kinds of affective meaning can be extracted from text and speech. For personality example detecting a person’s personality from their language can be useful for di- personality example detecting a person’s personality from their language can be useful for di- alog systems (users tend to prefer agents that match their personality), and can play alog systems (users tend to prefer agents that match their personality), and can play Lexicons for detecting document affect: Simplest supervised method Use the lexicons as features for a classifier Given a training set ◦ Each observation has a label (review X has sentiment Y) ◦ Assign features to each observation ◦ Use “counts of lexicon categories” as a features ◦ NRC Emotion category “anticipation” had count of 2 ◦ 2 words in this document were in “anticipation” lexicon ◦ LIWC category “cognition” had count of 7 Lexicons for detecting document affect: Simplest supervised method Baseline ◦ Use counts of all the words and bigrams in the training set ◦ Like the naïve bayes algorithm ◦ This is hard to beat ◦ But “using all the words” only works if the training and test sets are very similar ◦ In real life, sometimes the test set is very different ◦ Lexicons are useful in that situation Computing entity-centric affect

Suppose we want an affect score for an entity in a text (not the entire document) Entity-centric method of Field and Tsvetkov (2019) Entity-centric affect (Field and Tsvetkov 2019) 1: Train classifier to predict V/A/D from embeddings

1. For each word w in the training corpus: • Use off-the-shelf encoders (like BERT) to extract a contextual embedding e for each instance of the word. • Average over the e embeddings of each instance of w to obtain a single embedding vector for one training point w. • Use the NRC Lexicon to get V, A, and D scores for w. 2. Train (three) regression models on all words w to predict V, A, D scores from a word’s average embedding. Entity-centric affect (Field and Tsvetkov 2019) 2: Assign scores to entity mentions

Given an entity mention m in a text, assign affect scores as follows: • Use the same pretrained LM to get contextual embeddings for m in context. • Feed this embedding through the 3 regression models to get V, A, D scores for the entity. 20.10 • CONNOTATION FRAMES 19

(b) Average over the e embeddings of each instance of w to obtain a single embedding vector for one training point w. (c) Use the NRC VAD Lexicon to get S, A, and P scores for w. 2. Train (three) regression models on all words w to predict V, A, D scores from a word’s average embedding. Now given an entity mention m in a text, we assign affect scores as follows: 1. Use the same pretrained LM to get contextual embeddings for m in context. 2. Feed this embedding through the 3 regression models to get S, A, P scores for the entity. This results in a (S,A,P) tuple for a given entity mention; To get scores for the rep- resentation of an entity in a complete document, we can run coreference resolution and average the (S,A,P) scores for all the mentions. Fig. 20.14 shows the scores from their algorithm for characters from the movie The Dark Knight when run on WikpediaEntity plot summary-centric texts with affect gold coreference. (Field and Tsvetkov 2019)

weakly Rachel Dent Gordan Batman Joker powerfully weakly Rachel Joker Dent Gordan Batmanpowerfully = dominance

Power Score Power Score

negative Joker Dent Gordan Rachel Batman positive =negative valence Joker Gordan Batman Dent Rachel positive

Sentiment Score Sentiment Score

dull Rachel Dent GordanBatman Joker scary dull Dent Gordan Rachel Batman Joker scary = arousal

Agency Score Agency Score Figure 20.14 Power (dominance), sentiment (valence) and agency (arousal) forFigure characters 2: Power, sentiment, and agency scores for char- Figure 1: Power, sentiment, and agency scores for char- in the movie The Dark Knight computed from embeddings trained on the NRC VADacters Lexicon. in The Dark Night as learned through ASP with acters in The Dark Night as learned through the regres- Note the protagonist (Batman) and the antagonist (the Joker) have high power andELMo agency embeddings. These scores reflect the same pat- sion model with ELMo embeddings. Scores generally scores but differ in sentiment, while the love interest Rachel has low power and agencyterns as but the regression model with greater separation high sentiment. align with character archetypes, i.e. the antagonist has between characters. the lowest sentiment score.

20.10 Connotationment Frames have resulted in his effective removal from vey Dent (ally to Batman who turns evil) and the industry. While articles about the #MeToo Rachel Dawes (primary love interest). To facil- movement portray men like Weinstein as unpow- itate extracting example sentences, we score each The lexicons we’ve described so far define a word as a point in affectiveinstance space. A of these entities in the narrative separately connotation connotation frameerful,, by we contrast, can speculate is a lexicon that that the incorporates corpora used a richer to kind of gram- frame train ELMo and BERT portray them as powerful. and average across instances to obtain an entity matical structure, by combining affective lexicons with the frame semantic lexicons 9 of Chapter 10. TheThus, basic in insight a corpus of connotation where traditional frame lexicons power roles is that ascore predicate for the document. To maximize our data like a verb expresseshave been connotations inverted, about the the embeddingsverb’s arguments extracted (Rashkin etby al. 2016 capturing, every mention of an entity, we per- Rashkin et al. 2017from). ELMo and BERT perform worse than ran- form co-reference resolution by hand. Addition- Consider sentencesdom, as like: they are biased towards the power struc- ally, based on our results from Table 3 as well as (20.15) Countrytures A violated in the data the sovereignty they are trained of Country on. B Further ev- the use of Wikipedia data in training the ELMo (20.16) the teenageridence ... of survived this exists the Boston in the Marathon performance bombing” of the model (Peters et al., 2018), we use ELMo embed- By using the verbBERT-maskedviolate in (20.15 embeddings), the author - is whereas expressing these their em- sympathiesdings with for our analysis. Country B, portrayingbeddings Country generally B as capture a victim, power and expressing poorly as antagonismcom- towardFigures 1 and 2 show results. For refer- pared to the unmasked embeddings (Table 2), ence, we show the entity scores as compared to they outperform the unmasked embeddings on this one polar opposite pair identified by ASP. Both task, and even outperform the frequency baseline the regression model and ASP show similar pat- in one setting. Nevertheless, they do not outper- terns. Batman has high power, while Rachel has form Field et al. (2019), likely because they do not low power. Additionally, the Joker is associated capture affect information as well as the unmasked with the most negative sentiment, but the high- embeddings (Table 2). est agency. Throughout the plot summary, the movie progresses by the Joker taking an aggres- 4.3 Qualitative Document-level Analysis sive action and the other characters responding. We can see this dynamic reflected in the Joker’s Finally, we qualitatively analyze how well our profile score, as a high-powered, high-agency, method captures affect dimensions by analyzing low-sentiment character, who is the primary plot- single documents in detail. We conduct this anal- driver. In general, ASP shows a greater separation ysis in a domain where we expect entities to fulfill between characters than the regression model. We traditional power roles and where entity portray- hypothesize that this occurs because ASP isolates als are known. Following Bamman et al. (2013), the dimensions of interest, while the regression ap- we analyze the Wikipedia plot summary of the proach captures other confounds, such as that hu- movie The Dark Knight,7 focusing on Batman (protagonist),8 the Joker (antagonist), Jim Gordan 9When we used this averaging metric in other evaluations, (law enforcement officer, ally to Batman), Har- we found no significant change in results. Thus, in other sce- narios, we compute scores over averaged embeddings, rather 7http://bit.ly/2XmhRDR than averaging scores separately computed for each embed- 8We consider Batman/Bruce Wayne to be the same entity. ding to reduce computationally complexity. Lexicons for Sentiment, Using the lexicons to Affect, and detect affect Connotation Lexicons for Sentiment, Connotation frames Affect, and Connotation Connotation Frames intuition

A predicate expresses connotations about its arguments (Rashkin et al. 2016, Rashkin et al. 2017). By using violate, author is sympathizing with B, and expressing negative sentiment toward A: • Country A violated the sovereignty of Country B By using survive, author is saying that the bombing is negative, and sympathizing with teenager: • the teenager ... survived the Boston Marathon bombing” 20 CHAPTER 20 • LEXICONS FOR SENTIMENT,AFFECT, AND CONNOTATION

the agent Country A. By contrast, in using the verb survive, the author of (20.16) is Connotationexpressing that the Frames bombing is a negative experience, and the subject of the sentence, the teenager, is a sympathetic character. These aspects of connotation are inherent in the meaning of the verbs violate and survive, as shown in Fig. 20.15. Rashkin et al., 2016, 2017

Connotation Frame for “Role1 survives Role2” Connotation Frame for “Role1 violates Role2” S(writer S(writer

role1) role1) → Writer → Writer → _ → + _ role2) + role2) S(writer S(writer S(role1→role2) S(role1→role2) Role1 is a _ There is Role1 is the _ Role2 is a sympathetic Role1 Role2 some type antagonist Role1 Role2 sympathetic victim of hardship victim _ + _ + Reader Reader

(a) (b) Figure 20.15 Connotation frames for survive and violate. (a) For survive, the writer and reader have positive sentiment toward Role1, the subject, and negative sentiment toward Role2, the direct object. (b) For violate, the writer and reader have positive sentiment instead toward Role2, the direct object.

The connotation frame lexicons of Rashkin et al. (2016) and Rashkin et al. (2017) also express other connotative aspects of the predicate toward each argu- ment, including the effect (something bad happened to x) value: (x is valuable), and mental state: (x is distressed by the event). Connotation frames can also mark the power differential between the arguments (using the verb implore means that the theme argument has greater power than the agent), and the agency of each argument (waited is low agency). Fig. 20.16 shows a visualization from Sap et al. (2017).

power(AGTH)

agency(AG)= agency(AG)=+

Figure 20.16 The connotation frames of Sap et al. (2017), showing that the verb implore implies the agent hasFigure lower power 2: The than formal the theme notation (in contrast, of the connotation say, with a verb like demanded), and showing the lowframes level of of agency power of and the subject agency. of Thewaited first. Figure example from Sap et al. (2017). shows the relative power differential implied by the verb “implored”, i.e., the agent (“he”) is in Connotation framesa position can be of built less bypower hand than(Sap the et theme al., 2017) (“the, or tri- theyFigure can be 3: learned Sample verbs in the connotation frames by supervised learningbunal”).(Rashkin In contrast, et al., “He2016)demanded, for examplethe tribunal using hand-labeledwith high train- annotator agreement. Size is indicative ing data to superviseshow classifiers mercy” implies for each that of the the agent individual has authority relations,of e.g., verb whether frequency in our corpus (bigger = more S(writer Role1) is + or -, and then improving accuracy via globalfrequent), constraints color differences are only for legibility. ! over the theme. The second example shows the across all relations.low level of agency implied by the verb “waited”. one another. For example, if the agent “dom- interactive demo website of our findings (see Fig- inates” the theme (denoted as power(AG>TH)), ure 5 in the appendix for a screenshot).2 Further- then the agent is implied to have a level of control more, as will be seen in Section 4.1, connotation over the theme. Alternatively, if the agent “hon- frames offer new insights that complement and de- ors” the theme (denoted as power(AGTH)

agency(AG)= agency(AG)=+

Figure 2: The formal notation of the connotation frames of power and agency. The first example shows the relative power differential implied by the verb “implored”, i.e., the agent (“he”) is in a position of less power than the theme (“the tri- Figure 3: Sample verbs in the connotation frames bunal”). In contrast, “He demanded the tribunal with high annotator agreement. Size is indicative show mercy” implies that the agent has authority of verb frequency in our corpus (bigger = more over the theme. The second example shows the frequent), color differences are only for legibility. low level of agency implied by the verb “waited”. one another. For example, if the agent “dom- interactive demo website of our findings (see Fig- inates” the theme (denoted as power(AG>TH)), ure 5 in the appendix for a screenshot).2 Further- then the agent is implied to have a level of control more, as will be seen in Section 4.1, connotation over the theme. Alternatively, if the agent “hon- frames offer new insights that complement and de- ors” the theme (denoted as power(AG