<<

Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis

Stefanos Angelidis and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB [email protected], [email protected]

Abstract [Rating: ??] I had a very mixed experience at The Stand. The burger and fries were good. The chocolate shake was We consider the task of fine-grained senti- divine: rich and creamy. The drive-thru was horrible. It ment analysis from the perspective of multi- took us at least 30 minutes to order when there were only ple instance learning (MIL). Our neural model is trained on document sentiment labels, and four cars in front of us. We complained about the wait learns to predict the sentiment of text seg- and got a half–hearted apology. I would go back because ments, i.e. sentences or elementary discourse the food is good, but my only hesitation is the wait. units (EDUs), without segment-level supervi- + The burger and fries were good sion. We introduce an attention-based polar- ity scoring method for identifying positive and + The chocolate shake was divine negative text snippets and a new dataset which + I would go back because the food is good we call SPOT (as shorthand for Segment-level – The drive-thru was horrible Summary POlariTy annotations) for evaluating MIL- – It took us at least 30 minutes to order style sentiment models like ours. Experimen- tal results demonstrate superior performance Figure 1: An EDU-based summary of a 2-out-of-5 against multiple baselines, whereas a judge- star review with positive and negative snippets. ment elicitation study shows that EDU-level opinion extraction produces more informative summaries than sentence-based alternatives. Socher et al., 2013) annotated with sentiment la- bels and used to predict sentiment in unseen texts. Coarse-grained document-level annotations are rel- 1 Introduction atively easy to obtain due to the widespread use Sentiment analysis has become a fundamental area of opinion grading interfaces (e.g., star ratings ac- of research in Natural Language Processing thanks companying reviews). In contrast, the acquisition to the proliferation of user-generated content in the of sentence- or phrase-level sentiment labels re- form of online reviews, blogs, internet forums, and mains a laborious and expensive endeavor despite social media. A plethora of methods have been pro- its relevance to various opinion mining applica- posed in the literature that attempt to distill senti- tions, e.g., detecting or summarizing consumer opin- ment information from text, allowing users and ser- ions in online product reviews. The usefulness of vice providers to make opinion-driven decisions. finer-grained sentiment analysis is illustrated in the The success of neural networks in a variety of ap- example of Figure 1, where snippets of opposing po- plications (Bahdanau et al., 2015; Le and Mikolov, larities are extracted from a 2-star restaurant review. 2014; Socher et al., 2013) and the availability of Although, as a whole, the review conveys negative large amounts of labeled have led to an in- sentiment, aspects of the reviewer’s experience were creased focus on sentiment classification. Super- clearly positive. This goes largely unnoticed when vised models are typically trained on documents focusing solely on the review’s overall rating. (Johnson and Zhang, 2015a; Johnson and Zhang, In this work, we consider the problem of segment- 2015b; Tang et al., 2015; Yang et al., 2016), sen- level sentiment analysis from the perspective of tences (Kim, 2014), or phrases (Socher et al., 2011; Multiple Instance Learning (MIL; Keeler, 1991).

17

Transactions of the Association for Computational Linguistics, vol. 6, pp. 17–31, 2018. Action Editor: Ani Nenkova. Submission batch: 7/17; Revision batch: 11/2017; Published 1/2018. c 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Instead of learning from individually labeled seg- larity of a text can be computed (e,g., by aggregating ments, our model only requires document-level su- the sentiment scores of constituent words). More re- pervision and learns to introspectively judge the sen- cently, Taboada et al. (2011) introduced SO-CAL, timent of constituent segments. Beyond showing a state-of-the-art method that combines a rich senti- how to utilize document collections of rated reviews ment lexicon with carefully defined rules over syn- to train fine-grained sentiment predictors, we also tax trees to predict sentence sentiment. investigate the granularity of the extracted segments. techniques have subse- Previous research (Tang et al., 2015; Yang et al., quently dominated the literature (Pang et al., 2002; 2016; Cheng and Lapata, 2016; Nallapati et al., Pang and Lee, 2005; Qu et al., 2010; Xia and 2017) has predominantly viewed documents as se- Zong, 2010; Wang and Manning, 2012; Le and quences of sentences. Inspired by recent work in Mikolov, 2014) thanks to user-generated sentiment summarization (Li et al., 2016) and sentiment clas- labels or large-scale crowd-sourcing efforts (Socher sification (Bhatia et al., 2015), we also represent et al., 2013). Neural network models in particular documents via Rhetorical Structure Theory’s (Mann have achieved state-of-the-art performance on vari- and Thompson, 1988) Elementary Discourse Units ous sentiment classification tasks due to their abil- (EDUs). Although definitions for EDUs vary in the ity to alleviate feature engineering. Kim (2014) literature, we follow standard practice and take the introduced a very successful CNN architecture for elementary units of discourse to be clauses (Carlson sentence-level classification, whereas other work et al., 2003). We employ a state-of-the-art discourse (Socher et al., 2011; Socher et al., 2013) uses recur- parser (Feng and Hirst, 2012) to identify them. sive neural networks to learn sentiment for segments Our contributions in this work are three-fold: of varying granularity (i.e., words, phrases, and sen- a novel multiple instance learning neural model tences). We describe Kim’s (2014) approach in more which utilizes document-level sentiment supervision detail as it is also used as part of our model. to judge the polarity of its constituent segments; the Let xi denote a k-dimensional creation of SPOT, a publicly available dataset which of the i-th word in text segment s of length n. The contains Segment-level POlariTy annotations (for segment’s input representation is the concatenation sentences and EDUs) and can be used for the eval- of word embeddings x1,..., xn, resulting in word uation of MIL-style models like ours; and the em- matrix X. Let Xi:i+j refer to the concatenation pirical finding (through automatic and human-based of embeddings xi,..., xi+j. A convolution filter evaluation) that neural multiple instance learning is W Rlk, applied to a window of l words, produces ∈ superior to more conventional neural architectures a new feature c = ReLU(W X + b), where i ◦ i:i+l and other baselines on detecting segment sentiment ReLU is the Rectified Linear Unit non-linearity, ‘ ’ 1 ◦ and extracting informative opinions in reviews. denotes the entrywise product followed by a sum over all elements and b R is a bias term. Ap- ∈ 2 Background plying the same filter to every possible window of Our work lies at the intersection of multiple research word vectors in the segment, produces a feature map c = [c1, c2, . . . , cn l+1]. Multiple feature maps areas, including sentiment classification, opinion − mining and multiple instance learning. We review for varied window sizes are applied, resulting in a related work in these areas below. fixed-size segment representation v via max-over- time pooling. We will refer to the application of con- Sentiment Classification Sentiment classification volution to an input word matrix X, as CNN(X).A is one of the most popular tasks in sentiment anal- final sentiment prediction is produced using a soft- ysis. Early work focused on unsupervised meth- max classifier and the model is trained via back- ods and the creation of sentiment lexicons (Turney, propagation using sentence-level sentiment labels. 2002; Hu and Liu, 2004; Wiebe et al., 2005; Bac- The availability of large-scale datasets (Diao et cianella et al., 2010) based on which the overall po- al., 2014; Tang et al., 2015) has also led to the de-

1Our code and SPOT dataset are publicly available at: velopment of document-level sentiment classifiers https://github.com/stangelid/milnet-sent which exploit hierarchical neural representations.

18 These are obtained by first building representations Opinion Mining A standard setting for opinion of sentences and aggregating those into a document mining and summarization (Lerman et al., 2009; feature vector (Tang et al., 2015). Yang et al. (2016) Carenini et al., 2006; Ganesan et al., 2010; Di Fab- further acknowledge that words and sentences are brizio et al., 2014; Gerani et al., 2014) assumes a set deferentially important in different contexts. They of documents that contain opinions about some en- present a model which learns to attend (Bahdanau et tity of interest (e.g., camera). The goal of the system al., 2015) to individual text parts when constructing is to generate a summary that is representative of the document representations. We describe such an ar- average opinion and speaks to its important aspects chitecture in more detail as we use it as a point of (e.g., picture quality, battery life, value). Output comparison with our own model. summaries can be extractive (Lerman et al., 2009) Given document d comprising segments or abstractive (Gerani et al., 2014; Di Fabbrizio et (s1, . . . , sm), a Hierarchical Network with at- al., 2014) and the underlying systems exhibit vary- tention (henceforth HIERNET; based on Yang ing degrees of linguistic sophistication from identi- et al., 2016) produces segment representations fying aspects (Lerman et al., 2009) to using RST- (v1,..., vm) which are subsequently fed into a style discourse analysis, and manually defined tem- bidirectional GRU module (Bahdanau et al., 2015), plates (Gerani et al., 2014; Di Fabbrizio et al., 2014). whose resulting hidden vectors (h1,..., hm) are Our proposed method departs from previous work used to produce attention weights (a1, . . . , am) in that it focuses on detecting opinions in individ- (see Section 3.2 for more details on the attention ual documents. Given a review, we predict the po- mechanism). A document is represented as the larity of every segment, allowing for the extrac- weighted average of the segments’ hidden vec- tion of sentiment-heavy opinions. We explore the tors vd = i aihi. A final sentiment prediction is usefulness of EDU segmentation inspired by Li et obtained using a softmax classifier and the model is al. (2016), who show that EDU-based summaries P trained via back-propagation using document-level align with near-extractive summaries constructed by sentiment labels. The architecture is illustrated in news editors. Importantly, our model is trained in Figure 2(a). In their proposed model, Yang et al. a weakly-supervised fashion on large scale docu- (2016) use bidirectional GRU modules to represent ment classification datasets without recourse to fine- segments as well as documents, whereas we use grained labels or gold-standard opinion summaries. a more efficient CNN encoder to compose words 2 into segment vectors (i.e., vi = CNN(Xi)). Note Multiple Instance Learning Our models adopt a that models like HIERNET do not naturally predict Multiple Instance Learning (MIL) framework. MIL sentiment for individual segments; we discuss deals with problems where labels are associated with how they can be used for segment-level opinion groups of instances or bags (documents in our case), extraction in Section 5.2. while instance labels (segment-level polarities) are Our own work draws inspiration from represen- unobserved. An aggregation function is used to tation learning (Tang et al., 2015; Kim, 2014), es- combine instance predictions and assign labels on pecially the idea that not all parts of a document the bag level. The goal is either to label bags (Keeler convey sentiment-worthy clues (Yang et al., 2016). and Rumelhart, 1992; Dietterich et al., 1997; Maron Our model departs from previous approaches in that and Ratan, 1998) or to simultaneously infer bag and it provides a natural way of predicting the polar- instance labels (Zhou et al., 2009; Wei et al., 2014; ity of individual text segments without requiring Kotzias et al., 2015). We view segment-level senti- segment-level annotations. Moreover, our atten- ment analysis as an instantiation of the latter variant. tion mechanism directly facilitates opinion detection Initial MIL efforts for binary classification made rather than simply aggregating sentence representa- the strong assumption that a bag is negative only if tions into a single document vector. all of its instances are negative, and positive oth-

2 erwise (Dietterich et al., 1997; Maron and Ratan, When applied to the YELP’13 and IMDB document clas- sification datasets, the use of CNNs results in a relative perfor- 1998; Zhang et al., 2002; Andrews and Hofmann, mance decrease of < 2% compared Yang et al’s model (2016). 2004; Carbonetto et al., 2008). Subsequent work re-

19 laxed this assumption, allowing for prediction com- with our proposed network. Additionally, none of binations better suited to the tasks at hand. Wei- the aforementioned efforts explicitly evaluate opin- dmann et al. (2003) introduced a generalized MIL ion extraction quality. framework, where a combination of instance types is required to assign a bag label. Zhou et al. (2009) 3 Methodology used graph kernels to aggregate predictions, exploit- In this section we describe how multiple instance ing relations between instances in object and text learning can be used to address some of the draw- categorization. Xu and Frank (2004) proposed a backs seen in previous approaches, namely the need multiple-instance classifier where for expert knowledge in lexicon-based sentiment instance predictions were simply averaged, assum- analysis (Taboada et al., 2011), expensive fine- ing equal and independent contribution toward bag grained annotation on the segment level (Kim, 2014; classification. More recently, Kotzias et al. (2015) Socher et al., 2013) or the inability to naturally pre- used sentence vectors obtained by a pre-trained hi- dict segment sentiment (Yang et al., 2016). erarchical CNN (Denil et al., 2014) as features un- der an unweighted average MIL objective. Predic- 3.1 Problem Formulation tion averaging was further extended by Pappas and Under multiple instance learning (MIL), a dataset D Popescu-Belis (2014; 2017), who used a weighted is a collection of labeled bags, each of which is summation of predictions, an idea which we also a group of unlabeled instances. Specifically, each adopt in our work. document d is a sequence (bag) of segments (in- Applications of MIL are many and varied. MIL stances). This sequence d = (s1, s2, . . . , sm) is ob- was first explored by Keeler and Rumelhart (1992) tained from a document segmentation policy (see for recognizing handwritten post codes, where the Section 4 for details). A discrete sentiment label position and value of individual digits was unknown. yd [1,C] is associated with each document, where MIL techniques have since been applied to drug ac- ∈ the labelset is ordered and classes 1 and C corre- tivity prediction (Dietterich et al., 1997), image re- spond to maximally negative and maximally posi- trieval (Maron and Ratan, 1998; Zhang et al., 2002), tive sentiment. It is assumed that yd is an unknown object detection (Zhang et al., 2006; Carbonetto et function of the unobserved segment-level labels: al., 2008; Cour et al., 2011), text classification (An- drews and Hofmann, 2004), image captioning (Wu yd = f(y1, y2, . . . , ym) (1) et al., 2015), paraphrase detection (Xu et al., 2014), and information extraction (Hoffmann et al., 2011). Probabilistic sentiment classifiers will produce When applied to sentiment analysis, MIL takes document-level predictions yˆd by selecting the advantage of supervision signals on the document most probable class according to class distribution level in order to train segment-level sentiment pre- p = p(1), . . . , p(C) . In a non-MIL framework a d h d d i dictors. Although their work is not couched in classifier would learn to predict the document’s sen- the framework of MIL, Tackstr¨ om¨ and McDonald timent by directly conditioning on its segments’ fea- (2011) show how sentence sentiment labels can be ture representations or their aggregate: learned as latent variables from document-level an- ˆ notations using hidden conditional random fields. pd = fθ(v1, v2,..., vm) (2) Pappas and Popescu-Belis (2014) use a multiple in- In contrast, a MIL classifier will produce a class dis- stance regression model to assign sentiment scores tribution pi for each segment and additionally learn to specific aspects of products. The Group-Instance to combine these into a document-level prediction: Cost Function (GICF), proposed by Kotzias et al. p =g ˆ (v ) , (2015), averages sentence sentiment predictions dur- i θs i (3) ˆ ing trainng, while ensuring that similar sentences pd = fθd (p1, p2,..., pm) . (4) receive similar polarity labels. Their work uses a pre-trained hierarchical CNN to obtain sentence em- In this work, gˆ and fˆare defined using a single neu- beddings, but is not trainable end-to-end, in contrast ral network, described below.

20 Figure 2: A Hierarchical Network (HIERNET) for document-level sentiment classification and our proposed Multiple Instance Learning Network (MILNET). The models use the same attention mechanism to combine segment vectors and predictions respectively.

3.2 Multiple Instance Learning Network Segment Encoding An encoding vi = CNN(Xi) is produced for each segment, using the CNN archi- Hierarchical neural models like HIERNET have been tecture described in Section 2. used to predict document-level polarity by first en- coding sentences and then combining these repre- Segment Classification Obtaining a separate rep- sentations into a document vector. Hierarchical vec- resentation vi for every segment in a document al- tor composition produces powerful sentiment pre- lows us to produce individual segment sentiment predictions p = p(1), . . . , p(C) . This is achieved dictors, but lacks the ability to introspectively judge i h i i i the polarity of individual segments. using a softmax classifier:

Our Multiple Instance Learning Network (hence- pi = softmax(Wcvi + bc) , (5) forth MILNET) is based on the following intuitive assumptions about opinionated text. Each segment where Wc and bc are the classifier’s parameters, conveys a degree of sentiment polarity, ranging from shared across all segments. Individual distributions very negative to very positive. Additionally, seg- pi are shown in Figure 2(b) as small bar-charts. ments have varying degrees of importance, in rela- tion to the overall opinion of the author. The overar- Document Classification In the simplest case, ching polarity of a text is an aggregation of segment document-level predictions can be produced by polarities, weighted by their importance. Thus, our taking the average of segment class distributions: (c) 1 (c) model attempts to predict the polarity of segments p = /m p , c [1,C]. This is, however, a d i i ∈ and decides which parts of the document are good crude way of combining segment sentiment, as not P indicators of its overall sentiment, allowing for the all parts of a document convey important sentiment detection of sentiment-heavy opinions. An illustra- clues. We opt for a segment attention mechanism tion of MILNET is shown in Figure 2(b); the model which rewards text units that are more likely to be consists of three components: a CNN segment en- good sentiment predictors. coder, a softmax segment classifier and an attention- Our attention mechanism is based on a bidirec- based prediction weighting module. tional GRU component (Bahdanau et al., 2015) and

21 The starters were quite bland. I didn’t enjoy most of them, but the burger was brilliant! 0.75 att: 0.3 att: 0.2 att: 0.5 0.50

0.25 probability 0.00 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 polarity gtd-pol. −1 0 1 −1 0 1 −1 0 1 Figure 3: Polarity scores (bottom) obtained from class probability distributions for three EDUs (top) ex- tracted from a restaurant review. Attention weights (top) are used to fine-tune the obtained polarities. inspired by Yang et al. (2016). However, in con- use the negative log likelihood of the document-level trast to their work, where attention is used to com- prediction as an objective function: bine sentence representations into a single document (y ) vector, we utilize a similar technique to aggregate L = log p d (12) − d individual sentiment predictions. Xd We first use separate GRU modules to produce forward and backward hidden vectors, which are 4 Polarity-based Opinion Extraction then concatenated: After training, our model can produce segment-level

−→h i = −−−→GRU(vi), (6) sentiment predictions for unseen texts in the form of class probability distributions. A direct application ←−h i = ←−−−GRU(vi), (7) of our method is opinion extraction, where highly

hi = [−→h i, ←−h i], i [1, m] . (8) positive and negative snippets are selected from the ∈ original document, producing extractive sentiment The importance of each segment is measured with summaries, as described below. the aid of a vector ha, as follows: Polarity Scoring In order to extract opinion sum- hi0 = tanh(Wahi + ba) , (9) maries, we need to rank segments according to their exp(h Th ) i0 a sentiment polarity. We introduce a method that takes ai = T , (10) i exp(hi0 ha) our model’s confidence in the prediction into ac- where Equation (9)P defines a one-layer MLP that count, by reducing each segment’s class probability produces an attention vector for the i-th segment. distribution pi to a single real-valued polarity score. To achieve this, we first define a real-valued class Attention weights ai are computed as the normal- weight vector w = w(1), . . . , w(C) w(c) [ 1, 1] ized similarity of each hi0 with ha. Vector ha, which h | ∈ − i that assigns uniformly-spaced weights to the ordered is randomly initialized and learned during training, (c+1) (c) 2 labelset, such that w w = C 1 . For exam- can be thought of as a trained key, able to recognize − − sentiment-heavy segments. The attention mecha- ple, in a 5-class scenario, the class weight vector would be w = 1, 0.5, 0, 0.5, 1 . We compute nism is depicted in the dashed box of Figure 2, with h− − i attention weights shown as shaded circles. the polarity score of a segment as the dot-product of the probability distribution p with vector w: Finally, we obtain a document-level distribution i over sentiment labels as the weighted sum of seg- polarity(s ) = p(c)w(c) [ 1, 1] (13) ment distributions (see top of Figure 2(b)): i i ∈ − c X p(c) = a p(c) , c [1,C] . (11) d i i ∈ Gated Polarity As a way of increasing the effec- Xi tiveness of our method, we introduce a gated exten- Training The model is trained end-to-end on doc- sion that uses the attention mechanism of our model uments with user-generated sentiment labels. We to further differentiate between segments that carry

22 significant sentiment cues and those that do not: Yelp’13 IMDB Documents 335,018 348,415 gated-polarity(si) = ai polarity(si) , (14) · Average #Sentences 8.90 14.02 Average #EDUs 19.11 37.38 where ai is the attention weight assigned to the i-th Average #Words 152 325 segment. This forces the polarity scores of segments Vocabulary Size 211,245 115,831 the model does not attend to closer to 0. Classes 1–5 1–10 An illustration of our polarity scoring function is provided in Figure 3, where the class predic- Table 1: Document-level sentiment classification tions (top) of three restaurant review segments are datasets used to train our models. mapped to their corresponding polarity scores (bot- tom). We observe that our method produces the de- Yelp’13seg IMDBseg sired result; segments 1 and 2 convey negative senti- Sent. EDUs Sent. EDUs ment and receive negative scores, whereas the third #Segments 1,065 2,110 1,029 2,398 segment is mapped to a positive score. Although the #Documents 100 97 same discrete class label is assigned to the first two, Classes – , 0 , + – , 0 , + the second segment’s score is closer to 0 (neutral) as { }{ } its class probability mass is more evenly distributed. Table 2: SPOT dataset: numbers of documents and segments with polarity annotations.

Segmentation Policies As mentioned earlier, one of the hypotheses investigated in this work regards 5.1 Datasets the use of subsentential units as the basis of extrac- tion. Specifically, our model was applied to sen- Our models were trained on two large-scale senti- tences and Elementary Discourse Units (EDUs), ob- ment classification collections. The Yelp’13 corpus tained from a Rhetorical Structure Theory (RST) was introduced in Tang et al. (2015) and contains parser (Feng and Hirst, 2012). According to RST, customer reviews of local businesses, each associ- documents are first segmented into EDUs corre- ated with human ratings on a scale from 1 (negative) sponding roughly to independent clauses which to 5 (positive). The IMDB corpus of movie reviews are then recursively combined into larger discourse was obtained from Diao et al. (2014); each review spans. This results in a tree representation of the is associated with user ratings ranging from 1 to 10. document, where connected nodes are characterized Both datasets are split into training (80%), validation by discourse relations. We only utilize RST’s seg- (10%) and test (10%) sets. A summary of statistics mentation, and leave the potential use of the tree for each collection is provided in Table 1. structure to future work. In order to evaluate model performance on the The example in Figure 3 illustrates why EDU- segment level, we constructed a new dataset named based segmentation might be beneficial for opinion SPOT (as a shorthand for Segment POlariTy) by extraction. The second and third EDUs correspond annotating documents from the Yelp’13 and IMDB to the sentence: I didn’t enjoy most of them, but the collections. Specifically, we sampled reviews from burger was brilliant. Taken as a whole, the sentence each collection such that all document-level classes conveys mixed sentiment, whereas the EDUs clearly are represented uniformly, and the document lengths convey opposing sentiment. are representative of the respective corpus. Docu- ments were segmented into sentences and EDUs, re- sulting in two segment-level datasets per collection. 5 Experimental Setup Statistics are summarized in Table 2. Each review was presented to three Amazon Me- In this section we describe the data used to assess chanical Turk (AMT) annotators who were asked the performance of our model. We also give details to judge the sentiment conveyed by each segment on model training and comparison systems. (i.e., sentence or EDU) as negative, neutral, or pos-

23 Yelp'13 - Sentences Yelp'13 - EDUs 0.8 Majority class applied to all instances. negative Majority: 0.7 neutral SO-CAL: State-of-the-art lexicon-based system 0.6 positive

0.5 that classifies segments into positive, neutral, and

0.4 negative classes (Taboada et al., 2011). 0.3 Seg-CNN: Fully-supervised CNN segment classi- 0.2 fier trained on SPOT’s labels (Kim, 2014).

proportion of segments proportion 0.1

0.0 GICF: The Group-Instance Cost Function model 1 2 3 4 5 1 2 3 4 5 IMDB - Sentences IMDB - EDUs introduced in Kotzias et al. (2015). This is an 0.8 unweighted average prediction aggregation MIL 0.7

0.6 method that uses sentence features from a pre-

0.5 trained convolutional neural model. 0.4 HIERNET: HIERNET does not explicitly generate 0.3 individual segment predictions. Segment polarity 0.2 scores are obtained by assigning the document- 0.1 proportion of segments proportion

0.0 level prediction to every segment. We can then 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 document class document class produce finer-grained polarity distinctions via gating, using the model’s attention weights. Figure 4: Distribution of segment-level labels per document-level class on our the SPOT datasets. We further illustrate the differences between HI- ERNET and MILNET in Figure 5, which includes short descriptions and simplified equations for each itive. We assigned labels using a majority vote or model. MILNET naturally produces distinct seg- a fourth annotator in the rare cases of no agreement ment polarities, while HIERNET assigns a single po- (< 5%). Figure 4 shows the distribution of segment larity score to every segment. In both cases, gating labels for each document-level class. As expected, is a further means of identifying neutral segments. documents with positive labels contain a larger num- Finally, we differentiate between variants of HI- ber of positive segments compared to documents ERNET and MILNET according to: with negative labels and vice versa. Neutral seg- ments are distributed in an approximately uniform Polarity source: Controls whether we assign polar- manner across document classes. Interestingly, the ities via segment-specific or document-wide pre- proportion of neutral EDUs is significantly higher dictions. HIERNET only allows for document- compared to neutral sentences. The observation re- wide predictions. MILNET can use both. inforces our argument in favor of EDU segmenta- Attention: We use models without gating (no sub- tion, as it suggests that a sentence with positive or script), with gating (gt subscript) as well as mod- negative overall polarity may still contain neutral els trained with the attention mechanism disabled, EDUs. Discarding neutral EDUs, could therefore falling back to simple averaging (avg subscript). lead to more concise opinion extraction compared to relying on entire sentences. 5.3 Model Training and Evaluation We further experimented on two collections intro- We trained MILNET and HIERNET using Adadelta duced by Kotzias et al. (2015) which also originate (Zeiler, 2012) for 25 epochs. Mini-batches of 200 from the YELP’13 and IMDB datasets. Each collec- documents were organized based on the reviews’ tion consists of 1,000 randomly sampled sentences segment and document lengths so the amount of annotated with binary sentiment labels. padding was minimized. We used 300-dimensional pre-trained embeddings. We tuned hyper- 5.2 Model Comparison parameters on the validation sets of the document On the task of segment classification we compared classification collections, resulting in the follow- MILNET, our multiple instance learning network, ing configuration (unless otherwise noted). For the against the following methods: CNN segment encoder, we used window sizes of 3, 4

24 Figure 5: System pipelines for HIERNET and MILNET showing 4 distinct phases for sentiment analysis. and 5 words with 100 feature maps per window size, comparison is indicative of the utility of fine-grained resulting in 300-dimensional segment vectors. The sentiment predictors that do not rely on expensive GRU hidden vector dimensions for each direction segment-level annotations. were set to 50 and the attention vector dimension- ality to 100. We used L2-normalization and dropout 6 Results to regularize the softmax classifiers and additional We evaluated models in two ways. We first assessed dropout on the internal GRU connections. their ability to classify segment polarity in reviews Real-valued polarity scores produced by the two using the newly created SPOT dataset and, addition- models are mapped to discrete labels using two ap- ally, the sentence corpora of Kotzias et al. (2015). propriate thresholds t , t [ 1, 1], so that a seg- 1 2 ∈ − Our second suite of experiments focused on opin- ment s is classified as negative if polarity(s) < t1, ion extraction: we conducted a judgment elicita- 3 positive if polarity(s) > t2 or neutral otherwise. tion study to determine whether extracts produced To evaluate performance, we use macro-averaged F1 by MILNET are useful and of higher quality com- which is unaffected by class imbalance. We select pared to HIERNET and other baselines. We were optimal thresholds using 10-fold cross-validation also interested to find out whether EDUs provide a and report mean scores across folds. better basis for opinion extraction than sentences. The fully-supervised convolutional segment clas- sifier (Seg-CNN) uses the same window size and 6.1 Segment Classification feature map configuration as our segment encoder. Table 3 summarizes our results. The first block in Seg-CNN was trained on SPOT using segment la- the table reports the performance of the majority bels directly and 10-fold cross-validation (identical class baseline. The second block considers mod- folds as in our main models). Seg-CNN is not di- els that do not utilize segment-level predictions, rectly comparable to MILNET (or HIERNET) due to namely HIERNET which assigns polarity scores to differences in supervision type (segment vs. docu- segments using its document-level predictions, as ment labels) and training size (1K-2K segment la- well as the variant of MILNET which similarly uses bels vs. 250K document labels). However, the ∼ document-level predictions only (Equation (11)). In 3 the third block, MILNET’s segment-level predic- The discretization of polarities is only used for evaluation purposes and is not necessary for summary extraction, where tions are used. Each block further differentiates be- we only need a relative ranking of segments. tween three levels of attention integration, as previ-

25 Yelp’13seg IMDBseg Neutral Segments Method Yelp IMDB Method Non-Gtd Gated Sent EDU Sent EDU GICF 86.3 86.0 HIERNET 4.67 36.60 GICF 92.9 86.5 Majority 19.02 17.03 18.32 21.52 HN

† † † † Sent MILNET 39.61 44.60 GICFMN 93.2 91.0

HIERNETavg 54.21† 50.90† 46.99† 49.02† MILNET 94.0 91.9 Non-Gtd Gated HIERNET 55.33† 51.43† 48.47† 49.70† HIERNET 2.39 55.38 HIERNETgt 56.64† 58.75 62.12 57.38† Table 5: Accuracy scores

EDU MILNET 52.10 56.60 MILNETavg 58.43† 48.63† 53.40† 51.81† on the sentence classi-

Document MILNET 52.73† 53.59† 48.75† 47.18† Table 4: F1 scores fication datasets intro- MILNETgt 59.74† 59.47 61.83† 58.24† for neutral segments duced in Kotzias et al.

MILNETavg 51.79† 46.77† 45.69† 38.37† (Yelp’13). (2015). IL ET M N 61.41 59.58 59.99† 57.71† HierNet 1.6 Segm MILNETgt 63.35 59.85 63.97 59.87 1.4 negative neutral positive 1.2 1.0 SO-CAL 56.53† 58.16† 53.21† 60.40 0.8 0.6 Seg-CNN 56.18 59.96 58.32 62.95 0.4 † † † 0.2 0.0 −1 0 1 −1 0 1 −1 0 1 MILNet Table 3: Segment classification results (in macro- 1.6 1.4 averaged F1). indicates that the system in question 1.2 † 1.0 is significantly different from MILNETgt (approxi- 0.8 0.6 mate randomization test (Noreen, 1989), p < 0.05). 0.4 0.2 0.0 −1 0 1 −1 0 1 −1 0 1 polarity ously described. The final block shows the perfor- Figure 6: Distribution of predicted polarity scores mance of SO-CAL and the Seg-CNN classifier. across three classes (Yelp’13 sentences). When considering models that use document- level supervision, MILNET with gated, segment- specific polarities obtains the best classification per- neutral segments. Figure 6 illustrates the distribu- formance across all four datasets. Interestingly, tion of polarity scores produced by the two mod- it performs comparably to Seg-CNN, the fully- els on the Yelp’13 dataset (sentence segmentation). supervised segment classifier, which provides addi- In the case of negative and positive sentences, both tional evidence that MILNET can effectively iden- models demonstrate appropriately skewed distribu- tify segment polarity without the need for segment- tions. However, the neutral class appears to be par- level annotations. Our model also outperforms the ticularly problematic for HIERNET, where polarity strong SO-CAL baseline in all but one datasets scores are scattered across a wide range of values. which is remarkable given the expert knowledge In contrast, MILNET is more successful at identify- and linguistic information used to develop the lat- ing neutral sentences, as its corresponding distribu- ter. Document-level polarity predictions result in tion has a single mode near zero. Attention gating lower classification performance across the board. addresses this issue by moving the polarity scores Differences between the standard hierarchical and of sentiment-neutral segments towards zero. This is multiple instance networks are less pronounced in illustrated in Table 4 where we observe that gated this case, as MILNET loses the advantage of produc- variants of both models do a better job at identify- ing segment-specific sentiment predictions. Models ing neutral segments. The effect is very significant without attention perform worse in most cases. The for HIERNET, while MILNET benefits slightly and use of gated polarities benefits all model configura- remains more effective overall. Similar trends were tions, indicating the method’s ability to selectively observed in all four SPOT datasets. focus on segments with significant sentiment cues. In order to examine the effect of training size, we We further analyzed the polarities assigned by trained multiple models using subsets of the original MILNET and HIERNET to positive, negative, and document collections. We trained on five random

26 Yelp Sentences Yelp EDUS 70 Method Informativeness Polarity Coherence 65 HIERNETsent 43.7 33.6 43.5 60 MILNETsent 45.7 36.7 44.6

55 Unsure 10.7 29.6 11.8

macro-f1 edu 50 HIERNET 34.2† 28.0† 48.4 MILNETedu 45.0 45 53.3 61.1 Unsure 12.5 11.0 6.6 40 IMDB Sentences IMDB EDUS sent 70 MILNET 35.7† 33.4† 70.4† MILNETedu 55.0 51.5 23.7 65 Unsure 9.3 15.2 5.9 60 LEAD 34.0 19.0† 40.3 55 RANDOM 22.9† 19.6† 17.8† edu macro-f1 50 MILNET 37.4 46.9 33.3 MILNet Unsure 5.7 14.6 8.6 45 HierNet Seg-CNN 40 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 Table 6: Human evaluation results (in percentages). training size training size indicates that the system in question is signifi- † cantly different from MILNET (sign-test, p < 0.01). Figure 7: Performance of HIERNETgt and MILNETgt for varying training sizes. 6.2 Opinion Extraction subsets for each training size, ranging from 100 doc- In our opinion extraction experiments, AMT work- uments to the full training set, and tested segment ers (all native English speakers) were shown an classification performance on SPOT. The results, av- original review and a set of extractive, bullet-style eraged across trials, are presented in Figure 7. With summaries, produced by competing systems using a the exception of the IMDB EDU-segmented dataset, 30% compression rate. Participants were asked to MILNET only requires a few thousand training doc- decide which summary was best according to three uments to outperform the supervised Seg-CNN. HI- criteria: Informativeness (Which summary best cap- ERNET follows a similar curve, but is inferior to tures the salient points of the review?), Polarity MILNET. A reason for MILNET’s inferior perfor- (Which summary best highlights positive and neg- mance on the IMDB corpus (EDU-split) can be low- ative comments?) and Coherence (Which summary quality EDUs, due to the noisy and informal style of is more coherent and easier to read?). Subjects were language used in IMDB reviews. allowed to answer “Unsure” in cases where they Finally, we compared MILNET against the GICF could not discriminate between summaries. We used model (Kotzias et al., 2015) on their Yelp and all reviews from our SPOT dataset and collected IMDB sentence sentiment datasets.4 Their model re- three responses per document. We ran four judg- quires sentence embeddings from a pre-trained neu- ment elicitation studies: one comparing HIERNET ral model. We used the hierarchical CNN from and MILNET when summarizing reviews segmented their work (Denil et al., 2014) and, additionally, as sentences, a second one comparing the two mod- pre-trained HIERNET and MILNET sentence em- els with EDU segmentation, a third which compares beddings. The results in Table 5 show that MILNET EDU- and sentence-based summaries produced by outperforms all variants of GIFC. Our models also MILNET, and a fourth where EDU-based sum- seem to learn better sentence embeddings, as they maries from MILNET were compared to a LEAD improve GICF’s performance on both collections. (the first N words from each document) and a RAN- DOM (random EDUs) baseline. 4GICF only handles binary labels, which makes it unsuitable Table 6 summarizes our results, showing the pro- for the full-scale comparisons in Table 3. Here, we binarize our training datasets and use same-sized sentence embeddings for portion of participants that preferred each system. 150 72 all four models (R for Yelp, R for IMDB). The first block in the table shows a slight prefer-

27 [Rating: ????] As with any family-run hole in the wall, service can be slow. What the staff lacked in speed, they made up for in charm. The food was good, but nothing wowed me. I had the Pierogis while my friend had swedish meatballs. Both dishes were tasty, as were the sides. One thing that was disappointing was that the food was a a little cold (lukewarm). The restaurant itself is bright and clean. I will go back again when i feel like eating outside the box.

Extracted via HIERNETgt Extracted via MILNETgt (0.13) [+0.26] The food was good+ (0.16) [+0.12] The food was good+ (0.10) [+0.26] but nothing wowed me.+ (0.12) [+0.43] The restaurant itself is bright and clean+ (0.09) [+0.26] The restaurant itself is bright and clean+ (0.19) [+0.15] I will go back again+ + (0.13) [+0.26] Both dishes were tasty (0.09) [–0.07] but nothing wowed me.−

EDU-based + (0.18) [+0.26] I will go back again (0.10) [–0.10] the food was a a little cold (lukewarm)−

(0.12) [+0.23] Both dishes were tasty, as were the sides+ (0.13) [+0.26] Both dishes were tasty, as were the sides+ (0.18) [+0.23] The food was good, but nothing wowed me+ (0.20) [+0.59] I will go back again when I feel like eating (0.22) [+0.23] One thing that was disappointing was that outside the box+ + Sent-based the food was a a little cold (lukewarm) (0.18) [–0.12] The food was good, but nothing wowed me−

+ (number): attention weight [number]: non-gated polarity score text : extracted positive opinion text−: extracted negative opinion

Figure 8: Example EDU- and sentence-based opinion summaries produced by HIERNETgt and MILNETgt. ence for MILNET across criteria. The second block NET is able to detect positive and negative snippets shows significant preference for MILNET against via individual segment polarities. Here, EDU seg- HIERNET on informativeness and polarity, whereas mentation produced a more concise summary with a HIERNET was more often preferred in terms of clearer grouping of positive and negative snippets. coherence, although the difference is not statisti- cally significant. The third block compares sentence 7 Conclusions and EDU summaries produced by MILNET. EDU summaries were perceived as significantly better in In this work, we presented a neural network model terms of informativeness and polarity, but not co- for fine-grained sentiment analysis within the frame- herence. This is somewhat expected as EDUs tend work of multiple instance learning. Our model to produce more terse and telegraphic text and may can be trained on large scale sentiment classifica- seem unnatural due to segmentation errors. In the tion datasets, without the need for segment-level fourth block we observe that participants find MIL- labels. As a departure from the commonly used NET more informative and better at distilling polar- vector-based composition, our model first predicts ity compared to the LEAD and RANDOM (EDUs) sentiment at the sentence- or EDU-level and subse- baselines. We should point out that the LEAD sys- quently combines predictions up the document hier- tem is not a strawman; it has proved hard to out- archy. An attention-weighted polarity scoring tech- perform by more sophisticated methods (Nenkova, nique provides a natural way to extract sentiment- 2005), particularly on the newswire domain. heavy opinions. Experimental results demonstrate Example EDU- and sentence-based summaries the superior performance of our model against more produced by gated variants of HIERNET and MIL- conventional neural architectures. Human evalua- NET are shown in Figure 8, with attention weights tion studies also show that MILNET opinion extracts and polarity scores of the extracted segments shown are preferred by participants and are effective at cap- in round and square brackets respectively. For both turing informativeness and polarity, especially when granularities, HIERNET’s positive document-level using EDU segments. In the future, we would like prediction results in a single polarity score assigned to focus on multi-document, aspect-based extraction to every segment, and further adjusted using the cor- (Cao et al., 2017) and ways of improving the coher- responding attention weights. The extracted seg- ence of our summaries by taking into account more ments are informative, but fail to capture the neg- fine-grained discourse information (Daume´ III and ative sentiment of some segments. In contrast, MIL- Marcu, 2002).

28 Acknowledgments Jianpeng Cheng and Mirella Lapata. 2016. Neural sum- marization by extracting sentences and words. In Pro- The authors gratefully acknowledge the support ceedings of the 54th Annual Meeting of the Associa- of the European Research Council (award num- tion for Computational Linguistics (Volume 1: Long ber 681760). We thank TACL action editor Ani Papers), pages 484–494, Berlin, Germany. Nenkova and the anonymous reviewers whose feed- Timothee Cour, Ben Sapp, and Ben Taskar. 2011. Learn- back helped improve the present paper, as well as ing from partial labels. Journal of Charles Sutton, Timothy Hospedales, and members Research, 12(May):1501–1536. of EdinburghNLP for helpful discussions and sug- Hal Daume´ III and Daniel Marcu. 2002. A noisy-channel gestions. model for document compression. In Proceedings of the 40th Annual Meeting of the Association for Com- putational Linguistics, pages 449–456, Philadelphia, References Pennsylvania, USA. Misha Denil, Alban Demiraj, and Nando de Freitas. Stuart Andrews and Thomas Hofmann. 2004. Multiple 2014. Extraction of salient sentences from labelled instance learning via disjunctive programming boost- documents. Technical report, University of Oxford. ing. In Advances in Neural Information Processing Giuseppe Di Fabbrizio, Amanda Stent, and Robert Systems 16, pages 65–72. Curran Associates, Inc. Gaizauskas. 2014. A hybrid approach to multi- Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- document summarization of opinions in reviews. In tiani. 2010. SentiWordNet 3.0: An enhanced lexi- Proceedings of the 8th International Natural Lan- cal resource for sentiment analysis and opinion min- guage Generation Conference (INLG), pages 54–63, ing. In Proceedings of the 5th Conference on In- Philadelphia, Pennsylvania, USA. ternational Language Resources and Evaluation, vol- Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan- ume 10, pages 2200–2204, Valletta, Malta. der J. Smola, Jing Jiang, and Chong Wang. 2014. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Jointly modeling aspects, ratings and sentiments for gio. 2015. Neural machine translation by jointly movie recommendation (JMARS). In Proceedings of learning to align and translate. In Proceedings of the the 20th ACM SIGKDD International Conference on 3rd International Conference on Learning Represen- Knowledge Discovery and , pages 193– tations, San Diego, California, USA. 202, New York, NY, USA. Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. 2015. Better document-level sentiment analysis from Thomas G. Dietterich, Richard H. Lathrop, and Toms RST discourse parsing. In Proceedings of the 2015 Lozano-Prez. 1997. Solving the multiple instance Conference on Empirical Methods in Natural Lan- problem with axis-parallel rectangles. Artificial Intel- guage Processing, pages 2212–2218, Lisbon, Portu- ligence, 89(1):31 – 71. gal. Wei Vanessa Feng and Graeme Hirst. 2012. Text-level Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017. discourse parsing with rich linguistic features. In Pro- Improving multi-document summarization via text ceedings of the 50th Annual Meeting of the Associa- classification. In Proceedings of the 31st AAAI Con- tion for Computational Linguistics (Volume 1: Long ference on Artificial Intelligence, pages 3053–3058, Papers), pages 60–68, Jeju Island, Korea. San Francisco, California, USA. Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. Peter Carbonetto, Gyuri Dorko,´ Cordelia Schmid, Hen- 2010. Opinosis: A graph based approach to abstrac- drik Kuck,¨ and Nando De Freitas. 2008. Learning to tive summarization of highly redundant opinions. In recognize objects with little supervision. International Proceedings of the 23rd International Conference on Journal of Computer Vision, 77(1):219–237. Computational Linguistics, pages 340–348, Beijing, Giuseppe Carenini, Rymond Ng, and Adam Pauls. 2006. China. Multidocument summarization of evaluative text. In Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Ray- Proceedings of the 11th Conference of the European mond T. Ng, and Bita Nejat. 2014. Abstractive sum- Chapter of the Association for Computational Linguis- marization of product reviews using discourse struc- tics, pages 305–312, Trento, Italy. ture. In Proceedings of the 2014 Conference on Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Empirical Methods in Natural Language Processing, 2003. Building a discourse-tagged corpus in the pages 1602–1613, Doha, Qatar. framework of rhetorical structure theory. In Current Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke and New Directions in Discourse and Dialogue, pages Zettlemoyer, and Daniel S Weld. 2011. Knowledge- 85–112. Springer. based weak supervision for information extraction of

29 overlapping relations. In Proceedings of the 49th An- ory of text organization. Text-Interdisciplinary Jour- nual Meeting of the Association for Computational nal for the Study of Discourse, 8(3):243–281. Linguistics: Human Language Technologies-Volume Oded Maron and Aparna Lakshmi Ratan. 1998. 1, pages 541–550, Portland, Oregon, USA. Multiple-instance learning for natural scene classifica- Minqing Hu and Bing Liu. 2004. Mining and summa- tion. In Proceedings of the 15th International Con- rizing customer reviews. In Proceedings of the 10th ference on Machine Learning, volume 98, pages 341– ACM SIGKDD International Conference on Knowl- 349, San Francisco, California, USA. edge Discovery and Data Mining, pages 168–177, Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Seattle, Washington, USA. SummaRuNNer: A based se- Rie Johnson and Tong Zhang. 2015a. Effective use quence model for extractive summarization of docu- of word order for text categorization with convolu- ments. In Proceedings of the 31st AAAI Conference tional neural networks. In Proceedings of the 2015 on Artificial Intelligence, pages 3075–3081, San Fran- Conference of the North American Chapter of the cisco, California. Association for Computational Linguistics: Human Ani Nenkova. 2005. Automatic text summarization of Language Technologies, pages 103–112, Denver, Col- newswire: Lessons learned from the document under- orado, USA. standing conference. In Proceedings of the 20th AAAI, Rie Johnson and Tong Zhang. 2015b. Semi-supervised pages 1436–1441, Pittsburgh, Pennsylvania, USA. convolutional neural networks for text categorization Eric Noreen. 1989. Computer-intensive Methods for via region embedding. In Advances in Neural Infor- Testing Hypotheses: An Introduction. Wiley. mation Processing Systems 28, pages 919–927. Curran Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- Associates, Inc. ploiting class relationships for sentiment categoriza- Jim Keeler and David E. Rumelhart. 1992. A tion with respect to rating scales. In Proceedings of self-organizing integrated segmentation and recogni- the 43rd Annual Meeting on Association for Compu- tion neural net. In Advances in Neural Informa- tational Linguistics, pages 115–124. Association for tion Processing Systems 4, pages 496–503. Morgan- Computational Linguistics. Kaufmann. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Yoon Kim. 2014. Convolutional neural networks for sen- 2002. Thumbs up? sentiment classification using ma- tence classification. In Proceedings of the 2014 Con- chine learning techniques. In Proceedings of the 2002 ference on Empirical Methods in Natural Language Conference on Empirical Methods in Natural Lan- Processing, pages 1746–1751, Doha, Qatar. guage Processing, pages 79–86, Pittsburgh, Pennsyl- Dimitrios Kotzias, Misha Denil, Nando De Freitas, and vania, USA. Padhraic Smyth. 2015. From group to individual la- Nikolaos Pappas and Andrei Popescu-Belis. 2014. Ex- bels using deep features. In Proceedings of the 21th plaining the stars: Weighted multiple-instance learn- ACM SIGKDD International Conference on Knowl- ing for aspect-based sentiment analysis. In Proceed- edge Discovery and Data Mining, pages 597–606, ings of the 2014 Conference on Empirical Methods in Sydney, Australia. Natural Language Processing, pages 455–466, Doha, Quoc Le and Tomas Mikolov. 2014. Distributed repre- Qatar, October. sentations of sentences and documents. In Proceed- Nikolaos Pappas and Andrei Popescu-Belis. 2017. Ex- ings of the 31st International Conference on Machine plicit document modeling through weighted multiple- Learning, pages 1188–1196, Beijing, China. instance learning. Journal of Artificial Intelligence Re- Kevin Lerman, Sasha Blair-Goldensohn, and Ryan Mc- search, 58:591–626. Donald. 2009. Sentiment summarization: Evaluating Lizhen Qu, Georgiana Ifrim, and Gerhard Weikum. and learning user preferences. In Proceedings of the 2010. The bag-of-opinions method for review rating 12th Conference of the European Chapter of the ACL, prediction from sparse text patterns. In Proceedings of pages 514–522, Athens, Greece. the 23rd International Conference on Computational Junyi Jessy Li, Kapil Thadani, and Amanda Stent. 2016. Linguistics, pages 913–921, Beijing, China. The role of discourse units in near-extractive summa- Richard Socher, Jeffrey Pennington, Eric H. Huang, An- rization. In Proceedings of the SIGDIAL 2016 Con- drew Y. Ng, and Christopher D. Manning. 2011. ference, The 17th Annual Meeting of the Special Inter- Semi-supervised recursive for predicting est Group on Discourse and Dialogue, pages 137–147, sentiment distributions. In Proceedings of the 2011 Los Angeles, California, USA. Conference on Empirical Methods in Natural Lan- William C. Mann and Sandra A. Thompson. 1988. guage Processing, pages 151–161, Edinburgh, Scot- Rhetorical structure theory: Toward a functional the- land, UK.

30 Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, on Computational Linguistics: Posters, pages 1336– Christopher D. Manning, Andrew Y. Ng, and Christo- 1344, Beijing, China. pher Potts. 2013. Recursive deep models for semantic Xin Xu and Eibe Frank. 2004. Logistic regression and compositionality over a sentiment treebank. In Pro- boosting for labeled bags of instances. In Proceed- ceedings of the 2013 Conference on Empirical Meth- ings of the Pacific-Asia Conference on Knowledge Dis- ods in Natural Language Processing, pages 1631– covery and Data Mining, pages 272–281. Springer- 1642, Seattle, Washington, USA. Verlag. Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Wei Xu, Alan Ritter, Chris Callison-Burch, William B. Voll, and Manfred Stede. 2011. Lexicon-based meth- Dolan, and Yangfeng Ji. 2014. Extracting lexically ods for sentiment analysis. Computational Linguis- divergent paraphrases from Twitter. Transactions of tics, 37(2):267–307. the Association for Computational Linguistics, 2:435– Oscar Tackstr¨ om¨ and Ryan McDonald. 2011. Discov- 448. ering fine-grained sentiment with latent variable struc- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex tured prediction models. In Proceedings of the 39th Smola, and Eduard Hovy. 2016. Hierarchical atten- European Conference on Information Retrieval, pages tion networks for document classification. In Pro- 368–374, Aberdeen, Scotland, UK. ceedings of the 2016 Conference of the North Ameri- Duyu Tang, Bing Qin, and Ting Liu. 2015. Document can Chapter of the Association for Computational Lin- modeling with gated recurrent neural network for sen- guistics: Human Language Technologies, pages 1480– timent classification. In Proceedings of the 2015 Con- 1489, San Diego, California, USA. ference on Empirical Methods in Natural Language Matthew D. Zeiler. 2012. ADADELTA: an adaptive Processing, pages 1422–1432, Lisbon, Portugal. method. CoRR, abs/1212.5701. Peter D Turney. 2002. Thumbs up or thumbs down?: Qi Zhang, Sally A. Goldman, Wei Yu, and Jason E. Fritts. Semantic orientation applied to unsupervised classifi- 2002. Content-based image retrieval using multiple- cation of reviews. In Proceedings of the 40th annual instance learning. In Proceedings of the 19th Inter- meeting on association for computational linguistics, national Conference on Machine Learning, volume 2, pages 417–424, Pittsburgh, Pennsylvania, USA. pages 682–689, Sydney, Australia. Sida Wang and Christopher D. Manning. 2012. Base- Cha Zhang, John C. Platt, and Paul A. Viola. 2006. Mul- lines and bigrams: Simple, good sentiment and topic tiple instance boosting for object detection. In Ad- classification. In Proceedings of the 50th Annual vances in Neural Information Processing Systems 18, Meeting of the Association for Computational Linguis- pages 1417–1424. MIT Press. tics: Short Papers-Volume 2, pages 90–94, Jeju Island, Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. 2009. Korea. Multi-instance learning by treating instances as non- Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. iid samples. In Proceedings of the 26th Annual In- Scalable multi-instance learning. In Proceedings of ternational Conference on Machine Learning, pages the IEEE International Conference on Data Mining, 1249–1256, Montreal,´ Quebec. pages 1037–1042, Shenzhen, China. Nils Weidmann, Eibe Frank, and Bernhard Pfahringer. 2003. A two-level learning method for generalized multi-instance problems. In Proceedings of the 14th European Conference on Machine Learning, pages 468–479, Dubrovnik, Croatia. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2):165–210. Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classifica- tion and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3460–3469, Boston, Massachusetts, USA. Rui Xia and Chengqing Zong. 2010. Exploring the use of word relation features for sentiment classification. In Proceedings of the 23rd International Conference

31 32