Constructiveness-Based Product Review Classification

Ugo Loobuyck

Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits June 8, 2020

Supervisor: Prof. Joakim Nivre, Uppsala University Abstract

Promoting constructiveness in online comment sections is an essential step to make the internet a more productive place. On online marketplaces, customers often have the opportunity to voice their opinion and relate their experience with a given product. In this thesis, we investigate the possibility to model con- structiveness in product review in order to promote the most informative and argumentative customer feedback. We develop a new constructiveness 4-class scale taxonomy based on heuristics and specic categorical criteria. We use this taxonomy to annotate 4000 Amazon customer reviews as our training set, re- ferred to as the Corpus for Review Constructiveness (CRC). In addition to the 4-class constructiveness tag, we include a binary tag to compare modeling perfor- mance with previous work. We train and test several computational models such as Bidirectional Encoder Representations from Transformers (BERT), a Stacked Bidirectional LSTM and a Gradient Boosting Machine. We demonstrate our anno- tation scheme’s reliability with a set of inter-annotator agreement experiments, and show that good levels of performance can be reached in both multiclass setting (0.69 F1 and 57% error reduction over the baseline) and binary setting (0.85 F1 and 71% error reduction). Dierent features are evaluated individually and in combination. Moreover, we compare the advantages, downsides and performance of both feature-based and neural network models. Finally, these models trained on CRC are tested on out-of-domain data (news article comments) and shown to be nearly as procient as on in-domain data. This work allows the extension of constuctiveness modeling to a new type of data and provides a new non-binary taxonomy for data labeling. Contents

Preface5

1. Introduction6 1.1. Purpose...... 6 1.2. Outline...... 7

2. Background9 2.1. Constructiveness...... 9 2.2. Toxicity...... 11 2.3. Machine Learning...... 12 2.3.1. Classic feature-based models...... 12 2.3.2. Neural networks...... 13

3. Data 16 3.1. Data Gathering...... 16 3.1.1. Training Data...... 16 3.1.2. Test Data...... 17 3.2. Possible Shortcomings...... 17

4. Annotation 19 4.1. Annotation Scheme...... 19 4.1.1. Four-class scale (CRC

5. Experimental Methodology 28 5.1. Preprocessing...... 28 5.2. Baseline...... 28 5.3. Feature-based Model...... 29 5.3.1. Model...... 29 5.3.2. Features...... 29 5.3.3. Hyperparameters...... 30 5.4. Neural Network Models...... 31 5.4.1. Stacked Bi-LSTM Network...... 31 5.4.2. BERT...... 32 5.5. Evaluation...... 33

6. Results and Discussion 34 6.1. Results...... 34 6.2. Multiclass vs. Binary Classication...... 35 6.3. Feature Performance...... 36 6.4. Feature-Based vs. Neural Networks...... 37 6.5. In- vs. Out-of-Domain...... 39

3 7. Conclusion 41

Appendices 42

A. Amazon categories in CRC 43 A.1. Amazon ocial Dataset 1995-2015...... 43 A.2. Amazon Review Data 1996-2018 (University of California San Diego. 43

B. Examples of review annotations with 4-class scheme 44 B.1. Class A...... 44 B.2. Class B...... 45 B.3. Class C...... 45 B.4. Class D...... 45

C. Stacked Bidirectional LSTM Architecture 46

4 Preface

Constructiveness is the human way. A Policy of Kindness Dalai Lama

Beyond the technical aspect of the master’s thesis from which I learned a lot and the challenges that such a large project implies, the subject I chose oered me a number of unexpected key takeaways. I have learned how to give better feedback, be more positive and measure the impact of words, but also how to receive feedback and capitalize on it. I would like to thank my supervisor, Joakim Nivre, for all the judicious advice he gave me and for bringing a lot of experience into my project. Many thanks to Clara, who has shown tremendous support during these four months, both in the form of advice, moral support and proofreading, and without whom the project would have been much harder. Last but not least, thanks to my friends and family, from whom I was far away during the project, for their remote support and many encouragements.

5 1. Introduction

Constructiveness has always been a core component of improvement-oriented human communications, and plays a key role in feedback systems. Instead of giving feedback by simply pointing out mistakes or attempting to hurt, constructiveness can be used through argumentation and respectful discourse techniques in order to capitalize on these past mistakes for future improvements. Similarly, there are many ways of giving positive feedback, but constructive positive feedback is usually supported by relevant examples and details. With the rapid expansion of online communities in the past decades, moderation tools have been continuously developed to face the spread of toxicity and hate in such places. It is important for these media and businesses that the space specically designed for ows of ideas, constructive feedback and respectful discussions is not polluted by a minority of disrupters. On the other side of this perpetual ght to make the internet a safer place, the recent NLP task of constructiveness analysis has arisen, and addresses the promotion of informative online content that aims for general improvement. While it is critical to ensure that a minimal respect standard is followed by users when posting feedback (Reich, 2011), boosting the exposure of constructive comments is twofold: rst, it can have a positive impact on the way other users act within the same environment, through mimetic isomorphism (DiMaggio and Powell, 1983) — second, it allows users to have an immediate access to the most informative content via means of ltering. The latter is perfectly illustrated by the New York Times newspaper,1 who employs a team of human moderators in order to pick and highlight remarkably valuable article comments on a broad range of subjects, referred to as the NYT Picks (Diakopoulos, 2015). Ghose and Ipeirotis (2006) also designed a tool that promotes helpful reviews on Amazon,2 usable by product manufacturers to retrieve the most insightful comments and by customers to have rapid access to missing information, for example.

1.1. Purpose

Most of the previous work in this area has focused on the analysis and classication of online news article comments. This type of data is convenient in the way that users often express their points of view and opinion with a certain degree of constructiveness. We take a dierent approach: we believe that constructiveness modeling can also be performed on product feedback, where customers relate their personal experience about items they acquired. Although the intention clearly diers from news article comments which are more oriented towards inter-user discussion, product reviews still exhibit a wide range of explicit argumentation and informativeness features. Moreover, most of the constructiveness classication work has used a binary frame- work, i.e. constructive or not constructive. This setting is broadly used since it is the simplest form of classication, and usually oers a good balance between realisticness and performance. For instance, sentiment analysis is often performed in a binary

1https://www.nytimes.com 2https://www.amazon.com

6 setting although it is known that more target classes can be legitimately added, e.g. “neutral”. We believe that classifying the constructiveness of user inputs in a binary framework is not representative of the concept of constructiveness itself. For instance, the three product reviews below show increasingly constructive features:

(a). That just sucks!!

(b). My daughter loves these paints! She got them for Christmas and uses them almost daily.

(c). This is a rugged great lighting tripod that will support a fair amount of weight safely. The only issue I had was there wasn’t detailed instructions on how to set it up. One of the extensions was actually inside a part of the unit and it took some time for me to actually nd that part to make it all come together. Overall great product just wished they had done a little bit better on the instructions.

This setup raises a few questions: where does the threshold between constructive and non-constructive lie? Review (b) is clearly less destructive than review (a), but also less constructive than review (c), therefore where does it stand in a binary setting? To counter this problem, we introduce a new annotation scheme built on four ranked categories representing constructiveness levels. Our taxonomy, which partly relies on a set of categorical heuristics, allows a more thorough coverage of the constructive- ness spectrum and tends to ll in the previously mentioned grey areas. Our 4-class scale is also interpretable on the binary level, as we can divide it into two distinct classes, namely constructive and non-constructive, therefore allowing us to compare performance between binary scale and our new multiclass scale while relying on the same scheme. A recurrent issue in this eld of study is the lack of annotated data. To our knowledge, the only available English data annotated for constructiveness derive from news articles (comments or comment threads) in a binary framework (Kolhatkar et al., 2020; Napoles et al., 2017). A major contribution of this thesis is the Corpus for Review Constructiveness (CRC),3 a data set composed of 4000 Amazon reviews annotated by hand with our 4-class scheme. In this thesis we (1) show that our annotation scheme is reliable and that construc- tiveness classication shows comparable performance in multiclass and binary settings, (2) empirically determine the most and least eective features for constructiveness modeling, (3) evaluate the performance of feature-based and neural network models in both multiclass and binary settings, and for both in- and out-of domain data, and (4) show that models trained on product reviews can detect patterns of constructiveness on out-of-domain data.

1.2. Outline

First, in Chapter 2 we present the constructiveness-related work as well as some of the state-of-the-art machine learning tools available for general text classication. We also discuss spam, hate speech and toxicity detection. In Chapter 3, we detail our data selection process for both experimental training and testing, and discuss our data set’s advantages and shortcomings. In Chapter 4, we introduce our annotation scheme for review constructiveness classication, and illustrate our various criteria with precise examples. We discuss our various experimental setups, methods and validation results in Chapter 5. In Chapter 6, we present our results and discuss the

3The annotations are available on the project’s GitHub: https://github.com/ugolbck/AFCC.

7 dierent outcomes in order to answer our research questions. We further analyze our results both quantitatively and qualitatively, in order to gain insights on what can be improved in future eorts. Finally, we conclude and reect on this thesis in Chapter 6.

8 2. Background

2.1. Constructiveness

Even though constructiveness classication is a relatively new addition to the range of available NLP tasks, the notion of constructiveness is as least as old as the notion of feedback and the two are paired. Constructiveness is a well-known yet subjective concept that constitutes the very core of our ability to use discussion in order to improve something that has been done, or bring thoughtful input to a conversation in order to make it better, for example. The concept of constructiveness is broadly used in several areas of work such as teaching and learning (Ovando, 1994), social psychology (Dost and Yagmurlu, 2008) or journalism (Diakopoulos, 2015; Haagerup, 2017). We focus on the analysis of constructiveness in an online feedback context, on which a decent amount of work has already been done. This small area of research aims at gaining insights on the relations and importance of constructiveness features. To this day, several denitions have been used. Niculae and Danescu-Niculescu-Mizil (2016) perform experiments to determine the impact of good (or bad) conversations on the outcome of a multiplayer guessing game where players form a team and have to guess a point on a map. Conversations resulting in a better nal guess than original individual player guesses are judged “constructive”, in the way that they were fruitful and show signs of reasoning. Napoles et al. (2017) dene constructive conversations as ERICs (Engaging, Respectful, and/or Informative Conversations) and annotate a large corpus of comment threads from Yahoo News1 articles. They perform a series of experiments in order to detect markers of qualitative online conversations with respect to ERIC features. Kolhatkar and Taboada (2017a) present the results of a survey to get dierent inputs on the matter, and mixed the answers into this denition: “Constructive comments intend to create a civil dialogue through remarks that are relevant to the article and not intended to merely provoke an emotional response. They are typically targeted to specic points and supported by appropriate evidence”. Similarly, Fujita et al. (2019) rely on a survey in order to get insights about what constitutes a constructive comment. With the answers, they set a pre-condition and several main conditions to determine if a comment is constructive during the annotation process. A currently popular way of creating new annotated data for a task like constructive- ness analysis is crowd-sourcing: a variable number of human annotators are given an annotation scheme along with a few dummy questions to train on, before annotating real data. The resulting constructiveness annotations therefore mostly rely on the clarity of the annotation instructions and the reliability of the researchers’ scheme. Previous research on constructiveness using this method includes Fujita et al. (2019), Kolhatkar et al. (2020, 2019), and Napoles et al. (2017). In our research work, we di- verge from this technique and annotate the data ourselves, with the aim to reduce the potential noise that might be introduced by many annotators. We discuss our method’s reliability in a Chapter 4. Several papers tackle constructiveness-based classication of news article thread comments and individual comments. To our knowledge, all research work so far

1https://news.yahoo.com

9 has exploited supervised learning for classication purposes. It would however be interesting to see constructiveness analysis as a regression problem, and assign a constructiveness grade to new input (as shown on the Perspective API2 for toxicity detection). In Kolhatkar and Taboada (2017a), the authors train a bidirectional LSTM-based model (Hochreiter and Schmidhuber, 1997) using GloVe embeddings (Pennington et al., 2014) on comments from constructive comment threads from the Yahoo News Annotated Comments Corpus (YNACC) (Napoles et al., 2017) as constructive instances and non-constructive instances from the Argument Extraction Corpus (AEC) (Swanson et al., 2015). They obtain 72.56% accuracy on their own test set composed of news article comments. Their follow-up work (Kolhatkar and Taboada, 2017b) uses NYT Picks as constructive instances and YNACC comments from non-constructive threads as non- constructive examples. Their experiments yield 0.81 F1-score with a Linear SVM trained on a set of features (argumentation, text quality, named entities), suggesting that NYT Picks are a good constructiveness representative. In their newest work (Kolhatkar et al., 2020), which is contemporaneous to this thesis, the authors annotate 12000 comments from a Canadian online newspaper, the Constructive Comments Corpus (C3) (as an enhancement of the SFU Opinion and Comments Corpus (Kolhatkar et al., 2019)). They perform binary classication with neural networks and feature-based models. Their best result is 0.93 F1 with a biLSTM network, BERT (Devlin et al., 2018), and a length-based model, all trained and tested on C3. Fujita et al. (2019) have introduced the notion of constructiveness score (C-score). During the annotation process, they request binary annotations from crowdworkers (i.e. constructive or not constructive), and then normalize the number of positive votes into a constructiveness-based ranking. They demonstrate that constructiveness in comments is not correlated to user feedback (i.e. number of likes on the comment). They perform pairwise ranking experiments and obtain NDCG@10=78.25 and precision@10=42.2. We adhere to the idea that constructiveness is not a binary concept in practice and should be ranked. It is necessary to dive into the analysis of many types of features (lexical, semantic, syntactic, etc.) in order to capture the essence of what makes a piece of text construc- tive or not. The measurement of the impact of each feature in that piece of text can provide valuable insights for constructiveness modeling experiments. Constructive- ness features are discussed in several papers. In this work, we gain knowledge from them and try to infer the best possible set of relevant constructiveness features for product reviews. Diakopoulos (2015) nd that a set of editorial criteria (argument quality, criticality, internal coherence, personal experience, readability, length and thoughtfulness) was mainly responsible for the selection of comments as NYT Picks. Park et al. (2016) also articulate a set of constructiveness-related criteria to discriminate NYT Picks: article relevance, conversational relevance, length, readability, recommen- dation and achieve 13% precision and 60% recall on a highly skewed data set with an SVM. Kolhatkar and Taboada (2017b) nd a strong association in individual article comments between constructiveness and the presence of argumentation discourse markers. In their experiments, they use a large set of these markers as features for classication: discourse connectives, reasoning verbs, modals, abstract nouns, stance adverbials. They also use more classic features such as length, NER, text quality and word count and TFIDF. Kolhatkar et al. (2020) use similar features and combine them with additional features provided by the Perspective API (attack on the author, toxicity, identity hate, text coherence, etc.). They show that length is a major component of constructiveness, which greatly inuences feature-based models, due to the strong

2http://www.perspectiveapi.com

10 correlation between length features and constructiveness label. This correlation guar- antees a good performance during the testing phase but results in an “easy-to-fool” model that mostly relies on the length of the comment to assign a label to it. They further discuss the possibility to use neural networks as a solution to that issue, which do not suer as much from such strongly correlated features and usually produce sturdier models.

2.2. Toxicity

Toxicity detection can be seen as both the antithesis of constructiveness classication and its necessary complement: This classication task primarily aims at ltering out any text that could constitute an oense, inconvenience or direct attack to other entities. In our case, we try and promote argumentation in text and point out constructive reviews. Therefore, and even though both processes are opposed, we believe that these two tasks should be treated together in order to fully cover the full spectrum of argumentation. This intuition is supported by several papers: Gautam and Taboada (2019) show that depending on the data source, the amount of constructive comments can be consistently higher than the amount of non-constructive comments as the level of toxicity increases. They also show that unlike constructiveness, toxicity seems to be evenly distributed across topics. Their most interesting nding is that a large proportion of constructive comments do contain a small amount of toxicity (~15%), and that for low levels of toxicity, there are about twice as many non-constructive comments as constructive ones (we will discuss the latter phenomenon later on in this thesis). A similar observation is made in Napoles et al. (2017), where the authors discuss the dierent correlations resulting from their annotation scheme on news article comments (YNACC). They nd that constructiveness is positively correlated with informativeness and persuasiveness, but also with controversy. Indeed, even though controversy is shown to be correlated with ameware, it seems to be an essential part of constructive comment threads. They also show that non-constructive threads are correlated with non-persuasiveness, negativity and mean comments, which is less surprising. Kolhatkar and Taboada (2017a) also investigate the proportion and level of toxicity in constructive and non-constructive comments, and nd a similar distribution between the two categories. They discuss examples of comments in which both constructiveness and toxicity can coexist, and conclude that they are therefore orthogonal concepts. In Kolhatkar et al. (2020), aggregated constructiveness scores are compared to toxicity scores, resulting in near-to-null correlation coecients (Pearson’s d = −0.02, Spearman’s d = 0.04 and Kendall’s g = −0.04), indicating no correlation between the two features and therefore validating their non-correlation. All research elements above suggest that although constructiveness and toxicity represent opposed concepts, features of both should be considered to eciently generalize on dierent kinds of user input data. More generally, a lot of thorough work has been done on spam, toxicity and hate speech detection, as shown in Fortuna and Nunes (2018) and Schmidt and Wiegand (2017). In the eld of hate speech and toxicity detection, Badjatiya et al. (2017) achieve 0.93 F1 when using neural networks and Davidson et al. (2017) reach 0.90 F1 with a set of linguistic features and a logistic regression classier. Nobata et al. (2016) use, among other features, character n-grams to counter intentional misspellings in dierent types of hate speech, as well as averaged and pre-trained embeddings to better represent the semantic aspect of a comment. They obtain 0.80, 0.82 and 0.78 F1 on dierent data

11 sets using Vowpal Wabbit.3 Another signicant eort is made by the Conversation AI4 research group, with as main purpose the improvement and understanding of machine learning algorithms to counter conversational toxicity. They regularly publish annotated data sets with toxicity markers, as well as convenient API tools such as the Perspective API introduced in this chapter, which allows live classication of input text in terms of toxicity-related features. Several relevant recent online competitions have also been organized to use community eort and build robust models.567 Ren and Ji (2017) investigate the detection of spam in product reviews. They use a convolutional neural network (CNN) to capture the semantic information of a sentence and feed the output to a gated recurrent neural network (GRNN) to capture discourse relations and build a document representation. They compare their model to a feature- based logistic regression model and to an SVM feature-based model from Li et al. (2014). The neural model outperforms the SVM model, but they also show that concatenating the discrete and neural features into a single representation vector outperforms both neural and logistic regression models, suggesting a good complementarity between types of features. Sarcasm classication is also relevant to our study, as it is commonly seen in non- constructive comments, as shown in Napoles et al. (2017). Gautam and Taboada (2019) and Kolhatkar et al. (2020) both include sarcasm as a component of non- constructiveness in their taxonomies, and Kolhatkar and Taboada (2017a) describe sarcastic article comments as “toxic” in their statistical experiment on the relation be- tween constructiveness and toxicity, and nd that “toxic” comments appear about four times more in non-constructive comments than in constructive ones (5.21% against 1.33%). The matter has been tackled from many angles and using numerous features, as shown in Joshi et al. (2017).

2.3. Machine Learning

Supervised machine learning techniques using feature-based models or neural network models have become the best way to deal with dierent automated NLP tasks, such as sentiment analysis, machine translation or named-entity detection.

2.3.1. Classic feature-based models Feature-based models have been thoroughly used for all types of data-driven NLP tasks. A wide variety of models are available for both binary and multiclass classication, e.g. decision trees and random forests (Breiman et al., 1984), Gradient Boosting Machines (GBM) (Friedman, 2001), support vector machines (SVM) (Chang and Lin, 2011; Drucker et al., 1999), logistic regression (Hosmer Jr et al., 2013), perceptron algorithms (Freund and Schapire, 1999), etc. Each classier has its own characteristics and is more or less eective depending on the type of data and the selected features. Most of them are described in Aggarwal and Zhai (2012). For our constructiveness classication experiments, it is important to consider non-neural models. Indeed, one recurrent issue in this eld is the lack of labeled data, which can be problematic for models training several millions of parameters. Linear models, which can learn from a low amount of pre-selected features, can give us insights on the most and least informative features. In previous work, authors tend to carry out experiments using only one or a

3https://github.com/VowpalWabbit/vowpal_wabbit 4https://conversationai.github.io 5https://www.kaggle.com/c/jigsaw-toxic-comment-classication-challenge 6https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classication 7https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classication/overview

12 few feature-based models (Fujita et al., 2019; Kolhatkar and Taboada, 2017b; Kolhatkar et al., 2020). Support Vector Machines are usually picked by researchers as they are straightforward classiers allowing both linear and non-linear learning, and usually perform well on textual data. We diverge from this method and use a Gradient Boosting classier instead, a tree-based ensemble model.

2.3.2. Neural networks Neural networks tend to perform better than feature-based models, and are logically achieving state of the art results in most NLP tasks. Their ability to build complex representations of text sequences by performing linear and non-linear transformations allows them to eectively grasp the intent or meaning of that text. A typical approach to deep learning for text classication is to see text as a sequence of inputs represented by vectors of real numbers called embeddings (Turian et al., 2010), and feed these embeddings into a series of neuron layers, which perform non- linear transformations on the vectors. The output is nally fed in a classication layer, usually activated by a sigmoid function for a yes/no problem or softmax function for a multiclass problem. Popular embedding techniques include pre-trained word embeddings such as GloVe (Pennington et al., 2014) and word2vec (Goldberg and Levy, 2014), which facilitate the representation learning of text inputs. Kolhatkar and Taboada (2017a,b) and Kolhatkar et al. (2020) consistently use GloVe embeddings in their experiments.

Figure 2.1.: Example of a simple convolution with only one filter of size B = 2 sliding along the sentence, performing matrix multiplication against the selected text window and producing successive features aer a pooling step. The resulting feature map is a vector with dimension (= + B − 1). Each word of the sentence is represented by a real-numbered vector of length 38<.

Very eective ways to create informative sequence representations are available to us. For example, convolutions allow the detection of underlying patterns in a sentence by sliding a window of xed-size lters that try to intensify informative meaning (Dos Santos and Gatti, 2014). The result is max- or average-pooled to reduced the dimensionality and fed into the following layers for activation and classication. A simple representation of a single convolution is shown in Figure 2.1. This type of network is broadly used in image classication Ciresan et al. (2011) because lters are ecient at detecting visual edges, but it can also infer good meaning representation is text classication, as shown in Ren and Ji (2017). Long short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) are also very useful to detect distant dependencies between words, and solve the vanishing gradient issue from regular recurrent architectures (Bengio et al., 1994). A basic repre- sentation of an LSTM cell is shown in Figure 2.2. Such cells are often organised in an encoder-decoder architecture, where an input is encoded into a meaningful represen- tation and decoded into the output. An issue with these models based on recurrent

13 Figure 2.2.: LSTM cell architecture (Hochreiter and Schmidhuber, 1997). GC is the input sequence at timeframe C. 2C−1 is the previous cell state, ℎC−1 is the previous hidden state, 2C is the updated cell state and ℎC is the updated hidden state. Yellow nodes represent pointwise operations (multiplication, addition or tanh). Orange nodes represent activation functions (sigmoid or tanh).

Figure 2.3.: The Transformer architecture (Vaswani et al., 2017). The le block correspond to the encoder, the right part corresponds to the decoder. mechanisms is that tokens of an input sequence are processed iteratively, therefore the representation of a given token undergoes several transformations in successive hidden cells before between decoded, and it gets harder for the model to preserve a good representation of the input. This issue has been addressed and corrected by the relatively recent transformer architecture, introduced by Vaswani et al. (2017). Transformers also use an encoder-decoder architecture, but are able to simultaneously process each token of an input and is performing many matrix multiplications, making it fast to process with a GPU. Figure 2.3, presented in Vaswani et al. (2017), shows

14 the basic transformer block. The multi-head attention in both the encoder and the decoder captures contextual relationships in the sentence, and the feed-forward layers are classic densely-connected neuron layers. One of the state of the art model based on transformers is the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) which draws input rep- resentation from unsupervised pre-training and has been found adequate for language understanding and generative tasks. This model is composed of 12 chained layers of transformer-attention blocks and has been pre-trained on huge amounts of data to solve two NLP tasks in parallel (Masked Language Modeling and Next Sentence Prediction), and can be ne-tuned for specic downstream tasks. Similarly, XLNet (Generalized Autoregressive Pretraining for Language Understanding) (Yang et al., 2019), tries to correct BERT’s shortcomings and outperforms it on many language understanding and language generation tasks. In this thesis we investigate both feature-based and neural network models with two goals: to compare the dierent types of models and the impact of model complexity on the constructiveness classication task performance, and to compare individual and combined sets of classic features using Gradient Boosting.

15 3. Data

As discussed before, one major issue of constructiveness classication is the lack of annotated data usable for supervised learning. Some linear models might generalize decently by learning from a small amount of data, but neural networks usually train up to dozens of millions of parameters and therefore require large amounts of labeled data. Recent eorts have been made to annotate sets of data via crowdsourcing, allowing us to train models on consistent data inputs. This solves the problem encountered in Kolhatkar and Taboada (2017a), where they had to train a biLSTM model on two dierent data sets, one for positive instances of constructiveness (but annotated at the comment thread level) and the other for negative instances (annotated for argument quality). Unfortunately, there is to this day no available labeled data for product reviews or non-binary constructiveness (Fujita et al. (2019) infer constructiveness-based ranking by normalizing binary annotations from crowdworkers). This is why we undertake the labeling of a new data set adapted to our research needs. We use this set both for training and testing, both in a multiclass and a binary setting. We also use several labeled data sets created in previous research to test our models’ ability to generalize to out-of-domain data. The data is gathered from dierent ocial releases or open source data sets listed in Section 3.1, with academic research as a sole purpose.

3.1. Data Gathering

3.1.1. Training Data • Corpus for Review Constructiveness (CRC). This data will serve as our main data set. It originates from Amazon Review Data 1995-20151 and 1996-20182 (Ni et al., 2019). We choose Amazon reviews to be our training set, as it allows our models to learn from a wide range of diverse user feedback inputs, and hope- fully spot useful patterns for classication. The variety in types of products is unrivaled amid review data sets and facilitates our data gathering task. However, we discuss the several potential drawbacks of this type of data in Section 3.2. 4000 reviews were randomly selected from a wide collection of categories in- cluding Video games, Electronics, Music, Movies, etc. The complete list of selected categories can be found in AppendixA. The reason for random extraction is that we want to maximize the representativeness of our nal set in terms of diversity of inputs, with regard to future unseen samples. Training and testing models on a single or a handful of product types would certainly improve performance during test experiments, but would most likely not generalize well to new data in real-world applications. We keep 80% of this data to train all our models, that is 3200 reviews.

1https://s3.amazonaws.com/amazon-reviews-pds/readme.html 2https://nijianmo.github.io/amazon/index.html

16 3.1.2. Test Data • Corpus for Review Constructiveness (CRC). This test set is a part of our main data set, hence coming from Amazon Review Data 1995-2015 and 1996-2018. We use stratied splitting based on the constructiveness tag for the train/test division to keep an equal target class distribution over the training and test set, therefore ensuring a good representativeness of subsequent experiments. This means that the proportion of each constructiveness label will be the same across training set and the test set. 20% of the original set (4000 reviews) is used in our experiments, that is 800 reviews.

• Constructive Comments Corpus (C3) (Kolhatkar et al., 2020). This data set is composed of 12000 news article top-level comments from the Canadian journal “The Globe and Mail”.3 The comments are annotated individually for binary con- structiveness, toxicity, and for a set of constructiveness and non-constructiveness characteristics. In this work, we only use the binary constructiveness tag to see how well our models, trained on our annotated binary Amazon data, generalize to out-of-domain data.

• The Yahoo News Annotated Comments corpus (YNACC) (Napoles et al., 2017). This data set is composed of Yahoo News comments from 2016. It contains infor- mation both on the comment thread level (with a maximum answer embedding of 1) and comment level. Unfortunately, the constructiveness tag concerns com- ment threads. In their work, Kolhatkar and Taboada (2017a) consider YNACC comments from constructive threads as all constructive, and Kolhatkar and Taboada (2017b) use comments from non-constructive YNACC threads as non- constructive examples to train their machine learning models. Even though some toxic comments might occur in constructive threads, hence introducing noise on test results, this data set still gives us an idea of our models’ performance. The public release le contains 23383 lines of annotated comments, 22795 of which are labeled for binary constructiveness. Again, we use these comments to measure the performance of our binary model on out-of-domain data.

3.2. Possible Shortcomings

The type of data we have chosen for this task covers a range of products and services that is broad enough to ensure decent classication of new instances based on patterns inferred during training. Indeed, CRC is composed of many dierent product categories and allows our feature-based models and neural networks to draw patterns based on a large set of linguistic structures and words. However, product reviews dier from article comments in many ways and tend to be more descriptive, therefore it is possible that our models might not generalize well to out-of-domain data. Training on Amazon reviews entails several drawbacks: First, some types of items are surprisingly hard to constructively review, which can result in biased models. For instance, “gift cards” are straightforward products that do not necessarily require feedback on their content. Such reviews are only rarely informative even though they are also rarely toxic. On the ipside, movies are often carefully described and customers relate what they liked or disliked, often in a constructive way. Also, we unfortunately do not have access to the full scope of possible user inputs, since the Amazon review system’s algorithm lters most hateful and insulting comments that go against the reviewing rules. We are therefore lacking a part of the trainable material corresponding

3https://www.theglobeandmail.com

17 to the lowest category of our constructiveness scale (described in Chapter4), which can have a strong impact on the classication of unseen data. The feature-based models we select do not use pre-trained embeddings nor pre-trained weights, and will therefore probably yield decent results on CRC in the case of non-constructive samples, but poorly generalize to out-of-domain lax sources or live user inputs allowing strong toxicity, spam or hate speech in general.

18 4. Annotation

We have seen in Section 2.1 that several papers already present constructiveness-based schemes and taxonomies, but to our knowledge none of them has yet investigated either non-binary labeling or product reviews (again, Fujita et al. (2019) infer rankings based on binary labeling). In this chapter we describe and discuss our ranked annotation scheme for multiclass and binary constructiveness classication of product feedback. Most existing data sets have been annotated by crowdworkers, who usually reach a good a level of agreement (Kolhatkar et al., 2020). However, we believe that asking crowdworkers to answer questions requiring numerical non-binary answers about a subjective concept makes it harder to reach a consensus, as we multiply chances of non-agreement or chance-agreement with each new ranked category (this potential issue has been raised by Fujita et al. (2019)). It is thus also harder to keep track of the crowdworkers’ performance with random gold questions having a single correct answer. We take the risk to annotate by hand all 4000 collected Amazon reviews composing CRC thanks to one annotator only, hoping that personal bias will not aect the annotation process. As a complement, we perform two inter-annotator agreement experiments to test the reliability of both our new 4-class scheme and the derived 2-class scheme.

4.1. Annotation Scheme

In Section 2.1 we have reviewed the dierent denitions and features of construc- tiveness provided in relevant papers investigating news article comments, in which authors have found that constructive user inputs were generally respectful, argumenta- tive, informative and showing proof of personal experience. This set of features perfectly suits article comments, which usually aim at discussing, showing agreement or dis- agreement, stating a position, showing frustration, etc. Many comments are reactions towards other users or the author instead of the article. This generally increases potential toxicity, often occurring when the conversation shifts o-topic or in case of disagreement. Moreover, users in online comment sections tend to display more toxicity when anonymous and with a lack of eye-contact (Lapidot-Leer and Barak, 2012). Online customers show a slightly dierent behavior. First, the non-conversational aspect of reviews mitigates conversation-related features such as inter-user toxicity or inter-user disagreement leading to o-topic divergences. Rare reviews show signs of inter-customer references, but usually state their agreement and nearly never include sarcastic remarks. Reviews serve two main purposes: describing a personal experience with a product and/or describing the features of a product. In order to adapt the denition of constructiveness to customer reviews, the descriptive dimension must therefore be taken into account, since the description of an item or a situation carries a large amount of information, and it participates in the overall argumentation of the review. Before laying out our taxonomy, we must disambiguate the use of subjectivity and objectivity. Ghose and Ipeirotis (2006) notice that reviews of feature-based products (electronics, hardware, tools, etc.) tend to be found more helpful by other potential customers when they are more objective, e.g. “The laptop does not t in most 13"

19 cases, but the extra fan really helps reducing the heat”. On the other hand, experience- based products (movies, music, video-games, etc.) tend to be found more helpful when consisting of mostly personal and sentimental elements, e.g. “I just love this game. The characters are amazing and I feel like I’m 10yo again”. However, we diverge from these assumptions in that constructiveness, which is dierent from helpfulness, heavily relies on argumentation and argumentation itself mostly relies on factual statements. We therefore consider objective reviews to be more constructive than subjective reviews. The key point of an annotation process for an NLP task is its consistency. Indeed, the entire data set must be annotated according to the same criteria for two main reasons: rst, to avoid confusing our machine learning models in their pattern detection process and therefore minimizing the amount of random classication — second, to ensure its reusability in further experiments. Because this work mainly investigates multiclass classication and most of the recent work in this area has been done on binary classication, we decide to experiment on two closely-related scales, the rst composed of four ranked classes and the second composed of two classes derived from the former. We apply this scheme on our main data set, CRC. The resulting data sets are referred to as CRC

4.1.1. Four-class scale (CRC

• One word reviews are class A , e.g. “thanks”.

• Amazon’s default review text is class A, e.g. “One Star”, “Five Stars”.

• Reviews containing aggressive frustration and stating purchase regret are class A, e.g. “this was just a WASTE of money, thanks for nothing”.

• Short reviews composed of two adjectives are class B, e.g. “light and sturdy, thanks”.

• Reviews composed of at least three adjectives and personal impression are class C, e.g. “I like them, they are nice, sturdy but super heavy”.

• Extremely long and non-insulting reviews are class D.

To label more complex reviews that do not precisely fall into these pre-dened patterns, we use the taxonomy shown in Table 4.1. For a given review, we use Table 4.1 as a checklist where each row corresponds to a core component of constructiveness.

20 The components are ordered by decreasing importance, with argumentation and informativeness being the most important ones and objectivity as the least important one. We make the labeling decision based on the presence or absence of each component by weighing it with each component’s importance, i.e. if a review is argumentative and informative, but doesn’t show sign of experience and is subjective, the label would tend to classes C or D because the former components are more important. In a perfect world, a review belonging to class A would not be argumentative nor informative, would be short and would not aim for improvement, would be very toxic and subjective, and would not show signs of personal experience. In contrast, a review from class C would contain a few elements of argumentation and some information on the product/experience, would have a decent length and show some interest for future improvement, would not be toxic but instead respectful, would be objective and no spelling mistakes, and would show signs of experience with the product. For example: I love this machine! I have already made a ton of vinyl decals for cups, glasses, car windows, tote bags, t-shirts, etc. This is the machine you need to do it all! I would recommend to anyone! This review fullls features from all classes and is therefore hard to annotate: it provides no to little argumentation (B) and little information (B), is rather short (A/B), does not state negative aspects and therefore no suggestion for improvement (A/B), is not toxic nor sarcastic (C/D), is respectful (C/D) and shows experience with the product (C/D), and is rather subjective (A/B). By looking at the overall grades for each component and comparing this specic review to other reviews, we can still determine that the best t would most likely be class B, since it is fundamentally not a toxic review that does not bring much information on the product. Here is an easier example: Having been a pro guitar player for a long time I remember playing through a Fender Leslie back in the day. I have about 4 pedals that are supposed to sound like a Leslie but don’t quite make the grade. Even though the reviews are mostly positive I was skeptical about the Leslie pedal but ordered one anyway and was pleasantly surprised. It sounds great. The drawbacks are it’s size and the fact that it does change they way your amp sounds a bit. It does sound good though and I can live with the slight change in the way my amp sounds. Denitely a permanent xture on my pedal board. This review is somewhat argumentative (C) and informative (D), is long and well written (D), it states the negative aspects of the product (C/D) but is not toxic (C/D), is respectful (C/D) and rather objective (C/D), and contains good proof of experience (C/D). We can determine with a fair degree of condence that this review either belongs to class C or D, with a slight preference for D because more important components of class D are veried. A list of reviews with their annotation and a few taxonomic explanations is available in AppendixB, and can help to get a better understanding of the labeling process.

4.1.2. Two-class scale (CRC18=) In order to compare our results on constructiveness classication for product reviews to previous work done on other types of data (mostly article comments and comment threads), we also provide a binary classication scheme. This second scale simply rep- resents the two sides of the main 4-class scheme, i.e. constructive and non constructive.

21 Figure 4.1.: 4-class scheme with increasing constructiveness.

Not constructive Rather not con- Rather construc- Constructive (D) (A) structive (B) tive (C) • Provides no argu- • Provides no to lit- • Provides at least • Provides thor- mentation tle argumentation some argumenta- ough argumenta- tion tion • Brings no infor- • Brings no to little • Brings a mod- • Brings substan- mation about the information about erate amount of tial information product the product information about about the product the product • Rather short or • Rather short or • Rather long • Long or very long poorly written poorly written and/or well and well written written • Shows no interest • Shows no to lit- • Shows some in- • Shows interest for improvement tle interest for im- terest for improve- for improvement if provement ment necessary • Contains high • Contains no to • Contains no to • Contains low level of toxicity, high level of toxi- moderate level of amount or no of sarcasm, direct at- city or sarcasm toxicity or sarcasm toxicity or sarcasm tacks or insults • Disrespectful • Rather respectful • Respectful • Respectful • Unspecied or no • Proof of user ex- • Proof of user ex- • Proof of user ex- proof of user expe- perience perience perience rience • Rather subjective • Rather subjective • Rather objective • Rather objective

Table 4.1.: Taxonomy for all four target classes A (Not constructive), B (Rather not constructive), C(Rather constructive) and D (Constructive). Each row addresses a component of constructiveness. Rows are ordered by component importance.

This means that the two lowest categories of the scale will collapse into one class, and the two highest categories will also collapse into another class, as shown in Figure 4.2. The reviews are automatically redirected from classes A, B, C and D into classes AB and CD after our annotation.

22 Figure 4.2.: Binary scheme derived from the 4-class scheme presented in 4.1. The four classes collapse into two, allowing binary classification.

4.2. Annotation agreement

It is common practice to proof test any human annotation in supervised tasks. The general purpose is to ensure that subsequent prediction models base their estimation of the function that maps inputs to outputs on robust data sets. The annotation process is a major component of what guides these models, it is therefore essential that in a context where several human annotators are providing labeled data, a consistent level of agreement is ensured so that the resulting data is not skewed. In this thesis, only one annotator provides the labeled data used for training and part of testing, nevertheless it is still important to perform reliability tests. These tests can help us during the annotation process, in case of an unreliable scheme for example, which allows us to quickly adapt and further develop that scheme. Finally, they ensure that the annotation process is replicable in future work. Many test statistics and coecients are available to estimate an agreement as precisely as possible. We perform two experiments to test our scheme’s reliability: the rst one, early on in the annotation process, on a large number of untrained and unpaid annotators to get an idea of possible annotation tendencies in a 4-class context, and the second one, closer to the end of the annotation process, on a larger sample of reviews and with only one other trained human annotator, to take a more mathematical approach and calculate the agreement, in both multiclass and binary settings.

4.2.1. Experiment 1 We start by carrying out a straightforward survey with 16 participants on a sample of 20 reviews, aiming at gaining insights on possible aws in the scheme described in 4.1. The subjects of this survey are uent or native English speaker, and are given a short and simple denition of each of the four ranked categories, without any actual example nor detail. It is important to note that the annotators are not paid and we

23 only make the survey available on our personal Facebook1 account, therefore we only expect a minimal emotional investment from participants. We remove the annotation results of two annotators who we consider as outliers since they do not seem to label the reviews seriously and are provoking heavy noise in the data.

Figure 4.3.: Agreement trend visualization between 14 surveyers and us on 20 reviews, with 4 constructiveness ranked categories. The black lines on the blue bars represent the stand deviation for each review

We compute the mean and standard deviation of the survey results in a 4-class context, and compare it to our annotation to see the dierence. The results are shown in Figure 4.3. The “dierence” bars are measured in terms of category distance, which means that a dierence of 2 between our annotation and the survey mean annotation would roughly correspond to a disagreement between us and the general trend that spans over 2 categories, i.e. a very strong disagreement. We can see that in most cases, the dierence does not exceed 1. Only three reviews show dierences above 1, and four other reviews come relatively close to 1. The remaining thirteen reviews show a negligible dierence between our labels and the surveyers. These results are encouraging since open numerical questions about a subjective concept are expected to be hard to answer for untrained and unpaid annotators (Fujita et al., 2019). Out of 20 reviews, two were showing strong signs of ambiguity in the results from the surveyers:

(a). The only problem I have is the fact that I have to push down a bit harder to get the buttons, but it’s just a mild inconvenience. Great product.

(b). low quality product, cannot be used due to ghosting.I got it from US thinking people using good quality product there.I didn’t notice early review. It has been really true.

In review (a), the standard deviation is 0.95, which shows a high disagreement between surveyers. The review shows characteristics from several classes of our taxonomy, which can explain its ambiguous character and the poor overall agreement. The standard deviation for review (b) is 0.77, which again shows moderate disagreement between surveyers. Many considered this review to be a class B, although four of them labeled it as class C and one as class D, which is surprising. According to our

1https://www.facebook.com

24 taxonomy, the review is simply toxic and sarcastic, does not bring arguments or much information, which would correspond to a class A. These two samples are good examples of the diculty to reach a consensus with four target classes, both with ambiguous reviews and straightforward toxic reviews. The average standard deviation over the 20 reviews is 0.58, which shows that the survey participants’ annotations span over a reasonably small range. Also, the average dierence between our annotation and the survey mean annotation is -0.38, meaning that on average we annotated each review slightly more negatively than the surveyers, and that the average dierence is also reasonably small. These two values suggest that even with untrained annotators and vague instructions, it is possible to achieve decent agreement, both among untrained annotators and between us and the reviewers.

• Fleiss’ ^ (Fleiss, 1971) Fleiss’ ^ is an inter-annotator agreement metric appropriate in situations where more than two annotators are involved. However, the calculation is adapted to nominal or binary values, i.e. discrete categories with no notion of distance between them. The kappa score can also be computed for ordinal values but will not take into account any order or distance between categories: Say we give two annotators a review to grade (on a scale from 1 to 4). Fleiss’ ^ computation will consider the outcome {1, 1} as an agreement, and the outcomes {1, 2} and {1, 4} as equal disagreements, when it should ideally consider the outcome {1, 4} as a stronger disagreement than {1, 2}. We get ^ = 0.279, which corresponds to a ’fair agreement’ according to Landis and Koch (1977).

4.2.2. Experiment 2 Since the annotation scheme is one major contribution of this thesis, we decided to perform a second set of experiment in order to get a more precise assessment of the reliability of our scheme. Here we annotate 50 randomly picked reviews from CRC, and ask another human annotator to also rate each review. Two dierences are introduced in this new experiment: rst, the assessment of an inter-annotator agreement is more easily handled by statistical metrics when performed by only two annotators; second, the annotation scheme itself and the dierent characteristics of constructiveness were explained to the other annotator, which was not the case in Experiment 1.

• Kendall’s g (Kendall, 1938) This statistic is a non-parametric hypothesis test that measures the ordinal association or concordance coecient between our two ranked annotation sets, i.e. the degree of statistical dependence between two samples. The SciPy Python library oers a computation of Kendall’s g-b, which takes into account tied pairs = − = g 2 3 = p (4.1) (=2 + =3 + )G ) ∗ (=2 + =3 + )~)

where =2 and =3 are the number of concordant and discordant pairs, respectively, )G and )~ are the number of ties occurring only in the x and y samples, respec- tively. The resulting coecient ranges from −1 and 1, respectively meaning complete discordance and complete concordance between the two samples. We compute g in both 4-class and 2-class contexts:

25 – In a 4-class context, g = 0.778, which indicates a strong agreement between annotations. – In a 2-class context, g = 0.797, which indicates an even stronger agreement between annotations. These results show that when discussing the 4-class scheme with a single extra annotator and precisely stipulating the characteristics of each category, a very strong agreement can be reached in both 4-class and 2-class contexts.

The two inter-annotator agreement experiments have shown the reliability of our taxonomy and validate the possibility to distribute constructiveness over more than two classes.

4.3. Annotated Data Sets

Corpus Train (4-class) Test (4-class) ABCD ABCD

CRC

Table 4.2.: Data distribution for CRC (4000 reviews), C3 (12000 comments) and YNACC (22795 comments) in a 4-class seing. A, B, C and D correspond to the classes described in 4.1.1.

Corpus Train (2-class) Test (2-class) AB CD AB CD

CRC18= 1550 1650 388 412 C3 — 5484 6516 YNACC — 10983 11812

Table 4.3.: Data distribution for CRC (4000 reviews), C3 (12000 comments) and YNACC (22795 comments) in a 2-class seing. AB and CD are described in 4.1.2.

We annotate CRC by relying on our scheme and collect out-of-domain test data accordingly with the explanations in 3.1. Table 4.2 shows the distribution of construc- tiveness tags over each data set in a 4-class context. Table 4.3 shows the distribution of constructiveness tags over each data set in a binary context. The sample distribution for our CRC sets in both contexts is constant since we use a stratied splitting method to ensure that the test set yields a correct sample representation. We notice that for all binary constructiveness sets, the proportion of constructive/non- constructive is similar: 51.5%, 54.3% and 51.8% of constructive samples for CRC, C3 and YNACC, respectively. This suggests that constructiveness levels are similar in article comments and product reviews. Table 4.2 also shows that, in a 4-class scheme, CRC seems to show a normal-like constructiveness distribution, with most of the product reviews belonging to classes B and C. This can be explained by the fact that we label reviews in class A or D when they mostly show fully constructive or fully non-constructive features, which happens less often than reviews that show a mix of features from several classes and are logically labeled as B or C. This will likely cause imbalance issues during classication: indeed,

26 computational models learning from imbalanced data sets usually tend to overpredict the majority classes because they do not see enough instances of minority classes, resulting in worse generalization.

27 5. Experimental Methodology

In this chapter we describe the dierent steps taken in order to organize our experi- mental setup. We here take a technical approach to the procedures that lead us to prove or disprove our four research questions. These involve data preprocessing, feature extraction, model architecture and hyperparameter tuning.

5.1. Preprocessing

A major component of the machine learning pipeline for text classication is pre- processing of the raw data. Adapting the text data with preprocessing is an absolute requirement for two complementary reasons: rst, to minimize the noise contained inside the text such as symbols, HTML tags, links, case, punctuation, etc. Second, to maximize the informativeness of the text by keeping the meaningful structures, which will guide our models during pattern detection. Here is an example of review that requires some cleaning:

Nik Kershaw was an unfortunately typical artist in the video-happy 80’s. Tuneful, photogenic and big on synths, he was a cross between George Micheal and Howard Jones. That is epitomized on his biggest interna- tional hit, \\"Wouldn’t It be Good.\\" The video was typical of the period, wild haircuts, ersatz Bowie posturing and satellite dishes. The \\"[[ASIN:B0000071BO Human Racing]]\\" became his biggest hit and he established himself overseas, but is pretty much a one-hit wonder in the US.

As shown in the example, Amazon customer reviews happen to be rather noisy, therefore we performs several basics operations to clean the text as much as possible. We remove HTML tags, URLs as well as other undesirable character, e.g. blackslashes, square brackets, etc. The reviews are tokenized and character repetitions normalized to three repetitions, i.e. “greaaaaat” becomes “greaaat”. We lowercase each token and expand word contractions to further normalize the text, i.e. “don’t” becomes “do not”. The punctuation is removed and numbers are spelled out to reduce the vocabulary size. It is important to note that we do not remove stopwords from the reviews, because unlike tasks like sentiment analysis, our main goal is not to make strongly polarized words stand out, and ignore frequent words. Instead, we are interested in entire argumentative structures, and we believe that removing the very frequent words might hurt pattern recognition. We preprocess all training and test sets before commencing our experiments, and feed to our dierent models (except baselines) the exact same data.

5.2. Baseline

For both multiclass and binary classication, we choose a random stratied classica- tion baseline, implemented with scikit-learn. This weak baseline suits our problem because it randomly predicts an output while taking class distributions into account.

28 A random baseline also oers the possibility to measure the relative improvement or deterioration of a trained model’s performance, therefore allowing us to compare multiclass and binary classication in terms of error reduction over the baseline.

5.3. Feature-based Model

Even though extremely complex models involving neural networks are nowadays used to deal with matters as subjective as text constructiveness, one of our goal in this thesis is to better understand what constitutes constructiveness and how to eectively model it. For this purpose, classic models relying on sets of pre-dened features are relatively easy to interpret and can give us better indications about the dierent features’ importance.

5.3.1. Model We choose to implement a Gradient Boosting Classier, also called Gradient Boosting Machine (GBM) from the scikit-learn library (Friedman, 2001; Pedregosa et al., 2011). GBM is an ensemble method that uses “weak learners”, here decision trees, and combine them in an iterative way to learn from the dierent input features and minimize a loss function. This popular model is widely used in regression and classication problems and showed better performance than Support Vector Machines, Random Forests and other classic models during our preliminary experiments, we therefore choose a GBM to carry out our feature-based experiments.

5.3.2. Features With feature extraction, we build an informative representation of the data that is to be fed to the machine learning algorithm. Since these algorithms only accept numerical values as input, each sample is represented by a matrix which allows the algorithm to conveniently learn the function that maps input data to output category. Feature extraction begins during the preprocessing step because as we clean the data, we might lose some valuable information that cannot be retrieved after data cleaning. Say we want to use the ratio of uppercase words as a feature, we then need to perform the extraction of that feature after cleaning the noise from the text, but before lowercasing the review. As discussed in Section 4.1.1, all features do not carry the same importance and some of them are essential to constructive speech. Kolhatkar and Taboada (2017b) and Kolhatkar et al. (2020) discuss possible feature sets and nd that length features as well as text quality features were mostly responsible for the performance of their SVM model on article comments. The issue raised with length features is that since constructive inputs are usually longer than non-constructive inputs, models relying on these features might be ineective in detecting long non-constructive text, or short constructive text. Adding more features such as text quality helps creating more robust models that better understand the problem. Our goal here is to assemble the best possible set of features so as to maximize the ability of our computational model to spot underlying patterns of constructiveness. We rely on our annotation scheme to integrate what we believe are informative features of constructive behavior, and lay out the dierent feature types in Table 5.1. We use token unigrams as a text feature and TFIDF weighting to give more im- portance to rare words. Preliminary experiments have shown that unigrams were performing better than bigrams, trigrams or combinations of n-grams, therefore we stick to unigrams in our experiments.

29 Feature type Features Lexical (1) Token unigrams with TFIDF weighting Syntactic (1) POS unigrams with TFIDF weighting Discourse (2) Number of discourse markers, number of modals Text quality (4) Readability score, number of uppercase words, number of punctuation marks, number of unknown words Length (3) Number of tokens, number of characters, average to- ken length Semantic (3) Sentiment score, number of positive tokens, number of negative tokens Named entities (1) Number of named entities

Table 5.1.: Feature set used for feature-based classification.

Length and text quality features show promising results in detecting constructive- ness when used individually (Kolhatkar et al., 2020), as constructive text is usually long and substantial. We therefore integrate these in our system. For text quality, we compute the Flesch Reading Ease score (Flesch, 1948). Argumentation is also a major component of our scheme, therefore we add several discourse features to our set consisting in several lexicon matching counts. We call “discourse markers” the set of stance connectors from dierent discourse categories, such as addition (“moreover”), contrast (“however”), cause (“because”), consequence (“therefore”), etc. A feature that has not yet been investigated for constructiveness modeling is part- of-speech: as we believe that descriptiveness is another component of constructive behavior in product reviews, and based on our observations of descriptive patterns during annotation, such as the recurrent use of adjectives, we add a syntactic dimension to our model to detect these patterns. We use Stanford’s CoreNLP pipeline (Manning et al., 2014) to extract the POS tags for each tokens. Again, we weigh POS unigrams with TFIDF. Named-entity features are integrated to our feature set since it ts the context of product reviews where customers often refer to the purchased item, and can be a sign of descriptiveness or argumentation. Named-entities are also extracted with the CoreNLP pipeline, before lowercasing the reviews. Finally, we add a semantic set of features including a sentiment score and polarity lexicon match counts, because an important aspect of constructiveness is that both positive and negative aspects should be stated in a positive way in order to sustain future improvements. Our intuition is that more positive reviews tend to be more constructive. The sentiment score is computed with the VADER sentiment analyzer (Hutto and Gilbert, 2014).

5.3.3. Hyperparameters

We perform a hyperparameter tuning with grid search on CRC

30 estimators to 300, which is rather high because some models take some time to converge. For models who learn quickly and might end up overtting, we set an early stopping parameter to 5 epochs with a low tolerance. Finally, we set the maximum node depth in each tree to 8 so that trees remain interpretable in case of visualization.

Hyperparameter Value Learning rate 0.03 # of estimators 300 Max tree depth 8 ES iterations 5 ES tolerance 0.005

Table 5.2.: Hyperparameter set for Gradient Boosting classification, where ES=Early Stopping.

5.4. Neural Network Models

Neural networks have developed tremendously in the past decade and using them is now a standard procedure in most NLP tasks. By allowing non-linear transformation and oering deep learning from embedded structures, these models usually manage to pick up underlying patterns and understand language eectively. We use two dierent model architectures in our experiments: rst, our own simple model based on the LSTM cell, and second, BERT, which is considered as a current state of the art transformer-based architecture in many NLP tasks.

5.4.1. Stacked Bi-LSTM Network We rst build a model of our own that only consists in a few neural layers, so as to contrast with the depth and complexity of BERT. Our model is composed of two vertically stacked bidirectional Long Short-Term Memory (LSTM) layers. Stacked LSTM has been successfully implemented and tested in past studies (He et al., 2016; Prakash et al., 2016), and is a good compromise in complexity between a shallow network and the current state of the art models. Similarly, bidirectional LSTM is known to help building better representation, since it looks at the input from beginning to end and from end to beginning (Devlin et al., 2018; Graves et al., 2013). More generally, recurrent neural networks (RNN) process text timestep by timestep and allow informative long-term dependencies to be remembered through time as they pass in the network. Our intuition is that constructiveness relies on argumentation, which itself involves rather long and thought-through sentences (Kolhatkar et al., 2020). An LSTM-based architecture is therefore appropriate. To build word representations we use GloVe embeddings (Pennington et al., 2014) trained on 2 billion tweets, resulting in 200 dimension vectors. The mapping from token to embedding is performed on the preprocessed data sets. Our model is implemented with Keras,1 which provides convenient lightweight im- plementation tools for deep learning. The model architecture is visualized in Appendix C, and the sets of hyperparameters for multiclass and binary classication are shown in Table 5.3. We nd an optimal learning rate value with a function (“LR Finder”) that measures loss while increasing the learning rate exponentially (Smith, 2018). The batch size, hidden layer size and dropout are tuned empirically until we nd a good balance between training speed and learning curve. We train the models for 5 epochs while

1https://keras.io

31 using early stopping to prevent overtting: the epoch values in Table 5.3 show that our models were unfortunately prone to quick overtting, and were stopped after 4 and 3 epochs in multiclass and binary settings, respectively. The input length is set to 150 tokens, which approximately corresponds to the 90Cℎ percentile of the review length distribution, and is a good compromise to keep most reviews untouched and reduce the few extremely long outliers. Finally, we use class weights during training to give more weight to under-represented classes: a penalty cost proportional to the imbalance is passed to the loss function when misclassication occurs.

Hyperparameter Value (multiclass) Value (binary) Learning rate 0.005 0.005 Batch size 64 64 Epochs 4 3 Input length 150 150 Hidden layer size 128 128 Dropout 0.1 0.1 ES iterations 2 2 ES tolerance 0.001 0.001

Table 5.3.: Hyperparameter set for stacked bi-LSTM, where ES=Early Stopping, Batch size is the number of instances fed in the model at each step, Input length is the maximum number of tokens in each instance.

5.4.2. BERT The second model we use, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), is considered as a state of the art model and is used in many NLP tasks involving language understanding or generation. We use BERT( for resource limitation reasons. In contrast with recurrent neural networks, BERT uses the “transformer” architecture, which allows simultaneous processing of inputs and simultaneous transformations within the network. Moreover, it contains more than 100 million parameters in total, where our model trains 3.3 million parameters only. We use a pre-pretrained version of BERT, and perform ne-tuning on our task to adapt trainable weights to product review vocabulary and classication. An additional preprocessing step is performed in order to normalize the inputs to something readable by BERT. The lowercased text is re-tokenized using WordPiece, which drastically reduces the amount of out-of-vocabulary words by splitting unknown words into several known words: for example, the word “drastic” and the sux “ally” might be known by the model, so a potential unknown word like “drastically” will be split into “drastic” and “##ally”. We implement BERT for classication using the ktrain python library (Maiya, 2020), a lightweight wrapper for HuggingFace Transformers.2 The set of hyperparameters used for multiclass and binary classication is shown in Table 5.4. Once again the learning rate nder function is used to start the learning process as fast as possible. The “OneCycle” learning rate policy (Smith, 2017) is used to optimize the learning process by successively increasing and decreasing the learning rate. The batch size is limited to 6 by the library. Similarly to our bi-LSTM model, we choose 150 as the input length. For computational resources and time reasons, we only train BERT for 2 epochs.

2https://huggingface.co/transformers

32 In contrast with other other two models which were rst tuned on a validation set and retrained on the full training set, we tune and test BERT in a single run, therefore the training data is only composed of 2880 instances instead of 3200.

Hyperparameter Value (multiclass) Value (binary) Learning rate 0.0001 0.00005 Learning rate policy one cycle one cycle Batch size 6 6 Epochs 2 2 Input length 150 150

Table 5.4.: Hyperparameter set for BERT fine-tuning, where ES=Early Stopping, Batch size is the number of instances fed in the model at each step, Input length is the maximum number of tokens in each instance, and Learning rate policy, also called "scheduler", makes the learning rate vary according to a paern.

5.5. Evaluation

Throughout our experiments, we measure all our models’ performance (including baseline performance) with a weighted F1 score. This metric allows us to take into account the imbalance between classes by weighing each class’ F1 score with the number of instances corresponding to that class. The F1 computation is

2 ∗ (% ∗ ') 1 = (5.1) (% + ') where P is the Precision and R the Recall. Both precision and recall therefore have equal weight in the calculation. The score is computed for each class individually, and averaged with the number of positive instances in each class. The evaluation is performed with scikit-learn for all experiments. To compare the gap in performance between multiclass and binary classication, we measure the error reduction over the random baseline, calculated with  AA'43 = ' (5.2) where  is the absolute improvement over the baseline in terms of F1 score, and ' is baseline’s error rate. For example, say a random baseline is 0.25 F1 for a 4-class classication problem, and a model achieves 0.60 F1 on a given test set, then the error reduction over the baseline is AA'43 = (0.60 − 0.25)/(1 − 0.25) = 0.47.

33 6. Results and Discussion

Our main interest concerns the ability of computational models to detect patterns of constructiveness in product reviews. To this aim, we perform a series of experiments to specically answer the four research questions established in Chapter1, namely: comparing the dierence in performance between multiclass and binary classifcation, evaluating the performance of individual features for constructiveness modeling, comparing classic feature-based models with neural networks, and measuring modeling generalization on out-of-domain data.

6.1. Results

Multiclass Binary Model CRC

Table 6.1.: Results for multiclass and binary classification experiments in terms of weighted F1 score, for both Gradient Boosting and neural network models. The second column (CRC

Our experiments are set up accordingly with the settings detailed in Chapter5. Table 6.1 shows the results of all our experiments for both multiclass and binary classication, with dierent feature sets as well as feature-based and neural network architectures, and for in- and out-of-domain test sets (in a binary setting only). It is important to note that the models used for multiclass classication are all trained on CRC

34 on development sets as well in order to better evaluate the generalization ability of each model, but for time reasons we choose to only collect and analyze test results.

6.2. Multiclass vs. Binary Classification

We have established a new labeling scheme designed to capture any product review’s constructiveness in a multiclass setting, i.e. “not constructive at all” to “completely constructive”. To make our work comparable with previous eorts, we derive a binary scheme from our main scheme, i.e. “not constructive” or “constructive”. One goal of this thesis is to show that our 4-class scheme can be used reliably in further experiments and that ensued labeled data can be classied with similar performance in both multiclass and binary settings. To that end, we have carried out, in Section 4.2, two inter-annotator agreement ex- periments suggesting that our 4-class and 2-class annotation schemes were reliable. We here benchmark several models’ performance on multiclass and binary classication tasks on our CRC data set, and compare the results. Table 6.1 shows that on CRC

Figure 6.1.: Confusion matrix of a Gradient Boosting Machine trained and tested on CRC

Figure 6.1 shows a confusion matrix for one of our two best models, GBM with all features. The total number of instances can be found in Table 4.2. The gure shows good indicators of robust classication, i.e. rather high amount of true positives. Moreover, misclassied instances are often ending up in neighbour classes (class C is the direct neighbour of class B and D), which supports the hypothesis that an ascending constructiveness scale can be modeled rather eectively. An exception to that is class A, which only reaches 0.47 recall individually, showing that class A is the hardest to predict. Indeed, as discussed in Chapter4, class A holds dierent types of reviews based on several heuristics, which may confuse the model. Our other best performing model, using BERT, suers from a similar issue and yields 0.51 recall on class A.

35 On CRC18=, BERT achieves the best performance with 0.85 F1, which corresponds to a 71% error reduction over the baseline, closely followed by our stacked bi-LSTM model which yields 0.84 F1 and GBM with dierent feature sets. Figure 6.1 shows a confusion matrix for GBM with all features in a binary context. The total number of instances can again be found in Table 4.2. We notice a similar number of errors for both classes AB and CD, and that the majority class CD performs slightly better than the minority class AB (0.82 F1 against 0.81). These results suggest a good balance in our binary scale.

Figure 6.2.: Confusion matrix of a Gradient Boosting Machine trained and tested on CRC18=, achieving 0.81 F1 score.

We deduce from our empirical observations that both multiclass and binary clas- sication for constructiveness can be performed relatively well on product reviews. However, binary classication, which is by nature easier to perform than a multiclass task, achieves a higher error reduction over the baseline (71% at most against 57% at most for multiclass), suggesting that binary classication for product reviews is prefer- able to multiclass classication. Both sets of results could still be improved in future work by using machine learning techniques such as under-sampling or over-sampling, more thorough exploration of features, feature selection, etc.

6.3. Feature Performance

A key aspect of our work on constructiveness modeling is to nd out which types of features provide the highest amount of information to computational models. On the ipside, nding out the least informative features is also valuable in that it gives us indications on features that might hurt the model. Here, we focus on measuring modeling performance of individual and combined feature sets. We can see in Table 6.1 that named-entities perform poorly as an individual feature on every dataset (multiclass or binary, in- or out-of-domain). We believe that this is also due to the skewed aspect of the feature: for instance on CRC, out of 3200 instances, more than 2200 do not contain any named entities, resulting in no or nearly no association between amount of named entities and constructiveness tag. However, by performing an extra ablation experiment, we nd that GBM with all features except named entities only reaches 0.67 F1 on CRC

36 entity feature still helps the model overall, and simply under-performs on its own. Performance on binary test sets remains unchanged with the ablation of the named entity feature. Table 6.1 also shows that on CRC

6.4. Feature-Based vs. Neural Networks

Investigating the best and worst individual features as well as the overall performance of combined features with a Gradient Boosting Machine gave us insight on how to model constructiveness with classic feature-based models. In this section, we take a

37 step back and compare feature-based models with two neural networks: a stacked bidirectional LSTM and BERT. Table 6.1 shows that on CRC

Figure 6.3.: Confusion matrix of BERT trained and tested on CRC

BERT also perform better than the other models when trained and tested on CRC18=, and achieves 0.85 F1. In contrast, it under-performs on out-of-domain data (C3 and YNACC), seemingly because of a slight overtting of CRC. On CRC18=, the bi-LSTM models yields 0.84 F1 and even seems to overt less on out-of-domain data, hence achieving 0.80 F1 on C3. On YNACC, all models do not perform well and we obtain the best results with GBM including only text quality features or length features individually (0.56 F1 each). We further discuss the reasons of poor inter-domain performance in Section 6.5. Regardless of quantitative evaluations, transformer-based and LSTM-based models tend to solve an issue that often occurs in feature-based models: instead of simply reyting on pre-dened features, they try to actually understand the input and detect deep semantic dependencies based on word meanings. For example, feeding BERT, trained on CRC

h-o-r-r-i-b-l-e disney should consider the actual toddler that has to wear these too narrow and she could not even walk in them without the back

38 of her foot popping out of them the strap is positioned way too low to keep the shoe on her foot while walking toddlers cant wear ats without a decent strap to hold their feet in if any reviewer said these shoes were great then they did not buy them for a toddler do yourself a favor go to payless shoes com they oer a similar shoe with a strap that will keep your girlie’s foot in the shoe and the shoe has a decent width

Another issue occurring with feature-based models on this particular data set is the strong bias due to customer review redundancy: we have explained in Section 4.1 that we rely on a set of heuristics for annotation, in order to gain time. The reviews that share the same lexical patterns are all labeled the same way, resulting in this type of misclassication: the following preprocessed review contains the words “ve stars”, which happens to be a frequent Amazon default review text that we always label as class A — the model’s reliance on lexical features biased it to classify the review as class A, even though we labeled it as class C:

i would have rated this ve stars but it freezes up often i really like the variety of games and they are always adding new ones please leave older games on to play as well oers hours of exciting play

These evaluations show that constructiveness classication on product reviews can be performed with both feature-based or neural network models, and reach comparable performance. However, the analysis of two shortcomings of feature-based models suggests that neural networks are more robust when it comes to understanding complex dependencies, intent or general meaning, even though they are much longer to train. The stacked bidirectional LSTM we proposed reaches error reduction results similar to the GBM and BERT, and oers a good balance between model complexity and performance on several test sets.

6.5. In- vs. Out-of-Domain

Since constructiveness classication is a relatively new task, it is important to compare our work on new data with previous work. For this prupose, we have created a binary scale derived from our main annotation scheme in order to compare the classication of product reviews (CRC) and article comments (C3 and YNACC). The three rightmost columns of Table 6.1 shows that with 0.80 F1, the Gradient Boosting model with all features manages to eectively classify news article comments from C3 when trained on CRC18=. Shortly behind in performance, our stacked bi-LSTM also reaches a good score (0.78 F1). BERT seems to not generalize well at all and only reaches 0.64 F1. As mentioned before, we believe that this is due to overtting, as we have seen similar patterns in our preliminary experiments on the stacked bi- LSTM when training for too many epochs, which still achieved decent performance on CRC18=, but very poor results on C3 and YNACC. Our models all perform poorly on the third data set, YNACC. The best performance is set by text quality and length features alone, improving the baseline by only 0.06 F1 (12% error reduction). The stacked bi-LSTM does not improve the baseline and BERT yields 0.43 F1, meaning a 0.07 F1 deterioration under the baseline (14% error increase). A possible explanation is the type of data we are trying to classify. YNACC is composed of news article comments, but the annotation for constructiveness is performed on the comment thread level only, which forces us to consider each comment of a constructive thread as equally constructive, and vice versa with non-constructive threads. This experiment was inspired by Kolhatkar and Taboada (2017a,b), who train a classier on

39 YNACC before testing on their own data set. Kolhatkar et al. (2020) train and test several models on their own data set and another data set containing YNACC comments in order to show cross-domain adaptation, and obtain good results (up to 0.84 F1). In their paper, “cross-domain” means that the topics discussed in online news articles dier from one data set to another; in this thesis, we use the term “cross-domain” to compare article comments and product reviews. It seems that models trained on product reviews are able to adequately classify in-domain data, but show decreasing performance on the out-of-domain data is awed: here, the scheme used to annotate C3 resembles ours, whereas the annotation scheme for YNACC completely diers. Table 6.1 also shows that when moving away from in domain data, less complex models perform increasingly better than complex models: indeed, BERT, which is by far our most complex model, achieves the best performance on CRC18=, does not generalize well to C3, on which GBM with all features achieves the top performance, and a single feature is required to outperform any neural networks on YNACC. This could suggest that constructive structures in product reviews might dier from those in article comments. However, given that both of our neural models seem to overt the training data, and that the annotation scheme of YNACC does not ressemble ours, it is hard to draw meaningful conclusion from these experiment regarding concrete inter-domain constructiveness structuring. Our observations encourage us to validate our models’ ability to generalize to ressembling out-of-domain data such as news article comments when trained on product review data, but have also shown that classication performance substantially decreases when testing on inappropriate data.

40 7. Conclusion

To answer this thesis’ problematics, we have hand-annotated 4000 product reviews, the Corpus for Review Constructiveness (CRC), with a constructiveness tag by following our new 4-class annotation scheme. Inter-annotator agreement experiments have veried the reliability of this scheme, and modeling constructiveness has shown promising results as well as good indicators that an improvement in performance was still possible. We have compared multiclass and binary classication for constructiveness analysis on our data set, and shown that a consistent improvement over a random baseline could be obtained with advanced sets of features and the use of state of the art pre-trained models such as BERT. It has also been shown that the dierent features picked for this task are not equally important during classication as they do not perform equally individually. Most of these features contribute to the good performance of our feature-based Gradient Boosting Machine because they impersonate well the concept and denition of con- structiveness. However their importance strongly depends on the classication setting, i.e. multiclass or binary. A classic feature-based model was compared to much more complex neural network models, which despite showing equivalent results in terms of pure performance, address two issues revealed by our qualitative analysis, namely length feature bias and lexical feature bias. The main drawback of state-of-the-art neural networks is their complexity and ne-tuning time, therefore we have introduced a compromise architecture based on bidirectionnal LSTM cells, which is tunable much faster than BERT but performs slightly worse. Finally, we have compared in- and out-of-domain performance of our models in a binary context, on two dierent data sets already annotated for constructiveness. We have found that when the out-of-domain data set provides an annotation on the comment level, classication using non-complex models trained on in-domain data yields a performance similar to those of in-domain testing. In contrast, testing on out- of-domain data annotated on the comment thread level was not successful, suggesting that the two types of data were too dissimilar. In future work, it would be interesting to tackle some of the shortcomings we faced in this thesis. For instance, it would be useful to design a labeling taxonomy that does not rely as much on heuristics but simply performs a component-wise computation of Table 4.1. Concerning the experimental setup, it would be interesting to run a complete set of feature ablation experiments that would be be complementary to our work, in order to have access to the full scope of feature usefulness.

41 Appendices

42 A. Amazon categories in CRC

The following two lists give the name of the data les used to build our Corpus for Review Constructiveness (CRC). The link for each relaease can be found in 3.1.

A.1. Amazon oicial Dataset 1995-2015.

Wireless, Video games, Toys, Software, Shoes, Music, Mobile Apps, Jewelry, Home, Electronics, Camera, Books v1, Digital Ebook Purchase, Luggage, Baby.

A.2. Amazon Review Data 1996-2018 (University of California San Diego

The data was downloaded from the "Small subsets" section (5-core). Appliances, All Beauty, Art Crafts and Sewing, Gift Cards, Industrial and Scientic, Musical Instruments, Movies and TV, Sports and Outdoors.

43 B. Examples of review annotations with 4-class scheme

B.1. Class A

• Q: What could be more painfully dull than sitting through this movie? A:Sitting through it TWICE–the second time as part of a class on multiculturalism. A trait shared by all substandard-to-mediocre movies is predictability, and this movie is about 95% predictable. That’s a lazy way to make a lm, as well as an insult to the audience’s intelligence. After the rst few scenes, you know exactly how the movie will evolve. There are no surprises. The characters are wooden and one-dimensional. Good actors were wasted in this movie, which is preachy and moralistic while having nothing important to say. Amazingly, this has become a must-see in classes having to do with "issues" such as racism. I suppose there are people out there who found the movie thought- provoking and interesting, but these are individuals who feel more comfortable being told what to think rather than use critical-thinking skills to interpret more nuanced works. If you really feel compelled to rent this movie, feel free to turn it o after the rst ve minutes: you’ll be able to guess the rest. → uses a lot of sarcasm and aims for destruction, even though it brings some kind of argument.

• Four Stars → Amazon template rating "G Stars", does not bring anything.

• does not last very long if you are doing long term erasing and kind of worse is that you have to buy over $25 worth of stu just to get one pack of these crappy eraser but overall good for the money. → Rather destructive and disrespectful, does not aim at all for improvement.

• Why would you rate a gift card? → Rhetorical and sarcastic question.

• Nice try thanx → Probably sarcastic and uninformative.

• This movie is really bad. I never understood why this movie won 11 Oscars. The actors are very bad in this movie, especially Leonardo di Caprio and most of the scenes are unbelieveble and stupid (di Caprio with his arms extended and yelling "I am the King of the world is ridiculous. Kate Winslet’s character really sucks. How can she fall in love with such a jerk!!! It doesn’t make any sense. There were much, but much better movies in 1997 The plot: The commonplace: the rich girl falls for the poor but goodhearted guy. That’s it. Come on! Except( maybe) for the special eects, you don’t get anything from this movie. Don’t buy this piece of garbage! They can release it in any form. That’s not going to change the fact that this movie SUCKS. Don’t waste your money. Save your money for something better. → Absolutely toxic and hateful, does not bring anything and just unloads frustration and sarcasm.

44 B.2. Class B

• It seems like a good sight for a tactical rie. → "seems" shows user non-experience, rather short.

• Very imsy. Lots of noise too. → Short description of aws, but not argumenta- tion whatsoever.

• This movie was pretty good. But, I can see why I don’t remember it being in the theaters. → Tells shortly about the quality of the movie, but uses sarcasm without argumentation.

• Broke inside in two days. → Proof of user experience and gives simple feedback, but no argumentation whatsoever nor description of how it broke.

B.3. Class C

• Wonderful material, prints like a dream. The glow intensity is good but nothing to write home about. I’m making some Kuchi Kopi nightlights with it before diving into my brain slug project. → positive and respectful, states the qualities of the product. The last part is irrelevant and personal.

• Not impressed. Too many mistakes for a Disney movie (just one example: when Ella is riding her horse ’bareback’ , the prince stops her & you see she’s wearing riding gloves/or has reigns wrapped around her hands then they disappear next scene). Unoriginal. I thought the young lady whom portrayed Cinderella was lovely though. Sad that they thought there had to be so much cleavage for a children’s movie. My 8 year old son even said cover up. You do make movies for children, right Disney Studios? I wonder what Walt would think about so many of your releases lately? → Makes some pertinent points, well written and arguments well, although slightly sarcastic.

B.4. Class D

• It has horrible acting, very poor plot, and some denite awkward moments. But, Miami Connection is such an honest and earnest movie, that you can’t help but love it. It’s dripping with cheese, and the morale of the story is a bit skewed, but again, it’s so endearing that I couldn’t help but have a huge goofy smile on my face the entire time. The few interspersed ght scenes are actually quite good, with YK Kim and company displaying some pretty impressive ghting acumen. Overall, this lm should not be missed by lm enthusiasts. It’s worth watching simply for the excellent "Friends Forever" song performed near the beginning of the movie. This movie contains Ninja ght scenes. → Even though quite rough about the aws of the movie, still turns the review in a positive manner, and provides good argumentation.

• Nice cover. Great protection. Good value. Heavy enough, without being too heavy or too sti. Easily waterproof. I feel my table is safe from spills, kids, and my cat. Fits nice. Looks good. Should be all 5-star reviews. → Although it lacks some connectors, the description of the product is extremely informative.

45 C. Stacked Bidirectional LSTM Architecture

Figure C.1.: Stacked Bidirectional LSTM model architecture.

46 Bibliography

Aggarwal, Charu C and ChengXiang Zhai (2012). “A survey of text classication algorithms”. In: Mining text data. Springer, pp. 163–222. Badjatiya, Pinkesh, Shashank Gupta, Manish Gupta, and Vasudeva Varma (2017). “Deep learning for hate speech detection in tweets”. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760. Bengio, Yoshua, Patrice Simard, and Paolo Frasconi (1994). “Learning long-term de- pendencies with gradient descent is dicult”. IEEE transactions on neural networks 5.2, pp. 157–166. Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen (1984). Classi- cation and regression trees. CRC press. Chang, Chih-Chung and Chih-Jen Lin (2011). “LIBSVM: A library for support vector machines”. ACM transactions on intelligent systems and technology (TIST) 2.3, pp. 1– 27. Ciresan, Dan Claudiu, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber (2011). “Flexible, high performance convolutional neural networks for image classication”. In: Twenty-Second International Joint Conference on Articial Intelligence. Davidson, Thomas, Dana Warmsley, Michael Macy, and Ingmar Weber (2017). “Auto- mated hate speech detection and the problem of oensive language”. In: Eleventh international aaai conference on web and social media. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018). “BERT: Pre-training of deep bidirectional transformers for language understanding”. arXiv preprint arXiv:1810.04805. Diakopoulos, Nicholas (2015). “Picking the NYT picks: Editorial criteria and automation in the curation of online news comments”. ISOJ Journal 6.1, pp. 147–166. DiMaggio, Paul J and Walter W Powell (1983). “The iron cage revisited: Institutional iso- morphism and collective rationality in organizational elds”. American sociological review, pp. 147–160. Dos Santos, Cicero and Maira Gatti (2014). “Deep convolutional neural networks for sentiment analysis of short texts”. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78. Dost, Ayfer and Bilge Yagmurlu (2008). “Are constructiveness and destructiveness essential features of guilt and shame feelings respectively?” Journal for the Theory of Social Behaviour 38.2, pp. 109–129. Drucker, Harris, Donghui Wu, and Vladimir N Vapnik (1999). “Support vector machines for spam categorization”. IEEE Transactions on Neural networks 10.5, pp. 1048–1054. Fleiss, Joseph L (1971). “Measuring nominal scale agreement among many raters.” Psychological bulletin 76.5, p. 378. Flesch, Rudolph (1948). “A new readability yardstick.” Journal of applied psychology 32.3, p. 221. Fortuna, Paula and Sérgio Nunes (2018). “A Survey on Automatic Detection of Hate Speech in Text”. ACM Comput. Surv. 51.4 (July 2018). issn: 0360-0300. doi: 10.1145/ 3232676. url: https://doi-org.ezproxy.its.uu.se/10.1145/3232676. Freund, Yoav and Robert E Schapire (1999). “Large margin classication using the perceptron algorithm”. Machine learning 37.3, pp. 277–296.

47 Friedman, Jerome H (2001). “Greedy function approximation: a gradient boosting machine”. Annals of statistics, pp. 1189–1232. Fujita, Soichiro, Hayato Kobayashi, and Manabu Okumura (2019). “Dataset Creation for Ranking Constructive News Comments”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 2619–2626. doi: 10.18653/v1/P19-1250. url: https://www.aclweb.org/anthology/P19-1250. Gautam, Vasundhara and Maite Taboada (2019). “Constructiveness and Toxicity in Online News Comments”. Ghose, Anindya and Panagiotis G Ipeirotis (2006). “Designing ranking systems for consumer reviews: The impact of review subjectivity on product sales and review quality”. In: Proceedings of the 16th annual workshop on information technology and systems, pp. 303–310. Goldberg, Yoav and Omer Levy (2014). “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method”. arXiv preprint arXiv:1402.3722. Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed (2013). “Hybrid speech recognition with deep bidirectional LSTM”. In: 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, pp. 273–278. Haagerup, Ulrik (2017). Constructive news: How to save the media and democracy with journalism of tomorrow. ISD LLC. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016). “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. Neural computation 9.8, pp. 1735–1780. Hosmer Jr, David W, Stanley Lemeshow, and Rodney X Sturdivant (2013). Applied logistic regression. Vol. 398. John Wiley & Sons. Hutto, Clayton J and Eric Gilbert (2014). “Vader: A parsimonious rule-based model for sentiment analysis of social media text”. In: Eighth international AAAI conference on weblogs and social media. Joshi, Aditya, Pushpak Bhattacharyya, and Mark J Carman (2017). “Automatic sarcasm detection: A survey”. ACM Computing Surveys (CSUR) 50.5, pp. 1–22. Kendall, Maurice G (1938). “A new measure of rank correlation”. Biometrika 30.1/2, pp. 81–93. Kolhatkar, Varada and Maite Taboada (2017a). “Constructive language in news com- ments”. In: Proceedings of the First Workshop on Abusive Language Online, pp. 11– 17. Kolhatkar, Varada and Maite Taboada (2017b). “Using New York Times Picks to Identify Constructive Comments”. In: Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 100–105. doi: 10.18653/v1/W17-4218. url: https://www.aclweb.org/anthology/W17-4218. Kolhatkar, Varada, Nithum Thain, Jerey Sorensen, Lucas Dixon, and Maite Taboada (2020). “C3: The Constructive Comments Corpus. Jigsaw and Simon Fraser Univer- sity”. doi: 10.25314/ea49062a-5cf6-4403-9918-539e15fd7b52. Kolhatkar, Varada, Hanhan Wu, Luca Cavasso, Emilie Francis, Kavan Shukla, and Maite Taboada (2019). “The SFU Opinion and Comments Corpus: A corpus for the analysis of online news comments”. Corpus Pragmatics, pp. 1–36. Landis, J. Richard and Gary G. Koch (1977). “The Measurement of Observer Agreement for Categorical Data”. Biometrics 33.1, pp. 159–174. issn: 0006341X, 15410420. url: http://www.jstor.org/stable/2529310.

48 Lapidot-Leer, Noam and Azy Barak (2012). “Eects of anonymity, invisibility, and lack of eye-contact on toxic online disinhibition”. Computers in human behavior 28.2, pp. 434–443. Li, Jiwei, Myle Ott, Claire Cardie, and Eduard Hovy (2014). “Towards a general rule for identifying deceptive opinion spam”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1566–1576. Maiya, Arun S (2020). “ktrain: A Low-Code Library for Augmented Machine Learning”. arXiv preprint arXiv:2004.10703. Manning, Christopher D, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky (2014). “The Stanford CoreNLP natural language processing toolkit”. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60. Napoles, Courtney, Joel Tetreault, Aasish Pappu, Enrica Rosato, and Brian Provenzale (2017). “Finding good conversations online: The yahoo news annotated comments corpus”. In: Proceedings of the 11th Linguistic Annotation Workshop, pp. 13–23. Ni, Jianmo, Jiacheng Li, and Julian McAuley (2019). “Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 188– 197. doi: 10.18653/v1/D19-1018. url: https://www.aclweb.org/anthology/D19-1018. Niculae, Vlad and Cristian Danescu-Niculescu-Mizil (2016). “Conversational markers of constructive discussions”. arXiv preprint arXiv:1604.07407. Nobata, Chikashi, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang (2016). “Abusive language detection in online user content”. In: Proceedings of the 25th international conference on world wide web, pp. 145–153. Ovando, Martha N (1994). “Constructive Feedback”. International Journal of Educational Management. Park, Deokgun, Simranjit Sachar, Nicholas Diakopoulos, and Niklas Elmqvist (2016). “Supporting comment moderators in identifying high quality online news com- ments”. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 1114–1125. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research 12, pp. 2825–2830. Pennington, Jerey, Richard Socher, and Christopher D. Manning (2014). “GloVe: Global Vectors for Word Representation”. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. url: http://www.aclweb.org/anthology/D14- 1162. Prakash, Aaditya, Sadid A Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri (2016). “Neural paraphrase generation with stacked residual lstm networks”. arXiv preprint arXiv:1610.03098. Reich, Zvi (2011). “User comments”. Participatory journalism: Guarding open gates at online newspapers, pp. 96–117. Ren, Yafeng and Donghong Ji (2017). “Neural networks for deceptive opinion spam detection: An empirical study”. Information Sciences 385, pp. 213–224. Schmidt, Anna and Michael Wiegand (2017). “A survey on hate speech detection using natural language processing”. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 1–10.

49 Smith, Leslie N (2017). “Cyclical learning rates for training neural networks”. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp. 464– 472. Smith, Leslie N (2018). “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay”. arXiv preprint arXiv:1803.09820. Swanson, Reid, Brian Ecker, and Marilyn Walker (2015). “Argument mining: Extracting arguments from online dialogue”. In: Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue, pp. 217–226. Turian, Joseph, Lev Ratinov, and Yoshua Bengio (2010). “Word representations: a simple and general method for semi-supervised learning”. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp. 384–394. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin (2017). “Attention Is All You Need”. CoRR abs/1706.03762. arXiv: 1706.03762. url: http://arxiv.org/abs/1706.03762. Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le (2019). “XLNet: Generalized Autoregressive Pretraining for Language Understanding”. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Curran Associates, Inc., pp. 5753–5763. url: http://papers.nips.cc/paper/8812-xlnet- generalized-autoregressive-pretraining-for-language-understanding.pdf.

50