Recognizing Themes in Amazon Reviews Through Unsupervised Multi-Document Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Recognizing themes in Amazon reviews through Unsupervised Multi-Document Summarization Hanoz Bhathena Jeff Chen William Locke [email protected] [email protected] [email protected] Abstract Amazon now has an overwhelming number for reviews for popular products, and shoppers are routinely spending a lot of time reading through hundreds or thou- sands of reviews while making mental notes of important themes. We propose a system for unsupervised extractive summarization using deep learning feature extractors combined with several different models: k-means, affinity propagation, DBSCAN, and PageRank. We also propose a system for unsupervised abstrac- tive summarization using a deep learning model. These summarization models capture the major themes in all the reviews of a product and optionally report the number of users who mentioned the same theme, hence giving shoppers a sense of salience. Since summarization in an unsupervised learning is notoriously difficult to evaluate, we propose a suite of metrics: ROUGE-1, semantic similarity, senti- ment accuracy, attribute match, and content preservation, as ways to measure the quality of the summary. We found that extractive summarization is able to capture important themes, though sometimes the exemplar sentence is not well chosen. We found that abstractive summarization is able to generate human-readable text, though content preservation is challenging. Our baseline of k-means clustering scored a 3.48 and 3.44 for attribute match and content preservation, respectively and we achieved 3.72 and 3.93, respectively, with PageRank. 1 Introduction The number of reviews for products on Amazon is overwhelming as it’s common to see products with more than 1,000 reviews. The shopper who just wants to know the common themes is left to read pages upon pages of reviews and take mental notes of recurring topics. We would like to have a system that outputs themes when given a set of product reviews on Amazon. Since labeled datasets from Amazon reviews to summaries do not exist and are very costly to produce, we chose to take on the problem of generating summaries using unsupervised learning with no ground truth summaries available to us. We first propose an unsupervised extractive summarization algorithm using sentence embeddings along with clustering to automatically extract the most common themes among a set of reviews. Concretely, given a set of reviews for a particular product, our system will output a list of sentences with the most salient themes in the review set. We also propose an unsupervised abstractive summa- rization model that will directly write a summary when given a subset of reviews for the product. We face several distinct challenges in achieving this. First, since the number of salient themes found in a given set of reviews might differ between products, our system will need a way of deciding upon the appropriate number of themes to assign for to a given product. Second, the summarization should express a given theme in a way that is concise yet comprehensive. Third, since our problem is unsupervised, we need to establish an appropriate set of metrics that can act as proxies in place of ground truth comparisons. 1 2 Related Work A recent paper demonstrates the use of centroid-based summarization through word embeddings [1]. Rossiello et al. determines the central topic of an article by summing all the important words as dic- tated by term frequency-inverse document frequency (TFIDF), and then encodes sentences by ag- gregating the word vectors of each word. And finally calculates the distance by the cosine distance. This method performed better than centroid-based summarization using a Bag of Words model [2], where vectors for sentences are determined by their word-frequencies. Analysis of Adjective-Noun Word Pair Extraction Methods for Online Review Summarization by Yatani et al. demonstrates that showing the importance of summaries is useful for users to build an impression about the prod- uct [3]. ROUGE-N score has been used as a way to evaluate summarization algorithms; Chu et al. [4] demonstrate that it can be used as a proxy in unsupervised settings as well. They also demonstrate the possibility of using a trained text classifier as an indicator of the accuracy of the summary. This paper describes a way to automatically generate abstractive summaries in an unsupervised learning setting by making use of sequence to sequence autoencoders and sequence similarity. We discuss more of the details behind this model including our implementation in the Abstractive Summariza- tion sections of this paper. Finally, there has been more work on abstractive summarization using neural nets [5]. 3 Dataset Our dataset is sampled from a large dataset consisting of 142.8 million product reviews (along with metadata) collected from Amazon for the period May 1996 to July 2014 [6]. This larger dataset is divided into 24 product categories (e.g., books, electronics, video games, etc.). We base our work on a subset of reviews made for products in the electronics category that contains around 7.8 million reviews for 476k products. Each item in this set of reviews also includes a number of meta-attributes such as a reviewer name and ID, an Amazon Standard Identification Number (ASIN), the ratio of users that marked the review helpful to those that found it unhelpful, the overall star rating (out of 5.0), and the date and time the review was made. 90% of the reviews have less than 200 words as seen in Figure 1. Additionally, 20% of (electronics) products have 10 or more reviews. An example of a review is shown in Figure 2. Figure 1: Histogram of number of words per review. 4 Methods: Models and Evaluation 4.1 Machine Evaluation Currently, the most widely used metric to evaluate summarization is the ROUGE-N score, which measures the overlap of n-grams between the reference summaries and those generated by an au- 2 Figure 2: Sample review tomatic summarization system. However, ROUGE does have its limitations when it comes to cor- relation with human judgment of the quality of a summary since having overlapping words does not necessarily indicate the meanings are the same. Furthermore, ROUGE becomes a particularly tricky metric to evaluate unsupervised text summarization since we do not have real ground truth summaries. Therefore, we generally adopt heuristics like taking the ground truth as the entire review set that needs to be summarized. We therefore need to come up with additional/alternative metrics and we choose two such metrics to evaluate our unsupervised summarizations in an automated way. Review sentiment classifier: A review generally has a rating between 1 to 5. Using a trained review rating classifier we can score the documents we are trying to summarize and the summaries we generate using this classifier. If the difference between the average ratings classification for the summaries and the raw product reviews is small we have a good summarizer. This metric inherently measures the consistency of sentiment or tonality between the original reviews and their summaries which should be as high as possible. Semantic similarity: In addition to word overlap and sentiment consistency, we need an automatic measure of the overall content preservation. We discuss this in more detail in the human evaluation section below, but replicating the human process for content preservation measurement is a hard problem for a machine. We therefore rely on semantic similarity as a proxy for this. After encoding every sentence into a fixed dimensional vector representation we calculate its cosine similarity with the original reviews. The sentence embeddings, and by extension the word embeddings which are the inputs to the model that creates them, will be more similar if the summaries contain words which are similar meaning to those in the original reviews. However, these words do not need to be exact matches like in the case of ROUGE. For example, customer and client would have a high semantic score but a zero ROUGE score. 4.2 Human Evaluation While automatic evaluation metrics such as ROUGE, semantic similarity, and sentiment accuracy provided a good indication of the performance of our models, we found it helpful to include a number of human evaluation metrics to either corroborate or provide additional insight into the performance of each method. To this end, in our clustering-based approaches, we primarily used human evaluation to identify how well exemplar sentence summaries preserved the theme content of their respective clusters as well as an attribute match evaluation metric that determined how reflective the sentiment applied to a given feature is of the general sentiment applied to that same feature in the cluster. We applied human evaluation by taking a random sample of summary sentences and scoring them using a standard Likert scale (between 1 and 5). For content preservation, we asked the question: “Does the content of the summary represent the most commonly described features in the cluster review sentences?” For attribute match we asked the question: “Does the summary represent how this feature is generally described in the cluster review sentences?” We applied the same two human evaluations to abstractive summarization and PageRank, but instead of comparing the summary to a review cluster we compared it to the entire review set. 3 4.3 Sentence Embeddings Before we perform any summarization, we need to first encode review sentences using a sentence encoder. We experimented using the pre-trained text encoders available through Tensorflow Hub [7]. We utilized two types of text encoders: neural network language model [7] based on Bengio et al. [8] and the Word2Vec model by Mikolov et al. [9]. The neural network language model is an n-gram language model where a window of the previous n-1 words is used to predict the current word.