Hey Siri. Ok Google. Alexa: a Topic Modeling of User Reviews for Smart Speakers
Total Page:16
File Type:pdf, Size:1020Kb
Hey Siri. Ok Google. Alexa: A topic modeling of user reviews for smart speakers Hanh Nguyen Dirk Hovy Bocconi University Bocconi University [email protected] [email protected] Abstract Topic models, especially LDA (Blei et al., 2003), are one of the most widely used tools for User reviews provide a significant source of these purposes. However, due to their stochas- information for companies to understand their tic nature, they can present a challenge for in- market and audience. In order to discover terpretability (McAuliffe and Blei, 2008; Chang broad trends in this source, researchers have typically used topic models such as Latent et al., 2009). This is less problematic when the Dirichlet Allocation (LDA). However, while analysis is exploratory, but proves difficult if it is there are metrics to choose the “best” num- to result in actionable changes, for example prod- ber of topics, it is not clear whether the re- uct development. The main dimension of freedom sulting topics can also provide in-depth, ac- in LDA is the number of topics: while there are tionable product analysis. Our paper exam- metrics to assess the optimal number according ines this issue by analyzing user reviews from to a criterion, it is unclear whether the resulting the Best Buy US website for smart speakers. Using coherence scores to choose topics, we topics provide us with a useful discrimination for test whether the results help us to understand product and market analysis. The question is “Can user interests and concerns. We find that while we derive market-relevant information from topic coherence scores are a good starting point to modeling of reviews?” identify a number of topics, it still requires We use smart speakers as a test case to study manual adaptation based on domain knowl- LDA topic models for both high-level and in-depth edge to provide market insights. We show that analyses. We are interested in to answer the fol- the resulting dimensions capture brand perfor- lowing research questions: mance and differences, and differentiate the market into two distinct groups with different • What are the main dimensions of concerns properties. when people talk about smart speakers? 1 Introduction • Can the LDA topic mixtures be used to di- The Internet has provided a platform for people to rectly compare smart speakers by Amazon, express their opinions on a wide range of issues, Google, Apple, and Sonos? including reviews for products they buy. Listen- Smart speakers are a type of wireless speaker ing to what users say is critical to understanding that provides a voice interface for people to use the product usage, helpfulness, and opportunities spoken input to control household devices and ap- for further product development to deliver better pliances. While still relatively new, smart speak- user experience. User reviews – despite some po- ers are rapidly growing in popularity. As the 1 tentially inherent biases – have quickly become Economist (2017) put it: “voice has the power an invaluable (and cheap) form of information for to transform computing, by providing a natural product managers and analysts (Dellarocas, 2006). means of interaction.” We use a dataset of smart However, the speed, amount, and varying format speaker reviews and coherence scores as a met- of user feedback also creates a need to effectively ric to choose the number of topics, and evaluate extract the most important insights. the resulting model both in terms of human judge- 1People tend to over-report negative experiences, while ment and in its ability to meaningfully discrimi- some positive reviews are bought (Hovy, 2016). nate brands in the market. 76 Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 76–83 Hong Kong, Nov 4, 2019. c 2019 Association for Computational Linguistics After the coherence score (Roder¨ et al., 2015) of the re- Raw data pre-processing sulting topics. This metric is more useful for in- terpretability than choosing the number of topics # reviews 53,273 on held-out data likelihood, which is a proxy and # words 1,724,842 529,035 can still result in semantically meaningless topics # unique words 25,007 10,102 (Chang et al., 2009). The question is: what is coherent? A set of topic Table 1: Summary of dataset. descriptors are said to be coherent if they sup- port each other and refer to the same topic or con- cept. For example, “music, subscription, stream- Contributions We show that LDA can be a ing, spotify, pandora” are more coherent than “mu- valuable tool for user insights: 1) basic user con- sic, machine, nlp, yelp, love.” While this differ- cerns can be distinguished with LDA by using co- ence is obvious to human observers, we need a herence scores (Roder¨ et al., 2015) to determine way to quantify it algorithmically. the best number of topics, but an additional step Coherence scores are a way to do this. Several is still needed for consolidation; 2) human judge- versions exist, but the one used here has the high- ment correlates strongly with the model findings; est correlation with human ratings (Roder¨ et al., 3) the extracted topic mixture distributions accu- 2015). It takes the topic descriptors and combines rately reflect the qualitative dimensions to com- four measures of them that capture different as- pare products and distinguish brands. one pects of “coherence”: 1) a segmentationn Sset , 2 Dataset 2) a boolean sliding window Psw(110), 3) the in- direct cosine measure with normalized pointwise 2.1 Data collection mutual information (NPMI) m~ cos(nlr), and 4) the arithmetic mean of the cosine similarities σ . From the Best Buy US website, we collect a a The input to the scoring function is a set W of dataset of 53,273 reviews for nine products from the N top words describing a topic, derived from four brands: Amazon (Echo, Echo Dot, Echo the fitted model. The first step is their segmenta- Spot), Google (Home, Home Mini, Home Max), tion Sone. It measures how strongly W ∗ supports Apple (HomePod) and Sonos (One, Beam). Each set W 0 by quantifying the similarity of W ∗ and W 0 in review includes a review text and the brand as- relation to all the words in W : sociated with it. Our collection took place in 0 ∗ 0 ∗ November 2018. Due to their later market entries (W ;W )jW = fwig; wi 2 W ; W = W and significantly smaller market sizes, the num- ber of available Apple and Sonos reviews is lim- In order to so do, W 0 and W ∗ are represented as ited. Amazon, Google, Apple, and Sonos reviews context vectors ~v(W 0) and ~v(W ∗) by pairing them account for 53.9%, 41.1%, 3.5% and 1.5% of the with all words in W : dataset, respectively. 8 9 < X = ~v(W 0) = NPMI(w ; w )γ 2.2 Review text pre-processing i j :w 2W 0 ; i j=1;:::;jW j We pre-process the review text as follows: First, we convert all text to lowercase and tokenize it. The same applies for ~v(W ∗). In addition: We then remove punctuation and stop words. We 0 1γ log P (wi;wj )+" build bigrams and remove any remaining words γ P (wi)·P (wj ) NPMI(wi; wj) = @ A with 2 or fewer characters. Finally, we lemma- − log(P (wi; wj) + " tize the data. The statistics of the resulting bag-of- words representation are described in Table1. An increase of γ gives higher NPMI values more weight. " is set to a small value to pre- 3 Methodology vent logarithm of zero. We choose γ = 1 and " = 10−12. 3.1 Topic extraction Second, the probability Psw(110) captures prox- The main issue in LDA is choosing the optimal imity between word tokens. It is the boolean slid- number of topics. To address this issue, we use ing window probability, i.e., the number of doc- 77 ties in that topic, and one random word with low probability in that topic but high probability (top 5 Figure 1: Example of word intrusion task in the survey most frequent words) in another topic. The word that does not belong to the topic is called the true intruder word. The hypothesis of word intrusion is that if the topics are interpretable, they are coher- ent, and subjects will consistently choose the true intruder words. For topic k, let wk be the true intruder word, ik;s be the intruder selected by the subject s. S is the Figure 2: Example of topic intrusion task in the survey number of subjects. The model precision for topic k is defined as the fraction of subjects agreeing with the model: uments in which the word occurs, divided by the P number of sliding windows of size s = 110. j(ik;s = wk)j MP = s Third, given context vectors ~u = ~v(W 0) and k S ∗ ~w = ~v(W ) for the word sets of a pair Si = (W 0;W ∗), the similarity of W 0 and W ∗ is the co- The model precision ranges from 0 to 1, with sine vector similarity between all context vectors. higher value indicating a better model. For the topic intrusion task, each survey sub- PjW j i=1 ui · wi ject is shown a short review text and is asked to scos(~u;~w) = k~uk2 · k~wk2 choose a group of words which they think do not describe the review (Fig.2). Each group of words Finally, the cosine similarity measures are aver- represents a topic. Each question is comprised of 3 aged, giving us a single coherence score for each topics with the highest probabilities LDA assigned model (each model has a different number of top- to that review, and 1 random topic with low proba- ics).