<<

Leveraging in

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Shuo Chang

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy

Loren G Terveen

August, 2016 c Shuo Chang 2016 ALL RIGHTS RESERVED Acknowledgements

Through the five years of my graduate school, I have received numerous support from my family, friends and colleagues. This thesis would not have been possible without them. First and foremost, I thank my advisor Dr. Loren Terveen for sharing his wisdom and helping me to grow. He gave me the freedom to pursue my research interest and challenge me to be better. During difficult times, he consuls me with his kindness and patience. I am forever grateful for what I have learned from him during this five years. I would also like to thank my committee members — Dr. Joseph Konstan, Dr. George Karypis and Dr. Yuqing Ren — for providing valuable feedback. I thank Ed Chi, Peng Dai and Elizabeth Churchill for giving me internship oppor- tunities and being awesome mentors. During the internships, I was fortunate to receive guidance and support from Atish Das Sarma, Lichan Hong, Jilin Chen and Zhiyuan Cheng. I am very thankful to be a member of the GroupLens research lab, which will always have a special place in my heart. I’d like to thank Dr. John Riedl for welcoming me to the GroupLens family. I truly enjoyed working and being friend with Max Harper, Vikas Kumar, Daniel Kluver, Tien Nguyen, Hannah Miller, Jaccob Thebault-Spieker, Aditya Pal, Isaac Johnson, Qian Zhao, Dr. Brent Hecht and other members in GroupLens. Last but not least, I would like thank my parents for giving me inspiration for pursuing a PhD, and Steve Gartland and Merlajean Gartland for making me feel at home in a foreign country. Finally, I am thankful to my wife, Jingnan Zhang, for her encouragement, patience and love. I am lucky to have her to share all the laughters and tears while pursuing my dream.

i Dedication

To John Riedl, a greatly missed mentor and role model.

ii Abstract

Recommender systems, since their introduction 20 years ago, have been widely deployed in web services to alleviate user information overload. Driven by objectives of their applications, the focus of recommender systems has shifted from accurately modeling and predicting user preferences to offering good personalized user experience. The later is difficult because there are many factors, e.g., tenure of a user, context of recommendation and transparency of recommender system, that affect users’ perception of recommendations. Many of these factors are subjective and not easily quantifiable, posing challenges to recommender algorithms. When pure algorithmic solutions are at their limits in providing good user expe- rience in recommender systems, we turn to the collective intelligence of human and computer. Computer and human are complementary to each other: computers are fast at computation and data processing and have accurate memory; humans are capable of complex reasoning, being creative and relating to other humans. In fact, such close between human and computer have precedent: after chess master Garry Kasparov lost to IBM computer “Deep Blue”, he invited a new form of chess — ad- vanced chess, in which human player and a computer program teams up against such pairs. In this thesis, we leverage the collective intelligence of human and computer to tackle several challenges in recommender systems and demonstrate designs of such hybrid systems. We make contributions to the following aspects of recommender systems: providing better new user experience, enhancing topic modeling component for items, composing better recommendation sets and generating personalized natural language explanations. These four applications demonstrate different ways of designing systems with collective intelligence, applicable to domains other than recommender systems. We believe the collective intelligence of human and computer can power more intelligent, user friendly and creative systems, worthy of continuous research effort in future.

iii Contents

Acknowledgements i

Dedication ii

Abstract iii

List of Tables viii

List of Figures x

1 Introduction 1 1.1 What is Collective Intelligence? ...... 2 1.2 Challenges in Recommender System ...... 3 1.3 Leverage Collective Intelligence in Recommender Systems ...... 4 1.4 Research Platforms ...... 5 1.4.1 MovieLens ...... 5 1.4.2 + ...... 7 1.5 Thesis Overview ...... 8

2 Interactive New User Onboarding Process for Recommender Systems 10 2.1 Introduction ...... 10 2.2 Related work ...... 13 2.3 Design Space Analysis ...... 15 2.3.1 Design space ...... 16 2.3.2 Design challenges ...... 17

iv 2.4 Feasibility study ...... 21 2.4.1 Data ...... 21 2.4.2 Method ...... 22 2.4.3 Baseline ...... 22 2.4.4 Metrics ...... 23 2.4.5 Results ...... 23 2.5 User Experiment ...... 26 2.5.1 Overview of the cluster-picking interface ...... 26 2.5.2 Method ...... 26 2.5.3 Results ...... 29 2.6 Discussion ...... 32

3 Automatic Topic Labeling for Multimedia Social Media Posts 34 3.1 Introduction ...... 34 3.2 Related Work ...... 36 3.2.1 ...... 36 3.2.2 Topic Extractors and Annotators ...... 37 3.2.3 Topic Labeling for Twitter ...... 38 3.3 Our Approach ...... 38 3.3.1 Single-Source Annotators ...... 39 3.3.2 Crowdsourcing Training Labels ...... 41 3.3.3 Supervised Ensemble Model ...... 44 3.4 Evaluation ...... 47 3.4.1 Evaluation Setup ...... 47 3.4.2 Binary Classification for Main Topic Labels ...... 48 3.4.3 Multiclass Classification of Topic Labels ...... 52 3.5 Discussion ...... 54

4 CrowdLens: Recommendation Powered by Crowd 56 4.1 Introduction ...... 56 4.2 Related work ...... 58 4.3 The CrowdLens Framework ...... 60 4.3.1 Recommendation Context: Movie Groups ...... 60

v 4.3.2 Crowd Recommendation Workflow ...... 61 4.3.3 User Interface ...... 62 4.4 Experiment ...... 63 4.4.1 Obtaining Human Quality Judgments ...... 65 4.5 Study: Recommendations ...... 66 4.5.1 Different pipelines yield different recommendations ...... 67 4.5.2 Measured Quality: Algorithm slightly better ...... 67 4.5.3 Judged Quality: Crowdsourcing pipelines slightly preferred . . . 68 4.5.4 Diversity: A trend for crowdsourcing ...... 69 4.5.5 Crowdsourcing may give less common recommendations . . . . . 71 4.5.6 Recency: Little difference ...... 71 4.5.7 Discussion ...... 72 4.6 Study: Explanations ...... 73 4.6.1 Explanation quality ...... 74 4.6.2 Features of Good Explanations ...... 74 4.6.3 Discussion ...... 76 4.7 Summary Discussion ...... 77

5 Personalized Natural Language Recommendation Explanations Pow- ered by Crowd 78 5.1 Introduction ...... 78 5.2 Design Space and Related Work ...... 80 5.3 System Overview ...... 82 5.4 Experiment Platform ...... 83 5.5 Overview of System Processes ...... 84 5.5.1 Select Key Topical Dimensions of Items ...... 84 5.5.2 Generate Natural Language Explanations for The Key Topical Dimensions ...... 87 5.5.3 Model Users’ Preferences and Present Explanations in a Person- alized Fashion ...... 90 5.6 User Evaluation ...... 91 5.6.1 Survey Design ...... 92

vi 5.6.2 Results ...... 94 5.7 Discussion ...... 97

6 Conclusion 99 6.1 Summary of Contributions ...... 99 6.2 Implications ...... 101 6.3 Future Work ...... 103

References 105

Appendix A. Glossary and Acronyms 117 A.1 Additional Figures ...... 117

vii List of Tables

2.1 Example of weighted aggregation of ranked item lists. List 1 has 2 points and list 2 has 1 point. We assign scores to items in the lists based on ranking. We re-rank items based on the weighted average of item scores in two lists and take top 3...... 21 2.2 User-expressed familiarity with recommended movies and prediction ac- curacy. Familiarity is represented by the average number of movies (out of 10) that users had heard of or seen. Prediction accuracy is measured by average RMSE...... 31 3.1 Comparison of two answer statistics between with and without verifiable questions(“VQ” and “No VQ”). Workers spend longer time on tasks and have higher chance to reach agreement when there are verifiable questions. 43 3.2 Unbalanced class distribution in gold standard data set...... 49 3.3 Comparison of the best performing ensemble model and single-source annotators, listed above. “All annotator” predicts the union of topic labels from single-source annotators as positive. The best performing algorithm is GBC trained on all five features, having the best overall F 1-score here...... 49 3.4 The F 1-scores of the best-performing model of every classification algo- rithm for multiclass classification along with their feature combinations. GBC performs the best. All ensemble learning algorithms outperform the baseline algorithm, which consistently predicts the most popular cat- egory, “Main or Important”...... 53 4.1 Final recommendation lists from the five pipelines for the movie group shown in Figure 4.1...... 67

viii 4.2 Average number of overlapping movies recommended from any pair of pipelines across 6 movie groups. Each pipeline generates 5 movies as final recommendations. Standard deviations are included in parenthesis 67 4.3 Measured and human-judged quality of recommended movies. For a movie group in a pipeline, measured quality is computed as the average of ratings (on a 5 star scale) on recommended movies from MovieLens users who indicated a preference for the movie group. User judgments from the online evaluation range from -2 (very inappropriate) to 2 (very appropriate). Both columns show the average for all pipelines across six movie groups...... 68 4.4 Extracted features of recommendation explanations, along with their ef- fect and statistical significance in a logistic regression model to predict whether an explanation is evaluated to be good or bad by MovieLens users. 75 5.1 Summary of questions asked in the 3-stage user survey. We asked users to respond on 5-point Likert scale for the above statements. The “Stage” column shows the stage the statements were shown. The “Metric” column shows how we processed responses to measure the corresponding rubrics. 92

ix List of Figures

1.1 The homepage of MovieLens, featuring movie recommendations. . . . .6 1.2 Homepage of Google+, featuring a feed of social media posts from friends and recommender algorithm...... 7 2.1 Screen shot of the interface for 14 movie groups, 4 of which are visible, and 6 preference points to allocate...... 13 2.2 An overview of the design challenges in our group-based preference elici- tation process...... 17 2.3 Results of simulated group-based compared to a baseline. “Baseline” shows the performance of the rate 15 movies pro- cess. The two simulated cluster-picking strategies are “simulation”, where users pick the cluster with movies they’ve rated the highest, and “opti- mal”, where users pick the cluster that results in the best predictions. Lower RMSE scores are better and higher F-measure scores are better. The chart on the left shows that prediction accuracy is slightly worse than the baseline in most cases while the chart on the right shows that recommendation accuracy is significantly improved compared with the baseline...... 24 2.4 Cluster quality measured by Silhouette Coefficient vs. number of clusters. For each cluster number, we run the clustering algorithm 10 times and plot the standard deviation with error bars. This figure shows two local maxima around 5 and 14 clusters...... 25 2.5 Interface used for evaluating top-10 recommended movies. Participants are asked to answer a list of survey questions regarding top-10 recom- mendations shown to the left...... 27

x 2.6 Time to finish preference elicitation process. Time is measured in min- utes. The group-based approach takes less than half as long as the base- line, on average...... 28 2.7 Survey results about the easiness and fun of our group-picking process. We compare picking one group with distributing 3 or 6 points across groups. The percentages summarize the responses after combining dis- agree/strongly disagree, and agree/strongly agree. Respondents think point allocation is easier and more fun than picking a single favorite. . . 29 2.8 Survey results about recommendation quality, comparing our group-based process with the baseline rate 15 process. The percentages summarize the responses after combining disagree/strongly disagree, and agree/strongly agree. Respondents answer more positively concerning the group-based recommendations...... 30 2.9 Survey response regarding the whether the groups are comprehensible. The percentages summarize the responses after combining disagree/strongly disagree, and agree/strongly agree. The movie groups presented to users were scrutable; both the example movies and the example tags con- tributed to this outcome...... 33 3.1 An example of a Google+ post with a Youtube video attached. The text of the post does not match with the attached video, which makes topic labeling difficult...... 35 3.2 Approach overview. We crowdsource judgment of topic labels (described in Section 3.3.2). Then we train a supervised ensemble model with human evaluated topic labels (described in Section 3.3.3) ...... 39 3.3 Google+ allows users to attach multiple types of attachment (i.e., image, video and link to web page) to a post...... 40 3.4 Task template used on MTurk to evaluate relevance of topic labels to a Google+ post. The post, left out from the screen shot, is embedded in the task in similar way as Figure 3.1. Workers can play videos and click onto web page in the post...... 42

xi 3.5 Comparison of best models of each classification algorithm on binary classification task. The four best models are trained on all five features after exhaustively searching feature combinations. Y axis shows the F 1- score. GBC has the highest F 1-score...... 50 3.6 Feature analysis for GBC model on binary classification task. We show best performing models trained on 1 to 5 features. The x axis shows the F 1-scores of models...... 51 3.7 Comparison of best models of each prediction algorithm for multi-class classification of topic label relevance. The features used by best models are summarized in Table 3.4. Y axis shows the F 1-score. GBC has the highest F 1-score...... 52 3.8 Feature analysis for multi-class classification of topic label relevance. The x axis shows F 1-score of GBC algorithm trained on different sets of fea- tures. Feature of topicality scores topic has the most predictive power. Somewhat surprisingly, prob, hashtag, type bring little improvement. . 54 4.1 An example movie group. New users in MovieLens express their taste on movie groups...... 61 4.2 CrowdLens user interface. The four core components are task instruc- tions, example recommendations, multi-category search, and the worker- generated recommendation list. The interface for generating examples is the same except without component 2 - examples...... 62 4.3 Five pipelines for producing recommendations. The human-only and algorithm-assisted pipelines were instantiated with both MovieLens vol- unteers and paid crowdworkers from Mechanical Turk. The MovieLens recommendation algorithm served as a baseline and was the source of input examples for the algorithm-assisted pipelines...... 64 4.4 Interface for collecting judgments of recommendations and explanations. 66 4.5 Diversity of recommendations. Y axis shows average pairwise similarities between tag genome vectors (lower values are more “diverse”)...... 69 4.6 Popularity of recommended movies. Y axis shows number of ratings from all MovieLens users on a natural log scale...... 70

xii 4.7 Recency of recommended movies. Y axis shows number of days since release for movies on natural log scale (lower values are more “recent” . 72 5.1 Example natural language explanations for the movie “Gravity”. De- pending on our model of a user’s interest, our system selects one of the three explanations for the user...... 79 5.2 System overview. There are two inputs to the system - topic labels and review data. We show three procedures of the system in bold text. The first two procedures use mixed computation, with crowd and algorithm, while the last one only uses algorithm...... 82 5.3 Interface shown to crowd workers for refining clustering result...... 85 5.4 An example of crowd-refined tag clusters for the movie “Goodfellas”. The clustering algorithm separated the tags into the four groups shown. The crowdworkers curated the tags: a strikethrough indicates bad cluster fit, a crossout indicates inappropriateness for explanations, and red/bold indicates selection as the representative tag for the cluster...... 86 5.5 Interfaces used for the Mapper phase of MapReduce work flow in the “generate natural language explanations for the key topical dimensions” step...... 88 5.6 Interfaces used for the Reduce phase of MapReduce work flow in the “generate natural language explanations for the key topical dimensions” step...... 89 5.7 Survey responses to questions regarding efficiency. Natural language ex- planations - labeled with “crowd” - contained a more appropriate amount of information (a) and helped subjects more with decision-making (b). . 95 5.8 Difference in interest before and after watching the trailer to measure effectiveness;no statistically significant difference...... 96 5.9 Survey responses to questions regarding trust. Participants trusted nat- ural language explanations more than tag explanations...... 97 5.10 Survey responses to questions regarding satisfaction. Across the three questions, participants gave more positive responses for natural language explanations as compared with tag explanations...... 98

xiii A.1 Interface for first step in three-step user survey that evaluates recommen- dation explanations...... 118 A.2 Interface for second step in three-step user survey that evaluates recom- mendation explanations...... 119 A.3 Interface for third step in three-step user survey that evaluates recom- mendation explanations...... 120

xiv Chapter 1

Introduction

Information overload, a term popularized by Alvin Toffler in Future Shock as far back as 1970, has become increasingly common experience for people living in current infor- mation age. Boosted by rapid adoption of , approximate 39% 1 of global population has access to now. With increased bandwidth and development of technology, Internet users are generating tremendous amount of content, i.e., posts, pictures and videos, on websites such as Facebook, Twitter and YouTube. People are constantly interrupted by emails at work, overwhelmed by messages on Facebook or Twitter and, even for entertainment, faced with too many choices on YouTube or Net- flix. Recommender system, together with closely related technology - search engine, is widely applied to alleviate information overload. Recommender systems are compu- tation systems that suggest items of interest to people [1], deployed in various online services. For example, on , it models users’ interest based on purchases and browsing behavior and then recommends items to buy. On Pandora, an online radio, recommender system selects to play based on likes and dislikes users indicated to songs. Designing recommender system that offers good user experience is a long stand- ing challenge, despite advances in - the underlying technology. Since the inception of recommender system [2], the past 20 years have witnessed increasingly accurate predictions of user preferences through continuous research and engineering

1KPCB Internet trends report http://www.kpcb.com/internet-trends

1 2 effort. However, apart from accurate models of user preferences, there are many factors contributing to final user experience with recommender systems [3]. For example, new users of a recommender system may want more familiar recommendations to establish trust with the system, while long term users may be interested in novel recommenda- tions to discover new items. Compared with the well defined task of predicting user preferences on items, serving recommendations that result in good user experience is a more difficult task. In this thesis, we propose to leverage collective intelligence of computer and human to improve user experience with recommender systems. We identify the opportunity of combining strength of computer and human and demonstrate how to design systems that leverages such collective intelligence.

1.1 What is Collective Intelligence?

Collective intelligence, according to Malone et al. [4], is defined as “groups of individuals acting collectively in ways that seem intelligent”. The term, first appeared in 1800’s, is used to describes activities observed in economy, biology, sociology and etc, from the invisible hand (Smith, 1887) that controls allocation of resources in human society to and coordination of ant colony. Coming from the perspective of computer science, we focus on referring “individual” to either human or computer in the definition of “collective intelligence”. This focus is determined by our expertise and knowledge. Therefore, in the context of this the- sis, “collective intelligence” refers to the collection of artificial intelligence and human intelligence. With wide adoption of computing devices such as PC and smartphones, the collective intelligence of human and computer is becoming more ubiquitous, creating wonders not possible before. For example, volunteers contributed online encyclopedia — — is the largest knowledge based human has ever created. In fact, English Wikipedia alone, with 2.9 billion words, is over 60 times larger than the next largest English encyclopedia — Encyclopædia Britannica2. software not only serves as a tool that supports collaboration among over 100K volunteers but also takes an active role in

2https://en.wikipedia.org/wiki/Wikipedia:Size_comparisons 3 protection against vandalism and management of article qualities3. As another example, Google search engine ranks importance of 100,000,000 gigabytes4 of web pages by mining patterns from the web links Internet users around the globe created, answering 40K search queries every second. Computer and human are complementary to and enhancing each other, creating a symbiotic relationship. On one hand, human guides and teach computer to do tasks such as recognizing images, understanding natural language and identifying vandal edits to Wikipedia articles. On the other hand, computer automates these tasks, increasing efficiency and reliability. For example, software deployed on Wikipedia catches 40% vandalism to Wikipedia with an false positive rate of 0.1%, freeing valuable time of Wikipedia contributors from fighting against vandalism.

1.2 Challenges in Recommender System

By offering personalized experience, recommender systems improve user satisfaction and consequently increases consumption or sales, well aligned with goals of online : Netflix wants to show users movies they like to watch and makes them watch more movies; Facebook wants to present posts users like to read and grab attention from users; Amazon suggests items users will likely to buy to increase sales. The field of recommender system research has shifted its focus from accurately predicting user preferences to improving overall user experience [3], driven by wide commercial applications of recommender systems. The early research interest on rec- ommender system was sparkled by algorithm [2], a variant of k-nearest-neighbor algorithm that predicts users’ ratings on items. As a result, subse- quent research in recommender system focus on algorithms [5] that improves accuracy of rating predictions, as demonstrated by the Netflix prize5. Practitioners of recommender systems in business settings, however, showed less interest in accuracy of prediction than metrics such as Click Through Rate (measuring the percentage of presented rec- ommendations with clicks) and overall user satisfaction. These practical concerns led

3http://www.wired.com/2015/12/wikipedia-is-using-ai-to-expand-the-ranks-of-human-editors/ 4https://www.google.com/search/about/insidesearch/howsearchworks/crawling-indexing. html 5http://www.netflixprize.com/ 4 researchers [3, 6, 7] to take a human-centered computing approach to design both algo- rithms and interfaces of recommender systems. Designing recommender systems that offer good user experience, however, is a chal- lenging task due to our limited understanding of human perception of recommenda- tions. There are numerous factors, e.g., tenure of user, context of recommendation and transparency of system, that affect users’ experience with recommender systems. For example, some users of a music recommender system may enjoy listening to music they are familiar with, while other users may be interested in exploring new music. Even same users may want different recommendations: they sometimes are in the mood for exploring unfamiliar movies while other times want to quickly find an relevant movie to watch. In addition, research on user perceptions of recommender systems is challenging due to researchers’ limited access to real recommender systems in natural settings. Un- like algorithmic research that can be done with a dataset of user ratings, user experience related research requires field studies with real users of recommender systems, which are scarce in research communities. Even with good understanding of factors related to user perception of recommen- dations, algorithms still struggle to jointly optimizing for all these factors. Past studies [7,8,9] have shown that relevance, popularity, familiarity, topic diversity and serendipity are all important factors to good experience with a collection of recommendations, e.g., lists of movies on Netflix and rows of items on Amazon. Algorithms can optimize for well defined metrics such as accuracy and relevance, but struggle for subjective factors, such as familiarity and serendipity, that lack clear measurement. Therefore, even for such common task of recommending collection of items, recommender algorithms still face great challenges.

1.3 Leverage Collective Intelligence in Recommender Sys- tems

When algorithm is at its limit, we can leverage the collective intelligence of algorithm and human to improve user experience with recommender systems. Recommender sys- tem, in fact, inherently uses this collective intelligence: recommender algorithms make prediction on users’ preferences on movies based on other users’ ratings or clicks, which 5 encodes human wisdom. In practice, we have various design choices of workflows to combine the strength of human and computer. One way, which has received most research interest in the past, is designing algorithms that extract information from existing human generated data. Alternatively, we can actively elicit human work and incorporate human effort into computation process. In this thesis, we demonstrate various designs of collective intelligence system through the following four applications in recommender systems.

• A1: improving new user experience with more efficient user bootstrapping process;

• A2: enhancing content modeling of items to increase relevance of recommenda- tions;

• A3: making more satisfying recommendations with crowd participation in recom- mender process;

• A4: creating personalized natural language explanations for recommendations.

1.4 Research Platforms

In this thesis, we conduct research on two platforms - MovieLens and Google+. The former is a research oriented movie recommender system with a small community and the later is a popular commercial social network service.

1.4.1 MovieLens

MovieLens is a movie recommendation website with over thousands monthly , used extensively as a research platform. First launched in the Fall of 1997, MovieLens was among the firsts to use Collaborative Filtering recommender algorithm. Through word-of-mouth and unsolicited press including The New Yorker, the commu- nity on MovieLens grew stably over the past 20 years, with 20-30 new user sign ups every day. Currently, over 5000 users log onto MovieLens each month. The main feature of MovieLens is to provide movie recommendations (shown in Figure 1.1) based on user provided ratings and tags. MovieLens users express their movie 6

Figure 1.1: The homepage of MovieLens, featuring movie recommendations. preferences by rating movies. Recommender algorithms model users’ interest through these ratings and help users to explore potentially interesting movies, prominently shown on the homepage. This constitutes a loop: users provide ratings that improve recommendations and algorithms present better recommendations that result in more ratings. In addition, users can provide tags to movies, describing their fine- grained opinions about movies. MovieLens has been a popular platform for research. Movie- Lens datasets are the most widely used datasets for recommender research community (140,000+ downloads in 2014 and 7,500+ references in publications in Google Scholar). Apart from recommender system research, there have been studies on user engagement and user interactions in MovieLens by GroupLens Research. Users on MovieLens are aware of the research purpose of the platform and are willing to volunteer, making MovieLens a valuable resource for researchers. 7

Figure 1.2: Homepage of Google+, featuring a feed of social media posts from friends and recommender algorithm.

1.4.2 Google+

Google+ is a popular social network that focuses on discovery and interest6. Similar with other social networks, users in Google+ connect with each other, join communities and share posts, images and videos. The recent redesign of the website puts more focus on helping users to share and explore information they are interested. Therefore, users not only see updates from connected users but also posts recommended by algorithm, which models users’ interest based on their activities. Following the growing popularity of images and videos, an increasing proportion of posts on Google+ are multimedia, consisting of text, image and video. In this thesis, we study recommender system for Google+ to present how collective intelligence can be applied to a commercial recommender system. The access to Google+ data and infrastructure is kindly supported by Google during the author’s internship at

6https://googleblog.blogspot.com/2015/11/introducing-new-google.html 8 Google Research.

1.5 Thesis Overview

The core contribution of this thesis is the approach of applying collective intelligence of compute and human to tackle existing challenges and enable new applications in the domain of recommender system. There are several ways to design a system leveraging collective intelligence demonstrated by the following chapters, each of which focuses on a problem in recommender system:

• Chapter 2 focuses on new user experience with recommender system. We revisit the long standing “cold start” problem of recommender algorithms, which require sufficient user preference information to offer good personalization. We design an interactive preference elicitation process that asks new users to express prefer- ences on stereotypical tastes, which are extracted from user ratings. This chapter demonstrates how to computationally mine information from existing human gen- erated data and use it to improve user interactions.

• Chapter 3 looks at the problem of modeling topic of social media posts to increase relevance of social recommendations. We aim to improve the accuracy of auto- matic topic labelers on multimedia posts through crowdsourced human judgments. This chapter demonstrates how to leverage human wisdom on demand at scale to improve computation processes and practical guidelines for quality control of such collective intelligence systems.

• Chapter 4 and Chapter 5 show two studies on human and computer hybrid rec- ommender systems. Different from the previous two chapters, we experiment with directly incorporating crowd effort in recommendation process. In the first study, we ask crowd workers to make recommendations for stereotypical tastes introduced in Chapter 2 and explain their recommendations. Comparing with algorithm gen- erated recommendations, we find several positive properties of crowd recommen- dations and the benefits of algorithm support in recommendation process. In the second study, we focus on the clear strength of human recommendation – being able to explain recommendations in natural language. With a hybrid process of 9 computation and crowd effort, we can generate personalized natural language ex- planations for recommendations at scale, enabling a new feature of recommender system.

Finally, the concluding Chapter 6 summarizes the contributions of these studies, discusses implications of this thesis and points to future directions of research. Chapter 2

Interactive New User Onboarding Process for Recommender Systems

2.1 Introduction

In this chapter, we tackle a long standing challenge for recommender systems – onboard- ing new users. Similar to Collaborative Filtering, we use algorithm to extract human wisdom from users’ ratings: computationally extract item groups that represent stereo- typical user preferences and elicit new user preference with the item groups. In this chapter, we demonstrate a classical way of leveraging collective intelligence to address issue in recommender system. New user experience is crucial to all online communities. Communities will in- evitably die without constant supply of new users [10], because of turnover of existing members. Most new comers to a website decide whether to stay or comeback in their first visits. New users vote with feet if they do not find websites to be interesting or beneficial to them. Recruiting new users to websites that offer personalized experience is especially challenging. Recommender systems that power personalization can only generate high quality recommendations with enough data about new users, who are not committed and

10 11 likely to leave when given work to do. Therefore, there are two seemingly contradicting goals – providing an efficient and easy mechanism for eliciting preferences and delivering high quality recommendations based on relatively little data. For recommender systems, research communities refer to this challenge as “cold start problem”. In addition to previously described user cold start problem, there is also item cold start problem, which occurs when new items are added to a recommender system without user preference information about these items. The combination of the two — user cold start and item cold start — is system cold start problem, which occurs when a system is first released [11]. In practice, there are two common approaches to address this problem. In the first approach, users are admitted directly into a system with no required work, leading to a non-personalized initial experience. In the second approach, users are asked to complete a preference elicitation process before gaining access to a personalized view of the site. In this chapter, we are primarily concerned with the second approach, as we focus our attention on systems where the core features are built around personalization. Designing a good preference elicitation process needs to balance the required user effort and resulting quality of personalization. Systems with difficult or slow preference elicitation processes may waste user time and effort, and may see decreased activity. For example, in the previous version of MovieLens1, new users are required to provide ratings on at least 15 movies as part of the sign-up process. Our analysis of MovieLens data shows that users take an average of 6.8 minutes to complete this process, and 12.6% of these users fail to complete the process and never even get to the front page! But on the other hand, systems that do not learn user preferences will deliver poorer recommendations, may not gain users’ trust, and therefore may receive less use. A series of algorithm focused studies [11, 12, 13, 14, 15, 16] show diminishing im- provement on personalization, while a few studies [17] which experiment with new user interaction in preference elicitation show promising result. Algorithmic studies focus on choosing an initial set of items for users to rate so that the information about user preference is maximized with fixed user effort. Mcnee et. al [17] showed that giving user more control lower perceived effort though users spend more time in the process. Though further optimizing the selection of items for rating remains an open problem,

1www..org 12 we posit that larger advances may be possible by re-thinking the user interaction model. With the goal of improving new user experience, we computationally mine user pref- erences from users’ ratings and design an interactive preference elicitation process. We are motivated by a simple intuition that runs contrary to the assumptions in prior stud- ies: perhaps the process of rating individual items to bootstrap user profiles is inherently inefficient. An obvious alternative to rating individual items is to rate groups of items that represent stereotypical preferences. We hypothesize that, similar to how Collabora- tive Filtering algorithms compute user preference from ratings, we can computationally derive groups of items that are understandable, recognizable and representative of pref- erence space, and consequently improve the classical methods of preference elicitation. Such a system would ask relatively few questions to minimize the time required, but would be able to turn the answers to those questions into a valuable source of informa- tion for bootstrapping personalization. In this paper, we evaluate a new user preference elicitation process that combines automatic clustering algorithms with interfaces for capturing preferences about groups of items. We hypothesize that this type of system will allow users to accurately express their preferences, and quickly give them high quality personalization. Specifically, we evaluate our group-based process with two research questions:

RQ1-Minimal user effort. Is group-based preference elicitation efficient, flexible, and easy for users to understand ?

RQ2-High quality recommendation. Does group-based preference elicitation lead to an accurate model of user preferences and therefore high-quality personalized recommendations?

We study these research questions using multiple methods. To determine the theo- retical feasibility of our algorithmic approach, we run an offline simulation analysis based on data from MovieLens, our research platform. To better evaluate the real-world char- acteristics of our system, we follow this with a user experiment with MovieLens users where we compare our process to a baseline process that has been in production for over ten years. The rest of this chapter is structured as follows. First, we survey related work and situate our contribution. Then, we describe the design challenges related to generating 13

Representative tags

Representative movies

Points assigned to cluster

Figure 2.1: Screen shot of the interface for 14 movie groups, 4 of which are visible, and 6 preference points to allocate. high-quality groups of items, eliciting preferences for those items, and turning those preferences into recommendations. We follow this by describing the methods and results of two related experiments: a simulation-based feasibility study, and an online user experiment. We conclude with a discussion of the lessons learned from these experiments and the advantages and trade-offs we perceive in using a group-oriented preference elicitation process.

2.2 Related work

Most existing work on preference elicitation for new users in recommender systems focuses on algorithmic solutions that maximize the information gained about new users. However, recent work has called attention to the need for research on user experience that includes both interaction design and algorithm development to address this problem [3]. Past algorithm focused research presented various methods [11, 12, 13, 14, 15, 16] for selecting items, which are presented to new users for ratings. Because Collaborative 14 Filtering (CF), the most widely used group of recommender algorithms, base recom- mendations on user ratings, the quality of recommendation is directly affected by items user rate: CF struggles to model a user’s preferences from less known items due to lack of ratings from other users. Rashid et. al [12] framed the preference elicitation problem as: finding optimal item sequence for new users to rate so that the users spend least amount of time and receive accurate rating predictions from recommender system. Rashid et. al, and Elahi et. al later, proposed heuristics that generate sequence of items based on popularity, entropy [12] of ratings and predicted ratings [13]. Another line of research [11,16] use algorithms to select sequence of items that are repre- sentative of entire preference space. Under the same problem definition, other research has recently developed decision tree-based preference elicitation methods [14, 15, 18], whereby the recommender system repeatedly asks new users to pick their preference from a pair of movies. According to the user’s previous choice, the system adaptively generates movie pairs for comparison. In the end, the system categorizes users into a user group with similar taste and then uses an average of the group’s taste for generating recommendations. Commercial systems have also introduced a variety of user experiences for new user preference elicitation. Systems that use CF algorithms to power personalization elicit user preferences through either explicit or implicit user feedback. For example, Netflix asks users to step through a variety of pages where they rate genres and movies. Other systems use implicit feedback to infer users’ preferences for recommendation: Youtube and Netflix use viewing history, Amazon uses clicking and purchasing history, and Face- book uses user activities on news feed items. However, implicit feedback is not as useful for systems that wish to allow users to control the personalization. Content-based rec- ommender systems can directly elicit preferences on content features. For example, both Facebook Paper and Pinterest ask new users about the categories they are interested in before recommending content. However, similar content features are not necessarily available in other application domains, and these features may not offer the flexibility to control the granularity of users’ stated preferences. The commercial systems, together with research on interactive recommender, in- spire us to focus on overall user experience rather than simplified goals in algorithmic research. We aim to give new users, who care more than time spent and accuracy 15 of rating prediction, good experience in preference elicitation process. For example, McNee et. al [17] demonstrated through an experiment on MovieLens that, when giv- ing users more control over preference elicitation process, the users perceive less effort even though spending more time compared with those who go through a uncontrollable process. Other works [19, 20, 21] focusing on user interaction also demonstrated multi- ple factors that are relevant to user experience. SmallWorlds [20] provides interactive graph visualization for social recommendation on Facebook. Through a user study, au- thors find that users think the system is transparent and are satisfied with the system. TasteWeights [21], an interactive music recommender, allows users to control their rec- ommendations by choosing their preferred artists or by leveraging the preferences found in their social network. The most directly relevant work to this work [19] studied a choice-based preference elicitation process. Similar to decision tree-based methods, the system iteratively asks users to compare two groups of movies before making any recommendations. In contrast with decision tree-based methods, movies are picked to represent two opposite values of a latent factor, which is computed from a matrix factorization collaborative filtering algorithm. In picking movies to present, they consider both the algorithmic latent space and the transparency to users. Through a user study, they showed that the system has advantages on 15 metrics over both a manual system where users search for items and an automatic recommender system with no interaction. Similar to [19], we propose an interactive preference elicitation process that strikes a balance between user effort and quality of recommendation by asking users to rate groups of items. Compared to [19], we provide a different presentation of movie tastes with tag-labeled movie clusters. Moreover, our approach is generalizable to any collab- orative filtering recommender system, while [19] is constrained to matrix factorization- based systems. And, similar to decision tree-based methods [14, 15, 18], we represent users’ tastes as the average preference of users who are similar.

2.3 Design Space Analysis

In this section, we describe the design space surrounding our preference elicitation pro- cess, guided by our two research questions. As part of this analysis, we articulate several 16 design challenges related to the creation and display of groups of items, ways of allowing users to express preferences for those items, and how we might turn these preferences into item recommendations. Specifically, we target our analysis on the design of a new user process for MovieLens.

2.3.1 Design space

User effort and quality of personalization are two correlated objectives in designing a preference elicitation process. On the one hand, a system may require a demanding process that asks users for lots of information - such as ratings or survey responses - in exchange for high-quality personalization. On the other hand, a system might admit users directly, making the process fast and easy at the cost of foregoing the opportunity for personalized recommendations. In this research, we are interested in developing a part of the design space that is both low-effort and high-personalization. Standard preference elicitation processes for Collaborative Filtering that ask users to rate individual items is inherently inefficient, because some ratings contain highly overlapping information. For example, if a user has rated “Toy Story”, it may be the case that rating “The Lion King” does not add much additional information for the purposes of personalization, since they both reflect a preference for children’s animation. Research on active learning in recommender system [22] explores a similar idea that ratings on similar items represent correlated information about user preference. Therefore, we believe that eliciting preferences on groups of similar items is more efficient. One obvious way to group items in the movie domain is to use existing cate- gorization such as genres. However, categorization is not necessarily available in other item domains and existing categorization has no flexibility to control the granularity of groupings. There is a trade-off between the granularity of groupings: preferences on groups of finer granularity represent more nuances in user tastes, but may require more user effort to express such preferences. To have the flexibility of exploring such trade- off, we propose to use automatic clustering algorithm to generate item groups (or item clusters for the rest of this thesis), generalizable to other rating based recommender systems. 17

DC1 DC2 DC3 +2 Generate Describe Elicit DC4 groups groups preferences Recommend Filter +

+1

Dense Item Movie group User’s choices Full matrix Recommended rating matrix clusters interface of movie groups movie list

Figure 2.2: An overview of the design challenges in our group-based preference elicitation process.

2.3.2 Design challenges

To accomplish our goal of building a new user preference elicitation process that re- quires both minimal effort and leads to high quality recommendations, we must address several design challenges. In this section, we detail the following challenges (shown in Figure 2.2), along with our proposed solutions:

• DC1: Generating groups,

• DC2: Describing groups,

• DC3: Eliciting preferences,

• DC4: Recommending based on preferences.

DC1: Generating groups

We use an unsupervised clustering method to identify non-overlapping groups of items, which are movies in MovieLens. We aim to generate high quality clusters that contain similar movies in the same cluster, dissimilar movies in different clusters, and cover the whole preference space, accurately capturing user preferences about movies. We generate movie clusters from high quality existing ratings, available in all CF- based recommender systems. The intuition is to group movies that receive consensus on ratings into same clusters. Computationally, consensus is established when same users rated two movies and values of ratings on two movies are similar. However, like any online system, users’ activities in recommender system follow pattern: 18 majority of users rate only few commonly known movies while a few users rate many diverse movies. As a result, when given all ratings in system, clustering algorithm will tend to group commonly known movies together even these movies reflect different taste. To avoid such undesired clustering results, therefore, we extract ratings from prolific users on a subset of representative movies: pick the 200 most frequently-rated movies in MovieLens, and then include ratings from users who have rated more than 75% of those movies. These numbers are chosen according to the size of the dataset and sparsity of the ratings. Our intuition is confirmed by manual inspection of clustering results using dense rating matrix and the full matrix. We further experimented with data transformation techniques such as normaliz- ing ratings and dimension reduction on the rating matrix. Based on extensive offline simulation, we chose to use mean subtracting normalization as our technique [23]. After selecting data, we compute pairwise cosine similarities [2] between all movie vectors, and run the Spectral Clustering algorithm [24] to partition movies into clusters. Spectral Clustering is a standard graph-based clustering technique commonly applied to similar domains such as document clustering and community detection [25, 26]. Again, we evaluate the performance of Spectral Clustering by manual inspection of clustering result and find it to generate higher quality clusters in comparison to Affinity Propaga- tion [27], another popular graph-based clustering algorithm.

DC2: Describing groups

To help users better understand the preference elicitation process, we need to present the movie groups in a comprehensible way. In our implementation, we use both movies and tags to represent movie groups in the user interface (see Figure 2.1). Tags, which are created and applied to movies by users in MovieLens, provide high quality descriptions for the movie groups. However, they can be replaced by manual labeling for a system that does not have tags. For picking representative movies and tags, there are two options: (1) select repre- sentative movies then find the most descriptive tags, and (2) pick representative tags then find the most relevant movies. We decide to use the latter because it results in more comprehensible representations of groups in our manual evaluation. For each movie group, we first pick the top-three tags that both uniquely describe 19 and are highly relevant to the group. We measure the relevance of a tag and a movie with Tag Genome [28], which is more accurate than but replaceable with more general tag application data. With tag relevance metrics, we define uniqueness and relevance of a tag to a movie group in Equation 2.1 and Equation 2.2 respectively below. Then, we pick the three tags with highest overall uniqueness and relevance, measured by multiplication of the two metrics.

rel(t, c) unique(t, c) = (2.1) P rel(t, c ) ci∈C i

rel(t, c) relevance(t, c) = (2.2) P rel(t , c) ti∈Tc i where t denotes a tag in the set of tags Tc that appears in cluster c, and C denotes all the clusters. Note that rel(t, c) is the aggregated relevance of tags t to all movies in cluster c. In our implementation, we use relevance between a tag and a movie generated from the Tag Genome [28], but other systems could replace this data with a count of applications of the tag to the movie. After selecting three tags, we pick three movies that are most relevant to the all three tags of a movie group. To avoid redundancy, we do not pick more than one movie from same sequel, e.g., Star Wars 1 and Star Wars 2.

DC3: Eliciting preferences

There are many possible user interfaces that support the design goal of eliciting user preferences for movie groups. For example, users might rank the movie groups, or they might provide 5-star ratings for each group. In this work, our goal is to minimize effort while achieving high personalization. Therefore, we start with a very simple interface: users choose one favorite group. Alternatively, we can give users options to express their movie tastes with several movie groups. Intuitively, one movie group may not be sufficient to represent movie taste, e.g., a user might enjoy both horror movies and Sci-Fi movies, which are likely to be in different groups. To give users more control of how they express their preferences for movie groups, we further propose experimenting with an interaction technique that asks users to allocate a fixed number of points across one or more movie groups. 20 Step 1: Compute pseudo rating vectors: foreach movie group Mk do 1. Find the set of users Uk whose average rating on movies in Mk is the highest; 2. Compute average ratings of users Uk on top movies that went into clustering process to get pseudo rating vector R~ k; end Step 2: Make recommendations: foreach pseudo rating vector R~ k do Rank all movies based on predicted ratings, computed, with R~ k as input, using a standard Collaborative Filtering algorithm; end Step 3: Aggregate recommendations based on user preferences on movie groups when users express interest on multiple movie groups. Algorithm 1: Algorithm for the recommendation process

We use a data-driven approach to decide on the appropriate number of points to give users to allocate on movie groups. Again, we analyze the selected high quality ratings, computing how many movie groups users’ top-rated movies2 fall into. As a result of this analysis, for any given number of groups, we are able to pick the number of points that covers all the top-rated movies for 80% of the users. The results of this analysis inform the design of our user experiment (described in the user experiment section below) where we examine the impact of point allocation versus an interface where users simply pick their favorite movie group.

DC4: Recommending based on preferences

The next step, given a user’s chosen preference for one or more movie groups, is to generate personalized recommendations from the full item space. One important feature of our recommendation process is that it can be generalized to any standard collaborative filtering algorithm. The intuition behind recommendation process is to find like minded existing users for new users based on their preferences on movie groups and then make recommendations on the existing users’ taste. This process, detailed in Algorithm 1, works as follows: 1) map each movie group to a pseudo rating vector based on ratings from existing users;

2movies rated ≥ 4 on a 5 star rating scale 21 List1(2 points) List2(1 point) Combined list Rank movie score movie score movie score 1 A 3 D 3 A 2.33 2 B 2 B 2 B 2 3 C 1 A 1 D 1

Table 2.1: Example of weighted aggregation of ranked item lists. List 1 has 2 points and list 2 has 1 point. We assign scores to items in the lists based on ranking. We re-rank items based on the weighted average of item scores in two lists and take top 3.

2) make movie recommendations for each rating vector from entire movie space; 3) aggregate recommendations for all movie groups based on new users’ preferences on the groups. When users allocate points across multiple movie groups, we aggregate recommen- dations - both predicted ratings and top-N recommendation lists. For predicted ratings, we compute a simple weighted average of multiple predicted ratings, based on different pseudo rating profiles. For generating a top-N list, as shown in an example in Table 2.1, we aggregate multiple ranked item lists by assigning scores (N − rank + 1) to items based on their rank in the list, then computing weighted averages of the scores, and finally ranking items accordingly.

2.4 Feasibility study

To evaluate the effectiveness of group-based preference elicitation, we conducted an offline simulation study. We focused on an evaluation of RQ2-High quality recommen- dation using prediction accuracy and top-N recommendation accuracy [29], and left RQ1-Minimal user effort for the user experiment. We also evaluated the trade-offs inherent in increasing the number of clusters through an analysis of resulting recom- mendation quality and of clustering quality metrics.

2.4.1 Data

We constructed a data set of 2.2M ratings from 5,018 users on 22,115 movies from the MovieLens database. We selected users who have provided at least 50 ratings and who have rated at least one movie since June 1, 2013. We chose users based on their 22 ratings behavior because we can more accurately simulate user preferences and evaluate quality-based metrics for users with many ratings.

2.4.2 Method

Our simulation study has two parts: first, we built item groups and simulated user group choices; second, we used those simulated group choices to evaluate the quality of the resulting recommendations. To fairly evaluate the results, we conducted a 5-fold cross validation on the data set, where 80% of the users (and their ratings) are used to build the item groups, and the remaining 20% are used to simulate group choices and evaluate recommendation quality. Further, for each user in the test set, we randomly selected 80% of their ratings to guide the simulated group choices, while the remaining 20% are left for evaluating recommendation quality. We simulated users picking one favorite movie group, assuming that users will choose the movie group for which they have the highest average rating on items in that group. Because this assumption is a crude approximation of actual user behavior, we also included an oracle strategy — assuming users will always pick the movie group which results in best prediction accuracy — to approximate the theoretically optimal performance.

2.4.3 Baseline

We compared group-based preference elicitation to the standard method used in Movie- Lens for the last 10 years - asking users to rate 15 movies from a long list of movies generated by an algorithm. We analyzed our data set to compare various movie sorting algorithms [11] as well as several state of the art CF algorithms - item-item [23] and FunkSVD [30] - with respect to their resulting recommendation quality after 15 ratings. Item-item CF has the best recommendation quality in combination with popularity- based sorting. 3 Therefore, we used use item-item CF both to serve as our rate-15 base- line recommender, and to recommend movies based on users’ choices of movie groups.

3All evaluations are carried out in Lenskit [31]. 23 2.4.4 Metrics

For prediction accuracy, we measured Root Mean Squared Error [32] of predictions, defined in Equation 2.3, on ratings from the set of test users. Lower RMSE scores are better. For top-N recommendation accuracy, we, following the methodology of [33], measured it by of including 5-star movies in top-N recommendations. For easy interpretation, we report F-measure scores, defined in Equation 3.4, to combine precision and recall. Higher F-measure scores are better.

v u N 2 uX (ˆrm − rm) RMSE = t (2.3) N m=1 wherer ˆm is predicted rating and rm is actual user rating.

2 × P recision × Recall F − measure = (2.4) P recision + Recall

2.4.5 Results

RQ2-Quality of recommendation.

As shown in Figure 2.3, group-based preference elicitation has better average top-N recommendation accuracy (in aggregate, F > 0.19 for both optimal and simulation conditions) than the baseline (in aggregate, F = 0.15). In other words, the top N recommendation list generated from our group-based process contains more movies that users may be interested in. We found that the optimal prediction accuracy of group-based preference elicitation varies with the number of clusters, but in general is close to the baseline (in aggregate, RMSE = 0.804 for the baseline condition, and RMSE = 0.801 for the group-based process when the number of clusters is 12). However, the actual prediction accuracy is likely to be worse than baseline. E.g., the RMSE is 0.856 when simulating 12 clusters. This is as expected, because 15 ratings provide information about user rating patterns, such as the tendency to rate higher or lower on average than other users; cluster-based preference elicitation doesn’t provide such information. Our simulation gives us evidence that the group-based method will provide better top-N recommendations but slightly worse prediction accuracy, establishing feasibility 24

0.86 0.22

● ● ● ● ● ● ● 0.20 0.84

0.18 RMSE 0.82 ● ● F − measure 0.16 ● ● ● ● 0.80 ● 0.14 4 8 12 16 4 8 12 16 # clusters # clusters

Baseline Cluster Picking ● optimal simulation

Figure 2.3: Results of simulated group-based preference elicitation compared to a base- line. “Baseline” shows the performance of the rate 15 movies process. The two simulated cluster-picking strategies are “simulation”, where users pick the cluster with movies they’ve rated the highest, and “optimal”, where users pick the cluster that results in the best predictions. Lower RMSE scores are better and higher F-measure scores are better. The chart on the left shows that prediction accuracy is slightly worse than the baseline in most cases while the chart on the right shows that recommendation accuracy is significantly improved compared with the baseline. of proposed method. This is further supported by prior research [7, 34] that argues prediction accuracy is less relevant to the end user experience than recommendation accuracy.

Number of clusters.

One goal of this simulation is to inform our design choices concerning the best number of groups to display to users. As shown in Figure 2.3, the theoretical bound for prediction accuracy improves with the number of clusters. This is intuitive because as the number of clusters increases, the groups of items become smaller and more homogeneous - these smaller groups may better capture user preferences. However, we see two potential downsides to increasing the number of clusters. First, this improvement is likely gained at the cost of increasing user effort: the act of picking one movie group from sixteen choices is intuitively harder than picking one movie group 25

0.08

0.07

0.06 Clusterquality 0.05

1 2 3 4 5 6 7 8 9 101112131415161718 Number of clusters (K)

Figure 2.4: Cluster quality measured by Silhouette Coefficient vs. number of clusters. For each cluster number, we run the clustering algorithm 10 times and plot the standard deviation with error bars. This figure shows two local maxima around 5 and 14 clusters. from four choices. Second, when we simulate cluster-picking behaviour using highest average ratings rather than the optimal method, we find that RMSE increases slightly with the number of clusters. This indicates the possibility that, in a real-world system, increasing the number of choices will decrease the chances of users picking the best cluster in terms of the resulting prediction accuracy. We also measured the quality of clusters using the Silhouette coefficient [35]. Sil- houette coefficient values range from -1 for poor clustering to 1 for good clustering. We plotted the Silhouette coefficient for a varying number of clusters (K) in Figure 2.4. This figure shows a general trend of increasing cluster quality as K increases, and the presence of two local maxima around 5 and 14 clusters. Combining these results, we found that there is a trade-off in increasing the number of groups for our preference elicitation process: we can have high quality groups and the best prediction accuracy with many groups, or we can minimize user effort with few groups. We investigated this trade-off in our user experiment below. 26 2.5 User Experiment

We conducted an online user experiment to evaluate both RQ1-Minimal user effort and RQ2-Quality of recommendation with real users. We invited MovieLens users to participate in this experiment, comparing the performance of our group-based process with a baseline process that is currently used in MovieLens. In addition to the offline simulation work discussed above, this user experiment can shed light on user effort and subjective perception of recommendation quality.

2.5.1 Overview of the cluster-picking interface

We implemented a prototype cluster-based preference elicitation interface in MovieLens, shown in Figure 2.1. We presented participants with N (6 or 14) groups of movies and asked them to express their preferences on these groups either by picking the favorite group or distributing points onto multiple groups. Each movie group is described by three user created tags in MovieLens and three movies. When participants finish the task, system can make a personalized recommendation list to users.

2.5.2 Method

We invited recently active MovieLens users to participate, using the same selection criteria as the feasibility study (see above). We sent emails to 2.8K users and received 342 responses between May 18 and May 23, 2014. This group of participants is already familiar with the MovieLens recommender system. We asked participants to complete two tasks, using an experimental interface de- signed for this evaluation. First, participants picked one or more groups (see experimen- tal conditions below) from our prototype interface, shown in Figure 2.1. participants then answered survey questions about the process of picking among the groups, and evaluated a top-10 list of recommended movies, with an interface shown in Figure 2.5, that is generated based on their group choices. Second, participants evaluated a top-10 movie list, using same interface shown in Figure 2.5, generated based on their histor- ical first 15 ratings after signing up for MovieLens. The order of the above two tasks was randomized. Participants were not aware of how the recommendation lists are generated. 27

Figure 2.5: Interface used for evaluating top-10 recommended movies. Participants are asked to answer a list of survey questions regarding top-10 recommendations shown to the left.

We evaluated RQ1-User effort and RQ2-Quality of recommendation based on survey responses and logs of activities. We measured the time participants spent picking movie groups and combined with survey responses to evaluate RQ1-User effort. We analyzed survey responses to evaluate RQ2-Quality of recommendation. For the online survey, we we have a 2 × 2 × 3 between-subjects design with the following three factors:

1. Order. The order in which the user evaluates the two top-10 movie lists. We include this condition and randomize it to make sure that there is no order effect.

2. Number of groups. The number of movie groups to display to the user. We experiment with 6 groups and 14 groups, based on our findings and open questions 28

Baseline rate 15 movies

6 clusters ●● ● ● 6 points

6 clusters ● ● 3 points

6 clusters ● ● pick one iment conditions 14 clusters ● ● 6 points Expe r 14 clusters ● ● ● 3 points

14 clusters ● pick one 0 2 4 6 8 Time to finish picking clusters(Minutes)

Figure 2.6: Time to finish preference elicitation process. Time is measured in minutes. The group-based approach takes less than half as long as the baseline, on average.

from the quantitative analysis discussed above. Further, considering UI design, we pick an even number of groups to better organize them in the interface.

3. Group-picking interface. The interface given to users for picking groups. We ex- periment with three conditions: picking one favorite movie group, and distributing either 3 or 6 points across movie groups (multiple points may be given to a sin- gle group). These values are chosen based on the design space analysis discussed above.

Participants were randomly assigned to one of the twelve experimental conditions. All survey questions have likert scale answers ranging from strongly disagree to strongly agree. 29

Choosing a group was an easy way to express my preferences. (Easy)

6 33% 11% 56% points 3 28% 20% 52% points Pick 48% 20% 32% one Choosing a group was a fun way to express my preferences. (Fun)

6 21% 37% 42% points 3 25% 23% 53% points Pick 30% 26% 44% one 0 25 50 75 100 Percentage

Response Strongly disagree Disagree Neutral Agree Strongly agree

Figure 2.7: Survey results about the easiness and fun of our group-picking process. We compare picking one group with distributing 3 or 6 points across groups. The percentages summarize the responses after combining disagree/strongly disagree, and agree/strongly agree. Respondents think point allocation is easier and more fun than picking a single favorite.

2.5.3 Results

RQ1 - Minimal user effort

Time. We observed that group-based process requires less time than rate-15 movie baseline process. As an objective metric for user effort, we measured time participants spent picking movie groups in online surveys. Through an analysis of 27,226 new users who signed up to MovieLens between Jan. 1, 2008 and Dec. 31, 2010, we compared the time with time spent in rate-15 movie baseline as shown in Figure 2.6. Participants spent more time as the interface becomes more expressive (more groups or more points). However, even for the most time-consuming task - distributing 6 points across 14 groups - the median time is only 1.5 minutes. This is less than half of the median time (4 30

This list would help me find movies I would enjoy watching.(Satisfaction)

Rate 15 movies 17% 19% 64%

Cluster based 8% 13% 78%

The items in the list accurately represent my preferences (Personalization).

Rate 15 movies 24% 22% 54%

Cluster based 16% 18% 66%

0 25 50 75 100 Percentage

Response Strongly disagree Disagree Neutral Agree Strongly agree

Figure 2.8: Survey results about recommendation quality, comparing our group-based process with the baseline rate 15 process. The percentages summarize the responses after combining disagree/strongly disagree, and agree/strongly agree. Respondents answer more positively concerning the group-based recommendations. minutes) that users have historically spent in the rate 15 baseline process! Subjective feedback. To evaluate whether the group-picking interface is easy and fun, we asked users two likert-scale questions about the process and received results shown in Figure 2.7. We did not find statistically significant differences in user responses when comparing users in the 6 and 14 group conditions. However, we did find that varying the group-picking interface had an effect on perceived ease of use. Interestingly, although picking one group is the task that takes the least time to finish (Figure 2.6), users think that this is the hardest task (p-value < 0.001, Wilcoxon test [36]). In fact, we received an email from one user speaking to this point: “I’m finding that my movie preferences cannot be accurately described by selecting just one group of three movies.” 31

Table 2.2: User-expressed familiarity with recommended movies and prediction accu- racy. Familiarity is represented by the average number of movies (out of 10) that users had heard of or seen. Prediction accuracy is measured by average RMSE.

Method Heard of Seen RMSE Rate 15 movies baseline 8.5 6.7 0.784 Group-based 9.0 7.4 0.797

RQ2 - Quality of recommendation

Top N recommendation accuracy. To evaluate the resulting recommendation ac- curacy of the group-based process, we asked participants to evaluate two lists — one generated based on choice(s) of movie groups and the other based on historical initial 15 ratings in MovieLens — of recommendations in randomized order. Based on these data, we can perform a within-subjects comparison of user perceptions of recommen- dation quality. Figure 2.8 shows the responses to two questions related to this goal - one about general satisfaction and another about personalization. Participants rated both factors higher for the group-based method (p values ∼ 0, paired Wilcoxon test). Specifically, users felt that recommendations from the group-based recommender would better help them “find movies I would enjoy watching” and “accurately represent my preferences”. These results echo the earlier findings from our offline simulation of top-N recommendation accuracy. Note that satisfaction and personalization are both rated higher on average by users in the point allocation conditions (as compared with the pick one condition), but these differences are not statistically significant. We also asked users to count the number of movies that they have heard of or seen from the list of recommended movies. Results are shown in Table 2.2. The group-based method recommends movies that users are more familiar with. Prediction accuracy. We also evaluated the prediction accuracy of the group- based method. For each user, we made rating predictions on all movies except their first 15 rated movies. These prediction were either based on their group choices or their historical first 15 ratings. The average RMSEs are summarized in Table 2.2. Consistent with findings from the offline simulation study, the group-based method has slightly worse prediction accuracy (0.797 vs. 0.784). It is, however, unclear if this small 32 difference in accuracy would be perceptible to users. To summarize, the group-based method reduces user effort and improves top-N rec- ommendation quality, while potentially losing some prediction accuracy. Most impor- tantly, users are more satisfied and get better personalization from the resulting top-N recommendations, which is the fundamental goal of most recommender systems.

Evaluation on Movie Groups

User experience in our process is dependent on whether movie groups are sensible, measured through survey questions. The responses, as summarized in Figure 2.9, show that 80% of subjects understood the types of movies in each group. Additionally, most subjects agreed that displaying three movies (89%) and three tags (74%) helped them to understand the movie groups. In other questions, we learned that subjects felt that three tags was about the right number to display, while showing one or two additional movies for each group might be better.

2.6 Discussion

In this research, we developed a new process for preference elicitation in recommender systems based on the idea that recommender systems may bootstrap new users more effectively by having them express preferences for groups of items rather than individ- ual items. We evaluated this process with the dual goals of minimal user effort and high quality recommendation. A user experiment showed that our method succeeded. Compared to a baseline condition where users rate 15 movies - mirroring a process that has been in production on movielens.org for over a decade - the new process is much faster and leads to more highly-evaluated recommendation lists. We leverage the collective intelligence of computer and human by mining existing user generated data — ratings. We cluster movies into groups based on ratings, rep- resenting stereotypical tastes in system. Our recommendation process again leverages human wisdom by representing new users’ preferences with existing users’ ratings. Such way of using collective intelligence is increasingly popular as Internet gets involved more profoundly in people’s everyday life. Computationally mining human 33

I understand the types of movies that each group represents.

7% 6% 80%

The movies shown with each group help me to understand the group.

5% 13% 89%

The words shown with each group help me to understand the group.

9% 17% 74%

0 25 50 75 100 Percentage

Response Strongly disagree Disagree Neutral Agree Strongly agree

Figure 2.9: Survey response regarding the whether the groups are comprehensible. The percentages summarize the responses after combining disagree/strongly disagree, and agree/strongly agree. The movie groups presented to users were scrutable; both the example movies and the example tags contributed to this outcome. wisdom encoded in large scale data enables applications not possible before. For ex- ample, Netflix draws insights about features of movies enjoyed by different groups of audience and makes movies that are tailored to audiences’ taste. However, human can take more active roles in such collaboration with software other than unknowingly generating data. As we will show in the following chapters, such closer participation of human breaks the limits of current algorithm and enables even more innovative applications. Chapter 3

Automatic Topic Labeling for Multimedia Social Media Posts

3.1 Introduction

We can build computation system not only based on existing user generated data, as demonstrated in the previous chapter, but also on data generated from on demand crowdsourcing. In this chapter, we apply the later methodology to build an automatic topic labeling system for multimedia social media posts. The role of recommender system has become increasingly important on social media sites as user generated content overwhelms average users. Facebook intelligently selects posts that show up on home feed to suit users’ interest; Google+ recommends popular posts that may be interesting to users. Recommender systems enhance user experi- ence by presenting interesting and relevant content from rapidly growing information repository. One key component of these recommender systems is automatic topic labeling, which associates topic labels with social media posts. Unlike movies and music, social media posts are very dynamic — 6,000 tweets are tweeted every second on average. At the time of recommendation, topics are often the only features available — user reactions, such as ratings, likes and clicks, take time to accumulate — to recommender systems. Modeling topics of social media posts is challenging: apart from being short and noisy, posts contain multimedia content, i.e., text, image, video and external links. For

34 35

Figure 3.1: An example of a Google+ post with a Youtube video attached. The text of the post does not match with the attached video, which makes topic labeling difficult.

example, a typical post on Google+, shown in Figure 3.1, contains text and video. With advances in technology, we have reasonably accurate annotators for text, image and even video. Combining results from these annotators in a meaningful way, however, remains a challenge. For example, labels from images tend to be less reliable than labels from texts. In addition, the perception of label relevance is subjective and complex. For instance, for a G+ post from TechCrunch (a tech media company headquartered in Silicon Valley) talking about the history of real estate development in San Francisco, an annotator based on author name may label“tech news”, an annotator based on text content may label “real estate”, and an annotator based on picture may label “city street”. It is difficult for an algorithm to decide which label(s) best describe the post, even though each individual annotator is somewhat accurate. Interpreting topics of social media posts is a task that human is better at than machine, therefore we propose to use an machine learning model to mimic how crowd workers judge the relevance of topics to posts. We first utilize crowdsourcing to collect 36 a ground truth data set on the relevance of topic labels for randomly-sampled posts, so as to quantify the reliability of each topic annotator. Using ground truth data from crowdsourced labels, we then utilize a supervised ensemble model that combines the outputs of various topic annotators, thus further filter and classify topic labels based on their degree of relevance. In an evaluation, we demonstrate that the ensemble model improves over a baseline that naively aggregates topic labels from all annotators. Moreover, the ensemble model is also capable of accurately classifying topic labels into more fine-grained relevance categories. The rest of this chapter is structured as follows. We first survey related work. Then, we describe our approach – a crowdsourcing process that generates relevance judgments and an ensemble learning model that predicts relevance of topic labels. We follow this with an evaluation our automatic topic labeling system on two practical tasks. We conclude with a discussion of the results and implication to collective intelligence.

3.2 Related Work

This work draws from two broad areas of prior research - crowdsourcing research and topic extractors and annotators (for text, image, video) research. Here we introduce these two areas, and end with a detailed introduction of [37], which described a Twitter- based topic labeling problem that is closely related to our work.

3.2.1 Crowdsourcing

Crowdsourcing has been a widely-applied approach for human evaluation [38] and for obtaining large-scale training data for machine learning systems. We will present a brief survey of some recent work. Ensuring quality of crowdsourced work is an important issue, because individual crowd workers are not always reliable. For instance, Kittur and Chi [38] showed that some crowd workers are not reliable because they optimize for maximum profit with minimum work. They improved quality of crowd work by adding verifiable questions to HIT template on Amazon Mechanical Turk (MTurk for short). Kazai et al. [39] studied various factors that affect the quality of relevance judgment task for web search. These factors include conditions of pay, required effort, and selection of workers based 37 on proven reliability. In addition, they found the intrinsic factors of workers, e.g., motivation, expertise, etc., also relate to work quality. A number of prior works have in particular studied using crowdsourcing for anno- tating social media content, such as text, image and video. Finin et al. [40] studied how to efficiently annotate Named Entities in large volumes of Tweets at low cost using MTurk and CrowdFlower. They found both MTurk and CrowdFlower easy to use, cost effective and capable of producing qualified data. On the other hand, Alonso et al. [41] used crowdsourcing to annotate the level of interestingness of Tweets and found the task very challenging, because agreement among workers is low. Our work build upon these works by following the insights and best practices in utilizing crowdsourcing for training data collection.

3.2.2 Topic Extractors and Annotators

The research of topic extractors and annotators aims to assign meaningful topic labels to various types of content, including text, image, and video. Traditionally these research efforts are separated by these content types, each of which requires entirely different underlying algorithms and approaches. We briefly survey research on these annotators here. For text, topic extraction has a long history, particularly in semantic web research. In social media, prior work has mostly relied on text-based approaches, where the primary challenge is that text being short and noisy. Bontcheva and Rout [42] gave a detailed summary of prior work. To name a few, Ramage et al. [43] used semi-supervised labeled Latent Dirchlet Allocation (LDA) model to map Tweets into topic dimensions. Ritter et al. [44] used labeled LDA in natural language processing tasks, i.e., Part of Speech Tagging, Named Entity Recognition (NER) and taxonomy of entities, on Tweets. More recently, Gattani et al. [45] introduced a system that extracts entities from Tweets, links entities to concepts in Wikipedia, and classifies Tweets into 23 predefined categories. Image annotation is closely related to the larger area of image recognition. There are several pieces of work on annotating images with topic labels [46,47,48]. In particular, Weston et al. [48] proposed a scalable and efficient method applied that annotates image by learning joint embedding space for images and topic labels. For annotating videos, topic labels is extremely challenging [49,50]. In one example 38 research that relates more directly to our work, Aradhye et al. [50] used both meta data, e.g., video title, description and tags, and audiovisual features to find topic labels for YouTube videos. Our work does not attempt to improve these specialized topic extractors and an- notators; instead, we utilize these annotators and focus on how we can intelligently integrate them for topic labeling of social media posts.

3.2.3 Topic Labeling for Twitter

Perhaps the most relevant work to our work is recent research by Yang et al. [37], which proposed a topic labeling system deployed in Twitter. This system has several components: non-topical tweet detection, automatic labeled data acquisition, evaluation with human computation, diagnostic and corrective learning and topic inference. As part of this system, an integrative model aggregates signals from different parts of the tweets, i.e., text, web page, author, hashtag and user interest. The model assigns weights to different sources, where weights come from human-labeled data. Similar to their work, we train an ensemble model to aggregate topic labels that are generated by different annotators. Our work differs from theirs in one fundamental aspect: our work deals with versatile multimedia posts in Google+ as opposed to short textual tweets. In recent social media system redesigns, multimedia content is becoming more and more common in posts. Therefore, our work is timely in investigating this challenging problem. In the next section, we describe the specific challenges of topic labeling for multi- media posts and rationale behind our solution to these challenges. In particular, we explain how the multimedia nature of posts complicates the relevance judgment of topic labels.

3.3 Our Approach

In this section, we introduce our approach to label multimedia posts for Google+. The central piece of our system integrates different annotators for various parts of a post, i.e., author name, text boy, attached image, attached video and etc., and merges the varying signals from these different annotators by using a ensemble learner, with ground 39

Human Evaluated Randomly Single-source Crowd Sampled Posts Annotators Topic Labels Author Annotator FailArmy FailArmy ✔ Human Section 3.2 Annotate Evaluate Text Annotator Cat Cat ✘ Crowdsourcing

Video Annotator Fails Fails ✔

Train Model New Posts Lakers ✘ Lakers Automatic Annotate Supervised Classify Section 3.3 NBA Learning NBA ✔ Ensemble Model Model Rubio Rubio ✘

Figure 3.2: Approach overview. We crowdsource relevance judgment of topic labels (described in Section 3.3.2). Then we train a supervised ensemble model with human evaluated topic labels (described in Section 3.3.3) . truth topic labels that are evaluated by crowdsourced workers. Because the judgment of relevance of topic labels for posts can be highly subjective, we employ many crowd workers to evaluate the relevance of topic labels to aggregate a variety of opinions. This, however, raises the challenge of ensuring the quality of the work on subjective tasks as observed in the early work on crowdsourcing [38]. We will describe our crowdsourcing process that carefully addresses this issue in Section 3.3.2. To harness all topic annotators with varying accuracy, we build a supervised en- semble model to filter topic labels from each annotator based on its accuracy and other features from posts. We train the supervised learning model on data from crowdsourcing process. We will cover the details of the ensemble model in Section 3.3.3. The work flow is summarized in Figure 3.2: we first leverage the crowd to evaluate relevance of topic labels from different annotators on randomly sampled posts, then train the supervised learning model with data from crowdsourcing, and finally use the ensemble model to classify topic labels from different annotators on unseen Google+ posts.

3.3.1 Single-Source Annotators

A post on G+ contains author name, body text, comments and optional multimedia attachment of image, video or link to web page. Figure 3.3 shows the interface for 40

Figure 3.3: Google+ allows users to attach multiple types of attachment (i.e., image, video and link to web page) to a post.

creating a new post. Intuitively, we can use single-source annotators to annotate each part of a post and combine the labels from these annotators. To analyze each of the parts of a post, entity/topic annotators map media content onto a particular set of topic keywords. In this work we rely on the Freebase1 knowledge base system that provides a shared collection of topic keywords that all the individual text annotators, image annotators, and video annotators use [37, 48, 50]. Regardless of the underlying implementation of these single-source annotators, the output of the annotators can be represented as pairs of label and relevance as shown in Equation 3.1:

{< l1 : Tp,l1 >, ..., < li : Tp,li >, ...} (3.1)

where li is the label, and Tp,li is the topical relevance of the label li to post p. We call

Tp,li topicality score. There are two major challenges that complicate the aggregation of these topic labels: First, the single-source annotators have varying levels of reliability, because they are optimized for single source of input (text, image or video). For instance, recognizing a cat from image is technically more difficult than extracting the word “cat” from text. Moreover, annotators provide inconsistent topicality scores due to contextual differences in how they are applied. For instance, textual entity extractors generally work better with longer pieces of text than social media posts, which tend to be quite short. For

1https://www.freebase.com/ 41 another example, an annotator applied to the author-name field value of “Big Cat Rescue” is likely to return the “cat” label, when the account really promotes protection on “tigers”. Second, humans may perceive topic of a post differently based on their background knowledge and understanding of the context. When judging whether a topic label is relevant, humans consider a post as an integral object and weigh text, image, and video in the post differently. For example, Figure 3.1 shows a Google+ post with a YouTube video: “Cat” is a label for the body text while both the author name and video show the topic is about “Fails”. When presented to people, “Cat” is more likely to be perceived as irrelevant, since the video attracts more attention than the text. An ensemble model combines the topic labels from different annotators, by apply- ing an informed decision-making process based on reliability of annotators and various features of the post, such as the attachment type. For this reason, we choose to train and apply an ensemble model to aggregate topic labels from multiple annotators. To do this, we must first obtain ground truth labels for the supervised machine learning algorithms.

3.3.2 Crowdsourcing Training Labels

We crowdsourced relevance judgment of topic labels, generated from multiple single- source annotators, to randomly sampled posts. These human judgments served as training data for the ensemble model.

Task UI

We used Amazon’s Mechanical Turk (MTurk), a popular crowdsourcing platform, to collect crowdsourced topic label evaluations. We paid workers 15 cents for each task, which takes 1 to 2 minutes to complete. In each task, we asked workers to evaluate, with a task interface shown in Figure 3.4, all entities extracted by all single-source annotators for one Google+ post. Workers had four options “Main or Important”, “Related”, “Off-topic” or “Don’t know”, each with detailed definitions. We iterated on the design of the task interface to ensure quality of workers’ answers. We included toggleable instructions (partial text below) and example responses to help 42

Figure 3.4: Task template used on MTurk to evaluate relevance of topic labels to a Google+ post. The post, left out from the screen shot, is embedded in the task in similar way as Figure 3.1. Workers can play videos and click onto web page in the post. workers align with our expectations. To gain context, a worker can optionally hover her mouse over a topic label to read its English description, shown in the black text box in Figure 3.4.

A Google+ post may contain text, images, videos or links to external web pages. Please consider the entire content of the post when answering the questions in this task. If the post contains a link or a video, please click through to get a better understanding of the post.

Quality Control for Crowdsourcing

Crowdsourcing pipelines are subject to spamming and other low quality work, and the task of relevance judgments for topic labels is no exception. Irresponsible workers, aiming to maximize their profit, game the system by filling the forms with random answers. This is especially difficult to detect, since we do not have any ground truth data set to compare to. To control the quality of crowd work, therefore, we deter spams 43 in the following two ways: First, a single MTurk worker can perform work on at most 5% of the labeling tasks. Spammers, being minorities [51] on Mturk, usually complete many more tasks than honest workers because they do not put in effort. We set up a gateway server to keep track of the work histories of workers and disable their access to new tasks once they reach our preset limit. Second, we followed the approach introduced by Kittur et al. [38] and added in verifiable questions for each task. As shown in Figure 3.4, we asked workers which part of a post is each topic label relevant to. We informed workers that we will check the correctness of their answers with this instruction:

You are expected to spend on average 1 to 2 minutes on each HIT. If you don’t spend enough time, your HIT will be rejected. Some of the questions have correct answers. If your answers don’t match with correct answers, your HIT will be rejected.

To verify the effectiveness of verifiable questions, we conducted an quick A/B experi- ment. We randomly sampled 300 Google+ posts, created identical tasks (same payment and description) for these posts with two UIs, but one with verifiable questions and the other without. We had three independent workers judge each post.

No VQ VQ Median time per post 1.3 min 1.6 min Chances of unanimous agreement 34.9% 35.5% Chances of majority agreement 88.0% 90.1%

Table 3.1: Comparison of two answer statistics between with and without verifiable questions(“VQ” and “No VQ”). Workers spend longer time on tasks and have higher chance to reach agreement when there are verifiable questions.

The result, as summarized in Table 3.1, indeed shows the benefit of adding veri- fiable questions. The median time workers spent on tasks is longer and workers had better agreement when verifiable questions exist. Without ground truth answers, better agreement makes us more confident of answer quality. The verifiable questions helped encourage workers to carefully evaluate each label, as part of their decision process 44 of relevance, and increased the cost of spamming. Together with clear warning about rejection, verifiable questions discourage spams. The output of this crowdsourcing process helps us prepare a human-labeled ground truth data set for the ensemble model in next section.

3.3.3 Supervised Ensemble Model

With human-evaluated topic labels from the crowdsourcing process, we trained an en- semble model to combine topic labels from different topic annotators. In this section, we describe the details of this ensemble model. For each G+ post, the ensemble model filters topic labels generated from various single-source annotators to get relevant topic labels. We have single-source annotators for author name, body text, comment, image, video and link to web page respectively. The entire process can be modeled as a classification task: predicting the relevance class of the candidate topic labels for a post. Depending on the applications of the ensemble model, the classification task can be configured differently to either generate two-class or multi-class classification of the topic labels. Binary Classification for Only “Important” Labels: We want to be able to select a topic label only when it is central and important to the post. From all topic labels generated from various annotators, select only “Main or Important” topic labels. In other words, predicting a topic label as positive when it is “Main or Important” and as negative when it is any one of “Relevant”, “Off-topic”, or “Don’t Know” categories. These topic labels most accurately describe the content of posts, helpful in a number of applications. For example, search engine may index posts with these topic labels for highly relevant search result. Multiclass Classification into All Categories: In this case, we want to actually categorize topic labels into all four categories: “Main or Important”, “Relevant”, “Off- topic” and “Don’t Know”, corresponding to the choices in the crowdsourcing template. This can be naturally modeled as multiclass classification problem. Such categorization provides useful information about the quality of topic labels, allowing applications se- lectively use topic labels based on their need. For example, for accuracy-critical tasks like search we may only use “Main or Important” topic labels for search indexing, while for recommendation tasks we may also include “Relevant” topic labels so as to include 45 a broader range of related topics. For posts with “Don’t Know” labels, they should be excluded from being shown to end-users.

Training Features in the Model

Our goal is to learn a general model that is based on the features of a post and the topic labels, essentially the same information used by humans to judge relevance. We extracted features that are readily available and contain helpful information for judging relevance of topic labels. We do not utilize low-level features such as word tokens and image pixels, because these features are already utilized by the single-source annotators. These features, though not comprehensive, are common to most topic label system for social media posts, serving the purpose of validating our proposed system.

• Topicality scores from single-source annotators (denoted as topic). Single- source annotators generate topic labels with topicality score (between 0 and 1) of the topics as shown in Equation 3.1. For instance, for a short post like “Go Wolves!”, text annotator is not confident that “Wolves” refers to the NBA bas- ketball team - “Timberwolves”, due to lack of context. Therefore, the topic label “Timberwolves” may have a low topicality score. Furthermore, different annota- tors assign topicality scores according to different standards: some annotators may be conservative while others may be more optimistic. With human evaluation, the supervised ensemble model can learn how much to place confidence in topicality scores from various annotators. In our system, we use six single-source annotators, resulting in the topicality scores feature vector in Equation 3.2. A topicality score of 0 means the topic label is not generated from a particular annotator.

< Tauthor,Tcomment,Tphoto,Ttext,Tvideo,Twebpage > (3.2)

• Conditional probability with other topic labels on a post (denoted as prob). Because one post mostly talks about one topic, it is unlikely to have very different topic labels to co-appear on one post. For example, “NBA” is very unlikely to appear with “Knitting” on same post. More formally, assuming

we already have a topic label L1 for a post, then the conditional probability of 46

L2 co-appear with L1 is computed as Equation 3.3. With more topic labels we approximate the combined probability using geometric mean.

F req(L1,L2) P (L2|L1) = (3.3) F req(L1)

• Length of text in post (denoted as length). We also extract length of text in post, with the intuition that single-source annotators tend to be more accurate on longer posts. We apply log transformation and normalize this feature to be in range of 0 to 1 using the max post length.

• Type of attachment (denoted as type). The type of attachment in posts also affects how people perceive the topic of the post. For instance, users tend to pay more attention to images and videos when they are present in a post. This is a categorical feature.

• Whether topic label is a user-provided hashtag (denoted as hashtag). There is good chance that humans would perceive a topic label as relevant if it is one of author provided hashtags. This feature is a binary variable.

For the rest of the paper, we will refer to features by their shorthand notations. For example, topic, prob represents a set of features consists of topicality score vector and conditional probability with other labels.

Classification Algorithm

We wanted to understand the performance of the ensemble model under different clas- sification algorithms. We picked several popular classification algorithms, implemented in a popular machine learning library called ’scikit-learn’ [52], for both the binary and multiclass classification problem. The set of algorithms we experimented with are:

• Random Forest (RF). RF is an ensemble learning model that computes the average decision of many decision trees (200 in our case) trained on random samples of data set. RF, same with basic decision trees, can handle multiclass classification. 47 • Gradient Boosting Classifier (GBC). GBC is an additive model that iteratively adds decision trees (200 total trees in our case) using boosting. Similar with RF, GBC can classify multiple classes.

• Logistic Regression. We use logistic regression with L2 regularization. For mul- ticlass classification, we use the one-vs-rest strategy. This strategy fits a binary classifier for each class, with that class as positive and other classes as negative, and decides the class for data as majority decision of all binary classifiers.

• Support Vector Machine (SVM). We use linear SVM with L2 regularization. Same with Logistic Regression, we apply SVM to multiclass classification using the one- vs-rest strategy.

In the evaluation section below, we picked the best performing algorithm from the above list to compare with baseline method.

3.4 Evaluation

In this section, we evaluate the ensemble model on the ground truth constructed from a gold standard data set. In our evaluation, we try to answer following questions:

• How does the ensemble model compare with a naive baseline methods? For binary classification, the baseline is an union predictor, i.e., predicting topic labels by aggregating all annotator outputs. For multiclass classification, the baseline is predicting the most common category for a label.

• What are the different performance of classification algorithms, and what is the best performing ensembling technique?

• How do different features contribute to the ensemble model?

3.4.1 Evaluation Setup

Data. Using the crowdsourcing process described in Section 3.2.1, we created a training data set for the ensemble model, as well as a gold standard data set to test the ensemble model. 48 We first group recent G+ posts (in August 2014) by types of annotators used to generate topic labels. Then we sample uniformly randomly from each group and form a stratified G+ post sample, representative of all annotators. Next, we create crowd- sourcing tasks from each sampled post and have N distinct workers answer each task. Finally, we aggregate answers from N workers by taking the majority vote for each task. For the training data set, we have 5104 topic labels on 2550 posts, with N = 10, for a total of 51040 judgments. For the gold standard data set, we have 592 topic labels on 300 posts, with N = 20 for more reliability, for a total of 11840 judgments. By increasing number of independent workers on each task, we get more reliable judgements. In a pilot study, we find that quality of work done by 7 Mturk workers is comparable to quality of work done by 3 trained corporate experts. Therefore, we are confident that judgements from 20 workers on Mturk can serve as the gold standard data set. Experiment. Our evaluation is carried out in several steps:

1. We first train the four classification algorithms (described in Section 3.3.3) on all 31 (25 − 1 ) combinations of the 5 features (described in Section 3.3.3).

2. Then, we find the best model – the classification algorithm trained on one feature combination that yields the best performance, and compare it to baseline method.

3. Next, we compare the best performing model of each classification algorithm to baseline method.

4. Finally, for the best classification algorithm, we compare the varying performance of different combinations of features.

3.4.2 Binary Classification for Main Topic Labels

We use standard precision/recall metrics to evaluate performance of binary classifica- tion. We train supervised learning models and make predictions on the gold standard data — 592 pairs of topic label and post. The category distribution of the gold stan- dard data is summarized in Table 3.2. For positive class, i.e., topic label being “Main or Important”, we compute precision, recall and F 1-score (defined in Equation 3.4). Best Ensemble Model. For each one of the four classification algorithms, we trained 31 models on various combinations of features. The best performing model, as 49 Main or Important Relevant Off-topic Don’t Know 47.7% 29.5% 18.8% 3.9%

Table 3.2: Unbalanced class distribution in gold standard data set. measured by F 1-score, is Gradient Boosting Classifier (GBC) trained on all five features, i.e., topic, prob, length, hashtag, and type. To understand the relative performance of the ensemble models in aggregating var- ious single-source annotators, we compare the performance of the ensemble models to single-source annotators and a baseline method that classifies topic labels from all an- notators as “Main or Important”, which we refer as the All Annotator Baseline.

Annotator F1 Precision Recall Author Name 0.007 0.111 0.003 Comment 0.007 0.250 0.003 Photo 0.049 0.242 0.027 Post Text 0.402 0.497 0.337 Video 0.630 0.708 0.568 Web Link 0.293 0.390 0.235 All Annotator Baseline 0.664 0.497 1.000 Ensemble Model 0.717 0.691 0.745

Table 3.3: Comparison of the best performing ensemble model and single-source annota- tors, listed above. “All annotator” predicts the union of topic labels from single-source annotators as positive. The best performing algorithm is GBC trained on all five fea- tures, having the best overall F 1-score here.

Table 3.3 summarizes the results of this comparison, which shows that the best ensemble model has the highest F 1-score. There are three findings from this table:

• Our best ensemble model has the best overall performance (0.717 F 1-score) in comparison with baseline method and all single-source annotators. The ensemble model is significantly more precise than the baseline method (0.691 compared to 0.497), close to the most precise video annotator (0.708). Though having worse recall than baseline, the ensemble model has the best recall compared to any single-source annotator. • The first six rows of the table shows varying reliability of different annotators. 50 0.8

0.717 0.698 0.700 0.7 0.664 0.663 F1 Score F1

0.6

0.5 All Annotator GBC Logistic RF SVM Baseline Regression

Figure 3.5: Comparison of best models of each classification algorithm on binary clas- sification task. The four best models are trained on all five features after exhaustively searching feature combinations. Y axis shows the F 1-score. GBC has the highest F 1-score.

The annotator that extracts topic labels from author name has precision as low as 0.111, while the most accurate video annotator has precision of 0.708. A big portion of Google+ posts, however, does not contain videos, resulting in the poor recall of 0.568. • The baseline method that blindly takes union of topic labels from all annota- tors, though having perfect recall, has unsatisfying precision (0.497). This can be attributed to the fact that baseline method also includes inaccurate topics from unreliable annotators.

In summary, the ensemble model can aggregate topic labels from unreliable annota- tors and identify relevant labels based on features from post and topic labels. Classification Algorithm. After exhaustively training on all combinations of features, we obtain the best models for four algorithms when trained on the set of all 5 features. We find that the best models of the four classification algorithms all outperform the baseline method, as is shown in Figure 3.5. The four models consistently 51

All Annotator Baseline 0.664

topic 0.687

topic, hashtag 0.708

topic, hashtag, length 0.715 Features

topic, prob, length, type 0.718

topic, prob, length, hashtag, type 0.719

0.5 0.6 0.7 0.8 F1 score

Figure 3.6: Feature analysis for GBC model on binary classification task. We show best performing models trained on 1 to 5 features. The x axis shows the F 1-scores of models. have higher F 1-score than baseline, with GBC having the highest F 1-score. In other words, regardless of implementations of ensemble model, we can improve over the naive baseline. Feature Analysis. Both features about topic labels and features about post pro- vide useful information. Due to space constraint, we only show the comparison results of models trained on 31 feature combinations for the best performing GBC algorithm. Figure 3.6 depicts the GBC models trained on combinations of one to five features. We find that topicality scores to be the most powerful single feature, resulting in 0.689 F 1- score. Adding hashtag post features (hashtag) and type of attachment (type) further improved performance. Conditional probability consistency with other labels (prob) and length of text (length) did not have significant effects. Combining these results, we find that our ensemble model, which aggregates single- source annotators using supervised machine learning, is better at classifying topic labels that are central to posts than any single-source annotator and the naive baseline method. 52

0.547 0.526 0.498 0.475

0.4

0.308

F1 Score F1 0.2

0.0 Common GBC Logistic RF SVM Label Baseline Regression

Figure 3.7: Comparison of best models of each prediction algorithm for multi-class classification of topic label relevance. The features used by best models are summarized in Table 3.4. Y axis shows the F 1-score. GBC has the highest F 1-score.

3.4.3 Multiclass Classification of Topic Labels

For this part of the evaluation, we extend our evaluation metrics so as to handle mul- ticlass classification of all classes (i.e., “Main or Important”, “Relevant”, “Off-topic” and “Don’t Know”). As the distribution of topic labels is unbalanced across classes (Table 3.2), we compute precision, recall and F 1-score for each one of the four classes and take a weighted average to get averaged precision, recall and F 1-score, where the weights are frequencies of the four classes (as shown in Equation 3.4).

n n n n F 1 = 1 F 1 + 2 F 1 + 3 F 1 + 4 F 1 (3.4) n 1 n 2 n 3 n 4

In the equation, F 1i denotes F 1-score of classifying class i, ni denotes the number of test data in class i and n denotes total the number of test data. Under this setup, we introduce a baseline method that always predict the most common label “Main or Important” for all input data, which we refer as the Common Label Baseline. 53 Best Ensemble Model. Similar with the evaluation for binary classification, we train the four classification algorithms on 31 combinations of five features. Comparing to the binary classification case, the best ensemble model has more significant improvement over the baseline method. The best model, GBC trained on topic, hashtag, prob shown in Table 3.4, has highest F 1-score of 0.547 in comparison to 0.308 F 1-score of baseline.

Algorithm Features F1 Baseline N/A 0.308 Logistic topic,hashtag,length 0.498 SVM topic,hashtag 0.475 RF topic,hashtag,prob,length 0.526 GBC topic,hashtag,prob 0.547

Table 3.4: The F 1-scores of the best-performing model of every classification algorithm for multiclass classification along with their feature combinations. GBC performs the best. All ensemble learning algorithms outperform the baseline algorithm, which con- sistently predicts the most popular category, “Main or Important”.

Classification Algorithm. Best models of four classification algorithms all have significant improvement over the baseline method, more details shown in Figure 3.7. We achieve the best performance of each classification algorithm on different combinations of features as shown in Table 3.2. This is different from binary classification, where all classification algorithms perform best on full set of features. Overall, ensemble models outperform the baseline for all classification algorithms. Feature Analysis. Consistent with the result in binary classification, we observe the strong predictive power of topic label feature, topicality scores (0.529 F 1-score with single feature), and post feature, length of post (0.542 F 1-score with ( topic, length), as shown in Figure 3.8. However, we notice that hashtag, prob, type do not contribute to the classification. One possible explanation is that topicality score is most informative about degree of relevance, which is the basis for multiclass classification task. Other features do not contain much information about degree of relevance. In summary, our ensemble model is significantly better than naive baseline method in classification of topic labels. Trained with labeled data from crowdsourcing process, the ensemble model is capable of classify the relevance of unseen topic labels on new 54

All Annotator Baseline 0.308

topic 0.529

topic, length 0.542

topic, hashtag, prob 0.547 Features

topic, hashtag, prob, type 0.547

topic, prob, length, hashtag, type 0.544

0.0 0.2 0.4 0.6 F1 score

Figure 3.8: Feature analysis for multi-class classification of topic label relevance. The x axis shows F 1-score of GBC algorithm trained on different sets of features. Feature of topicality scores topic has the most predictive power. Somewhat surprisingly, prob, hashtag, type bring little improvement. posts with decent performance.

3.5 Discussion

The collective intelligence of machine learning algorithm and crowd enables scalable topic modeling of social media posts. Understanding topics of social media posts is a task that human can easily do. Machine learning algorithm learns from limited human generated data and automates topic modeling to millions of social media posts, which are impractical for manual labeling. Our experiment shows that a machine learning model trained on human judgment can improve the accuracy of automatically generated topic labels. Evaluating on a gold standard data set, we find the ensemble model outperforms baseline method that naively combines topic labels from all annotators in classifying topic labels that are Main or Important topics. The ensemble model also significantly 55 outperforms a baseline method in multiclass classification of topic labels into relevance categories. This study shows that we can automate and mimic human intelligence with machine learning algorithm by learning from crowd sourced human judgments. Crowdsourcing enables scalable and efficient way of leveraging human collective intelligence. Similar approach was used to construct ImageNet [53] data set that is used to train image recognition algorithm. Such crowdsourced data set enables machine learning system to automate / mimic human judgment. The results show that our ensemble model significantly outperforms baseline meth- ods for both binary and multiclass classification of topic labels. In other words, by integrating topic annotators on different parts of the posts, we have substantially im- proved topic label quality for Google+ posts. The collective intelligence of machine learning algorithm and crowd helps us to tackle the challenge of aggregating multiple single-source topic annotators. As the example shown in Figure 3.1, even if we had perfect annotators for all parts of the post, topic labels from these annotators may not be relevant to the post. Human make judgment about topic relevance by combining information from various parts of a post. By crowdsourcing topic label judgment, we obtain a data set and train machine learning algorithm to mimic human judgment, scaling up topic labeling to millions of social media posts. Instead of designing an algorithm with existing data, as shown in the previous chap- ter, we actively construct a data set through crowdsourcing and train algorithm to mimic human wisdom. Such approach requires understanding about limitations of algorithm and opportunities for improvement with human generated data. The flexibility in constructing data set also brings the challenge of ensuring quality of crowdsourced data. As we demonstrated in this work, we need to carefully design the process for eliciting human judgments, including the interface of online survey, types of answers and mechanism to reduce spams. The quality of data is crucial to the final performance of machine learning algorithm. Asking human to judge relevance of topic labels is one basic application of human wisdom, we will proceed to explore more complex tasks that can be achieved with collective intelligence in the following chapter. Chapter 4

CrowdLens: Recommendation Powered by Crowd

4.1 Introduction

Human can generate data that computation process is built on, as shown in the previous chapter; computation process can also provide assistance to human. Such symbiotic collaboration relationship between human and computers enables new applications. In this chapter, we leverage collective intelligence of algorithm and human to improve user experience in receiving recommendations - both recommended items and explanations for recommendations. Despite their popularity, recommender algorithms face challenges in offering good user experience. First, while algorithms excel at predicting whether a user will like a single item, they struggle at composing sets of recommendations, which requires balanc- ing factors such as diversity, popularity, and recency of the items in the sets [7]. Second, users like to know why items are being recommended. Current recommender systems, however, only provide template-based and simplistic explanations [54]. When algorithms are at their limit, human wisdom may shine. People may not be as good at algorithms at predicting how much someone will like an item [55]. But “accuracy is not enough” [7]. We conjecture that people’s creativity, ability to consider and balance multiple criteria, and ability to explain recommendations make them well suited to take on a role in the recommendation process. Indeed, many sites rely on users

56 57 to produce recommendations and explain the reasons for their recommendations. For example, user-curated book lists in Goodreads1 are a popular way for users to find new books to read. In this work, we experiment with methods to leverage collective intelligence in rec- ommending and explaining items. We have the following research questions to guide our study: RQ1 - Organizing crowd recommendation What roles should people play in the recommendation process? How can they complement recommender algorithms? Prior work has shown that crowdsourcing benefits from an iterative workflow (e.g., [56, 57, 58]). In our context, we might iteratively generate recommendations by having one group generate examples and another group synthesize recommendations, incorporating those examples along with their own fresh ideas. Alternatively, we might substitute a recommendation algorithm as the source of the examples. We are interested in the effectiveness of these designs compared with an algorithmic baseline. We speculate that this design decision will have a measurable impact on the quality, diversity, popularity, and recency of the resulting recommendations. RQ2 - Explanations Can crowds produce useful explanations for recommenda- tions? What makes a good explanation? Explanations are common in recommender systems: they help people evaluate a recommendation and decide whether the rec- ommended item is worth their attention. But systems generate only template-based explanations, such as “we recommend X because it is similar to Y”. People, on the other hand, have the potential to explain recommendations in varied and creative ways. It is not clear, however, whether a crowdsourcing process will produce explanations that realize this potential. We will explore the feasibility of crowdsourcing explana- tions, identify characteristics of high-quality explanations, and develop mechanisms to identify and select the best explanations to present to users. RQ3 - Volunteers vs. crowdworkers How do recommendations and explanations produced by volunteers compare to those produced by paid crowdworkers? Compared to paid crowdworkers, users of a site generally have more domain knowledge, some com- mitment to the site, and may choose to volunteer their time without compensation.

1www.goodreads.com 58 However, for all but the largest sites, recruiting paid crowdworkers (e.g., from Ama- zon Mechanical Turk) results in many contributions much more quickly. We thus will compare result quality and timeliness for the two groups. To address these questions, we built CrowdLens, a crowdsourcing framework and sys- tem to produce movie recommendations and explanations. We implemented CrowdLens on top of MovieLens2, a movie recommendation web site with thousands of active monthly users. The rest of this chapter is structured as follows. We first survey related work. Then, we describe CrowdLens framework for recommendation. We follow this with an experiment on MovieLens and discussion on research results for recommendation and explanation tasks separately. We conclude with a summary of our experiment.

4.2 Related work

Foundational crowdsourcing work showed that tasks could be decomposed into indepen- dent microtasks and that people’s outputs on microtasks could be aggregated to produce high quality results. After the original explorations, researchers began to study the use of crowdsourcing for more intellectually complex and creative tasks. Bernstein et al. [57] created Soylent, a Microsoft Word plugin that uses the crowd to help edit documents. Lasecki et al. [59] used crowdworkers to collaboratively act as conversational assistants. Nebeling et al. [60] proposed CrowdAdapt, a system that allowed the crowd to design and evaluate adaptive website interfaces. When applying crowdsourcing for these more complicated tasks, various issues arose. Organizing the workflow. More complicated tasks tend to require more complex worklows, i.e., ways to organize crowd effort to ensure efficiency and good results. Early examples included: Little et al. [56] proposed an iterative crowdsourcing process with solicitation, improvement and voting; and Kittur et al. [58] applied the Map Reduce pattern to organize crowd work. For Soylent, Bernstein et al. [57] developed the find- fix-verify pattern: some workers would find issues, others would propose fixes, and still others would verify the fixes. Similar to this research, we developed an iterative work- flow for incorporating crowds into recommendation: generate examples, then synthesize

2http://movielens.org 59 recommendations. Types of crowdworkers. Depending on the difficulty and knowledge requirements of a task, different types of crowd workers may be more appropriate. Zhang et al. [61] used paid crowdworkers from Amazon Mechanical Turk to collaboratively plan itineraries for tourists, finding that Turkers were able to perform well. Xu et al. [62,63] compared structured feedback from Turkers on graphic designs with free-form feedback from design experts. They found that Turkers could reach consensus with experts on design guidelines. On the other hand, when Retelny et al. [64] explored complex and interdependent applications in engineering, they used a “flash team” of paid experts. Using a system called Foundry to coordinate, they reduced the time required for filming animation by half compared to using traditional self-managed teams. Mindful of these results, we compared performance of paid crowdworkers to “experts” (members of the MovieLens film recommendation site). Crowdsourcing for personalization. Personalization – customizing the presen- tation of information to meet an individual’s preferences – requires domain knowledge as well as knowledge of individual preferences. This complex task has received rela- tively little attention from a crowdsourcing perspective. We know of two efforts that focused on a related but distinct task: predicting user ratings of items. Krishnan et al. [55] compared human predictions to those of a collaborative filtering recommender algorithm. They found that most humans are worse than the algorithm while a few were more accurate, and people relied more on item content and demographic information to make predictions. Organisciak et al. [65] proposed a crowdsourcing system to pre- dict ratings on items for requesters. They compared a collaborative filtering approach (predicting based on ratings from crowdworkers who share similar preferences) with a crowd prediction approach (crowdworkers predicting ratings based on the requester’s past ratings). They found that the former scales up to many workers and generates a reusable dataset while the latter works in areas where tastes of requesters can be easily communicated with fewer workers. Generating sets of recommendations is more difficult than predicting ratings: as we noted, good recommendation sets balance factors such as diversity, popularity, fa- miliarity, and recency. We know of two efforts that used crowdsourcing to generate 60 recommendations: Felfernig et al. deployed a crowd-based recommender system proto- type [66], finding evidence that users were satisfied with the interface. The StitchFix3 commercial website combines algorithmic recommendations with crowd wisdom to pro- vide its members with a personalized clothing style guide. Our research also explores incorporating people into the recommendation process. We experimented with sev- eral roles for people and evaluated their performance using a range of recommendation quality measures.

4.3 The CrowdLens Framework

Though individuals are not as accurate as algorithms in predicting ratings [55], crowd- sourcing gains power by aggregating inputs from multiple people. However, many options are possible when designing a system for crowdsourced recommendation and explanation. In this section, we describe several key aspects of CrowdLens, including the intended use of the recommendations, the recommendation workflow, and the user interface.

4.3.1 Recommendation Context: Movie Groups

Human recommendations do not scale like algorithmic recommendations. It is simply too much work to ask users to come up with long lists of personalized content for all users of a system. Therefore, human-built recommendation lists are typically non- personalized or semi-personalized: a web site lists the top games of the year; a DJ plays music suited for listeners that like a particular kind of music. In this research, we explore the idea of using humans to generate recommendations for users who have expressed a preference for a particular type of movie. MovieLens uses “movie groups” during the sign-up process: after users register, they are asked to distribute three “like” points across six movie groups, where each movie group is summarized by three representative movies and three descriptive terms. See figure 4.1 for an example. MovieLens then uses a semi-personalized modification of collaborative filtering to recommend for these new users, as described in Chapter 2. This algorithm will serve as the baseline of comparison for our crowd recommendation approaches.

3www.stitchfix.com 61

Figure 4.1: An example movie group. New users in MovieLens express their taste on movie groups.

We therefore design CrowdLens to collect recommendations and explanations for each of the six movie groups offered to new users of MovieLens.

4.3.2 Crowd Recommendation Workflow

There are many possible designs for generating recommendations from a crowd of work- ers. Recent research provides evidence that providing good “example” responses can improve the quality [67] and diversity [68] of responses in crowdsourced creative work; however, providing examples also may lead to conformity and reduced diversity [69]. Therefore, we thought it was both promising and necessary to experiment with a rec- ommendation workflow that used examples. CrowdLens organizes recommendations into a two step “pipeline”: the first step generates a candidate set of recommendations, and a second step synthesizes the final set of recommendations, either drawing from the generated candidates, or adding new content. This process may enable both creativity – the first group will come up with more diverse and surprising recommendations – and quality – the second group will gravitate toward and be guided by the best recom- mendations from the first. This workflow is similar to workflows that have been used successfully elsewhere for crowdsourcing subjective and creative tasks [56, 57, 58]. The first step (generate candidates) in the CrowdLens pipeline can be fulfilled either by a recommendation algorithm or by crowdworkers. This allows us to experiment with different configurations of human-only and algorithm-assisted workflows. 62

1. Task instruction

2. Examples

3. List of recommendations and explanations

4. Search for movie, people or tag

Figure 4.2: CrowdLens user interface. The four core components are task instructions, example recommendations, multi-category search, and the worker-generated recommen- dation list. The interface for generating examples is the same except without component 2 - examples.

4.3.3 User Interface

See figure 4.2 for a screenshot of the CrowdLens recommendation interface. The recom- mendation task is framed by asking workers to produce a list of 5 movies that “would be enjoyed by members who pick [a specific movie group]”. As detailed in the figure, the interface has four major parts:

1. Instructions for the crowdworker and a description of the target movie group.

2. A set of example recommendations. Crowdworkers may add recommendations from this list by clicking the “+” sign next to each example movie. This component is visible only for workers in the second (synthesize) step, not the first (generate).

3. The list of movies the crowdworker recommends. Each recommendation is accom- panied by a text box where the crowdworker must write an explanation for why they are recommending the movie. 63 4. An auto-complete search interface, which crowdworkers can use to find movies to recommend. Users can search for movies by title, actor, director, or tags.

4.4 Experiment

We conducted an online experiment of crowd recommendation on MovieLens. We re- cruited participants from MovieLens and Amazon Mechanical Turk to generate and synthesize recommendations and to produce explanations. We evaluated the resulting recommendations and explanations using both offline data analysis and human judg- ments, which we gathered from another set of MovieLens users.

Participants

We recruited 90 MovieLens participant via email invitations between Dec 16, 2014 and Jan 30, 2015. The qualification criterion for MovieLens users was logging in at least once after November 1, 2014. We recruited 90 Amazon Mechanical Turk workers on Jan 30, 2015. Each Turker was allowed to complete one HIT; this ensured that no single Turker was assigned to multiple experimental conditions. We recruited turkers from the US and Canada with approval rate of over 95%, paying them $0.75 per HIT (which took 5-8 minutes on average).

Experimental Design

We have a 2 × 2 between-subjects design with the following factors:

1. Type of worker: volunteer (MovieLens user) or paid crowdworker (from Amazon Mechanical Turk).

2. Source of examples: are example recommendations generated by an algorithm or aggregated from another set of workers?

In discussing our results, we refer to the four resulting experimental conditions as ML, ML AI, TK, and TK AI – see figure 4.3. These pipelines generate both recommendations and explanations. We also include a baseline condition – the semi- personalized collaborative filtering algorithm of Chang et al. [70] described in Chapter 2. 64 We refer to the baseline as AI. The baseline algorithm only generates recommendations, therefore there is no baseline condition for studying explanations.

Examples from Individual movie Aggregated Recommendations ML Volunteers recommendations recommendations volunteers Recommendation Quality: Examples from Individual movie Aggregated Recommendations Human Judgment ML_AI Volunteers recommendations algorithm recommendations Quality Popularity Examples from Individual movie Aggregated TK Crowd workers Recommendations Recency crowd workers recommendations recommendations Diversity

Examples from Individual movie Aggregated Explanation TK_AI Crowd workers Recommendations algorithm recommendations recommendations Quality: Human Judgments Algorithm AI recommendations Recommendations

Figure 4.3: Five pipelines for producing recommendations. The human-only and algorithm-assisted pipelines were instantiated with both MovieLens volunteers and paid crowdworkers from Mechanical Turk. The MovieLens recommendation algorithm served as a baseline and was the source of input examples for the algorithm-assisted pipelines.

In each of the four experimental pipelines, five participants independently produced a set of 5 recommended movies for each of the 6 MovieLens movie groups. In the human-only pipelines, (TK and ML), 5 participants generated the initial (example) recommendations. This accounted for the total of 180 experimental participants:

• 90 MovieLens volunteers: (10 (ML) + 5 (ML AI)) × 6 (MovieLens movie groups) plus

• 90 Turkers: (10 (TK) + 5 (TK AI)) × 6 (MovieLens movie groups)

We clarify the recommendation pipelines by walking through an example - recom- mending for movie group shown in figure 4.2 in ML pipeline.

1. 5 MovieLens volunteers independently recommend movies for the movie group with no examples provided (using an interface similar to figure 4.2 but part 2 removed). They also write explanations for their recommendations.

2. To prepare for the synthesis step, we find N unique movies from 5 × 5 movies in step 1.

3. Each of the next 5 MovieLens volunteers used interface in figure 4.2 to synthesize 65 recommendations with the randomly ordered N movies shown as examples (using interface shown in figure 4.2).

4. We aggregate recommended movies in step 3 to get top 5 most frequent recom- mendations (if there were ties, we chose randomly), together with explanations on these 5 movies from step 1-3.

For algorithm-assisted pipelines (e.g., ML AI ), we replace step 1-2 with the baseline recommendation algorithm, generating the same N number of examples with its human counterpart.

4.4.1 Obtaining Human Quality Judgments

We recruited MovieLens users to provide judgments of the recommendations and expla- nations produced in the recommendation pipelines. We emailed users who had logged in at least once after November 1, 2014 and who had not been invited to participate in the crowd recommendation experiment. 223 MovieLens users responded to the survey between March 4 and March 29, 2015. Participants used the interface shown in Figure 4.4 to judge recommendations and explanations. For each movie group, we aggregated the five recommendation lists from the five pipelines to get the set of unique movies (15-18 in the experiment). Similarly, for each movie, we collected all explanations for that movie from all four crowd recom- mendation pipelines. To avoid order effects, we randomized the order of movies and explanations shown to each participant. We randomly assigned judges to 1 of the 6 movie groups to evaluate corresponding recommendations and explanations from the 5 pipelines. For each movie group, we aggregated the five recommendation lists from the five pipelines to get a set of unique movies (15-18 in the experiment). Similarly, for each movie, we collected all explanations for that movie from all four crowd recommendation pipelines. Judges rated the set of unique movies from the 5 pipelines in random order (shown in figure 4.4). To avoid the effect of explanations on movie quality judgment, judges can only rate the corresponding explanations after rating a movie. Figure 4.4 shows the instructions for judges. Judges rated recommended movies on a five point scale from “Very inappropriate” to “Very appropriate” (mapped to -2 to 2) 66

Figure 4.4: Interface for collecting judgments of recommendations and explanations.

(with the option to say “Not Sure”), and explanations on a five point scale from “Not helpful” to “Very helpful” (mapped to -2 to 2). While explanations could be evaluated on many dimensions [71], we asked judges to focus on how helpful they would be to a user who was considering a recommendation.

4.5 Study: Recommendations

We first ask a simple question: do the different recommendation pipelines actually differ? Do they produce different movies? We then evaluate recommendations produced by the five pipelines using the standard criteria we discussed earlier – quality, diversity, popularity, and recency – as well as human judgments. We conclude this section by discussing our results in the context of our research questions. 67 AI ML ML AI TK TK AI Schindler’s List Fight Club The Pianist Saving Private Ryan Saving Private Ryan Saving Private Ryan Good Will Hunting Apollo 13 The Green Mile Gladiator The Silence of the Lambs The Shawshank Redemption Saving Private Ryan Cast Away 12 Angry Men The Pianist 12 Angry Men Schindler’s List The Silence of the Lambs Schindler’s List The Matrix Glengarry Glen Ross Philadelphia Apollo 13 Dead Men Walking

Table 4.1: Final recommendation lists from the five pipelines for the movie group shown in Figure 4.1.

4.5.1 Different pipelines yield different recommendations

Table 4.2 shows that there is little overlap between the different pipelines. The table shows the average (across the 6 movie groups) number of common movies between each pair of pipelines (each of which generated 5 movies). Recommendations from the two human-only pipelines, ML and TK, have little in common with recommendations from the baseline algorithm: 0.8 common movies for ML and 1 common movie for TK. The overlap increases for the algorithm-assisted pipelines, with 2.5 common movies ML AI and AI and 2.2 common movies between TK AI and AI. In other words, the algorithm- assisted pipelines exhibited a stronger fixation effect; participants were more likely to select examples that were produced by the algorithm than by other participants. As illustration, Table 4.1 shows the five recommendation lists for the movie group shown in Figure 4.1.

ML AI ML TK AI TK ML 0.5 (0.96) - - - TK AI 1.8 (0.90) 1.3 (0.75) - - TK 0.5 (0.76) 0.5 (0.76) 1 (1) - AI 2.5 (0.69) 0.8 (0.69) 2.2 (0.37) 1 (0.82)

Table 4.2: Average number of overlapping movies recommended from any pair of pipelines across 6 movie groups. Each pipeline generates 5 movies as final recommen- dations. Standard deviations are included in parenthesis

4.5.2 Measured Quality: Algorithm slightly better

Our intuition for evaluating quality was: if a pipeline recommends a movie for a specific movie group, how highly would users who like that group rate this movie? We used existing MovieLens data to formalize this intuition: for each movie group, we find the set of MovieLens users who assigned at least one point to the group, and then compute these 68 Average Measured Average Judged Quality (SD) Quality (SD) AI 4.19 (0.65) 0.93 (0.63) ML 4.12 (0.63) 1.26 (0.31) ML AI 4.10 (0.66) 1.23 (0.35) TK 4.00 (0.68) 1.07 (0.68) TK AI 4.12 (0.64) 1.26 (0.31)

Table 4.3: Measured and human-judged quality of recommended movies. For a movie group in a pipeline, measured quality is computed as the average of ratings (on a 5 star scale) on recommended movies from MovieLens users who indicated a preference for the movie group. User judgments from the online evaluation range from -2 (very inappropriate) to 2 (very appropriate). Both columns show the average for all pipelines across six movie groups. users’ evaluations on recommendations as average of their past ratings on recommended movies.4 We compared quality of recommendations from the five pipelines as follows. We used a mixed-effect linear model, in which the pipeline as fixed effect and user as random effect, accounting for the variations of how differently users rate on 5 point scale. Then, we did pairwise comparisons of least squared means of 5 average ratings. As we expected, recommendations from the baseline algorithm have the highest average rating (p < 0.001 using least squared means comparison) as shown in Table 4.3, because the algorithm is designed to recommend movies of highest average ratings for people who like a movie group. Average ratings of crowd generated recommendations, however, are only slightly worse, less than 0.2 stars on 5-star scale. TK is the worst in all crowd pipelines.

4.5.3 Judged Quality: Crowdsourcing pipelines slightly preferred

The human judgments of recommendations gave us another perspective on quality. We analyzed the judgments as follows. We first removed “Not Sure” responses. Then for each judge, we computed their rating for each pipeline by averaging their ratings for the five movies generated by that pipeline. We analyzed the ratings using mixed-effect model similar with that described in the previous section.

4There were a median of 485.5 such users per movie group; min 180, max 780. 69

0.95

0.93

0.91 Avg. Tag GenomeTag Avg. Similarity

0.89 ML ML_AI AI TK_AI TK Pipeline

Figure 4.5: Diversity of recommendations. Y axis shows average pairwise similarities between tag genome vectors (lower values are more “diverse”).

Human judges preferred recommendations from all crowd pipelines over the baseline algorithm (p < 0.01 using least squared means comparison) as shown in Table 4.3. (Note that there was larger variance in judgments of recommendations from the algorithm). There also was an interaction effect between the two experimental factors. Rec- ommendations from algorithm-assisted Turkers (TK AI) were better than those from Turkers only (TK). However, there was no differences between the two pipelines involv- ing MovieLens users.

4.5.4 Diversity: A trend for crowdsourcing

Diversity is an attribute of a set of items: how “different” are they from each other. More diverse recommendations help users explore more broadly in a space of items. And prior work showed that recommender system users like a certain amount of diversity [8, 72]. We computed the diversity of the 5 movies from each recommendation pipeline using 70

11

10

9 log(#Ratings)

8

7

ML ML_AI AI TK_AI TK Pipeline

Figure 4.6: Popularity of recommended movies. Y axis shows number of ratings from all MovieLens users on a natural log scale. the tag genome [28]. The tag genome consists of vectors for several thousands movies measuring their relevance to several thousand tags. This lets us compute the similarity of any two movies by computing the similarity of their vectors, for example using cosine similarity. We compute the topic diversity of a set of recommendations as the average pairwise cosine similarities between the tag genome vectors of the recommended movies, a common way to quantify diversity of recommendations [8]. The higher the value, the less diverse the recommendations. We compared diversity, popularity and recency (described in the following two sec- tions) of recommendations using ANOVA analysis with TukeyHSD test. As we had conjectured, crowdsourcing pipelines tend to result in more diverse rec- ommendations (see Figure 4.5). However, the only statistically significant difference was between Turkers in the human-only process (TK) and the baseline algorithm (p < 0.05). Note that the statistical power for this analysis was reduced because we have fewer data 71 points, just one value per set of recommendations (since diversity is a property of a set of items, not a single item). Thus, we have confidence that that if we had more data, we would find that people generally recommend more diverse movies than the algorithm.

4.5.5 Crowdsourcing may give less common recommendations

Different users may want different level of popularity in recommendations they receive. Prior research [72] showed an item’s popularity – how frequently it has been rated by users – to be positively correlated with new user satisfaction by earning trust. Long term users, however, may want to discover and explore new items with the help of recommender systems [73]. We measured movie popularity as the log transform of its number of ratings in MovieLens (this followed a normal distribution after the transformation). Note that this definition of “popularity” does not mean an item is “liked”, just that many users expressed an opinion of it. Crowdsourcing pipelines tended to result in less popular – thus potentially more novel – movies than the algorithm (see Figure 4.6). Turkers in the human-only pipeline (TK) recommended significantly less popular movies (p < 0.05); for MovieLens volun- teers, there was a similar trend, but it was not statistically significant. Recall that we showed that workers were more likely to include examples from the algorithm in their recommendation sets than examples from previous workers. That result is reflected in this analysis: for example, Turkers who saw examples from the algorithm (TK AI) were more likely to include them than Turkers who saw examples from previous Turkers (TK); this made their final results more similar to the algorithmic baseline results; in particular, they included more popular items.

4.5.6 Recency: Little difference

Prior work showed that different users prefer more or less recent movies in their rec- ommendations [74]. We measured movie recency as the log transform of the number of days since the movie’s release date. The lower the value, the more recent the movie. Figure 4.7 suggests that there is a trend for crowdsourcing pipelines to result in more recent movies; however, these differences were not significant. 72

10

9

8

7 log(#dayssince relase)

6

ML ML_AI AI TK_AI TK Pipeline

Figure 4.7: Recency of recommended movies. Y axis shows number of days since release for movies on natural log scale (lower values are more “recent”

4.5.7 Discussion

RQ1 - Organizing crowd recommendation. We find evidence that the design of the crowdsourcing workflow matters – different pipeline structures lead to different recom- mendation sets, as measured by overlapping movies. As compared with the algorithmic baseline, the four human pipelines generated more diverse lists that contain less popu- lar movies. We find differences in evaluated quality between the human pipelines and the algorithmic baseline. Looking further into these differences, we find an interesting contradiction: the algorithm-only pipeline generates recommendations with the highest average rating from target users, while the human pipelines are judged to be more ap- propriate in an online evaluation. We conjecture that the judges are considering user experience factors (e.g., topic match to movie group) that also can impact recommen- dation quality [7] — humans recommend movies that are a better fit for the task, but have lower historical ratings. 73 We also observe an interesting fixation effect: showing algorithm-generated exam- ples led the human pipelines to generate lists that overlap more with the algorithm-only pipeline. One positive impact of this fixation effect is that algorithm-assisted Turkers recommended higher quality movies. On the other hand, this effect dampens diver- sity and novelty — the core strengths of the human pipelines — for both Turkers and MovieLens volunteers. To summarize the trade-off, using algorithm-generated examples instead of human-generated examples reduced the time required to produce recommen- dations, but leads to less organic recommendation sets. RQ3 - Volunteers vs. crowdworkers. Perhaps most interesting, we find only small differences in our outcome measures between MovieLens volunteers and Turk- ers. Since MovieLens members are self-selected as being interested in movies, while Turkers are not, one might conjecture that Turkers are at an enormous disadvantage: movie lovers will recognize more of the movies shown, and will more easily call related high-quality movies to mind. Possibly we do not find large differences because our rec- ommendation task skews towards mainstream movies (we do not show obscure movies in the movie groups) or because movie watching is such a common pastime. Whether or not this is the case, this is a positive early result for systems wishing to use paid crowdworkers to recommend content. As mentioned above, we observe an interesting interaction effect where Turkers ap- pear to benefit from algorithmic suggestions more than MovieLens users. It is interesting to speculate why this might be the case. The fixation effect of algorithm generated ex- amples improves quality of movies recommended by paid crowdworkers at the cost of popularity and diversity, but does not bring benefit to MovieLens volunteers. We spec- ulate that this effect for Turkers reveals a relative weakness for recall tasks, along with a strength for synthesis tasks.

4.6 Study: Explanations

In this section, we turn our attention to the other output of the CrowdLens process: recommendation explanations. We first analyze the overall quality of explanations, fo- cusing on the differences between paid crowdworkers and MovieLens volunteers (RQ3). 74 We follow with an analysis of language features that are predictive of high quality ex- planations (RQ2).

4.6.1 Explanation quality

The users recruited from MovieLens used our evaluation UI (Figure 4.4) to enter 15,084 ratings (on a scale of -2 to 2) of the quality of 336 crowdsourced explanations. We compare the quality of explanations using a mixed effect linear model, in order to eliminate per-user scoring biases. The model has two independent variables — the type of worker (volunteer or crowd worker) and the source of examples. The model has one dependent variable — average rating of the explanations. The average rating for explanations from both types of workers is slightly above 0 on a scale of -2 (“Not helpful”) to 2 (“Very helpful”). MovieLens users’ explanations, on average, received higher ratings than explanations from Turkers (means: 0.14 vs. 0.03, p < 0.0001). There is no significant difference between workers in the human-only and algorithm- assisted pipelines. This is as expected, since showing example movies is not directly related to the task of writing explanations.

4.6.2 Features of Good Explanations

We expect that the quality of explanations provided by crowdworkers will vary substan- tially — it is natural in crowd work that some contributions will be higher quality than others. In this analysis, we explore the features of explanations that are predictive of high evaluation scores. If we can extract key features of good explanations, we can use this knowledge to provide guidelines for workers as they write explanations. For this analysis, we labelled explanations with an average rating ≥ 0.25 as “good” (145 explanations are in this class) and those with an average rating ≤ −0.25 as “bad” (102 explanations are in this class). We extracted language features using “Pattern”5, a popular computational linguistics library. We treated explanations as a bag of words, normalizing words to lowercase and removing common stop words. Based on these features (summarized in Table 4.4), we classify explanations as “good” or “bad” using

5http://www.clips.ua.ac.be/pattern 75 Feature Effect P-value log(# words) 1.55 ∼ 0 % words that appear in tags on movie 2.58 < 0.0005 % adjectives 4.55 < 0.005 % words that appear in genres of movie 5.97 < 0.01 Modality 0.98 < 0.01 (-1 to 1 value computed by Pattern to represent uncertain to certain tone) Subjectivity (0 to 1 value computed by Pattern) 1.26 0.06 # typos (given by Google spell checker) -1.61 0.07 % words that are directors’ names 3.97 0.07 Polarity 0.54 n.s. (-1 to 1 value computed by Pattern to represent negative to positive attitude) % nouns 1.64 n.s. % verbs -1.62 n.s. % words that appear in movie title -0.02 n.s. % words that are actor names -2.88 n.s. % words that are three tags used to describe the 1.75 n.s. movie group the movie is recommended for % words that appear in plot summary of the movie -0.38 n.s.

Table 4.4: Extracted features of recommendation explanations, along with their effect and statistical significance in a logistic regression model to predict whether an explana- tion is evaluated to be good or bad by MovieLens users. a logistic regression model. Qualitatively, we observe substantial variance in the quality of explanations. To inform our discussion of features that are predictive of high and low quality evaluations, let us look at several sample explanations from the study (evaluated “good” or “bad” as described above). Below are two examples of “good” explanations:

For “Apollo 13”: Dramatic survival on a damaged space module. Great acting by Tom Hanks, Bill Paxton and Kevin Bacon.

For “Sleepless in Seattle”: Cute movie of cute kid matchmaking in Seattle and true love upto a the Empire State Building across the country in New York City - so romantic!

We notice that these (and other good explanations) contain some specific details about the movie (e.g., actors, setting, and plot) and the reasons why the movie is good. Below are two examples of “bad” explanations: 76 For “Apollo 13”: Because is almost exclusively dramatic, good acting and intense.

For “The Avengers”: It’s good vs evil

We notice that these (and other bad explanations) are too short, overly derivative of the tags shown to describe the movie groups, and not as detailed as the good expla- nations. Qualitative insights such as these informed the development of features that we subsequently extracted for our regression analysis. As described above, we use logistic regression to predict “good” explanations, using a broad set of extracted features as input. To evaluate the accuracy of the logistic regression model, we ran a 5-fold cross validation and got a high average F-measure6 of 0.78. With this high accuracy, we are confident that the extracted features are indicative of the quality of explanations. Table 4.4 summarizes the model’s features and effects. The model reveals that longer explanations, and explanations containing tags, genres, or other adjectives are more likely to be highly-evaluated.

4.6.3 Discussion

RQ2 - Explanations. Our evaluation revealed that the crowdsourcing pipelines re- sulted in explanations with neutral to acceptable quality on average, and many that were judged very good. The challenge, therefore, is in selecting and displaying only those explanations with the highest quality. Machine learning methods can predict this measured quality with good accuracy using easily extracted features. We find that the highest-quality explanations tend to be longer, and tend to contain a higher percentage of words that are tags, genres, or other adjectives. RQ3 - Volunteers vs. crowdworkers. In our experiment, MovieLens volunteers provided better explanations than paid Turkers. Likely, this is because the MovieLens volunteers are more familiar with movie recommendations. Also, because they have volunteered to participate in the experiment, they may be more motivated or informed than average users.

6 2·precision·recall precision+recall 77 4.7 Summary Discussion

Through these studies, we see evidence that the collective intelligence of human and computer has advantages over the wisdom of either one of them. Volunteers, though generating best recommendations in human only pipeline, are much more expensive in terms of time and money. With the support from algorithm, paid crowd workers can rec- ommend highly rated and diverse movies. In the human and algorithm mixed pipelines, algorithm generates examples that are later synthesized by human workers, reducing human effort and improving quality of recommendations for paid crowd workers. Such mixed approach exploits strengths of both algorithms – for example, producing accurate recommendations – and people – for example, producing appropriately diverse sets of recommendations. We also showed the potential opportunity of algorithm support in explanation writing tasks: judging quality of explanations and providing writing sug- gestions for improvement. For example, the suggestion can be: “Your explanation has a 60% chance to be evaluated positively; to improve it, consider writing a longer expla- nation that describes more attributes of the movie that would make someone interested in watching it”. Though not experimentally verified, we believe such algorithm support can result in higher quality of explanations. The real strength of human, in recommendation process, is being able to explain recommendations. Explanations help users to decide whether to take a recommendation especially when they have little prior knowledge. Therefore, we continue research on ways to combine wisdom of algorithm and that of human to generate high quality explanation at scale in the following chapter. Chapter 5

Personalized Natural Language Recommendation Explanations Powered by Crowd

5.1 Introduction

In this chapter, we further explore structured ways to combine human and computer intelligence and build new applications in recommender systems. As we have shown in the previous chapter, crowdworker can participate in recommendation process and explain recommendations. However, crowdworkers write explanations with mixed qual- ity when given no specific instructions. Therefore, we incorporate algorithm support in explanation writing process, leveraging collective intelligence, to generate personalized natural language explanations. In this chapter, we present this process and evaluate such explanations through user experiment on MovieLens. Recommendation explanations are critical to user experience with recommender sys- tems. Both practitioners in industry and researchers [54, 71, 75] have shown persistent interest in explanations. Past research has shown several positive effects of recommen- dation explanations, including increasing user trust in recommender systems [76] and helping users make good decisions on recommendations [71]. Users also want explana- tions on their recommendations: presenting explanations is the second most requested

78 79

• From your MovieLens pro!le it seems that you prefer movies tagged as space, this movie takes you in space and it feels claustrophobic to be there. It keeps you on the edge of your seat the whole time.

• From your MovieLens pro!le it seems that you prefer movies tagged as visual, Gravity is unlike what we have seen on a cinema screen before and arguably it has one of the best uses of 3D in a movie.

• From your MovieLens pro!le it seems that you prefer movies tagged as intense, the movie a pretty intense ninety minutes, with Bullock's character constantly battling one catastrophe after another, and all of it is amazing to see.

Figure 5.1: Example natural language explanations for the movie “Gravity”. Depending on our model of a user’s interest, our system selects one of the three explanations for the user. feature in the MovieLens movie recommender site 1. Recommendation explanations on popular websites, however, are mostly generated by algorithms, hence are syntactic and formulaic. For example, “Customers who bought this item also bought ...” on Amazon and “Based on your watching history.” on YouTube. In comparison, people can provide effective explanations for everyday recommen- dations. For example, when librarians recommend books, they may summarize and highlight a book’s content and help people figure out whether it is a good fit for them. Though human effort is generally expensive, crowdsourcing enables computational processes to integrate human inputs at scale and on demand, extending the capability of current computational systems. For example, Cheng et al. [77] leverage crowd wisdom to train a machine learning system that can detect whether people in a video are lying. Pure crowdsourcing approaches to generating natural language recommendation ex- planations will not succeed because most crowdworkers are not domain experts. For example, we cannot expect a crowdworker who has not seen the movie “Gravity” to

1https://movielens.uservoice.com/forums/238501-general 80 write an effective explanation for a movie fan who likes intense movies. Therefore, to generate natural language explanations at scale, we combine several techniques. First, we build on existing algorithmic processes that generate personalized content-based ex- planations [75]. Second, we combine this content model with user review text, which is popular across many recommendation systems, including IMDb and Yelp. To convert these two inputs into a set of coherent explanations that can be selected for display on a per-user basis, we rely on a combination of several crowdsourcing and algorithmic steps. In this paper, we present a system that leverages collective intelligence — incorpo- rating algorithm support in crowd effort — to generate personalized natural language recommendations at scale. See Figure 5.1 for three example explanations generated by our process for the movie “Gravity”; in application, we would select the one explana- tion for display that best matches with the current user’s interests. Our system can be adapted to any recommendation application where user reviews and topic labels (e.g., categories, genres, or tags) for items are available. We deployed the system in MovieLens and conducted a controlled experiment to evaluate the resulting explanations. Using established rubrics from past research [71], we compared the natural language explanations with tag-based explanations generated by a state-of-the-art algorithm [75]. The rest of this chapter is structured as follows. We first describe design space and related work. Then, we present our system which consists of three processes. We follow this with an experiment on MovieLens and conclude with discussion of experiment results.

5.2 Design Space and Related Work

Explanation interfaces in recommender systems communicate why a user might like (or dislike) a particular item. In this section we describe several key dimensions, the research and industrial work that has explored these dimensions, and where our experimental system fits. When designing recommendation explanations, there are several “styles” [78] of ex- planations to choose from: case-based, collaborative-based [54], content-based [75], con- versational [79], demographic-based, and knowledge-based [80]. For example, Amazon’s 81 “Customers Who Bought This Item Also Bought ...” is collaborative-based, while Pan- dora’s “Based on what you’ve told us so far, we’re playing this track because it features ...” is content-based. Apart from styles of explanations, explanations can be presented in different ways, e.g., using natural language, template-based language, or a graphical representation [21, 54]. As template-based language can be generated by algorithms, it is most common on popular websites. Tintarev et al. [71] also experimented with filling a template with movie meta data — “Unfortunately, this movie belongs to at least one genre you do not want to see: . It also belongs to the genre(s): . This movie stars .”. As for graphical explanations, Herlocker et al. [54], for example, experimented with histograms of ratings as explanations, finding them to be persuasive. Natural language explanations, however, can not be easily generated by an algorithm, and have received little research attention. Recommendation explanations can be personalized or non-personalized, depending on whether or not different users see different explanations for the same item. Person- alization has been shown to have positive and negative effects [71], depending on the design. Herlocker et al. [54] found that non-personalized histograms of ratings are more persuasive than personalized histograms of ratings from users’ neighbors in preference space. Vig et al. [75] also observed a similar effect with tag-based explanations: person- alization resulted in a better user experience in one of their designs, but had a negative effect on the other one. Inspired by word-of-mouth recommendation explanations, we are interested in de- signing personalized natural language explanations about the content of recommended items. Our intuition is that by showing what users might like about recommended items (in a personalized fashion), we can convince them to explore or even consume unfamiliar items [7, 81]. Our goal is to help users to decide whether or not to take a recommendation. Therefore, the text should be reasonably short to avoid overwhelming users as they casually browse recommendations. We hypothesize that users have better experience with natural language, compared with template-based language. In generating natural language explanations with human effort, we can either recruit professionals like librarians or paid crowdworkers. While professionals may generate 82

1. Select key topical dimensions

Topic labels Selected topic labels

AI Crowd

Natural language Review data explanations for topic labels

3. Model users’ preferences and present 2. Crowd synthesize explanations personalized explanations to users from related phrases in reviews

Figure 5.2: System overview. There are two inputs to the system - topic labels and review data. We show three procedures of the system in bold text. The first two procedures use mixed computation, with crowd and algorithm, while the last one only uses algorithm. higher quality content – though prior work has shown counter-examples [82] – crowd- sourcing has the advantages of being scalable, on-demand, and relatively inexpensive. Crowdsourcing overcomes the disadvantages, e.g., expensive and not scalable, of the former and enables integration with computational processes. To reduce the difficulty of writing explanations, we can show crowdworkers rel- evant text from user reviews, by simple text search or more advanced data mining methods [83, 84, 85]. Finding crowdworkers who are familiar with items and capable of writing explanations is challenging. Showing relevant information from user reviews can address this issue: crowdworkers who have not watched a movie can still synthesize and edit already-written sentences. Current data mining methods focus on extracting keywords and key phrases from text. For example, Liu et al. [85] introduced an al- gorithm that extracts salient phrases from review data, e.g., “ice cream parlor” from Yelp’s review data. These keywords offer limited help for crowdworkers. Therefore, we use simple text search to find sentences that may be relevant for explanations.

5.3 System Overview

In this section, we give an overview of the system, showing how we composite three sub processes to generate personalized natural language explanations. The system takes topic labels and reviews for items as input, and generates natural 83 language explanations, which are presented to users in a personalized fashion based on their past activities, e.g., ratings, clicks, or purchases. Both topic labels and reviews are easily accessible in contemporary recommender systems. Topic labels for items can be either tags provided by users or topic entities (e.g., Google’s knowledge graph entities) generated through a computational proces [37,86]. User reviews are a popular feature in e-commerce sites such as Amazon and the Apple App Store, and in specialized sites such as IMDB for movies and Yelp for restaurants. The system consists of three processes, as shown in Figure 5.2 and described in more detail in Section 5.5:

1. Select key topical dimensions from item topic labels. We first select topical dimen- sions with an unsupervised learning approach, and then leverage crowd wisdom to refine the output.

2. Generate natural language explanations for the key topical dimensions. We data mine relevant quotes for the selected dimensions, and then ask crowdworkers to synthesize the quotes into explanations.

3. Model users’ preferences and present explanations in a personalized fashion. We compute users’ interest on topical dimensions of a item based on their activities, such as rating, clicks and browsing, and then present users explanations that are most interesting to them.

5.4 Experiment Platform

We deploy our explanation system one MovieLens, which currently does not provide explanations for its recommendations. We recruit crowdworkers from Amazon Mechanical Turk (“Mturk” for short) for the human computational tasks required in our system. To ensure quality results, we only recruited US crowdworkers who have finished more than 2000 HITs (or micro tasks) and maintain higher than 98% approval rate. 84 5.5 Overview of System Processes

We now describe three steps that generate explanations of the following form: “From your MovieLens profile it seems that you prefer movies tagged as [Topical Dimension], [Natural Language Explanation]”. These steps, in particular, generate two natural lan- guage components: “[Topical Dimension]”, a personalized topic label, and “[Natural Language Explanation]”, a crowd-synthesized explanation based on review text. For each step, we describe separately the generalizable process and the experimental imple- mentation details.

5.5.1 Select Key Topical Dimensions of Items

Description

We aim to find semantically diverse topical dimensions of an item from associated topic labels to fill the “[Topical Dimension]” part of the explanation. Our intuition is that to personalize an explanation for a user, we should highlight an aspect of the recommended item for which the user previously has expressed a preference. For example, some movie fans be drawn to the movie “Gravity” because of its space travel aspects, while others will be drawn to it for its high quality 3D effects. Therefore, we want to show different explanations (shown in Figure 5.1) to these two groups of users. Using a clustering algorithm is one way to select diverse topical labels that represent a larger set of topic labels. Clustering, based on semantic similarities, groups similar topic labels together. For example, a clustering algorithm may group “3D”, “visual” and “CG” together because they are semantically similar. We leverage crowd wisdom to refine topic label clusters. Clustering, as an unsu- pervised learning method, is sensitive to parameter choices and notoriously difficult to evaluate. In addition, we need to find one tag to fill in “[Topical Dimension]”, a highly subjective task. Human wisdom has been shown to be effective in improving cluster- ing results [87]. Therefore, we decide to ask crowdworkers to refine algorithmically generated clusters and pick labels for the clusters. 85

Figure 5.3: Interface shown to crowd workers for refining clustering result.

Implementation

In our experimental system, we cluster most relevant tags to movies. We selected top 20 most relevant tags for each movie, using an automatic tagging system for movies - tag genome [28]. Alternatively, we could simply take top 20 most frequently applied tags, which describe key attributes for most movies after manual inspection, for systems without tag genome. With manual inspection, we find that the 20 most relevant tags describe most key attributes of movies. Then, we cluster, for each movie, the most relevant tags based on semantic similar- ities computed from movie reviews. To compute semantic similarities between tags, we 86 mafia, gangster, gangsters, mob, crime, mentor

violent, narrated, violence, stylish, visceral, stylized, bloody, brutality

masterpiece, storytelling, drama, dialogue

interesting, original

Figure 5.4: An example of crowd-refined tag clusters for the movie “Goodfellas”. The clustering algorithm separated the tags into the four groups shown. The crowdworkers curated the tags: a strikethrough indicates bad cluster fit, a crossout indicates inappro- priateness for explanations, and red/bold indicates selection as the representative tag for the cluster.

first trained word2vec, a neural network embedding model [88], on a random sample of IMDB reviews (total size of 300MB); and then computed cosine similarities between latent vectors of tags from the embedding model. In training the word2vec model, we choose the dimension of 1000 for the embedding model, after inspecting outputs from the model with varying dimensions (ranging from 50 to 1500). With semantic similar- ities between tags, we constructed a similarity graph of the top 20 tags for each movie and ran an affinity propagation clustering algorithm [89] on the graph. After tuning the affinity propagation parameters, we generated between 4 to 6 tag clusters, which seemed appropriate with manual inspection, for each movie. After we generated the tag clusters, we recruited three Mturk crowdworkers to in- dependently refine the tag clusters and identify labels. We paid workers $0.15 for each task. We then aggregated their responses with majority voting. We instructed, with an interface shown in Figure 5.3, crowdworkers to 1) remove tags that did not belong with other tags in a cluster, 2) remove tags that were inappropriate (or offensive) to appear in the explanation template “From your MovieLens profile it seems that you prefer movies tagged as ”, and 3) pick tags as labels for clusters. We asked crowdworkers to do 87 these three work in sequence: they filter tags first then pick label from the remaining tags. The example in Figure 5.4 shows the result of this step on the movie “GoodFel- las”. As we can see, automatically generated tag clusters have overall reasonable good quality but some obvious defects too. Crowd workers can improve tag clusters, making them more sensible.

5.5.2 Generate Natural Language Explanations for The Key Topical Dimensions

Description

Having obtained key topical dimensions for each movie in the previous process, we again use collective intelligence to generate the “[Natural Language Explanation]” for each of the topical dimensions. As discussed in Section 5.2, our explanations should be short, descriptive about a certain topical dimension and well written. Given these constraints, only a human can write these explanations. However, writing explanations for a movie can be a challenging task for crowdworkers, who may have not watched the movies. Though an algorithm struggles to write explanations, it can extract relevant information about a topic dimension of a movie from user reviews, assisting crowdworkers in writing. In addition to algorithm support, we use a two stage MapReduce [58] work flow to ensure the quality of crowd synthesized explanations. Showing algorithm-extracted text from user reviews simplifies the task of writing explanations; however, there is not a good computational method for measuring the quality of these explanations. Therefore, we had multiple independent crowdworkers synthesize explanations from selected review text in the Map phase, and then had another group of workers vote on the best explanation.

Implementation

To assist crowdworkers, we located and presented quotes describing a topical dimension of a movie. First, we found highly voted positive IMDB reviews for each movie. Then, for each topical dimension of the movie, we searched 2 the indexed review text for sentences containing any tag in the corresponding tag clusters. Finally, we selected

2Using Whoosh – https://pypi.python.org/pypi/Whoosh/ 88

Figure 5.5: Interfaces used for the Mapper phase of MapReduce work flow in the “gen- erate natural language explanations for the key topical dimensions” step. top 6 quotes, considering work load for crowd workers, as ranked by the number of votes and ratings on the reviews. For example, here are quotes for the aspect “drama, masterpiece, storytelling, dialogue” about the movie “Goodfellas”:

As much as the true events of Henry’s life have more than likely been drama- tised and glamourised to a certain extent, the essence of this film IMO is that it is still a brilliantly damning portrayal of the characters and lifestyle of mobsters. The consistently fine acting by the large ensemble cast (both known and un- known), the cinematography, editing, dialogue, brilliant use of music, it’s all 89

Figure 5.6: Interfaces used for the Reduce phase of MapReduce work flow in the “gen- erate natural language explanations for the key topical dimensions” step.

breathtaking.

The dialogue is incredible.

Storytelling with impeccable pacing, this is what it’s like when a master com- poser conducts his masterpiece.

If ever the word ‘masterpiece’ was meant to be used, it was for this film. ‘Goodfellas’ is a masterpiece, pure and simple.

In the mapper phase, we recruited three workers to synthesize explanations for each movie, paying $0.75 to each worker. As shown in the interface in Figure 5.5, we had the 90 following instructions for crowdworkers: 1) pick one quote that best describes a topical dimension (represented as tags) of a movie from 6 quotes shown; 2) rewrite the selected quote into an explanation that follows the template “From your MovieLens profile it seems that you prefer movies tagged as [Automatically Filled in Topical Dimension], ”, limited to 50 words. In the reducer phase, we recruited three workers to vote on the best explanations from the mapper phase, paying $0.40 to each worker. As shown in the interface in Figure 5.6, we asked crowdworkers to vote on the best explanation for a topical di- mension (represented as tags) of a given movie. For example, the resulting explanation for “Goodfellas”, based on the majority votes for “drama, masterpiece, storytelling, dialogue” is as follows:

From your MovieLens profile it seems that you prefer movies tagged as drama, and storytelling with impeccable pacing? Well, Goodfellas is the cin- ematic equivalent of a master composer conducting his masterpiece.

5.5.3 Model Users’ Preferences and Present Explanations in a Per- sonalized Fashion

Description

Now that we have natural language explanations for multiple topical dimensions of a movie, we present users with explanations that best match with their own topic-based preferences. First, we model a user’s preferences on topic labels from her activities in the recommender system, e.g., ratings and clicks. Second, we choose a natural language explanation based on the user’s favorite topical dimension (represented as a topic label) for a given movie.

Implementation

In MovieLens, we modeled users’ preferences for tags based on their movie ratings, as shown in the equation below. More formally, the tag preference of user u on tag t, 91 denoted as prefu,t, was computed from the set of movies Mu that user u has rated.

P ratingu,m ∗ relm,t + K ∗ avgu pref = m∈Mu (5.1) u,t P rel + K m∈Mu m,t

where m denote a movie in Mu, ratingu,m denotes the user’s rating on movie m and relm,t denotes relevance score of tag t to movie m given by algorithmic tagging system - tag genome [28]. Note that we have a user average Bayesian Prior avgu, which is user’s average preference on tags. With the choice of K = 20, prefu,t is skewed towards the prior when user has few ratings, because few ratings can not accurately represent users’ preferences on tags. When recommending a movie to a user, we picked a topical dimension, each repre- sented with a tag, that has highest value of prefu,t and presented the matching natural language explanation.

5.6 User Evaluation

We evaluated our natural language explanations, compared with a baseline of person- alized tag explanation, using a within-subjects user experiment in MovieLens. We pick the tag explanation designed by Vig et al. [75], because it, similar to Pandora’s explana- tions, represents state-of-the-art content-based explanations. Each participant answered survey questions for both the experimental and the baseline condition. We invited MovieLens users to participate in an online survey via email invitations. From January 29, 2016, we sent emails to 4000 recent active users who have more than 15 total ratings and have logged in after November 1, 2015. 220 users finished our survey (we required users to submit at least one pair of responses for inclusion) and gave 711 responses (we allowed users to continue submitting more responses after their first). We implemented Vig et al. [75]’s tag explanations in MovieLens as follows. For a movie recommended to a user, we ranked highly relevant tags – with a relevance score from the tag genome higher than 4 on a 15 scale – based on the user’s preferences, and then picked the top 5 tags in the ranking to fill in the template “We recommend the movie because you like the following features: [tag1, ..., tag5]” as an explanation. 92 Rubric Statement Stage Metric The explanation contains right 2 Raw response amount of information. Efficiency Change of responses I know enough about this movie to before and after 1, 2, 3 decide whether to watch it. seeing explanation (stage 1 and 2) Time spent N/A N/A reading explanation Change of responses before and after Effectiveness I am interested in watching this movie. 1, 2, 3 watching trailer (stage 2 and 3) I trust the explanation. 2 Raw response Trust The explanation reflects my preferences 2 Raw response about this movie. The explanation is easy to understand. 2 Raw response Satisfaction The explanation is useful. 2 Raw response I wish MovieLens included explanations 2 Raw response like this.

Table 5.1: Summary of questions asked in the 3-stage user survey. We asked users to respond on 5-point Likert scale for the above statements. The “Stage” column shows the stage the statements were shown. The “Metric” column shows how we processed responses to measure the corresponding rubrics.

5.6.1 Survey Design

In the survey, participants evaluated both natural language explanations and the base- line tag explanations for randomly selected movies from a set of 100 movies with high average ratings and not rated by each participant. This set of 100 movies all had an average rating of at least 3.8 on a 5 star scale; 50 were drawn from the top-500 and 50 were drawn from the 500 to 1000 most frequently rated movies. We generated natural language explanations for the 100 movies with a cost of $3.90 per movie using the pre- viously described system. For each participant, we randomly picked an unrated movie from the set of 100 and showed it together with a baseline or experimental explanation. We asked the participant to finish at least one pair of evaluations – one baseline, one experimental – and gave them the option to evaluate more pairs. We evaluated explanations according to established rubrics from past research [78]. 93 We selected the four rubrics that are most appropriate for our natural language ex- planations: trust, effectiveness, efficiency and satisfaction. Two other goals from this framework – transparency and scrutability – are not a good fit with evaluating natural language (we do not seek to show how the underlying recommender works), while the last rubric – persuasiveness – is not relevant to the goals of MovieLens, which does not seek to persuade users to watch more movies. A short definition of the four chosen rubrics follows:

• Efficiency: making users acquainted with items efficiently;

• Effectiveness: helping users make good decisions;

• Trust: earning trust from users;

• Satisfaction: offering positive user experience and overall satisfaction.

We designed a three-stage survey, interfaces shown in Appendix (Figure A.1, Fig- ure A.2 and Figure A.3), to measure these four qualities of recommendation explana- tions, asking questions when participants 1) have only seen the title, year, genre and poster of a movie, 2) have also seen the explanation, and 3) have also watched a trailer for the movie (a preview/advertisement for the movie, typically 2-3 minutes long). In each stage, we asked participants to respond to several statements on 5-point Likert scale, from “Strongly Disagree” to “Strongly Agree”. In Table 5.1, we summarize the questions from all three stages. Note that we asked participants to respond to two statements repeatedly in all three stages to measure the effectiveness and efficiency of explanations. More specifically, we measured effectiveness as the change of a partici- pant’s response to “I am interested in watching this movie” before and after watching the movie trailer. The change should be small for effective explanations, which help users to accurately gauge their interest in recommendations before consumption [78]. As asking participants to watch a full movie is unrealistic in this experimental context, we approximate this action with watching a trailer. Similarly, we measured efficiency with time spent reading explanations and changes in participants’ responses to “I know enough about this movie to decide whether to watch it.” before and after seeing expla- nations. 94 5.6.2 Results

Through analysis of survey responses, we describe findings on how personalized natural language explanations compare with the personalized-tag-explanation baseline in the four rubrics. We treat raw responses to survey questions as ordinal data and use non-parametric statistical models for analysis. For effectiveness and efficiency where we measure the difference between two responses, we first map 5-point Likert responses to values 1-5, calculate the difference in values (ranging from -4 to 4) and treat the result as ordinal data. Using the Cumulative Link Mixed Models 3, we model the fixed effect of the explanation type (baseline vs. experimental) on ordinal responses, with user and movie as random effects. We include the two random effects to capture the variances of responses across users and variances caused by different movies shown in the survey.

Efficiency

As compared with our baseline explanations, participants perceive the natural language explanations to have more information and a more appropriate amount of information; on the other hand, they take longer to read. Participants in 45% of cases agreed with the statement “The explanation contains right amount of information.” for natural language explanation, compared to only 19% for tag explanations (CLMM, N = 711, p ∼ 0), as shown in Figure 5.7a. More specifically, natural language explanations contain more information than tag explanations, as evident by changes of responses on “I know enough about this movie to decide whether to watch it.”. We observe greater changes towards positive direction when showing natural language explanations than tag explanations (CLMM, N = 711, p < 0.01) as shown in Figure 5.7b. This is as expected, because natural language explanations are longer and more descriptive than tag explanations. As a result, we observe the increased time participants spent reading natural language explanations as compared with tag explanations: there is a 0.38 difference of log transformed seconds spent on reading.

3Included in ‘ordinal’ package of R 95 The explanation contains right amount of information.

TAG 55% 26% 19%

CROWD 34% 21% 45%

100 50 0 50 100 Percentage Strongly disagree Disagree Neutral Agree Strongly agree

(a)

Changes in response regarding knowledge about a movie

TAG 6% 39% 55%

CROWD 4% 32% 63%

100 50 0 50 100 Percentage

Response −2 −1 0 1 2 3 4

(b)

Figure 5.7: Survey responses to questions regarding efficiency. Natural language expla- nations - labeled with “crowd” - contained a more appropriate amount of information (a) and helped subjects more with decision-making (b).

Effectiveness

We observe that natural language explanations are no different than tag explanations in terms of effectiveness, which is measured as the changes of responses on “I am inter- ested in watching this movie.” when seeing explanations and having watched trailers. As shown in Figure 5.8, a participant’s interest level, in fact, is slightly more likely to decrease for natural language explanations than for tag explanations (28% vs. 26%). This can be attributed to the fact that natural language explanations are more per- suasive, hence participants may became overly excited after reading the explanations. Note that in both conditions, 1/3 of the users reported the same interest level after watching trailers; given that trailers are usually rich with information that assists in decision-making, this result is encouraging regarding the overall effectiveness of both types of explanations. 96 Changes in response regarding interest about a movie

TAG 26% 34%

CROWD 28% 34%

100 50 0 50 100 Percentage

Response −3 −2 −1 0 1 2 3 4

Figure 5.8: Difference in interest before and after watching the trailer to measure effec- tiveness;no statistically significant difference.

Trust

We find that participants trusted personalized natural language explanations more than the baseline tag explanations as shown in Figure 5.9. For the statement “I trust the explanation.”, participants were significantly (CLMM, N = 711 4, p < 0.05) more likely to agree when seeing natural language explanations than seeing tag explanations (66% vs. 57% agree). We find more agreement on the statement “The explanation reflects my preferences about this movie.” for natural language explanations, but this effect is small and not statistically significant.

Satisfaction

Participants rated natural language explanations more highly on all three satisfaction questions, shown in Figure 5.10 (CLMM, N = 711, p < 0.001 for all three): 68% of responses agreed that natural language explanations are “useful” compared to 51% of responses for tag explanations. Though natural language explanations are longer and contain more information, 76% of the responses agreed that they are “easy to understand” compared to 61% of the responses for tag explanations. Overall, 68% of responses wished to include natural language explanations in MovieLens compared to 51% for tag explanations.

4220 participants have options to evaluate more than 1 pair 97 I trust the explanation.

TAG 17% 26% 57%

CROWD 15% 19% 66% The explanation reflects my preferences about this movie.

TAG 22% 36% 42%

CROWD 24% 28% 48% 100 50 0 50 100 Percentage Strongly disagree Disagree Neutral Agree Strongly agree

Figure 5.9: Survey responses to questions regarding trust. Participants trusted natural language explanations more than tag explanations.

5.7 Discussion

The user experiment shows that natural language explanations generated from our pro- cess are better received by users compared to the personalized-tag-explanation baseline, which represents state-of-the-art content-based explanation. Compared with the base- line, users perceived that natural language explanations are more trustworthy, contain a more appropriate amount of information and offer a better user experience. To our sur- prise, though, natural language explanations are comparable in effectiveness for helping users make decisions on whether to take recommendations. Considering that users only spend a small amount of effort reading explanations, both content-based explanations are fairly effective in helping users make good decisions. However, apart from the content of items, many other factors may affect users’ decision on whether to take recommenda- tions, such as their trust in the system, social influence and past experience [81]. This is perhaps the reason why users perceive that natural language explanations contain richer information but are not significantly more effective in decision-making support. Though natural language explanations written by humans are superior to automat- ically generated explanations, human labor is expensive and difficult to scale up. Our system addresses this challenge through a mixed computation model that combines intelligent algorithms with crowdsourced inputs. Crowdsourcing allows our system to quickly recruit large number of workers as needed. Our algorithmic processes model the 98 I wish MovieLens included explanations like this.

TAG 25% 25% 51%

CROWD 15% 19% 66% The explanation is easy to understand.

TAG 22% 17% 61%

CROWD 13% 12% 76% The explanation is useful.

TAG 26% 23% 51%

CROWD 19% 13% 68% 100 50 0 50 100 Percentage Strongly disagree Disagree Neutral Agree Strongly agree

Figure 5.10: Survey responses to questions regarding satisfaction. Across the three questions, participants gave more positive responses for natural language explanations as compared with tag explanations. content of items and extract quotes from existing user reviews, two steps that effectively reduce the human effort required per movie. In our experiment we generated natural language explanation for 100 movies at the cost of $3.90 per movie in a short amount of time. For organizations that can afford this per-item cost, we believe our approach could be scaled to a much larger item space. Our system underscores the potential of collective intelligence of human and comput- ers. Recent research in machine learning, especially deep neural networks, has advanced the limit of computers in image/voice recognition tasks and natural language conversa- tions – advances that are made possible with human-generated training data. For tasks that remain too challenging for a purely algorithm solution (e.g., writing recommen- dation explanations) crowdsourcing may provide the necessary input to create scalable solutions. In the long run, crowd-generated data can be fed into machine learning processes to enable the automation of human work, pushing the limit of algorithms. Chapter 6

Conclusion

This thesis presented applications of computer and human collective intelligence in rec- ommender system domain. In this concluding chapter, we first review the contributions of the thesis and then discuss potential implications.

6.1 Summary of Contributions

We make contributions chiefly in two ways: improving various aspects of recommender systems and demonstrating methodologies of combining human and algorithm intelli- gence. In summary, we leverage the collective intelligence to onboard new users more efficiently and effectively (Chapter 2), model topic of multimedia posts (Chapter 3), recommend appropriate groups of items (Chapter 4) and explaining recommendations (Chapter 5). In Chapter 2, we proposed a more efficient interactive preference elicitation process for new users of recommender systems. Through a user evaluation on MovieLens, we found that, compared with prior process that asks users to rate movies, users spend less time completing the process and perceive resulting recommendations to be more satisfactory. In addition, our process is more transparent and intuitive to users. In this work, we improved a classic way of leveraging collective intelligence: com- putationally mine human wisdom from existing human behavior data. We took an user centered approach to redesign user interaction in preference elicitation. Contrary to prior researches that focus on algorithmically finding sequence of items for users to

99 100 rate, we instead computationally constructed representative item groups and asked new users to interact with the item groups. We also demonstrated ways, e.g., selecting data and tuning algorithm, to ensure performance of the central technology — clustering — in recommender system domain. In Chapter 3, we improved a key component of social recommender systems — topic labeling of user posts. We built an system that label topics of multimedia social media posts by intelligently combine outputs of single-source topic labeler for text, image and video. Evaluated on Google+ posts, this system showed improvement on topic relevance judgment tasks on two granularities. In this work, we demonstrated a more active way of combining computer and human intelligence: actively eliciting data from crowdworkers and training a machine learning algorithm on the data. We see the limitation of algorithms, and the strength of human, in understanding topics of social media posts composed of text, image and video. By building an ensemble learning algorithm on crowdsourced human judgments, we combine the strength of human and algorithms and automate topic labeling for multimedia posts. To truly leverage the collective intelligence of computer and human, we need to create a symbiotic working relationship between the two: human provide data for computers and computers support/direct human to generate data. In Chapter 4, we incorporated crowd effort into algorithmic recommendation process to compose sets of recommendations and provide explanations, both of which pose challenges for current algorithms. Through an evaluation on MovieLens, we found that human, compared with algorithm, can produce more diverse and less common sets of recommendations that are highly evaluated by users. In addition, crowdworkers are capable of writing convincing explanations. In this work, we explored the design space of collective intelligence in recommenda- tions process, including how to organize workflow and what crowd workers to recruit. We compared different design options through user evaluation on MovieLens. One work- flow which shows algorithm generated recommendations to crowdworkers as examples can improve the performance of paid crowdworkers to match that of highly motivated volunteers and reduces the cost of pure human workflows. In Chapter 5, we again leveraged highly collaborative human and computer effort in explanation writing task and built process to generate natural language explanations at 101 scale. We compared the resulting explanations with stat-of-the-art algorithm generated explanations in user experiment on MovieLens, finding that personalized natural lan- guage explanations contain more appropriate amount of information, earn more trust from users and make users more satisfied. In this work, we demonstrated a multi-stage process that creates a mutual assist- ing relationship between human and computer, combining the data processing power of computers and text understanding and writing skills from human. Algorithmic support to crowdworkers reduces cost and ensures quality of work. On the other hand, crowd- workers refine output from algorithm process and generate data that can later be used to improve algorithm.

6.2 Implications

Though our works focus on leveraging collective intelligence in recommender system do- main, broader research community has shown growing interest in such collective intelli- gence, driven by advance of machine learning and crowdsourcing research. Researchers build systems that combine the strength of the two: Flock [77] detects people lying in video footages; JellyBean [90] counts objects in any picture; Zensors [91] offers powerful intelligent sensing on environmental changes such as lighting, furniture and seasons; Frenzy [92] provides support for complex conference scheduling tasks. Database re- searchers incorporate crowd effort into database operations to break current technology limitations [93, 94, 95]. Robotic researches leverage crowd wisdom to train robots’ ac- tions [96, 97]. Why do we combine the intelligence of computer and human? Because computer and human have reciprocal benefits, systems that combine the strength of both can break off current technological limits and create innovative applications. Computers automates and scales up computations, offering speed, accuracy and memory. Human are capable of complex reasoning and subjective judgment, offering flexibility and adaptivity. For example, as shown in Chapter 5, we combine the strength of human in writing natural language text and power of computers in topic modeling and search to provide per- sonalized explanations. With such close collaboration between human and computers, computers are “learning” from human behavior and pushing their limits, evident by 102 recent advance in machine image recognition and voice recognition. Retrospectively, human are liberated from mundane tasks that can be automated by machine and focus on tasks that are more complex and challenging. How to design systems that leverage such collective intelligence? As demonstrated in this thesis, there are various ways to combine human and computer intelligence depending on applications.

• Perhaps the most applied method is leveraging human wisdom passively with com- putation [98], which processes human behavior data and extracts human wisdom from these data. These systems, including our system in Chapter 2, do not di- rectly elicit human input rather mine patterns from natural human behavior. For example, search engines rank the importance of web pages based on the hyper links people created on Internet [99,100], with the observation that more author- itative web pages are being linked to more frequently. The key to success of this type of systems lies in designing good algorithmic process, which includes data preprocessing and internals of algorithm. Because users generate data naturally and possibly for completely different pur- pose, inferring patterns from data requires careful considerations. In Chapter 2, for instance, rating data contains not only preference information but also reflects popularity of movies. Therefore, to extract various movie tastes, we selected only high quality rating data from experienced users on frequently rated movies. In many cases, the peculiarity and noise in human generated data is not known a prior, calling for trial and error design process.

• Another approach is to actively elicit human generated data, which feeds into a computation process. This approach is widely applied in areas that machine learning algorithms still struggle but are making good progress on human level performance. The end goal of these computation processes are often automate and replace human labor. For example, crowdsourcing has been used to 1) annotate natural languages [101,102], generate speech corpora for voice recognition [102,103] and annotate objects in images [53]. In practice, we need to carefully consider quality control and cost trade-offs. Be- cause crowd workers are poorly motivated for these mechanical tasks, requesters 103 for crowd work have to consciously ensure quality of human generated data, as demonstrated in Chapter 3. Regardless of quality control effort, there are in- evitable noises in crowd generated data. System designers, therefore, need to deal with a trade-off between cost and quality. To make things even more complicate, when optimizing for quality of resulting machine learning process, there is also a decision for whether to gather more but less accurate labels or less but more accurate labels [104].

• A more symbiotic relationship between human and computer is mutual assistant of each other in fulfilling a common goal. This collaboration is required for complex tasks that are beyond current technology, such as writing poem and discovering beauty. Recommending movies and explaining recommendations, as presented in Chapter 4 and Chapter 5, are such subjective tasks. For these tasks, we use computers to process and mine patterns from data and ask crowd workers to en- gage in subjective and creative tasks. This mutual assistance, as demonstrated in Chapter 4, results in better performance than either one. In industry, online cloth styling service StitchFix1 recruit stylists who refine algorithmic cloth recommen- dations and add in personal advice on styles to clients.

Designing a workflow that efficiently uses expensive human labor is a key challenge. For example, in Chapter 5, we showed crowd workers who write explanations rele- vant quotes in movie reviews to reduce the difficulty of the task. Such algorithmic support helps human workers to focus their attention on creative and subjective tasks, which are insurmountable challenges for algorithms.

6.3 Future Work

This thesis explores the new research area that combines human and computer intelli- gence, opening up many opportunities for future research. One direction is to come up with design guidelines for such hybrid systems. Current researches focus on one specific application and design tailored workflow to combine computer and human wisdom. There are some common design principals that surface

1www.stitchfix.com 104 from these works: for instance, aggregating multiple crowd workers’ input through “Map Reduce” paradigm; batching slow human work offline to reduce latency. It is helpful for whole research community and practitioners to summarize existing work and come up with these guidelines. Another strength, which is not covered by this thesis, in combining human and computer is iterative improvement of system over time. As we discussed before, human work can provide training labels for computation processes, which gradually replace mechanical human effort. As a result, such hybrid system should evolve over time as human workers are freed from prior tasks. Many fears that such trend results in loss of jobs for human workers. In fact, human workers always have comparative advantages against computers and can take on more creative tasks that are not replaceable by computers. The overall efficiency of such hybrid systems will increase over time. We have demonstrated in this thesis that the collective intelligence of human and computer can improve various aspects of recommender systems. We believe the power of such symbiosis between human and computer, however, is not limited to recommender systems. This tight collaboration between human and computer will be a powerful model for social computing systems. References

[1] Francesco Ricci, Lior Rokach, and Bracha Shapira, editors. Recommender Systems Handbook. Springer US, Boston, MA, 2015.

[2] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. GroupLens: an open architecture for collaborative filtering of netnews. In CSCW, 1994.

[3] Joseph A Konstan and John Riedl. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction, (1-2):101–123, 2012.

[4] Thomas W Malone, Robert Laubacher, and Chrysanthos Dellarocas. Harnessing crowds: Mapping the genome of collective intelligence. 2009.

[5] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.

[6] Pearl Pu, Li Chen, and Rong Hu. Evaluating recommender systems from the users perspective: survey of the state of the art. User Modeling and User-Adapted Interaction, 22(4-5):317–355, mar 2012.

[7] Sean M. McNee, John Riedl, and Joseph A. Konstan. Being accurate is not enough. In CHI EA, 2006.

[8] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Im- proving recommendation lists through topic diversification. In WWW, 2005.

105 106 [9] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. Evaluating Collaborative Filtering Recommender Systems. ACM Trans. Inf. Syst., 2004.

[10] Robert Kraut, Moira Burke, and John Riedl. Dealing with newcomers. Evidence- based Social Design Mining the Social Sciences to Build Online Communities, 2010.

[11] AM Rashid, G Karypis, and J Riedl. Learning preferences of new users in recom- mender systems: an information theoretic approach. ACM SIGKDD Explorations Newsletter, 2008.

[12] Al Mamunur Rashid, Istvan Albert, Dan Cosley, Shyong K. Lam, Sean M. McNee, Joseph A. Konstan, and John Riedl. Getting to know you: learning new user preferences in recommender systems. IUI, 2002.

[13] Mehdi Elahi, Francesco Ricci, and Neil Rubens. Active learning strategies for rat- ing elicitation in collaborative filtering. ACM Transactions on Intelligent Systems and Technology, 5(1):1–33, 2013.

[14] Nadav Golbandi, Yehuda Koren, and Ronny Lempel. On bootstrapping recom- mender systems. In CIKM, New York, New York, USA, 2010.

[15] N Golbandi, Y Koren, and R Lempel. Adaptive bootstrapping of recommender systems using decision trees. WSDM, 2011.

[16] Nathan N Liu, Xiangrui Meng, Chao Liu, and Qiang Yang. Wisdom of the better few: cold start recommendation via representative based rating elicitation. In Proceedings of the fifth ACM conference on Recommender systems, pages 37–44. ACM, 2011.

[17] Sean M. McNee, Shyong K. Lam, Joseph A. Konstan, and John Riedl. Interfaces for eliciting new user preferences in recommender systems. pages 178–187, 2003.

[18] M Sun, F Li, J Lee, and K Zhou. Learning multiple-question decision trees for cold-start recommendation. WSDM, 2013. 107 [19] Benedikt Loepp, Tim Hussein, and J¨uergenZiegler. Choice-based preference elic- itation for collaborative filtering recommender systems. In CHI, pages 3085–3094, New York, New York, USA, 2014.

[20] Brynjar Gretarsson, John O’Donovan, Svetlin Bostandjiev, Christopher Hall, and Tobias H¨ollerer. SmallWorlds: Visualizing Social Recommendations. Computer Graphics Forum, 29(3):833–842, 2010.

[21] Svetlin Bostandjiev, John O’Donovan, and Tobias H¨ollerer.TasteWeights: a visual interactive hybrid recommender system. In RecSys, 2012.

[22] Neil Rubens, Dain Kaplan, and Masashi Sugiyama. Active learning in recom- mender systems. In Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors, Recommender Systems Handbook, pages 735–767. Springer US, 2011.

[23] Christian Desrosiers and George Karypis. A comprehensive survey of neighborhood-based recommendation methods. In Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors, Recommender Systems Handbook, pages 107–144. Springer US, 2011.

[24] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages 849–856. MIT Press, 2001.

[25] Shuo Chang and Aditya Pal. Routing questions for collaborative answering in com- munity question answering. ASONAM, Niagara, Ontario, Canada, 2013. ACM.

[26] Santo Fortunato. Community detection in graphs. Physics Reports, 486(35):75 – 174, 2010.

[27] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976.

[28] Jesse Vig, Shilad Sen, and John Riedl. The Tag Genome. ACM Transactions on Interactive Intelligent Systems, 2(3):1–44, September 2012. 108 [29] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors, Rec- ommender Systems Handbook, pages 257–297. Springer US, 2011.

[30] S. Funk. Netflix update: Try this at home. http://sifter.org/ si- mon/journal/20061211.html, 2006.

[31] Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John T. Riedl. Rethinking the recommender research ecosystem: Reproducibility, openness, and lenskit. RecSys, Chicago, Illinois, USA, 2011. ACM.

[32] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, John, and T. Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22:5–53, 2004.

[33] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recom- mender algorithms on top-n recommendation tasks. In RecSys, Barcelona, Spain, 2010. ACM.

[34] Asela Gunawardana and Guy Shani. A survey of accuracy evaluation metrics of recommendation tasks. J. Mach. Learn. Res., 10:2935–2962, December 2009.

[35] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and val- idation of . Journal of Computational and Applied Mathematics, 20(0):53 – 65, 1987.

[36] Frank Wilcoxon. Individual comparisons by ranking methods. In Samuel Kotz and NormanL. Johnson, editors, Breakthroughs in Statistics, Springer Series in Statistics, pages 196–202. Springer New York, 1992.

[37] Shuang-Hong Yang, Alek Kolcz, Andy Schlaikjer, and Pankaj Gupta. Large-scale high-precision topic modeling on twitter. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pages 1907–1916, New York, New York, USA, August 2014. ACM Press. 109 [38] Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, pages 453–456, New York, NY, USA, 2008. ACM.

[39] Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. , 16(2):138–178, 2013.

[40] Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. Annotating named entities in Twitter data with crowdsourcing. pages 80–88, 2010.

[41] Omar Alonso, Catherine C. Marshall, and Marc Najork. Are Some Tweets More Interesting Than Others? #HardQuestion. In Proceedings of the Symposium on Human-Computer Interaction and Information Retrieval - HCIR ’13, pages 1–10, New York, New York, USA, October 2013. ACM Press.

[42] Kalina Bontcheva and Dominic Rout. Making sense of social media streams through semantics: a survey. Semantic Web.

[43] D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. 2010.

[44] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1524–1534, Strouds- burg, PA, USA, 2011. Association for Computational Linguistics.

[45] Abhishek Gattani, Digvijay S. Lamba, Nikesh Garera, Mitul Tiwari, Xiaoyong Chai, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. Entity extraction, linking, classification, and tagging for social me- dia: A wikipedia-based approach. Proc. VLDB Endow., 6(11):1126–1137, August 2013.

[46] Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. Automatic image an- notation and retrieval using cross-media relevance models. In Proceedings of the 110 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 119–126. ACM, 2003.

[47] Jiakai Liu, Rong Hu, Meihong Wang, Yi Wang, and Edward Y. Chang. Web- scale image annotation. In Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, PCM ’08, pages 663–674, Berlin, Heidelberg, 2008. Springer-Verlag.

[48] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI’11, pages 2764–2770. AAAI Press, 2011.

[49] SL Feng, Raghavan Manmatha, and Victor Lavrenko. Multiple bernoulli rele- vance models for image and video annotation. In Computer Vision and , 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–1002. IEEE, 2004.

[50] Hrishikesh Aradhye, George Toderici, and Jay Yagnik. Video2text: Learning to annotate video content. In Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, ICDMW ’09, pages 144–151, Washington, DC, USA, 2009. IEEE Computer Society.

[51] Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR11), pages 21– 26, 2011.

[52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[53] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 111 [54] Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. Explaining collabo- rative filtering recommendations. In CSCW, 2000.

[55] Vinod Krishnan, Pradeep Kumar Narayanashetty, Mukesh Nathan, Richard T. Davies, and Joseph A. Konstan. Who predicts better? In RecSys, 2008.

[56] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. TurKit. In HCOMP, 2009.

[57] Michael S. Bernstein, Greg Little, Robert C. Miller, Bj¨ornHartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. Soylent. In UIST, 2010.

[58] Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. CrowdForge. In UIST, 2011.

[59] Walter S Lasecki, Rachel Wesley, Jeffrey Nichols, Anand Kulkarni, James F Allen, and Jeffrey P Bigham. Chorus: A Crowd-powered Conversational Assistant. In UIST, 2013.

[60] Michael Nebeling, Maximilian Speicher, and Moira C Norrie. CrowdAdapt: En- abling Crowdsourced Web Page Adaptation for Individual Viewing Conditions and Preferences. In EICS, 2013.

[61] Haoqi Zhang, Edith Law, Rob Miller, Krzysztof Gajos, David Parkes, and Eric Horvitz. Human computation tasks with global constraints. In CHI, 2012.

[62] Anbang Xu, Shih-Wen Huang, and Brian Bailey. Voyant: Generating Structured Feedback on Visual Designs Using a Crowd of Non-experts. In CSCW, 2014.

[63] Anbang Xu, Huaming Rao, Steven P Dow, and Brian P Bailey. A Classroom Study of Using Crowd Feedback in the Iterative Design Process. In CSCW, 2015.

[64] Daniela Retelny, S´ebastienRobaszkiewicz, Alexandra To, Walter S Lasecki, Jay Patel, Negar Rahmati, Tulsee Doshi, Melissa Valentine, and Michael S Bernstein. Expert Crowdsourcing with Flash Teams. In UIST, 2014. 112 [65] Peter Organisciak, Jaime Teevan, Susan Dumais, Robert C. Miller, and Adam Tauman Kalai. A Crowd of Your Own: Crowdsourcing for On-Demand Personalization. In HCOMP, 2014.

[66] Alexander Felfernig, Sarah Haas, Gerald Ninaus, Michael Schwarz, Thomas Ulz, Martin Stettinger, Klaus Isak, Michael Jeran, and Stefan Reiterer. RecTurk: Constraint-based Recommendation based on Human Computation. In Recsys 2014 - CrowdRec Workshop, 2014.

[67] Chinmay Kulkarni, StevenP. Dow, and ScottR Klemmer. Early and repeated exposure to examples improves creative work. pages 49–62, 2014.

[68] Pao Siangliulue, Kenneth C Arnold, Krzysztof Z Gajos, and Steven P Dow. To- ward collaborative ideation at scale: Leveraging ideas from others to generate more creative and diverse ideas. In CSCW, 2015.

[69] StevenM. Smith, ThomasB. Ward, and JayS. Schumacher. Constraining effects of examples in a creative generation task. Memory and Cognition, 21(6):837–845, 1993.

[70] Shuo Chang, F. Maxwell Harper, and Loren Terveen. Using Groups of Items to Bootstrap New Users in Recommender Systems. In CSCW, 2015.

[71] Nava Tintarev and Judith Masthoff. Evaluating the effectiveness of explanations for recommender systems. User Modeling and User-Adapted Interaction, 22:399– 439, 2012.

[72] Michael D. Ekstrand, F. Maxwell Harper, Martijn C. Willemsen, and Joseph A. Konstan. User perception of differences in recommender algorithms. In Recsys, 2014.

[73] Sa´ulVargas and Pablo Castells. Rank and relevance in novelty and diversity metrics for recommender systems. Recsys, 2011.

[74] F. Maxwell Harper, Funing Xu, Harmanpreet Kaur, Kyle Condiff, Shuo Chang, and Loren Terveen. Putting Users in Control of their Recommendations. In RecSys, 2015. 113 [75] Jesse Vig, Shilad Sen, and John Riedl. Tagsplanations. In IUI, 2008.

[76] Rashmi Sinha and Kirsten Swearingen. The role of transparency in recommender systems. In CHI EA, 2002.

[77] Justin Cheng and Michael S. Bernstein. Flock: Hybrid crowd-machine learning classifiers. In CSCW, 2015.

[78] Nava Tintarev and Judith Masthoff. Recommender Systems Handbook, chapter Explaining Recommendations: Design and Evaluation, pages 353–382. Springer US, Boston, MA, 2015.

[79] Cynthia A. Thompson, Mehmet H. G¨oker, and Pat Langley. A personalized system for conversational recommendations. Journal of Artificial Intelligence Research, 21(1):393–428, March 2004.

[80] Weiquan Wang and Izak Benbasat. Recommendation agents for electronic com- merce: Effects of explanation facilities on trusting beliefs. J. Manage. Inf. Syst., 23(4):217–246, May 2007.

[81] Anthony Jameson, Martijn C. Willemsen, Alexander Felfernig, Marco Gemmis, Pasquale Lops, Giovanni Semeraro, and Li Chen. Recommender Systems Hand- book, chapter Human Decision Making and Recommender Systems, pages 611–648. Springer US, Boston, MA, 2015.

[82] F. Maxwell Harper, Daphne Raban, Sheizaf Rafaeli, and Joseph A. Konstan. Predictors of answer quality in online q&a sites. In CHI, 2008.

[83] Srikanta Bedathur, Klaus Berberich, Jens Dittrich, Nikos Mamoulis, and Gerhard Weikum. Interesting-phrase mining for ad-hoc text analytics. Proceedings of the VLDB Endowment, 3(1-2):1348–1357, 2010.

[84] Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R Voss, and Jiawei Han. Scal- able topical phrase mining from text corpora. Proceedings of the VLDB Endow- ment, 8(3):305–316, 2014.

[85] Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. Mining quality phrases from massive text corpora. 2015. 114 [86] Shuo Chang, Peng Dai, Jilin Chen, and Ed H Chi. Got Many Labels?: Deriving Topic Labels from Multiple Sources for Social Media Posts using Crowdsourcing and Ensemble Learning. In WWW, 2015.

[87] Shuo Chang, Peng Dai, Lichan Hong, Sheng Cheng, Zhang Tianjiao, and Chi Ed. AppGrouper: Knowledge-based Interactive Clustering Tool for App Search Results. In IUI, 2016.

[88] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Ad- vances in neural information processing systems, pages 3111–3119, 2013.

[89] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315:972–976, 2007.

[90] Akash Das Sarma, Ayush Jain, Arnab Nandi, Aditya Parameswaran, and Jennifer Widom. Surpassing humans and computers with jellybean: Crowd-vision-hybrid counting algorithms. In Third AAAI Conference on Human Computation and Crowdsourcing, 2015.

[91] Gierad Laput, Walter S Lasecki, Jason Wiese, Robert Xiao, Jeffrey P Bigham, and Chris Harrison. Zensors: Adaptive, rapidly deployable, human-intelligent sensor feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 1935–1944. ACM, 2015.

[92] Lydia B Chilton, Juho Kim, Paul Andr´e, Felicia Cordeiro, James A Landay, Daniel S Weld, Steven P Dow, Robert C Miller, and Haoqi Zhang. Frenzy: col- laborative data organization for creating conference sessions. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems, pages 1255–1264. ACM, 2014.

[93] Michael J Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. Crowddb: answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 61–72. ACM, 2011. 115 [94] Aditya G Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzo- tis, Aditya Ramesh, and Jennifer Widom. Crowdscreen: Algorithms for filtering data with humans. In Proceedings of the 2012 ACM SIGMOD International Con- ference on Management of Data, pages 361–372. ACM, 2012.

[95] Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1):13– 24, 2011.

[96] Maxwell Forbes, Michael Jae-Yoon Chung, Maya Cakmak, and Rajesh PN Rao. Robot programming by demonstration with crowdsourced action fixes. In Second AAAI Conference on Human Computation and Crowdsourcing, 2014.

[97] Alexander Sorokin, Dmitry Berenson, Siddhartha S Srinivasa, and Martial Hebert. People helping robots helping people: Crowdsourcing for grasping novel objects. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2117–2122. IEEE, 2010.

[98] Jeffrey P Bigham, Michael S Bernstein, and Eytan Adar. Human-computer inter- action and collective intelligence. Handbook of Collective Intelligence, 57, 2015.

[99] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The citation ranking: bringing order to the web. 1999.

[100] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999.

[101] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and Fast—but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In EMNLP, 2008.

[102] Chris Callison-Burch and Mark Dredze. Creating speech and language data with amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 1–12. Association for Computational Linguistics, 2010. 116 [103] Ian McGraw, Chia-ying Lee, I Lee Hetherington, Stephanie Seneff, and James R Glass. Collecting voices from the cloud. In LREC, pages 1576–1583, 2010.

[104] CH Lin and DS Weld. To Re (label), or Not To Re (label). 2014. Appendix A

Glossary and Acronyms

A.1 Additional Figures

117 118

Figure A.1: Interface for first step in three-step user survey that evaluates recommen- dation explanations. 119

Figure A.2: Interface for second step in three-step user survey that evaluates recom- mendation explanations. 120

Figure A.3: Interface for third step in three-step user survey that evaluates recommen- dation explanations.