Supervised and Active Learning for Recommender Systems

by

Laurent Charlin

A submitted in conformity with the requirements for the degree of Graduate Department of Computer Science University of Toronto

© Copyright 2014 by Laurent Charlin Abstract

Supervised and Active Learning for Recommender Systems

Laurent Charlin Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2014

Traditional approaches to recommender systems have often focused on the collaborative filtering problem: using users’ past preferences in order to predict their future preferences. Although essential, rating prediction is only one of the components of a successful recommender system. One important problem is how to translate predicted ratings into actual recommendations. Furthermore, considering additional information either about users or items may offer substantial gains in performance while allowing the system to provide good recommendations to new users.

We develop machine learning methods in response to some of the limitations of current recommender- systems’ research. Specifically, we propose a three-stage framework to model recommender systems.

We first propose an elicitation step which serves as a way to collect user information beneficial to the recommendation task. In this thesis we framed the elicitation process as one of active learning. We developed several active elicitation methods which, unlike previous approaches which exclusively focus improving the learning model, directly aim at improving the recommendation objective.

The second stage of our framework uses the elicited user information to inform models that predict user-item preferences. We focus user-preference prediction for a document recommendation problem for which we introduce a novel over the space of user side-information, item (document) contents, and user-item preferences. Our model is able to smoothly tradeoff its usage of side information and of user-item preferences to make good document recommendations in both cold-start and non-cold- start data regimes.

The final step of our framework consists of the recommendation procedure. In particular, we focus on a matching instantiation and explore different natural matching objectives and constraints for the paper- to-reviewer matching problem. Further, we explore and analyze the synergy between the recommendation objective and the learning objective.

In all stages of our work we experimentally validate our models on a variety of datasets from differ- ent domains. Of particular interest are several datasets containing reviewer preferences about papers submitted to conferences. These datasets were collected using the Toronto Paper Matching System, a system we built to help conference organizers in the task of matching reviewers to submitted papers.

ii Acknowledgements

I am most in debt to my supervisors Richard Zemel and Craig Boutilier. Without their advice, their continued encouragements and their help and support this work would not have been possible. I am glad we made this co-supervision work. I am especially grateful to Rich, whom as the NIPS’10 program chair provided the initial motivation and momentum behind this thesis. Furthermore, Rich’s ideas, presence and experience were determinant in our joint creation of the Toronto paper matching system. Throughout these projects I found a great mentor and I have been privileged to work closely to Rich. Rich has thought me a lot about how to pick and approach research problems as well as about how to model them. I am also very grateful to have been able to work with Craig. I have learned a lot from Craig. His vision, his curiosity and his scientific rigour are qualities that I strive for. Craig’s ideas were also the ones that initially helped foster this research and his insights and ideas throughout have provided great balance to my work. Our interactions through COGS have further widen my research interests. I would also like to thank my first mentor, Pascal Poupart, who showed me how exciting research could be and gave me some of the tools to succeed at it. My thanks also go to the members of my thesis committee, Sheila Mcllraith and Geoffrey Hinton, for their precise comments and questions throughout my PhD. Geoff’s enthusiasm and presence in the lab were also very motivating to me. I would also like to thank my external advisor, Andrew McCallum for the appropriateness of his comments regarding my work and also for pointing out important immediate future steps of great benefit. Finally, I am thankful to Ruslan Salakhutdinov and Anna Goldenberg for reading and commenting on the final copy of my thesis. The constant support and love of Anne were also determinant in undertaking and successfully finishing this PhD. Her reassuring words have helped me in many occasions. I am especially thankful for her ideas and her outlook on life which she selflessly shares with me and which I have learned so much from. Further, I want to dedicate this thesis to Viviane, the next big project in our lives. Although their involvement was more indirect I learned a lot from postdocs that have tenured in Toronto, specifically I want to thank Iain, Ryan, Marc’Aurelio and, of course, Hugo who has become a good friend and collaborator. Finally, the machine learning group at Toronto was an extremely stimulating and pleasant place to work at thanks to collaborators, close colleagues and friends: Kevin R., Jasper, Danny, Ilya, Kevin S., Jen, Fernando, Eric, Maks, Darius, Bowen, Charlie, Tijmen, Graham, John, Vlad, Deep, Nitish, George, Andriy, Tyler, Justin, Chris, Niail, Phil, and Genevi`eve. Special thanks to Kevin R., Jasper, Danny and Ilya for many interesting discussions about everything throughout our graduate years.

iii Contents

1 Introduction 1 1.1 Recommender Systems ...... 1 1.1.1 Constrained Recommender Systems ...... 2 1.2 Contributions ...... 3 1.3 Outline ...... 4

2 Background 6 2.1 Preliminaries and Conventions ...... 6 2.1.1 Learning ...... 7 2.2 Preference Modelling and Predictions ...... 7 2.2.1 Collaborative Filtering ...... 8 2.2.2 CF for Recommendations ...... 14 2.3 Active Preference Collection ...... 18 2.3.1 Uncertainty Sampling ...... 19 2.3.2 Query by Committee ...... 20 2.3.3 Expected Model Change ...... 21 2.3.4 Expected Error Reduction ...... 21 2.3.5 Batch Queries ...... 23 2.3.6 Stopping Criteria ...... 24 2.4 Matching ...... 24

3 Paper-to-Reviewer Matching 26 3.1 Paper Matching System ...... 27 3.1.1 Overview of the System Framework ...... 27 3.1.2 Active Expertise Elicitation ...... 29 3.1.3 Software Architecture ...... 29 3.2 Learning and Testing the Model ...... 31 3.2.1 Initial Score Models ...... 31 3.2.2 Supervised Score-Prediction Models ...... 32 3.2.3 Evaluation ...... 33 3.3 Related Work ...... 36 3.3.1 Expertise Retrieval and Modelling ...... 37 3.4 Other Possible Applications ...... 37 3.5 Conclusion and Future Opportunities ...... 37

iv 4 Collaborative Filtering with Textual Side-Information 39 4.1 Side Information in Collaborative Filtering ...... 39 4.2 Problem Definition ...... 42 4.3 Background ...... 43 4.3.1 Variational Inference in Topic Models ...... 44 4.4 Collaborative Score Topic Model (CSTM) ...... 46 4.4.1 The Relationship Between CSTM and Standard Models ...... 48 4.4.2 Learning and Inference ...... 48 4.5 Related Work ...... 52 4.6 Experiments ...... 53 4.6.1 Datasets ...... 53 4.6.2 Competing Models ...... 53 4.6.3 Results ...... 55 4.7 Conclusion and Future Opportunities ...... 60

5 Learning and Matching in the Constrained Recommendation Framework 63 5.1 Learning and Recommendations ...... 63 5.2 Matching Instantiation ...... 64 5.2.1 Matching Objectives ...... 65 5.3 Related Work on Matching Expert Users to Items ...... 67 5.4 Empirical Results ...... 68 5.4.1 Data ...... 68 5.4.2 Suitability Prediction Experimental Methodology ...... 69 5.4.3 Match Quality ...... 69 5.4.4 Transformed Matching and Learning ...... 74 5.5 Conclusion and Future Opportunities ...... 75

6 Task-Directed Active Learning 77 6.1 Related work ...... 77 6.2 Active learning for Match-Constrained Recommendation Problems ...... 78 6.2.1 Probabilistic Matching ...... 79 6.2.2 Matching as Inference in an Undirected Graphical Model ...... 80 6.3 Active Querying for Matching ...... 81 6.4 Experiments ...... 83 6.4.1 Data Sets ...... 83 6.4.2 Experimental Procedures ...... 84 6.4.3 Results ...... 87 6.5 Conclusion and Future Opportunities ...... 89

7 Conclusion 91 7.1 Summary ...... 91 7.2 Future Research Directions ...... 92

Bibliography 94

v List of Tables

3.1 Comparing the top ranks of word-LM and topic-LM ...... 35

4.1 Modelling capabilities of the different score prediction models ...... 54 4.2 Comparisons between CSTM and competitors for cold-start users ...... 56 4.3 Test performance of CSTM and competitors on the unmodified ICML-12 dataset ..... 59 4.4 Comparisons between CSTM and two variations ...... 59

5.1 Overview of the matching/evaluation process...... 70 5.2 Comparison of the matching objective versus within-reviewer variance ...... 73

vi List of Figures

1.1 Constrained recommendation framework ...... 4

2.1 Graphical model representation of PMF and BPMF ...... 11 2.2 Graphical model representation of a mixture model for collaborative filtering ...... 12

3.1 A conference’s typical workflow...... 28 3.2 High-level software architecture of the system...... 31 3.3 Histograms of score values for NIPS-10 and ICML-12 ...... 34 3.4 Comparison of top word-LM scores with top topic-LM scores ...... 36

4.1 Preference prediction in the constrained recommendation framework ...... 40 4.2 Graphical model representation of collaborative filtering with side information models . . 41 4.3 Graphical model representations of LDA and CTM ...... 43 4.4 Graphical model representation of CSTM ...... 47 4.5 Graphical model representation of CTR ...... 52 4.6 Score histograms for NIPS-10, ICML-12, and Kobo ...... 54 4.7 Test performance comparing CSTM and competitors across datasets ...... 57 4.8 Learned parameters for NIPS-10 and ICML-12 ...... 58 4.9 Test performance on unmodified ICML-12 dataset ...... 59 4.10 Test performance comparing CSTM and CTR on new users ...... 61

5.1 Constrained recommendation framework ...... 64 5.2 Match-constrained recommendation framework ...... 65 5.3 Score histograms for NIPS-10 and NIPS-09 ...... 69 5.4 Performance on the matching task on the NIPS-10 dataset...... 71 5.5 Histogram of assignments by score value for the NIPS-10 dataset ...... 72 5.6 Comparison of score assignment distributions ...... 73 5.7 Histograms of number of papers matched per reviewer under soft constraints ...... 74 5.8 Performance on the transformed matching objective on NIPS-10...... 75

6.1 Elicitation in the match-constrained recommendation framework ...... 78 6.2 Comparison of matching methods on a “toy” example ...... 80 6.3 Histograms of score values for Jokes and Dating datasets...... 84 6.4 Matching performance comparing different active learning strategies ...... 86 6.5 Usage frequency of the fall-back query strategy of our active learning methods ...... 87

vii 6.6 Additional matching experiments comparing different active learning strategies ...... 88

viii Chapter 1

Introduction

Easy access to large digital storage has enabled us to record data of interest, from documents of great scientific importance to videos of cats and hippos. The ease with which digital content can now be shared renders this content nearly-instantly accessible to interested (online) users worldwide. Without some form of organization, for example through search capabilities, these data would rapidly lose the interest of users overwhelmed under this deluge of data. Hence, the tasks of organizing and analyzing immense quantities of data are of critical importance. On-line search engines, by enabling their users to search on-line records for relevant items, have been the original tools of choice for finding relevant data. Although a good search engine can distinguish relevant from irrelevant items given a specific query (for example, what movies are playing this weekend?), search engines lack the ability to determine the level of user interest within the set of relevant documents. In the last few years, recommender systems have become indispensable. Recommender systems enable their users to filter items based on individual preferences (for example, which weekend movies would I enjoy?). Recommender systems add so much value to search engines that major search engines now include a recommendation engine to personalize their results. For similar reasons, recommender systems have become an important topic of academic and corporate research.

1.1 Recommender Systems

Recommender systems are software systems that can recommend items, for example, scientific papers or movies, of interest to their users, for example, scientists or movie lovers. Although specifics will differ across real-world applications, recommender systems aim to help users in decision-making tasks [88]. Hence, recommender systems are useful in domains where it is infeasible or impractical for users to experience or provide input on every available item. For example, a user may require a recommender system to help in deciding which movie to attend at a film festival. To fulfil its duties, a recommender system must correctly model information about its users and items. A good recommender system should represent user preferences, but also should use other user information which may affect that users’ immediate preferences, such as a user’s current state of mind or his or her decision mechanism. In this respect, a recommender system behaves similarly to a user’s friend who suggests an item of interest. Furthermore, a good recommender system should also analyze information about items which will be useful in determining user interests. A good recommender system

1 Chapter 1. Introduction 2 for a user should therefore combine the insights of that user’s friends with the knowledge of a domain expert [87]. Goldberg et al. [40] introduced the first practical recommender system appeared at the end of the 1990s. Incidentally, the same authors also coined the term collaborative filtering to describe a system which uses the collective knowledge of all its users to more accurately infer the preferences of individual users. From that point, interest in both academia and the corporate world grew rapidly. Resnick and Varian [87] formalized some of the design parameters in recommender systems. These parameters included the type of user feedback (user evaluations of items), the various sources of preferences, and the aggregation of user feedback, as well as how to communicate recommendations to the users. The authors noted that “one of the richest areas for exploration is how to aggregate evaluations”. The future proved them right because a plethora of work has since been carried out on this topic [88]. Furthermore, this central question also constitutes the core of this thesis. Creating a recommender system is a task which spans several research areas. As pointed out by Ricci et al. [88] in their recommender systems handbook: “Development of recommender systems is a multidis- ciplinary effort which involves experts from various fields such as artificial intelligence, human-computer interaction, information technology, data mining, , adaptive user interfaces, decision- support systems, marketing, and consumer behaviour”. This thesis considers the determination of user-item pref- erences, using information about users and items, as the central task faced by recommender systems. Machine learning and specifically prediction models and techniques offer a natural way of representing available information for predicting user-item preferences. Hence, machine learning models constitute the central component of recommender systems. Other aspects of recommender systems, for example, the interface between the system and its users, are auxiliary components which are built around the prediction mechanisms.

1.1.1 Constrained Recommender Systems

User preferences represent the main thrust behind a recommender system’s suggestions. Therefore, optimizing the prediction of preferences is the natural goal of a recommender system. However the system’s designer may have additional objectives that he wishes to optimize, as well as other constraints that he may be required to satisfy. For example, an on-line movie store may want to consider its stock of each movie to ensure that only available movies are recommended. Depending on the exact constraints, such recommendations may be similar to or very different from the initial, constraint-free recommendations. The same on-line retailer may also wish to maximize some other function such as user happiness, or perhaps more realistically, his or her own long-term profit. Such objectives lead to a trade-off between an individual’s preferences, the preferences of other users, and the objectives of the designers. Another common objective is that a recommender system may be inclined to take into account the diversity of its suggestions [78] (for example, as a way to hedge its bets against non-optimal recommendations). A recommender system may also use its recommendations as a means to further refine its user model, for example, by selecting a subset of recommended items using an “exploration strategy”. In this context, it may be useful for the system to recommend items that will be most useful in refining its user model, somewhat independently of a system’s (ultimate) objective to provide the best possible recommendations. The study of constrained recommender systems, and specifically the design of learning methods which are tailored to the interaction between the preference prediction objective and the final objective, is a primary focus of this thesis. Chapter 1. Introduction 3

1.2 Contributions

The aim of this thesis is to develop machine learning methods tailored to recommender systems. Fur- thermore, we are interested in how such machine learning methods may lead to better user and item representations and, therefore, ultimately lead to improved recommender. We propose to model the recommendation problem using a three-stage process which is elaborated in this thesis:

Preference Collection: When initially engaging with a recommender system, a new user must provide information related to his preferences. This information will be used by the recommender system to build a user model which paves the way to personalized recommendations. The preference collection phase represents an opportunity for the system to elicit user information actively. We frame the problem of selecting what to ask a user as an active learning problem. Such elicitation, by asking informative queries which quickly lead to a good user model, has the potential to mitigate the cold-start problem, a problem which refers to the difficulty for a system to learn good models of users with little known preferences such as novel users. We propose several methods for performing active learning of user preferences over items for (match-constrained) recommender systems. Our main contribution that leads to the empirical success we demonstrate is our novel active learning methods—they are sensitive to the matching objective, which is the ultimate objective of match- constrained recommender systems.

Predict missing preferences: The central task of a recommender system is to use elicited user infor- mation to predict user-item preferences. This means that the system must use available information to learn a predictive model of user-item preferences. Here, we propose several models to predict these missing user preferences. We focus on developing models which, in addition to user-item preferences, also model and leverage side-information, that is features of users or items aside from user preferences. We demonstrate empirically that this side-information can be beneficial in both cold-start and non-cold-start data regimes.

Preferences to recommendations: Using predicted preferences to suggest items to users is the goal of the recommender system. Going from preferences to suggestions can be as simple as selecting a subset of the most preferred items or providing users with a preference-ordered list of items. However, it can also involve solving a more complex optimization problem such as that imposed by a constrained recommender system. This stage therefore involves using user-item preferences, either user-elicited or predicted, as inputs to a recommendation procedure, for example a combinatorial optimization problem, which considers the final objective and constraints and determines the recommendations. In this thesis, we introduce and instantiate match-constrained recommender systems which involve matching users to items under constraints which globally restrict the set of possible matches. We explore several matching objectives and constraints and show, for certain objectives, a synergy between the final matching objective and the preference-prediction objective (the learning loss).

A flow chart depicting the three stages described above is shown in Figure 1.1. Although the first two stages directly benefit from machine-learning techniques, the third, which arises from practical consid- erations of recommender systems, provides an ultimate objective to be used to guide the development of appropriate machine-learning techniques. In other words, there is a synergy between the first two stages and the last stage. Specifically, the first stage’s objective is to collect useful information from users. The Chapter 1. Introduction 4

Entities 3 S 1

Users Prediction of missing Final objectives Elicitation 1 2 preferences 2 3 2 and constraints 2 3 2 0 1 2 0 1 2 Side-Information 1 1 2 F 1 1 2 from Users and Stated & predicted Entities preferences Recommendations

Stated preferences

Figure 1.1: Flow chart depicting the different components of our research. The first part represents elicitation, followed by missing-preference prediction and finally the recommendation procedure, F (). The recommendation procedure represents the ultimate objective of the system (for example, ranking, matching, per-user diversity, social-welfare maximization).

usefulness of the information will be determined by the recommendation objective. The active learning strategies can therefore be guided by the recommendation objective. Furthermore, when learning mod- els for missing-preference prediction, a model sensitive to the recommendation objective may focus on correctly predicting preferences that are more likely to be part of the recommender system’s suggestions. The exact form of inter-stage interactions will be detailed in the appropriate chapters.

1.3 Outline

We will begin in Chapter 2 by defining the problem more formally, explaining some of the foundational principles behind our work and reviewing relevant previous research in the areas of preference modelling and prediction, active learning, and matching. Chapter 3 introduces the paper-to-reviewer matching problem as a practical problem that will be used to motivate and illustrate the various contributions throughout this thesis. This chapter also introduces and describes in detail the on-line software system that we have implemented and released to help conference organizers assign submitted papers to reviewers. Furthermore, this platform serves as a testbed for the developments in other chapters. Chapter 4 describes our work on the problem of preference prediction with textual side-information, which is the first research contribution of this thesis. We introduce a novel graphical model for personal recommendations of textual documents. The model’s chief novelty lies in its learned model of individual libraries, or sets of documents, associated with each user. Overall, our model is a joint directed prob- abilistic model of user-item scores (ratings), and the textual side information in the user libraries and the items. Creating a generative description of scores and the text allows our model to perform well in a wide variety of data regimes, smoothly combining the side information with observed ratings as the number of ratings available for a given user ranges from none to many. We compare the model’s performance on preference prediction with a variety of other models, including two methods used in our paper-to-reviewer matching system. Overall, our method compares favourably to the competing mod- els especially in cold-start data settings where user side-information is essential. We further show the benefits of modelling user side-information in an application for personal recommendations of posters to view at conference. Chapter 1. Introduction 5

In Chapter 5, we formally introduce our framework for optimizing constrained recommender systems and instantiate a match-constrained recommender system from it. We frame the matching or assignment problem as an integer program and propose several variations tailored to the paper-to-reviewer matching domain. Experiments on two data sets of recent conferences examine the performance of several learning methods as well as the effectiveness of the matching formulations. We show that we can obtain high- quality matches using our proposed framework. Finally, we explore how preference prediction and matching interact. Experimentally we show that matching can benefit from interacting with the learning objective when matching utility is a non-linear function of user preferences. Active learning methods to optimize match-constrained recommender systems are proposed in Chap- ter 6. Specifically, we develop several new active learning strategies that are sensitive to the specific matching objective. Further, we introduce a novel method for determining probabilistic matchings that accounts for the uncertainty of predicted preferences. This is of importance for as active learning strategies are often guided by, or make use of, model uncertainty (for example, a common strategy is to pick queries that will most reduce the model’s uncertainty). Experiments with real-world data sets spanning diverse domains compare our proposed methods to standard techniques based on the qual- ity of the resulting matches as a function of the number of elicited user preferences. We demonstrate that match-sensitive active learning leads to higher-quality matches more quickly compared to standard active learning techniques. Finally, we provide concluding remarks and discuss opportunities for future work in Chapter 7. We also provide a brief discussion of the main future issues to be addressed by the field of recommender systems as a whole. Chapter 2

Background

In this chapter, we review the literature pertaining to the three stages of recommendation systems studied in this thesis. We begin by introducing the notation that will be used throughout this thesis and reviewing some of the foundational concepts of machine learning. Then we review some of the previous research in preference prediction, focusing especially on side-information based and collaborative filtering methods. We also introduce the concept of active learning as a way to collect user preferences and review some of the relevant literature. Finally, we introduce and discuss the matching literature as an interesting example of a recommendation objective. Note that throughout this chapter, we survey some of the work that relates to this thesis as a whole. Work specifically related to individual chapters will be surveyed independently as part of the relevant chapter.

2.1 Preliminaries and Conventions

In our study of recommendation systems, we take user-item preferences to be the atomic and quintessen- tial pieces of information used to represent a user’s interest in a particular item. Our convention is to use the term preferences to denote a user’s interest in an item, but we will also use it to denote a user’s expertise with respect to an item. A user’s expertise denotes his competence, or qualifications, with re- spect to a particular item and as such may differ from his preferences (for example, conference organizers may wish to assign submitted papers to the most expert reviewers regardless of the reviewer’s actual interest in the paper). This distinction will usually be clear from context, although we will typically use the more specialized term score to denote expertise and rating to denote interest. Note that all models and methods developed in this thesis can readily deal with ratings or scores. We only consider cases where users explicitly express their preferences for items individually. For example, we do not consider preferences that could be expressed by providing a ranking of a group of items. We assume that user-item preferences are expressed using numeric (typically integer) values. Furthermore, we equate higher ratings (or scores) with a stronger preference (that is, users would prefer an item rated 5 over an item rated 1). Interestingly, the semantic meaning of the preferences is usually not specified in data sets, and therefore the value of a preference is usually taken to represent its utility. As a result, most of the work we survey considers the rating-prediction problem as a metric regression problem rather than an ordinal regression problem, and we have followed their lead.

6 Chapter 2. Background 7

We denote an individual user as r, the set of all users as , and the number of users as R (that is, R R ). Similarly, individual items are denoted as p, the set of all items as , and the number of items ≡R P as P . In our work, we will use the term users to designate the set of entities to which recommendations will be provided, regardless of whether they represent a person, a group of people, or other entities of interest. Similarly, items are entities to be recommended, regardless of their actual representation. Typical items in this thesis will be documents, jokes, and other humans. We denote the preference of user r toward item p as srp (a score). It will also be useful to think of a preference as a (r,p,s)-triplet. In general, we denote matrices using uppercase letters (for example, U, V ), vectors using bold-font lower-case letters (for example, a, γ), and scalars using lower case letters (for example, a,b). Unless this introduces ambiguity, we also denote the size of sets using uppercase letters (for example, N and R).

2.1.1 Learning

The central machine-learning task in this thesis is to model user interest in items to predict user-item preferences. We will assume that a subset of user-item preferences, possibly with side information, is always available to the system. Ways of obtaining such information will be discussed in Section 2.3. Our goal is therefore to predict the preferences of users for items that they have not yet rated. In other words, our goal is to use observed user-item preferences to learn a model that can predict missing user-item preferences. This prediction problem is framed as a supervised machine-learning problem. Accordingly, our aim is to minimize, for a suitable error or loss function, the expected loss of our preference prediction model. The expectation is taken with respect to a fixed but unknown data-generating distribution, where the generating distribution is a joint distribution over preferences and user-item pairs. Because in practice the true expected loss cannot be evaluated, it is customary to evaluate and report instead the empirical loss [124]. To evaluate the performance of a model, we will then use user-item preferences collected from users otherwise known as ground-truth preferences. The empirical loss is the result of evaluating the loss function using the ground-truth preferences with the predicted preferences. For the purposes of empirical comparison, the available ground-truth preferences are divided into two disjoint sets: the training set, which the machine learning model is allowed to leverage, and the test set, which the method is not given access to and, therefore, which can be used to evaluate model performance. In addition, it can be useful to reserve additional data from the training set to constitute a validation set. The validation set can be used to evaluate model performance during the training stage. In particular, validation sets are often used to determine the value of hyper-parameters, which are parameters given as input to a learning model, and to prevent overfitting. A model is said to overfit if its training loss is much smaller than its test loss.

2.2 Preference Modelling and Predictions

In the context of recommendation systems, two broad classes of preference prediction systems have been explored in the machine-learning literature: content-based systems and collaborative filtering systems. Both types of systems make use of user preferences for items. Content-based systems assume access to the content of the items and model content features, features derived from the content. As an example, in a book domain, the content of items include the words of the books, the corresponding word counts would be content features. On the other hand, collaborative filtering leverages only the preference similarities between users and/or items [40]. Instead of content-based we prefer the more general term Chapter 2. Background 8 side information, which denotes all user, item, or user-item features except the user-item preferences themselves.

2.2.1 Collaborative Filtering

A simple example of collaborative filtering (CF) can be described as follows: imagine two users that have similar ratings for certain items. CF makes the assumption that these similarities extend across items rated by only one of the two users. Therefore, to learn a user’s model CF methods try to glean preference information from similar users (that is, users whose preferences for items rated in common are similar). Using this natural idea, researchers have built a wealth of models that have performed very well on several real-life data sets [72, 62]. These methods usually exploit CF by enabling the learning models to share parameters either across users or across items, or both. We view CF as having certain advantages over side-information based methods:

1. CF leverages the very natural and powerful idea that a user’s preferences can be predicted using the preferences of similar users. Using other users’ preferences is especially attractive in today’s connected world where it has become easier to collect preferences for large numbers of users.

2. Because CF uses user preferences only, it is largely domain-independent and can also be used to model user preferences across different item domains. On the contrary, side-information-based systems leverage the similarities contained in the features of users or items. Selecting features that are discriminative of user preferences may not be straightforward in certain domains. For example, in the movie-domain, meta-data, such as a movie’s genre, has not typically led to significant performance gains (see for example [66]).

CF also has certain limitations:

1. Because it leverages only the information contained in user preferences, CF is problematic when the preferences of a specific user or for a specific item are not available, a problem commonly known as the cold-start problem because it generally applies when users or items first enter the system. Generally, CF is effective in domains where collections of preferences from users for items are accessible, in particular when each user can rate multiple items and each item is rated by multiple users. For example, typical domains in which CF has been successful relate to everyday decisions, such as movie or restaurant recommendations, which entail easy access to large numbers of users and their preferences. By design, CF would typically perform worst in domains where obtaining many preferences from each user or for each item is difficult. Example domains include recommending houses, cars, or universities to which a high-school student should apply. In such domains, CF may still be useful once side information is available [19].

2. The fact that CF does not use domain-specific information also means that it does not make use of potentially useful domain information when it is available.

We find that the domain-independence of CF methods, as well as the apparent difficulty of leveraging side information in systems where many preferences are available, and their performance in practice [7] tips the balance in favour of using CF methods for recommendation systems. Hybrid systems that combine the advantages of both CF and side-information techniques offer inter- esting opportunities. Overall, the field of side-information-CF models has only begun to be explored by Chapter 2. Background 9 researchers. Because Chapter 4 will present several novel hybrid approaches, we will defer our discussion of previous research in this field until then.

Original Collaborative Filtering Models

The term collaborative filtering was first coined by researchers who used this technique to help users filter their emails by leveraging their colleagues’ preferences [40]. This was soon followed by similar work using a neighbourhood, or memory-based, method aimed at filtering articles from netnews [86]. In model-free, or neighbourhood, approaches to CF, a prediction for a user-item pair is the weighted combination of the ratings given to that item by the user’s neighbours. Alternatively, one could use the weighted combination of the ratings given by that user to neighbouring items. The weight of each neigh- bour is typically taken to be some similarity measure between the two users. Breese et al. [20] compare different similarity measures such as cosine similarity and the Pearson correlation coefficient, as well as several variations of these. In their experiments, they show that modifications to the latter yield the best results. The other main class of CF models is called model-based CF [20]. In this approach, parameters of a (probabilistic) model are learned using known ratings, and the model is then used to predict missing ratings. In terms of ratings prediction performance, the neighbourhood approaches are typically not as good as the best model-based approaches on real-world CF data sets. However, the fact that they are very computationally efficient renders them attractive for on-line recommendation applications (for example, those that need to update their recommendations according to newly acquired ratings in real time). Model-based approaches, in addition to providing stronger performance, are generally more principled. Among model-based approaches, there has been a push toward using probabilistic models (see the next two sections). Probabilistic models generally offer a wealth of advantages, such as uncertainty modelling, higher robustness to noise in the data, and a more natural way to encode additional information such as prior information available to the system or other features of items and users. By using probabilistic models, it is also possible to benefit from recent developments in these models, including advances in learning and inference techniques. In this document, we will focus mostly on model-based approaches, which have been favoured in most recent machine-learning work.

Matrix Factorization Models

User-item-score triplets of the form (r,p,s) can be seen as the entries of a matrix, where the pair (r, p) represents an index of the matrix and s corresponds to the value at that index. We will denote this score matrix by S and the triplet (r,p,s) by srp = s. The dimension of the matrix is equal to the number of users R by the number of items P (S RR×P ). The unobserved triplets are encoded as missing entries ∈ in this matrix. The resulting matrix containing observed and unobserved triplets is therefore a sparse matrix. The goal of collaborative filtering is then to fill the matrix by estimating the missing entries of the matrix (Sm) given its observed entries (So). One natural way of doing so is to find a factorization of S, S U T V , with U Rk×R and V Rk×P , with k effectively determining the rank of U T V [111]. ≈ ∈ ∈ Then a missing value, (r, p), can be recovered by multiplying the r-th column of U with the p-th column T of V : ur vp. When S encodes user-item preferences, then U could be seen as encoding user attitudes toward the item features encoded by V . Unless this introduces ambiguity, we simplify the notation by Chapter 2. Background 10

denoting user r’s factors, that is, r’s column of U, by ur, and similarly item p’s factors, p’s column of

V , by vp. For matrices without missing entries, a factorization can be recovered by finding the singular value decomposition (SVD) of R, which is a convex optimization problem. However, SVD cannot be used for matrices with missing entries [112]. Furthermore, CF applications typically involve predicting preferences for a large number of users and items given a small number of observed ratings (i.e., S is sparse), So << RP . For that reason, estimating (R2 + RP + P 2) parameters would not be wise. | | A variety of methods have attempted to deal with these difficulties for CF. Srebro and Jaakkola [112] proposed to regularize the factorization by finding a low-rank approximation: S Sˆ, where ≈ rank(Sˆ) = k << min(R,P ). Specifically, Srebro and Jaakkola [112] searched for U and V that would minimize the squared loss between the reconstruction and the observed ratings:

T 2 Iij(sij (u vj )) . (2.1) − i i j X X where Iij = 1 if score sij is observed and 0 otherwise. They also considered more elaborate cases where I could be a real number corresponding to a weight, “for example in response to some estimate of the noise variance” [112]. They proposed both an iterative procedure, based on the fact that the objective becomes convex when conditioning on either U or V , and an expectation-maximization (EM) procedure. Salakhutdinov and Mnih [99] further regularized the objective using the Frobenius norm of U and V :

T 2 Iij (sij (u vj) + λ U Fro + λ V Fro. (2.2) − i 1|| || 2|| || i j X X denotes the Frobenius norm of a matrix and corresponds to the square root of the sum of squares ||·|| of all elements of the matrix (it is the equivalent, for matrices, of the vector Euclidean). The minimum of this objective corresponds to the MAP solution of a probabilistic model with a Gaussian likelihood:

2 T 2 Pr(S U,V,σ )= (sij u vj ,σ ) (2.3) | N | i i j Y Y and isotropic Gaussian priors over U and V . This model is called probabilistic matrix factorization (PMF) , and its graphical representation, by way of a Bayesian network, is shown in Figure 2.1(a). The authors show that although the problem is non-convex, performing gradient descent jointly on U and V yields very good performance on a large real-life data set such as the Netflix challenge data set (100 million ratings, over 480,000 users, and 17,000 items). Instead of approximating the posterior with the independent mode of U and V , Lim and Teh [68] proposed to use a mean-field variational approximation which assumes an independent posterior over U and V (Pr(U, V So) = Pr(U So)Pr(V So)). Salakhutdi- | | | nov and Mnih [98] further proposed Bayesian PMF, a fuller Bayesian extension of PMF, which involves setting hyper-priors over the parameters of the priors of U and V (see Figure 2.1(b)). The posterior can be approximated using a sampling approach. Specifically, they developed a Gibbs sampling approach that scales well enough to be applied to large data sets and which outperforms ordinary PMF. Instead of performing regularization by constraining the rank of U and V , which would render the problem non-convex, Srebro et al. [113] proposed regularizing (only) the norm of the factors, U Fro T || || and V Fro, or equivalently the trace norm of U V (that is, the sum of its singular values, denoted || || by U T V ). With certain loss functions (for example, the hinge loss), finding the optimal U T V with || ||Σ the trace-norm constraints is a max-margin learning problem, and hence the name of this model is max Chapter 2. Background 11

α α ν , W ν , W VU 0 0 0 0

Λ Λ V U

Vj Ui

µ µ µ µ 0 V Vj Ui U 0

Rij i=1,...,N Rij j=1,...,M i=1,...,N j=1,...,M

α α (a) (b)

Figure 2.1: Graphical model representation of PMF (a) and BPMF (b). These figures show that Bayesian PMF is a PMF model with additional priors of U and V . Both figures are borrowed from [98].

margin matrix factorization (MMMF). To illustrate this, suppose, for ease of exposition, that S is binary. T T T Then maximizing S(U V ) with fixed (U V ) Σ will find the solution, (U V ), that yields the largest || || T distance between the hyperplane defined by the normal vector (u vj ) and the ratings sij . Although i { } this approach leads to a convex optimization problem, Srebro et al. [113] framed the optimization as a semi-definite program (SDP), and given current solvers, the optimization problem cannot be solved for more than a few thousand variables [129]. In their formulation, the number of variables is equal to the number of observed entries, meaning that this solution technique severely limits the scale of the applicable data sets. To combat this problem, Rennie and Srebro [85] proposed a faster variant of MMMF, in which they use a second-order gradient descent method, and bound the ranks of U and V (effectively solving the non-convex problem). Another way to understand this work is as follows. Start from Equation2.2; then a max margin approach reveals itself if the squared error term is replaced by a T loss function that maximizes the margin between the hyperplanes defined by u vj and sij i,j. In i { } ∀ fact, with binary ratings, PMF can also be seen as a least-squares SVM [116]. As previously noted in matrix factorization techniques, the factors U and V can be thought of as user features and item features respectively. Of note is the fact that factorization models are symmetric with respect to users and items: transposing the observed score matrix does not alter the optimal solution.

Other Probabilistic Models

There has also been significant work on probabilistic models that are not based on factorizing the score matrix. Such models typically learn different sets of parameters for users and for items. The principle underlying most of these probabilistic models is to cluster users into a set of global user profiles. Profiles can be seen as user attitudes toward items. Once a user’s profile has been determined, that profile alone interacts, in some specific way, with an item’s representation to produce a rating. Some of this work originates in applying mixture models to the problem of CF [49]. The generative model is as follows:

2 Pr(sij ui,vj )= Pr(zi = z ui)Pr(Szj = s; µzj ,σ ), (2.4) | | zj z Z X∈ Chapter 2. Background 12

u

v z

s

i = 1,...,N

j = 1,...,M

Figure 2.2: Graphical representation of a mixture model proposed by Hofmann [49] to predict ratings.

2 where Z is a latent random variable representing the different user profiles. Pr(Szj = s; µzj ,σzj ) is a 2 Gaussian distribution with mean µzj and variance σzj. This mixture model’s Bayesian-network repre- sentation is presented in Figure 2.2. Informally, the corresponding generative model first associates user ui with a profile z. Then that profile and the item of interest vj combine to set the parameters of a

Gaussian, which in turn determines the value of the score for (ui,vj ). The parameters to learn are the mixture weights for each user (that is, the distribution over the profiles) in addition to the mean and variance of a Gaussian distribution for every item and every profile. Several authors have proposed to use priors over possible user profiles [71, 48]. In addition to representing users as a distribution over a small set of profiles, Ross and Zemel [93] proposed to also further cluster items (MCVQ). Marlin and Zemel [73] model users as a set of binary factors (binary vectors), and factors are viewed as user attitudes. Each factor is associated with a multinomial distribution over ratings for each item. The main distinguishing feature of this approach is that the influences of the different factors, expressed through their multinomials over ratings, are combined multiplicatively to predict a rating for a particular item. This combination implies that factors can express no opinion about certain items (by having a uniform distribution over the item’s ratings). Furthermore, the multiplicative combination also means that the distribution over a rating can be sharper that any of the factor distributions, something that is impossible in mixture models where distributions are averaged. Lawrence and Urtasun [66] proposed the use of a Gaussian process recovered by marginalizing out U from the PMF formulation, Equation 2.3, assuming an isotropic Gaussian prior over U. The likelihood over ratings is then a zero-mean Gaussian with a special covariance structure:

N 2 −1 T 2 P (S V,σ ,αw)= (sj 0,α V V + σ I). (2.5) | N | w j=1 Y

One then optimizes for the parameters V and the scalar parameters αw and σ using (stochastic) gradient descent. Here the parameters, V , form a covariance matrix that encodes similarities between items. To obtain a prediction for a user given the above model, one simply has to condition on the observed ratings of the user. Given the Gaussian nature of this model, a user’s unobserved ratings will then be predicted by a weighted combination of the user’s observed ratings. The weights of the combination are then a function of the variance between observed and unobserved items. Chapter 2. Background 13

Salakhutdinov et al. [100] proposed to apply restricted Boltzmann machines (RBMs) to the collab- orative filtering task. RBMs are a class of undirected graphical models, general enough to be applied, with few modifications, to problems in many different domains. For CF, Salakhutdinov et al. [100] showed that using RBMs enables efficient learning and inference that can scale to large problems (such as the Netflix data set) and can yield near state-of-the-art performance. Furthermore, RBM distributed representations combine active features multiplicatively in a similar way and with the same advantages over mixture models, as Marlin and Zemel [73]. Recently, Georgiev and Nakov [38] have proposed a novel RBM-based model for CF. Without direct comparative experiments, it is difficult to differentiate among the methods described above in terms of performance. Matrix factorization techniques, which are typically very fast to train (and support effective inference) have been the most widely used in recent years. One lesson drawn from the large problem run by the on-line movie rental company Netflix is that ensemble methods, which are combinations of several different methods, are particularly well-suited for collaborative filtering problems. With this in mind, combinations of linear models such as matrix factorization techniques have been shown to be very effective [62, 7, 64].

Classical Evaluation Metrics

We now turn our attention to the performance metrics typically used to evaluate CF models. Two metrics are commonly used to compare the performance of CF methods. Both are decomposable over each single predicted rating. The first is the mean absolute error (MAE):

r,p So srp sˆrp MAE := ( )∈ | − |, (2.6) So P | | o wheres ˆrp is a learning method’s estimation of srp and S is the set of user-item index pairs of interest (e.g., the pairs contained in a test set). Researchers have also used a normalized version of this method (NMAE), where MAE is normalized so that random guessing would yield an error value of one. Nor- malization enables comparison of errors across data sets that have different rating scales. The second metric is the mean squared error (MSE):

2 r,p So (srp sˆrp) MSE := ( )∈ − , (2.7) So P | | and its square-root version RMSE:= √MSE. The probabilistic methods introduced in the previous section do not directly optimize for these measures. Instead, they learn a maximum likelihood (ML) estimate of the parameters, a maximum a posteriori (MAP) estimate, or a full distribution over their parameters in the case of Bayesian methods such as Bayesian PMF. As outlined above, finding a MAP estimate is, under certain assumptions, equivalent to minimizing the squared loss of individual predictions (and hence the MSE). Once learning is done, probabilistic models can be used to infer a distribution over ratings. Given that distribution, different evaluation metrics call for different prediction procedures: the MSE is minimized by taking the expectation over possible rating values, while the MAE is optimized by predicting the median of the ratings distribution. One exception is in the case of MMMF, which does not have a direct probabilistic interpretation, but Srebro et al. [113] proposed to minimize a hinge loss which generalizes to a multi-class objective like MAE. Chapter 2. Background 14

The Missing At-Random Assumption

There is a general issue with the evaluation framework used in most CF research which may be problem- atic when deploying these models in the real world. Because the empirical loss is taken as a surrogate for the expected loss, one assumes that the data-generating distribution of both is the same; however, in practice, it usually is not. In other words, users generally do not rate items at random, and in fact, it has been shown empirically that users tend to bias their ratings toward items in which they are inter- ested [75]. Positively biased ratings are likely due to users wanting to experience items that they believe they will enjoy a priori. Hence, learning models will be affected unless this bias is explicitly taken into account. Technically speaking, most models assume that data are missing at random, which is often incorrect. Marlin et al. [75] and Marlin and Zemel [74] proposed a series of models which explicitly account for data not missing at random. Using such models is preferred if data about items considered but not explicitly rated are available. In practice, such data are seldom available (although on-line mer- chants may have access to them). Other solutions try to reconcile the (possibly biased) data with the need to minimize the (unbiased) expected loss, but they also have certain drawbacks [126]. Although of great interest, such questions are somewhat tangential to our goals and therefore will not be directly addressed here.

2.2.2 CF for Recommendations

The astute reader will have noticed that although we claim that CF techniques belong to the realm of recommendation systems, neither the outlined CF methods nor their evaluation metrics offer a clear way of making item recommendations to users. In fact, many authors (see, for example, [72]) see the prediction of preferences as the first of two steps leading to personalized recommendations. The second step consists of sorting the (predicted) ratings. The top recommendation for a user is then the item with the highest predicted score. This two-step view of a recommendation system using collaborative filtering has two important shortcomings:

1. When the goal is simply to recommend the top items to users, the solution to the ratings estimation problem is also a solution to the ranking problem. Weimer et al. [128] pointed out that score- estimation models must learn to calibrate scores, thus making the score-estimation problem actually harder than estimating rankings.

2. As we have argued in Chapter 1, defining item recommendations by suggesting to users their predicted favourite items is only one example of a possible recommendation task. In practice other conditions and constraints may need to be satisfied:

(a) Certain domains may warrant the consideration of factors such as confidence in predictions. For example, in a mobile application where there are costs associated with recommendations, a system might prefer to use a more conservative strategy by, say, recommending an item for which it has a higher certainty of success. (b) The availability of side information may introduce additional constraints. For example, if several items are to be recommended at once, criteria such as diversity might have to be taken into account (see, e.g., [78]). (c) In addition to constraints affecting single users or items, there are possibly global constraints on users, items, or both. For example, certain items may be available only in limited quan- Chapter 2. Background 15

tities, and therefore these items may not be recommended to more than a certain number of users. More complex relationships may also exist between users and items, such as those that arise in the domain of paper-to-reviewer matching for scientific conferences (see Chapter 3 for a complete discussion).

In the cases described above, modifying the second step according to the recommendation task may not be optimal because a learning method trained for ratings prediction may not capture the constraints and aims of the recommendation system.

Instead of considering CF for recommendations as the product of two separate steps, where the first is unaware of the objectives of the second, we argue that the tasks should be more integrated in that the objective of collaborative filtering should be sensitive to the final objective. In fact, this line of reasoning will be present throughout this thesis. We review some existing methods on the topic below and defer a full discussion of our approaches to Chapter 5.

Collaborative Filtering in the Service of Another Task

In previous sections, we have reviewed various collaborative filtering techniques that all optimize rating predictions. We have also established that to offer recommendations, a system should be aware of the final recommendation task from the beginning. We mean by this that the learning procedure, at training time, should consider a loss reflective of the ultimate objective. One may object to this by pointing out that regardless of the final task, the learning method will be accurate if it is able to estimate perfectly the unobserved ratings. However, this is almost never the case in practice. Instead, integrating the final objective into the learning objective can be seen as indicating to the learning algorithm where to focus its attention (capacity) to maximize performance on the final task. One may wonder why this seemingly simple principle has not been more widely applied. First, as will be made clear in this section, the ultimate task objective is often harder to optimize than the ratings estimation task. Both RMSE and MAE conveniently decompose over single predictions and are continuous. Furthermore, although possibly sub-optimal, it may be more appealing to design a general CF method that can easily be applied to multiple different tasks rather than one with a specific task in mind. In this section, we outline some methods that have looked at integrating the preference estimation step and the recommendation step into one. We also present a few tasks for which collaborative filtering has been used and which may benefit from an integrated approach. Most of the existing work on CF in the service of another task has proposed solutions for specific recommendation tasks as opposed to providing frameworks that can easily be adapted to different tasks. Jambor and Wang [56] stand out because they do not focus on a specific application, but rather propose a general framework that can be tailored to the requirements of various applications. Although the solution proposed in this paper does not constitute a definitive solution, it touches upon some of the ideas that we have put forward in the introduction to this section and that have been developed by others looking at specific applications. The main technical idea developed by Jambor and Wang [56] is that contrary to what RMSE and MAE do, not all errors are “created equal”, and different types of errors should be weighted differently based on the objectives of the recommendation system. Accordingly, the Chapter 2. Background 16 authors use a loss function weighted by a function of the ground-truth rating and the prediction:

2 w(ˆsrp,Srp)(srp sˆrp) (2.8) o − (r,pX)∈S These weights can either be fixed a priori or learned to minimize the system’s loss function (for example, a ranking function). Let us look at two simple examples to demonstrate how the w’s may be set for different objectives. For simplicity, we will assume that the ratings are binary. In the first example, imagine that the objective is to maximize the system’s precision performance, a reasonable objective when the goal is to recommend top items to users. In that case, predicting a zero for a rating with a ground-truth (GT) value of one will hurt the system less than predicting a one with a GT of zero. Then it is sensible to set w(0, 1) >w(1, 0) in the above formula. In general, given that our objective is precision, underestimating a high rating is less costly than overestimating a low rating. Now imagine that instead of maximizing precision, we would like to maximize recall. Then it would be sensible for the learning method to incur a greater cost when incorrectly classifying a one versus incorrectly classifying a zero, and therefore w(1, 0) >w(0, 1).

Collaborative Ranking

The most natural recommendation task is to provide every user with a list of items ranked in order of (estimated) preferences or ratings. This list may not contain all items, but is rather a truncated list with only the top items for a given user. Naturally, then, optimizing a collaborative filtering system with a ranking objective has received the most attention from the community. Although many different ranking objectives exist, researchers in the learning-to- rank literature have recently focused mostly on using the normalized cumulative discounted gain (NDCG) [57]:

K S 1 2 πi 1 NDCG@T (R, π) := − (2.9) N log(i + 2) i=1 X where S are the ground-truth scores and π denotes a permutation, that is, Sπi is the value of the rating at position πi. π reflects the ordering of the predictions of the underlying learning model. The truncation parameter T corresponds to the number of items to be recommended to each user, and N is a normalization factor such that the possible values of NDCG lie between zero and one. Note that the denominator has the effect of weighting the higher-ranked items more strongly than the lower-ranked ones. Having T and the denominator in the NDCG objective represents two major differences compared to traditional collaborative filtering metrics and highlights the fact that, for top-T recommendations, learning with NDCG@T might be beneficial compared to learning with a traditional loss (e.g., RMSE). Moreover, NDCG as defined above considers the ranking of only one user, whereas in a multi-user setting, it is customary to report the performance of a method using the average NDCG across all users. A major challenge in this line of work is that NDCG is not a continuous function, but in fact is piecewise constant [128] because all ratings with the same label can be ranked in a number of different consistent orders without affecting the value of NDCG. For this reason, it is challenging to perform gradient-based optimization of NDCG. Weimer et al. [128] derived a convex lower bound on NDCG and maximized that instead. The underlying model predicting the ratings has the same form as PMF. Consequently, their approach can be understood as training PMF using the NDCG loss and is called Chapter 2. Background 17

CoFiRank. The resulting optimization is still non-trivial because the evaluation of the lower bound is expensive, but using a bundle method [109] with an LP in its inner loop, Weimer et al. [128] were able to report results on large data sets such as the Netflix data set. In experiments on standard CF data sets, they compared their approach to MMMF as well as their approach using an ordinal loss and a simple regression loss (RMSE). They showed that for most settings, training using a ranking loss (NDCG or ordinal regression) results in a significant performance gain versus MMMF. In a subsequent paper, Weimer et al. [129] showed less convincing results; on a CF ranking task, their method trained using an ordinal loss is outperformed by the same method trained using a regression loss. The data sets used in both papers differed, which may explain these differences. Balakrishnan and Chopra [4] proposed a two-stage procedure resembling the optimization of NDCG. First, much as in regular CF, a PMF model is learned. In the second stage, they use the latent variables inferred in PMF as features in a regression model using a hand-crafted loss which is meant to capture the main characteristics of NDCG. They show that jointly learning these two steps easily outperforms CoFiRankwhile providing a simpler optimization approach. Volkovs and Zemel [125] used a similar two-stage approach. They first extract user-item features using neighbour preferences. They then use these features inside a linear model trained using LambdaRank, a standard learning-to-rank approach [24]. This model delivered state-of-the-art performance and requires only 17 parameters, making learning and inference very fast.

Shi et al. [107] define a simple user-specific probability model over item rankings. The probability distribution encodes the probability that a specific item, v, is ranked first for user u [25]:

exp(srp) P (srp)= . (2.10) ′ p′ exp(srp ) P A cross-entropy loss function is used to learn to match the ground-truth distribution, the distribution yielded by the above equation using the ground- truth ratings, with the model (PMF) distribution:

T exp(srp) exp(g(ur vp)) Loss := log T (2.11) ′ o ′ ′ o ′ − o r ∈S (r) exp(srp ) p ∈S (r) exp(g(ur vp )) p∈XS (r) P P where g() is a logistic function and So(r) are the items rated by user u. They optimize the above function using gradient descent, alternately fixing U and optimizing V and vice versa. In experiments, this model’s performance was generally significantly higher than CoFiRank’s performance.

Finally, Liu and Yang [69] bypassed the difficulty of training a ranking loss by using a model-free CF approach. Here the similarity measure between users is the Kendall rank correlation coefficient, which measures similarity between the ranks of items rated by both users. They model ratings using ′ ′ pairwise preferences. A user’s preference function over two items, p and p , Ψu(p,p ) is positive if the ′ user prefers p to p (srp >srp′ ), negative if the contrary, and zero if the user does not prefer one to the other (srp = srp′ ). Formally,

′ ′ ′ ′ p,p sim(r, r )(srp srp ) ′ r ∈Nu Ψu(p,p )= − (2.12) ′ ′ ′ p,p sim(r, r ) P u ∈Nu

′ P v,v ′ where Nu is the set of neighbours of u that have rated both v and v and simr,r′ is the similarity between users u and u′. Then the optimal ranking for each user is that which maximizes the sum of the Chapter 2. Background 18 properly ordered pairwise Ψ: ′ max . Ψr(p,p ). (2.13) π ′ ′ p,p :πX(p)>π(p ) Although the decision-variant of this optimization problem is NP-complete to solve [28], Liu and Yang [69] proposed two heuristics that are less computationally costly and that perform well in practice. Various authors have also looked at how side information about items may affect recommendations. Jambor and Wang [55] studied ways to incorporate item-resource constraints. For example, a company might prefer not to recommend aggressively products for which its stock is low, or alternatively, it might want to recommend products with higher profit margins. They proposed to deal with such additional constraints once the ratings had been estimated by allowing the system to re-rank the output of the original CF method based on the additional constraints. They formulate this re-ranking as a convex optimization problem, where the corresponding constraints or an additional term in the objective were used to capture the requirement described above. They did not detail the exact technique used to solve this optimization problem, but depending on requirements, their problem can be cast either as a linear program or as a quadratic program and both can be solved with off-the-shelf solvers (for example, CPLEX). Overall, relatively little work has been done on using CF in more complex recommendation applica- tions. The handful of papers that have considered CF with a true recommendation perspective will be discussed in Chapter 5.

2.3 Active Preference Collection

In the previous section, we have assumed a typical supervised learning setting, where a set of labelled instances (ratings) are provided and we are looking for the set of parameters within a class of models that will minimize a certain loss function. Hence, we are assuming that the set of ratings is fixed, which is unnatural in many recommendation system domains. For example: a) any new system will first have to gather item preferences from users; b) typically, new users continually enter the system; and c) if the system is on-line, it is probable that existing users will be providing new ratings for previously unrated items. These events provide an opportunity for the system to gather extra preference data, or extra information indicative of user preferences, and thus to improve its performance. Moreover, if the system were able to target specifically valuable user preferences over items, these could help the system achieve greater increases in performance more quickly. The process of selecting which labels or ratings to query falls into the category of active learning. This is a learning paradigm which is often motivated in a context where the cost of acquiring labelled data is high [11]. The aim of active learning is to select the queries that will improve the performance of the learning model most quickly. Active learning methods are often evaluated by comparing their performance to passive methods, methods which randomly select queries to be elicited after each new query or after a set of queries. Considering that relatively little work has been done on active learning directly aimed at recommen- dation systems, this section will give a brief overview of active learning methods in general. Whenever work related to recommendation systems does exist, we will highlight it, and we will also emphasize the merits of the various techniques with respect to their applicability and possible effectiveness for recommendation applications. We will focus on four main query strategy frameworks: 1) uncertainty sampling, 2) query-by- Chapter 2. Background 19 committee, 3) expected model change, and 4) expected error reduction [102]. Apart from papers related to recommendation systems, most of the material presented in this section originates from the excellent survey of Settles [102] and references therein. In what follows, we will denote the set of available ratings as So and the set of unobserved, or missing, ratings as Su. The goal will then be to query ratings from Su using a particular query strategy. Furthermore, because our focus is on the recommendation domain, we will assume that the set of possible queries, which consist of unobserved user-item pairs, is known a priori. At each time step, the goal is then to pick the best possible query, according to a specific query strategy, among all possible queries. Once a user has responded to a query, his response can used to re-train the learning model. After that we can proceed to query selection for the next time step.

2.3.1 Uncertainty Sampling

In uncertainty sampling, the goal is to pick the query that most reduces the uncertainty of the model’s predictions [67]. The querying method can be seen as sampling examples, hence its name. It is therefore usually assumed that the model’s posterior probability over the elements of Su can be evaluated. Given a posterior distribution over ratings, one can use the uncertainty of the model in several different ways. For example, one may pick the query about which the model is least confident, where, for example, confidence is the probability of the mode of the posterior distribution:

u o max (1 Pr(ˆsrp S ,θ)) (2.14) (r,p)∈Su − |

u u u o where Sˆ denotes the mode of that posterior distribution (i.e., Sˆ = arg maxs Pr(S = s S ,θ)). rp rp rp | Perhaps a more pleasing criterion would be to look at the whole distribution through its entropy rather than its mode: u o u o min Pr(srp = s S ,θ)logPr(srp = s S ,θ) (2.15) (r,p)∈Su − s | | X The intuition is that entropy-based uncertainty sampling should lead to a better model of the full posterior distribution over ratings, whereas the previous approach (or variants that consider the margin between the top predictions) might be better at discriminating between the most likely class and the others [102]. In practice, the performance of the different methods seems to be application- dependant [101, 104]. For regression problems, the same intuitions as above can be applied, where the uncertainty of the predictive distribution is simply the variance over its predictions [102]. Rubens and Sugiyama [96] proposed a related approach where, instead of querying the example with the most uncertainty, they query the example that most reduces the expected uncertainty of all items. They applied this idea to CF and show that on a particular subset of a CF data set (one of the MovieLens data sets), this approach outperforms both a variance-based and an entropy-based strategy, with both performing equally to or worse than a random strategy. The authors compared the performance of the different methods when very little information is known from each user (between 1 and 10 items). Other authors have also applied these ideas to CF. Jin and Si [58] used the relative entropy between the posterior distribution over parameters and the updated parameters given a possible query and its response to guide their query selection. Using the same model, Harpale and Yang [45] considered that in many domains, it is unreasonable to ask a user to rate any item because it is unlikely that a user can easily access all items (for example, users might not be willing to watch any movie selected by the system). The authors proposed that the system’s decision should be based on how likely an item is to Chapter 2. Background 20 be rated, approximated by a quantity similar to the item’s popularity among users with similar profiles as the user under elicitation. Authors have also used uncertainty minimization strategies using non-probabilistic models. For example in CF, Rish and Tesauro [90] looked at using MMMF and defined uncertainty to be proportional to the distance to the decision boundary, or in other words, the distance in feature space between the point and the hyperplane in a max-margin model. The next query is the one that is closest to the separating hyperplane. Experimentally, the authors showed that this approach outperformed other heuristics, namely random and max. uncertainty query selection. One theoretical problem with this approach is that it suffers from sampling bias [30]. In other words, because the queries are chosen with respect to the uncertainty of the current classifier, it is possible that some regions of the input space will be completely ignored, although all regions have some weight under the true data distribution, leading to poor generalization performance. Sampling bias is a problem that affects most active learning methods unless special precautions are taken [30]. In general, uncertainty sampling approaches are computationally efficient even though they require the evaluation of all possible (RP ) queries. O

2.3.2 Query by Committee

The idea of the query-by-committee framework is to keep a committee [105], that is, a set of hypotheses from one or multiple models, trained and consistent on So. In classification tasks, such a set of hypotheses is called a version space. The instance to be queried is then the one on which these various hypotheses disagree the most. Several definitions of disagreement are possible; for example, one may use the average Kullback-Leibler divergence between the distribution over ratings of each hypothesis. One obvious limitation of these approaches is that one must be able to generate multiple hypotheses that are all consistent with the training set. There has been a number of recent developments in this field that try to circumvent this limitation. In fact, these developments have led their proponents to declare “theoretical victory” over active learning. 1 Query-by-committee can also be extended to regression problems, which are usually the domain of CF approaches, although the notion of version space does not apply any more [102]. Instead, one can measure the variance between the predictions of the various committee members. Compared to uncertainty sampling, query-by-committee attempts to choose queries that reduce un- certainty over both model predictions and model parameters. The computational efficiency of using this strategy will depend on the procedure for obtaining the set of hypotheses forming the committee. Although we are not aware of any work that has applied the query-by-committee framework in either CF or recommendation systems, one could imagine training differently-initialized (or different hyper- parameters) Bayesian PMF models, or alternatively a set of PMF models with uncertainty defined by to how close a predicted rating is to the threshold between different rating values. Overall, as noted in Section 2.2.1, ensemble methods, combinations of different methods, where each method ends up learn- ing a specific aspect of users and items have been shown to do extremely well in collaborative filtering problems [7]. It remains to be seen whether this methodology can also be useful for active learning.

1http://hunch.net/?p=1800 Chapter 2. Background 21

2.3.3 Expected Model Change

The goal in expected-model-change approaches is to find the query that will maximize the (expected) model change. A simple criterion for monitoring model change is to evaluate the norm of the difference of the parameter vectors (e.g., θr θ ), where θ is the current model’s parameter vector and θs is || rp − || rp the updated model’s parameter vector after querying srp = s. Settles et al. [103] proposed to exploit this idea using the expected gradients of the ratings with respect to the model parameters (θ):

u o o u P (srp = s S ,θ) L(S srp = s) (2.16) s | ||∇ ∪ || X where L(So su ) is a vector representing the gradient of the parameter vector with respect to a loss ∇ ∪ rp function L. The loss is calculated using the observed scores and the proposed query, srp. Because the learning method is typically re-trained after each query, the gradient can be approximated by L(su = ∇ rp s). The chosen query is then the one that maximizes Equation 2.16. This query strategy is called expected gradient length (EGL). It is important to note that in contrast to other approaches, EGL does not directly reduce either model uncertainty or generalization error, but rather picks the query with the “greatest impact on the [model’s] parameters” [102]. This strategy can be applied only to models that can be trained using a gradient method (which is the case for PMF, but not for all other probabilistic methods). Compared to other approaches, the efficiency of this method will be largely dependent on the cost of evaluating the model gradients. Furthermore, because outliers will greatly affect the model parameters, this strategy would be particularly prone to failure when using an uninformative model of uncertainty.

2.3.4 Expected Error Reduction

Perhaps the most natural strategy to use is to choose the query that will most reduce, in expectation, the error of the final objective. Recall that the usual method for evaluating the effectiveness of all querying strategies is to compare the model learned using active learning with one learned without (by so-called passive learning). Therefore, it makes sense to optimize for the true objective. Settles [102] first proposed an approach to minimize the 0/1 loss:

u o u ′ o u r Pr(s = s S ,θ) 1 max Pr(ˆs ′ ′ = s S s = s,θ ) . (2.17) rp ′ s r p rp rp θ |  − s θrp | ∪  s ′ ′ u X (r ,p )∈X(S \srp)   where Prθs ( ) denotes the distribution over unlabelled ratings when the model with parameter vector θ rp · receives an additional training example srp = s. What is surprising about the above formulation is that it is not actually calculating the error, but rather the change in confidence over the unlabelled data. In a sense, it is using the unlabelled data as a validation set. Instead, it would seem to make sense to look at the expected decrease in error on a labelled validation set (that is, a labelled set disjoint from the training set and So = Strain Svalidation ). The obvious problem with using a validation set is that { ∪ } in an active learning setting, labelled data are often scarce (at least at the beginning of the elicitation), and therefore there might not be enough data to create a meaningful validation set. Importantly, this framework can be adapted to use any objective (loss) function. The strategy can therefore be leveraged both in collaborative filtering with the usual loss functions or when using CF Chapter 2. Background 22 for another task. One limitation is that because the framework requires retraining the model for each possible value of each possible query, it can be very computationally expensive. Some work has been done using this strategy for ranking applications. Arens [3] used a ranking SVM and addressed the problem of ranking relevant papers for users. The queries consist of asking a user whether the queried document is “definitely relevant”, “possibly relevant”, or “not relevant”. They experimented with two querying strategies. The first queries the highest (predicted) ranked document that is unlabelled. The second strategy queries the most uncertain documents, as defined by those that is in the middle of the ranking, which were used as a signal of uncertainty (in this view, this approach is similar to uncertainty sampling). The authors argue that the learner will rank the worst and best documents with confidence and that therefore the learner is more uncertain about documents in the middle of the ranking. Their results indicate that the (first) strategy of querying the highest-ranked documents outperforms the second strategy (and random querying).

EVOI

The decision-theoretically optimal way of evaluating queries is to use expected value of information (EVOI) [50]. At the heart of EVOI is a utility function, a problem-dependant objective which assigns a number, a utility, to possible outcomes (for example to recommendations). The optimal query according to EVOI is the one which maximizes expected utility. The expectation is taken with respect to the distribution over query responses (for example, possible score values). In practice, it is common to use the learning model’s distribution over possible responses (for example, the CF model). We can better understand EVOI by looking at an example. Boutilier et al. [18] studied a setting where the goal is to recommend one item at a time to a user. The utility of making such a recommendation is defined to be the rating value of the recommended item (or its expectation according to the model of unobserved ratings). To select a query, one evaluates each possible query and selects the one with maximum myopic EVOI, that is, the query that yields the maximum difference in (expected) value over all items:

o o o o u S u o S o u S ∪srp o S EV OI(srp,θ )= Pr(srp = s S ,θ )V (S srp = s,θ ) V (S ,θ ). s | ∪ − X The value of the belief state associated with (So,θ) is defined as:

o So u o So V (S ,θ ) = max Pr(srp = s S ,θ )s, (2.18) (r,p)∈Su s | X and similarly,

o o o u ′ S ∪sr′p′ u o u ′ S ∪sr′p′ V (S sr′p′ = s ,θ )) = max Pr(srp = s S sr′p′ = s ,θ )s, (2.19) (r,p)∈{Su\(r′,p′)} ∪ s | ∪ X o where Pr( So,θS ) is the model’s posterior over scores for user-item rp when So is observed. It is ·| rp important to note that EVOI can be used with any model that can infer a distribution over unobserved ratings. In Boutilier et al. [18], the MCVQ model is used (see Section 2.2.1). Calculating myopic EVOI involves two sources of costly computations. First, given a user, the computation of the posteriors over every possible rating value must be performed for all queries (that is, the set of unobserved items). Equations 2.19 and 2.18 say that the value of a belief is equal to the Chapter 2. Background 23 maximum expected unobserved rating given that belief. Boutilier et al. [18] note that given the current item with highest (mean) score p∗ and the previous expected scores over all other items (Equation 2.18) one can calculate which subset of items might have their mean score increased enough to become the item with maximal value according to Equation 2.19. Accordingly, they bound the impact that a query, about a particular item, can have on the mean predicted score of all other items. This restricts the number of item posteriors which must be re-computed at each step. Furthermore, these bounds are calculated in a user-independent way and off-line. This same procedure can be applied to any learning model, although the exact form of the bound is specific to MCVQ. Second, the potentially large number of possible queries is also computationally problematic. Boutilier et al. [18] suggest the use of prototype queries: a small set of queries such that any query is within some ǫ, in terms of EVOI value, of a prototype query. Experimentally, the authors showed that even when using less than 40% of all queries, the system outperforms passive learning and is close to the optimal (myopic-)EVOI performance. Calculations of non-myopic EVOI, which involves calculating the optimal querying sequence, are even more costly because the effect of each query with respect to all possible future queries must be evaluated. Other work of note has proposed a similar framework for a memory-based collaborative filtering model [132]. Notably they propose that the active learning procedure only query items that the user is likely to have already consumed (for example, movies that the user is likely to have already watched) based on what similar users have consumed. The rationale is that such items will be easier to score. This method has the additional benefit of pruning the item search-space.

2.3.5 Batch Queries

Until now, we have assumed that we are picking queries one at a time in a greedy manner, considering only the immediate myopic effect of the query on the model. This might not be optimal because it is possible that a query’s value will reveal itself only after several more rounds of elicitation. A related idea is that of batch queries. Instead of selecting and obtaining results of one query at a time, we might consider finding optimal sets of queries. Batch querying may be necessary when a system simply does not have the computational resources to find optimal queries on-line and therefore a set of queries must be calculated off-line [102]. Alternatively, this approach can be motivated in recommendation systems as a more natural mode of interaction for users. For example, imagining a domain where rating a product involves performing an action off-line, it may be easier for users to rate multiple products at a time. Another motivation for batch active learning is when parallel labellers are available [44]. A simple strategy for selecting a set of N queries would simply be to pick the top-N queries greedily according to one of the active-learning strategies already discussed. Such a strategy will typically not be optimal because the selection process will not consider information gained by previous queries in the batch. Using EVOI, an optimal strategy for batch querying would be to consider the value of all possible sets of queries (of a given length). The combinatorial nature of such calculations makes this approach impractical for most real-world problems. Several authors have looked at constructing optimal sets of queries according to some less general criteria. Several authors have proposed considering diversity as a reasonable criterion for batch-query selection [21, 131]. Recently, Guo [43] proposed the selection of the instances which maximize the mutual information between labelled and unlabelled sets. The intuition Chapter 2. Background 24 underlying this is that, for a learning method to generalize, the labelled set should be representative of the unlabelled set. We are unaware of previously published work in CF that has looked at asking batches of queries. We develop a greedy approach to this problem in Chapter 6.

2.3.6 Stopping Criteria

A critical aspect of any active querying method is the criterion used to stop eliciting new preferences from users. This is a trade-off between the cost of reducing the model’s error by acquiring extra information and the cost incurred by making recommendations given the current model. For example, imagine that each time a user is queried, there is a probability that he will become annoyed and leave the system. In addition, if recommendations are not satisfactory, the same user may also leave the system. This is a good example of the trade-off between further elicitation versus exploitation of the user’s current ratings. Adopting a decision-theoretic approach, querying should stop once the utility of all queries is negative. This assumes a utility function which accounts for all user benefits and costs; such a function may be difficult to model accurately. In practice, Bloodgood and Vijay-Shanker [17] proposed that if one could define a separate validation set, active learning could simply be stopped once the error of the validation set stabilized. However, as previously noted, setting aside a labelled set is impractical in the typical active-learning setting where few labelled instances exist. Several authors [17, 81] have agreed that one must determine whether or not the model has stabilized, meaning that its predictions are not likely to change given more data. Having said this, we are not aware of a formal, yet practical, way to decide on a stopping criterion that is applicable to a wide array of models and data sets and is not overly conservative [17]. In fact, in his survey, Settles [102] implicitly defended the decision-theoretic framework: “the real stopping criterion [...] is based on economic or other external factors, which likely come well before an intrinsic learner-decided threshold.” In other words, the cost of reducing the model’s error dominates the cost of the model’s error before the model has reached its best possible performance.

2.4 Matching

It is often the case that designers and users of a recommendation system have requirements, in addition to the user’s intrinsic item preferences, which must be considered before making recommendations. In this work, we will be specifically interested in recommendations which require an assignment, or a matching, between users and items. An example of a matching recommendation, which we will discuss at length, is matching reviewers to conference submissions. We will refer to recommendation systems which provide this type of matching as match-constrained recommendations. Match-constrained recommendations are also prevalent in other domains involving the selection of experts, such as grant reviewing, assignment marking, and hospital-to-resident matching [94], as well as in completely different areas such as on-line dating [46], where users must be matched to other users according to their romantic interest, or house matching [52], or even in recommending taxis to customers. The field of matching has a long and distinguished history dating back to the classical work on the stable marriage problem [36]. The importance of this work was recognized when one of its authors received the 2012 Bank of Sweden prize in economic science. 2 The stable marriage problem consists

2It is often referred to as the Nobel Prize in Economics Chapter 2. Background 25 of finding a stable match between single women and single men. Stability implies that in the resulting match, no woman-man pairs both prefer being matched to one another than their partners in the matching. For evident reasons, stability is a property often sought by practitioners. Gale and Shapley [36] proposed a simple iterative algorithm which is guaranteed to output a stable matching. Several other domains have caught the interest of economics and theoretical computer science researchers, among them college admission matching [36] and resident matching (of residency candidates to hospitals, also similar to the roommate-matching problem) [94], which generalize the stable marriage problem to many-to-one matches. In many-to-one matches members of one set of entities need to be matched to, not one, but multiple entities of the other set (this is matching polygamy in a sense). Such problems can be solved with a generalization of the stable marriage algorithm. Further constraints on these problems, such as couples wanting to be in the same hospital have also been considered, but are typically harder to solve (“couples matching” is NP-complete) [92]. An important dichotomy is that between two-sided and single-sided matching domains. In single- side matching, entities to be recommended do not express preferences, for example matching in housing markets [52]. Note that in single-sided domains, the notion of stability does not apply, although similar properties are captured by the notion of the matching core. A central aspect of great importance in this line of work has been truthful elicitation of user prefer- ences. Indeed, to ensure the integrity of a matching system, it is often of great importance that users have incentives to be truthful. This research, under the name mechanism design, seeks protocols, in- cluding matching protocols, which ensure that a user cannot obtain a preferred match by not reporting his preferences truthfully [95]. In our work, we focus exclusively on one-sided matching problems. Furthermore, we do not con- sider strategic issues, such as the ones studied by mechanism design. We initiate our discussion of match-constrained recommendations by introducing a running example in Chapter 3. We further study matching formulations and constraints in Chapter 5. Chapter 3

Paper-to-Reviewer Matching

Before detailing the research contributions of this thesis, we will discuss a real-life example of a matched- constrained recommendation system. This example will be used as an illustration throughout our work. Furthermore, this particular example has provided some of the initial motivation behind the work that has led to this thesis. We introduce the paper-to-reviewer matching problem. This problem must routinely be solved by conference organizers to determine a conference program. The way typical conferences operate is that once the paper-submission deadline has passed, conference organizers initiate the reviewing process for the submissions. Concretely, organizers have to assign each submission to a set of reviewers. Reviewers are typically chosen from a preselected pool. Once assigned their papers, reviewers will have a fixed amount of time to provide paper reviews, which are informed opinions from domain experts describing the merits of each submission. Because reviewers’ time is limited, each reviewer has only enough resources to review at most fixed number of papers. The paper-to-reviewer-assignment process aims to find the most expert reviewers for each submission. Obtaining high-quality reviews is of great importance to the quality and reputation of a conference, and in a certain sense, to shape the direction of a field. This assignment process can be seen as a recommendation system with reviewer expertise substituting for the (more typical) factor of user preference. Furthermore, constraints on the number of submissions that each reviewer may process and on the number of reviewers per submission gives rise to a match-constrained recommendation problem. We can frame the paper-to-reviewer matching problem in terms of our three-stage framework pre- sented in Figure 1.1. First, expertise about submissions, with possibly other expertise-related informa- tion, is elicited from reviewers. Second, a learning model can be used to predict the missing paper- reviewer expertise data. Finally, given the specific constraints of each conference, the optimization procedure consists of a matching procedure which assigns papers to reviewers according to their stated and predicted expertise. Many research questions related to how learning models can be optimized for this framework emerge when thinking of the paper-matching problem in this context. The rest of this thesis, and specifically Chapter 5, discusses some of these questions. We have built a software system called the Toronto Paper Matching System (TPMS), which provides automated assistance to conference organizers in the process of assigning their submissions to their reviewers. In this chapter, we discuss the intricacies of the practical paper-to-reviewer assignment problem through a description of TPMS.

26 Chapter 3. Paper-to-Reviewer Matching 27

3.1 Paper Matching System

Assigning papers to reviewers is not an easy task. Conference organizers typically need to assign reviewers within a couple of days of the conference submission deadline. Furthermore, conferences in many fields now routinely receive more than one thousand papers, which have to be assigned to reviewers from a pool which often consists of hundreds of reviewers. The assignment of each paper to a set of suitable reviewers requires knowledge about both the topics studied in the paper and reviewers’ expertise. For a typical conference, it will therefore be beyond the ability of a single person, for example, the program chair, to assign all submissions to reviewers. Decentralized mechanisms are also problematic because global constraints, such as reviewer load, absence of conflicts of interest, and the need for every paper to be reviewed by a certain number of reviewers must be satisfied. The main motivation for automating the reviewer-assignment process is to reduce the time required to assign submitted papers (manually) to reviewers. A second motivation for an automated reviewer-assignment system concerns the ability to find suit- able reviewers for papers, to expand the reviewer pool, and to overcome research cliques. Particularly in rapidly expanding fields such as machine learning, it is of increasing importance to include new reviewers in the review process. Automated systems offer the ability to learn about new reviewers as well as the latest research topics. In practice, conferences often adopt a hybrid approach in which a reviewer’s interest with respect to a paper is first independently assessed, either by allowing reviewers to bid on submissions, or, for example, by letting members of the senior program committee provide their assessments of reviewer expertise. Using either of these assessments, the problem of assigning reviewers to submissions can then be framed and solved as an optimization problem. Such a solution still has important limitations. Reviewer bidding requires reviewers to assess their preferences over the list of all papers. Failing to do so, for example if reviewers only examine papers that contain specific terms matching their interest, is likely to decrease the quality of the final assignments. On the other hand, asking the senior program committee to select reviewers still imposes a major time burden. Faced with these limitations, when Richard Zemel was the co-program chair of NIPS 2010, he decided to build a more automated way of assigning reviewers to submissions. The resulting system that we have developed aims to provide a proper evaluation of reviewer expertise to yield good reviewer assign- ments while minimizing the time burden on conference program committees (reviewers, area chairs, and program chairs). Since then, the system has gained adoption for both machine learning and computer vision conferences and has now been used (repeatedly) by NIPS, ICML, UAI, AISTATS, CVPR, ICCV, ECCV, ECML/PKDD, ACML, and ICGVIP.

3.1.1 Overview of the System Framework

In this section, we first describe the functional architecture of the system, including how several confer- ences have used it. We then briefly describe the system’s software architecture. Our aim is to determine reviewers’ expertise. Specifically, we are interested in evaluating the expertise of every reviewer with respect to each submission. Given these assessments, it is then straightforward to compute optimal assignments (see Chapter 5 for a detailed discussion of matching procedures). We insist that reviewer expertise rather than reviewer interest is what we aim to evaluate. This is in contrast with the more typical approaches that assess reviewer interest, for example through bidding. Chapter 3. Paper-to-Reviewer Matching 28

Matching Reviewer Final papers assignments Matching Guide Elicited Final Initial scores Sorting Ellicitation scores scores Submitted Ranked papers Sorting list

Figure 3.1: A conference’s typical workflow.

The workflow of the system works in synergy with the conference submission procedures. Specifically, for conference organizers, the busiest time is typically right after the paper submission deadline, because at this time, the organizers are responsible for all submissions, and several different tasks, including the assignment to reviewers, must be completed within tight time constraints. For TPMS to be maximally helpful, assessments of reviewer expertise could be computed ahead of the submission deadline. With this in mind, we note that an academic’s expertise is naturally reflected through his or her work and is most easily assessed by examining his or her published papers. Hence, we use a set of published papers for each reviewer participating in a conference. Throughout our work, we have used the raw text of these papers. It stands to reason that other features of a paper could be modelled: for example, one could use citation or co-authorship graphs built from each paper’s bibliography and co-authors respectively. Reviewers’ published papers have proven to be very useful in assessing expertise. However, we have found that we can further boost performance using another source of data: each reviewer’s self-assessed expertise about the submissions. We will refer to such assessments as scores. We differentiate scores from more traditional bids: scores represent expertise rather than interest. We use assessed scores to predict missing scores and then use the full reviewer-paper score matrix to determine assignments. Hence, a reviewer may be assigned to a paper for which he did not provide a score. To summarize, although each conference has its own specific workflow, it usually involves the sequence of steps shown in Figure 3.1. First, we collect reviewers’ previous publications (note that this can be done before the conference’s paper submission deadline). Using these publications, we build reviewer profiles which can be used to estimate each reviewer’s expertise. These initial scores can then be used to produce paper-reviewer assignments or to refine our assessment of expertise by guiding a score-elicitation procedure (e.g., using active learning to query scores from reviewers). Elicited scores, in combination with our initial unsupervised expertise assessments, are then used to predict the final scores. Final scores can then be used in various ways by the conference organizers (for example, to create per-paper reviewer rankings that will be vetted by the senior program committee, or directly in the matching procedure). Below, we describe the high-level workflow of several conferences that have used TPMS. NIPS 2010: For this conference, the focus was mostly on modelling the expertise of the area chairs, the members of the senior program committee. We were able to evaluate the area chairs’ expertise initially using their previously published papers. There were 32 papers per area chair on average. We then used these initial scores to perform elicitation. The exact process by which we picked which reviewer- paper pairs to elicit is described in the next section. We performed the elicitation in two rounds. In the first round, we kept about two-thirds of the papers selected as those about which our system was most confident (estimated as the inverse entropy of the distribution across area chairs per paper). Using these elicited scores, we were then able to run a supervised learning model and proceed to elicit information Chapter 3. Paper-to-Reviewer Matching 29 about the remaining one-third of the papers. We then re-trained a supervised learning method using all elicited scores. The result were used to assign a set of papers to each area chair. For the reviewers, we also calculated initial scores from their previously published papers and used those initial scores to perform elicitation. Each reviewer was shown a list of approximately eight papers on which they could express their expertise. The initial and elicited scores were then used to evaluate the suitabilities of reviewers for papers. Each area chair was then provided a ranked list of (suggested) reviewers for each of his assigned papers. ICCV-2013: ICCV used author suggestions, by which each author could suggest up to five area chairs that could review a paper, to restrict area-chair score elicitation. The elicited scores were used to assign area chairs. Area chairs then suggested reviewers for each of their papers. TPMS initial scores, calculated from reviewers’ previously published papers, were used to present a ranked list of candidate reviewers to each area chair. ICML 2012: Both area chairs and reviewers could assess their expertise for all papers. To help in this task, TPMS initial scores, calculated again from reviewer’s and area chairs previous publications, were used to generate a personalized ranked list of candidate papers which area chairs and reviewers could use to quickly identify relevant papers. TPMS then used recorded scores for both reviewers and area chairs in a supervised learning model. Predicted scores were then used to assign area chairs and one reviewer per paper (area chairs were able to assign the other two reviewers).1

3.1.2 Active Expertise Elicitation

As mentioned in the previous section, initial scores can be used to guide active elicitation of reviewer expertise. The direction that we have taken is to run the matching program using the initial scores. In other words, we use the initial scores to find an (optimal) assignment of papers to reviewers. Then a reviewer’s expertise for all papers assigned to him are queried. Intuitively, these queries are informative because according to our current scores, reviewers are queried about papers that they would have to review (a strong negative assessment of a paper is therefore very informative). By adapting the matching constraints, conference organizers can tailor the number of scores elicited per user (in practice, it can be useful to query reviewers about more papers than is warranted by the final assignment). We formally explore these ideas in Chapter 5. Note that our elicited scores will necessarily be strongly biased by the matching constraints used in the elicitation procedure. To relate to our discussion in Section 2.2.1: scores are not missing at random. In practice, this does not appear to be a problem for this application (that is, assigning papers to a small number of expert reviewers). There are also empirical procedures, such as pooled relevance judgements which combines the scores of different models while considering the variance in between models, which have been used to reduce the elicitation bias [79]. It is possible that such a method could be adapted to be effective with our matching procedure.

3.1.3 Software Architecture

For the NIPS-10 conference, the system was initially made up of a set of MATLAB routines that would operate on conference data. The data were exported (and re-imported) from the conference Web site hosted on Microsoft’s Conference Management Toolkit (CMT).2 This solution had limitations because

1The full ICML 2012 process has been detailed by the conference program chairs: http://hunch.net/?p=2407 2http://cmt.research.microsoft.com/cmt/ Chapter 3. Paper-to-Reviewer Matching 30 it imposed a high cost on conference organizers that wanted to use it. Since then, and encouraged by the ICML 2012 organizers, we have developed an on-line version of the system which interfaces with CMT and can be used by conference organizers through CMT (see Figure 3.2). The system has two primary software features. One is to act as an archive by storing reviewers’ previously published papers. We refer to these papers as a reviewer’s archive or library. To populate their archives, reviewers can register with and log in to the system through a Web interface. Reviewers can then provide URLs pointing to their publications. The system automatically crawls the URLs to find reviewers’ publications in PDF format. There is also a functionality which enables reviewers to upload papers from their local computers. Conference organizers can also populate reviewers’ archives on their behalf. Another option enables our system to crawl a reviewer’s profile.3 The ubiquity of the PDF format has made it the system’s accepted format. On the programming side, the interface is entirely built using the Python-based Django web framework4 (except for the crawler, which is written in PHP and relies heavily on the wget utility5). The second main software feature is one that permits communication with Microsoft’s CMT. Its main purpose is to enable our system to access some of the CMT data as well as to enable organizers to call our system’s functions through CMT. The basic workflow proceeds as follows: organizers, through CMT, send TPMS the conference submissions; then they can send us a score request which queries TPMS for reviewer-paper scores for a specified set of reviewers and papers. This request contains the paper and reviewer identification for all the scores that should be returned. In addition, the request can contain elicited scores (bids in CMT terminology). After receiving these requests, our system processes the data, which may include PDF submissions and reviewer publications, and computes scores according to a particular model. TPMS scores can then be retrieved through CMT by the conference organizers. Technically speaking, our system can be seen as a paper repository where submissions and meta- data, both originating from CMT, can be deposited. Accordingly, the communication protocol used is SWORD based on the Atom Publishing Protocol (APP) version 1.0.6 SWORD defines a format to be used on top of HTTP. The exact messages enable CMT to: a) deposit documents (submissions) to the system; b) send information to the system about reviewers, such as their names and publication URLs; c) send reviewers’ CMT bids to the system. On our side, the SWORD API was developed in Python and is based on a simple SWORD server implementation.7 The SWORD API interfaces with a computations module written in a mixture of Python and MATLAB (we also use Vowpal Wabbit8 for training some of the learning models). Note that although we interface with CMT, TPMS runs completely independently (and communi- cates with CMT through the network); therefore, other conference management frameworks could easily interact with TPMS. Furthermore, CMT has its own matching system which can be used to determine reviewer assignments from scores. CMT’s matching program can be used to combine several pieces of information such as TPMS scores, reviewer suggestions, and subject-area scores. Hence, we typically return scores from CMT and conference organizers, then run CMT’s matching system to obtain a set of (final) assignments.

3http://scholar.google.com/ 4https://www.djangoproject.com/ 5http://www.gnu.org/software/wget/ 6The same protocol with similar messages is used by http://arxiv.org to enable users to make programmatic submis- sions of papers 7https://github.com/swordapp/Simple-Sword-Server 8 http://hunch.net/~vw/ Chapter 3. Paper-to-Reviewer Matching 31

Conference Organizers CMT

Score TPMS models

Paper collection web-interface Reviewers

Figure 3.2: High-level software architecture of the system.

3.2 Learning and Testing the Model

As mentioned in Section 3.1.1, at different stages in a conference workflow, we may have access to different types of data. We use models tailored to the specifics of each situation. We first describe models that can be used to evaluate reviewer expertise using the reviewers’ archive and the submitted papers. Then we describe supervised models that have access to ground-truth expertise scores. We remind the reader about some of our notation, which now takes on specific meaning for the problem of paper-to-reviewer matching. An individual submission is denoted as p while is the set P of all submitted papers. Similarly, single reviewers are denoted as r and the set of all reviewers is denoted as . We introduce a reviewer’s archive (a reviewer’s previously published papers) encoded R a using a bag-of-words representation and denoted by the vector wr . Note that we will assume that a reviewer’s papers are concatenated into a single document to create that reviewer’s archive. Similarly, a submission’s content is denoted as wd. Finally, f( ) and g( ) represent functions which map papers, p · · submitted or archived, respectively, to a set of features. Features can be word counts associated with a d a bag-of-words representation, in which case the f(wp) and g(wr ) are the identity function, or possibly, higher-level features such as those learned from a topic model [16].

3.2.1 Initial Score Models

Two different models were used to predict initial scores. Language Model (LM): This model predicts a reviewer’s score as the dot product between a reviewer’s archive representation and a submission:

a T d srp = g(wr ) f(wp) (3.1)

There are various possible incarnations of this model. The one that we have routinely used is due to Mimno and McCallum [79] and consists of using the word-count representation of the submissions (that is, each submission is encoded as a vector in which the value of an entry corresponds to the number of times that the word associated with that entry appears in the submission). For the archive, we use the normalized word count for each word appearing in the reviewer’s published work. By assuming conditional independence between words given a reviewer and working in the log domain, the above is equivalent to: a srp = log f(wri) (3.2) i∈p X In practice, we Dirichlet smooth [133] the reviewer’s normalized word counts to better deal with rare Chapter 3. Paper-to-Reviewer Matching 32 words: a a a Nwr wri µ Nwi f(wri)= + (3.3) N a + µ N a N a + µ N  wr  wr  wr 

a where Nwr is the total number of words in reviewer r’s archive, N is the total number of words in the a corpus, wri and Nwi are the number of occurrences of word i in r’s archive and in the corpus respectively, and µ is a smoothing parameter. Because papers have different lengths, scores will be uncalibrated. This means that shorter papers will receive higher scores than longer papers. Depending on how scores are used, this may not be problematic. For example, this will not matter if one wishes to obtain ranked lists of reviewers for each paper. We have obtained good matching results with such a model. However, normalizing each score by the length of its paper has turned out to be also an acceptable solution. Finally, in the language model described above, the dot product of the archive and submission representation is used to measure similarity; other metrics could also be used, such as KL-divergence. Latent Dirichlet Allocation (LDA) [16]: LDA is a unsupervised probabilistic method used to model documents. Specifically, we can use the topic proportions found by LDA to represent documents. Equa- tion 3.1 can then be used naturally to calculate expertise scores from the LDA representations of archives and submissions. When using the LM with LDA representations we refer to it as topic-LM, in contrast with word-LM when using bag-of-words representation.

3.2.2 Supervised Score-Prediction Models

Once elicited scores are available, supervised regression methods can be used. The problem can be seen as one of collaborative filtering. Furthermore, because both reviewer (user) and paper (item) content exist, the problem can also be modelled using a hybrid approach. Probabilistic Matrix Factorization: PMF was discussed in Section 2.2.1. For easier comparison with T other models in this section, we can express the score generative model as: srp = γr ωp (where γr and

ωp correspond to ur and vp, respectively, in our earlier notation). Because PMF does not use any information about either papers or reviewers, its performance suffers in the cold-start regime. Nonetheless, it remains an interesting baseline for comparison. Linear Regression (LR): On the other end of the spectrum are models which use document content, but do not share any parameters across users and items. The simplest regression model learns a separate model for each reviewer using submissions as features:

T d srp = γr f(wp) (3.4) where γr denotes user-specific parameters. This method has been shown to work well in practice, particularly if many scores have been elicited from each reviewer. A hybrid model can be of particular value in this domain. One issue with the conference domain is that it is typical for some reviewers to have very few or even no observed scores. It may then be beneficial to enable parameter sharing between users and papers. Furthermore, re-using information from each reviewer’s archive may also be beneficial. One method of sharing parameters in a regression model was proposed by John Langford, co-program chair of the ICML 2012 conference:

T d T a srp = b + br +(γ + γr) f(wp)+(ω + ωr) g(wr ) (3.5) Chapter 3. Paper-to-Reviewer Matching 33 where b is a global bias and γ and ω are parameters shared across reviewers and papers, which encode weights over features of submissions and archives respectively. The ω shared parameter enables the model to make predictions even for users that have no observed scores. In other words, the shared ω enables the model to calibrate a reviewer’s archive using information from the other reviewers’ scores. d br, γp and ωr are parameters specific to a paper or a reviewer. For ICML-12, f(wp) was paper p’s a word counts, while g(wr ) was the normalized word count of reviewer r’s archive (similarly to LM). In a practice, for each reviewer-paper instance g(wr ) is multiplied by the paper’s word occurrence vector. For that conference and afterwards, that model was trained in an on-line fashion using Vowpal Wabbit with a squared loss and L2 regularization. In practice, because certain reviewers have few or no observed scores, one has to be careful to weight the regularizers of the various parameters properly so that the shared parameters are learned at the expense of the individual parameters. To determine a good setting of the hyper-parameters we typically use a validation set (and search in hyper-parameter space using, for example, grid search). To ensure the good performance of the model across all reviewers, each reviewer contributes equally to the validation set.

3.2.3 Evaluation

A machine-learning model is typically assessed by evaluating its performance on the task at hand. For example, we can evaluate how well a model performs in the score-prediction task by comparing the predicted scores to the ground-truth scores. The task of most interest here is that of finding good paper- reviewer assignments. Ideally, we would be able to compare the quality of our assignments to some gold-standard assignment. Such an assignment could then be used to test both different-score prediction models and different matching formulations. Unfortunately, ground-truth assignments are unavailable. Moreover, even humans would have difficulty finding optimal assignments, and hence we cannot count on their future availability. We must then explore different metrics to test the performance of the overall system, including methods for comparing score-prediction models as well as matching performance. In practice, in-vivo experiments can provide a good way to measure the quality of TPMS scores and ultimately the usefulness of the system. ICML 2012’s program chairs experimented with different initial scoring methods using a special interface which showed, one of three groups of, ranked candidate papers to reviewers.9 The experiment had some biases: the papers of the three groups were ranked using TPMS scores. The poll, which asked after the fact whether reviewers had found the ranked-list interface useful, showed that reviewers who had used the list based on word-LM were slightly more likely to have preferred the list than the regular CMT interface (the differences were likely not statistically significant). We will defer comprehensive comparisons of the different score-prediction methods to Chapter 4, where we will introduce a novel model and make comparisons with it. Likewise, in Chapter 5, we will introduce different matching objectives and constraints and discuss relevant matching experiments. Below, we offer some qualitative comparisons of initial score models.

Datasets

Through its operation, the system has gathered interesting datasets containing reviewer submission preferences as well as reviewer and submitted papers. We have assembled a few datasets from these

9http://hunch.net/?p=2407 Chapter 3. Paper-to-Reviewer Matching 34

12000 3000

10000 2500

8000 2000

1500 6000

1000 4000 Number of Scores

500 2000

0 0 0 1 2 3 1 2 3 4 (a) NIPS-10 (b) ICML-12

Figure 3.3: Histograms of score values.

collected data. Below, we introduce certain datasets that are used throughout this thesis for empirical evaluations. The datasets bear the name of the conference from which their data were assembled. NIPS-10: This dataset consists of 1251 papers submitted to the NIPS 2010 conference. The set of reviewers consists of the conference’s 48 area chairs. The submission and archive vocabulary consists of 22,535 words. User-item preferences were integer scores in the 0 to 3 range. The histogram of score values is given in Figure 3.3. Suitabilities on a subset of papers were elicited from reviewers using a rather involved two-stage process. This process utilized the language model (LM) to estimate the suitability of each reviewer for each paper, and then queried each reviewer on the papers on which his estimated suitability was maximal. The output of the first round was fine-tuned using a combination of a hybrid discriminative/generative RBM [65] with replicated softmax input units [97] trained on the initial scores, and LM, which then determined the second round of queries. In total, each reviewer provided score on an average of 143 queried papers (excluding one extreme outlier), and each paper received an average of 3.3 suitability assessments (with a std. dev. of 1.3). The mean suitability score was 1.1376 (std. dev. 1.1). With regards to our earlier discussion on missing at random (Section 2.2.1) we note that since the querying process was biased towards asking about pairs with high predicted suitability, the unobserved scores are not missing at random, but rather tended toward pairs with low suitability. We do not distinguish the data acquired in the two phases of elicitation; both took place within a short time frame, so we assume suitabilities for any one reviewer are stable. ICML-12: This dataset consists of 857 papers and 431 reviewers from the ICML 2012 conference. The submission and archive vocabulary consists of 21,409 words. User-item preferences were integer scores in the 0 to 3 range. The histogram of score values is given in Figure 3.3. The elicitation process was less constrained than for NIPS-10. Specifically, reviewers could assess their expertise for any paper although they were also shown a selection of suggested papers produced either by the LM or based on subject-area similarity. The original vocabulary of these datasets was very slightly pre-processed. Specifically, we removed stop words from a predefined list. Moreover, we only kept words that appeared in at least five documents, including two different submissions. We performed further experiments with using word stems, which did not make a significant difference. We therefore hypothesize that words that are useful in determining a user’s expertise are typically technical words which are not declined under different stems. Chapter 3. Paper-to-Reviewer Matching 35

NIPS-10 ICML-12 NDCG@5 0.926 0.867 NDCG@10 0.936 0.884

Table 3.1: Evaluating the similarity of the top-ranked reviewers for word-LM versus topic-LM on the NIPS-10 and ICML-12 datasets.

Initial score quality

We examined the quality of the initial scores, those estimated solely by comparing the archive and the submissions, without access to elicited scores. We will compare the performance of a model which uses the archive and submission representations in word space to one which uses these representations in topic space. The method that operates in word space is the language model as described by Equation 3.2. For purposes of comparison, we further normalized these scores using the length of each submission. We refer to this method as word-LM. To learn topics, we used LDA to learn 30 topics using the content of both the archives and submissions. For the archive, we learned topics for each reviewer’s paper and then averaged a reviewer’s papers in topic space. This version of the language model is denoted as topic-LM. We first compared the two methods with each other by comparing the top-ranked reviewers for each paper according to word-LM and topic-LM. Table 3.1 reports the average similarity of the top-5 and top-10 reviewers using NDCG, where word-LM was used to make predictions of topic-LM. The high NDCG values indicate that both methods generally agree about the most expert reviewers. We can obtain a better appreciation of the scores of each model by plotting the model’s (top) scores for each paper. Each data point on Figure 3.4 shows the score of one of the top-40 reviewers, on the y-axis, for a particular paper, on the x-axis. For visualization purposes, points corresponding to the same reviewer ranking across papers are connected.10 One can see that word-LM scores all fall within a small range. In other words, given a paper, word-LM evaluates all reviewers to have very similar expertise. We have usually not found this to be a problem when using word-LM scores to assign reviewers to papers. In one exceptional case, we did get a poor assignment in a case when the program committee of a conference was very small (9 members) and the breadth of the submissions was large. From Figure 3.4, we see that topic-LM does not have this problem as it better separates the reviewers, and often finds a few top reviewers to have a certain (expertise) margin over the others, which seems sensible. One possible explanation for these discrepancies is that working in topic space removes some of the noise present in word space. It has been suggested that using a topic model “fuse[s] weak cues from each individual document word into strong cues for the document as a whole”.11 Specifically, elements like the specific words, and to a certain extent the writing style, used by individual authors may be abstracted away by moving to topic space. Topic-LM may therefore provide a better evaluation of reviewer expertise than can word-LM. We would also like to compare word-LM and topic-LM on matching results. However, such results would be biased toward word-LM because, in our datasets, this method was used to produce initial scores which guided the elicitation of scores from reviewers (we have validated experimentally that word- LM slightly outperforms topic-LM using this experimental procedure). Using the same experimental procedure, word-LM also outperforms matching based on CMT subject areas.

10This methodology was suggested and first experimented with by Bill Triggs, the program chair for the ICGVIP 2012 conference. 11From personal communication with Bill Triggs. Chapter 3. Paper-to-Reviewer Matching 36

0.09 0.13

0.12 0.08

0.11

0.07 0.1

0.06 0.09

0.08 0.05

0.07

0.04 0.06

0.03 0.05 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 (a) NIPS-10 using topic-LM (b) ICML-12 using topic-LM

0.132 0.138

0.13 0.136

0.134 0.128

0.132 0.126

0.13 0.124 0.128

0.122 0.126

0.12 0.124

0.118 0.122

0.116 0.12 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 (c) NIPS-10 using word-LM (d) ICML-12 using word-LM

Figure 3.4: Score of the top-40 reviewers (y-axis) for 20 randomly selected submitted papers (x-axis).

3.3 Related Work

We are aware that other conferences, such as SIGGRAPH, KDD, and EMNLP, have previously used a certain level of automation for the task of assigning papers to reviewers. The only conference man- agement system that we are aware of that has explored machine learning techniques for paper-reviewer assignments is MyReview 12, and some of their efforts are detailed in [89]. The reviewer-to-paper matching problem has also received attention in the academic literature. Conry et al. [29] propose combining several sources of information including reviewer co-authorship informa- tion, reviewer and submission self-selected subject areas as well as submission representations using the submission’s abstract. Their proposed collaborative filtering model linearly combines regressors on these different sources of information in a similar way as we do in Equation 3.5. Using elicited and predicted scores, matching is then performed, as a second step, using a standard matching formulation [122]. We note that in this work no differentiation is made between reviewer expertise and reviewer interest. In TPMS we build representations of submissions using the full text of submissions. Other authors

12http://myreview.lri.fr/ Chapter 3. Paper-to-Reviewer Matching 37 have exploited the information contained in the submissions’ references instead. Specifically, Rodriguez and Bollen [91] use the identity of the authors of the referenced papers and a co-authorship graph, built offline, to identify potential reviewers. Although building a useful co-authorship graph and identifying the authors of referenced papers in submissions presents a technical challenge, this method has the advantage of not requiring any input from reviewers. Benferhat and Lang [8], Goldsmith and Sloan [42], Garg et al. [37] assume that reviewer-to-paper suitability scores are available and focus on the matching problem, and various desirable constraints. We provide a more complete review of these studies in Section 5.3.

3.3.1 Expertise Retrieval and Modelling

The reviewer-matching problem can be seen as an instantiation of the more general expertise retrieval problem [6]. The key problem of expertise retrieval, a sub-field of information retrieval, is to find the right expert given a query. A basic query is a set of terms indicating the topics for which one wishes to find an expert. In our case, queries are the papers submitted to conferences, and expertise is modelled using previously authored papers. Because reviewer expertise and queries live in the same space, we never have to model expertise explicitly in terms of high-level, or human-designed, topics. Overall, the field of expertise retrieval has studied models similar to ours, including probabilistic and specifically language models [6].

3.4 Other Possible Applications

An automated paper matching system has been particularly useful in a context, such as a conference, where many items have to be assigned to many users under short time constraints. Nonetheless, the system could be used in other similar contexts where an individual’s level of expertise can be learned from that individual’s (textual) work. One example is the grant-reviewing matching process, where grant applications must be matched to competent referees. Academic journals also have to deal often with the problem of finding qualified reviewers for their papers. As more evaluation processes migrate to on-line frameworks, it is also likely that other problems fitting within this system’s capabilities will emerge. Other applications of more general match-constrained recommendations systems are discussed in Chapter 5.

3.5 Conclusion and Future Opportunities

There are a variety of system enhancements, improved functionality and research directions we are currently developing, and others we would like to explore in the near future. On the software side, we intend to automate the system further to reduce the per-conference cost (both to us and to conference organizers) of using the system. This implies providing automated debugging and explanatory tools for conference organizers. We have also identified a few more directions that will require our attention in the future:

1. Re-using reviewer scores from conference to conference: Currently reviewers’ archives are the only piece of information that is re-used between conferences. It possible that elicited scores could also be used as part of a particular reviewer’s profile that can be shared across conferences. Chapter 3. Paper-to-Reviewer Matching 38

2. Score elicitation before the submission deadline: Conferences often have to adhere to strict and short deadlines when assigning papers to reviewers after the submission deadline. Hence, collecting additional information about reviewers before the deadline may be able to save more time. One possibility would be elicit scores from reviewers about a set of representative papers from the conference (for example, a set of papers published in the conference’s previous edition).

3. Releasing the data: The data that we gathered through TPMS have opened various research opportunities. We are hoping that some of these data can be properly anonymized and released for use by other researchers.

4. Better integration with conference management software: Running outside of CMT (or other con- ference organization packages) has provided advantages, but the relatively weak coupling between the two systems also has disadvantages for conference organizers. As more conferences use our system, we will be in a better position to develop further links between TPMS and CMT.

5. Leveraging other sources of side information: Other researchers have been able to leverage other type of information apart from a bag-of-words representation of the submissions’ main text (see Section 3.3). The model presented in Chapter 4 may be a gateway to modelling some of this other information.

6. As mentioned in Section 3.3.1 many models have been proposed for expertise retrieval. It would be worthwhile to compare the performance of our current models with the ones developed in that community.

Finally, we are actively exploring ways to evaluate the accuracy, usefulness, and impact of TPMS more effectively. We can currently evaluate how good our assignments are in terms of how qualified reviewers are for the papers to which they are assigned (Chapter 5 details assignment procedures and evaluations). Further, anecdotal evidence from program organizers, the number of the number of conferences that have expressed an interest in using the system, and the several experiments that we have run, all suggest that our system provides value in proposing good reviewer assignments and in saving conference organizers and senior program committee members time and cognitive effort. However, we do not have data to evaluate the intrinsic impact that the system may have on conferences and, more generally, on the field as a whole. We can divide the search for answers to such questions into two stages. First, how much does reviewer expertise affect reviewing quality. For example is it the case that finding good expert reviewers leads to better reviews? Further, are there synergies between reviewers (for example, what is the benefit of having both a senior researcher and a graduate student reviewer assigned to the same paper)? Second, how much impact do reviews have on the quality of a conference and ultimately of the field? For example do good reviews lead to: a) more accurate accept and reject decisions; and b) higher-quality papers? One way to answer these questions would be to run controlled experiments with conference assignments, for example, by using two different assignment procedures. Subsequently we could evaluate the performance of the procedures by surveying reviewers, authors, and conference attendees. While we may obtain conclusive answers about some of these questions, evaluating the quality of the field as a result of reviews seems like a major challenge. We have only begun exploring such issues but already, in practice, we have found it a challenge to incentivize conference organizers to carry out evaluation procedures which could help in assessing these impacts. Chapter 4

Collaborative Filtering with Textual Side-Information

Good user-item preference predictions are essential to the goal of providing users with good recommen- dations. The task of preference prediction, which is the focus of this chapter, is therefore at the heart of recommender systems. Preference prediction corresponds to the second stage of our recommendation framework, depicted for the purposes of this chapter in Figure 4.1. We discuss this stage first because of the importance of preference prediction in the framework as well as the fact that the other stages of the framework will assume access to a preference prediction method. There are various flavours of preference prediction problems, which are differentiated by which type of input information is available to the preference prediction model. The simplest of situations is when only a subset of user-item preferences are observed. The system must then learn, using a collaborative filtering model, similarities in-between users that will enable it to accurately predict unobserved preferences. We are interested in studying situations where side information, features of users, items or both, along with user-item preferences is available. Specifically, we focus on hybrid systems, systems that use collaborative filtering models while also modelling the side information. We first provide a high-level description of the current state-of-the-art in the field of hybrid recom- mender systems. We then focus on the specific problem of leveraging user and item textual side infor- mation for user-preference predictions. We propose the novel collaborative score topic model (CSTM) for this problem and present supporting empirical evidence of its good performance across data regimes. Before introducing our model we also provide a brief review of topic models and variational inference methods used for learning the parameters of those topic model.

4.1 Side Information in Collaborative Filtering

We define side information as any information about users or items distinct from user-item preferences. The immediate goal of modelling side information is to glean information that will be helpful in learning better models of users and items and (consequently) provide more accurate preference predictions. Overall, side information has been particularly useful to combat the cold-start problem which plagues collaborative-filtering methods. Figure 4.2 provides a sketch of the representations of the major class of models (used in the literature) to combine side information with collaborative filtering. There are two

39 Chapter 4. Collaborative Filtering with Textual Side-Information 40

Entities 3 S 1

Users Prediction of missing 1 2 preferences 2 3 2 2 3 2 0 1 2 0 1 2 Side-Information 1 1 2 F 1 1 2 from Users and Items Stated & predicted preferences Recommendations

Stated preferences

Figure 4.1: Flow chart depicting the framework developed in this thesis with a particular focus on the second stage of the framework, the preference prediction stage.

predominant classes. The first, depicted in Figure 4.2(b) is to use the side information to build a prior over user representations for example by enforcing that users (and items) with similar side information

(fu and fv) have similar latent representations (u and v). Such models are therefore of particular use in the cold-start data regime. The second methodology which has received a lot of attention is to use the side information as covariates (or features) in a regression model (see Figure 4.2(c)). The parameters over such features are often shared across user and items. Finally, a third methodology is to model the side information together with the scores as shown in Figure 4.2(d). The resulting user and item latent factors of this generative model should capture elements of both which can be useful for preference prediction and for side information prediction. Side information has been used, with varying levels of success, in many different domains. To better understand the domains in which side information has been useful, we discuss user side information and item side information separately. User side information has shown to be useful in various forms. Examples of user side information includes frequently available user demographics, such as user gender, age group or a user’s geographical location, which have often been used and have shown to produce, at least, mild performance gains [1, 2, 26]. Concretely, it is likely that such side information is weakly indicative of preferences (for example, two average teens may have similar movie tastes). More descriptive features such as a user’s social network have shown promise both as a way to select a user’s neighbourhood in model-free approaches [53, 76, 70] (neighbourhood models were introduced in Section 2.2.1), as well as in in collaborative filtering approaches as a way to regularize a user representations by assuming that neighbours of a social network have similar preferences [26, 54]. Features that could be described as relating user behaviours, such as user browsing patterns, user search queries, or user purchases, have also shown their utility, again as a way to regularize user representations [61, 99]. On the item side, gains have mostly come from domains where the content of items can easily be analyzed using machine learning techniques, for example, when the content is text as it is for scientific papers [126] or newspaper articles [27]. For books, Agarwal and Chen [2] have proposed using book reviews as representative of a book’s content (in Section 4.6 we show that using the content of books is challenging). They use a probabilistic model of content and user preferences which combines regressors over side information in a conceptually similar way as is done in Conry et al. [29] which we reviewed in Section 3.3. For other tasks, such as the one of predicting interactions between drugs and proteins, side Chapter 4. Collaborative Filtering with Textual Side-Information 41

M N

M M N N

(a) A collaborative fil- (b) The user side information fu, and the item side tering model such as information fv regularize the user and item represen- probabilistic matrix fac- tations respectively. Here, the graphical representation torization [99] illustrates that each user and item latent factors (u and v) can be regularized by all other users side informa- tion or all other items side information (this could be the case for users when using the structure of a social network as side information). Regularization of user and item representations is used by many researchers [54, 39, 126, 66, 99]

M M N N

(c) Side information is used as features (covariates) (d) Learn a generative model of side information where used to learn a regressor [1, 2, 26, 61, 27, 29, 130] user and item latent factors have to both predict scores and the side information. Such a model can be seen as another way of regularizing the user and latent fac- torizations. A few researchers have used such models [106, 70].

Figure 4.2: Graphical model representations of the three common classes of models for using side in- formation within collaborative filtering models. Note that we depict the pure collaborative filtering model as one which includes, but is not restricted to, the popular probabilistic matrix factorization. We emphasize that the above figures are sketches meant as a guide of the different model classes (us- ing a similar graphical representation), they are not intended to reproduce the exact parameters of the proposed models. Chapter 4. Collaborative Filtering with Textual Side-Information 42 information consisting of drug similarities and protein similarities have also shown promise as a way to regularize the representations of a matrix factorization model [39]. However in other popular domains such as movie preference prediction where content features present modelling difficulties, using item side information, for example a movie’s genre, its actors or its cast, has been less useful (see, for example, [66, 106]). In the movie domain, modest gains have been achieved when using related-content information such as user-proposed movie tags [2, 1]. Music recommendation offers an interesting testbed, falling between the simpler representations of text and the more complex representations of movies, as we now have tools that allow the accurate analysis, including determination of genre, of music tracks (see, for example, the MIREX competition1). However, results in the music domains mostly point to the effectiveness of pure collaborative filtering. In a recent contest [77], content-based methods still did not rival with pure collaborative filtering sys- tems [Section 6.2.3, 9]. In a similar task, Weston et al. [130] show that using MFCC audio features as regression features, does not provide a gain over pure collaborative filtering methods. These negative results have made some researchers question our ability to extract useful content features from such complex domains as music, movies and images [108]. We provided an fairly high-level overview of the work that has combined side information to col- laborative filtering models. In this chapter we will present a novel model to do so which is applicable to textual side information (or more generally to the case when side information can be modelled with a topic model) and we will compare it, experimentally, to competing models in this domain and in particular to [126].

4.2 Problem Definition

We do not aim to solve the general problem of hybrid recommender systems in this chapter. Instead we focus on pushing the state-of-the-art by studying the task of document recommendation using both user-item scores and side information. The primary novelty of our work lies in leveraging a particular form of side information: the content of documents associated with users, which we call user libraries. A typical scenario that can be modelled in this way is scientific-paper recommendation for researchers; for example, Google Scholar recommends papers based on an individual’s profile. A second scenario is paper-reviewer assignment, where each reviewer’s previously published papers can be used to assess the match between their expertise and each submitted paper. Another relevant application domain is book recommendation, as online book merchants typically enable users to collect items in a virtual container akin to a personal library.2 In each case a user’s library, or side information, consists of documents which are not necessarily explicitly rated but nonetheless may contain information about a user’s preferences. To model user-item scores as well as user and item content we introduce a novel directed graphical model. This model uses twin topic models, with shared topics, to model the side information. User and item topic proportions are then used as features to predict user-item scores with a collaborative filtering model. The collaborative filtering component enables the model to effectively make use of the side information with varying number of observed scores. We demonstrate empirically that the model outperforms several other methods on three datasets in both cold and warm-start data regimes. We

1http://www.music-ir.org/mirex/ 2For example., Amazon’s Kindle and Kobo’s tablets have an option for users to populate their libraries, while Barnes and Nobles’ Nook gives users an active shelf. Chapter 4. Collaborative Filtering with Textual Side-Information 43

N N D D

K K

(a) (b)

Figure 4.3: The graphical models of two topic models (a) LDA and (b) CTM.

further show that the model automatically learns to gradually trade off the use of side information in favour of information learned from user-item scores as the amount of user preference-data increases.

4.3 Background

In Chapter 3 we used a topic model as a tool to learn representations for user libraries and paper submissions. In this chapter we propose a graphical model which jointly models user and item topics and user-item scores. To better understand this model and our proposed inference procedures it will be useful to further introduce LDA as well as introduce a second topic model called the correlated topic model (CTM). Topic models are a class of directed graphical models. They were initially proposed as a way to model collections of textual documents [16]. Specifically, each document in the collection is modelled as a mixture over topics. A topic is a distribution over words. With that in mind the aim of topic models is then to learn a (word-level) representation of a set of documents. This representation involves two key components: a) distributions over words or topics; b) for each document in the collection a distribution over topics. Over the years topic models have been extended to model many different practical situations, including: modelling how documents change over time [12], modelling hierarchies of topics [15], and modelling document contents and labels [14, 84, 63]. For our needs we will specifically focus on LDA and its CTM variant. LDA’s graphical model is depicted in Figure 4.3. The generative model of LDA over a set of D documents, an N-size word vocabulary, and K topics is:

• For each document, d = 1 ...D - Draw d’s topic proportions: η Dirichlet(α) d ∼ Chapter 4. Collaborative Filtering with Textual Side-Information 44

- For each word in the document, n = 1 ...N:

- Draw a topic: zdn Multinomial(η ) ∼ d - Draw a word: wn Multinomial(β ) ∼ zdn

ηd represents the topic proportions, or a mixture over topics, of document d. For each word in a document LDA first samples a single topic, zdn. A word is then sampled from the corresponding topic distribution. It is useful to encode β as a matrix of size K N, where K is the number of topics. × Then the zdn’th row of that matrix βzdn is the distribution (over words) corresponding to topic zdn. Formally, for each document, LDA learns to represent a distribution over words given model parameters and latent variables: Pr(wd zd, φ , β,α). Across all documents LDA then represents the space of joint | d distributions over document words. The correlated topic model is very similar to LDA except that mixtures over topics are modelled using a logistic normal distribution instead of a multinomial. Accordingly, instead of sampling a document’s distribution over topics from a Dirichlet prior, it’s sampled from a Normal distribution with mean parameters µ and covariance Σ parameters. CTM’s graphical model is shown in Figure 4.3 and its generative model is given by:

• For each document, d = 1 ...D - Draw d’s topic proportions: η (µ, Σ) d ∼N - For each word in a document, n = 1 ...N

- Draw a topic: zdn Multinomial(softmax(η )) ∼ d - Draw a word: wdn Multinomial(β ) ∼ zdn exp(v) where softmax(v) = ′ . In CTM the main advantage of using a normal distribution is that Pk′ exp(vk ) correlations between topics (e.g., topic 1 often co-occurs with topic 5) can be represented. This has shown experimentally to provide better models of document collections (and accessorily provides potentially interesting visualizations of the discovered topic correlations) [13]. The main disadvantage brought upon by using a normal distribution is that the normal is not conjugate to the multinomial (whereas the Dirichlet is) and thus this change complicates resulting inference procedures [13, 127].

4.3.1 Variational Inference in Topic Models

The key inference problem of graphical models is to evaluate the posterior distributions over latent variables ηd and zd given the observed variables wd and the model parameters α and β. Posteriors can then be used to make predictions using standard Bayesian inference. In the case of LDA the per- document posterior, following Bayes’ rule, is given by:

Pr(wd zd,β)Pr(zd ηd))Pr(ηd α) Pr(zd, ηd wd, β,α)= | | | (4.1) | Pr(wd zd,β)Pr(zd ηd))Pr(ηd α) ∂η z | | | R However the denominator is intractable dueP to the coupling of η and β [16]. Hence we must resort to an approximate inference approach. Specifically we will use variational inference. In variational inference the idea is to replace the intractable posterior by a tractable distribution over the latent variables: Q( s , η ). The choice of Q is model dependant, however, for computational { } { } simplicity, it is often given a parametric form. For simplicity let us group all latent variables, across Chapter 4. Collaborative Filtering with Textual Side-Information 45

3 documents, into Z = zd , η d D and the visible variables into X = wd d D. Then the approach {{ } { d}} ∈ { } ∈ is to optimize this tractable distribution such that it is as close as possible to the true posterior. The measure of closeness between the two distributions is taken to be the KL-divergence:

Q(Z) KL(Q P )= Q(Z)ln ∂Z (4.2) || P (Z X,α,β) Z  |  = Q(Z)ln(Q(Z)) Q(Z)ln(P (Z X,α,β) ∂Z (4.3) − | Z = H(Q(Z)) EQ[P (Z X,α,β)] (4.4) − − | where H( ) denotes the entropy of its argument. · Minimizing the KL-Divergence is still intractable because it requires the evaluation of the posterior P (Z X,α,β) (Equation 4.4). However, by subtracting ln P (X) the log-marginal probability of the data, | which is constant with respect to Z, we obtain:

KL(Q P ) ln P (X)= Q(Z)ln(Q(Z)) Q(Z)ln P (Z X) ∂Z Q(Z)ln P (X)∂Z (4.5) || − − | − Z Z (Q(Z)) = Q(Z)ln ∂Z (4.6) P (Z X,α,β)P (X α,β) Z  | |  (Q(Z)) = Q(Z)ln ∂Z (4.7) P (Z,X α,β) Z  |  = Q(Z)ln Q(Z) Q(Z)ln P (Z,X α,β) (4.8) − | Z = H(Q(Z)) EQ[P (Z,X α,β)] (4.9) − − |

The resulting expression, which consists of an entropy term and an expectation over the complete-data likelihood Pr(Z,X α,β), can now be evaluated. | The variational expectation-maximization (EM) [32, 80] algorithm (approximately) minimizes the above, also known as the free energy, by alternating the following two optimization steps. In the E- step the algorithm minimizes the free energy with respect to the variational posterior parameters, the parameters of Q. Then, in the M-step the free energy is minimized with respect to the model parameters (that is, α,β in LDA and µ, Σ,β in CTM). The M-step is also often described as maximizing the complete- data log-likelihood. The EM algorithm is a classic algorithm in the machine learning literature. The particular derivation that we have shown is similar to the one in Frey and Jojic [35].

There are several common choices of variational distributions. For example, we recover maximum a posterior configuration of the latent variables (MAP) by choosing the variational distribution to be a Dirac delta function with its mode equal to the variational parameter h: Q(h) = δh(h), where δh(h) b b evaluates to one when h = h and zero otherwise. Another common technique is to use a mean-field b approach where the intractable posterior distribution is approximated using decoupled distributions. b For example, in the case of LDA, Q(η , zd γ , φ )= Q(η γ )Q( zd φ ) where γ and φ are d { }| d { d} | d { }|{ d} d { d} variational parameters.

3 Equivalently we will write {wd} when the domain of the index is clear from context. Chapter 4. Collaborative Filtering with Textual Side-Information 46

4.4 Collaborative Score Topic Model (CSTM)

Our approach to document recommendation relies on having: a) a set of observed user-item preferences d a ( srp ); b) contents of the items ( w ); and c) the content of user-libraries ( w ). The model’s aim { } { p} { r } is to utilize the content in its user-item score predictions (which can then be used to recommend items to users). To do so we combine a topic model over side information and a collaborative filtering model of user- item preferences. Specifically, we propose a generative model over the joint space of user preferences, user side information (user libraries) and item side information (item contents). This model can be cast in the light of the previous work on incorporating side information along with collaborative filtering which we described earlier in this chapter (Section 4.1). Specifically, we combine a generative model of user and item side information as well as of user-item preferences similarly as in Figure 4.2(d). The side information, which consists of textual documents is modelled using a topic model. The user and item topic proportions are then used within a linear regression model over user-item preferences. In that respect our model also shares commonalities with the class of models depicted in Figure 4.2(c). The resulting model is called the collaborative score topic model (CSTM). We now describe in more detail the components of our model. Our content-based model is mediated by topics: we learn a shared topic model from the words of the documents and user libraries. We represent topic proportions with a normal distribution and realized topics zu and zd, in the same way as CTM, using the logistic normal [13]. User and item topic proportions offer a compact representation of user and item side information. We favor CTM over LDA because of its continuous representations of topic proportions which can be useful for the regression model. We use these representations as covariates in a regression model to predict user-item preferences. The regression has two sets of parameters. The first are user-specific parameters on the item topics covariates. The second are compatibility parameters, which are shared across users and items, and are based on the compatibility between the item topics and the topics of the user library. We now introduce the complete graphical model of the CSTM. A graphical representation of the model is given in Figure 4.4. The associated generative model is:

• Draw compatibility parameters: θ (0,λ2I) ∼N θ • Draw shared-user parameters: γ (0,λ2 I) 0 ∼N γ0 • For each user r = 1 ...R: - Draw individual-user parameters: γ (0,λ2 I) r ∼N γ 2 - Draw user-topic proportions: ar (0,λ I) ∼N a • For each document p = 1 ...P :

2 - Draw document-topic proportions: dp (0,λ I) ∼N p • For all of user r’s user-library words, n = 1 ...N:4

a - Draw z Multinomial(softmax(ar)) rn ∼ a - Draw w Multinomial(βza ) rn ∼ rn • Repeat the above for all of document p’s M words

4For simplicity, we’ll assume in the notation that all user-libraries contain N words and all item documents contains M words. Chapter 4. Collaborative Filtering with Textual Side-Information 47

N M P R

K

Figure 4.4: Graphical model representation for CSTM.

• For each user-document pair (r, p), draw scores:

T T 2 srp ((ar dp) θ + d (γ + γ ),σ ) (4.10) ∼N ⊗ p 0 r s

where (µ,σ2) represents a normal distribution with mean µ and variance σ2, stands for the N exp(v) ⊗ Hadamard product, softmax(v)= ′ , and I is the identity matrix. Pk′ exp(vk ) The specific parametrization of the preference regression shown in Equation 4.10 is important. Our model is designed to perform well in both cold-start and warm-start data regimes. In cold-start settings the model needs the user’s side information to predict user-item preferences. When the number of observed preferences increases, the model can gradually leverage that information, smoothly combining it with information gleaned from the side information to refine its model of missing preferences. To accomplish this, the regression model (Equation 4.10) is separated in two: one component that exploits T user side information ((ar dp) θ) and another that does note include the user side information but ⊗ T does include user-specific parameters (dp (γ0 + γr)). Item side information is incorporated by modulating the user information through an element-wise product. The weights θ then serve several purposes: 1) they can act to amplify or reduce the effect of certain topics (for example diminish the influence of topics bearing little preference information); 2) they enable the model to more easily calibrate its output to the range of observed preference values; and 3) changing the magnitude of θ enables the model to control how much it uses the side information for preference prediction. When user-item preferences are more abundant, the model can use them to learn a user-specific model, γr, over item features. Note that these user-specific parameters are combined with a shared Chapter 4. Collaborative Filtering with Textual Side-Information 48

set of parameters, γ0, which allows for some transfer across users. An individual’s γr can be used to increase that user’s reliance on user-item preferences at the possible expense of item side information, as the joint magnitude of the γ’s defines the weights associated with this part of the model. Our model learns a single set of topics to model user and item content. Sharing topics ensures that the user and item representations (ar and dd r, d) are aligned and render their element-wise product ∀ ∀ meaningful.

4.4.1 The Relationship Between CSTM and Standard Models

Simplifying the proposed CSTM model in various ways produces other models that have been used for similar tasks. First, setting γ0 and γr, for every user r, to zero and θ’s to a vector of ones we obtain the language model introduced in 3.2.1. To be precise, because we are using user and item topic representations, this degeneracy corresponds to topic-LM (see Section 3.2.3). Mimno and McCallum [79] used the related word-LM in a preference prediction task and found its performance particularly strong in low-data regimes.

Further, setting θ and γ0 to zero we obtain the individual user regression model LR introduced in Section 3.2.2. As we will demonstrate LR typically outperforms purely collaborative filtering models in our task. By modelling preferences as a combination of user features and item features, our model can also be seen as an instance of collaborative filtering [99, 98, 7]. Finally, we have opted to represent topic proportions using a logistic normal distribution as in CTM [13]. In our case we utilize the logistic normal due to its representational form, and not as a means of learning topic correlations.5 Compared to a multinomial, the normal distribution adds a level of flexibility that may be useful to better calibrate CSTM’s preference predictions; the drawback is additional complexity in model inference. Our model also shares several similarities with the model of Equation 3.5. First, both models are composed of two separate components which, respectively, learn to regress over user and item side- information. While the regression over item content is the same, CSTM in its other component, to maximize its performance on cold start users, combines the user and item side-information representa- tions. Early experiments with CSTM showed that having both global and user-specific compatibility parameters (which would be closer to the parameters ω + ωr of Equation 3.5) did not yield empirical improvements. One important difference between the two models is that the model of Equation 3.5 uses representations learned offline whereas CSTM learns these representations jointly with the regression parameters.

4.4.2 Learning and Inference

For learning we use a version of the EM algorithm where we alternate between updates of the user-item a d specific variables ( = γ , ar , dp , z , z ) in the E-step and updates of the parameters or H {{ r} { } { } { } { }} shared variables (Θ = γ , θ, β ) in the M-step. The inference and learning procedures are similar to { 0 } those proposed for nonconjugate LDA models in Wang and Blei [126]. The general EM algorithm is shown in Algorithm 1.

5Because learning topic correlations has been found to improve on standard LDA, it is possible that learning the topic correlations could also improve our model. Chapter 4. Collaborative Filtering with Textual Side-Information 49

E-Step

Inference in this model is intractable, so we must rely on approximations when manipulating the posterior over the user-item specific variables. The log-posterior over user-item variables, given the fixed model parameters and the data, is

R P R 1 1 1 := aT a dT d γT γ 2λ r r 2λ p p 2λ r r L − a r − d p − γ r X X X 1 T T 2 srp ((ar dp) θ +(γ + γ ) dp) − 2σ2 − ⊗ 0 u s (r,p)∈So X  R,N P,M exp(arza ) exp(spzd ) + log rn + log pm exp(a ) exp(s ) r,n j rj p,m j pj X X R,N P P,M P a a d d + log βzrn,wrn + log βzpm,wpm log Z(Θ) (4.11) r,n p,m − X X where So stands for the set of observed preferences and Z(Θ) is the normalizing term of the posterior, intractable in part because ar and dp cannot be analytically integrated out because they are not conjugate to the distribution over topic assignments [13].

We address this computational issue by employing variational approximate inference. For each of the topic-proportion and regression variables ar , dp , γr , we use a Dirac delta posterior parameterized { } { } { } a d by its mode ar , dr , γ . For the topic-assignment variables z , z , we instead utilize a mean- { } { } { r} { } { } field posterior. The full approximate posterior is therefore: b b b a d a d q( ar , dp , γ , z , z ar , γ , dp , φ , φ )= { } { } { r} { r } { p}| { } { r} { } { r } { p} R R N D M a b d δγ (γr) δar (ar) φbrnza b δd (dp) φpmzd br b r bp p r ! r n ! d m ! Y Y Y Y Y a d a where δµ(x) is the delta function with mode µ and φ , φ are the mean-field parameters (e.g., φ is { r } { p} r a th a matrix whose entries φrnj are the probabilities that the n word in user r’s library belongs to topic j).

a d Approximate inference entails finding the variational parameters ar , dp , γ , φ , φ that { } { } { r} { r } { p} b b b Chapter 4. Collaborative Filtering with Textual Side-Information 50 minimize the KL-divergence with the true posterior

KL := Eq [ ] H(q) − L − R P R 1 1 1 = aT a + dT d + γT γ 2λ r r 2λ p p 2λ r r a r d p γ r X X X 1 b b 2 b b T θ γ bγ bT + 2 srp ((ar dp) +( 0 + r) dp) 2σs o − ⊗ (r,pX)∈S   R,N,K b b b b a exp(ark) a a φrnk log + log βk,wrn log φrnk − exp(arj ) − r,n,k j ! X b P,M,K P d exp(spk) d b d φpmk log + log βk,wpm log φpmk − exp(spj ) − p,m,k j ! X b + constant P (4.12) b

Our strategy is to perform one pass of coordinate descent, optimizing each set of variational parameters 6 given the others. For γr, we obtain a closed-form update by differentiating the above equation and setting the result to 0: b −1 T 1 dpd 1 γ T θ T γ T p r = 2 (srp (dp ar) dp 0)dp 2 + σs o − ⊗ −  o σs 2λγ I  p∈XS (r) p∈XS (r) b b b b b b b   o where S (r) is the set of indices for documents that user r has rated. The ar , dp parameters do not { } { } have closed-form solutions, hence we use conjugate gradient descent for optimization. The derivatives b with respect to the posterior KL are: b

∂KL a 1 exp(a ) r θ r φa = 2 (srp sˆrp)(dp )+ N rn ∂ar λa − σ − ⊗ exp(arj ) − s p∈So(r) j n X b X b P ∂KL dp 1 exp(dp) d b = (srp rˆrp)(ar θ + γ + γ )+ Mb φ λ − σ2 − ⊗ 0 r exp(s ) − pn ∂dp d s o j pj n r∈XS (p) b X b b P b T T b where,s ˆrp =(ar dp) θ + d (γ + γ ). ⊗ p 0 r In practice, when optimizing for ar and dp, we use the approach of Blei and Lafferty [13] and b b introduce additionalb variational parameters,b ζa,ζd , which are used to bound the respective softmax a { a } K d denominators. For each user ζr is updated using ζr = k exp(ˆark), and similarly for ζp ’s. For the mean-field parameters φa , φd , minimizing the KL while enforcing normalization leads to { r } { p} P the following solution:

a a βk,wrn exp(ark) φrnk = a j βj,wrn exp(arj ) b β d exp(spk) d Pk,wpm φpmk = b d j βj,wpm exp(spj ) b P 6While we could cycle through all variational parameters until convergenceb before beginning the M-step, we’ve found a single pass of updates per E-step to work well in practice. Chapter 4. Collaborative Filtering with Textual Side-Information 51

Algorithm 1 EM for the CSTM a d o Input: w , w , srp S . { r } { p} { }∈ while Convergence criteria not met do # E-Step for all p P do ∈ d Update dp, φp end for for all r =b 1 ...R do a Update ar, γr, φr end for # M-Step b b Update θ,γ0, β end while

We update the variational parameters of all users and subsequently of all documents.

M-Step

The M-step aims to maximize the expectation of the complete likelihood under the variational posterior

(taking into account the prior over the parameters γ0, θ):

1 2 γ θ T θ γ γ T Eq [ ] + log p( 0) + log p( )= 2 srp ((ar dp) +( 0 + r) dp) L −2σs o − ⊗ (r,pX)∈S   R,N,K b P,M,Kb b b a d a d + φrnk log βk,wrn + φpmk log βk,wpm r,n,k p,m,k X X 1 T 1 T γ0 γ0 θ θ + constant . − 2λγ0 − 2λθ

Setting the derivatives to zero (and satisfying the βjw parameters’ normalization constraints), we obtain the following updates:

−1 1 (d a )2 1 θ T γ γ T p r = 2 (srd dp ( 0 + r))(dp ar) ⊗2 + σ − ⊗  σ λθI  s (r,p)∈So (r,p)∈So s X  X b b b b  −1 b T 1 dpd 1 γ T θ γT p 0 = 2 (srp (dp ar) ( r dp))dp 2 + σ − ⊗ −  σ λγ0 I  s (r,p)∈So (r,p)∈So s X  X b b a d a b b bd b   r,n φrnj1{wrn=k} + p,m φpmj1{wpm=k} βjk = a d . ′ φ 1 a ′ + φ 1 d ′ Pk ,r,n rnj {wrn=k } P p,m pmj {wpm=k } P P

At test time, prediction of missing preferences is made usings ˆrp, which is readily available. Chapter 4. Collaborative Filtering with Textual Side-Information 52

s M P R

K

Figure 4.5: A graphical representation of the CTR model (adapted from [126])

4.5 Related Work

Previous work on hybrid collaborative filtering includes a few models that have combined item-only topic and regression models for user-item preference prediction. We are not aware of any earlier work that develops a text-based model of a user, nor one that combines user and item side information as in CSTM. Agarwal and Chen [2] model several sources of side information including item textual side informa- tion using LDA. The topic assignment proportions of documents ( zd /M for all d D documents) m dm ∈ are used as item features and combined multiplicatively with user-specificP features. The results are linearly combined with user demographic information to generate preferences. Wang and Blei [126] also combine LDA with a regression model for the task of recommending sci- entific articles. Here the item topic proportions are used as a prior mean on normally-distributed item (regression) latent variables. User latent variables are also normally distributed from a zero-mean prior. A specific user-item score is then generated as the inner product of item and user latent variables: T srp = ar (dp + ǫp), where ǫp is drawn from a zero-mean normal distribution. The preference prediction model is the same as the one used in probabilistic matrix factorization [99]. Wang and Blei also report that a modified version of their model analogous to the model of Agarwal and Chen [2] performed worse on their data. Shan and Banerjee [106] proposed a similar model without the bias term ǫd but used CTM [13]. The fact that we model an additional type of information (user textual side information) makes it difficult to directly compare our model to the ones above. In addition, the parametrization we use to predict preferences is very different from previous models. We initially experimented with a parametrization similar to Wang and Blei [126], albeit modified to also model user side information, and Chapter 4. Collaborative Filtering with Textual Side-Information 53 found it did not perform as well as CSTM (see the next section for experimental comparisons). Finally, Agarwal and Chen [1] propose a collaborative filtering model with side information. Although the form of the side information is not amenable to using topic models, the authors utilize a combination of linear models to obtain good performance in both cold and warm-start data regimes. They reported improved results in all the regimes they tried compared to pure collaborative filtering methods. CSTM is also useful for predicting reviewer-paper affinities. Other researchers have look at modelling reviewer expertise using collaborative filtering or topic models. We note the work of Conry et al. [29] which uses a collaborative filtering method along with side information about both papers and reviewers to predict reviewer paper scores. Mimno and McCallum [79] developed a novel topic model to help predict reviewer expertise. Finally, Balog et al. [5] utilize a language model to evaluate the suitability of experts for various tasks.

4.6 Experiments

We first describe the three datasets used for our experiments. We then introduce the set of methods against which we perform empirical comparisons, ranging from pure CF methods to pure side information methods. We report three separate sets of experiments. In the first we focus on the cold-start problem for new users and examine the effect of including user libraries. In the second we study how the methods perform on users with varying amounts of observed scores. Finally, we design a synthetic paper recommendation experiment and simulate the arrival of new cold-start new users in order to test the value of using both the user library and the user-provided item scores.

4.6.1 Datasets

We evaluate the models using these three datasets: NIPS-10 and ICML-12 were both introduced in Section 3.2.3. We note that for NIPS-10, Each user’s library consists of his own previously published papers. Users have an average of 31 documents (std. 20). After some basic preprocessing (we removed words that either appear in fewer than 3 submissions or words which appear in more than 90% of all submissions), the length of the joint vocabulary is now slightly over 18,000 words. For ICML-12, Users have an average of 25 documents (std. 29) each and the length of the joint vocabulary is 16,201 words (identical pre-processing as with NIPS-10 above). Kobo: The third dataset is from Kobo, a large North American-based ebook retailer.7 The dataset contains 316 users and 2601 documents (books). Users average 81 documents (std. 100). We removed very-infrequent and very-frequent words (those appearing in less than 1% or more than 95% of all documents). The resulting vocabulary contains 6,440 words. Users have a minimum of 15 expressed scores (mean 22, std. 6). The respective score distributions of the each of the three datasets is shown in Figure 4.6.

4.6.2 Competing Models

We use empirical comparisons against other models to evaluate the performances of CSTM. For the competing models we re-use some of the general models previously introduced and introduce a few other

7http://www.kobo.com Chapter 4. Collaborative Filtering with Textual Side-Information 54

1200 9000 3500

8000 3000 1000 7000

2500 800 6000

2000 5000 600 4000 1500

400 3000 1000

2000 200 500 1000

0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 5 (a) NIPS-10 (b) ICML-12 (c) Kobo

Figure 4.6: Score histograms. Compared to Figure 3.3 the histograms for NIPS-10, and ICML-12 are altered according to the user categorization described in Section 4.6.3.

User side information Document side information Shared Parameters SLM-I X X SLM- X X II LR X PMF X CTR X X CSTM X X X

Table 4.1: A comparison of the modelling capabilities of each model. “Shared Params” stands for models that share information between users and/or items (in other words those which use some form of CF).

ad-hoc models for the task. Specifically, each competing model has particular characteristics (Table 4.1) which will help in understanding CSTM’s performance. Note that we use topic representations of documents for competing models that use side information. Such representations were learned offline using a correlated topic model [13]. We re-use some of our previous notation to describe these models. Namely au and dd are K-length vectors which designate a user’s and a document’s (topic) representation respectively. Constant: The constant model predicts the average observed scores for all missing preferences. Comparison to this baseline is useful to evaluate the value of learning. T T T Supervised language model I (SLM-I): This model is a supervised version of the LM:s ˆrp := (au θA)(dd θD) where the parameters θA,θD are K F matrices. F is a hyper-parameter determined using a validation × set (ranges from 5 to 30 in our experiments). Supervised language model II (SLM-II): This model uses isotonic regression (see, for example, [10]) to calibrate the LM. The idea is to learn a regression model that satisfies the implicit ranking established by the LM:

2 minimizesˆr ,∀r (ˆsrp srp) o − (rpX)∈S subject tos ˆrp sˆ , p. ≤ r(p+1) ∀ where the constraints enforce a user-specific document ordering specified by the output of the LM. Once learned the set of regression parameters sˆ are used as the model’s predictions. To obtain predictions { } Chapter 4. Collaborative Filtering with Textual Side-Information 55 for an unobserved document we have found that taking the average of the (predicted) scores of the two (observed) documents ranked, according to the LM, directly above and below the new document works well. The regression is user-specific and therefore cannot be used for users with no observed preferences. For such users we simply re-use the learned parameters of its closest user. We leave further research into more principled approach, for example a collaborative one, for future work. LR: The linear regression model was introduced in Section 3.2.2. To recapitulate, it is a user-specific T regression model where predictions are given by: srp = γu dd. PMF [99]: PMF was introduced in Section 2.2.1. We remind the reader that it is a state-of-the-art collaborative filtering approach and that it does not model the side information. The size of the latent space is determined using a validation set (range from 1 to 30). Collaborative topic regression (CTR): CTR, is matrix factorization with document-content model introduced in Wang and Blei [126]. CTR was briefly reviewed in Section 4.5. We use a slightly different version than the one introduced by its authors. Namely, we have replaced LDA by CTM. Also, in our application since all user-item scores are given we use a single variance value over scores (σs). For SLM-I, LR, and PMF learning is performed using a variational approximation with a Gaussian likelihood model and zero-mean Gaussian priors over the model’s parameters. The prior variances are determined using a validation set. Finally, we investigated a few other models. Of note: instead of modelling user libraries as side information we used the documents of user libraries as observed highly-scored items. We experimented with various scoring schemes but none lead to consistent improvements over the baselines described above. We also experimented with replacing directed topic models with an supervised extension of an undirected topic model [97]. However these method did not perform well and are not discussed further.

4.6.3 Results

To run CSTM on the above datasets we first concatenated user libraries (for example a reviewer’s previously published papers) into a single document. The content of the resulting document can then be used as that a user’s side information (wr) in CSTM. To get user and item topic proportions we learned a CTM topic model [13] using the content of the items and then projected user documents into that space to obtain user topic proportions. We directly used these topics in SLM-1, SLM-II and LR. We also used these topics as initialization in those models which jointly learn topics and scores (CSTM and CTR). In all experiments we use 30 topics. For training we create 5 folds from the available scores. Each fold is split into 80 percent observed and 20 percent test data. We used the first fold to determine the hyper-parameters of the model. We report the average results over the five folds as well as the variance of this estimator. We want to evaluate the performance of CSTM in settings where some users have no observed scores. The cold-start setting is of particular practical importance and one that should enable a good model to leverage the user’s side information. Accordingly, in our datasets we randomly selected one fourth of all users and removed all of their observed scores for training but kept their test scores (and their side information remains available at training). Further, for NIPS-10 and Kobo, whose users have a more uniform number of ratings, we binned the remaining users (three quarters) uniformly into three categories. For NIPS-10, users in each category had 15, 30 and 55 observed scores respectively. In each of the three categories 5 ratings per user were kept for validation. For Kobo users in the first two categories had 8 and 10 scores while the scores of users in the last category were left untouched (5 scores per user Chapter 4. Collaborative Filtering with Textual Side-Information 56

NIPS-10 ICML-12 Kobo Constant 0.4378±2×10−3 0.6386±4×10−5 0.6882±5×10−4 SLM-I 0.4684±2×10−3 0.7903±4×10−5 0.6873±1×10−3 SLM-II 0.4696±3×10−4 0.7752±1×10−4 0.6926±6×10−4 CSTM 0.4846±1×10−3 0.8096±1×10−4 0.7243±2×10−4

Table 4.2: Comparisons between CSTM and competitors for cold-start users using NDCG@5. We report the mean NDCG@5 value and the variance over the five training folds.

were kept for validation). For ICML-12 since users are already naturally distributed into categories, we split the observed data into 25 percent validation and 75 percent train. For the next two experiments, for each dataset, we train each model on all of the data but we divide our discussion into two parts. First we discuss cold-start users, and then we examine the (other) user categories.

Cold-Start Data Regime

We first report the results for the completely cold-start data regime. As a reminder, this simulates new users entering the system with their side-information. That is, user and item side-information is available but scores are not. For the cold-start users, it is difficult to calibrate the output of the model to the correct score range since only the users’ side information is available. The prediction is that the models can use the side information to get a better understanding of users’ preferences and discriminate between items of interest. Accordingly we report results using Normalized DCG, (NDCG) a well-established ranking measure (see Equation 2.9), where a value of 1 indicates a perfect ranking and 0 a reverse-ordered perfect ranking [57]. NDCG@T considers exclusively the top T items. Table 4.2 reports results for the three datasets using NDCG@5 (note that other values of NDCG gave similar results). We can only report results for the methods that have the ability to predict scores for cold-start users: PMF, LR, and CTR do not use any user side information and hence do not have that ability. In this challenging setting CSTM significantly outperforms the other methods. Further we see that methods using side information typically outperform the constant baseline. This demonstrates that the useful information about user preferences can be leveraged from the user libraries. Further, the good performance of CSTM in this setting shows that the model is able to leverage that information.

Warm-start data regimes

The goal of CSTM is to perform well across different data regimes. In the previous section we examined the performance of several methods on cold start users; we now focus on users with observed scores. For each dataset we report the performance of the various methods for each user category. For ICML-12 we separated users into roughly equal sized bins according to their number of observed scores. Results for the three datasets are provided in Figure 4.7. First we note that as the number of observed scores is increased the performance of the different methods also increases. CSTM outperforms all other methods on lower data-regimes. On users with more observed scores CSTM is competitive with both CTR and LR. We note that overall in this task, and even when many observed preferences are available, PMF is not competitive with most of the methods that have access to the side information. This highlights Chapter 4. Collaborative Filtering with Textual Side-Information 57

1.15 Constant LM−I LM−II PMF LR 1.1 CTR CSTM

1.05

1

0.95 =15 =30 =55 (a) NIPS-10 data set

1.15

1.1

1.05

1

0.95

0.9 >0,<=20 >20,<=30 Others (b) ICML-12 data set

1

0.95

0.9

0.85

0.8

0.75 >0,<=8 >8,<=10 Others (c) Kobo data set

Figure 4.7: Test RMSE of the different methods across the different datasets. For each dataset, we report results for the three subsets of users with different number of observed scores (categories). Figures better seen in color (however the ordering in the legend corresponds to the ordering of the bars in each group). Chapter 4. Collaborative Filtering with Textual Side-Information 58

2 1.05

1.9

1.8 1 1.7

1.6 +w) 0 1.5 0.95 |/(w θ | 1.4

1.3 0.9 1.2

1.1

1 0.85 1 2 3 4 1 2 3 4

Figure 4.8: Averaged norm of parameters under users with varying number of scores (left NIPS-10, right ICML-12).

the value of content side information on both user and item sides. This is further made clear by the relatively strong performance of both SLM-I and SLM-II. Overall user libraries do not seem to help as much on the Kobo dataset. There are several explanations for this. First, in Kobo the distribution over scores is very skewed toward high scores. Therefore a constant baseline does quite well. Further, bag-of-words representations are particularly well suited for academic papers where the presence (absence) of specific words are very good indications of a document’s field and hence its targeted audience. However, in (non-technical) books user preferences also rely on other aspects such as the document’s prose which is harder to capture in a bag-of-words topic model.

Tradeoff of side information with observed scores

In Section 4.4 we motivated the specific parametrization of CSTM by its ability to trade off the influence of the user library side information versus that of the user-item scores. Here we show that learning in our model performs as expected. Figure 4.8 reports the relative norm of the compatibility parameters

θ versus the (shared and individual) user parameters (γ0 + γ) as a function of the number of observed scores: θ / (γ + γ) . As hypothesized as the number of observed scores increases the relative weight of | | | 0 | the user library side information decreases.

Results on original ICML-12 dataset

For completeness we also report results on the original, unmodified, version of ICML-12 (that is a version without the artificial 25 percent cold-start users, see Section 4.6.3). Accordingly, users are now binned differently. In Figure 4.9 and Table 4.3 we provide comparisons of the different methods on the original version of ICML-12. We highlight that these results further display the good performance of CSTM in all data regimes. Similarly as in the previous experiments with ICML-12, CSTM is only outperformed by CTR for reviewers with many training scores. Chapter 4. Collaborative Filtering with Textual Side-Information 59

1.15

1.1

1.05

1

0.95

0.9 <= 16 >16,<=27 >27,<=40 Others

Figure 4.9: RMSE results on the unmodified ICML-12 dataset.

ICML-12 Constant 0.8950±5×10−4 SLM-I 0.9301±2×10−4 SLM-II 0.9308±4×10−4 CSTM 0.9409±3×10−4

Table 4.3: For the unmodified ICML-12 dataset, comparisons between CSTM and competitors for cold- start users using NDCG@5.

Variations of CSTM

We also experimented with variations of CSTM to better understand the roles played by the different aspects of the model and its training. CSTM fixed topics (CSTM-FT): This model uses the exact preference regression model used by CSTM but it uses fixed user topic and document topic representations; that is, it predicts preferences T T with rud =(au dd)θ + dd(γ + γ) where au and dd are previously learned offline. ⊗ 0 CSTM no user side information (CSTM-NUSI): To evaluate the gain of using user side information we experimented with a version of our model that does not model user side information (i.e., as if a user did not have any documents). Specifically, in this model ar 0 for all users. ≡ We provide some results comparing CSTM with its variations in Table 4.4. We notice that it is the superior synergy of the user side information and the joint training of the model that explain CSTM’s performance.

NIPS-10 ICML-12 ICML-12(original) Kobo CSTM-NUSI 0.4941±4×10−4 0.7765±9×10−5 0.8048±9×10−4 0.7997±8×10−5 CSTM-FT 0.4984±2×10−4 0.8036±5×10−5 0.8066±8×10−5 0.8026±8×10−5 CSTM 0.5016±2×10−4 0.8217±2×10−5 0.8322±2×10−4 0.8037±2×10−5

Table 4.4: Comparisons between CSTM and two variations. Results report NDCG@5 over all users. Chapter 4. Collaborative Filtering with Textual Side-Information 60

Poster Recommendations

We explore a different scenario which is meant to simulate what would happen when a model is deployed in a complete recommender system, for example, to guide users to posters of interest in an academic conference. Specifically, we evaluate the performance of CTR and CSTM as new users arrive into the system and gradually provide information about themselves. We postulate that users first provide the system with their library. Then users gradually express their preferences for certain (user-chosen) items. We trained CTR and CSTM on all but 50 randomly-chosen ICML-12 users, restricting our attention to users with at least 15 observed scores. We then simulated these users entering the system and evaluate their individual impact. In this experiment we want to recommend a few top-ranked items to each user. In terms of our recommendation framework in Figure 4.1, the recommendation of the personalized-best items is our ultimate objective F ( ). Therefore we evaluate the system’s performance using NDCG (of · held out data). Figure 4.10 presents the performance of CSTM and CTR as a function of the amount of data available in the system. When a user first enters the system no data is available about him (indicated by “0” in the figure). The methods revert to using a constant predictor which predicts the mean of the previously observed scores across all users. Once a user provides a library (Lib.) we see that CSTM’s performance increases very significantly. CTR cannot leverage that side information. Then once users provide scores, the performance of both methods increases and the performance of CTR eventually reaches the performance of CSTM. Figure 4.10 demonstrates the advantage of having access to user side information, namely, the system can quickly give good recommendations to new users. Further, in absolute terms the system performs relatively well without having access to any scores. It is also interesting to note, in this experiment, as far as NDCG goes, the performance of CSTM only modestly improves as the number observed scores in- creases. This may be a consequence of our fairly primitive online learning procedure. As far as modelling goes this experiment is also a demonstration that our model of user libraries is effective at extracting features (ar for all users) indicative of preferences and that the regression model (Equation 4.10) then successfully combines the user and item side information.

4.7 Conclusion and Future Opportunities

We have introduced a novel graphical model to leverage user libraries for preference prediction tasks. We showed experimentally that CSTM overall outperforms competing methods and can leverage the information of other users and of user libraries to perform particularly well in cold-start regimes. We also explored a paper recommendation task and demonstrated the positive impact on the recommendation quality of having access to user libraries. Overall, we showed that using the content of items outperforms state-of-the-art (content-less) collab- orative filtering methods in both cold and warm start regimes. Furthermore, we have showed that using user-item side information is also a win in many cases. Finally, user-item side information is essential to quickly provide good recommendations to new users. Future work offers both immediate and longer-term possibilities. In the near term, we could refine the inference procedure used in training our model. For example by using a fully variational approach and by leveraging the latest inference procedures for non-conjugate models such as CTM [127]. An important question is what is missing from CSTM before it can used in real life. Of practical importance is a model’s scalability properties. Currently, with respect to the datasets used in this Chapter 4. Collaborative Filtering with Textual Side-Information 61

CTR 0.8 CSTM

0.75

0.7

0.65

0.6

0.55

0 Lib. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 4.10: Comparison of CSTM and CTR’s NDCG@10 performance on new users as a function of the amount of data provided by users. The x-axis denotes what user data is available (Lib. stands for user libraries while integer values denote the number of available user scores). Without any user data (labelled 0 on the x-axis) both methods revert to a constant predictor.

chapter, CSTM could be trained on larger datasets but is still far from being able to learn from datasets that may be common in industry. To give an indication of training time, the current implementation of our model, written in Matlab, can be fitted to our datasets within a few hours on a modern computer. Currently, our inference procedure, at each iteration, must iterate through all users and all items. One interesting avenue to improve scaling (in terms of number of users and items) is to use stochastic variational inference [47, 23]. The basic intuition behind the stochastic optimization methods used in machine learning, for example stochastic gradient descent, is to allow multiple parameter updates per training pass. This stands in contrast to Algorithm 1 where we update the model parameters after having examined all observed preferences. Stochastic optimization techniques are very commonly used to train (collaborative-filtering) models on large datasets (for example, [99]). For our purposes, we could imagine sampling n users, along with the documents that these users have rated, per iteration and use the sufficient statistics from these n users to update the global parameters. Therefore, the training time per iteration would at reduced by least be an order of R/n (where R is the total number of users). It is still difficult to evaluate the resulting size of datasets that CSTM could then be fitted to since the number of iterations can only be determined empirically. Related to the above is the question of using CSTM as the main model inside of TPMS (for example in place of the LM or of the model depicted in Equation 3.5). CSTM is designed for this exact domain and, based on the experiments of this chapter it could improve the performance of the system. Before changing the model used in TPMS we will need to evaluate CSTM against the model of Equation 3.5. Such comparisons, since CSTM is novel, do not yet exist. Furthermore, there are practical aspects which Chapter 4. Collaborative Filtering with Textual Side-Information 62 do not favour the inclusion of CSTM. First and foremost, many conferences only rely on user and item content and never elicit expertise from reviewers. Furthermore, conference organizers are often working under stringent time constraints which may favour the use of simpler models. For example, the LM is fast to fit and we have found good settings of its hyper-parameters that work well across all conferences. On the other hand both CSTM and the model of Equation 3.5 require tuning of the hyper-parameters which can be time-consuming. One further aspect to consider is how much the empirical gains of CSTM translate to perceivable differences in reviewer scores and ultimately in reviewer assignments. As an indication, the next chapter shows that, on our datasets, there is a strong correlation between prediction performance and matching performance. Thus given our current experiments, when time permits, CSTM seems like a very good candidate for inclusion into TPMS. A second aspect of practical importance is that once we move to online recommendation, models must also be able to adapt to new data, including novel items and users, updates to user libraries, and new user-item scores. In the poster recommendations experiment we have seen that a simple conditional inference method works relatively well for novel users. However, one would also like to use the information from novel users to learn better representations of all users. In other words, we would need a mechanism which updates model parameters once a sufficient amount of new data is available. Furthermore, we could refine such a method inter alia to allow the system to adapt to the evolving preferences of users over time. For example, Agarwal and Chen [1] propose a decaying mechanism to emphasize more recent scores over older ones. A similar mechanism could be use to weight the different documents in a user’s library (for example based on date of publication for research papers or purchase date for books). There is also the question of other potential applications for which CSTM could be useful. In addition to modelling text, topic models have also been shown to model images [33]. CSTM could then be used as an image recommendation tool (for example to photographers). In that case, much like for the books of the Kobo dataset, it remains to be seen whether topic models can capture features of images which are indicative of preferences. Another application for CSTM is the one of modelling legislators’ interests whom, similarly to academic reviewers, write and express their preferences about proposed laws. Furthermore, there are novels aspects in this domain such as the effects of party lines which, for certain bills, may create voting correlations which may have to be taken into account to correctly determine the legislators’ true preferences. Finally, the approach behind CSTM, which allows it to smoothly interpolate between side information and preferences, could be refined. In its current form, if some user documents are not useful to predict preferences the model must either: a) adjust the global compatibility parameters θ, thereby changing the predictions of all users; or b) increase user individual parameters γr while keeping the calibration of predicted preferences to observed ones; or c) change the representation of user libraries therefore changing the topic model; or d) use a combination of the above. A more pleasing solution would allow the model to independently adjust the importance of user libraries, and perhaps even of individual documents within a library. Overall, interpolating from side information based recommendations to a preference one requires additional attention. Chapter 5

Learning and Matching in the Constrained Recommendation Framework

In the majority of recommender systems the predicted preferences enable the recommendation stage. The recommendation step, the ultimate step in our constrained-recommendation framework, can assume various forms. In particular it may include several different constraints and objectives in addition to the initial preference prediction objective (the loss function associated with the learning method). This chapter is devoted to studying this last step and the interactions between the learning and recommen- dation stages. Motivated by the reviewer-paper assignment problem, we also develop and explore an approach to constrained recommendation consisting of a matching between reviewers and papers. We begin by detailing the design choices of our proposed recommendation framework. We then look at an instance of the framework for matching reviewers to papers. We especially focus on the development of the matching stage and, present several experimental results on matching problems. The experiments aim to compare different learning models on the matching task, and also explore different matching formulations motivated by real-life requirements. Finally, we demonstrate synergies between matching and learning.

5.1 Learning and Recommendations

For readability purposes we reproduce a constrained-recommendation framework in Figure 5.1. We remind the reader of the three stages of the framework: 1) elicitation of user information; 2) prefer- ence prediction using the previously elicited preferences and side information; and 3) determination of recommendations using domain-specific considerations (objective and constraints). Separating the preference prediction stage and the recommendation stage implies the use of two separate objective functions. Ideally we would rather use a single objective which encompasses both the ideals of the learning objective (e.g., accurate preference predictions) and the ideals of the recommen- dation objective (e.g., learn a model which will give good recommendations). In practice this is usually impossible for two reasons: 1) The optimization stage can involve solving a complex, for example a

63 Chapter 5. Learning and Matching in the Constrained Recommendation Framework 64

Entities 3 S 1

Users Prediction of missing Final objectives Elicitation 1 2 preferences 2 3 2 and constraints 2 3 2 0 1 2 0 1 2 Side-Information 1 1 2 F 1 1 2 from Users and Stated & predicted Entities preferences Recommendations

Stated preferences

Figure 5.1: Flow chart depicting a constrained recommender system.

non-continuous, optimization problem with constraints which we cannot express as a learning objective; 2) Learning the parameters of a predictive model is likely the most expensive step of the three stages; once learned it can be useful if the predictions can be used in different optimization problems. The reviewer to paper matching system of Chapter 3 is a good example of the latter as conference organizers will often use TPMS scores multiple times while exploring their preferences over the various matching constraints. Further, in machine learning it is sometimes empirically advantageous to learn a model using a simpler loss function that the one which is required by the task. A good example, which we reviewed in Section 2.2.2, occurs in the field of learning to rank for recommender systems, a recommendation task requiring a straightforward sorting procedure as its last stage. Certain state-of-the-art methods often split the learning into two different steps (akin to the two stages of our framework): in the first step they learn a standard preference prediction model the output of which is then used to train a model using a (domain specific) ranking loss [4]. The separation of the prediction and optimization stages does not imply that the two stages cannot work in synergy when possible. Specifically, defining a loss function that is sensitive to the final objective may provide performance gains. In fact we explore some of these synergies for the matching problem in Section 5.4.4.1

5.2 Matching Instantiation

Motivated by the reviewer to paper matching problem we explore different formulations of an assignment or matching problem to optimally assign papers to reviewers given some constraints. In other words, as shown in Figure 5.2, the third stage of our constrained recommendation framework is a matching problem. Concretely, we frame the assignment problem as an integer program [122], and explore several varia- tions that reflect different desiderata, and how these interact with various learning methods. We test our framework on two data sets collected from a large AI conference, measuring predictive accuracy with respect to both reviewer suitability score and matching performance, and exploring several different matching objectives and how they can be traded off against one another.

1Synergies between the elicitation and the optimization steps, which we show to be even more important experimentally, will be explored in Chapter 6. Chapter 5. Learning and Matching in the Constrained Recommendation Framework 65

Entities 3 S 1

Users Prediction of missing Matching objective 1 2 preferences 2 3 2 and constraints 2 3 2 0 1 2 0 1 2 Side-Information 1 1 2 1 1 2 from Users and Stated & predicted Assignments Entities preferences

Stated preferences

Figure 5.2: Match-constrained recommendation framework.

Although we focus on reviewer matching, our methods are applicable to any constrained matching domain where: (a) preferences can be used to improve matching quality; (b) it is infeasible or undesirable for users to express preferences over all items; and (c) capacity or other constraints limit the min/max number of users-per-item (or vice versa). Examples include facility location, school/college admissions, certain forms of scheduling and time-tabling, and many others. We reuse the notation established in previous chapters. As a reminder, we denote the matrix of observed scores, or reviewer-paper suitabilities, by So. Further we denote the observed scores for a o o u u u particular reviewer r and paper p by Sr and Sp, respectively. S ,Sr ,Sp are the analogous collections of unobserved scores. Given this information, our goal is to find a “good” matching of papers to reviewers in the presence of incomplete information about reviewer suitabilities, possibly exploiting the side information available.

5.2.1 Matching Objectives

We articulate several different criteria that may influence the definition of a “good” matching and explore different formulations of the optimization problem that can be used to accommodate these criteria. We also discuss how these criteria may interact with our learning methods. Naturally, one would like to assign submitted papers to their most suitable reviewers; of course, this is almost never possible since some reviewers will be well suited to far more papers than other reviewers. In general, load balancing is enforced by placing an upper limit or maximum on the number of papers per reviewer. Similarly, we may impose a minimum to ensure reasonable load equity or load fairness across reviewers. However, limiting the paper load increases the probability that certain papers will be assigned to very unsuitable reviewers. This suggests only making assignments involving pairs with score srp above some minimum score threshold. This ensures that every paper is reviewed by a minimally suitable reviewer, but may sacrifice load equity (indeed, it may sacrifice feasibility). One may also desire suitability fairness across reviewers; that is, reviewers should have similar score distributions over their assigned papers (so on average no reviewer is assigned to significantly more papers for which he is poorly suited than any other reviewer). Finally, when multiple reviewers are assigned to papers, it may be desirable to assign complementary reviewers to a paper so as to cover the range of topics spanned by a Chapter 5. Learning and Matching in the Constrained Recommendation Framework 66 submission. Related is the desire to ensure each paper is reviewed by at least one “well-suited” reviewer. The intricacies of different conferences prevent us from establishing an exhaustive list of matching desiderata (see [8, 42, 37] for further discussion). We now explore matching mechanisms that will account for several of these criteria: we frame the matching procedure as an optimization problem and show how several properties can be formulated as constraints or modifications of the objective function. We formulate the basic matching problem as an integer program (IP), where each paper is assigned to its best-suited reviewers given the constraints [122]:

basic maximize J (Y,S) = srpyrp (5.1) X X r p

subject to yrp ∈{0, 1}, ∀r, p (5.2)

yrp = Rtarget , ∀p (5.3) X r

The binary variable yrp encodes the matching of item p to user r; a match is an instantiation of these variables. Y is the match matrix Y = yr r∈R,p∈P and similarly, S is the score matrix S = sr r∈R,p∈P . basic { } { } J (Y,S) denotes the value of the objective of the IP with match matrix Y and score matrix S. Rtarget is the desired number of reviewers per paper. Minimum and maximum reviewer load, Pmin and Pmax respectively, can be incorporated as constraints [122]:

yrp ≥ Pmin, yrp ≤ Pmax, ∀r. (5.4) X X p p

This IP, including constraints (5.4), is our basic formulation (Basic IP). Its solution, the optimal match, maximizes total reviewer suitability given the constraints. Although IPs can be computationally difficult, our constraint matrix is totally unimodular, so the linear program (LP) relaxation (allowing yrp [0, 1]) ∈ does not affect the integrality of the optimal solution; hence the problem can be solved as an LP. This can be understood as follows. The constraints define a feasible set: Ay b. Each constraint is linear ≤ thus the feasible set is a polyhedron. Since A the constraint matrix is totally unimodular [122] and b is an integer vector then the vertices of that polyhedron have integer coordinates. An LP’s objective function is also linear and therefore its optima corresponds to one, or more, of this polyhedron’s vertices. Although not mentioned above, it is essential for the matching to prevent assignments of reviewers to submitted papers for which they have conflicts of interest (COI). The above formulation can easily enforce known COI by directly constraining the conflicting assignments’ yrp variables to be 0, alternatively, we can set the relevant scores srp’s to . −∞ To capture additional matching desiderata, we can modify the objective or the constraints of this IP.

Load balancing can be controlled by manipulating Pmin and Pmax: a small range ensures each reviewer is assigned to roughly the same number of papers at the expense of match quality, while a larger range does the converse. We can instead enforce load equity by making the tradeoff explicit in the objective with “soft constraints” on load:

balance J (Y,S) = srpyrp − λf yrp − y¯ (5.5) X X X  X   r p r p

wherey ¯ is the average number of papers per reviewer (M/N) and f is a penalty function (for example, f(x)= x or f(x)= x2). The parameter λ controls the tradeoff between load equity and match quality. | | The J balance objective (Equation 5.5) along with the constraints expressed in Equation 5.2 comprise our Balance IP. Note if f(x) is nonlinear, then Balance IP becomes a nonlinear optimization problem. Chapter 5. Learning and Matching in the Constrained Recommendation Framework 67

The J basic objective (Equation 5.1) maximizes the overall suitability of the assignments, equating

“utility” with suitability. However, the utility of a specific match yrp may not be linear in suitability srp. For example, utility may be more “binary”: as long as a paper is assigned to a reviewer whose suitability is above a certain threshold, then the assignment is good, otherwise it is not. This can be realized by applying some non-linear transformation g to the scores in the matching objective (for example, a matched pair with score srp 2, 3 may be greatly preferred to srp 0, 1 ): ∈ { } ∈ { } transformed J (Y,S) = g(srp)yrp. (5.6) X X r p

In this transformed objective J transformed , if g is a logistic function then score are softly “binarized.” tfm Note that g(srp) can be evaluated offline and therefore, J can still be used as the objective of an IP. Finally, we note that some of these matching objectives can also be incorporated into the suitability prediction model. For example, the nonlinear transformation g can be directly used in the learning training objective (for example, instead of vanilla RMSE):

o 1 2 CLR-TFM(S ) = (ˆsrp − g(srp)) . (5.7) o X |S | o srp∈S

5.3 Related Work on Matching Expert Users to Items

We briefly reviewed some of the connections between our work and expertise retrieval in Section 3.3.1. Here we broaden our horizon and discuss research that deals explicitly with matching expert users with suitable items, including research from the expertise retrieval field. Stern et al. [115] propose to use a hybrid recommender system to assign experts to specific tasks. The tasks are (hard) combinatorial problems and the set of experts comprises different algorithms that may be able to solve these tasks. The novelty of this approach is that new unseen tasks appear all the time and so side information—features of the tasks represented by certain properties of the combinatorial problems—must be used to relate a new task to other tasks and their experts. A Bayesian model is used to combine collaborative filtering and content-based recommendations [114]. Matching reviewers to papers in scientific conferences has also received some attention [8, 42, 37]. Recently Conry et al. [29], showed how this domain could benefit from using a collaborative filtering approach. This approach is conceptually similar to the one used by TPMS in Chapter 3. As a reminder, imagine that reviewers have assessed their expertise (or preferences) for a subset of papers. CF can be used to fill in the missing scores (or ratings). Once all scores are known the objective is then to assign the best reviewers to each paper. Doing so, without further constraints, could create large imbalances between reviewers with larger expertises, say more senior reviewers, and those with less expertise. One solution is to post constraints on the maximum number of papers reviewers can be assigned to. Likewise, papers must be reviewed by a minimum number of reviewers. The final objective function that must be optimized has constraints across users (reviewers) and items (papers). Ideally, the collaborative filtering would concentrate on accurately predicting scores corresponding to reviewer-paper pairs that will end up being matched. This is not easy as the matching objective is typically not continuous. Conry et al. [29] have studied this problem without integrating both steps. Rodriguez and Bollen [91] have built co-authorship graphs using the references within submissions in order to suggest initial reviewers. In Karimzadehgan et al. [60], authors argue that reviewers assigned to Chapter 5. Learning and Matching in the Constrained Recommendation Framework 68 a submission should cover all the different aspects of the submission. They introduce several methods, including ones based on modelling papers and reviewer using topics models, to attain good (topic) coverage. Our CSTM model (Chapter 4) is similar in the sense that it models reviewers and papers using a topic model. However, in CSTM reviewer preferences are determined according to the global suitability of the reviewer, and not explicitly according to his coverage with respect to a paper’s topics. In follow-up work Karimzadehgan and Zhai [59] and Tang et al. [118] explore similar coverage ideas and show how they can be incorporated as part of a matching optimization problem (akin to the one we present in Section 5.2.1). A second body of work focuses exclusively on the matching problem itself. Benferhat and Lang [8], Goldsmith and Sloan [42], and Garg et al. [37] discuss various optimization criteria, and some of the practices used by program chairs and existing conference management software. Taylor [122] shows how these criteria can be formulated as an IP. Tang et al. [117] propose several extensions to the IP. This work assumes reviewer suitability for each paper is known, and deals exclusively with specific matching criteria.

5.4 Empirical Results

We start by describing the data sets used in our experiments. The rest of the section is divided into three parts. The first considers score predictions with the different learning models. The second turns to matching quality and explores the soft constraints on the number of papers matched per reviewer. Finally, the third part evaluates a transformation of the matching objective and shows how using a transformed learning objective can enhance performance on the transformed matching problem.

5.4.1 Data

Experiments are run using the NIPS-10 data, described in previous chapters, and the NIPS-09 dataset, from the 2009 edition of the NIPS conference.2 As before, side information for each reviewer comprises a self-selected set of papers representative of his or her areas of expertise; these were summarized as word a d count vectors wr . Side information about submitted papers consisted of document word counts wp for each p. The total vocabulary used by submissions (across both sets) contained over 21,000 words; here we used only the top 1000 words for our experiments as ranked using TF-IDF ( wp = wr = 1000). | | | | Reviewer suitability scores ranged from 0 to 3; 0 meaning “paper lies outside my expertise;” 1 means “can review if necessary;” 2 means “qualified to review;” and 3 means “very qualified to review.” As discussed above, these scores are intended to reflect reviewer expertise, not desire. We focus on the area chair (or meta-reviewer) assignment problem, where the matching task is to assign a single area chair to each paper. We use the term reviewer below to refer to such area chairs. NIPS-09 comprises 1079 submitted papers and 30 area chairs. Contrary to the procedure followed by NIPS-10 (and ICML-12) reviewer scores were not elicited, but instead provided by the conference program chairs for every reviewer-submission pair. The mean suitability score was 0.19 (std. dev. 0.57). A histogram of the scores for each dataset is shown in Figure 5.3.

2See http://nips.cc Chapter 5. Learning and Matching in the Constrained Recommendation Framework 69

4 x 10 3000 3

2500 2.5

2000 2

1500 1.5

1000 1 Number of Scores 500 0.5

0 0 0 1 2 3 0 1 2 3 (a) NIPS-10 (b) NIPS-09

Figure 5.3: Observed scores for the two datasets. Figure for NIPS-10 is reproduced here (from Section 3.3) in order to highlight differences in score distribution with NIPS-09.

5.4.2 Suitability Prediction Experimental Methodology

We first describe the methodology used to train and test the score prediction models. We do not report suitability prediction results since Chapter 4 has already highlighted this stage of the framework. Suitability methods: We reuse some of the models compared in Chapter 4. Namely, we use a method which only uses side information, (word-)LM (see LM in Section 3.2.1 for details); a pure collaborative filtering method, BPMF, a Bayesian extension of PMF already reviewed in Section 2.2.1; and LR, a reviewer-specific linear regression model (see Section 3.2.2). The goal here is not to compare the merits of the different approaches on a score prediction task but rather to compare their matching performance. For learning, we are given a set of training instances, Str So. We split this set into a training and ≡ validation set. The trained model predicts all unobserved scores Su. Since we do not have true suitability values for all unobserved scores, we distinguish Su as being the union of test instances Ste (for which we have scores in the data set), and missing instances Sm. LR is trained using a regularized squared loss (or, equivalently, by assuming a Gaussian likelihood model and Gaussian priors over parameters). We denote a model’s estimates of the test instances as Sˆte. We use 5 different splits of the data in all experiments. In each split, the data is divided into training, validation and test sets in 60/20/20 proportions. There is no overlap in the test sets across the 5 splits. Training LR is naturally slightly faster than training BPMF3, for which we used 330 MCMC samples including 30 burn-in samples, but both methods can be trained in a few minutes on both of our data sets.

5.4.3 Match Quality

We now turn our attention to the matching framework. We first elaborate on how we perform the matching. We then evaluate the performance of the different learning methods on the matching objec- tive. Finally we introduce soft constraints into the matching objective and analyze the trade-offs they

3We use an implementation of BPMF provided by its authors and currently available at http://www.cs.toronto.edu/~rsalakhu/BPMF.html Chapter 5. Learning and Matching in the Constrained Recommendation Framework 70

train/validation test missing Matching Str Sˆte Sm = τ Evaluation Str Ste Sm = τ

Table 5.1: Overview of the matching/evaluation process. introduce.

Matching: Experimental Procedures

The matching IPs discussed above assume access to fully known (or predicted) suitability scores. Since we learn estimates of the unknown scores, we denote a model’s estimates of the test instances as Sˆte, and impute a value for all suitability values that are missing, using a constant imputation of τ R. ∈ Since missing scores are likely to reflect, on average, lower suitability than their observed counterparts, we use τ = 1 in all experiments (NIPS-10’s mean is 1.1376 and NIPS-09 has no missing scores). Given the estimate Sˆte computed by one of our learning methods, we perform a matching with S = Str Sˆte (Sm = τ). Note that this permits missing values to be matched, which is important in ∪ ∪ the regime where few suitability scores are known. Table 5.1 summarizes this procedure. For data set 4 NIPS-10 we set Pmin and Pmax to 20 and 30, respectively, while the range is 30–40 for data set NIPS-09. Baseline: We adopt a baseline method that provides an absolute comparison across methods. The baseline has access to Str and imputes τ for any element of Ste. To allow meaningful comparison to other methods, it employs the same imputation for missing scores, Sm = τ. A note on LM: Although the output of LM can be directly used for matching, it does not exploit observed suitabilities in its usual formulation. However LM can make use of some of the training data Str by incorporating submitted papers assessed as “suitable” by some reviewer r into his or her word a a vector wr . Specifically, we include all papers in wr for which r offered a score of 3 (only if this score is in Str). For all methods, once an optimal match Y ∗ is found, we evaluate it using all observed and unobserved scores, with the same constant imputation for the missing scores, where match quality is measured using J basic (see Equation 5.1): ∗ tr te m xrp(S S S = 1) (5.8) r p ∪ ∪ X X Matching Performance using Basic IP

We now report on the quality of the matchings that result from using the predictions of the different methods. Similarly to the preference prediction experiments in Chapter 4 , we consider dynamic matching performance as the amount of training data per user increases. Note that the optimal match value is 3053 for NIPS-10 and 2172 for NIPS-09, which occurs when Sˆte = Ste. Figure 5.4 shows how matching quality varies as the amount of training data per user increases in NIPS-10. Since training scores are also observed at matching time (Equation 5.8), all methods benefit from having a larger training set. Figure 5.4 leads to the following three observations. Firstly, when no observed data is available (i.e., when using only the archive) LM does very well, with a matching score of 2247 32, nearly identical to the quality of LR and BPMF with 10 suitabilities per user, and much ± 4These represent typical ranges for members of the senior program committee members. Chapter 5. Learning and Matching in the Constrained Recommendation Framework 71

2900

2800 2700

2600

2500

2400 Matching Objective LR 2300 BPMF 2200 LM Baseline 2100 0 10 20 30 40 50 60 70 80 90 Training set size per user

Figure 5.4: Performance on the matching task on the NIPS-10 dataset.

better than the match quality of 1262 obtained using constant scores ((Ste Sm)= τ) . Secondly, when ∪ very few scores are available, LR and LM perform best (and do equally well). As mentioned above, LM is able to exploit observed suitabilities by adding relevant papers to the user corpus, but this attenuates the impact of elicited scores: we see LM is outperformed by all other methods when sufficient data is available. Thirdly, LR outperforms all other methods as data is added. We also see that as the number of observed scores increases, unsurprisingly, the gain in matching performance (value of information) from additional scores decreases. It is also interesting to note that a total matching score of over 2500 implies that, on average, each reviewer is assigned papers on which her average preference is greater than 2 (out of 3). LR reaches this level of performance with less than 30 observed scores per user, while other methods need 30% more data per user to reach the same level of performance. Further insight into matching quality on NIPS-10 induced by the different learning methods can be gained by examining the distribution of scores associated with matched papers (Figure 5.5) or under different sizes of the training set (Figure 5.6). Figure 5.5 displays the number of scores of each value (0–3) that get assigned with a training set size of 40. Not surprisingly, LR and BPMF assign significantly more 2s and 3s combined than all other methods. LM is very good at picking the top scores which reinforces the fact that word-level features, from reviewer and submitted papers, contain useful information for matching reviewers. Similar results were obtained on NIPS-09 and thus LM’s performance is not simply a consequence of the data collection method used for NIPS-10. In addition, Baseline assigns few zeros, since all missing and test scores are imputed to be τ = 1. Figure 5.6 provides another perspective on assignment quality. Here we plot results for the best performing method, LR, on both NIPS-10 and NIPS-09, for 3 different training set sizes. We first note that the extreme imbalance in the distribution over scores of NIPS-09 leads LR to assign many zeros Chapter 5. Learning and Matching in the Constrained Recommendation Framework 72

0 1 2 3

500

400

300

200 Average number of assignments 100

0

LM LR LM LR LM LR LM LR

BPMF BPMF BPMF BPMF Baseline Baseline Baseline Baseline

Figure 5.5: Assignments for NIPS-10 by score value when using 40 training examples per user. even with 80 training scores per user. Overall, both data sets show that as the number of training scores increases, more 2s and 3s, and fewer 0s and 1s, are assigned. Our remaining results deal exclusively with NIPS-10 since experimental results with NIPS-09 were similar.

Load Balancing Balance IP

The experiments above all constrain the number of papers per reviewer to be within a specific range

(Pmin,Pmax). There is no good indication as to how to set these two limits. Instead we now use the Balance IP, both for matching and evaluation (see Equation 5.8), setting f to be the absolute value function. The resulting problem cannot be expressed directly as an LP. However, we can use a standard procedure which involves adding auxiliary variables. Specifically, the Balance IP, with f the absolute value function can be solved as the following optimization problem (for clarity we omit the constraints on yrp):

balance maximize J (Y,S) = srpyrp − λtr X X X r p r

subject to yrp − y¯ ≤ tr, ∀r (5.9)  X   p

− yrp − y¯ ≤ tr, ∀r (5.10)  X   p

where the auxiliary variables are denoted as tr. We have added two sets of constraints (Eq. 5.9 and

5.10), only one of which will be active at a time for each reviewer r, that ensure that tr is lower bounded by the absolute value of yrp y¯ . Since tr can only decrease the value of the objective, tr will p −  P   Chapter 5. Learning and Matching in the Constrained Recommendation Framework 73

0 1 2 3 0 1 2 3 0 1 2 3 600

500

400

300 NIPS-10 200

100 Avg. number of assignments 0 (a) 10 examples per user (b) 40 examples per user (c) 86 examples per user

0 1 2 3 0 1 2 3 0 1 2 3 600

500

400

300 NIPS-09 200

100 Avg. number of assignments 0 (d) 10 examples per user (e) 40 examples per user (f) 80 examples per user

Figure 5.6: Comparison of the distribution of assignments by score value given different number of observed scores.

λ 0 0.1 0.25 0.5 0.75 1 J basic 2625 2615 2600 2573 2569 2569 Variance 4.62 3.28 2.61 0.89 0.37 0.33

Table 5.2: Comparison of the matching objective versus within-reviewer variance of the number of assigned papers as a function of λ. then be exactly equal to either the left hand side of Equation 5.9 or the left hand side of Equation 5.10. Figure 5.7 shows the histogram of assigned papers per reviewer given by the optimal solution to the IP for different λ 0, 0.1, 1 . Experimentally when λ = 0 load equity is ignored, and almost all ∈ { } reviewers either get assigned the minimum (Rmin) or the maximum (Rmax) number of papers (this is due to certain reviewers having high expertise for more papers than others); within-reviewer variance 2 ( (yrp y¯) /M) is extremely high. When a “soft constraint” on load equity is introduced, assignments p − becomeP more balanced as the λ increases (i.e., the balance constraint becomes “harder”). Table 5.2 reports the matching objective versus the variance, averaged across users, for different values of λ with a training set size of 40 (other training sizes yielded similar results). Not surprisingly, larger penalties λ for deviating from the mean reviewer load give rise to greater load balance (lower load variance) and worse matching performance: Table 5.2 shows that the best matching, in this experiments, had a value of 2625 and a load variance of 4.62, while the worst matching had a value of 2569 and a load variance of 0.37. Generally, an appropriate λ will be chosen by the conference organizers, that nicely trades off performance versus load balance across reviewers (here, perhaps around λ = 0.5). In practice it is possible that the organizers will have to further examine the assignments (that is beyond looking at only the matching objective) before selecting an appropriate value for λ. Chapter 5. Learning and Matching in the Constrained Recommendation Framework 74

50

40

30

20

10

0 20 21 22 23 24 25 26 27 28 29 30 20 21 22 23 24 25 26 27 28 29 30 20 21 22 23 24 25 26 27 28 29 30 λ = 0 λ = 0.1 λ = 1

Figure 5.7: Histograms of number of papers matched per reviewer with different values of λ. The leftmost plot shows results with only hard constraints on reviewer loads (λ = 0); the others also include a soft constraint minimizing load variance. The corresponding matching values for different λ values are reported in Table 5.2

5.4.4 Transformed Matching and Learning

We now consider a non-linear transformation of the scores, reflecting the view that it is much better to assign reviewer-paper pairs with suitabilities of 2 and 3, than pairs with 0 and 1; as discussed above this can be accomplished by allowing “utility” yrp to be non-linear in suitability score srp. We adopt the following sigmoid function to effect this non-linear transformation: σ(s) = 1/(1+exp( (s 1.5)β)); − − here 1.5 is the middle of the scores’ range. We set β = 4.5, which gives: σ(0) = 0.001; σ(1) = 0.095; σ(2) = 0.90; σ(3) = 1.0. We first show how this transformation impacts matching performance without learning; then we discuss how one can incorporate the transformation into the learning objective itself.

We first test how matching using the transformed objectives affects results without using learning to infer missing scores (consequently, Su = τ), by examining difference in matching performance when varying the percentage of observed scores. Figure 5.8(a) shows the difference when matching with the transformed objective (J transformed ) versus the basic objective (J basic). In both cases the resulting matches are evaluated using J transformed . Although a minor gain is observed when most of the known data is observed, there is, overall, very little difference in performance when matching with either objective. Recall that the mean number of scores per paper is less than 4. Hence, when matching using a small fraction of the data, the matching procedure has very little flexibility to assign high scoring pairs unless learning is used to predict unobserved scores.

We can modify the learning objective to take into account the nonlinearity introduced in the matching objective. We do this by transforming all labels using the same sigmoidal transformation as in the matching objective (Equation 5.7). This allows learning to better predict the transformed scores by explicitly training on them.

Figure 5.8(b) shows the transformed matching performance of both LR on the non-transformed data, and LR-TFM, a linear regression model trained using the transformed learning objective. Not surprisingly, LR-TFM outperforms LR across all training set sizes, since it is trained for the modified objective J transformed . The difference is especially pronounced with smaller training sets—when enough data is available, both methods will naturally assign many 2s and 3s. (We also verified that LR-TFM outperforms BPMF trained on the transformed objective). Chapter 5. Learning and Matching in the Constrained Recommendation Framework 75

1100 1100

1050 1050

1000 1000 950 950 900 J tfm J tfm 900 850

800 850

750 Original Matching 800 LR Transformed Matching LR−TFM 700 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 Fraction of known scores Training set size per user (a) Comparing performance of original (Jbasic) and trans- (b) Comparing performance of original and transformed formed (Jtransformed ) matching objectives, without learn- LR learning using the transformed matching objective. ing.

Figure 5.8: Performance on the transformed matching objective on NIPS-10.

5.5 Conclusion and Future Opportunities

We have instantiated the recommendation stage of our framework as an assignment procedure between papers and reviewers. We showed how when only a small subset of reviewer-scores are elicited and inferring unobserved scores, using one of several learning methods, we are able to determine high-quality matchings. We explored the trade-off between matching quality and paper load balancing, which helps one avoid the need to manually set limits on the reviewer load. Finally we showed that using the realistic assumption that utility is non-linear in suitability score, we discover better matches using the same nonlinear transformation in the learning objective. Given how matching benefits from an interaction with learning, a next step would be to develop ways to strengthen this interaction by making the learning methods sensitive to the final matching objective. We discuss such interactions for active learning—where the system chooses which reviewer scores to query—in Chapter 6. Another possible path for future research is to explore optimization models for different recommendation problems and to more formally define the circumstances in which adapting the learning loss to the final objective provides an advantage. Finally, as we outlined in Section 5.3 several researchers have looked at possible matching constraints, which could be useful for the reviewer-to-paper matching task, using matching formulations similar as ours. Based on our experience with TPMS we would are also interested in exploring more expressive matching constraints, which would give more flexibility to conference organizers in expressing their preferences over assignments, which may not fit as part of our proposed LP framework. For example, in practice, assigning complementary reviewers (one reviewer of pool A and one reviewer of pool B) to submission is often sought by conference organizers. For such cases, mapping the matching optimization to one of performing inference in a undirected graphical model (see Section 6.2.2) and making use of high-order potentials (see, for example, [119]) could offer interesting opportunities. In our work we have used the value of the matching objective, which is based on scores elicited prior to assignment, to evaluate assignment quality. A possibly stronger method of evaluation would be to Chapter 5. Learning and Matching in the Constrained Recommendation Framework 76 re-evaluate the expertise of reviewers after they have reviewed their assigned papers. We could elicit reviewer expertise from the reviewers themselves (for example by using the confidence of their review). Alternatively, senior program committee members could be asked to evaluate the expertise of reviewers based on their reviews. Another advantage of performing post-hoc evaluation is that it would enable comparison between our system and other ways of assigning papers to reviewers (either manually or using other more automated procedures). Exploring different evaluations also relates to our discussion about evaluation of TPMS in Section 3.5. Chapter 6

Task-Directed Active Learning

Chapter 5 demonstrated the importance of effective learning of user preferences in matching problems. Equally important is the question of query selection, which has the potential to further reduce the amount of preference information users must provide and hence limit the impact of the cold-start problem. Eliciting preferences (e.g., in the form of item ratings) imposes significant time and cognitive costs on users. In domains such as paper matching, product recommendation, or online dating, users will have limited patience for specifying preferences. While learning techniques can be used to limit the amount of required information in match-constrained recommendation, the intelligent selection of preference queries will be critical in reducing user burden. It is this problem we address in this chapter. We frame the problem as one of active learning: our aim is to determine those preference queries with the greatest potential to improve the quality of the matching. This is a departure from most work in active learning, and, specifically, approaches tailored to recommender systems (as we discuss below), where queries are selected to improve the overall quality of ratings prediction. We develop techniques that focus on queries whose responses will impact—possibly indirectly by changing predictions—the matching quality itself. We also propose a new probabilistic matching technique that accounts for uncertainty in predicted preferences when constructing a matching. Finally, we test our methods on several real-life data sets comprised of preferences for online dating, conference reviewing, and jokes. Our results show that active learning methods that are tuned to the matching task significantly outperform a standard active learning method. Furthermore, we show that our probabilistic methods can be successfully leveraged in active learning.

6.1 Related work

Active learning is a rich field which we surveyed in the context of recommender systems in Section 2.3. With respect to previous research into active learning directed at other tasks, we are aware of one relevant study which explored active learning in matching domains. Rigaux [89] considers an iterative elicitation method for paper matching using neighbourhood CF, but requires an initial partitioning of reviewers, and elicits scores for the same papers from all reviewers in a partition (with the aim only of improving score prediction quality). In our work, we need not partition users, and we focus on optimizing matching quality rather than prediction accuracy. Our approach is thus conceptually similar to CF methods trained for specific recommendation tasks (Section 2.2.2 surveyed this work).

77 Chapter 6. Task-Directed Active Learning 78

Entities Elicitation 3 2 3 2 2 3 2 of user S 1 0 1 2 0 1 2 preferences Users 1 2 1 1 2 1 1 2 Stated & predicted Assignments Stated preferences preferences

Figure 6.1: Elicitation in the match-constrained recommendation framework. As shown by the arrow from the matching to the elicitation, the elicitation can be sensitive to the matching objective.

We also note that Bayesian optimization has been used for active learning recently [22]; however, these methods assume a continuous query space and some similarity metric over item space, hence are not readily adaptable to our match-constrained problems. As such, we will not explore the use of Bayesian optimization in this thesis.

6.2 Active learning for Match-Constrained Recommendation Problems

Our framework is depicted, specifically for the purposes of this chapter, in Figure 6.2. Preference elicitation corresponds to the first stage of our framework. Designing elicitation process that exploits feedback from the matching stage is the focus of this chapter. The active learning methods that we develop are in line with some of the previous work on active learning (see Section 2.3) and especially the work which is guided by prediction uncertainty (see Section 2.3.1). Once elicited, preferences can be used to predict missing preferences using one of the models described in the previous chapters. We use Bayesian PMF (described in Section 2.2.1). The reason for choosing this model will become clear in the next section. For matching, we reuse the IP presented in the previous chapter, with the J basic objective and constraints on the number of papers per reviewer and on the number of reviewers per paper:

maximize J(Y,S) = srpyrp (6.1) X X r p

subject to yrp ∈{0, 1}, ∀r, p (6.2)

yrp ≥ Rmin, yrp ≤ Rmax, ∀p (6.3) X X r r

yrp ≥ Pmin, yrp ≤ Pmax, ∀r. (6.4) X X p p

We also reuse some of our previously-defined notation: we denote the set of observed suitabilities by o o o u S , and denote the observed scores for a particular user r and item p by Sr and Sp, respectively. S , u u Sr , Sp are the analogous collections of unobserved scores. Chapter 6. Task-Directed Active Learning 79

6.2.1 Probabilistic Matching

Model uncertainty is at the heart of several active learning techniques (see Section 2.3). For example, a successful querying strategies such as uncertainty sampling (Section 2.3.1) directly aims at reducing model uncertainty by querying scores over which the model has the most uncertainty. Our aim is to extend such strategies to matching problems. In this section we will introduce a novel method for determining probabilistic matchings which can use score uncertainty as a way to determine the probability of a user-item pair being matched. While the IP optimization is straightforward, and provides optimal solutions when all scores are observed, it has potential drawbacks when used with predicted scores, and specifically, when used in conjunction with active learning. First, the IP does not consider potentially useful information contained in the uncertainty of the (predicted) suitabilities. Namely, because of the uncertainty of the (predicted) suitabilities, we are uncertain of the quality of the IP’s assignment. Evaluating how score uncertainty affects the IP’s assignment may allow us to reduce the quality uncertainty. Second, the IP does not express the range of possible matches that might optimize total suitability (given the constraints). While optimal matching given true scores can be viewed as a deterministic process, score prediction is inherently uncertain; and we can exploit this if our prediction model outputs a distribution over unobserved scores Su rather than a point estimate. Different score values supported by the uncertainty model may lead to different assignments. The number of assignments a particular user-item pair is assigned in is indicative of matching uncertainty (resulting from the score uncertainty). Given inputs consisting of observed scores So and possibly additional side information X, we can express uncertainty over a match Y ′ as:

′ o ∗ u o u o u Pr(Y = Y S ,X,θ)= δY ′ (Y (S ,S ))Pr(S S ,X,θ) dS , | | Z where Pr(Su So,X,θ) is our score prediction model (assuming model parameters θ), Y ∗( ) (see Equa- | · tion 6.1) is the optimal match matrix given a fixed set of scores, and δY ′ is the delta Dirac function with mode Y ′. Using a similar idea, we can formulate a probabilistic model over individual marginals:

o ∗ u o u o u Pr(yrp = 1 S ,X,θ)= δ (y (S ,S ))Pr(S S ,X,θ) dS , (6.5) | 1 rp | Z ∗ ∗ where yrp is the user-item rp entry of Y . With this in hand, we overcome the limitations of pure IP-based optimization by developing a sam- pling method for determining “soft” or probabilistic matchings that reflect the range of optimal matchings given uncertainty in predicted suitabilities. While Equation 6.5 expresses the induced distribution over marginals, the integral is intractable as it requires solving a large number of matching problems (in this case solving IPs). Instead we take a sampling approach: we independently sample each score from the posterior Pr(Su So,X,θ) to build a complete score matrix, then solve the matching optimization (IP) | using this sampled matrix. Repeating this process T times provides an estimated distribution over op- 1 T (t) timal matchings. We can then average the resulting match matrices, obtaining Y = T t=1 Y , where (t) Y is the t’th matching. Each entry Y rp is the (estimated) marginal probability thatP user-item pair rp o is matched; and the probability of this match depends, as desired, on the distribution Pr(srp S ,X,θ). | Fig. 6.2 illustrates Y , comparing it to the IP solution, on a randomly-generated “toy” problem with 3 reviewers and 6 papers (the match is constrained to exactly 1 reviewer per paper and 2 papers per Chapter 6. Task-Directed Active Learning 80

Figure 6.2: A “toy” example with a synthetic score matrix S with 3 reviewers and 6 papers. Each reviewer must be matched to exactly two papers while papers must each be matched to a single reviewer. Matching results using IP (bottom-left), the Y approximation with two different variance matrices and the loopy belief Propagation matching formulation (Z, bottom-right). reviewer). Assuming a fixed predicted score matrix S, two versions of Y are shown, one when all 1 estimated variances are low (Y low), the other when they are higher (Y high). Note that the Y matrices respect the matching constraints by design (for visualization purposes we round matching probabilities).

Y low agrees with the IP, but for Y high, we observe the inherent uncertainty in the optimal matching; e.g., column one shows all three match probabilities to be reasonably high. In addition, the last column shows that even though the second and third users have scores that differ by 2 on the sixth paper, the high variance in their scores gives both users a reasonable probability of being matched to that paper.

6.2.2 Matching as Inference in an Undirected Graphical Model

We have introduced a method for propagating the score uncertainty into matching uncertainty. There is however another source of uncertainty which we have not yet discussed: the inherent uncertainty over the range of possible matchings. Given a probabilistic model over possible matchings, the matching IP returns the single most likely (binary) solution (that is, the optimal solution). There are, independent of score uncertainty, possibly other good matches which, while being dominated by the optimal, may provide useful information for active learning. Tarlow et al. [121] model the matching problem using an undirected graph where (binary) nodes correspond to assignment variables yrp (in other words, there is a one-to-one mapping between nodes and paper-reviewer pairs). Nodes have singleton potentials to denote their corresponding score (srp) and high- order cardinality potentials enforce the reviewer and paper constraints (Equations 6.3 and 6.4). Tarlow et al. [121] present an efficient approximate-inference algorithm (based on loopy belief propagation [83]) for computing marginal probabilities (Pr(yrp = 1)) for such models. The probability of a marginal represents the weighted (approximate) number of times that the particular entry is part of a match over all valid matches. Each match is weighted by the match quality (the objective of the IP, Equation 6.1). In the rest of this chapter we refer to the matching marginal obtained using this method as the loopy-BP

1Variances are sampled uniformly at random; in a real problem they would be given by the prediction model. Chapter 6. Task-Directed Active Learning 81 matching and denote the resulting matrix of match marginals as Z (compared to Y for matching variables from the IP). Figure 6.2 presents the results of applying loopy-BP matching to a “toy” problem. We note that on this small problem the solution exploiting matching uncertainty Z is similar to Y high, the solution exploiting score uncertainty. This result is reasonable since different sampled score instantiations may end up exploring a range of matches that have high weight under loopy-BP matching. In other words, small perturbations around the mean of the predicted scores may result in IP assignments that are close the one another and coincide with the assignments that are given high probability by loop-BP.

6.3 Active Querying for Matching

Little work has considered strategies for actively querying the “most informative” preferences from users. In combination with supervised learning, active querying can further reduce the elicitation burden on users. Random selection of user-item pairs for assessment will generally be sub-optimal, since query selection is uninformed by the learned model, the objective function, or any previous data. By contrast, an active approach, in which queries are tailored to both the current preference model and the current best matching, will typically give rise to better matchings with fewer queries.2 In this section we describe several distinct strategies for query selection: we review a standard active learning technique and introduce several novel methods that are sensitive to the matching objective. Our methods can be broadly categorized based on two properties (which we use to label the different methods): whether they select queries by evaluating their impact in score space S or in matching space Y or Z; and whether they select queries with the maximal value M, or maximal entropy E.

S-Entropy (SE)

Uncertainty sampling is a common approach in active learning, which greedily selects queries involving (unobserved) user-item pairs for which the model is most uncertain [102]. In our context, this corresponds to selecting the user-item pair with maximum score entropy w.r.t. the score distribution produced by the learned model. The rationale is clear: uncertainty in score predictions may lead to poor estimates of match quality. Of course, this approach fails to explicitly account for the matching objective (the term Y (Su,So) in Equation 6.5), instead focusing (myopically) on entropy reduction in the predictive model (the term Pr(Su So,X,θ)). Queries that reduce prediction entropy may have no influence on the | u resulting matching. For example, if srp has high entropy, but a much lower mean than some “competing” u ′ sr′p, user r may remain matched to p with high probability regardless of the response to query rp.

S-Max (SM)

An alternate, yet still simple, strategy is to select queries involving user-item pairs with highest predicted score w.r.t. MAP score estimates given our predictions of unobserved scores:

Sˆu arg max Pr(Su = smax So,X,θ), ≡ Su | with smax being the highest possible score value (for example, in a particular data set). This may be especially advantageous for matching problems where, all else being equal, high scores are more likely

2In our settings one can elicit a rating or suitability score from a user for any item (e.g., paper, date, joke); so the full u set Sr serves as potential queries for user r. Chapter 6. Task-Directed Active Learning 82 to be assigned (see Equation 6.1). SM’s insensitivity to the matching objective means that it shares the obvious shortcoming as SE. One remedy is to use expected value of information (EVOI) to measure the improvement in matching quality given the response to a query (taking expectation over predicted responses). This approach, which we reviewed in Section 2.3.4, has been used effectively in (non-constrained) CF [18]; but EVOI is notoriously hard to evaluate. In our context, we would (in principle) have to consider each possible query rp, estimate the impact of each possible response so on the learned model (the term Pr(Su So,X,θ) in rp | Equation 6.5), and re-solve the estimated matching (the term Y (Su,So) in Equation 6.5). Instead, we consider several more tractable strategies that embody some of the same intuitions.

Y -Max (Y M)

A simple way to select queries in a match-sensitive fashion is to consider the solution returned by the IP w.r.t. the observed scores, So, and the MAP solution of the unobserved scores, Sˆu. We query the unknown pair rp that contributes the most to the value of the objective:

arg max yrpsˆrp, (rp)∈Su

o u where yrp Y (S , Sˆ ) is the binary match value for user r and item p, and srp the corresponding ∈ MAP score value. In other words, we query the unobserved pair among those actually matched with the highest predicted score. We refer to this strategy as Y-Max (Y M). It reflects the intuition that we should either confirm or refute scores for matched pairs, i.e., those pairs that, under the current model, directly determine the value of the matching objective. However notice that Y M is insensitive to score uncertainty. So, for example, it may query unobserved scores whose predictions have very high confidence, despite the fact that such queries are highly unlikely to provide valuable information.

Y -Max (Y M))

As remedy to Y M’s problem, Y M exploits our probabilistic matching model to select queries. As with Y M, Y M queries the unobserved pair rp that contributes the most to the objective value:

arg max Y rpsˆrp. (rp)∈Su

The difference is that we use the probabilistic match, exploiting prediction uncertainty in query selection.

Y -Entropy (Y E)

This method exploits the probabilistic match Y as well, but unlike Y M, Y E queries unknown pairs whose entropy in the match distribution is greatest. Specifically, we view each Yrp as a Bernoulli random variable with (estimated) success probability Y rp. We then query that pair with maximum match entropy:

arg max Y rp log Pr(Y rp) (1 Y rp)log Pr(1 Y rp) . (rp)∈Su − − − −   Chapter 6. Task-Directed Active Learning 83

Z-Max (ZM)

In this method we exploit the matching marginals of Z with maximal-value query selection:

arg max zrpsˆrp, (rp)∈Su

o u where zrp Z(S , Sˆ ) is the probabilistic match value for user r and item p according to loopy-BP ∈ matching introduced in Section 6.2.2. We also experimented with a strategy that considered both score uncertainty and matching uncertainty (ZM) but initial results were not better than those of ZM. We hypothesize that solving the IP using different sets of sampled scores is an alternate method of exploring the range of good matches. One important point to note is that the match-sensitive strategies, Y M, Y M, Y E, all attempt to query unobserved pairs that occur (possibly stochastically) in the optimal match. When the IP does not match on any unobserved pairs, a fall-back strategy is needed. All three strategies resort to random querying as a fall-back, selecting a random unobserved item score for any specific user as its query. For Y M and Y E, we further consider all queries that corresponds to a user-item pair with less than a 1% chance of being matched to be “random” queries.

6.4 Experiments

We test the active learning approaches described above on three data sets, each with very different characteristics. We begin with a brief description of the data sets and matching tasks, then describe our experimental setup, before proceeding to a discussion of our results.

6.4.1 Data Sets

We first describe our three data sets and define the corresponding matching tasks. Jokes data set: The Jester data set [41] is a standard CF data set in which over 73,000 users have each rated a subset of 100 jokes on a scale of -10 to 10. It has a dense subset in which all users rate ten common jokes. Our experiments use a data set consisting of these ten jokes and 300 randomly selected users. 3. We convert this to a matching problem by requiring the assignment of a single joke to each user (for example, to be told at a convention or conference), and requiring that each joke be matched to between 25 and 35 users (to ensure “jocular diversity” at the convention). Fig. 6.5(a) provides a histogram of the suitabilities for the Jester sub-data set. Conference data set: This is the data set derived from the NIPS 2010 conference that we have been using across the different experimental sections of this thesis. As previously reported the suitabilities for a subset of papers were elicited in two rounds. In the first round scores were elicited for about 80 papers per reviewer, with queries selected using the Y M procedure described above (where the initial scores were estimated using a word-LM, see Section 3.2.1, using reviewers’ published papers). Dating data set: The third data set comes from an online dating website.4 It contains over 17 million ratings from roughly 135,000 users of 168,000 items (other users). We use a denser subset of 32,000 ratings from 250 users (each with at least 59 ratings) over 250 items (other users); see Figure 6.5(c) for

3This data set is derived from the 18,000 subset of Dataset 1 presented in Goldberg et al. [41] Documentation for the data sets is available online, currently at http://eigentaste.berkeley.edu/dataset/ 4See http://www.occamslab.com/petricek/data/ Chapter 6. Task-Directed Active Learning 84

6000

200 5000

4000 150 3000

100 2000 Number of scores 50 1000

0 0 1 2 3 4 5 6 7 8 9 10 −10 −5 0 5 (a) Jokes Data set (b) Dating data set

Figure 6.3: Histograms of known suitabilities for two of our data sets.

the score histogram. Since items are users with preferences over their matches, dating is generally treated as a two-sided problem. While two-sided matching can fit within our general framework, the focus of our current work is on one-sided matching. As such, we only consider user preferences for “items” and not vice versa. Each user is assigned 25–35 items (and vice versa since “items” are users).

6.4.2 Experimental Procedures

Our experiments simulate the typical interaction of a recommendation or matching engine with its users. All experiments start with a few observed preferences for each user, for example, preferences provided by users upon first entering the system, and then go through several rounds of querying. At each round, a querying strategy selects queries to ask one or more users. Note that in practice we restrict the strategies to only query (unobserved) scores that are actually available in our datasets. That is, since we simulate the process of actively querying scores we can only simulate the responses that are available in the underlying dataset. Once all users have responded, the system re-trains the learning model with newly and previously observed preferences, then proceeds to select the next batch of queries. This is a somewhat simplified model that assumes semi-synchronous user communication and thus our strategies cannot take advantage of users’ most recent elicited preferences until the end of each round. We also assume for simplicity that the same fixed number of queries per user is asked in each round. The initial goal is simply to assess the relative performance of each method; we do relax some of these assumptions in Section 6.4.3. There are a variety of reasonable interaction modes for eliciting user preferences. For example, in paper-reviewer matching, posing a single query per round is undesirable, since a reviewer, after assessing a single paper, must wait for other reviewer responses—and the system to re-train—before being asked a subsequent query. Reviewers generally prefer to assess their expertise off-line w.r.t. a collection of papers. Consequently batch interaction is most appropriate: users are asked to assess K items. While batch frameworks for active learning have received recent attention (e.g., [44]), here we are interested in comparing different query strategies. Hence we use a very simple greedy batch approach where we elicit the “top” K preferences from a user, where the “top” queries are ranked by the specific active strategy under consideration. Appropriate choice of K is application dependent: smaller values of K may lead to Chapter 6. Task-Directed Active Learning 85 better recommendations with fewer queries, but require more frequent user interaction and user delay. We test different values of K below. We use BPMF to generate our predictions and its uncertainty model for unobserved scores. We chose BPMF because it keeps a full distribution over unobserved scores Pr(Su So, θ), given model parameters | θ, which is useful for our active learning strategies. A procedure for setting some of the hyper-parameters of BPMF is outlined in [98]. We use a validation set for the other methods giving (using notation from the original paper):

• Jokes: D = 1,α = 0.1,β0u = 0.1,β0v = 10

• Conference: D = 15,α = 2,β0u = β0v = 0.1

• Dating: D = 2,α = 2,β0u = β0v = 0.1

Each observed score is assigned a fixed small uncertainty value of 1e−3. The exact value is unimportant as long as it emphasizes near-certainty compared to the model’s uncertainty over predicted scores (we verified that this was the case in our experiments). For Y -based methods, which require sampling, we use 50 samples in all experiments. The ZM method, unlike the IP, is sensitive to the scale of the scores (the temperature of the system). A large scale will lead to more deterministic matches while a low scale has the opposite effect. Visual inspection of the results showed that, for each dataset, scaling the score such that the max score has a value of 10 leads to an acceptable level uncertainty. The elaboration of a more formal validation procedure is left for future work. We compare query selection methods w.r.t. their matching performance—i.e., the matching objective value of Equation 6.1—using the match matrix given by the IP using estimated scores and known scores So, evaluated on the full set of available scores. We use a random querying strategy, which selects unobserved items uniformly at random for each user, as a baseline. All figures show the number of queries per user on the x-axis. The y-axis indicates the difference in the matching objective value between a specific querying strategy and the baseline. Positive differences indicate better performance relative to the baseline. The magnitude of this difference can be best understood relative to the number of users in the data set. For example, a difference of 300 in objective value for the 300 users in the Jokes data set means that users are matched to jokes that are better by one “score unit” on average (as a reminder the scores of the jokes range from -10 to 10). Note that as we increase the number of queries, even random queries will eventually find good matches—in the limit, where all scores are observed, matching performance of all methods will be identical (hence the bell-shaped curves and asymptotic convergence in our results). We don’t focus on running time in our experiments since query determination can often be done off-line (depending on batch sizes). Having said that, even the most intense querying techniques are fast and can support online interaction: (a) in all 3 data sets, solving the IP takes a fraction of a second; (b) BPMF can be trained in a matter of a few minutes at most, but can be run asynchronously with query selection. For example, we could choose to re-run BPMF as soon as we have received a new (pre-defined) number of scores. The query selection would then use the most up-to-date learned model available; and (c) sampling scores is very fast as the posterior distribution is Gaussian. Furthermore, given the above, our methods should scale to larger datasets although the training time of BPMF may preclude fully online interaction. Chapter 6. Task-Directed Active Learning 86

250 5000

200 4800

4600 150 4400 100 4200 50 4000 0 3800 Absolute Matching Objective Difference in Matching Objective −50 3600

−100 3400 −50 0 50 100 150 200 250 300 Queries (a) Jokes data set (10 qbu)

200 3000

2800 150

2600 100

2400

50 2200

0 Absolute Matching Objective Difference in Matching Objective 2000

−50 1800 −20 0 20 40 60 80 100 120 Queries (b) Conference (20 qbu)

4 x 10 800 4.9

700 4.8 600 4.7 500 4.6 400

300 4.5

200 4.4 100 4.3 0 Absolute Matching Objective Difference in Matching Objective 4.2 −100

−200 4.1 −20 0 20 40 60 80 100 120 140 Queries (c) Dating (20 qbu)

Figure 6.4: Matching performance for active learning results (queries per batch per user: qbu). Standard error is also shown. The triangle plot (using the right vertical axis) shows the absolute matching value of the random strategy. Chapter 6. Task-Directed Active Learning 87

100 1000 4000

800 80 3000 60 600 2000 40 400 1000 20 200

Number of fall−back queries 0 0 0 0 100 200 300 0 50 100 150 0 50 100 150 (a) Jokes data set (b) Conference data set (c) Dating data set

Figure 6.5: Usage frequency of the fall-back query strategy for Y M, Y M and Y E.

6.4.3 Results

We first investigate the performance of the different querying strategies on our three data sets using default batch sizes—these K values were deemed to be natural given the domains (different K values are discussed below). Figure 6.4(a) shows results for Jokes using batches of 10 queries per user per round (K = 10). Figures 6.4(b) and 6.4(b) show Conference and Dating results, respectively, both with a batch size of 20. All users start with 20 observed scores: 15 are used for training and 5 for validation. We also experimented with a more realistic setting where some users have few observed scores (e.g., new users)—results are qualitatively very similar but are not discussed here. The relative performance of each of the active methods exhibits a fairly consistent pattern across all three domains, which permits us to draw some reasonably strong conclusions.5 First, we see that all methods except for SE outperform the baseline in all domains. Recall that SE is essentially uncertainty sampling, a classic (match-insensitive) active learning model often used as a general baseline method for active learning. It outperforms the random baseline only occasionally, most significantly after the first round of elicitation in Dating. Second, all of our proposed match-sensitive techniques outperform SE consistently on all data sets. Third, the match-sensitive approaches that leverage uncertainty over scores, namely, Y M and Y E, typically outperform Y M, especially after the initial rounds of elicitation. This difference in performance behaviour is most pronounced in the Conference domain. We gain further insight into these results by examining the inner workings of these strategies.6 Figure 6.5 shows the number of random (or fall-back) queries used (on average) by each of Y M, Y M and Y E. On all data sets Y M resorts to the fall-back strategy significantly earlier than the others, explaining Y M’s fall-off in performance and indicating that the diversity of potential matches identified by our probabilistic matching technique plays a vital role in match-sensitive active learning.

Sequential Querying

We employed a semi-synchronous querying procedure above, where all users are queried in parallel at each round. We now consider a different mode of interaction where, at each round, users are queried sequentially in round robin fashion. This allows the responses of earlier users within a round to influence the queries asked to later users—potentially reducing the total number of queries at the expense of

5We do not report the performance of SM— it is consistently outperformed by the baseline in all experiments. We have observed that SM typically selects all queries from among only a few items, namely, those with high predicted average score; hence it acquires no information about the vast majority of items. 6We did not perform further experiments with ZM since it appears as if considering inherent matching uncertainty does not lead to a win compared to the Y and Y methods. Chapter 6. Task-Directed Active Learning 88

250 200

200 150 150 100 100 50 50 0 0

−50 −50 Difference in Matching Objective −100 −100 0 100 200 −20 0 20 40 60 80 100 120 Queries Queries (a) Jokes data set (10 queries per batch per user, sequen- (b) Conference data set (10 queries per batch, parallel) tial)

150 200

100 150

50 100

0 50

−50 0

−100 −50 −20 0 20 40 60 80 100 −50 0 50 100 150 Queries Queries (c) Conference data set (40 queries per batch, parallel) (d) Conference data set with larger user and item con- straints

Figure 6.6: Matching performance for active learning results using non-default parameters.

increased synchronization (and delay) among users. Fig. 6.6(a) shows that our methods are robust to this modification in the querying procedure. Specifically, the SE is quickly outperformed by all methods that are sensitive to the matching. Furthermore, both Y M and Y E, which use matching uncertainty, outperform Y M overall.

Batch Sizes

The choice of the number of queries K per batch affects both the frequency with which the user interacts with the system as well as the overall match performance. For example, high values of K reduce the number of user “interactions” needed for a specific level of performance, at the expense of query efficiency (improvement in matching objective per query). The “optimal” value for K depends on the actual recommendation application. Figs. 6.6(b) and (c) shows results with different values of K on Conference, using 10 and 40 queries per round, respectively. The relative performance of the active methods remains almost identical. As expected, absolute performance w.r.t. query efficiency is better with smaller values Chapter 6. Task-Directed Active Learning 89 of K. The matching-sensitive strategies clearly outperform the score-based techniques. Results are similar across all data sets.

Matching Constraints

Our results are also robust to the use of different matching constraints, specifically, bounds on the numbers of items per user and vice versa (i.e., Rmin,Rmax,Pmin,Pmax). Using the Conference data set, we increase to two (from one) the number of reviewers assigned to each paper. Fig. 6.6(d) shows that the behavior of the methods changes little, with both Y -methods still outperforming all other methods. The other domains (not shown) exhibit similar results.

6.5 Conclusion and Future Opportunities

We investigated the problem of active learning for match-constrained recommender systems. We ex- plored several different approaches to generating queries that are guided by the matching objective, and introduced a novel method for probabilistic matching that accounts for uncertainty in predicted scores. Experiments demonstrate the effectiveness of our methods in determining high-quality matches with significantly less elicitation of user preferences than that required by uncertainty sampling, a standard active learning method. Our results highlight the importance of choosing queries in a manner that is sensitive to the matching objective and uncertainty over predicted scores. One effect of choosing which user preferences to label, based on some objective, is that the model will learn from (possibly) biased data. That is, the sampled data may provide the model with a biased view of the true underlying data distribution. This is a general problem of active learning known as sampling bias (for a formal description see, for example, [31]). By performing elicitation based on the matching objective we add a further source of bias. Informally, the system is more likely to query scores that are expected to be higher since they will more likely be matched. In our datasets we have not found it to be a problem. However, in general in order to reduce this added bias, one may want to elicit scores based on both the matching and learning objectives. Another possible practical avenue is to query all users about a (small) common fixed set of items. This would ensure a certain level of score diversity, in particular, it would ensure that the system has access to low scores for all users. On a practical note it has been very challenging to obtain informative uncertainty models using the preference datasets that we experimented with. We have shown that we are still able to leverage the model uncertainty to obtain performance gains. However, in general, uninformative uncertainty models may fundamentally limit the general usefulness of the active learning methods that (even indirectly) rely on model uncertainty. There are many promising avenues of future research in match-constrained recommendation. We could explore different matching objectives, for example two-sided matching with stability constraints (e.g., as would be appropriate in online dating as well as paper-reviewer matching, where papers require “sufficient” expertise). We could also explore methods for eliciting side information from users in a way that is guided by the recommendation objective. Furthermore, higher-level, abstract queries (such as preferences over item categories or features) may significantly boost “gain per query” performance. In fact, one could use CSTM (Chapter 4) to model higher-level features such a item genres (or the subject areas of submissions and reviewers). Initial experiments with such data showed some promises which could pave the way to using CSTM as the underlying active learning model. Another possibility would Chapter 6. Task-Directed Active Learning 90 be to allow the active learning to choose between different types of queries (for example queries about the subject areas of a reviewer or queries about user-item preferences). EVOI could be used as a principled way of selecting queries of these different types. Modelling this extra level of user preferences will also be useful for cold-start items. For example, in the reviewer-to-paper matching domain, imagine that reviewer profiles are kept across conferences. Then user preferences over paper subject areas could be used to get better score estimates of the submitted papers which in turn will help the active learning. Without using side information the initial active learning queries would be no better than the ones of a random baseline. Finally, in this chapter after new data was obtained we always re-trained the system using all available data. This approach is unlikely to scale, and online learning techniques, which could learn using only the newly-available data, would be of interest. Overall, explicit preference elicitation requires the voluntary participation of users. Conference re- viewers may oblige since they will reap immediate benefits. However, in other applications, it may be harder to convince users of the advantages of engaging in this elicitation mechanism. In such cases, recommender systems may either have to be more subtle about the elicitation adopted, for example, by including an exploration policy within their recommendation objective, or even by deducing preferences from user behaviour, for example, by examining the list of page they browsed, the search queries they issued, or the information they communicated through their participation in online social networking sites. Chapter 7

Conclusion

Throughout this thesis we have shown how we can both tailor existing machine learning models and methods, as well as develop new ones to increase the performance and expand the capabilities of rec- ommender systems. We now provide a summary of our work and highlight opportunities for future work.

7.1 Summary

Firstly, we established preference prediction as the core problem of interest in recommender systems. Current supervised learning methods are well suited to this problem. Specifically, supervised methods tailored to the specifics of recommender systems, such as collaborative filtering methods, have shown to be excellent at missing preference prediction in typical recommendation domains when user preference data is plentiful. However, in cold-start scenarios we must resort to leveraging side information, possibly including content information, about user and items. We showed how we can model textual user and item side information, using topic models, in a document-prediction domain and obtain superior performances compared to state-of-the-art methods that use only user scores of preferences. Secondly, we introduced a simple framework which decomposes the recommendation problem into three interacting stages: a) preference elicitation; c) preference prediction; and c) determination of rec- ommendations. We showed how we can cast match-constrained problems, such as the paper-to-reviewer matching problem, into this framework. For match-constrained recommendations we experimentally demonstrated, using two conference datasets, the strong correlation between the methods’ preference- prediction performance and their matching performance. Further, we exploited the synergy between learning and matching objectives when using a non-linear mapping between utility and suitability. Finally, using active learning, we explored the interaction between preference elicitation and matching in a match-constrained recommender system. The active querying methods we developed focussed on improving the recommendation objective instead of the learning objective. To reach our goals we also developed a probabilistic matching procedures to account for the uncertainty in predicted preferences. Our methods, including those that use probabilistic matching, proved useful in querying user preferences for the matching problem. Overall, using our conference datasets, a dating dataset and a jokes dataset, we showed that together with preference prediction methods, active learning methods greatly reduce the elicitation burden on users and thus help alleviate the cold-start problem.

91 Chapter 7. Conclusion 92

Interestingly, the non-research contribution of this thesis, the Toronto Paper Matching System, is the component of this thesis which has seemingly had the most immediate impact on the community. Further, it has both revealed research opportunities, for example by allowing us to collect interesting data sets which were essential in the development of our work, and it has also shown to be a good test bed for our research ideas.

7.2 Future Research Directions

As recommender systems become ubiquitous several research opportunities will naturally present them- selves. We have outlined some of these directions in the preceding chapters. We now outline a few additional directions which are the closest to the work presented in this thesis.

1. A wide variety of sources of side information may be indicative of user preferences (for example, the different aspects of user online behaviour, such as their search terms, the sites they visited and the frequency and length of the visits, their purchased items, and others). Learning simul- taneously from all such sources has the potential to refine recommendations and, more generally, personalization models. Therefore, it is worth developing and analyzing models which can learn from these potentially heterogeneous sources of (side) information. Learning from combinations of heterogeneous data is a general challenge of machine learning and one of particularly interest to recommender systems. Further these additional sources of side information will often contain more expressive forms of user preferences. For example, deriving preferences from text (e.g., of reviews), where extracting user preferences cannot be done with a bag-of-words model, is still a challenge in machine learning (although the field has been progressing, especially if allowed to learn from large collections of labelled data [110]). One possible avenue for accomplishing this task is—similar to our approach using CSTM—to learn general higher-level representations of this side information and then use these representations as features in a user preference model. In general, this direction of research may require the exploration of interesting combinations of existing machine learning content models (for example, models of text or images) with preference prediction models.

2. In many recommendation domains it may be unreasonable to require the explicit elicitation of many preferences from users. Therefore, as we discussed in the last section of Chapter 6, it will be essential to effectively learn from weaker sources of user preferences such as users’ online behaviour. As a first step, there has already been some work using implicit user feedback [51] for recommendations. While extending this work is promising, methods that can aggregate weak forms of preference data into definitive user preferences will be of importance.

3. Current recommender systems typically work for single item-domains. For example, a system rec- ommends either books or restaurants but not both. This is partly due to the mechanisms that are involved in creating user-item preference data sets. However, there are necessarily correlations between users’ preferences across different domains which can be exploited using a multi-domain recommender system. Further, a multi-domain recommender system has the potential of learning much finer-level representations of user preferences, leading to better user personalization. There- fore, methods which work across multiple domains, and which eventually lead to more general models of user preferences, will be beneficial even beyond recommender systems. Chapter 7. Conclusion 93

To be useful across many domains a recommender system will often be required to make recommen- dations in domains for which it has very little user preference information. For example, an online system will constantly need to adapt to both novel items and users as well as to users’ interests as they evolve over time. Again, the use of side information, including content information, will be a necessity to quickly identify exploitable correlations between the different recommendation domains. Ideally, recommender systems will not only recommend single items, but will also be able to rec- ommend structured list of items. For example, a system could recommend full trip itineraries, including places to stay, sightseeing activities and tickets to shows, or recommend a reading cur- riculum composed of a list of scientific or news articles to (gradually) learn about specific topics, or sets of ingredients to create harmonious recipes and meals. Such problems could be modelled using our current framework, by first having a preference learning objective followed by a combinatorial optimization step. In cases where structured label data exists we could then turn our attention to the work on structure output learning (for example, [123]) or more appropriately, to works that can deal explicitly with the ultimate combinatorial optimization such as Perturb-and-Map [82] and others [120].

4. Finally, there are also other particularities of practical recommender systems which have often been ignored in the academic literature. Examples of such particularities include:

• research that treats missing preferences, in commonly available datasets, as missing at random • research that eludes the temporal and spacial contexts around recommendations

Such considerations will likely become of crucial importance for real-life recommender systems and will become more accessible to academics once datasets containing information relevant to these considerations become available.

We have presented recommender systems as a major beneficiary from advances in machine learning research, and specifically from supervised and active learning methods. It is likely that recommender systems, especially at their intersection with human behaviour modelling, will take on even more impor- tant for machine learning techniques. This will especially be the case as larger user preference data sets, especially if they contain extra features such as more expressive information indicative of preferences, become available. Bibliography

[1] Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 19–28, New York, NY, USA, 2009. ACM. 40, 41, 42, 53, 62

[2] Deepak Agarwal and Bee-Chung Chen. flda: matrix factorization through latent dirichlet alloca- tion. In Proceedings of the third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 91–100, New York, NY, USA, 2010. ACM. 40, 41, 42, 52

[3] Robert Arens. Learning SVM ranking functions from user feedback using document metadata and active learning in the biomedical domain. In Johannes F¨urnkranz and Eyke H¨ullermeier, editors, Preference Learning, pages 363–383. Springer-Verlag, 2010. 22

[4] Suhrid Balakrishnan and Sumit Chopra. Collaborative ranking. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pages 143–152, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-0747-5. 17, 64

[5] Krisztian Balog, Leif Azzopardi, and Maarten de Rijke. Formal models for expert finding in enter- prise corpora. In Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo J¨arvelin, editors, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-06), pages 43–50, Seattle, Washington, USA, 2006. ACM. ISBN 1-59593-369-7. 53

[6] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2):127–256, February 2012. ISSN 1554-0669. 37

[7] Robert M. Bell and Yehuda Koren. Lessons from the Netflix prize challenge. SIGKDD Exploration Newsletter, 9:75–79, December 2007. ISSN 1931-0145. 8, 13, 20, 48

[8] Salem Benferhat and J´erˆome Lang. Conference paper assignment. International Journal of Intel- ligent Systems, 16(10):1183–1192, 2001. 37, 66, 67, 68

[9] T. Bertin-Mahieux. Large-Scale Pattern Discovery in Music. PhD thesis, Columbia University, February 2013. 42

[10] Michael J. Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Math. Program., 47:425–439, 1990. 54

94 BIBLIOGRAPHY 95

[11] Alina Beygelzimer, Daniel Hsu, John Langford, and Zhang Tong. Agnostic active learning without constraints. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 199–207. 2010. 18

[12] David M. Blei and John D. Lafferty. Dynamic topic models. In Proceedings of the Twenty-third International Conference of Machine Learning (ICML), 2006. 43

[13] David M. Blei and John D. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007. 44, 46, 48, 49, 50, 52, 54, 55

[14] David M. Blei and Jon D. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems (NIPS), 2007. 43

[15] David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum. Hierarchi- cal topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems (NIPS), 2003. 43

[16] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. 31, 32, 43, 44

[17] Michael Bloodgood and K. Vijay-Shanker. A method for stopping active learning based on sta- bilizing predictions and the need for user-adjustable stopping. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), 2009. 24

[18] Craig Boutilier, Richard S. Zemel, and Benjamin Marlin. Active collaborative filtering. In UAI, pages 98–106, Acapulco, 2003. 22, 23, 82

[19] Darius Braziunas and Craig Boutilier. Assessing regret-based preference elicitation with the UT- PREF recommendation system. In Proceedings of the Eleventh ACM Conference on Electronic Commerce (EC-10), pages 219–228, Cambridge, MA, 2010. 8

[20] John S. Breese, David Heckerman, and Carl Myers Kadie. Empirical analysis of predictive algo- rithms for collaborative filtering. In G. F. Cooper and S. Moral, editors, Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pages 43–52, 1998. 9

[21] Klaus Brinker. Incorporating diversity in active learning with support vector machines. In Fawcett and Mishra [34], pages 59–66. ISBN 1-57735-189-4. 23

[22] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report TR-2009-23, Department of Computer Science, University of British Columbia, November 2009. 78

[23] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. Streaming variational bayes, 2013. arXiv:1307.6769. 61

[24] Christopher J.C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In B. Sch¨olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 193–200. MIT Press, Cambridge, MA, 2007. 17 BIBLIOGRAPHY 96

[25] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise approach to listwise approach. Tech Report MSR-TR-2007-40, Microsoft Research, April 2007. 17

[26] Tianqi Chen, Hang Li, Qiang Yang, and Yong Yu. General functional matrix factorization using gradient boosting. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 436–444. JMLR Workshop and Conference Proceedings, 2013. 40, 41

[27] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining content- based and collaborative filters in an online newspaper. In Proceedings of the ACM SIGIR ’99 Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California, 1999. ACM. 40, 41

[28] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of Artificial Intelligence Research (JAIR), 10:243–270, May 1999. ISSN 1076-9757. 18

[29] Don Conry, Yehuda Koren, and Naren Ramakrishnan. Recommender systems for the conference paper assignment problem. In Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pages 357–360, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-435-5. 36, 40, 41, 53, 67

[30] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the Twenty-Fifth International Conference (ICML 2008), pages 208–215, 2008. 20

[31] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the Twenty-fifth International Conference on Machine learning (ICML), pages 208–215, 2008. 89

[32] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of The Royal Statistical Society, Series B, 39(1):1–38, 1977. 45

[33] P. Perona E. Bart, M. Welling. Unsupervised organization of image collections: Taxonomies and beyond. IEEE Transactions of Pattern Analysis and Machine Intelligence, 2011. 62

[34] Tom Fawcett and Nina Mishra, editors. Machine Learning, Proceedings of the Twentieth Interna- tional Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003. AAAI Press. ISBN 1-57735-189-4. 95, 102

[35] Brendan J. Frey and Nebojsa Jojic. A comparison of algorithms for inference and learning in prob- abilistic graphical models. IEEE Trans. Pattern Anal. Mach. Intell., 27(9):1392–1416, September 2005. ISSN 0162-8828. 45

[36] David Gale and Lloyd S. Shapley. College admissions and the stability of marriage. American Mathematical Monthly, 69(1):9–15, 1962. ISSN 0002-9890. 24, 25

[37] Naveen Garg, Telikepalli Kavitha, Amit Kumar, Kurt Mehlhorn, and Julian Mestre. Assigning papers to referees. Algorithmica, 58(1):119–136, 2010. 37, 66, 67, 68

[38] Kostadin Georgiev and Preslav Nakov. A non-iid framework for collaborative filtering with re- stricted boltzmann machines. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1148–1156. JMLR Workshop and Conference Proceedings, May 2013. 13 BIBLIOGRAPHY 97

[39] Mehmet Gnen, Suleiman Khan, and Samuel Kaski. Kernelized bayesian matrix factorization. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 864–872. JMLR Workshop and Conference Proceedings, May 2013. 41, 42

[40] David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35:61–70, December 1992. ISSN 0001-0782. 2, 7, 9

[41] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133–151, July 2001. ISSN 1386-4564. 83

[42] Judy Goldsmith and Robert H. Sloan. The AI conference paper assignment problem. In AAAI-07 Workshop on Preference Handling in AI, pages 53–57, Vancouver, 2005. 37, 66, 67, 68

[43] Yuhong Guo. Active instance sampling via matrix partition. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 802–810. 2010. 23

[44] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 593–600. MIT Press, Cambridge, MA, 2008. 23, 84

[45] Abhay Harpale and Yiming Yang. Personalized active learning for collaborative filtering. In SIGIR, pages 91–98, 2008. 19

[46] Guenter Hitsch and Ali Hortacsu. What makes you click? an empirical analysis of online dating. 2005 Meeting Papers 207, Society for Economic Dynamics, 2005. 24

[47] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013. ISSN 1532-4435. 61

[48] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM. ISBN 1-58113-096-1. 12

[49] Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22 (1):89–115, 2004. 11, 12

[50] R.A. Howard. Information value theory. Systems Science and Cybernetics, IEEE Transactions on, 2(1):22–26, 1966. ISSN 0536-1567. doi: 10.1109/TSSC.1966.300074. 22

[51] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages 263–272, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3502-9. 92

[52] Aanund Hylland and Richard J. Zeckhauser. The efficient allocation of individuals to positions. Journal of Political Economy, 87(2):293–314, 1979. ISSN 0022-3808. 24, 25 BIBLIOGRAPHY 98

[53] Mohsen Jamali and Martin Ester. Trustwalker: a random walk model for combining trust-based and item-based recommendation. In Proceedings of the 15th ACM SIGKDD international confer- ence on Knowledge discovery and data mining, KDD ’09, pages 397–406, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9. 40

[54] Mohsen Jamali and Martin Ester. A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender systems, RecSys ’10, pages 135–142, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0. 40, 41

[55] Tamas Jambor and Jun Wang. Optimizing multiple objectives in collaborative filtering. In ACM Recommender Systems, 2010. 18

[56] Tamas Jambor and Jun Wang. Goal-driven collaborative filtering: A directional error based approach. In Proc. of European Conference on Information Retrieval (ECIR), 2010. 15

[57] Kalervo J¨arvelin and Jaana Kek¨al¨ainen. Ir evaluation methods for retrieving highly relevant doc- uments. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR ’00, pages 41–48, New York, NY, USA, 2000. ACM. 16, 56

[58] Rong Jin and Luo Si. A bayesian approach toward active learning for collaborative filtering. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-04), pages 278–285. AUAI Press, 1 2004. 19

[59] Maryam Karimzadehgan and ChengXiang Zhai. Integer linear programming for constrained multi- aspect committee review assignment. Inf. Process. Manage., 48(4):725–740, July 2012. ISSN 0306- 4573. doi: 10.1016/j.ipm.2011.09.004. URL http://dx.doi.org/10.1016/j.ipm.2011.09.004. 68

[60] Maryam Karimzadehgan, ChengXiang Zhai, and Geneva Belford. Multi-aspect expertise matching for review assignment. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 1113–1122, New York, NY, USA, 2008. ACM. ISBN 978-1-59593- 991-3. 67

[61] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 426–434, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-193-4. 40, 41

[62] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recom- mender systems. IEEE Computer, 42(8):30–37, 2009. 8, 13

[63] Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. Disclda: Discriminative learning for dimen- sionality reduction and classification. In NIPS, 2008. 43

[64] Helge Langseth and Thomas Dyhre Nielsen. A latent model for collaborative filtering. Int. J. Approx. Reasoning, 53(4):447–466, June 2012. ISSN 0888-613X. 13 BIBLIOGRAPHY 99

[65] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted boltzmann machines. In Proceedings of the Twenty-fifth International Conference on Machine Learning, ICML ’08, pages 536–543, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. 34

[66] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with gaussian processes. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 601–608, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. 8, 12, 41, 42

[67] David D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional data. SIGIR Forum, 29:13–19, September 1995. ISSN 0163-5840. 19

[68] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. 10

[69] Nathan N. Liu and Qiang Yang. Eigenrank: a ranking-oriented approach to collaborative filtering. In Proceedings of the 31st annual international ACM SIGIR conference on Research and develop- ment in information retrieval, SIGIR ’08, pages 83–90, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-164-4. 17, 18

[70] Hao Ma, Haixuan Yang, Michael R. Lyu, and Irwin King. Sorec: social recommendation using probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM ’08, pages 931–940, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-991-3. 40, 41

[71] Benjamin Marlin. Modeling user rating profiles for collaborative filtering. In Advances in Neural Information Processing Systems 16 (NIPS), 2003. 12

[72] Benjamin Marlin. Collaborative filtering: A machine learning perspective. Technical report, University of Toronto, 2004. 8, 14

[73] Benjamin M. Marlin and Richard S. Zemel. The multiple multiplicative factor model for collabo- rative filtering. In Carla E. Brodley, editor, ICML, volume 69 of ACM International Conference Proceeding Series. ACM, 2004. 12, 13

[74] Benjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pages 5–12, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-435-5. 14

[75] Benjamin M. Marlin, Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. Collaborative filtering and the missing at random assumption. In Ronald Parr and Linda C. van der Gaag, editors, UAI, pages 267–275. AUAI Press, 2007. ISBN 0-9749039-3-0. 14

[76] Paolo Massa and Paolo Avesani. Trust-aware recommender systems. In Proceedings of the 2007 ACM conference on Recommender systems, RecSys ’07, pages 17–24, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-730–8. 40

[77] B. McFee, T. Bertin-Mahieux, D. Ellis, and G. Lanckriet. The million song dataset challenge. In Proc. of the 4th International Workshop on Advances in Music Information Research (AdMIRe ’12), April 2012. 42 BIBLIOGRAPHY 100

[78] Lorraine McGinty and Barry Smyth. On the role of diversity in conversational recommender systems. In Proceedings of the 5th international conference on Case-based reasoning: Research and Development, ICCBR’03, pages 276–290, Berlin, Heidelberg, 2003. Springer-Verlag. ISBN 3-540-40433-3. 2, 14

[79] David M. Mimno and Andrew McCallum. Expertise modeling for matching papers with reviewers. In Pavel Berkhin, Rich Caruana, and Xindong Wu, editors, Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 500–509, San Jose, California, 2007. ACM. ISBN 978-1-59593-609-7. 29, 31, 48, 53

[80] Radford M. Neal and Geoffrey E. Hinton. Learning in graphical models. chapter A view of the EM algorithm that justifies incremental, sparse, and other variants, pages 355–368. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-60032-3. 45

[81] Fredrik Olsson and Katrin Tomanek. An intrinsic stopping criterion for committee-based active learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learn- ing, CoNLL ’09, pages 138–146, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-29-9. 24

[82] G. Papandreou and A. Yuille. Perturb-and-map random fields: Using discrete optimization to learn and sample from energy models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 193–200, Barcelona, Spain, November 2011. 93

[83] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 0-934613-73-7. 80

[84] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009. 43

[85] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collab- orative prediction. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 713–719. ACM, 2005. ISBN 1-59593-180-5. 11

[86] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In 1994 ACM Conference on Computer Supported Collabora- tive Work Conference, pages 175–186, Chapel Hill, NC, 10/1994 1994. Association of Computing Machinery, Association of Computing Machinery. 9

[87] Paul Resnick and Hal R. Varian. Recommender systems. Communications of the ACM, 40(3): 56–58, March 1997. ISSN 0001-0782. 2

[88] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors. Recommender Systems Handbook. Springer, 2011. ISBN 978-0-387-85819-7. 1, 2

[89] Philippe Rigaux. An iterative rating method: application to web-based conference management. In SAC, pages 1682–1687, 2004. 36, 77

[90] Irina Rish and Gerald Tesauro. Active collaborative prediction with maximum margin matrix factorization. In ISAIM 2008, 2008. 20 BIBLIOGRAPHY 101

[91] Marko A. Rodriguez and Johan Bollen. An algorithm to determine peer-reviewers. In Proceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM-08), pages 319– 328, Napa Valley, California, USA, 2008. ACM. ISBN 978-1-59593-991-3. 37, 67

[92] Eytan Ronn. Np-complete stable matching problems. J. Algorithms, 11(2):285–304, May 1990. ISSN 0196-6774. 25

[93] David A. Ross and Richard S. Zemel. Multiple cause vector quantization. In Advances in Neural Information Processing Systems 15 (NIPS), pages 1017–1024, 2002. 12

[94] Alvin E. Roth. The evolution of the labor market for medical interns and residents: A case study in game theory. Journal of Political Economy, 92(6):991–1016, 1984. 24, 25

[95] Alvin E. Roth and Elliott Peranson. The redesign of the matching market for american physicians: Some engineering aspects of economic design. Working Paper 6963, National Bureau of Economic Research, February 1999. 25

[96] Neil Rubens and Masashi Sugiyama. Influence-based collaborative active learning. In Proceedings of the 2007 ACM conference on Recommender systems, RecSys ’07, pages 145–148, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-730–8. 19

[97] Ruslan Salakhutdinov and Geoffrey Hinton. Replicated softmax: an undirected topic model. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22 (NIPS), pages 1607–1614. 2009. 34, 55

[98] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the International Conference on Machine Learning, vol- ume 25, 2008. 10, 11, 48, 85

[99] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems (NIPS), volume 20, 2008. 10, 40, 41, 48, 52, 55, 61

[100] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the International Conference on Machine Learning, vol- ume 24, pages 791–798, 2007. 13

[101] Andrew Ian Schein. Active learning for logistic regression. PhD thesis, University of Pennsylvania, Philadelphia, PA, USA, 2005. AAI3197737. 19

[102] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. 19, 20, 21, 23, 24, 81

[103] B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. In Advances in Neural Information Processing Systems (NIPS), volume 20, pages 1289–1296. MIT Press, 2008. 21

[104] Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 1070–1079, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. 19

[105] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee, 1992. 20 BIBLIOGRAPHY 102

[106] Hanhuai Shan and Arindam Banerjee. Generalized probabilistic matrix factorizations for collabo- rative filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 1025–1030, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695- 4256-0. 41, 42, 52

[107] Yue Shi, Martha Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization for collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems, RecSys ’10, pages 269–272, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0. 17

[108] Malcolm Slaney. Web-scale multimedia analysis: Does content matter? IEEE MultiMedia, 18(2): 12–15, April 2011. ISSN 1070-986X. 42

[109] Alex J. Smola, S. V. N. Vishwanathan, and Quoc V. Le. Bundle methods for machine learning. In Advances in Neural Information Processing Systems 20 (NIPS), 2007. 17

[110] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, October 2013. Association for Computational Linguistics. 92

[111] Nathan Srebro. Learning with matrix factorizations. PhD thesis, Massachusetts Institute of Tech- nology, Cambridge, MA, USA, 2004. AAI0807530. 9

[112] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Fawcett and Mishra [34], pages 720–727. ISBN 1-57735-189-4. 10

[113] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems (NIPS), 2004. 10, 11, 13

[114] David H. Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online bayesian rec- ommendations. In Juan Quemada, Gonzalo Le´on, Yo¨elle S. Maarek, and Wolfgang Nejdl, editors, WWW, pages 111–120. ACM, 2009. ISBN 978-1-60558-487-4. 67

[115] David H. Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando Tacchella. Collaborative expert portfolio management. In Maria Fox and David Poole, editors, AAAI. AAAI Press, 2010. 67

[116] Johan A. K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999. 11

[117] Wenbin Tang, Jie Tang, and Chenhao Tan. Expertise matching via constraint-based optimization. In IEEE/WIC/ACM International Conference on Web Intelligence (WI-10) and Intelligent Agent Technology (IAT-10), volume 1, pages 34–41, Toronto, Canada, 2010. IEEE Computer Society. ISBN 978-0-7695-4191-4. 68

[118] Wenbin Tang, Jie Tang, Tao Lei, Chenhao Tan, Bo Gao, and Tian Li. On optimization of expertise matching with various constraints. Neurocomputing, 76(1):71–83, January 2012. ISSN 0925-2312. 68 BIBLIOGRAPHY 103

[119] Daniel Tarlow. Efficient Machine Learning with High Order and Combinatorial Structures. PhD thesis, University of Toronto, February 2013. 75

[120] Daniel Tarlow, Ryan Prescott Adams, and Richard S Zemel. Randomized optimum models for structured prediction. In Proceedings of the 15th Conference on Artificial Intelligence and Statis- tics, pages 21–23, 2012. 93

[121] Daniel Tarlow, Kevin Swersky, Richard S Zemel, Ryan P Adams, and Brendan J Frey. Fast exact inference for recursive cardinality models. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, 2012. 80

[122] Camillo J. Taylor. On the optimal assignment of conference papers to reviewers. Technical Report MS-CIS-08-30, University of Pennsylvania, 2008. 36, 64, 66, 68

[123] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the Twenty- First International Conference on Machine Learning, ICML’04, pages 104–, New York, NY, USA, 2004. ACM. ISBN 1-58113-838-5. 93

[124] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. ISBN 0-387-94559-8. 7

[125] Maksims Volkovs and Rich Zemel. Collaborative ranking with 17 parameters. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2303–2311. 2012. 17

[126] Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 448–456, New York, NY, USA, 2011. ACM. 14, 40, 41, 42, 48, 52, 55

[127] Chong Wang and David M. Blei. Variational inference in nonconjugate models. Journal of Machine Learning Research, 14(1):1005–1031, April 2013. 44, 60

[128] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - maximum margin matrix factorization for collaborative ranking. In NIPS, 2007. 14, 16, 17

[129] Markus Weimer, Alexandros Karatzoglou, and Alex Smola. Adaptive collaborative filtering. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys ’08, pages 275–282, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-093-7. 11, 17

[130] Jason Weston, Chong Wang, Ron Weiss, and Adam Berenzweig. Latent collaborative retrieval. In ICML. icml.cc / Omnipress, 2012. 41, 42

[131] Zuobing Xu, Ram Akella, and Yi Zhang 0001. Incorporating diversity and density in active learning for relevance feedback. In Giambattista Amati, Claudio Carpineto, and Giovanni Romano, editors, ECIR, volume 4425 of Lecture Notes in Computer Science, pages 246–257. Springer, 2007. ISBN 978-3-540-71494-1. 23 BIBLIOGRAPHY 104

[132] Kai Yu, Anton Schwaighofer, Volker Tresp, Xiaowei Xu, and Hans-Peter Kriegel. Proba- bilistic memory-based collaborative filtering. IEEE Trans. on Knowl. and Data Eng., 16 (1):56–69, January 2004. ISSN 1041-4347. doi: 10.1109/TKDE.2004.1264822. URL http://dx.doi.org/10.1109/TKDE.2004.1264822. 23

[133] Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, April 2004. ISSN 1046-8188. 31