Collaborative Topic Modeling for Recommending Github Repositories
Total Page:16
File Type:pdf, Size:1020Kb
Collaborative Topic Modeling for Recommending GitHub Repositories Naoki Orii School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA [email protected] ABSTRACT While an increase in the number of availble open source software projects certainly benefits the open source ecosys- The rise of distributed version control systems has led to a tem, it has become more difficult for programmers to find significant increase in the number of open source projects projects of interest. In 2009, GitHub hosted a recommenda- available online. As a consequence, finding relevant projects tion contest to recommend repositories to users. The train- has become more difficult for programmers. Item recom- ing dataset was based on 440,000 user-watches-repository mendation provides a way to solve this problem. In this pa- relationships given by over 56,000 users to nearly 121,000 per, we utilize a recently proposed algorithm that combines repositories. The test dataset consisted of 4,800 users (which traditional collaborative filtering and probabilistic topic mod- all of them are in the training data), and the goal is to rec- eling. We study a large dataset from GitHub, a social net- ommend up to 10 repositories for each of the test users. working and open source hosting site for programmers, and compare the method against traditional methods. We also The problem of item recommendation has been studied provide interpretations on the latent structure for users and extensively, especially involving the Netflix Prize. However, repositories. there is a distinct difference between the Netflix Prize and GitHub's recommendation contest. While in both contests we have the user-item matrix, we can also consider source 1. INTRODUCTION code data in the GitHub contest. The source code of a soft- Since the mid-2000's, there has been an increased adop- ware library can tell us rich information about the content tion in distributed version control systems among software of the repository, and by exploiting this we expect to im- developers. Several mature systems such as Bazaar, Mer- prove recommendation results. A recent paper formulated curial, and Git have appeared, alongside with hosting sites this idea, combining traditional collaborative filtering on the such as Launchpad, Bitbucket, and GitHub. Combined with user-item matrix and probabilistic topic models on text cor- these hosting sites, the distributed version control paradigm pora. In this paper, we apply this method on a large dataset has led to an adoption of collaborative software develop- from GitHub. ment, with a significant increase in the number of open source software projects. In particular, GitHub (https: //github.com/) has attracted the largest user base of these 2. PROBLEM DEFINITION websites with 2.7 million users and hosting over 4.5 million We assume there are I users and J items. Let R = 1 projects . The site hosts a wide array of projects, ranging rij I J denote the user-item matrix, where each element f g × from the Linux kernel to famous web application frameworks rij 0; 1 represents whether or not user i \favorited" item 2 f g such as Ruby on Rails, and non-software related projects j. While rij = 1 represents that user i is interested in item such as the German Federal Law. j, note that rij = 0 does not necessarily mean that the user In addition to its role as a hosting site, GitHub also func- is not interested in the item: it can also be the case the user tions as a social network for programmers. Projects on i does not know about item j. GitHub, also known as repositories, have profile pages. Users In this paper, we consider two tasks: (i) in-matrix predic- can download, fork2, and commit to any repository. A key tion and (ii) out-of-matrix prediction. For in-matrix predic- feature of GitHub is that it allows users to watch repositories tion, the task is to estimate the missing values in R based that they find intreresting and want to keep track of3. on the known values. For out-of-matrix prediction, the task is to predict user interest for items that are not included 1As of December 2012. 2 in R. In this paper, this amounts to recommending new Forking is a feature on GitHub that allows a users to copy repositories that have never been watched by a single user. another user's project in order to contribute to it, or to use it as a starting point for his/her own project. 3Note that the semantics of watching repositories in GitHub has changed in August 2012 (https://github.com/blog/ 3. METHODS 1204-notifications). GitHub has now introduced stars, We first give brief explanations of probabilistic matrix which function similarly as the previous version of watched factorization and probabilistic topic models. We then de- repositories. In the new watch functionality, once a user watches a repository, he/she will receive notifications for up- dates in discussions (project issues, pull requests, and com- ments). scribe Collaborative Topic Regression, which combines the hidden . In this paper, we use these discovered topics to im- two methods. prove recommendation,g as described in the following section. The simplest form of a probabilistic topic model is Latent 3.1 Probabilistic Matrix Factorization Dirichlet Allocation (LDA) [3]. The generative process for The basic idea behind latent factor models is that user LDA is formulated as follows: preference is determined by a small number of unobserved, 1. Draw topic proportions θj Dirichlet(α) for docu- latent factors. The goal is to uncover these latent user and ∼ item features that explain the observed ratings R. User ment wj K w i is represented by a latent vector ui R , and item j is 2. For each term n in j , K 2 represented by a latent vector vj R . K is typically chosen (a) Draw topic assignment zjn Mult(θj ) such that K I;J. The predicted2 ratingr ^ is given by ∼ ij (b) Draw word wjn Mult(βzjn ) the inner product of the two latent vectors: ∼ T 3.3 Collaborative Topic Regression r^ij = ui vj (1) The collaborative topic regression (CTR) model combines Thus, given R, the problem is to compute the latent feature traditional collaborative filtering with topic modeling [10]. vectors u and v. We commonly do this by minimizing the Note that the term \item" used in collaborative filtering and following regularized square error: the term \document" used in LDA both refer to the same 2 T 2 2 thing. Unless otherwise noted, from now on we will use these min rij ui vj + λu ui + λv vj (2) U;V − k k k k two terms interchangeably. i;j X Similarly to LDA, in CTR each item j is assigned a topic λu and λv are regularization parameters. proportion θj that is used to generate the words. A na¨ıve It is possible to adopt a probabilistic approach for matrix approach is to directly use θj to represent the item latent factorization [9]. We can imagine a simple generative model vector in equation 3: using a probabilistic linear model with Gaussian observation T 1 rij u θj ; c− (5) noise as follows: ∼ N i ij 1 Instead of taking this approach, CTR exploits the user data 1. For each user i, draw user latent vector ui (0; λu− IK ) ∼ N 1 to get an \adjusted" item latent vector vj . The generative 2. For each item j, draw item latent vector v (0; λ− I ) j v K process for CTR is formulated as follows: 3. For each user-item pair (i, j), draw the response∼ N 1 T 1 1. For each user i, draw user latent vector ui (0; λu− IK ) rij u vj ; c− (3) ∼ N ∼ N i ij 2. For each item j, (a) Draw topic proportions θj Dirichlet(α) where IK is a K-dimensional identity matrix, and cij mea- ∼ 1 (b) Draw item latent offset (0; λ− I ) and the sures our confidence in observing r . As discussed earlier, j v K ij item latent vector as v =∼ N+ θ we are confident that user i is interested in item j when j j j (c) For each word wjn, rij = 1, but we are not as confident that i is not interested i. Draw topic assignment zjn Mult(θj ) in j when rij = 0. Accordingly, we use different values for ∼ ii. Draw word wjn Mult(βz ) cij depending on the value of rij , as follows: ∼ jn 3. For each user-item pair (i, j), draw the rating a if rij = 1 cij = (4) T 1 b if r = 0 rij u vj ; c− (6) ( ij ∼ N i ij where a > b > 0. Note that in Step 2(b), j is added to item j's topic pro- 3.2 Probabilistic Topic Models portion θj in order to obtain the adjusted item latent vector vj . Topic modeling algorithms can be used to automatically The graphical representation of the model is given in Fig- extract topics from a corpus of text documents. Each topic ure 1. The top part of the model represents the repository represents a distribution of terms, and gives high probability content, and is essentially an LDA model. The bottom half to a group of tightly co-occuring words. A document can be of the model deals with user-repository data. represented by a small set of topics. Source code in software projects can also be thought of text documents. Typically, programmers give meaningful names to variables, types, and functions. Unless they pur- ↵ ✓ z w β posely try to obfuscate, minimize, or compress the code, N K programmers adhere to naming conventions that improve its readability.