ADecision Tree Based Recommender System
Amir Gershman, Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev Beer-Sheva,84105, Israel amirger,[email protected]
Karl-Heinz Luke¨ Deutsche Telekom AG,Laboratories, Innovation Development Ernst-Reuter-Platz 7, D-10587 Berlin, Germany [email protected]
Lior Rokach, Alon Schclar,Arnon Sturm Department of Information Systems Engineering, Ben-Gurion University of the Negev Beer-Sheva,84105, Israel liorrk, schclar,[email protected]
Abstract: Anew method for decision-tree-based recommender systems is proposed. The proposed method includes twonew major innovations. First, the decision tree produces lists of recommended items at its leaf nodes, instead of single items. This leads to reduced amount of search, when using the tree to compile arecommendation list for auser and consequently enables ascaling of the recommendation system. The second major contribution of the paper is the splitting method for constructing the decision tree. Splitting is based on anew criterion -the least probable intersection size. The newcriterion computes the probability for getting the intersection for each potential split in arandom split and selects the split that generates the least probable size of intersection. The proposed decision tree based recommendation system was evaluated on alarge sample of the MovieLens dataset and is shown to outperform the quality of recommendations produced by the well known information gain splitting criterion.
1Introduction
Recommender Systems (RS) propose useful and interesting items to users in order to in- crease both seller’sprofit and buyer’ssatisfaction. Theycontribute to the commercial success of manyon-line ventures such as Amazon.com or NetFlix [Net] and are avery active research area. Examples of recommended items include movies, web pages, books, news items and more. Often aRSattempts to predict the rating auser will give to items
170 based on her past ratings and the ratings of other (similar) users. Decision Trees have been previously used as amodel-based approach for recommender systems. The use of decision trees for building recommendation models offers several benefits, such as: efficiencyand interpretability [ZI02] and flexibility in handling avariety of input data types (ratings, demographic, contextual, etc.). The decision tree forms apredictive model which maps the input to apredicted value based on the input’sattributes. Each interior node in the tree corresponds to an attribute and each arc from aparent to achild node represents apossible value or aset of values of that attribute. The construction of the tree begins with aroot node and the input set. An attribute is assigned to the root and arcs and sub-nodes for each set of values are created. The input set is then split by the values so that each child node receivesonly the part of the input set which matches the attribute value as specified by the arc to the child node. The process then repeats itself recursively for each child until splitting is no longer feasible. Either asingle classification(predicted value) can be applied to each element in the divided set, or some other threshold is reached. Amajor weakness in using decision trees as aprediction model in RS is the need to build ahuge number of trees (either for each item or for each user). Moreover, the model can only compute the expected rating of asingle item at atime. To provide recommendations to the user,wemust traverse the tree(s) from root to leaf once for each item in order to compute its predicted rating. Only after computing the predicted rating of all items can the RS provide the recommendations (highest predicted rating items). Thus decision trees in RS do not scale well with respect to the number of items. We propose amodification to the decision tree model, to makeitofpractical use for larger scale RS. Instead of predicting the rating of an item, the decision tree would return a weighted list of recommended items. Thus with just asingle traverse of the tree, recom- mendations can be constructed and provided to the user.This variation of decision tree based RS is described in section 2. The second contribution of this paper is in the introduction of anew heuristic criteria for building the decision tree. Instead of picking the split attribute to be the attribute which produces the largest information gain ratio, the proposed heuristic looks at the number of shared items between the divided sets. The split attribute which had the lowest probability of producing its number of shared items, when compared to arandom split, is picked as the split attribute. This heuristic is described in further detail in section 3. We evaluate our newheuristic and compare it to the information gain heuristic used by the popular C.45 algorithm ([Qui93]) in section 4.
2RS-Adapted Decision Tree
In recommender systems the input set for building the decision tree is composed of Rat- ings. Ratings can be described as arelation
171 users, such as the user’sage, gender,occupation. Attributes can also describe the items, for example the weight, price, dimensions. Rating is the target attribute which the decision tree classifies by.Based on the training set, the system attempts to predict the Rating of items the user does not have a Rating for,and recommends to the user the items with the highest predicted Rating. The construction of adecision tree is performed by arecursive process. The process starts at the root node with an input set (training set). At each node an item attribute is picked as the split attribute. Foreach possible value (or set of values) child-nodes are created and the parent’sset is split between child-nodes so that each child-node receivesasinput-set all items that have the appropriate value(s) that correspond to this child-node. Picking the split-attribute is done heuristically since we cannot knowwhich split will produce the best tree (the tree that produces the best results for future input), for example the popular C4.5 algorithm ([Qui93]) uses aheuristic that picks the split that produces the largest information gain out of all possible splits. One of the attributes is pre-defined as the target attribute. The recursive process continues until all the items in the node’sset share the same target attribute value or the number of items reaches acertain threshold. Each leaf node is assigned alabel (classifying its set of items), this label is the shared target attribute value or the most common value in case there is more than one such value. Decision trees can be used for different recommender systems approaches:
• Collaborative Filtering -Breese et al. [BHK98] used decision trees for building acol- laborative filtering system. Each instance in the training set refers to asingle customer. The training set attributes refer to the feedback provided by the customer for each item in the system. In this case adedicated decision tree is built for each item. Forthis pur- pose the feedback provided for the targeted item (for instance like/dislike) is considered to be the decision that is needed to be predicted, while the feedback provided for all other items is used as the input attributes (decision nodes). Figure 1(left) illustrates an example of such atree, for movies. • Content-Based Approach -Liand Yamda [LY04] and Bouza et al. [BRBG08] propose to use content features to build adecision tree. Aseparate decision tree is built for each user and is used as auser profile. The features of each of the items are used to build amodel that explains the user’spreferences. The information gain of every feature is used as the splitting criteria. Figure 1(right) illustrates Bob’sprofile. It should be noted that although this approach is interesting from atheoretical perspective,the precision that wasreported for this system is worse than that of recommending the average rating. • Hybrid Approach -Ahybrid decision tree can also be constructed. Only asingle tree is constructed in this approach. The tree is similar to the collaborative approach, in that it takes user’sattributes as attributes to split by (such as her liking/disliking of a certain movie) butthe attributes it uses are general attributes that represent the user’s preference for the general case, based on the content of the items. The attributes are constructed based on the user’spast ratings and the content of the items. Forexample, auser which rated negatively all movies of genre comedy is assigned alow value in a”degree of liking comedy movies” attribute. Similarly to the collaborative approach,
172 the tree constructed is applicable to all users. However, it is nowalso applicable to all items since the newattributes represent the user’spreferences for all items and not just asingle givenitem. Figure 2illustrates such ahybrid tree.
Consider ageneral case with adata set containing n users, m items, and an average decision tree of height h.The collaborative filtering based RS requires m trees to be constructed, one for each item. When auser likes to receive arecommendation on what movie to watch, the system traverses all trees, from root to leaf, until it finds an item the user would like to view. The time complexity in this case is therefore O(h · m).This might be too slow for alarge system that needs to provide fast, on-demand, recommendations to users. The content based approach requires n trees to be constructed, one for each user.When auser likes to receive arecommendation, the system needs to traverse the user’stree from root to leaf once for each item, until it finds an item the user would liketoview. The time complexity in this case is therefore O(h · m).Similarly,inthe hybrid approach, the tree needs to be traversed once for each item, and the time complexity is also O(h · m).
The Godfather Genre is Comedy Dislike Like No Yes Pulp Star Fiction Wars Staring by George Published Clooney Dislike Like year Dislike Like Yes No >2000 <=2000 L D Fight The Sixth Club Sense Dislike Like Dislike Like L D L D
L D L D
Figure 1: Left: ACFdecision tree for whether users likethe movie ”The Usual Suspects” based on their preferences to other movies such as The Godfather,Pulp Fiction etc. Aleaf labeled with ”L” or ”D” correspondingly indicates that the user likes/dislikes the movie ”The Usual Suspects”. Right: ACBdecision tree for Bob.
In systems which require fast computation of recommendations and with manypossible items to recommend, all the above decision tree based RS would be impractical. Therefore we propose amodification of the decision tree to better fit RS, and provide recommenda- tions faster to users. Our proposed algorithm is similar to the ID3 algorithm [Qui86] and uses the hybrid ap- proach. Because we use the hybrid approach, only asingle tree is needed, and the attributes to split by are only attributes that describe users. These attributes can be computed based on the user’spast ratings and the content of the items, as shown in the example in figure 2 buttheycan also include user profile attributes such as age, gender etc. The major varia- tion from the ID3 algorithm is in the leaf nodes of the tree. Instead of creating leaf nodes with alabel that predicts the target attribute value (such as rating), we propose to construct arecommendation list out of the leaf’s input set and save this list at the leaf node as its
173 User’s Gender Female Male
Pulp Star Fiction Wars
Dislike Like Dislike Like
TheKid Paths of Glory Mononoke-hime ThePianist Safety Last! Smultronstället Fight Club Oldboy TheGold Rush Vertigo TheMatrix Finding Nemo TheBig Parade TouchofEvil American Beauty BigFish TheGeneral Psycho Magnolia Mystic River Metropolis TheApartment Memento Forrest Gump City Lights TheHustler Gladiator DieHard Duck Soup Léon Amores perros Metropolis Se7en Forrest Gump Donnie Darko City Lights Glory DieHard Cidade de Deus Fight Club
Figure 2: An example CF-CB hybrid decision tree label. When auser wishes to receive recommendations, the tree is traversed based on the user’sattributes until the leaf node is reached. The node contains apre-computed recom- mendation list and this list is returned to the user.Thus the time complexity is reduced to only O(h). Building the recommendation list at the leaf node can be done in various ways. Asimple solution selected here is to compute the weighted average of the ratings in all tuples at the leaf’s Ratings set and to recommend the items with the highest weighted average first. Consider the rightmost leaf in the example tree in figure 2. Forany tuple in the Ratings set (i,u,r denote an item, auser and arating, respectively) at this leaf node we knowthat u has adegree of liking the genre Comedy more than 0.5 and adegree of liking movies with the actor George Clooneymore than 0.7. All the items rated by users such as u appear in this Ratings set along with the ratings each user submitted. This leaf therefore contains the ratings of users similar to u.Weassume that if we nowpick the items which were rated the highest by the users similar to u, these would form agood recommendation for this user.Therefore, we order all items based on aweighted average (since items can appear more than once, when more than one user rated them) and set this list as the recommendation list of this leaf node. The algorithm is presented in detail in Algorithm 1.
3Least Probable Intersections
Decision trees seek to provide good results for newunseen cases, based on the model con- structed with the training set. To accomplish this, the construction algorithm strivesfor a small tree (in number of nodes) that performs well on the training set. It is believedthat a small tree would generalize better,avoids overfitting, and forms asimpler representation for humans to understand [Qui86]. The C4.5 [Qui93] is apopular algorithm for construct-
174 Algorithm 1 RS-Adapted-Decision-Tree(Ratings) create root node if (Ratings.size < threshold)
root.recommendations←recommendation list(Ratings) return root else
Let A be the user attribute that best classifies the input For each possible value v of A
Add abranch b below root,labeled (A=v) Let Ratingsv be the subset of Ratings that have the value v for A add RS-Adapted-Decision-Tree(Ratingsv)below this new branch b return root
ing such decision trees. It uses the criterion of normalized information gain [Mit97] to pick the attribute by which to split each node. The attribute with the largest normalized information gain is picked, as it provides the largest reduction in entropy. Since this is a heuristic, it does not guarantee the smallest possible tree. In recommender systems the input set for building the decision tree is composed of Rat- ings. Ratings can be described as arelation
• items(S)=πItemID(S),where π is the projection operation in relational algebra, the
175 set of all ItemIDs that appear in aset S.
• Oi(S)=|σItemID=i(S)|,where σ is the select operation in relation algebra, is the number of occurrences of the ItemID i in the set S.
Let Sq (q denotes the number of ratings) be arandom binary partition of the tuples in Ratings into twosub-relations A and B consisting of k and q-k tuples, respectively.Weare interested in the probability distribution of Sq ≡|(items(A) items(B)|. First, let us find the probability of an itembelonging to the intersection. The probability of o k i all oi occurrences of anyitem i to be in the set A is .Similarly the probability for all q oi o occurrences of item i to be in the set B is q−k .Inall other cases item i will appear i q in the intersection, thus the probability Pi that item i belongs to items(A) items(B) is: k oi q−k oi P =1− − (1) i q q
Next, we can construct arandom variable xi which takes the value 1when item i belongs to the intersection of A and B,and the value 0otherwise. Using the above equation, the variable xi is distributed according to aBernoulli distribution with asuccess probability Pi.Thus, Sq is distributed as the sum of |items(Ratings)| non-identically distributed Bernoulli random variables which can be approximated by aPoisson distribution [Cam60]:
λj·e−λ Pr(S = j)= (2) q j! where λ = Pi (3) i∈items(Ratings)
The cumulative distribution function (CDF) is therefore givenby:
Γ( k +1 ,λ) Pr(S ≤ j)= (4) q k ! where Γ(x,y)is the incomplete gamma function and k is the floor function. To summarize, givenabinary split of Ratings into twosub-relations A and B,ofsizes q and q-k respective ly.Our proposed heuristic first computes the size of the item-intersection, |(items(A) items(B)| = j.Next, we compute the probability of receiving such inter- section size in asimilar-size random split using the probabilityPr(Sq ≤ j).Out of all pos- sible splits of Ratings,our heuristic picks the one with the lowest probability Pr(Sq ≤ j) to be the next split in the tree.
176 4Experimental Evaluation
4.1 Evaluation Measures
To assess the quality of the resulting recommendations list, we evaluated the half-life util- ity metric ([BHK98]) of the movies that were ranked by the user butwere not used in the profile construction. The above metric assumes that successive items in the list are less likely to be viewed with an exponentially decreasing rate. The utility is defined as the difference between the user’srating for an item and the ”default rating” for an item. The grade is then divided by the maximal utopian grade. Specifically,the grade of the recommendations list for user a is computed as follows: