ADecision Tree Based Recommender System

Amir Gershman, Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev Beer-Sheva,84105, Israel amirger,[email protected]

Karl-Heinz Luke¨ Deutsche Telekom AG,Laboratories, Innovation Development Ernst-Reuter-Platz 7, D-10587 Berlin, Germany [email protected]

Lior Rokach, Alon Schclar,Arnon Sturm Department of Information Systems Engineering, Ben-Gurion University of the Negev Beer-Sheva,84105, Israel liorrk, schclar,[email protected]

Abstract: Anew method for decision-tree-based recommender systems is proposed. The proposed method includes twonew major innovations. First, the decision tree produces lists of recommended items at its leaf nodes, instead of single items. This leads to reduced amount of search, when using the tree to compile arecommendation list for auser and consequently enables ascaling of the recommendation system. The second major contribution of the paper is the splitting method for constructing the decision tree. Splitting is based on anew criterion -the least probable intersection size. The newcriterion computes the probability for getting the intersection for each potential split in arandom split and selects the split that generates the least probable size of intersection. The proposed decision tree based recommendation system was evaluated on alarge sample of the MovieLens dataset and is shown to outperform the quality of recommendations produced by the well known information gain splitting criterion.

1Introduction

Recommender Systems (RS) propose useful and interesting items to users in order to in- crease both seller’sprofit and buyer’ssatisfaction. Theycontribute to the commercial success of manyon-line ventures such as .com or [Net] and are avery active research area. Examples of recommended items include movies, web pages, books, news items and more. Often aRSattempts to predict the rating auser will give to items

170 based on her past ratings and the ratings of other (similar) users. Decision Trees have been previously used as amodel-based approach for recommender systems. The use of decision trees for building recommendation models offers several benefits, such as: efficiencyand interpretability [ZI02] and flexibility in handling avariety of input data types (ratings, demographic, contextual, etc.). The decision tree forms apredictive model which maps the input to apredicted value based on the input’sattributes. Each interior node in the tree corresponds to an attribute and each arc from aparent to achild node represents apossible value or aset of values of that attribute. The construction of the tree begins with aroot node and the input set. An attribute is assigned to the root and arcs and sub-nodes for each set of values are created. The input set is then split by the values so that each child node receivesonly the part of the input set which matches the attribute value as specified by the arc to the child node. The process then repeats itself recursively for each child until splitting is no longer feasible. Either asingle classification(predicted value) can be applied to each element in the divided set, or some other threshold is reached. Amajor weakness in using decision trees as aprediction model in RS is the need to build ahuge number of trees (either for each item or for each user). Moreover, the model can only compute the expected rating of asingle item at atime. To provide recommendations to the user,wemust traverse the tree(s) from root to leaf once for each item in order to compute its predicted rating. Only after computing the predicted rating of all items can the RS provide the recommendations (highest predicted rating items). Thus decision trees in RS do not scale well with respect to the number of items. We propose amodification to the decision tree model, to makeitofpractical use for larger scale RS. Instead of predicting the rating of an item, the decision tree would return a weighted list of recommended items. Thus with just asingle traverse of the tree, recom- mendations can be constructed and provided to the user.This variation of decision tree based RS is described in section 2. The second contribution of this paper is in the introduction of anew heuristic criteria for building the decision tree. Instead of picking the split attribute to be the attribute which produces the largest information gain ratio, the proposed heuristic looks at the number of shared items between the divided sets. The split attribute which had the lowest probability of producing its number of shared items, when compared to arandom split, is picked as the split attribute. This heuristic is described in further detail in section 3. We evaluate our newheuristic and compare it to the information gain heuristic used by the popular C.45 algorithm ([Qui93]) in section 4.

2RS-Adapted Decision Tree

In recommender systems the input set for building the decision tree is composed of Rat- ings. Ratings can be described as arelation (inwhich is assumed to be aprimary key). The attributes can describe the

171 users, such as the user’sage, gender,occupation. Attributes can also describe the items, for example the weight, price, dimensions. Rating is the target attribute which the decision tree classifies by.Based on the training set, the system attempts to predict the Rating of items the user does not have a Rating for,and recommends to the user the items with the highest predicted Rating. The construction of adecision tree is performed by arecursive process. The process starts at the root node with an input set (training set). At each node an item attribute is picked as the split attribute. Foreach possible value (or set of values) child-nodes are created and the parent’sset is split between child-nodes so that each child-node receivesasinput-set all items that have the appropriate value(s) that correspond to this child-node. Picking the split-attribute is done heuristically since we cannot knowwhich split will produce the best tree (the tree that produces the best results for future input), for example the popular C4.5 algorithm ([Qui93]) uses aheuristic that picks the split that produces the largest information gain out of all possible splits. One of the attributes is pre-defined as the target attribute. The recursive process continues until all the items in the node’sset share the same target attribute value or the number of items reaches acertain threshold. Each leaf node is assigned alabel (classifying its set of items), this label is the shared target attribute value or the most common value in case there is more than one such value. Decision trees can be used for different recommender systems approaches:

-Breese et al. [BHK98] used decision trees for building acol- laborative filtering system. Each instance in the training set refers to asingle customer. The training set attributes refer to the feedback provided by the customer for each item in the system. In this case adedicated decision tree is built for each item. Forthis pur- pose the feedback provided for the targeted item (for instance like/dislike) is considered to be the decision that is needed to be predicted, while the feedback provided for all other items is used as the input attributes (decision nodes). Figure 1(left) illustrates an example of such atree, for movies. • Content-Based Approach -Liand Yamda [LY04] and Bouza et al. [BRBG08] propose to use content features to build adecision tree. Aseparate decision tree is built for each user and is used as auser profile. The features of each of the items are used to build amodel that explains the user’spreferences. The information gain of every feature is used as the splitting criteria. Figure 1(right) illustrates Bob’sprofile. It should be noted that although this approach is interesting from atheoretical perspective,the precision that wasreported for this system is worse than that of recommending the average rating. • Hybrid Approach -Ahybrid decision tree can also be constructed. Only asingle tree is constructed in this approach. The tree is similar to the collaborative approach, in that it takes user’sattributes as attributes to split by (such as her liking/disliking of a certain movie) butthe attributes it uses are general attributes that represent the user’s preference for the general case, based on the content of the items. The attributes are constructed based on the user’spast ratings and the content of the items. Forexample, auser which rated negatively all movies of genre comedy is assigned alow value in a”degree of liking comedy movies” attribute. Similarly to the collaborative approach,

172 the tree constructed is applicable to all users. However, it is nowalso applicable to all items since the newattributes represent the user’spreferences for all items and not just asingle givenitem. Figure 2illustrates such ahybrid tree.

Consider ageneral case with adata set containing n users, m items, and an average decision tree of height h.The collaborative filtering based RS requires m trees to be constructed, one for each item. When auser likes to receive arecommendation on what movie to watch, the system traverses all trees, from root to leaf, until it finds an item the user would like to view. The time complexity in this case is therefore O(h · m).This might be too slow for alarge system that needs to provide fast, on-demand, recommendations to users. The content based approach requires n trees to be constructed, one for each user.When auser likes to receive arecommendation, the system needs to traverse the user’stree from root to leaf once for each item, until it finds an item the user would liketoview. The time complexity in this case is therefore O(h · m).Similarly,inthe hybrid approach, the tree needs to be traversed once for each item, and the time complexity is also O(h · m).

The Godfather Genre is Comedy Dislike Like No Yes Pulp Star Fiction Wars Staring by George Published Clooney Dislike Like year Dislike Like Yes No >2000 <=2000 L D Fight The Sixth Club Sense Dislike Like Dislike Like L D L D

L D L D

Figure 1: Left: ACFdecision tree for whether users likethe movie ”The Usual Suspects” based on their preferences to other movies such as The Godfather,Pulp Fiction etc. Aleaf labeled with ”L” or ”D” correspondingly indicates that the user likes/dislikes the movie ”The Usual Suspects”. Right: ACBdecision tree for Bob.

In systems which require fast computation of recommendations and with manypossible items to recommend, all the above decision tree based RS would be impractical. Therefore we propose amodification of the decision tree to better fit RS, and provide recommenda- tions faster to users. Our proposed algorithm is similar to the ID3 algorithm [Qui86] and uses the hybrid ap- proach. Because we use the hybrid approach, only asingle tree is needed, and the attributes to split by are only attributes that describe users. These attributes can be computed based on the user’spast ratings and the content of the items, as shown in the example in figure 2 buttheycan also include user profile attributes such as age, gender etc. The major varia- tion from the ID3 algorithm is in the leaf nodes of the tree. Instead of creating leaf nodes with alabel that predicts the target attribute value (such as rating), we propose to construct arecommendation list out of the leaf’s input set and save this list at the leaf node as its

173 User’s Gender Female Male

Pulp Star Fiction Wars

Dislike Like Dislike Like

TheKid Paths of Glory Mononoke-hime ThePianist Safety Last! Smultronstället Fight Club Oldboy TheGold Rush Vertigo TheMatrix Finding Nemo TheBig Parade TouchofEvil American Beauty BigFish TheGeneral Psycho Magnolia Mystic River Metropolis TheApartment Memento Forrest Gump City Lights TheHustler Gladiator DieHard Duck Soup Léon Amores perros Metropolis Se7en Forrest Gump Donnie Darko City Lights Glory DieHard Cidade de Deus Fight Club

Figure 2: An example CF-CB hybrid decision tree label. When auser wishes to receive recommendations, the tree is traversed based on the user’sattributes until the leaf node is reached. The node contains apre-computed recom- mendation list and this list is returned to the user.Thus the time complexity is reduced to only O(h). Building the recommendation list at the leaf node can be done in various ways. Asimple solution selected here is to compute the weighted average of the ratings in all tuples at the leaf’s Ratings set and to recommend the items with the highest weighted average first. Consider the rightmost leaf in the example tree in figure 2. Forany tuple in the Ratings set (i,u,r denote an item, auser and arating, respectively) at this leaf node we knowthat u has adegree of liking the genre Comedy more than 0.5 and adegree of liking movies with the actor George Clooneymore than 0.7. All the items rated by users such as u appear in this Ratings set along with the ratings each user submitted. This leaf therefore contains the ratings of users similar to u.Weassume that if we nowpick the items which were rated the highest by the users similar to u, these would form agood recommendation for this user.Therefore, we order all items based on aweighted average (since items can appear more than once, when more than one user rated them) and set this list as the recommendation list of this leaf node. The algorithm is presented in detail in Algorithm 1.

3Least Probable Intersections

Decision trees seek to provide good results for newunseen cases, based on the model con- structed with the training set. To accomplish this, the construction algorithm strivesfor a small tree (in number of nodes) that performs well on the training set. It is believedthat a small tree would generalize better,avoids overfitting, and forms asimpler representation for humans to understand [Qui86]. The C4.5 [Qui93] is apopular algorithm for construct-

174 Algorithm 1 RS-Adapted-Decision-Tree(Ratings) create root node if (Ratings.size < threshold)

root.recommendations←recommendation list(Ratings) return root else

Let A be the user attribute that best classifies the input For each possible value v of A

Add abranch b below root,labeled (A=v) Let Ratingsv be the subset of Ratings that have the value v for A add RS-Adapted-Decision-Tree(Ratingsv)below this new branch b return root

ing such decision trees. It uses the criterion of normalized information gain [Mit97] to pick the attribute by which to split each node. The attribute with the largest normalized information gain is picked, as it provides the largest reduction in entropy. Since this is a heuristic, it does not guarantee the smallest possible tree. In recommender systems the input set for building the decision tree is composed of Rat- ings. Ratings can be described as arelation (inwhich is assumed to be aprimary key). The intuition behind the heuris- tic proposed in the present paper is as follows. Consider asplit into twosubsets, A and B. The less ItemIDs are shared between the sets A and B,the better the split is since it forms a better distinction between the twogroups. However, different splits may result in different group sizes. Comparing the size of the intersection between splits of different sub-relation sizes would not be good since an even split for example (half the tuples in group A and half in group B)would probably have alarger intersection than avery unevensplit (such as one tuple in group A and all the rest in group B). Instead, we look at the probability of our split’sitem-intersection size compared to arandom (uniform) split of similar group sizes. Asplit which is very likely to occur even in arandom split is considered abad split (less preferred) since it is similar to arandom split and probably does not distinguish the groups well from each other.Asplit that is the least probable to occur is assumed to be a better distinction of the twosubgroups and is the split selected by the heuristic. More formally let us denote:

• items(S)=πItemID(S),where π is the projection operation in relational algebra, the

175 set of all ItemIDs that appear in aset S.

• Oi(S)=|σItemID=i(S)|,where σ is the select operation in relation algebra, is the number of occurrences of the ItemID i in the set S.

Let Sq (q denotes the number of ratings) be arandom binary partition of the tuples in Ratings into twosub-relations A and B consisting of k and q-ktuples, respectively.Weare interested in the probability distribution of Sq ≡|(items(A) items(B)|. First, let us find the probability of an itembelonging to the intersection. The probability of o k i all oi occurrences of anyitem i to be in the set A is .Similarly the probability for all q oi o occurrences of item i to be in the set B is q−k .Inall other cases item i will appear i q in the intersection, thus the probability Pi that item i belongs to items(A) items(B) is: k oi q−k oi P =1− − (1) i q q

Next, we can construct arandom variable xi which takes the value 1when item i belongs to the intersection of A and B,and the value 0otherwise. Using the above equation, the variable xi is distributed according to aBernoulli distribution with asuccess probability Pi.Thus, Sq is distributed as the sum of |items(Ratings)| non-identically distributed Bernoulli random variables which can be approximated by aPoisson distribution [Cam60]:

λj·e−λ Pr(S = j)= (2) q j! where λ = Pi (3) i∈items(Ratings)

The cumulative distribution function (CDF) is therefore givenby:

Γ(k +1,λ) Pr(S ≤ j)= (4) q k! where Γ(x,y)is the incomplete gamma function and k is the floor function. To summarize, givenabinary split of Ratings into twosub-relations A and B,ofsizes q and q-k respectively.Our proposed heuristic first computes the size of the item-intersection, |(items(A) items(B)| = j.Next, we compute the probability of receiving such inter- section size in asimilar-size random split using the probabilityPr(Sq ≤ j).Out of all pos- sible splits of Ratings,our heuristic picks the one with the lowest probability Pr(Sq ≤ j) to be the next split in the tree.

176 4Experimental Evaluation

4.1 Evaluation Measures

To assess the quality of the resulting recommendations list, we evaluated the half-life util- ity metric ([BHK98]) of the movies that were ranked by the user butwere not used in the profile construction. The above metric assumes that successive items in the list are less likely to be viewed with an exponentially decreasing rate. The utility is defined as the difference between the user’srating for an item and the ”default rating” for an item. The grade is then divided by the maximal utopian grade. Specifically,the grade of the recommendations list for user a is computed as follows:

max(r − d, 0) R = a,j (5) α 2(j−1)/(α−1) j where ra,j represents the rating of user a on item j of the ranked list, d is the default rating, and α is the viewing half-life that in this experiment wasset to 10.The overall score for a dataset across all users ( R) is R R =100 a a (6) max a Ra

max where Ra is the maximum achievable utility if the system ranked the items in the exact order that the user ranked them.

4.2 Experimental Setup

Forour experimental evaluation we used the MovieLens ([RL10]) data set, which holds 1million ratings of 6040 users rating 3900 distinct movies. Forthe training set, about 10% of the movies were selected, and all ratings associated with them. These ratings accumulate to roughly 50,000 ratings for half the user set (3063 in number). The users’ attributes consisted of age, occupation and gender.The same attribute can serveasasplit attribute at multiple junctions in the tree as we confine the tree to use only binary splits. The RS is asked to provide recommendations for all remaining users (2977 in number), and the half-life utility metric is used to evaluate the results. In order to evaluate the new heuristic, we compare tworecommendation systems identical in all details except for the split heuristic used. One system is using the standard information gain heuristic, and the other is using the proposed least probable intersection size heuristic. The same training set is givenasinput to both systems.

177 4.3 Results

Figure 3compares the performance of the newintersection size heuristic criterion with the well known information gain. The horizontal axis of this plot indicates the depth of the decision tree, as the tree is being constructed. The vertical axis indicates the average utility of recommendations produced by the tree. The solid line shows the utility of the decision tree using the newcriterion, whereas the broken line shows utility obtained by the information gain criterion. Both lines followthe well-known over-fitting pattern, in which the utility first increases, then decreases. Note that the newcriterion dominates the information gain criterion overall depths. Specifically,the best performance of the two criteria are 75.88% and 72.71% respectively which implies a4.35% relative improvement. To examine the effects of the tree’sdepth and of the splitting criteria, atwo-way analysis of variance (ANOVA )with repeated measures wasperformed. The dependent variable wasthe mean utility.The results of the ANOVA showed that the main effects of the tree’s depth F =34.2,p < 0.001 and the splitting criteria F =7.24,p < 0.001 were both significant. The Post-Hoc Duncan test wasconducted in order to examine when the proposed criterion outperforms the information gain criterion. With α =0.05,starting from tree of four levels and up to the atree of 21 levels the proposed method is significantly better than the information gain.

Figure 3: Comparing the average utility of the twosplitting criteria on different tree’ssizes.

5Conclusions

Anew decision tree based recommender technique waspresented. The newtechnique re- quires only asingle traversal of the tree for producing the recommendation list, by holding lists of recommended items in the tree’snodes. Anew splitting criterion for guiding the induction of the decision tree is also proposed. The

178 experimental study shows that the newcriterion outperforms the well known information gain criterion. Additional issues to be further investigated include: (a) Add apruning phase which will followthe growing phase and will help to avoid over-fitting; (b) Compare the proposed technique to other decision tree based techniques; and (c) Examine the proposed method on other benchmark datasets.

Acknowledgements

The project wasfunded and managed by Deutsche Telekom Laboratories as part of the Context-Aware Service Offering and Usage (CASOU) project.

References

[BHK98] J. S. Breese, D. Heckerman, and C. Kadie. Emperical analysis of predictive algorithms for collaborative filtering. In Fourteenth Conference on Uncertainty in Artificial Intelli- gence,1998.

[BRBG08] A. Bouza, G. Reif, A. Bernstein, and H. Gall. Semtree: ontology-based decision tree algorithm for recommender systems. In International Semantic WebConference,2008.

[Cam60] L. Le Cam. An Approximation Theorem for the Poisson Binomial Distribution. Pacific Journal of Mathematics, volume 10,pages 1181–1197, 1960.

[LY04] P. Li and S. Yamada. AMovie Recommender System Based on Inductive Learning. In IEEE Conference on Cybernetics and Intelligent Systems,2004.

[Mit97] T. M. Mitchell. (The McGraw-Hill Companies, Inc.).1997.

[Net] NetFlix. The , www.netflixprize.com.

[Qui86] J. R. Quinlan. Induction of Decision Trees. Machine Learning,pages 81–106, 1986.

[Qui93] J. R. Quinlan. C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning).MorganKaufmann, January 1993.

[RL10] GroupLens Research-Lab.”The MovieLens Data Set”, http://www.grouplens.org/node/73, February 2010.

[ZI02] T. Zhang and V. S. Iyengar.Recommender Systems Using Linear Classifiers. Journal of Machine Learning Research,volume 2,pages 313–334, 2002.

179