<<

Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2014, Article ID 162521, 8 pages http://dx.doi.org/10.1155/2014/162521

Research Article A Robust Approach Based on User Relationships for Recommendation Systems

Min Gao,1,2 Bin Ling,3 Quan Yuan,1 Qingyu Xiong,1,2 and Linda Yang3

1 School of Software Engineering, Chongqing University, Chongqing 400044, China 2 Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of Education, Chongqing 400044, China 3 School of Engineering, University of Portsmouth, Portsmouth PO1 3AH, UK

Correspondence should be addressed to Min Gao; [email protected]

Received 12 August 2013; Revised 10 December 2013; Accepted 30 December 2013; Published 19 February 2014

Academic Editor: Xing-Gang Yan

Copyright © 2014 Min Gao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Personalized recommendation systems have been widely used as an effective way to deal with overload. The common approach in the systems, item-based collaborative filtering (CF), has been identified to be vulnerable to “Shilling” attack. To improve the robustness of item-based CF, the authors propose a novel CF approach based on the mostly used relationships between users. In the paper, three most commonly used relationships between users are analyzed and applied to construct several user models at first. The DBSCAN clustering is then utilized to select the valid user model in accordance with how the models benefit detecting spam users. The selected model is used to detect spam user group. Finally, a detection-based CF method is proposed for the calculation of item-item similarities and rating prediction, by setting different weights for suspicious spam users and normal users. The experimental results demonstrate that the proposed approach provides a better robustness than the typical item-based kNN (k Nearest Neighbor) CF approach.

1. Introduction (orranks,weights)ofspamusers[15] made up by an attacker less than those of normal users, the antiattack ability of Nowadays, personalized recommendation systems have been recommendation systems would be improved. widely used as an effective way to help people cope with There are several kinds of relationships between the information overload [1, 2]. It automatically adjusts, restruc- users usually used in item-based CF, such as similarities tures, and presents tailored information for individuals by andcorrelations.Inthispaper,anapproachbasedonthese analyzing user information, creating one-to-one relationship, or understanding user needs in different contexts3 [ –6]. Until relationships is proposed to calculate the relative weights of now, CF is the most popular approach used in personalized users and to improve the attack resistant ability of typical recommendation systems. Approaches for CF recommenda- item-based CF approaches further. The proposed approach tioncanbegroupedintotwogeneralclasses[7–11]: user-based is constructed by the following four steps: three kinds of and item-based. relationships between users are selected to construct user Both the typical user-based and item-based CF approach- models; a density-based clustering algorithm is then used to es, however, suffer from “Shilling” attacks [12]becauseusers select the best user model; the model is then applied to detect of online systems can multiply their profiles and identities spam users; the detection results are incorporated into an nearly indefinitely. Thus, the systems that depend on such approach for the calculation of item-item similarities and rat- profiles would be subject to control by an attacker bent on ing prediction. Finally, the experimental results illustrate that making the system recommend as he or she desires [12– the proposed approach is able to provide a better robustness 17]. It is a common knowledge that some users’ ratings in (thestabilityofpredictionandhitratio)than(1)amostlyused recommendation systems are more valuable than those of item-based kNN CF (similarity-based CF) recommendation others. If there is an approach that makes the credit ratings approach and (2) other robust recommendation approaches. 2 Mathematical Problems in Engineering

𝑆 𝑆 IF I0 IS IT characteristics: (1) 𝐼 :allitemsin𝐼 are the most popular 𝑆 𝐹 items that are assigned to 𝑟max (𝛿(𝑖𝑘)=𝑟max);(2)𝐼 :allitems F F 0 0 S S T T 𝐼𝐹 i1 ···ik i1 ···il i1 ···im i1 ··· in in areassignedtorandom values that are in line with nor- 𝐹 𝑇 F S mal distribution (𝜎(𝑖𝑘 ) = 𝑟𝑎𝑛𝑑𝑜𝑚 V𝑎𝑙𝑢𝑒𝑠);(3)𝐼 :allitems 𝜎(i ) ···𝜎(iF) ···𝛿(iS) ···𝛿(i ) T ··· T 1 k Null Null 1 m 𝛾(i1 ) 𝛾(in ) 𝐹 𝑇 in 𝐼 are assigned to 𝑟max (𝛾(𝑖𝑘 )=𝑟max). Thesegmentattackmodelisdesignedtopushanitemto Figure 1: The general form of an attack profile. a targeted group (segment) of users with known or easily pre- dicted preferences [22]. It has the following characteristics: (1) 𝑆 𝑆 𝑆 𝐹 𝐼 :allitemsin𝐼 areassignedto𝑟max (𝛿(𝑖𝑘)=𝑟max);(2)𝐼 : 𝐹 𝐹 𝑇 The rest of this paper is organized as follows: Section 2 all items in 𝐼 are assigned to 𝑟min (𝜎(𝑖𝑘 )=𝑟min);(3)𝐼 :all presents the background of item-based CF approaches and 𝐹 𝑇 items in 𝐼 are assigned to 𝑟max (𝛾(𝑖𝑘 )=𝑟max). their related problems. Section 3 presents the proposed Research in the area of shilling attacks has made sig- methods for how to select user models, how to detect and nificant advances in last years. User-based CF makes rec- mark suspicious spam users and normal users, and how to ommendations by finding peers with preference profiles; calculate item-item similarities and predictions according consequently, the profiles with biased data may result in to the detection. Section 4 presents experimental results of biased recommendations easily. Item-based CF looks for theproposedapproachonMovieLensdatasetandanalyzes items with similar profiles and makes predictions based on a if the approach is effective in comparison with the typical user’s own ratings of the peer items; therefore, the item-based item-based CF approach and other robust recommendation CF also suffers from the attacks. approaches. Section 5 draws conclusions. Random attack and average attack models are successful against the user-based CF algorithms; however, they fall short 2. Background and Associated Problem of of having a significant impact against the item-based CF Item-Based CF Approaches algorithms [13]. The newer models, bandwagon and segment model, are quite successful against item-based CF algorithms CF is the mostly used and most successful recommendation [22].Intheseattackmodels,randomandbandwagonattacks techniquetodate[18–20]. The traditional CF, user-based CF, belong to low knowledge attacks [13] which need minimal is to predict the rating of an item for a target user based on knowledge of recommendation systems and user profiles. the of other like-minded users. It was remarkably For experimental purpose, the bandwagon attack is adopted successful in the past, but some potential challenges have inthepapersinceitisalowknowledgeattackandquite arisen [21]suchasproblemsinscalability,thatmeansthat successful against item-based CF. the computational complexity is growing rapidly with the number of users. The item-based CF has been proved to solve 2.2. Shilling Attack Resistant CF. A number of recent studies the problem [9]. Both the user-based and item-based CF have been focusing on the robust CF, due to the vulner- approaches, however, suffer from “Shilling” attacks. ability of the recommendation systems that are easily to be attacked. O’Donovan and Smyth [23]proposedthatthe 2.1. Shilling Attack Problem. An attack that influences a trustworthiness of users should be taken into consideration recommendation system is to arrange with a group of users, in recommendation systems. Their trust models can improve named shills [20]orspam users [14], to enter the system and the predictive accuracy. Massa and Avesani [24]proposed vouch for items in question. Their ratings are intended to a robust CF approach, also called trust-based CF, based mislead other users. The attacks are, therefore, called shilling on “web of trust.” The approach increases the coverage of attacks (or profile injection attacks [12]). recommendation systems while preserving the quality of An attack consists of a set of attack profiles (also named predictions, especially for new users. However, the predictive attack ratings). An attack model is an approach to construct accuracy and the coverage of recommendation systems are attack profiles. The general form of an attack profile is shown not the essential metrics for robust recommendation systems in Figure 1 [14]. [25]. Zhang [26] proposed a trust-aware CF based on users’ Suppose that there are 𝑝 items in total in a recommenda- multiple interests. He proposed a topic-level trust model and tion system; an attack profile consists of 𝑝-dimensional vector a CF approach based on the model. The approach improves of ratings. The 𝑝-dimensional vector can be divided into 4 the robustness of the recommendations. However, all those 𝐹 0 𝑆 𝑇 𝐹 0 𝑆 𝑇 threelevelsofthetrustmodelarebasedonthenumberof sets: 𝐼 , 𝐼 , 𝐼 ,and𝐼 .Here,𝑝=|𝐼 |+|𝐼 |+|𝐼 |+|𝐼 |=𝑘+ 𝐹 𝐹 𝐹 user ratings. 𝑙+𝑚+𝑛. 𝐼 (𝑖1∼𝑖𝑘 ) is a set of randomly selected filler items. 0 0 0 𝑆 𝑆 𝑆 The relationships and weights among users are essential 𝐼 (𝑖1∼𝑖𝑙 )isasetofunrateditems.𝐼 (𝑖1∼𝑖𝑚)isasetofselected 𝑇 to a recommendation system. Yu et al. [27]proposeda items which have some relationships with the target items. 𝐼 reputation-based approach for decoding information from 𝑇 𝑇 (𝑖1∼𝑖𝑛 )isasetoftargetitems.Severalattackmodelshave noisy, redundant, and intentionally distorted sources. Zhou been identified, such as random attack and average attack et al. [28] proposed correlation-based reputation algorithm [13], and the newer models, bandwagon and segment model to solve the ranking problem of rating systems. Shang et al. [22]. The bandwagon attack model is designed by giving high [29] presented that relevance information can outperform ratings on the most popular items [14]withthefollowing the mostly used Pearson correlation coefficient under the Mathematical Problems in Engineering 3

than same ratings [29]. In the paper, this relationship is called Input User Output User model (s) weighting interest similarity, shortly named In Sim, which represents Rating Rating Predicted matrix predicting ratings how two users are interested in the items, in a recommen- Item-item similarity computing dation system. In addition to those relationships, Gao and Wu [21] pointed out that the covariance between ratings is an impor- Figure 2: The architecture of the proposed CF approach. tant measure because it represents the linear dependence between the ratings of users. In practice, however, correlation coefficient (Corr coef ) instead of covariance is usually used standard collaborative filtering framework, especially for in measuring the linear dependence between two variables sparse data set. Thanks to these researches because we are because it gives a value between −1and1inclusive.The provided with valuable input to our approach. linear dependence is also usually used as user similarity In the paper, the user models are formed by the rela- in recommendation systems. Thus, in the paper, the linear tionshipsofusers,inwhichnotonlythenumbersofuser dependence (L depd) is considered as the third relationship ratings but also the ratings themselves are taken into account. in the model, which means how the ratings of two users Three kinds of mostly used relationships between users are change together. selected to construct user models firstly. The best user models Therefore, these three kinds of relationships, interest then are experimentally selected for detecting and weighting similarity, rating similarity, and linear dependence, are taken users. The rating weights for the users are incorporated into consideration in the research. into a typical item-based CF finally. The proposed models The interest similarity of users 𝑢 and V can be calculated and the approach can further improve the robustness of by (1). The more items have been rated by both user 𝑢 and V, recommendation approaches. They will be discussed in detail the closer the users are [19]. We define 𝐼(𝑢) as the set of items in Section 3. rated by the user 𝑢; 𝐼(V) is similar to 𝐼(V). 𝐼(𝑢, V) is the set of items rated by both users 𝑢 and V. Consider 3. User Relationships-Based Robust CF |𝐼 (𝑢) ∩𝐼(V)| 𝐼𝑛𝑆𝑖𝑚(𝑢,V) = . (1) To achieve a robust collaborative recommendation approach, |𝐼 (𝑢) ∪𝐼(V)| the spam users are detected based on users’ relationships and the detection results, represented by weights, are incorpo- The rating similarity of users 𝑢 and V can be calculated rated into item similarity computing (see Figure 2). The paper by Cosine, the most used measure for the calculation of 𝑟 adopts the definition of robustness for collaborative recom- similarities among users (see (2)). Here, the rating 𝑢,𝑖 means 𝑢 𝑖 𝑟 mendation, the ability to make recommendations despite how the user prefers the item .Therating V,𝑖 is similar to 𝑟 𝐼 noisy product ratings [23]. The approach takes the rating 𝑢,𝑖. is the set of items. onsider matrix as input and takes predicted ratings as output. In ∑ 𝑟 𝑟 data modeling module, three kinds of user relationships are 𝑅 = 𝑖∈𝐼 𝑢,𝑖 V,𝑖 . 𝑆𝑖𝑚(𝑢,V) (2) taken into consideration, which are interest similarity, rating √∑ (𝑟 )2√∑ (𝑟 )2 similarity, and rating linear dependence. In user weighting 𝑖∈𝐼 𝑢,𝑖 𝑖∈𝐼 V,𝑖 module, clustering-based detection results are applied to The linear dependence between the ratings of user 𝑢 and producetheweightsofusers.Thentheweightsareincor- those of user V canbecalculatedbyPearsonCorr coef (see porated into item-item similarity calculations and further (3)). The Corr coef is defined as the covariance of the vari- predictions. ables divided by the product of their standard deviations. Consider 3.1. The Analysis of User Relationships. There are different relationships between users in a recommendation system, 𝐿 𝑑𝑒𝑝𝑑 (𝑢, V) just as there are various relationships in any social group. The ∑ (𝑟 − 𝑟 )(𝑟 − 𝑟 ) relationships are exploited to construct user models for the = 𝑖∈𝐼(𝑢)∩𝐼(V) 𝑢,𝑖 𝑢 V,𝑖 V . (3) detection of spam user. 2 2 √∑ (𝑟 − 𝑟 ) √∑ (𝑟 − 𝑟 ) Traditionally, ratings similarity is the most used relation- 𝑖∈𝐼(𝑢)∩𝐼(V) 𝑢,𝑖 𝑢 𝑖∈𝐼(𝑢)∩𝐼(V) V,𝑖 V ship between users in recommendation systems [18]. The 𝑟 𝑢 rating similarity is shortly named R Sim to measure how Here, 𝑢 is the average of the ’s ratings on the items in 𝐼(𝑢) ∩V 𝐼( ) 𝑟 = ∑ 𝑟 /|𝐼(𝑢) V∩ 𝐼( )| 𝑟 much two users’ ratings are similar to each other. , 𝑢 𝑖∈𝐼(𝑢)∩𝐼(V) 𝑢,𝑖 ; V is similar to 𝑟 The rating similarity, however, is only one aspect of the 𝑢. user relationship. There are other relationships behind the So far, three relationships form three matrixes R Sim, ratings [29]. For example, which many items are rated by In Sim,andL depd. Table 1 shows three pair correlations both user 𝑢 and user V; the ratings are extremely different, between R Sim, In Sim,andL depd matrices before and after however. In this case, the rating similarity of them is very low. bandwagon attacks with 10% attack size and 10% filler size. Nevertheless, there should be a similarity between them is high since the rating sets of them are similar. Especially, if the 3.2. Construction of User Models. The combinations of the datasetisverysparse,ratingonsameitemsismoreimportant matrixes, In Sim, R Sim,andL depd, can form seven different 4 Mathematical Problems in Engineering

0.5 Table 1: The pair correlations between R Sim, In Sim,andL depd 0.45 matrices. 0.4 0.35 0.3 Correlations Before attacks After attacks 0.25 0.2 (In Sim, R Sim) 0.627 0.540 0.15 0.1 (In Sim, L depd) 0.097 0.119 0.05 0 (R Sim, L depd) 0.103 0.103 (0∼ 0.1] (0.9∼1] (−0.1∼ 0] (0.1∼ 0.2] (0.7∼ 0.8] 0.8 (0.2∼ 0.3] (0.3∼ 0.4] (0.4∼ 0.5] (0.5∼ 0.6] (0.6∼ 0.7] (0.8∼ 0.9] [−1 ∼ −0.9] (−0.9 ∼ −0.8] (−0.3 ∼ −0.2] (−0.2 ∼ −0.1] 0.7 (−0.8 ∼ −0.7] (−0.7 ∼ −0.6] (−0.6 ∼ −0.5] (−0.5 ∼ −0.4] (−0.4 ∼ −0.3]

0.6 Normal users 0.5 Attack size =10%, filler size =10% =5%, =5% 0.4 Attack size filler size Attack size =20%, filler size =20% 0.3 Figure 4: The distributions of the slotted L depd under bandwagon 0.2 attacks. 0.1

0 attributes of Slotted L depd are the slots from −1to1,0.1 intervals. The values of the attributes are in [0, 1]. Thus, the seven user models formed by the combinations (0 ∼ 0.1] (0.9 ∼ 1]

(−0.1 ∼ 0] can be simplified to the combinations of slotted In Sim, slotted (0.1 ∼ 0.2] (0.7 ∼ 0.8] (0.2 ∼ 0.3] (0.3 ∼ 0.4] (0.4 ∼ 0.5] (0.5 ∼ 0.6] (0.6 ∼ 0.7] (0.8 ∼ 0.9]

[−1 ∼ − 0.9] R Sim,andslotted L depd. In those user models, each user can (−0.9 ∼ − 0.8] (−0.3 ∼ − 0.2] (−0.2 ∼ − 0.1] (−0.8 ∼ − 0.7] (−0.7 ∼ − 0.6] (−0.6 ∼ − 0.5] (−0.5 ∼ − 0.4] (−0.4 ∼ − 0.3] be represented by ten to forty attributes. In Sim RSim Attacks will make similarities among spam users which Ldepd are greater than similarities among normal users. Therefore, the weighting problem can be seen as a clustering related Figure 3: The distribution charts of Slotted In Sim, Slotted problem. Density-based clustering algorithm DBSCAN [30, R Sim Csn,andSlotted L depd. 31]ischosentogroupusersintheresearchbecauseitcan discover arbitrary shaped clusters and good efficiency on large databases. DBSCAN groups the users who are dense and canbeconnectedintoasinglecluster.DBSCANisappliedon user models, such as (In Sim, R Sim, L depd)and(In Sim, all those user models to find which one will be most helpful R Sim). Please note that the user model constructed by to detect the group of spam users. (R Sim, In Sim) is similar to the model constructed by IntheDBSCANalgorithm,auserwillbeacoreofagroup (In Sim, R Sim). when his/her neighbors are equal to or more than 𝑘.Two All those three matrixes are (|𝑈| × |𝑈|) dimensional users will be neighbors when the distance of their attributes is matrixes. |𝑈| is the cardinality of the set of users. A vector less than 0.05. The bandwagon attack is used to analyze how from the combinations of the three matrixes can be used to the attributes are beneficial to the clustering. The attacking represent a user, which is high dimensional data. To decrease size and filler size are 5% and 5%, and 10% and 10%; the the dimension, the matrixes are experimentally analyzed. It is number of attacked items is 1. The attacks can be push attacks found that those In Sim and R Sim values can be, respectively or nuke attacksaccordingtoifitistoraisethepredictedrating divided into 10 slots, respectively, (0 to 1, 0.1 intervals); those of a target item. A push attack will raise the rating; otherwise L depd values for every user can be divided into 20 slots (−1 it is a nuke attack. Push attacks are taken into account in this to 1, 0.1 intervals). paper. Figure 3 is the distribution chart of Slotted In Sim, Slotted Figures 4 and 5 represent the distributions of Slotted R Sim,andSlotted L depd. L depd and Slotted R Sim values of normal users and spam Slotted In Sim is a (|𝑈| × 10) matrix that records the users.Theattacksizesandfillersizesare5%,5%;10%,10%; distribution of the interest similarities for all users. It is and20%,20%,respectively,inFigure 4. Those are 5%, 5%; formed by ten attributes that are the slots from 0 to 1, 0.1 10%, 10%; and 15%, 5%, respectively, in Figure 5.Inthese intervals. The values of the attributes is in [0, 1]. figures, the distribution of spam users are much obviously Slotted R Sim is a (|𝑈|×10) matrix that records the distri- different from those of normal users with increasing of attack bution of the rating similarities for all users. The definitions size and filler size. and values of attributes of slotted R Sim are similar to those As seen from Table 2,the(Slotted In Sim, Slotted R Sim) of slotted In Sim. is the best combination among them. Consequently, the Slotted L depd is a (|𝑈| × 20) matrix that records the attributes from Slotted In Sim and Slotted R Sim are chosen distribution of linear dependence for all users. The twenty to detect spam users. The precisions of other user models Mathematical Problems in Engineering 5

Table 2: Detection results based on user models. There are several algorithms for computing item-item similarities, such as cosine, correlation, and adjusted cosine- Precision recall based similarity [18]. Adjusted cosine is the mostly used Attacks algorithm to calculate the similarities between items because User models Attack size = 5% Attack size = 10% it is reasonably accurate, widely used, and easily analyzed Filler size = 5% Filler size = 10% [25]. Thus, in the WIKCF,adjustedcosineisutilizedto (Slotted In Sim, 0.85 0.99 calculate item similarities: Slotted L depd) 0.62 0.79 (Slotted R Sim, 0.96 1 𝑆𝑖𝑚 =( ∑ (𝑟 − 𝑟 )×(𝑟 − 𝑟 )×𝑤2) Slotted L depd) 0.72 0.82 𝑖,𝑗 𝑢,𝑖 𝑢 𝑢,𝑗 𝑢 𝑢 𝑢∈𝑈(𝑖)∩𝑈(𝑗) (Slotted R Sim, 0.95 1 Slotted In Sim) 0.98 0.95 ×( ∑ (𝑟 − 𝑟 )2 ×𝑤 2 √ 𝑢,𝑖 𝑢 𝑢 (4) 1 𝑢∈𝑈(𝑖)∩𝑈(𝑗) 0.9 0.8 −1 0.7 2 0.6 2 0.5 × √ ∑ (𝑟𝑢,𝑗 − 𝑟𝑢) ×𝑤𝑢 ) . 0.4 𝑢∈𝑈(𝑖)∩𝑈(𝑗) 0.3 0.2 𝑈(𝑖) 𝑖 0.1 Here, is the set of users who have rated on item . 0 Formally, 𝑈(𝑖) = {𝑢𝑢,𝑖 |𝑟 =0}̸ . 𝑟𝑢 is the average ratings of user 𝑢’s. The 𝑤𝑢 istheweightofuser𝑢. In order to estimate a rating, the most used weighted sum (0∼0.1] (0.9∼1] (−0.1∼0] (0.1∼0.2] (0.7∼0.8] (0.2∼0.3] (0.3∼0.4] (0.4∼0.5] (0.5∼0.6] (0.6∼0.7] (0.8∼0.9] is applied to predict ratings for users, which is the crucial step Normal users in a CF recommendation system. Consider Attack size =10%, filler size =10% ∑ 𝑠𝑖𝑚 ×𝑟 =5%, =5% 𝑗∈𝐼(𝑢) 𝑖,𝑗 𝑢,𝑗 Attack size filler size 𝑝𝑢,𝑖 = 󵄨 󵄨 , (5) =15 =15 ∑ 󵄨𝑠𝑖𝑚 󵄨 Attack size %, filler size % 𝑗∈𝐼(𝑢) 󵄨 𝑖,𝑗󵄨 Figure 5: The distributions of the slotted R Sim under bandwagon where 𝐼(𝑢) is the set of items rated by user 𝑢. attacks. 4. Experimental Evaluations unlistedinthetablearenomorethan20%.Mostofthose 4.1. Dataset. The widely used MovieLens dataset is utilized models even cannot find any spam user. With increasing to evaluate the proposed approach. MovieLens [32]isafree of attack size, filler size, and the number of attack items, service provided by GroupLens Research at the University mostoftheusermodelsemergeremarkableresults.Thatis of Minnesota (http://www.movielens.org). The site had over because the characteristics of attack users become much more 43,000 users who had rated more than 3,500 different movies. obvious. There are two datasets in the MovieLens project. One includes 1,000,209 anonymous ratings (1–5) of approximately 3.3. Detection-Based Item Similarity Calculation and Rating 3,900 movies made by 6,040 users who joined MovieLens Prediction. As discussed previously, item-based CF is pro- in 2000. Another dataset consists of 100,000 ratings from posed to compute the similarities between items and then to 943 users on 1,682 movies. Each user has rated at least 20 choose the most similar items for prediction [18]. The theory movies. The latter dataset has been used in the experiments. behind is to compare items based on the pattern of ratings The dataset was randomly divided into a training set (80,000 across users. ratings) and a test set (20,000 ratings) 50 times. The training In the research, the rating weights of users are incor- and test sets are named 𝑈𝑖base and 𝑈𝑖test (𝑖=1,...,50). porated with one of similarity-based algorithms [1], named item-based kNN collaborative filtering (shortened to IKCF). 4.2. Evaluation Metric. Three metrics are used to evaluate the As mentioned in Section 3.2, the sets of suspicious users algorithms:meanabsoluteerror(MAE[19]), predictions shift will be obtained when DBSCAN algorithm is applied to the [18], and hit ratio [14]shift.MAEisabroadlyusedmetricfor twenty attributes of Slotted R Sim and Slotted In Sim. the deviation of predictions from their true values. Prediction The new algorithm we proposed is a weighted item-based shift and hit ratio shift are mostly used metrics for measuring kNN collaborative filtering approach (named WIKCF). If the the robustness of the recommendation systems. users in the spam user group, then their weights should be For all predictions {𝑝1,𝑝2,...,𝑝𝑛} and corresponding real extremely small; otherwise, the weight should be large. In the ratings {𝑟1,𝑟2,...,𝑟𝑛}, research, the weight 𝑤𝑢 of user 𝑢 is simply set to 1 when he/she 𝑁 󵄨 󵄨 is not in the suspicious spam group or 0 when he/she is in the ∑ 󵄨𝑝 −𝑟󵄨 = 𝑖=1 󵄨 𝑖 𝑖󵄨 (6) suspicious group. MAE 𝑁 6 Mathematical Problems in Engineering

is the average of absolute error between all {𝑝𝑖,𝑟𝑖} pairs. The 0.6 lowertheMAEis,thebettertheproposedapproachis. 0.5 Prediction shift models the difference between average predicted ratings of all the ratings in the test set, after and 0.4 before the attacks [18]: 0.3 󸀠 ∑𝑖∈𝐼 ∑𝑢∈𝑈 abs (𝑝 𝑢,𝑖 −𝑝𝑢,𝑖) 0.2

= . (7) shift Prediction Prediction shift |𝑈||𝐼| 0.1 󸀠 In the formula, 𝑝𝑢,𝑖 and 𝑝𝑢,𝑖 arethepredictedratingsafter 0.0 andbeforetheattacks,𝑈 is the set of users and 𝐼 is the set

of items in the test set, and the abs function indicates the 5%, 5% 󸀠 10%, 5% 15%, 5% 20%, 5% 5%, 10% absolute value of 𝑝𝑢,𝑖 −𝑝𝑢,𝑖. 10%, 10% 15%, 10% 20%, 10% In a recommendation system, users are usually interested Attack size and filler size 𝑛 in the first items in the recommendation list. The changes IKCF, 20 target items WIKCF, 20 target items ofpredictedvaluesmaynottriggerthechangeofthe IKCF, 10 target items WIKCF, 10 target items recommendation list. Hit ratio is the average number of hits across all the users in the test set [14].Inthepaper,thehitratio Figure 6: Prediction shift values comparison of IKCF and WIKCF. indicates the ratio the first 𝑛 items in the recommendation hit the first 𝑛 items in the test set. Hit ratio shift models the difference between average hit ratios of all users, after and The experimental procedure included the following steps: before the attacks:

󸀠 (1) to get R Sim Csn, R AdjSim Csn,andIn Sim of users, ∑𝑢∈𝑈 abs (𝐻𝑢 −𝐻𝑢) = . (8) (2) to calculate their SRSC, SRSA,andSIS, Hit ratio shift |𝑈| (3) to compute the rating weights of users applying 󸀠 Here, 𝐻𝑢 and 𝐻𝑢 are the hit ratios of the users in the test DBSCAN algorithm, set, after and before the attacks. (4) to predict ratings in Uitest using WIKCF and compare the predicted ratings with the real ratings in Uitest to 4.3. Experimental Methodology. In the experiments, 10, 15, get the values of MAE, prediction shift, and hit ratio and 20 items are randomly selected as the target items, shift, respectively. The two metrics of prediction shift and hit (5) to predict ratings in U test applying IKCF and calcu- ratio shift are used to measure the relative performance of i latethevaluesofMAE,predictionshift,andhitratio robustness of the algorithms. The values of these metrics are shift, plottedagainstthesizeoftheattacksreportedasthenumber ofspamsandapercentageofthetotalnumberofusersinthe (6) to fill attacks into rating matrix (Uibase) with different system. The 𝑘 for the kNN of items was set to 20. The users attack sizes and filler sizes then repeat the steps 1–5 in the segment had similar ratings on 10 randomly selected several times (see the above settings). items. To test the robustness of the recommendation algorithms, 4.4. The Experimental Results and Analysis the applied attack models, attack size, and filler size are listed below. 4.4.1. Comparisons of Prediction Shift Values. The values of prediction shift are emphasized in Figure 5,inwhichthe (i) Attack model is bandwagon attack. impact of the attack is compared between IKCF and WIKCF. (ii) Attack size is the percentage of attack profiles, valued The 𝑥-axis depicts the different attack sizes and filler sizes: the 5%, 10%, 15%, and 20%, respectively. former are 5%, 10%, 15%, and 20%; the latter are 5% and 10%. 𝐹 𝑦 (iii) Filler size is the percentage of the filler ratings (𝐼 ) in The -axis indicates the prediction shift values. the attacks, valued 5% and 10%, respectively. In Figure 6, the light and dark gray bars are the results of IKCF;thelightanddarkbluebarsaretheresultsof The settings of the attack profiles are as follows: WIKCF. The bars indicate the prediction shifts when the

𝐹 system suffered from the attacks. In the attacks, the numbers (i) 𝐼 : the randomly filling items were assigned to ran- of the target items are 10 and 20. The figure illustrates that the 2 dom valued by its mean 𝜇 = 3.6 and variance 𝜎 = 1.1; predicted ratings of the adjusted cosine algorithm changed 𝑆 a lot when the system suffers from the attacks with different (ii) 𝐼 : the selected items were the first 𝑛 items rated by attack sizes and filler sizes. The greater the attack sizes and most users, 𝑛=20; the selected items were assigned filler sizes, the greater the change. Compared with IKCF,the to 𝑟max (𝑟max =5); predicted ratings of WIKCF change a little at any attack size 𝑇 (iii) 𝐼 : the target items were assigned to 𝑟max. and filler size. Mathematical Problems in Engineering 7

Table 3: MAE values. taken into consideration in the baseline approaches; in other words, the weights of spam users and normal users are the IKCF WIKCF same. 0.837 0.829

4.5. The Comparisons with Related Works. Zhang [26]pro- 0.8 posed a trust-aware CF approach based on users’ multiple 0.7 interests to provide robust recommendations and tested it 0.6 against MovieLens dataset. He applied random and average attack models to test his user-based CF algorithm. Similar 0.5 results for user-based CF can be found from Mehta and Nejdl 0.4 [33], in which a matrix factorization strategy (VarSelectSVD) is used, under 5% average attacks and 7% filler. As mentioned 0.3 Hit ratio shift ratio Hit before, those models are successful against the user-based CF 0.2 rather than item-based CF algorithms, such as bandwagon 0.1 and segment models, which are quite successful against item-based CF algorithms. Therefore, in the research, the 0.0 bandwagonmodelsareappliedagainsttheproposeditem- based CF algorithm. Mobasher et al. [13]applied𝑘NN super-

5%, 5% vised classification for user-based and item-based CF on the 10%, 5% 15%, 5% 20%, 5% 5%, 10% 10%, 10% 15%, 10% 20%, 10% MovieLens 100 K dataset by using 15 detection attributes that Attack size and filler size include six generic attributes, six attributes of average attack IKCF, top 10 WIKCF, top 10 model, and three attributes of group attack model. IKCF, top 20 WIKCF, top 20 Despite the weak comparability, the experimental results are given for reference: the prediction shifts of Zhang’s Figure 7: Hit ratio shift values comparison of IKCF and WIKCF. research [14]areintherangeof0.2∼0.5, the shifts experi- mental results in this research are less than 0.1, and the hit ratio shifts of his work are similar to the experimental results 4.4.2. Comparisons of the Values of Hit Ratio Shift. The hit of this research. The prediction shifts from Hurley are about ratio shifts are emphasized in Figure 7, in which the impact 0.1∼0.3 [34] under bandwagon attacks, but the results in this of the attack is compared between IKCF and WIKCF algo- research are less than 0.1. rithms. Similar to Figure 6,the𝑥-axis depicts the different attack sizes and filler sizes: the former are 5%, 10%, 15%, and 5. Conclusions 20%;thelatterare5%and10%.The𝑦-axis indicates the values of hit ratio shifts. In this paper, three usually used user relationships and the In Figure 7, the light and dark gray bars are the results of construction of user models have been analyzed at first. Then IKCF; the light and dark blue bars are the results of WIKCF, the best user models have been selected based on clustering which indicate the hit ratio shifts under the attacks. The method according to the results of spam user detection. numberofthetargetitemsis10intheattacks.Thehitratios Finally, a detection-based approach has been proposed for were computed according to the top 10 and 20 items in the the calculation of item similarities and ratings prediction. recommendation list and Uitest.Thefigureshowsthatthe The experimental results in this research demonstrate that hit ratio of IKCF changed a lot when the system suffered the most used relationships, interesting similarity and rating from the attacks with different attack sizes and filler sizes. similarity, are important to detect spam users; density-based The greater the attack sizes and filler sizes, the greater the clustering algorithm is effective to detect spam users; the change of WIKCF.ComparedwithIKCF,thehitratiovalues detection-based filtering approach does benefit improving of WIKCF change little at any attack size and filler size. the robustness of the typical item-based kNN CF recommen- dation approach. 4.4.3. Comparison of MAE Values. As illustrated in Table 3, MAE values of two algorithms are almost the same. Conflict of Interests

4.4.4. Experimental Analysis. It is easily found from Table 3, The authors declare that there is no conflict of interests Figures 6,and7 that the robustness of WIKCF is in a higher regarding the publication of this paper. degree than IKCF with MAE values compared with IKCF. The robustness has been demonstrated by the following: (1) Acknowledgments the prediction shift and hit ratio shift of WIKCF are less than those of IKCF are and (2), with the increasing of attack ThisresearchissupportedbytheNationalNaturalScience size and filler size, the impact of the attack is growing to Foundation of China (71102065), the National Key Basic IKCF;however,theimpactoftheattackisstabletoWIKCF.A Research Program of China (973) (2013CB328903), and the possible is that the rating weights of the users are not China Postdoctoral Science Foundation (2012M521680). 8 Mathematical Problems in Engineering

References [17]J.-S.LeeandD.Zhu,“Shillingattackdetection-anewapproach for a trustworthy ,” INFORMS Journal on [1] L. Lu,M.Medo,C.H.Yeung,Y.-C.Zhang,Z.-K.Zhang,andT.¨ Computing, vol. 24, no. 1, pp. 117–131, 2012. Zhou, “Recommender systems,” Physics Reports,vol.519,no.1, [18]B.Sarwar,G.Karypis,J.Konstan,andJ.Reidl,“Item-basedcol- pp. 1–49, 2012. laborative filtering recommendation algorithms,”in Proceedings [2]X.Luo,Y.Xia,andQ.Zhu,“Incrementalcollaborativefilter- of the 10th International Conference on World Wide Web,pp. ing recommender based on regularized matrix factorization,” 285–295, Hong Kong, 2001. Knowledge-Based Systems,vol.27,pp.271–280,2012. [19] M. O’Mahony, N. Hurley, N. Kushmerick, and G. Silvestre, [3]E.Frias-Martinez,G.Magoulas,S.Chen,andR.Macredie, “Collaborative recommendation: a robustness analysis,” ACM “Automated user modeling for personalized digital libraries,” Transactions on Internet Technology,vol.4,no.4,pp.344–377, International Journal of Information Management,vol.26,no. 2004. 3, pp. 234–248, 2006. [20] D. Lemire and A. Maclachlan, “Slope one predictors for online [4]Q.Liu,E.Chen,H.Xiong,C.H.Q.Ding,andJ.Chen, rating-based collaborative filtering,” Society for Industrial Math- “Enhancing collaborative filtering by user interest expansion via ematics,vol.5,pp.471–480,2005. personalized ranking,” IEEE Transactions on Systems, Man, and [21] M. Gao and Z. Wu, “Personalized context-aware collaborative Cybernetics B,vol.42,no.1,pp.218–233,2012. filteringbasedonneuralnetworkandslopeone,”Lecture Notes [5]M.Gao,Z.Wu,andF.Jiang,“Userrankforitem-basedcol- in Computer Science,vol.5738,pp.109–116,2009. laborative filtering recommendation,” Information Processing [22]B.Mobasher,R.Burke,R.Bhaumik,andC.Williams,“Effective Letters, vol. 111, no. 9, pp. 440–446, 2011. attack models for shilling item-based collabora-tive filtering [6] A. Said, B. J. Jain, and S. Albayrak, “Analyzing weighting systems,” in Proceedings of the WebKDD Workshop,pp.13–23, schemes in collaborative filtering: cold start, post cold start Citeseer, Chicago, Ill, USA, 2005. and power users,” in Proceedings of the 27th Annual ACM [23] J. O’Donovan and B. Smyth, Trust in Recommender Systems,IUI, SymposiumonAppliedComputing, pp. 2035–2040, ACM, 2012. Association for Computing Machinery, New York, NY, USA, 2005. [7] G. Linden, B. Smith, and J. York, “Amazon.com recommen- dations: item-to-item collaborative filtering,” IEEE Internet [24] P. Massa and P. Avesani, “Trust-aware recommender systems,” Computing,vol.7,no.1,pp.76–80,2003. in Proceedings of the 1st ACM Conference on Recommender Systems (RecSys ’07), pp. 17–24, October 2007. [8] T.-P. Liang, Y.-F. Yang, D.-N. Chen, and Y.-C. Ku, “A semantic- [25] M. O’Mahony, N. Hurley, and G. Silvestre, “Promoting recom- expansion approach to personalized knowledge recommenda- mendations: an attack on collaborative filtering,” Lecture Notes tion,” Decision Support Systems,vol.45,no.3,pp.401–412,2008. in Computer Science,vol.2453,pp.213–241,2002. [9] G. Adomavicius and A. Tuzhilin, “Toward the next generation [26]F.Zhang,“Researchontrustbasedcollaborativefilteringalgo- of recommender systems: a survey of the state-of-the-art and rithm for user’s multiple interests,” Journal of Chinese Computer possible extensions,” IEEE Transactions on Knowledge and Data Systems,vol.29,pp.1415–1419,2008. Engineering,vol.17,no.6,pp.734–749,2005. [27] Y.-K. Yu, Y.-C. Zhang, P. Laureti, and L. Moret, “Decoding [10] F. Cacheda, V. Carneiro, D. Fernandez,´ and V. Formoso, “Com- information from noisy, redundant, and intentionally distorted parison of collaborative filtering algorithms: limitations of cur- sources,” Physica A,vol.371,no.2,pp.732–744,2006. rent techniques and proposals for scalable, high-performance [28] Y.-B. Zhou, T. Lei, and T. Zhou, “A robust ranking algorithm to recommender systems,” ACM Transactions on the Web,vol.5, spamming,” EPL, vol. 94, no. 4, Article ID 48002, 2011. no. 1, article 2, 2011. [29] M.-S. Shang, L. Lu,¨ W. Zeng, Y.-C. Zhang, and T. Zhou, [11] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news “Relevance is more significant than correlation: information personalization: scalable online collaborative filtering,” in Pro- filtering on sparse data,” EPL,vol.88,no.6,ArticleID68008, ceedings of the 16th International World Wide Web Conference 2009. (WWW ’07),pp.271–280,Alberta,Canada,May2007. [30] H.-P.Kriegel, P.Kroger,¨ J. Sander, and A. Zimek, “Density-based [12]B.Mobasher,R.Burke,C.Williams,andR.Bhaumik,“Analysis Clustering,” WIREs Data Mining and Knowledge Discovery,vol. and detection of segment-focused attacks against collaborative 3,pp.231–240,2011. recommendation,” Lecture Notes in Computer Science, vol. 4198, [31] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based pp. 96–118, 2006. algorithm for discovering clusters in large spatial databases with [13] B. Mobasher, R. Burke, R. Bhaumik, and C. Williams, “Toward noise,” in Proceedings of the 2nd International Conference on trustworthy recommender systems: an analysis of attack models Knowledge Discovery and Data Mining (KDD ’96),pp.226–231, and algorithm robustness,” ACM Transactions on Internet Tech- AAAI Press, 1996. nology,vol.7,no.4,article23,pp.2301–2338,2007. [32] M. Gori, A. Pucci, V. Roma, and I. Siena, “Itemrank: a random- [14] B. Mehta, T. Hofmann, and P. Fankhauser, “Lies and pro- walk based scoring algorithm for recommender engines,” in paganda: detecting spam users in collaborative filtering,” in Proceedings of the 20th International Joint Conference on Artifi- Proceedings of the 12th International Conference on Intelligent cial Intelligence (IJCAI ’07), pp. 778–781, Hyderabad, India, 2007. User Interfaces (IUI ’07),pp.14–21,January2007. [33] B. Mehta and W.Nejdl, “Attack resistant collaborative filtering,” [15]B.Mehta,T.Hofmann,andW.Nejdi,“Robustcollaborative in Proceedings of the 31st Annual International ACM SIGIR Con- filtering,” in Proceedings of the 1st ACM Conference on Recom- ference on Research and Development in Information Retrieval mender Systems (RecSys ’07), pp. 49–56, October 2007. (ACM SIGIR ’08), pp. 75–82, New York, NY, USA, July 2008. [16] P. Massa and P. Avesani, “Trust-aware collaborative filtering for [34] N. J. Hurley, “Tutorial on robustness of recommender systems,” recommender systems,” Lecture Notes in Computer Science,vol. in Proceedings of the 5th ACM Conference on Recommender 3290, pp. 492–508, 2004. Systems (RecSys ’11),pp.9–10,October2011. Advances in Advances in Journal of Journal of Operations Research Decision Sciences Applied Mathematics Algebra Probability and Statistics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

The Scientific International Journal of World Journal Differential Equations Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Submit your manuscripts at http://www.hindawi.com

International Journal of Advances in Combinatorics Mathematical Physics Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal of Journal of Mathematical Problems Abstract and Discrete Dynamics in Complex Analysis Mathematics in Engineering Applied Analysis Nature and Society Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

International Journal of Journal of Mathematics and Mathematical Discrete Mathematics Sciences

Journal of International Journal of Journal of Function Spaces Stochastic Analysis Optimization Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014 Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014