TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 1 Trustworthy Website Detection Based on Social Hyperlink Network Analysis
Xiaofei Niu, Guangchi Liu, Student Member, and Qing Yang, Senior Member
Abstract—Trustworthy website detection plays an important role in providing users with meaningful web pages, from a search engine. Current solutions to this problem, however, mainly focus on detecting spam websites, instead of promoting more trustworthy ones. In this paper, we propose the enhanced OpinionWalk (EOW) algorithm to compute the trustworthiness of all websites and identify trustworthy websites with higher trust values. The proposed EOW algorithm treats the hyperlink structure of websites as a social network and applies social trust analysis to calculate the trustworthiness of individual websites. To mingle social trust analysis and trustworthy website detection, we model the trustworthiness of a website based on the quantity and quality of websites it points to. We further design a mechanism in EOW to record which websites’ trustworthiness need to be updated while the algorithm “walks” through the network. As a result, the execution of EOW is reduced by 27.1%, compared to the OpinionWalk algorithm. Using the public dataset, WEBSPAM-UK2006, we validate the EOW algorithm and analyze the impacts of seed selection, size of seed set, maximum searching depth and starting nodes, on the algorithm. Experimental results indicate that EOW algorithm identifies 5.35% to 16.5% more trustworthy websites, compared to TrustRank.
Index Terms—Trust model, social trust network, trustworthy website detection, social hyperlink network. F
1INTRODUCTION
EARCH engines have become more and more important sites are removed from the searching results, more trust- S for our daily lives, due to their ability in providing worthy websites are promoted. Web spams can be broadly relevant information or web pages to users. Although a classified into four groups: content spam, link spam, cloak- search engine typically returns thousands of web pages to ing and redirection, and click spam [4]. Content spam refers answer a query, users usually read only a few ones on top to deliberate changes in HTML fields of a web page so that of the list of recommended pages [1]. The advantage of the spam page becomes more relevant to certain queries. For a company’s website being ranked on top of the list can example, keywords relevant to popular query terms can be be converted to an increase in sales, revenue and profits. inserted into a spam page. Link spam allows a web page As a result, several techniques are created to clandestinely to be highly ranked by means of manipulating the page’s increase a web page’s ranking position, to achieve an unde- connections to other pages, resulting in a confusion of hy- served high click through rate (CTR). The deceptive actions perlink structure analysis algorithms, e.g., PageRank [5] and produce untrustworthy websites that are generally referred HITS [6]. Cloaking is a technique that provides different ver- to as spam websites [2]. It is shown that 22.08% of English sions of a page to users, based on the information contained websites/hosts are classified as spams [3]. Similarly, about in user queries. The redirection technology redirects users to 15% of Chinese web pages are spams. With spam websites malicious pages through executing JavaScript codes. Click ranked on the top of searching results, users waste their time spam is used to generate fraud clicks, with the intention to in processing useless information, leading to a deteriorated increase a spam page’s ranking position. users’ quality of experience (QoE). Therefore, it is critical to Although human can easily recognize spam websites, it’s design a mechanism to promote more trustworthy websites unrealistic to mark all spam websites manually. Therefore, and eliminate spams in the searching results provided to humongous anti-spam techniques are proposed, including users. solutions based on genetic algorithm [7] and genetic pro- gramming [8], [9], [10], [11], [12], [13], artificial immune 1.1 Limitations of Prior Art system [14], swarm intelligence [15], particle swarm opti- Existing solutions to trustworthy website detection focus mization [16] and ant colony optimization [17]. It is very mainly on identifying spam websites, i.e., while spam web- difficult, however, to detect all types of web spams, due to the fact that new spam techniques are created almost instantly once a particular type of spam is identified and X. Niu is with the School of Computer Science and Technology, Shan- • dong Jianzhu University, Jinan, Shandong 250101, China. Email: niuxi- banned within the Internet. Instead of classifying and de- [email protected]. tecting spam websites, PageRank [5] and TrustRank [18] G. Liu is with the Stratifyd Inc., Charlotte, NC 28207, USA. make an attempt to explore the possibility of promoting • Email:[email protected]. more trustworthy websites to users in the searching results. Q. Yang is with the Department of Computer Science and En- • gineering, University of North Texas, Denton, Texas 76203, USA The solutions first rank all web pages or websites, based on Email:[email protected]. their trust scores, in a descending order. Then, only websites Q. Yang is the corresponding author of this article. • ranked on top of the list are provided to users. As such, Manuscript received January 31, 2018; revised July 16, 2018. the number of spam websites that a user may encounter TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 2 with will be significantly reduced. Unfortunately, existing algorithm. The key contributions of this paper are as fol- trust-ranking based algorithms do not accurately model the lows. trustworthiness of web pages, and thus often mistakenly For the first time, the hyperlinks between websites are identify spams as trustworthy websites. To improve the viewed as the “social” connections between websites. Lever- performance of trustworthy website detection, we approach aging the trust model designed for social networks, the this problem by studying the trust relations among websites, trustworthiness of websites can be quantified. Due to the leveraging social network analysis techniques. accuracy of 3VSL in modelling trustworthiness, individual website’s trustworthiness can be precisely calculated, using 1.2 Proposed Solution the proposed EOW algorithm. Based on the previous re- search results, the trustworthiness of a website are mainly Trust has been intensively studied in online social net- determined by the numbers of trustworthy and spam web- works [19], [20], and the knowledge obtained from this field sites it links to. As only labeled websites’ trustworthiness are can be applied to analyze the hyperlink network, consisted known, we treat all other websites as uncertain/undecided. of websites, to understand website trustworthiness. Consid- Therefore, by counting the numbers of trustworthy, spam, ering a website as an individual user, and the hyperlinks and uncertain websites a website points to, the website’s connecting websites as the social relations among them, we trustworthiness opinion can be formed. We enhance the can model the network of websites as a social hyperlink OpinionWalk algorithm by identifying which opinions need network. to be updated while the algorithm searching within a social According to the three-valued subjective logic model hyperlink network. Specifically, a Boolean vector is used (3VSL) [21], the trust relation between two websites can to keep track of the websites that are connected from the be modeled as a trust opinion . Based on the current websites whose trustworthiness values are just up- opinion operations defined in 3VSL, the trustworthiness dated. When the EOW algorithm searches the next level of every website can be computed using the OpinionWalk in the network, only these websites’ trustworthiness are algorithm [22]. The algorithm starts from a seed node and changed accordingly. The proposed EOW algorithm is vali- searches the network, in a breadth first search manner, to dated by experiments using the WEBSPAM-UK2006 dataset compute the trustworthiness of all other nodes, from the that contains both trustworthy and spam websites crawled seed node’s perspective. If multiple seed nodes are chosen, within the .uk domain. Experimental results indicate that the algorithm will compute several different trust opinions the EOW algorithm identifies 16.5%, 12.65%, 8.77%, 5.35% of the same node. These opinions will then be combined to more trustworthy websites, in the top 1000, 2000, 3000 and obtain the trustworthiness of the node. As such, the websites 4000 websites, respectively, compared to TrustRank (the with higher trust values can be ranked on top of the list start-of-art solution), when the number of trustworthy seeds provided to users. is 200. In addition, EOW saves (on average) about 27.1% To apply the OpinionWalk algorithm in trustworthy execution time, compared to the OpinoinWalk algorithm, in website detection, however, we need to address two chal- computing the trustworthiness of all websites. lenges. First, the 3VSL models trust as an opinion vector The rest of this paper is organized as follows. In Sec- containing three values b (belief), d (distrust), and n (un- tion 2, we introduce the proposed EOW algorithm, followed certainty). It unfortunately does not specify how the three by an example illustrating how it works. In Section 3, we values of an opinion are obtained. To initialize the trust describe how EOW algorithm performs, regarding to detect- opinion between two linked websites, we need to under- ing trustworthy websites from a real-world dataset. Then, stand which factors affect the trustworthiness of a website. we summarize the related work of trustworthy website From previous studies, we find trustworthy websites rarely detection in Section 4. Finally, we conclude our work and point to spam websites and the websites linking to spams point out future research directions in Section 5. are likely to be spams [18] [23]. Therefore, by checking how many trustworthy (or spam) websites a website links to, we can possibly determine the website’s trustworthiness, 2DETECTING TRUSTWORTHY WEBSITES USING i.e., the values of b, d and n in the corresponding trust ENHANCED OPINIONWALK ALGORITHM opinion. The second challenge lies in the large execution Considering websites as users, a hyperlink network can time of the algorithm. As OpinionWalk searches a network be modeled as a ”social network” that reflects the “social level by level, the trustworthiness of all nodes will be connections” among different websites. The connection be- updated in each searching/iteration, which yields frequent tween two linked websites can be assigned a weight to trust computation and updates that are often not necessary. indicate the trustworthiness between them, which results To address this challenge, we design a mechanism to record in a weighted trust social network. Within the network, which websites’ trustworthiness need to be updated and we can leverage trust propagation and trust combination only change them when the algorithm “walks” through the to compute the trustworthiness of individual websites. As network. As a result, we are able to detect more trustworthy such, trustworthy websites can be identified from all web- websites within a relatively shorter period of time, com- sites available. pared to the state-of-art solutions [5], [18], [22]. OpinionWalk was designed to solve the massive trust assessment problem in social networks [22], however, a few 1.3 Contributions challenges need to be addressed before it can be applied In this paper, we discuss how to identify more trustworthy to trustworthy website detection. The first challenge is to websites by proposing the enhanced OpinionWalk (EOW) design a mechanism to assign weights on connections/links TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 3
so that the trustworthiness between websites can be re- We use gj, sj and uj to denote the number of labeled flected by the weights. The second challenge is to enhance good/normal websites, labeled spam websites, and unde- the OpinionWalk algorithm to make it more efficient as the cided websites that j points to. Let bj, dj, nj denote the current OpinionWalk algorithm is very slow. We propose the probabilities that website j is trustworthy, not trustworthy, enhanced OpinionWalk (EOW) algorithm that starts from and uncertain, respectively. According to the three-valued a trustworthy node and searches the hyperlink network to subjective logic [21] that is used to model trust in Opinion- detect more trustworthy websites. Walk [22], these probabilities can be computed as follows.
2.1 Social Hyperlink Network Model gj bj = The hyperlink graph of websites can be modeled as a gj + sj + uj +3 sj directed graph G =(V,E,W), where vertex i V denotes 8 dj = 2 > gj + sj + uj +3 a website, edge e(i, j) E represents the hyperlink from > uj . (1) > nj = websites i to j. We call2 website j as i’s adjacent node, and <> gj + sj + uj +3 3 weight w(i, j) W of the edge e(i, j) indicates how i trusts ej = > gj + sj + uj +3 j 2 > . We further define the indegree of a website as the number > of websites pointing to it. The outdegree of a website is In the above:> equations, ej denotes the prior uncertainty defined as the number of websites that it points to. The existing in website j. As we can see, if website j points weight on each edge in graph G is usually modeled as a real to no other website, the values of bj, dj and nj are zeros number [19], however, we find it cannot accurately reflect and ej =1, indicating the trustworthiness of website j is the trust between two nodes [21]. To conduct precise trust unknown or fully uncertain. In this case, we assume website computation, OpinionWalk algorithm defines the trust as an j points to 3 virtual websites, a normal one, a spam one, and opinion vector that contains three numbers reflecting how an uncertain one. This is why ej is called as the prior uncer- likely a user is trustful, not trustful and uncertain, respec- tainty of website j. The assumption is reasonable because tively. We adopt this definition and propose a mechanism Eq. 1 considers both prior (ej) and posterior uncertainties to assign the three values of an opinion in the trustworthy (nj) and still works even website j does not point to any website detection problem. other website. More details about the difference between prior and posterior uncertainties can be found in [21]. 2.2 Edge Weight Assignment Based on the above analysis, we model the trustworthi- By analyzing the structure of existing hyperlink graphs, we ness between two websites i and j as an opinion vector find that a trustworthy site rarely links to a spam site [18]. !ij =(bij,dij,nij,eij)=(bj,dj,nj,ej). !ij indicates how Besides, a website pointing to many spam sites is very likely trustworthy website j is, from website i’s perspective. As to be a spam site [23]. By looking at how many trustworthy shown in Fig. 1, if there is another website t also pointing to websites that a website points to, and how trustworthy these j, we have !tj = !ij =(bj,dj,nj,ej). Note that !ij = !tj, websites are, we are able to determine the trustworthiness which is different from OpinionWalk that assumes different of the website. In other words, the quality and quantity users have different opinions on the same user. of pointed websites should be considered in modeling the If there is no hyperlink from i to j, i.e., e(i, j) / E, 2 trustworthiness of a website. then we define i has an uncertain opinion O on j as O =(0, 0, 0, 1) that indicates a website is totally uncertain i about whether another website is trustworthy. For !ii,a j certain opinion I is defined as I =(1, 0, 0, 0) that indicates a website absolutely trusts itself. t trustworthy spam 2.3 Opinion Matrix Initialization uncertain Given a hyperlink graph G =(V,E,W) containing n nodes and a subset S V with labeled nodes, we can obtain the Fig. 1: Illustration of weight assignment in a hyperlink opinion matrix ⇢M, as it is defined in [22]. In the matrix, structure graph. ! =(b ,d ,n ,e ) is calculated from Eq. 1, if e(i, j) E; ij j j j j 2 For trustworthy website detection, there are usually a otherwise, !ij = O except for !ii = I. group of websites that are labeled as either normal or spam I !12 ... !1n by humans. This group of websites is commonly referred !21 I to as a labeled set. We further divide the labeled set into M = 2 ··· ··· 3. two groups: seed set and testing set. While the former is ··· ··· ··· ··· 6 !n1 I 7 used to initialize trust relations between websites, the latter 6 ··· ··· 7 4 5 is used for evaluation purpose. As shown in Fig. 1, we now The opinion matrix records all the trust relations among focus on the nodes that website j points to. These nodes websites that are connected to each other. This matrix will could be either trustworthy, not trustworthy, or uncertain, then be used to compute trustworthiness of all websites, depending on the nature of the corresponding websites. As which will be introduced in later sections. such, we consider a labeled normal website as a trustworthy Algorithm 1 shows how to initialize the opinion matrix, one, a spam website as a untrustworthy one. For those that based on a directed graph G and a labeled seed set S. are not analyzed by humans, or not being labeled, we call Please note that S is composed of websites that are selected them undecided or uncertain websites. from the labeled set with seed selection method described in TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 4
1 1 1 1 2 2 2 2 2 section 3.1.1. Lines 1-9 initialize the opinion matrix M with (bij,dij,nij,eij) and ⌦ij =(bij,dij,nij,eij) be two differ- I or O. Lines 10-28 update M[i][j] based on the number ent opinions derived from two parallel paths from i to j. of normal, spam and unlabelled websites that j points to. Then, a new opinion ⌦ij =(bij,dij,nij,eij) can be gener- 1 2 Finally, line 29 returns the opinion matrix M. ated by the combining operation ⇥(⌦ij, ⌦ij) as follows [21]: G S e2 b1 + e1 b2 Algorithm 1 GetOpinionMatrix( , ) b = ij ij ij ij ij e1 + e2 e1 e2 Require: Directed graph G, labeled seed set S. 8 ij ij ij ij > e2 d1 + e1 d2 Ensure: Opinion matrix M. > d = ij ij ij ij > ij 1 2 1 2 1: for all node i do > eij + eij eijeij > 2 1 1 2 . (3) 2: for all node j do > e n + e n > n = ij ij ij ij 3: if i = j then < ij e1 + e2 e1 e2 ij ij ij ij 4: M[i][j]=I > e1 e2 > ij ij 5: else > eij = 1 2 1 2 > eij + eij eijeij 6: M[i][j]=O > > 7: end if For a computed:> opinion ⌦ij =(bij,dij,nij,eij), it 8: end for contains four values and cannot be directly used to sort 9: end for websites. We need to convert it to a single trust value. 10: for all node i do The Eq. 4 is used to calculate the probability that j is a 11: for all nodes j s.t. e(i, j) E do trustworthy website, where x and y are the coefficients of 2 12: g 0, s 0, u 0; posterior uncertainty and prior uncertainty, indicating how j j j 13: for all nodes p s.t. e(j, p) E do much of the posterior uncertainty and prior uncertainty are 2 14: if p S and p is normal then credible, respectively. 2 15: gj gj +1 E(⌦ )=b + x n + y e (4) 16: else ij ij ⇥ ij ⇥ ij 17: if p S and p is spam then 2 18: s s +1 2.5 Enhanced OpinionWalk Algorithm j j 19: else Based on the previous discussion, opinions are used to 20: u u +1 quantify the trust relations between individual websites. j j 21: end if Specifically, the opinions are derived from the labeled web- 22: end if sites, based on formula (1). With these trust opinions, the 23: end for opinion matrix can be initialized that reflects the social 24: m g + s + u +3; connections among websites and the strength of these con- j j j 25: b g /m, d s /m, n u /m, e 3/m; nections. OpinionWalk algorithm starts from a seed node j j j j j j j 26: M[i][j] (b ,d ,n ,e ); to search the network level by level, and the trustwor- j j j j 27: end for thiness of all websites are iteratively obtained during the 28: end for searching process. The trust computation is then realized 29: return M by carrying out propagating and combining operations on opinions. However, the OpinionWalk algorithm updates the 2.4 Trust Propagation and Combination trustworthiness of all websites iteratively and every opinion Given two edges e(i, s) and e(s, j) E with the weights needs to be recalculated in each iteration. Let’s assume the 2 !is =(bis,dis,nis,eis) and !sj =(bsj,dsj,nsj,esj), we algorithm starts from website i and aims at computing the are able to compute !ij = (!is,!sj). In other words, s’s trustworthiness of all other websites. The trustworthiness opinion on the trustworthiness of website j can be propa- values are recorded in gated to website i so that i derives its own opinion about Y (k) =[⌦(k), ⌦(k), , ⌦(k), , ⌦(k)]T , j’s trustworthiness. The above-mentioned process is call i i1 i2 ··· ij ··· in trust propagation in social networks. To distinguish from (k) where ⌦ij denotes i’s opinion about the trustworthiness direct opinions, we use ⌦ij to denote i’s indirect opinion of website j, after the algorithm searches k levels in the on j. In this way, ⌦ij can be computed from the following network. As a result, the execution time of OpinionWalk equations [21]. is not favorable for quick trustworthy website detection. bij = bisbsj To address this issue, in this section, we introduce the En- d = b d 8 ij is sj , (2) hanced OpinionWalk (EOW) algorithm that updates fewer > nij =1 bij dij esj opinions in each iteration, and thus results in a shorter <> eij = e running time. When the algorithm starts from website i, the opinion where e = esj >if eis =1, otherwise, e =1. That implies the : 6 (1) prior uncertainty in i’s opinion on j is determined by that of vector Y =[! ,! , ,! , ,! ]T i i1 i2 ··· ij ··· in !sj if !sj is not uncertain; otherwise, !ij will be uncertain. Note that ! and ! can be replaced by indirect opinions is initialized, based on the direct links among websites. is sj (k+1) (k) ⌦is and ⌦sj. Next, we show how to obtain Yi from Yi , which If there are several different paths from website i to occurs when the algorithm moves from the k-th level to 1 (k) website j, we can combine these opinions. Let ⌦ij = (k + 1)-th level in the network. For a trust opinion in Yi , TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 5
(k) (1) T (1) (1) e.g., ⌦is , if it is not an uncertain opinion O and !sj = O and F =[0, 0, 1, 1] that means ⌦13 and ⌦14 will be 6 (1) (1) where !sj M and s = i = j, we can compute a updated next but ⌦11 and ⌦12 remain the same, when 2 (k) 6 6 new opinion (⌦is ,!sj), based on the trust propagation EOW searches the second level of the network. For node 3, (1) operation. It denotes i’s opinion on the trustworthiness of j, there is an opinion !12 Y1 , and an opinion !23 M, based on its k-hop “friend” website s’s “recommendation”. we can update node 1’s2 opinion on node 3 via2 node As shown in Fig. 2, if there exist m nodes that can make this 2 to (!12,!23). Similarly, because there is an opinion (1) type of recommendation, labeled as s1, s2, , sm, we can ! Y and ! M, node 1 gets a new opinion on ··· 11 1 13 combine the m newly obtained opinions to get a new opin- node2 3 as follows. 2 (k) (k) (k) ion ⇥( (⌦ ,!s j), (⌦ ,!s j), , (⌦ ,!s j)). As is1 1 is2 2 ism m (! ,! )= (,! )=! the opinion expresses i’s most current··· opinion on j’s trust- 11 13 I 13 13 (k+1) worthiness, we denote it as ⌦ij . In the same way, we can These two newly obtained opinion will be combined to form (k) (k+1) node 1’s most current opinion on the node 3’s trustworthi- update all elements in Yi to form Yi . ness. (2) ⌦13 =⇥(!13, (!12,!23)) s 1 (1) For node 4, we have !12 Y1 and !24 M, node 1’s 2 2 opinion on node 4 is updated to (!12,!24). With !13 i s j (1) 2 2 Y and ! M, node 1 gets another opinion on node 4, 1 34 2 i.e. (!13,!34). Combining these two opinions yields (2) ⌦14 =⇥( (!12,!24), (!13,!34)). sm Therefore, after EOW algorithm finishes searching the sec- (2) Fig. 2: Illustration of trustworthiness update from k-th level ond level, the opinion vector is updated to Y1 : to (k + 1)-th level in the network. T (k+1) [I,!12, ⇥(!13, (!12,!23)), ⇥( (!12,!24), (!13,!34))] From the above description, we can see that Yi (k) is solely determined by Yi and M, and not all opin- After this iteration, because ⌦13 and ⌦14 change, we (k) need to update the Boolean vector to F (2) =[0, 0, 0, 1]T ions are changed in the updating process. In fact, if ⌦ij changes when the algorithm is processing the k-th level indicating node 1 will update its opinion on node 4 but keep of the network, then only i’s opinions on j’s adjacent its opinions on other nodes unchanged, in the next round. nodes need to be recalculated in the next iteration. We This is because node 4 is the adjacent neighbor of node 3. When EOW searches the third level, because there exist propose a mechanism to keep track of which elements in the (2) (k) !12 Y1 and !24 M, we update ⌦14 to (!12,!24). opinion vector Yi need to be updated and only update 2 (2) (2) 2 (k+1) With ⌦13 Y1 and !34 M, node 1 has a new opinion those opinions in Yi . In the enhanced OpinionWalk (2)2 2 (k) (EOW) algorithm, there exists a Boolean vector F = on 4, (⌦13 ,!34). Combining these two opinions, node 1 [f ,f , ,f , ,f ]T that indicates whether i’s opinion derives a new opinion on node 4 as follow. 1 2 ··· j ··· n on j needs to be updated in the (k + 1) th iteration. If fj (3) (2) (k+1) ⌦14 =⇥( (!12,!24), (⌦13 ,!34)). equals to 1, ⌦ij needs to be recalculated and unchanged otherwise. With the subtle modification on the OpinionWalk As such the opinion vector is update to algorithm, the (average) execution time of EOW is only (3) (2) (3) T 72.9% of OpinionWalk’s. We will use the example shown Y1 = I,!12, ⌦13 , ⌦14 =[I,!12, ⇥(!13, (!12,!23)) , h⇥( (! ,! ), i(⇥(! , (! ,! )),! ))]T 2 12 24 13 12 23 34 After this iteration, because node 4 has no adjacent node, we have F (3) =[0, 0, 0, 0]T . That also means the EOW algo- 1 4 rithm stops and node 1’s opinions on the trustworthiness of all other nodes are obtained. 3 Algorithm 2 describes how to get the trustworthiness of all other websites, from website i’s perspective. Line 1 Fig. 3: An example of illustrating the Enhanced Opinion- calls Algorithm 1 to obtain the opinion matrix M. Lines (1) Walk algorithm. 2-5 initialize F and Yi , based on M. Lines 6-12 update in Fig. 3 to illustrate how the EOW algorithm works. With the bit corresponding to i’s adjacent nodes to 1. Line 13 the given network, we derive the opinion matrix as follows. initializes the searching level k to 1. Line 14 controls how many levels that EOW algorithm will search on the graph I !12 !13 O (k+1) (k) ! ! G. Lines 15-29 compute Yi based on Yi and M, and M = OI 23 24 . 2 ! 3 update the Boolean vector F accordingly. Line 15 copies OO I 34 (k) (k+1) 6 OO O I7 all opinions from Yi to Yi . Lines 16-21 recalculate 6 7 node i’s opinions on the websites whose trustworthiness Let’s assume EOW starts4 from node 1, then5 we have need to be updated. Lines 18-20 combine all opinions de- (1) T Y =[I,!12,!13, O] , rived from !sj = O. Lines 22-29 update F [j] based on 1 6 TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 6 (k+1) the information of which element in Yi [j] is different Generally more labeled data lead to more detailed evalu- (k) (k) from that of Yi [j]. Finally, the vector Yi will contain ation, so we adopt WEBSPAM-UK2006 dataset in evaluat- node i’s opinions on all other nodes, after EOW searches ing the proposed EOW algorithm. The WEBSPAM-UK2006 H levels within the graph G. By increasing the value of dataset was collected by a web crawler through “crawling” H, more accurate trust computation is expected, however, the .uk domain. It contains 77.9 million web pages, which it will increase the execution time of the EOW algorithm. equals to 11402 websites with 7473 websites/hosts labelled In practice, the value H is set to be a number ranging from as either normal/trustworthy or spam. 3 to 6 to achieve the good trade-off between accuracy and 3.1.1 Seed Selection performance. Because both TrustRank and EOW start from a set of seed nodes/websites to search for more trustworthy websites, in Algorithm 2 EOW(G,S,i,H) this section, we first introduce how seed nodes are usually Require: A directed graph G, labeled sample set S, starting chosen. The commonly-used seed selection methods for node i, maximum searching depth H. trustworthy website detection are high PageRank and in- Ensure: i’s opinion on j where j = i. verse PageRank [18]. Applying the PageRank algorithm on 6 1: M = GetOpinionMatrix(G, S); a website hyperlink network, every website will be assigned 2: for all node j do a PageRank score. If the websites are sorted based on their (1) 3: Y [j] M[i][j] scores, in a descending order, then those located at the top i 4: F [j] 0 of the list are considered trustworthy. Among these website,
5: end for the labelled trustworthy hosts are chosen as the seed nodes, 6: for all node j do which is why this algorithm is called high PageRank. On (1) 7: if Y [j] = O then the other hand, inverse PageRank algorithm first inverts the i 6 8: for all node q s.t. M[j][q] = O do links in the network, and then performs PageRank on the 6 9: F [q] 1 inverted graph. These labeled websites with large PageRank
10: end for scores are then treated as the seed nodes. 11: end if 3.1.2 Evaluation 12: end for The proposed EOW algorithm will rank all websites based 13: k 1 on their trust scores (derived from trust opinions), and
14: while k