Improving Website Hyperlink Structure Using Server Logs

Ashwin Paranjape∗ Robert West∗ Leila Zia Jure Leskovec Stanford University Stanford University Wikimedia Foundation Stanford University [email protected] [email protected] [email protected] [email protected]

ABSTRACT quently [7, 21]. For example, about 7,000 new pages are created Good websites should be easy to navigate via hyperlinks, yet main- on Wikipedia every day [35]. As a result, there is an urgent need taining a high-quality link structure is difficult. Identifying pairs to keep the site’s link structure up to date, which in return requires of pages that should be linked may be hard for human editors, es- much manual effort and does not scale as the website grows, even pecially if the site is large and changes frequently. Further, given with thousands of volunteer editors as in the case of Wikipedia. a set of useful link candidates, the task of incorporating them into Neither is the problem limited to Wikipedia; on the contrary, it may the site can be expensive, since it typically involves humans editing be even more pressing on other sites, which are often maintained pages. In the light of these challenges, it is desirable to develop by a single webmaster. Thus, it is important to provide automatic data-driven methods for automating the link placement task. Here methods to help maintain and improve website hyperlink structure. we develop an approach for automatically finding useful hyperlinks Considering the importance of links, it is not surprising that the to add to a website. We show that passively collected server logs, problems of finding missing links and predicting which links will beyond telling us which existing links are useful, also contain im- form in the future have been studied previously [17, 19, 22, 32]. plicit signals indicating which nonexistent links would be useful if Prior methods typically cast improving website hyperlink structure they were to be introduced. We leverage these signals to model the simply as an instance of a link prediction problem in a given net- future usefulness of yet nonexistent links. Based on our model, we work; i.e., they extrapolate information from existing links in order define the problem of link placement under budget constraints and to predict the missing ones. These approaches are often based on propose an efficient algorithm for solving it. We demonstrate the the intuition that two pages should be connected if they have many effectiveness of our approach by evaluating it on Wikipedia, a large neighbors in common, a notion that may be quantified by the plain website for which we have access to both server logs (used for find- number of common neighbors, the Jaccard coefficient of sets of ing useful new links) and the complete revision history (containing neighbors, and other variations [1, 17]. a ground truth of new links). As our method is based exclusively on However, simply considering the existence of a link on a website standard server logs, it may also be applied to any other website, as is not all that matters. Networks in general, and hyperlink graphs we show with the example of the biomedical research site Simtk. in particular, often have traffic flowing over their links, and a link is of little use if it is never traversed. For example, in the English Wikipedia, of all the 800,000 links added to the site in February 1. INTRODUCTION 2015, the majority (66%) were not clicked even a single time in Websites are networks of interlinked pages, and a good website March 2015, and among the rest, most links were clicked only very makes it easy and intuitive for users to navigate its content. One rarely. For instance, only 1% of links added in February were used way of making navigation intuitive and content discoverable is to more than 100 times in all of March (Fig. 1). Consequently, even provide carefully placed hyperlinks. Consider for instance Wiki- if we could perfectly mimic Wikipedia editors (who would a priori pedia, the free encyclopedia that anyone can edit. The links that seem to provide a reasonable gold standard for links to add), the connect articles are essential for presenting concepts in their appro- suggested links would likely have little impact on website naviga- priate context, for letting users find the information they are looking tion. for, and for providing an element of serendipity by giving users the Furthermore, missing-link prediction systems can suggest a large chance to explore new topics they happen to come across without number of links to be inserted into a website. But before links can intentionally searching for them. be incorporated, they typically need to be examined by a human, Unfortunately, maintaining website navigability can be difficult for several reasons. First, not all sites provide an interface that lets and cumbersome, especially if the site is large and changes fre- content be changed programmatically. Second, even when such ∗AP and RW contributed equally, RW as a Wikimedia Research Fellow. an interface is available, determining where in the source page to place a suggested link may be hard, since it requires finding, or introducing, a phrase that may serve as anchor text [19]. And third, Permission to make digital or hard copies of part or all of this work for personal or even when identifying anchor text automatically is possible, human classroom use is granted without fee provided that copies are not made or distributed verification and approval might still be required in order to ensure for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. that the recommended link is indeed correct and appropriate. Since For all other uses, contact the owner/author(s). the addition of links by humans is a slow and expensive process, WSDM’16, February 22–25, 2016, San Francisco, CA, USA. presenting the complete ranking of all suggested links to editors is © 2016 Copyright held by the owner/author(s). unlikely to be useful. Instead, a practical system should choose a ACM ISBN 978-1-4503-3716-8/16/02. . . $15.00 well composed subset of candidate links of manageable size. DOI: http://dx.doi.org/10.1145/2835776.2835832 No clicks: 66% and demonstrate that our models of human browsing behavior are At least one click: 34% accurate and lead to the discovery of highly relevant and useful hy- perlinks. Having full access to the logs of Wikipedia, one of the highest-traffic websites in the world, is particularly beneficial, as it allows us to perform a unique analysis of human web-browsing behavior. Our results on Simtk show that our method is general and

Fraction of new links of new Fraction applies well beyond Wikipedia. The following are the main contributions of this paper: 1e−06 1e−04 1e−02 1e+00 1 10 100 1000 10000 • We introduce the link placement problem under budget con- Mininum number of clicks straints, and design objective functions anchored in empir- ically observed human behavior and exposing an intuitive Figure 1: Complementary cumulative distribution function of diminishing-returns property. We also provide an efficient the number of clicks received in March 2015 by links intro- algorithm for optimizing these objectives (Sec. 2). duced in the English Wikipedia in February 2015. • We identify signals in the logs that can indicate the usage of a future link before it is introduced, and we operational- Thus, what is needed is an automatic approach to improving ize these signals in simple but effective methods for finding website hyperlink structure that accounts for website navigation promising new links and estimating their future clickthrough patterns and suggests a limited number of highly relevant links rates (Sec. 3). based on the way they will impact the website’s navigability. • We characterize the usage and impact of newly added links by analyzing human navigation traces mined from Wikipe- Present work. Here we develop an automatic approach for sug- dia’s server logs (Sec. 5). gesting missing but potentially highly used hyperlinks for a given website. Our approach is language- and website-independent and is based on using access logs as collected by practically all web 2. THE LINK PLACEMENT PROBLEM servers. Such logs record users’ natural browsing behavior, and we First we introduce the link placement problem under budget con- show that server logs contain strong signals not only about which straints. In this context, the budget refers to a maximum allow- existing links are presently used, but also about how heavily cur- able number of links that may be suggested by an algorithm. As rently nonexistent links would be used if they were to be added to mentioned in the introduction, realistic systems frequently require the site. We build models of human browsing behavior, anchored in a human in the loop for verifying that the links proposed by the al- empirical evidence from log data, that allow us to predict the usage gorithm are indeed of value and for inserting them. As we need to of any potentially introduced hyperlink (s,t) in terms of its click- avoid overwhelming the human editors with more suggestions than through rate, i.e., the probability of being chosen by users visiting they can respond to, we need to carefully select our suggestions. page s. The models are based on the following intuition: Consider Formally, the task is to select a set A of suggested links of a a page s not linked to another page t. A large fraction of users given maximum size K, where the quality of the chosen set A is visiting s and later on, via an indirect path, t, suggests that they determined by an objective function f : might have taken a direct shortcut from s to t had the shortcut ex- isted. Even though one could hope to detect such events by parsing maximize f (A) (1) the logs, our approach is even more general. One of our models is subject to |A| ≤ K. (2) able to estimate the fraction of people who would take a shortcut In principle, any set function f : 2V → R may be plugged in (where (s,t) only based on the pairwise page transition matrix. That is, we V is the set of pages on the site, and 2V is the set of all subsets are able to estimate navigation probabilities over longer paths by of V), but the practical usefulness of our system relies vitally on combining information about individual link transitions. a sensible definition of f . In particular, f should be based on a The second important aspect of our solution is that we only sug- reasonable model of human browsing behavior and should score gest a small number of the top K most useful links, where K is a link sets for which the model predicts many clicks above link sets ‘link budget’ constraint. Consider a naïve approach that first ranks for which it predicts fewer clicks. all candidate links by their probability of being used and then sim- In the following we introduce two simplified models of human ply returns the top K. An extreme outcome would be for all K links browsing behavior and propose three objective functions f based to be placed in a single page s. This is problematic because it seems on these models. Then, we show that two of these objectives have a priori more desirable to connect a large number of pages reason- the useful property of diminishing returns and discuss the impli- ably well with other pages than to connect a single page extremely cations thereof. Finally, we show that all three objectives can be well. This intuition is confirmed by the empirical observation that maximized exactly by a simple and efficient greedy algorithm. the expected number of clicks users take from a given page s is limited and not very sensitive to the addition of new links out of s. 2.1 Browsing models and objective functions Thus we conclude that inserting yet another out-link to s after some We first define notation, then formulate two web-browsing models, good ones have already been added to it achieves less benefit than and finally design three objective functions based on the models. adding an out-link to some other page s′ first. This diminishing-re- turns property, besides accurately mirroring human behavior, leads Notation. We model hyperlink networks as directed graphs G = to an optimization procedure that does not over-optimize on single (V,E), with pages as nodes and hyperlinks as edges. We refer to source pages but instead spreads good links across many pages. the endpoints of a link (s,t) as source and target, respectively. Un- We demonstrate and evaluate our approach on two very differ- linked pairs of nodes are considered potential link candidates. We ent websites: the full English Wikipedia and Simtk, a small com- denote the overall number of views of a page s by Ns. When users munity of biomedical researchers. For both websites we utilize are in s, they have a choice to either stop or follow out-links of s. complete web server logs collected over a period of several months To simplify notation, we model stopping as a link click to a special sink page ∅ ∈/ E that is taken to be linked from every other page. The purpose of the denominator is to renormalize the independent Further, Nst counts the transitions from s to t via direct clicks of the probabilities of all links—old and new—to sum to 1, so pst be- link (s,t). Finally, a central quantity in this research is the click- comes a distribution over t given s. Again, we use ws = logNs. through rate pst of a link (s,t). It measures the fraction of times users click to page t, given that they are currently on page s: 2.2 Diminishing returns Next, we note that all of the above objective functions are mono- pst = Nst /Ns. (3) tone, and that f2 and f3 have a useful diminishing-returns property. The stopping probability is given by ps∅. Note that, since in reality Monotonicity means that adding one more new link to the solu- users may click several links from the same source page, pst is tion A will never decrease the objective value. It holds for all three generally not a distribution over t given s. objectives (we omit the proof for brevity):

Next we define two models of human web-browsing behavior. f (A ∪ {(s,t)}) − f (A) ≥ 0. (7) Multi-tab browsing model. This model captures a scenario where the user may follow several links from the same page s, e.g., by The difference f (A∪{(s,t)})− f (A) is called the marginal gain of opening them in multiple browser tabs. We make the simplifying (s,t) with respect to A. assumption that, given s, the user decides for each link (s,t) inde- The next observation is that, under the link-centric multi-tab ob- jective f1, marginal gains never change: regardless of which links pendently (with probability pst ) if he wants to follow it. are already present in the network, we will always have Single-tab browsing model. This is a Markov chain model where the user is assumed to follow exactly one link from each visited f1(A ∪ {(s,t)}) − f1(A) = ws pst . (8) page s. This corresponds to a user who never opens more than a This means that f assumes that the solution quality can be in- single tab, as might be common on mobile devices such as phones. 1 creased indefinitely by adding more and more links to the same The probability of choosing the link (s,t) is proportional to p . st source page s. However, we will see later (Sec. 5.2) that this may Based on these browsing models, we now define objective func- not be the case in practice: adding a large number of links to the tions that model the value provided by the newly inserted links A. same source page s does not automatically have a large effect on Note that the objectives reason about the clickthrough rates pst of the total number of clicks made from s. links A that do not exist yet, so in practice pst is not directly avail- This discrepancy with reality is mitigated by the more complex able but must be estimated from data. We address this task in Sec. 3. objectives f2 and f3. In the page-centric multi-tab objective f2, Link-centric multi-tab objective. To capture the value of the further increasing the probability of at least one new link being newly inserted links A, this first objective assumes the multi-tab taken becomes ever more difficult as more links are added. This browsing model. The objective computes the expected number of way, f2 discourages adding too many links to the same source page. The same is true for the single-tab objective f3: since the click new-link clicks from page s as t:(s,t)∈A pst and aggregates this quantity across all pages s: probabilities of all links from s—old and new—are required to sum P to 1 here, it becomes ever harder for a link to obtain a fixed portion f1(A) = ws pst . (4) of the total probability mass of 1 as the page is becoming increas- s ingly crowded with strong competitors. X t:(Xs,t)∈A More precisely, all three objective functions f have the submod- If we choose ws = Ns then f1(A) measures the expected number ularity property (proof omitted for brevity): of clicks received by all new links A jointly when s is weighted according to its empirical page-view count. In practice, however, f (B ∪ {(s,t)}) − f (B) ≤ f (A ∪ {(s,t)}) − f (A), for A ⊆ B, (9) page-view counts tend to follow heavy-tailed distributions such as power laws [2], so a few top pages would attract a disproportional but while it holds with equality for f1 (Eq. 8), this is not true for amount of weight. We mitigate this problem by using ws = logNs, f2 and f3. When marginal gains decrease as the solution grows, we i.e., by weighting s by the order of magnitude of its count. speak of diminishing returns. To see the practical benefits of diminishing returns, consider two Page-centric multi-tab objective. Our second objective also as- source pages s and u, with s being vastly more popular than u, i.e., sumes the multi-tab browsing model. However, instead of mea- ws ≫ wu. Without diminishing returns (under objective f1), the suring the total expected number of clicks on all new links A (as marginal gains of candidates from s will nearly always dominate done by f ), this objective measures the expected number of source 1 those of candidates from u (i.e., ws pst ≫ wu puv for most (s,t) and pages on which at least one new link is clicked (hence, we term this (u,v)), so u would not even be considered before s has been fully objective page-centric, whereas f1 is link-centric): saturated with links. With diminishing returns, on the contrary, s would gradually become a less preferred source for new links. In other words, there is a trade-off between the popularity of s on the f2(A) = ws 1 − 1 − pst . (5) s   one hand (if s is frequently visited then links that are added to it X t:(Ys,t)∈A   get more exposure and are therefore more frequently used), and its Here, the product specifies the probability of no new link being ‘crowdedness’ with good links, on the other. clicked from s, and one minus that quantity yields the probability of As a side note we observe that, when measuring ‘crowdedness’, s w N clicking at least one new link from . As before, we use s = log s. f2 only considers the newly added links A, whereas f3 also takes the pre-existing links E into account. In our evaluation (Sec. 6.1.3), we Single-tab objective. Unlike f1 and f2, our third objective is based on the single-tab browsing model. It captures the number of page will see that the latter amplifies the effect of diminishing returns. views upon which the one chosen link is one of the new links A: 2.3 Algorithm for maximizing the objectives t:(s,t)∈A pst f (A) = w . (6) We now show how to efficiently maximize the three objective func- 3 s p p s t:(s,t)∈PA st + t:(s,t)∈E st tions. We first state two key observations that hold for all three X P P Algorithm 1 Greedy marginal-gain link placement Algorithm 2 Power iteration for estimating path proportions from Input: Hyperlink graph G = (V,E); source-page weights ws; pairwise transitions probabilities clickthrough rates pst ; budget K; objective f Input: Target node t, pairwise transition matrix P with Pst = pst Output: Set A of links that maximizes f (A) subject to |A| ≤ K Output: Vector Qt of estimated path proportions for all source 1: Q ← new priority queue pages when the target page is t T 2: for s ∈ V do 1: Qt ← (0,...,0) Σ 3: E ← (s,t)∈E pst // sum of pst values of old links from s 2: Qtt ← 1 ′ T 4: Σ ← 0 // sum of new pst values seen so far for s 3: Qt ← (0,...,0) // used for storing the previous value of Qt P ′ ε 5: Π ← 1 // product of new (1− pst ) values seen so far for s 4: while kQt − Qt k > do ′ 6: C ← 0 // current objective value for s 5: Qt ← Qt 7: for t ∈ V in decreasing order of pst do 6: Qt ← PQt // recursive case of Eq. 10 8: Σ ← Σ + pst 7: Qtt ← 1 // base case of Eq. 10 9: Π ← Π · (1 − pst ) 8: end while ′ 10: if f = f1 then C ← ws · Σ ′ 11: else if f = f2 then C ← ws · (1 − Π) ′ 12: else if f = f3 then C ← ws · Σ/(Σ + ΣE ) exists. Many websites provide a search box on every page, and 13: Insert (s,t) into Q with the marginal gain C′ −C as value thus one way of reaching t from s without taking a direct click is to 14: C ← C′ ‘teleport’ into t by using the search box. Therefore, if many users 15: end for search for t from s we may interpret this as signaling the need to go 16: end for from s to t, a need that would also be met by a direct link. Based 17: A ← top K elements of Q on this intuition, we estimate pst as the search proportion of (s,t), defined as the number of times t was reached from s via search, divided by the total number Ns of visits to s. objectives and then describe an algorithm that builds on these ob- Method 2: Path proportion. Besides search, another way of servations to find a globally optimal solution: reaching t from s without a direct link is by navigating on an in- direct path. Thus, we may estimate pst as the path proportion of 1. For a single source page s, the optimal solution is given by ∗ ∗ (s,t), defined as Nst /Ns, where Nst is the observed number of navi- the top K pairs (s,t) with respect to pst . gation paths from s to t. In other words, path proportion measures 2. Sources are ‘independent’: the contribution of links from s to the average number of paths to t, conditioned on first seeing s. the objective does not change when adding links to sources Method 3: Path-and-search proportion. Last, we may also com- other than s. This follows from Eq. 4–6, by noting that only bine the above two metrics. We may interpret search queries as a links from s appear in the expressions inside the outer sums. special type of page view, and the paths from s to t via search may These observations imply that the following simple, greedy algo- then be considered indirect paths, too. Summing search proportion rithm always produces an optimal solution1 (pseudocode is listed in and path proportion, then, gives rise to yet another predictor of pst . Algorithm 1). The algorithm processes the data source by source. We refer to this joint measure as the path-and-search proportion. For a given source s, we first obtain the optimal solution for s alone Method 4: Random-walk model. The measures introduced above by sorting all (s,t) with respect to pst (line 7; cf. observation 1). require access to rather granular data: in order to mine keyword Next, we iterate over the sorted list and compute the marginal gains searches and indirect navigation paths, one needs access to com- of all candidates (s,t) from s (lines 7–15). As marginal gains are plete server logs. Processing large log datasets may, however, be computed, they are stored in a global priority queue (line 13); once computationally difficult. Further, complete logs often contain per- computed, they need not be updated any more (cf. observation 2). sonally identifiable information, and privacy concerns may thus Finally, we return the top K from the priority queue (line 17). make it difficult for researchers and analysts to obtain unrestricted log access. For these reasons, it is desirable to develop clickthrough 3. ESTIMATING CLICKTHROUGH RATES rate estimation methods that manage to make reasonable predic- tions based on more restricted data. For instance, although the In our above exposition of the link placement problem, we assumed Wikipedia log data used in this paper is not publicly available, a we are given a clickthrough rate p for each link candidate (s,t). st powerful dataset derived from it, the matrix of pairwise transition However, in practice these values are undefined before the link is counts between all English Wikipedia articles, was recently pub- introduced. Therefore, we need to estimate what the clickthrough lished by the Wikimedia Foundation [34]. We now describe an rate for each nonexistent link would be in the hypothetical case that algorithm for estimating path proportions solely based on the pair- the link were to be inserted into the site. wise transition counts for those links that already exist. Here we propose four ways in which historical web server logs As defined above, the path proportion measures the expected can be used for estimating p for a nonexistent link (s,t). Later st number of paths to t on a navigation trace starting from s; we de- (Sec. 6.1.1) we evaluate the resulting predictors empirically. note this quantity as Qst here. For a random walker navigating the Method 1: Search proportion. What we require are indicators network according to the empirically measured transitions proba- of users’ need to transition from s to t before the direct link (s,t) bilities, the following recursive equation holds: 1 For maximizing general submodular functions (Eq. 9), a greedy 1 if s = t, algorithm gives a (1 − 1/e)-approximation [20]. But in our case, Qst = (10) observation 2 makes greedy optimal: our problem is equivalent to ( u psuQut otherwise. finding a maximum-weight basis of a uniform matroid of rank K with marginal gains as weights; the greedy algorithm is guaranteed The base case states thatP the expected number of paths to t from t to find an optimal solution in this setting [10]. itself is 1 (i.e., the random walker is assumed to terminate a path as soon as t is reached). The recursive case defines that, if the X = tree size random walker has not reached t yet, he might do so later on, when X = average degree continuing the walk according to the empirical clickthrough rates. (num. children) ) n

Solving the random-walk equation. The random-walk equation ≥ X (Eq. 10) can be solved by power iteration. Pseudocode is listed as ( Out−deg. [1, 30) Pr Algorithm 2. Let Q be the vector containing as its elements the Out−deg. [30, 61) t Out−deg. [61, 140) Fraction of new links of new Fraction estimates Qst for all pages s and the fixed target t. Initially, Qt has Out−deg. [140, 5601)

entries of zero for all pages with the exception of Qtt = 1, as per the 1e−05 1e−03 1e−01 1e−06 1e−04 1e−02 1e+00 base case of Eq. 10 (lines 1–2). We then keep multiplying Qt with 1 2 5 10 50 200 0.001 0.010 0.100 1.000 the matrix of pairwise transition probabilities, as per the recursive n Mininum pst case of Eq. 10 (line 6), resetting Qtt to 1 after every step (line 7). (a) Tree properties (b) New-link usage Since this reset step depends on the target t, the algorithm needs to be run separately for each t we are interested in. Figure 2: Wikipedia dataset statistics. (a) CCDF of size and The algorithm is guaranteed to converge to a fixed point under average degree of navigation trees. (b) CCDF of clickthrough rate for various source-page out-degrees. (Log–log scales.) the condition that s pst < 1 for all t (proof omitted for concise- ness), which is empirically the case in our data. P low this approach with time differences as weights. Hence, the trees produced by our heuristic are the best possible trees under 4. DATASETS AND LOG PROCESSING the objective of minimizing the sum (or mean) of all time differ- In this section we describe the structure and processing of web ences spanned by the referer edges. server logs. We also discuss the properties of the two websites we work with, Wikipedia and Simtk. Mining search events. When counting keyword searches, we con- sider both site-internal search (e.g., from the Wikipedia search box) 4.1 From logs to trees and site-external search from general engines such as Google, Ya- hoo, and Bing. Mining internal search is straightforward, since all Log format. Web server log files contain one entry per HTTP search actions are usually fully represented in the logs. An external request, specifying inter alia: timestamp, requested URL, referer search from s for t is defined to have occurred if t has a search en- URL, HTTP response status, user agent information, client IP ad- gine as referer, if s was the temporally closest previous page view dress, and proxy server information via the HTTP X-Forwarded- by the user, and if s occurred at most 5 minutes before t. For header. Since users are not uniquely identifiable by IP ad- dresses (e.g., several clients might reside behind the same proxy 4.2 Wikipedia data server, whose IP address would then be logged), we create an ap- proximate user ID by computing an MD5 digest of the concate- Link graph. The Wikipedia link graph is defined by nodes repre- nation of IP address, proxy information, and user agent. Common senting articles in the main namespace, and edges representing the bots and crawlers are discarded on the basis of the user agent string. hyperlinks used in the bodies of articles. Furthermore, we use the English Wikipedia’s publicly available full revision history (span- From logs to trees. Our goal is to analyze the traces users take ning all undeleted revisions from 2001 through April 3, 2015) to on the hyperlink network of a given website. However, the logs do determine when links were added or removed. not represent these traces explicitly, so we first need to reconstruct them from the raw sequence of page requests. Server logs. We have access to Wikimedia’s full server logs, con- We start by grouping requests by user ID and sorting them by taining all HTTP requests to Wikimedia projects. We consider only time. If a page t was requested by clicking a link on another page s, requests made to the desktop version of the English Wikipedia. the URL of s is recorded in the referer field of the request for t. The The log files we analyze comprise 3 months of data, from January idea, then, is to reassemble the original traces from the sequence through March 2015. For each month we extracted around 3 billion of page requests by joining requests on the URL and referer fields navigation trees. Fig. 2(a) summarizes structural characteristics of while preserving the temporal order. Since several clicks can be trees by plotting the complementary cumulative distribution func- made from the same page, e.g., by opening multiple browser tabs, tions (CCDF) of the tree size (number of page views) and of the the navigation traces thus extracted are generally trees. average degree (number of children) of non-leaf page views per While straightforward in principle, this method is unfortunately tree. We observe that trees tend to be small: 77% of trees consist of unable to reconstruct the original trees perfectly. Ambiguities arise a single page view; 13%, of two page views; and 10%, of three or when the page listed in the referer field of t was visited several more page views. Average degrees also follow a heavy-tailed dis- times previously by the same user. In this situation, linking t to all tribution. Most trees are linear chains of all degree-one nodes, but its potential predecessors results in a directed acyclic graph (DAG) page views with larger numbers of children are still quite frequent; rather than a tree. Transforming such a DAG into a tree requires e.g., 6% (44 million per month) of all trees with at least two page a heuristic approach. We proceed by attaching a request for t with views have an average degree of 3 or more. referer s to the most recent request for s by the same user. If the referer field is empty or contains a URL from an external website, 4.3 Simtk data the request for t becomes the root of a new tree. Our second log dataset stems from Simtk.org, a website where Not only is this greedy strategy intuitive, since it seems more biomedical researchers share code and information about their proj- likely that the user continued from the most recent event, rather ects. We analyze logs from June 2013 through February 2015. than resumed an older one; it also comes with a global optimality Contrasting Simtk and Wikipedia, we note that the Simtk dataset guarantee. More precisely, we observe that in a DAG G with edge is orders of magnitude smaller than the Wikipedia dataset, in terms weights w, attaching each internal node v to its minimum-weight of both pages and page views, at hundreds, rather than millions, parent argminu wuv yields a minimum spanning tree of G. We fol- of pages, and tens of millions, rather than tens of billions, of page lmnaycmltv itiuinfnto CD)o h click- the of rate (CCDF) through function distribution links cumulative the plementary 1). and (Fig. of March, times 66% in 100 time than that more single extent used a new were even the that 1% used to only not fact were used, the February rarely to in be added alluded to already tend have links we links introduction new the of In Usage 5.1 l rcsinvolving traces all aeidpneto ahohr rd hycmeefrclicks? ( for source compete same they do the or in other, placed each of links independent new page several answer are links new we Second, do particular, volume click receive? In much how 2). First, questions: (Sec. following formulat- problem the new when placement adding made link we of assumptions the effects the ing the of investigate some to assess is and section links this of goal The LINKS NEW OF EVALUATION: EFFECTS 5. tree. same the in path a on than rather session, ae( stp sbto) oe i onaison boundaries bin Lower bottom). is source 1 of top; markup is wiki (0 in page link out- of source-page position (a) relative (b) of and function degree as rate Clickthrough 3: Figure Sc )b aligu o often how consecu- up between proportions tallying path [13] by compute hour 3) we one consistent, (Sec. be than to more Further, events. no tive of extracting times of idle instead with Therefore, logs other the Wikipedia. extract we from the for trees, mined on than traces rich ours; navigation less as the are such that much means method is also a there it by that hand, means made this be hand, to dense one improvement less the significantly On also Wikipedia’s. is than structure hyperlink Simtk’s views. rcsaegnrlytes o his(e.41.Wiei chains, in While 4.1). (Sec. chains not trees, generally links are between Traces Competition bottom. the at appearing and 5.2 links center, of the that in as appearing high links as of times that 3.7 rate as about clickthrough high a as achieving links times page with 1.6 the [16], of about page top source the to the close in appearing position its with correlated also trend. negative strong a shows and out-degree sntncsayi re,weesvrlnx ae a eopened be may pages next page several same where the from trees, in necessary not is i.3a,wihpostecikhog rate clickthrough by the confirmed plots is which finding 3(a), This Fig. pages. higher source with lower-degree clickthrough, 1% between for over values achieve out-degree, links new source-page of 13% on and 3% depending that, observe We p p pst s st st , hl i.1posaslt ubr,Fg ()sostecom- the shows 2(b) Fig. numbers, absolute plots 1 Fig. While oase hs usin,w osdrteaon 0,0 links 800,000 around the consider we questions, these answer To ial,Fg ()cnrstefc httepplrt faln is link a of popularity the that fact the confirms 3(b) Fig. Finally, t =

) 0.001 0.004 0.007 ol omadsrbto over distribution a form would de oteEgihWkpdai eray21 n study and 2015 February in Wikipedia English the to added 1), 1 Source−page out−degree P 17 t 26 p p (a) st st 35 taie yteotdge ftesuc page source the of out-degree the by stratified , a qa h u-ereof out-degree the equal may sessions, 46 s s 61 uhta,i h oteteecs we all (when case extreme most the in that, such , ntemnh fJnayadMrh2015. March and January of months the in 83 114 endhr ssqecso aeviews page of sequences as here defined 175 310

pst s t curdbefore occurred

given 0.002 0.004

0 p s 0.109 , Relative linkposition st s i.e.

. 0.227 gis source-page against (b)

, 0.354 P 0.479 t t

p 0.596 ntesame the in st

x 0.7 = -axis. 0.798 ,this 1, 0.887 0.954 s . of ob h ubro pages of number the be to oelnst olw(ein10 o esta 0out-links 10 than less for 1.00 (median follow to links more o ae iha es 8 u-ik.Adtoal,gvnta users that in given 76% Additionally, stop and out-links. not out-links, 288 do 10 least than at less with stopping pages with less median for pages a is for with 89% stopping degree, of that structural probability larger shows of 2015, pages January on likely in articles 300,000 for 1 h rbblt fsopn at stopping of probability the (1) tpin stop ieyta oelnsaetkn r(i tutrldge a be ‘topical topic may or complex degree ‘interestingness’ more structural as a (ii) such more complexity’: or it factors makes latent taken; also with are page correlated a links to more links that more likely adding simply (i) that ble 4(b)). Fig. out-links; more or 288 for ere(i.4d)vr nysihl,ee o rsi hne in changes drastic for even slightly, navigational only grows and vary degree 4(c)) 4(d)) structural (Fig. (Fig. probability as degree stopping above: both from shrinks, ii or hypothesis to support ubro rniin sr aeotof out make users transitions of number ( ntesopn rbblt n h aiainldge of degree from navigational links the removing and or probability stopping adding the of on effect the captures now values aevle o ahqatt,oefrJnayadoefrMarch. for one and January page for fixed sepa- one a two quantity, For obtaining each snapshots, for these values in rate present links the exclu- on based sively probabilities, these stopping and as take structural well of We as compute degrees, months we navigational the Then, 2015. respectively. in 1, March, Wikipedia and of March January states from approximate other the the as Jan- snapshots and from 2015, one Wikipedia, 1, of uary snapshots two consider we particular, s neetpoete of properties inherent eae ihcagsi aiainldge o fixed cor- for are degree degree navigational in structural changes in with changes related whether observing and time inherent of because of but topic present the are of options properties link more because ply nohrwrs h aiainldge of degree navigational the words, other In nraetenme flnsagvnue hoe oflo from follow to chooses user given a links of number the increase s in(e.21,w ol e ocmeiinbtenlns the links; between competition no of see out-degree would the larger we 2.1), (Sec. tion beta digmn odlnst page a to links good many adding that its able with it click to choosing eval- probability and page, separately, entire option the link scanning every reader a uating imply would it since true, be from iliprac.Nx eivsiaeti usinmr carefully. more question cru- this of investigate is we links Next between limited importance. interaction a cial of only links. question propose the between to links, allowed interaction, of is number or method competition, placement link of our sign As first may a which as rates, clickthrough serve individual lower have to tend degree ubro rniin u of out transitions of number i.e. hswudla omr lcsfrom clicks more to lead would this ; a ena nitra oeo tree, a of node internal an as seen was et eeaieterlto ftesrcua ereof degree structural the of relation the examine we Next, i.4a,wihhsbe optdfo h rniincounts transition the from computed been has which 4(a), Fig. hs fet ol eepandb w yohss ti possi- is It hypotheses. two by explained be could effects These i.4c n ()so htti feti iucl,tu lending thus minuscule, is effect this that show 4(d) and 4(c) Fig. odcd hc yohssi re ene ocnrlfrthese for control to need we true, is hypothesis which decide To ntemlitbbosn oe ihisidpnec assump- independence its with model browsing multi-tab the In is edfiethe define we First out- higher of pages from links that saw already we 5.1 Sec. In s cosalreaddvresto pages of set diverse and large a across , ik)t te oista ih erlvn o understanding for relevant be might that topics other to links) , s o fixed for s eas en the define also We . p st nals xrm om oee,i swl conceiv- well is it however, form, extreme less a In . s hymk oecik naeaewhen average on clicks more make they , p s st h ifrnebtenteMrhadJanuary and March the between difference the , nispr om hsmdlsesulkl to unlikely seems model this form, pure its In . s aiainldge d degree navigational ed ob rcigtesame the tracking by so do We . s d h agrteepce ubro clicks of number expected the larger the , s . s s = ik to. links s iie ytettlnme ftimes of number total the by divided , tutrldegree, structural P N s t s − 6= n 2 h aiainldegree navigational the (2) and ∅ N s N s s ∅ ilhv oeconnections more have will st otoetpc,btntsim- not but topics, those to i.e. . s ihu tpigthere: stopping without , ie htte onot do they that given , s s . ssml h average the simply is s s of ih significantly might or s out-degree, ob h total the be to s swl.In well. as s s through vs. s . s offers with 1.38 (11) of s s s . foesbde ntesm orepg,sneti ilntin- high- not pages. spread will source should this different many one since across rather, links page, clickthrough indefinitely; source volume much same click too the crease spending on avoid budget should one’s budget-constrained one of for algorithms: hence important placement and has link modeling observation This user for attention. compete user implications links for Instead, other page. each a from with taken clicks of number overall 4(d)) (Fig. (removed). ranges added interquartile are links of indicated as shift as (downward) values, upward typical, the than by rather extreme, from stems 0.1%. effect only is decrease relative median the when deleted, and 0.3%, are only links still 100 is degree navigational in increase relative ikstructure; link outliers. without range full the show whiskers tiles; iso h betv functions objective the of ties nisonbfr hyaeitoue.Teeoe ete check then we Therefore, introduced. are they links before practically such own a identify fact, to its ability the on additional after eval- the only have the known must Since system is useful links suited links. added new are of of 3 set parts. rates Sec. uation three from clickthrough methods into the estimation predicting divided the for is that Wikipedia show for we First, results of analysis The Wikipedia 6.1 Simtk. and Wikipedia websites, different very evaluating two by on approach our it of universality the demonstrate we PLACEMENT Next EVALUATION: LINK 6. xsi c n d r eaiewt epc otevle from values the to on respect boundaries with bin relative Lower are (d) January. and (c) in axes n()sopn rbblt n d aiainldge.The degree. navigational (d) effect and small the probability a stopping fixing only (c) has on When degree structural however, pos- degree. page, navigational (b) source and with probability, correlated stopping itively (a) with is degree correlated structural negatively pages, source different Across 4: Figure Stopping−probability difference Stopping probability −0.2 −0.1 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 hs nig justify, findings These the increase not does links more adding simply nutshell, a In this but all, at effect no has links adding that say to not is This

Structural−degree differenceStructural−degree 1 −Inf 10

−100 degree Structural 17 (c) (a) e.g. −10 26 −1 38 hn10lnso oeaeadd h median the added, are more or links 100 when , 0 54 79 1 115 othoc, post 10 175 100 288 f 2

and Navigational−degree difference Navigational degree −0.2 −0.1 1.0 1.2 1.4 1.6 1.8 2.0 2.2 h iiihn-eun proper- diminishing-returns the 0.0 0.1 0.2 f 3 x Sc 2.2). (Sec. ae.Bxsso quar- show Boxes -axes.

Structural−degree differenceStructural−degree 1 −Inf 10

−100 degree Structural 17 (d) (b) −10 26 −1 38 0 54 79 1 115 10 175 100 288 y - hrfr osdrallnsaddi eraywt tlat1 in- set 10 this call least January; at in with searches February or in we paths counts, added direct search links and all path consider on mainly therefore based are methods our Since lsfrwihormtoscnptnilypeitahigh a predict exam- potentially positive can methods of our number which for non-negligible ples a require We examples. February. in Wikipedia to our links added whether to were correspond evaluating that data by January case the the using predictions actually high-ranking is this whether test we ml that imply hog aeo l u-ik of out-links all of rate through clseai htalink a hypothet- the that about scenario reason ical prediction clickthrough for models Our 5(b). and the 5(a) Fig. when to for refer between we section relation values, the next ground-truth on and the perspective predicted graphical see a (but For all page fails). for baseline value same same the the from predicts links it though even method) performing lyadd u,a ttdaoe hssti o nw hnthe when known not actu- is were set that this links practice. above, of in stated deployed subset are as the predictors But, for well added. fairly future ally link the a predict to of manage usage models the that showed we subsection arietasto arx sntlgigbhn ymc.Also, much. by behind that 3(a) lagging Fig. not from recall is the matrix, requires to only transition encouraging which pairwise is predictor, path- random-walk–based it and the Further, proportion that see path best. metrics perform all proportion across and-search that shows correlation. table rank The Spearman and correlation, Pearson error, absolute h enbsln ae h aepeito o l candidates all page for source prediction same same the in the originating makes baseline mean The 3): (Sec. methods different five compare we particular, In added. ma boueerro 0.72%, of well error quite performs absolute baseline mean (mean the why is This out-degree. page 2015, January of data log from 2015 March o elw a rdc h lctruhrates clickthrough the predict evaluate can were we we Here that well 2015. links how February 800,000 in Wikipedia identified English have the we to added 5, Sec. in described As [24]. button a of click simple the with suggestions them our decline shown or are accept users and Wikipedia: to links missing add to al- our of constraints. behavior budget the under investigate placement link we for Last, gorithm addition. of worthy is hte ag predicted large a whether intervals. confidence 95% methods bootstrapped estimation with rate Wikipedia, clickthrough on of Comparison 1: Table .. rdcigln addition link Predicting 6.1.2 .. siaigcikhog rates clickthrough Estimating 6.1.1 enbsl.007 ( 0.0072 baseln. Mean erhpo.007 ( 0.0070 prop. Search o hsts,w edagon-rt e fpstv n negative and positive of set ground-truth a need we task, this For nutvl,alrepeitdcikhog rate clickthrough predicted large a Intuitively, al vlae h eut sn he ifrn erc:mean metrics: different three using results the evaluates 1 Table eas eeoe rpia sritraeta ae teasy it makes that interface user graphical a developed also We ad ak .00( 0.0060 walks Rand. .ama baseline. mean a 5. and walks, random 4. proportion, path-and-search combined the 3. proportion, path 2. proportion, search 1. & prop. P&S ahprop. Path ( s , t ) enaslt r.Pasncr.Sera corr. Spearman corr. Pearson err. absolute Mean saueu ikadsol hsb de.Here added. be thus should and link useful a is 0.0057 0.0057 ( ( ± ± ± ± ± ( p s p st .03 .8( 0.58 0.0003) .04 .3( 0.53 0.0004) .04 .9( 0.49 0.0004) .04 .0( 0.20 0.0004) 0.0003) , st t st ag atdtrie ysource- by determined part large a to is ) au nedas en httelink the that means also indeed value saddt h ie nteprevious the In site. the to added is s htarayexist. already that vs. 0.61 .7 civdb h best- the by achieved 0.57% s aeyteaeaeclick- average the namely , ( ± ± ± ± ± .3 .9( 0.59 0.13) 0.12) 0.13) .2 .7( 0.17 0.22) .6 .4( 0.34 0.06) i.e. eoeteln was link the before , p L st ic nreality in Since . fteelnsin links these of p st 0.64 0.64 hudalso should ( ( p ± ± ± ± ± st 0.01) 0.01) 0.02) 0.02) 0.02) value.

(a) Page-centric multi-tab objective f2 (b) Single-tab objective f3 Source Target Source Target * ITunes Originals – Red Hot Road Trippin’ Through Time * ITunes Originals – Red Hot Road Trippin’ Through Time Chili Peppers Chili Peppers * Internat. Handball Federation 2015 World Men’s Handball Champ. * Gran Hotel Grand Hotel (TV series) Baby, It’s OK! Olivia (singer) * Tithe: A Modern Faerie Tale Ironside: A Modern Faery’s Tale (2014) (2015) * Nordend Category:Mountains of the Alps * Nordend Category:Mountains of the Alps * Jacob Aaron Estes The Details (film) * Gran Hotel Grand Hotel (TV series) Blame It on the Night Blame (Calvin Harris song) * Edmund Ironside Edward the Elder * Internat. Handball Federation 2015 World Men’s Handball Champ. * Jacob Aaron Estes The Details (film) * The Choice (novel) Benjamin Walker (actor) Confed. of African Football 2015 Africa Cup of Nations Payback (2014) Royal Rumble (2015) Live Rare Remix Box Road Trippin’ Through Time * Just Dave Van Ronk No Dirty Names

Table 2: Top 10 link suggestions of Algorithm 1 using objectives f2 and f3. Clickthrough rates pst estimated via path proportion (Sec. 3). Objective f1 (not shown) yields same result as f2 but includes an additional link for source page Baby, It’s OK!, demonstrating effect of diminishing returns on f2. Asterisks mark links added by editors after prediction time, independently of our predictions.

A further route forward would be to extend our method to other [15] J. Kleinberg. The small-world phenomenon: An algorithmic types of networks that people navigate. For instance, citation net- perspective. In STOC, 2000. works, knowledge graphs, and some online social networks could [16] D. Lamprecht, D. Helic, and M. Strohmaier. Quo vadis? On the all be amenable to our approach. Finally, it would be worthwhile effects of Wikipedia’s policies on navigation. In Wiki-ICWSM, 2015. to explore how our methods could be extended to not only identify [17] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019–1031, 2007. links inside a given website but to also links between different sites. [18] G. Marchionini. Exploratory search: From finding to understanding. In summary, our paper makes contributions to the rich line of Communications of the ACM, 49(4):41–46, 2006. work on improving the connectivity of the Web. We hope that [19] D. Milne and I. H. Witten. Learning to link with Wikipedia. In future work will draw on our insights to increase the usability of CIKM, 2008. websites as well as the Web as a whole. [20] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions – I. Acknowledgments. This research has been supported in part by NSF IIS- Mathematical Programming, 14(1):265–294, 1978. 1016909, IIS-1149837, IIS-1159679, CNS-1010921, NIH R01GM107340, [21] C. Nentwich, L. Capra, W. Emmerich, and A. Finkelstein. Xlinkit: A Boeing, Facebook, Volkswagen, Yahoo, SDSI, and Wikimedia Foundation. consistency checking and smart link generation service. TOIT, Robert West acknowledges support by a Stanford Graduate Fellowship. 2(2):151–185, 2002. [22] T. Noraset, C. Bhagavatula, and D. Downey. Adding high-precision 8. REFERENCES links to Wikipedia. In EMNLP, 2014. [1] L. Adamic and E. Adar. Friends and neighbors on the Web. Social [23] C. Olston and E. H. Chi. ScentTrails: Integrating browsing and Networks, 25(3):211–230, 2003. searching on the Web. TCHI, 10(3):177–197, 2003. [2] E. Adar, J. Teevan, and S. T. Dumais. Large scale analysis of Web [24] A. Paranjape, R. West, and L. Zia. Project website, 2015. revisitation patterns. In CHI, 2008. https://meta.wikimedia.org/wiki/Research: [3] M. Bilenko and R. White. Mining the search trails of surfing crowds: Improving_link_coverage (accessed Dec. 10, 2015). Identifying relevant websites from user activity. In WWW, 2008. [25] P. Pirolli. Information foraging theory: Adaptive interaction with [4] E. H. Chi, P. Pirolli, K. Chen, and J. Pitkow. Using information scent information. Oxford University Press, 2007. to model user information needs and actions and the Web. In CHI, [26] R. R. Sarukkai. Link prediction and path analysis using Markov 2001. chains. Computer Networks, 33(1):377–386, 2000. [5] F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarlos. Are Web users [27] P. Singer, D. Helic, A. Hotho, and M. Strohmaier. HypTrails: A really Markovian? In WWW, 2012. Bayesian approach for comparing hypotheses about human trails on [6] B. D. Davison. Learning Web request patterns. In Web Dynamics. the Web. In WWW, 2015. Springer, 2004. [28] J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The [7] P. Devanbu, Y.-F. Chen, E. Gansner, H. Müller, and J. Martin. Chime: perfect search engine is not enough: A study of orienteering behavior Customizable hyperlink insertion and maintenance engine for in directed search. In CHI, 2004. software engineering environments. In ICSE, 1999. [29] C. Trattner, D. Helic, P. Singer, and M. Strohmaier. Exploring the [8] D. Downey, S. Dumais, and E. Horvitz. Models of searching and differences and similarities between hierarchical decentralized search browsing: Languages, studies, and application. In IJCAI, 2007. and human navigation in information networks. In i-KNOW, 2012. [9] D. Downey, S. Dumais, D. Liebling, and E. Horvitz. Understanding [30] R. West and J. Leskovec. Human wayfinding in information the relationship between searchers’ queries and information goals. In networks. In WWW, 2012. CIKM, 2008. [31] R. West, A. Paranjape, and J. Leskovec. Mining missing hyperlinks [10] J. Edmonds. Matroids and the greedy algorithm. Mathematical from human navigation traces: A case study of Wikipedia. In WWW, Programming, 1(1):127–136, 1971. 2015. [11] S. Fissaha Adafre and M. de Rijke. Discovering missing links in [32] R. West, D. Precup, and J. Pineau. Completing Wikipedia’s hyperlink Wikipedia. In LinkKDD, 2005. structure through dimensionality reduction. In CIKM, 2009. [12] M. Grecu. Navigability in information networks. Master’s thesis, [33] R. White and J. Huang. Assessing the scenic route: Measuring the ETH Zürich, 2014. value of search trails in Web logs. In SIGIR, 2010. [13] A. Halfaker, O. Keyes, D. Kluver, J. Thebault-Spieker, T. Nguyen, [34] E. Wulczyn and D. Taraborelli. Wikipedia Clickstream. Website, K. Shores, A. Uduwage, and M. Warncke-Wang. User session 2015. http://dx.doi.org/10.6084/m9.figshare.1305770 identification based on strong regularities in inter-activity time. In (accessed July 16, 2015). WWW, 2015. [35] E. Zachte. Wikipedia statistics. Website, 2015. [14] D. Helic, M. Strohmaier, M. Granitzer, and R. Scherer. Models of https://stats.wikimedia.org/EN/TablesWikipediaZZ.htm human navigation in information networks based on decentralized (accessed July 16, 2015). search. In HT, 2013.