Improving Website Hyperlink Structure Using Server Logs

Ashwin Paranjape∗ Robert West∗ Leila Zia Jure Leskovec Stanford University Stanford University Wikimedia Foundation Stanford University [email protected] [email protected] [email protected] [email protected]

ABSTRACT quently [7, 21]. For example, about 7,000 new pages are created Good websites should be easy to navigate via hyperlinks, yet main- on Wikipedia every day [35]. As a result, there is an urgent need taining a high-quality link structure is difficult. Identifying pairs to keep the site’s link structure up to date, which in return requires of pages that should be linked may be hard for human editors, es- much manual effort and does not scale as the website grows, even pecially if the site is large and changes frequently. Further, given with thousands of volunteer editors as in the case of Wikipedia. a set of useful link candidates, the task of incorporating them into Neither is the problem limited to Wikipedia; on the contrary, it may the site can be expensive, since it typically involves humans editing be even more pressing on other sites, which are often maintained pages. In the light of these challenges, it is desirable to develop by a single webmaster. Thus, it is important to provide automatic data-driven methods for automating the link placement task. Here methods to help maintain and improve website hyperlink structure. we develop an approach for automatically finding useful hyperlinks Considering the importance of links, it is not surprising that the to add to a website. We show that passively collected server logs, problems of finding missing links and predicting which links will beyond telling us which existing links are useful, also contain im- form in the future have been studied previously [17, 19, 22, 32]. plicit signals indicating which nonexistent links would be useful if Prior methods typically cast improving website hyperlink structure they were to be introduced. We leverage these signals to model the simply as an instance of a link prediction problem in a given net- future usefulness of yet nonexistent links. Based on our model, we work; i.e., they extrapolate information from existing links in order define the problem of link placement under budget constraints and to predict the missing ones. These approaches are often based on propose an efficient algorithm for solving it. We demonstrate the the intuition that two pages should be connected if they have many effectiveness of our approach by evaluating it on Wikipedia, a large neighbors in common, a notion that may be quantified by the plain website for which we have access to both server logs (used for find- number of common neighbors, the Jaccard coefficient of sets of ing useful new links) and the complete revision history (containing neighbors, and other variations [1, 17]. a ground truth of new links). As our method is based exclusively on However, simply considering the existence of a link on a website standard server logs, it may also be applied to any other website, as is not all that matters. Networks in general, and hyperlink graphs we show with the example of the biomedical research site Simtk. in particular, often have traffic flowing over their links, and a link is of little use if it is never traversed. For example, in the English Wikipedia, of all the 800,000 links added to the site in February 1. INTRODUCTION 2015, the majority (66%) were not clicked even a single time in Websites are networks of interlinked pages, and a good website March 2015, and among the rest, most links were clicked only very makes it easy and intuitive for users to navigate its content. One rarely. For instance, only 1% of links added in February were used way of making navigation intuitive and content discoverable is to more than 100 times in all of March (Fig. 1). Consequently, even provide carefully placed hyperlinks. Consider for instance Wiki- if we could perfectly mimic Wikipedia editors (who would a priori pedia, the free encyclopedia that anyone can edit. The links that seem to provide a reasonable gold standard for links to add), the connect articles are essential for presenting concepts in their appro- suggested links would likely have little impact on website naviga- priate context, for letting users find the information they are looking tion. arXiv:1512.07258v3 [cs.SI] 27 Feb 2018 for, and for providing an element of serendipity by giving users the Furthermore, missing-link prediction systems can suggest a large chance to explore new topics they happen to come across without number of links to be inserted into a website. But before links can intentionally searching for them. be incorporated, they typically need to be examined by a human, Unfortunately, maintaining website navigability can be difficult for several reasons. First, not all sites provide an interface that lets and cumbersome, especially if the site is large and changes fre- content be changed programmatically. Second, even when such *AP and RW contributed equally, RW as a Wikimedia Research Fellow. an interface is available, determining where in the source page to place a suggested link may be hard, since it requires finding, or introducing, a phrase that may serve as anchor text [19]. And third, Permission to make digital or hard copies of part or all of this work for personal or even when identifying anchor text automatically is possible, human classroom use is granted without fee provided that copies are not made or distributed verification and approval might still be required in order to ensure for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. that the recommended link is indeed correct and appropriate. Since For all other uses, contact the owner/author(s). the addition of links by humans is a slow and expensive process, WSDM’16, February 22–25, 2016, San Francisco, CA, USA. presenting the complete ranking of all suggested links to editors is © 2016 Copyright held by the owner/author(s). unlikely to be useful. Instead, a practical system should choose a ACM ISBN 978-1-4503-3716-8/16/02. . . $15.00 well composed subset of candidate links of manageable size. DOI: http://dx.doi.org/10.1145/2835776.2835832 No clicks: 66% and demonstrate that our models of human browsing behavior are 1e+00 At least one click: 34% accurate and lead to the discovery of highly relevant and useful hy- perlinks. Having full access to the logs of Wikipedia, one of the highest-traffic websites in the world, is particularly beneficial, as 1e−02 it allows us to perform a unique analysis of human web-browsing behavior. Our results on Simtk show that our method is general and 1e−04

Fraction of new links of new Fraction applies well beyond Wikipedia. The following are the main contributions of this paper: 1e−06 1 10 100 1000 10000 • We introduce the link placement problem under budget con- Mininum number of clicks straints, and design objective functions anchored in empir- ically observed human behavior and exposing an intuitive Figure 1: Complementary cumulative distribution function of diminishing-returns property. We also provide an efficient the number of clicks received in March 2015 by links intro- algorithm for optimizing these objectives (Sec. 2). duced in the English Wikipedia in February 2015. • We identify signals in the logs that can indicate the usage of a future link before it is introduced, and we operational- Thus, what is needed is an automatic approach to improving ize these signals in simple but effective methods for finding website hyperlink structure that accounts for website navigation promising new links and estimating their future clickthrough patterns and suggests a limited number of highly relevant links rates (Sec. 3). based on the way they will impact the website’s navigability. • We characterize the usage and impact of newly added links by analyzing human navigation traces mined from Wikipe- Present work. Here we develop an automatic approach for sug- dia’s server logs (Sec. 5). gesting missing but potentially highly used hyperlinks for a given website. Our approach is language- and website-independent and is based on using access logs as collected by practically all web 2. THE LINK PLACEMENT PROBLEM servers. Such logs record users’ natural browsing behavior, and we First we introduce the link placement problem under budget con- show that server logs contain strong signals not only about which straints. In this context, the budget refers to a maximum allow- existing links are presently used, but also about how heavily cur- able number of links that may be suggested by an algorithm. As rently nonexistent links would be used if they were to be added to mentioned in the introduction, realistic systems frequently require the site. We build models of human browsing behavior, anchored in a human in the loop for verifying that the links proposed by the al- empirical evidence from log data, that allow us to predict the usage gorithm are indeed of value and for inserting them. As we need to of any potentially introduced hyperlink (s,t) in terms of its click- avoid overwhelming the human editors with more suggestions than through rate, i.e., the probability of being chosen by users visiting they can respond to, we need to carefully select our suggestions. page s. The models are based on the following intuition: Consider Formally, the task is to select a set A of suggested links of a a page s not linked to another page t. A large fraction of users given maximum size K, where the quality of the chosen set A is visiting s and later on, via an indirect path, t, suggests that they determined by an objective function f : might have taken a direct shortcut from s to t had the shortcut ex- isted. Even though one could hope to detect such events by parsing maximize f (A) (1) the logs, our approach is even more general. One of our models is subject to |A| ≤ K. (2) able to estimate the fraction of people who would take a shortcut In principle, any set function f : 2V×V → may be plugged in (s,t) only based on the pairwise page transition matrix. That is, we R (where V is the set of pages on the site, and 2V×V is the set of are able to estimate navigation probabilities over longer paths by all sets of page pairs), but the practical usefulness of our system combining information about individual link transitions. relies vitally on a sensible definition of f . In particular, f should The second important aspect of our solution is that we only sug- be based on a reasonable model of human browsing behavior and gest a small number of the top K most useful links, where K is a should score link sets for which the model predicts many clicks ‘link budget’ constraint. Consider a naïve approach that first ranks above link sets for which it predicts fewer clicks. all candidate links by their probability of being used and then sim- In the following we introduce two simplified models of human ply returns the top K. An extreme outcome would be for all K links browsing behavior and propose three objective functions f based to be placed in a single page s. This is problematic because it seems on these models. Then, we show that two of these objectives have a priori more desirable to connect a large number of pages reason- the useful property of diminishing returns and discuss the impli- ably well with other pages than to connect a single page extremely cations thereof. Finally, we show that all three objectives can be well. This intuition is confirmed by the empirical observation that maximized exactly by a simple and efficient greedy algorithm. the expected number of clicks users take from a given page s is limited and not very sensitive to the addition of new links out of s. 2.1 Browsing models and objective functions Thus we conclude that inserting yet another out-link to s after some We first define notation, then formulate two web-browsing models, good ones have already been added to it achieves less benefit than and finally design three objective functions based on the models. adding an out-link to some other page s0 first. This diminishing-re- turns property, besides accurately mirroring human behavior, leads Notation. We model hyperlink networks as directed graphs G = to an optimization procedure that does not over-optimize on single (V,E), with pages as nodes and hyperlinks as edges. We refer to source pages but instead spreads good links across many pages. the endpoints of a link (s,t) as source and target, respectively. Un- We demonstrate and evaluate our approach on two very differ- linked pairs of nodes are considered potential link candidates. We ent websites: the full English Wikipedia and Simtk, a small com- denote the overall number of views of a page s by Ns. When users munity of biomedical researchers. For both websites we utilize are in s, they have a choice to either stop or follow out-links of s. complete web server logs collected over a period of several months To simplify notation, we model stopping as a link click to a special sink page ∅ ∈/ E that is taken to be linked from every other page. The purpose of the denominator is to renormalize the independent Further, Nst counts the transitions from s to t via direct clicks of the probabilities of all links—old and new—to sum to 1, so pst be- link (s,t). Finally, a central quantity in this research is the click- comes a distribution over t given s. Again, we use ws = logNs. through rate pst of a link (s,t). It measures the fraction of times users click to page t, given that they are currently on page s: 2.2 Diminishing returns Next, we note that all of the above objective functions are mono- pst = Nst /Ns. (3) tone, and that f2 and f3 have a useful diminishing-returns property. The stopping probability is given by ps∅. Note that, since in reality Monotonicity means that adding one more new link to the solu- users may click several links from the same source page, pst is tion A will never decrease the objective value. It holds for all three generally not a distribution over t given s. objectives (we omit the proof for brevity): Next we define two models of human web-browsing behavior. f (A ∪ {(s,t)}) − f (A) ≥ 0. (7) Multi-tab browsing model. This model captures a scenario where the user may follow several links from the same page s, e.g., by The difference f (A∪{(s,t)})− f (A) is called the marginal gain of opening them in multiple browser tabs. We make the simplifying (s,t) with respect to A. assumption that, given s, the user decides for each link (s,t) inde- The next observation is that, under the link-centric multi-tab ob- jective f1, marginal gains never change: regardless of which links pendently (with probability pst ) if he wants to follow it. are already present in the network, we will always have Single-tab browsing model. This is a Markov chain model where the user is assumed to follow exactly one link from each visited f1(A ∪ {(s,t)}) − f1(A) = ws pst . (8) page s. This corresponds to a user who never opens more than a This means that f assumes that the solution quality can be in- single tab, as might be common on mobile devices such as phones. 1 creased indefinitely by adding more and more links to the same The probability of choosing the link (s,t) is proportional to p . st source page s. However, we will see later (Sec. 5.2) that this may Based on these browsing models, we now define objective func- not be the case in practice: adding a large number of links to the tions that model the value provided by the newly inserted links A. same source page s does not automatically have a large effect on Note that the objectives reason about the clickthrough rates pst of the total number of clicks made from s. links A that do not exist yet, so in practice pst is not directly avail- This discrepancy with reality is mitigated by the more complex able but must be estimated from data. We address this task in Sec. 3. objectives f2 and f3. In the page-centric multi-tab objective f2, Link-centric multi-tab objective. To capture the value of the further increasing the probability of at least one new link being newly inserted links A, this first objective assumes the multi-tab taken becomes ever more difficult as more links are added. This browsing model. The objective computes the expected number of way, f2 discourages adding too many links to the same source page. P The same is true for the single-tab objective f3: since the click new-link clicks from page s as t:(s,t)∈A pst and aggregates this quantity across all pages s: probabilities of all links from s—old and new—are required to sum to 1 here, it becomes ever harder for a link to obtain a fixed portion X X f1(A) = ws pst . (4) of the total probability mass of 1 as the page is becoming increas- s t:(s,t)∈A ingly crowded with strong competitors. More precisely, all three objective functions f have the submod- If we choose ws = Ns then f1(A) measures the expected number ularity property (proof omitted for brevity): of clicks received by all new links A jointly when s is weighted according to its empirical page-view count. In practice, however, f (B ∪ {(s,t)}) − f (B) ≤ f (A ∪ {(s,t)}) − f (A), for A ⊆ B, (9) page-view counts tend to follow heavy-tailed distributions such as power laws [2], so a few top pages would attract a disproportional but while it holds with equality for f1 (Eq. 8), this is not true for amount of weight. We mitigate this problem by using ws = logNs, f2 and f3. When marginal gains decrease as the solution grows, we i.e., by weighting s by the order of magnitude of its raw count. speak of diminishing returns. To see the practical benefits of diminishing returns, consider two Page-centric multi-tab objective. Our second objective also as- source pages s and u, with s being vastly more popular than u, i.e., sumes the multi-tab browsing model. However, instead of mea- ws  wu. Without diminishing returns (under objective f1), the suring the total expected number of clicks on all new links A (as marginal gains of candidates from s will nearly always dominate done by f ), this objective measures the expected number of source 1 those of candidates from u (i.e., ws pst  wu puv for most (s,t) and pages on which at least one new link is clicked (hence, we term this (u,v)), so u would not even be considered before s has been fully objective page-centric, whereas f1 is link-centric): saturated with links. With diminishing returns, on the contrary, s   would gradually become a less preferred source for new links. In X Y other words, there is a trade-off between the popularity of s on the f2(A) = ws 1 − 1 − pst . (5) s t:(s,t)∈A one hand (if s is frequently visited then links that are added to it get more exposure and are therefore more frequently used), and its Here, the product specifies the probability of no new link being ‘crowdedness’ with good links, on the other. clicked from s, and one minus that quantity yields the probability of As a side note we observe that, when measuring ‘crowdedness’, s w = N clicking at least one new link from . As before, we use s log s. f2 only considers the newly added links A, whereas f3 also takes the pre-existing links E into account. In our evaluation (Sec. 6.1.3), we Single-tab objective. Unlike f1 and f2, our third objective is based on the single-tab browsing model. It captures the number of page will see that the latter amplifies the effect of diminishing returns. views upon which the one chosen link is one of the new links A: P 2.3 Algorithm for maximizing the objectives X t:(s,t)∈A pst f (A) = w . (6) We now show how to efficiently maximize the three objective func- 3 s P p + P p s t:(s,t)∈A st t:(s,t)∈E st tions. We first state two key observations that hold for all three Algorithm 1 Greedy marginal-gain link placement Algorithm 2 Power iteration for estimating path proportions from Input: Hyperlink graph G = (V,E); source-page weights ws; pairwise transitions probabilities clickthrough rates pst ; budget K; objective f Input: Target node t, pairwise transition matrix P with Pst = pst Output: Set A of links that maximizes f (A) subject to |A| ≤ K Output: Vector Qt of estimated path proportions for all source 1: Q ← new priority queue pages when the target page is t T 2: for s ∈ V do 1: Qt ← (0,...,0) P 3: ΣE ← (s,t)∈E pst // sum of pst values of old links from s 2: Qtt ← 1 0 T 4: Σ ← 0 // sum of new pst values seen so far for s 3: Qt ← (0,...,0) // used for storing the previous value of Qt 0 5: Π ← 1 // product of new (1 − pst ) values seen so far for s 4: while kQt − Qt k > ε do 0 6: C ← 0 // current objective value for s 5: Qt ← Qt 7: for t ∈ V in decreasing order of pst do 6: Qt ← PQt // recursive case of Eq. 10 8: Σ ← Σ + pst 7: Qtt ← 1 // base case of Eq. 10 9: Π ← Π · (1 − pst ) 8: end while 0 10: if f = f1 then C ← ws · Σ 0 11: else if f = f2 then C ← ws · (1 − Π) 0 12: else if f = f3 then C ← ws · Σ/(Σ + ΣE ) exists. Many websites provide a search box on every page, and 13: Insert (s,t) into Q with the marginal gain C0 −C as value thus one way of reaching t from s without taking a direct click is to 14: C ← C0 ‘teleport’ into t by using the search box. Therefore, if many users 15: end for search for t from s we may interpret this as signaling the need to go 16: end for from s to t, a need that would also be met by a direct link. Based 17: A ← top K elements of Q on this intuition, we estimate pst as the search proportion of (s,t), defined as the number of times t was reached from s via search, divided by the total number Ns of visits to s. objectives and then describe an algorithm that builds on these ob- Method 2: Path proportion. Besides search, another way of servations to find a globally optimal solution: reaching t from s without a direct link is by navigating on an in- direct path. Thus, we may estimate pst as the path proportion of 1. For a single source page s, the optimal solution is given by ∗ ∗ (s,t), defined as Nst /Ns, where Nst is the observed number of navi- the top K pairs (s,t) with respect to pst . gation paths from s to t. In other words, path proportion measures 2. Sources are ‘independent’: the contribution of links from s to the average number of paths to t, conditioned on first seeing s. the objective does not change when adding links to sources Method 3: Path-and-search proportion. Last, we may also com- other than s. This follows from Eq. 4–6, by noting that only bine the above two metrics. We may interpret search queries as a links from s appear in the expressions inside the outer sums. special type of page view, and the paths from s to t via search may These observations imply that the following simple, greedy algo- then be considered indirect paths, too. Summing search proportion rithm always produces an optimal solution1 (pseudocode is listed in and path proportion, then, gives rise to yet another predictor of pst . Algorithm 1). The algorithm processes the data source by source. We refer to this joint measure as the path-and-search proportion. For a given source s, we first obtain the optimal solution for s alone Method 4: Random-walk model. The measures introduced above by sorting all (s,t) with respect to pst (line 7; cf. observation 1). require access to rather granular data: in order to mine keyword Next, we iterate over the sorted list and compute the marginal gains searches and indirect navigation paths, one needs access to com- of all candidates (s,t) from s (lines 7–15). As marginal gains are plete server logs. Processing large log datasets may, however, be computed, they are stored in a global priority queue (line 13); once computationally difficult. Further, complete logs often contain per- computed, they need not be updated any more (cf. observation 2). sonally identifiable information, and privacy concerns may thus Finally, we return the top K from the priority queue (line 17). make it difficult for researchers and analysts to obtain unrestricted log access. For these reasons, it is desirable to develop clickthrough 3. ESTIMATING CLICKTHROUGH RATES rate estimation methods that manage to make reasonable predic- tions based on more restricted data. For instance, although the In our above exposition of the link placement problem, we assumed Wikipedia log data used in this paper is not publicly available, a we are given a clickthrough rate p for each link candidate (s,t). st powerful dataset derived from it, the matrix of pairwise transition However, in practice these values are undefined before the link is counts between all English Wikipedia articles, was recently pub- introduced. Therefore, we need to estimate what the clickthrough lished by the Wikimedia Foundation [34]. We now describe an rate for each nonexistent link would be in the hypothetical case that algorithm for estimating path proportions solely based on the pair- the link were to be inserted into the site. wise transition counts for those links that already exist. Here we propose four ways in which historical web server logs As defined above, the path proportion measures the expected can be used for estimating p for a nonexistent link (s,t). Later st number of paths to t on a navigation trace starting from s; we de- (Sec. 6.1.1) we evaluate the resulting predictors empirically. note this quantity as Qst here. For a random walker navigating the Method 1: Search proportion. What we require are indicators network according to the empirically measured transitions proba- of users’ need to transition from s to t before the direct link (s,t) bilities, the following recursive equation holds: 1 ( For maximizing general submodular functions (Eq. 9), a greedy 1 if s = t, algorithm gives a (1 − 1/e)-approximation [20]. But in our case, Qst = P (10) observation 2 makes greedy optimal: our problem is equivalent to u psuQut otherwise. finding a maximum-weight basis of a uniform matroid of rank K with marginal gains as weights; the greedy algorithm is guaranteed The base case states that the expected number of paths to t from t to find an optimal solution in this setting [10]. itself is 1 (i.e., the random walker is assumed to terminate a path as soon as t is reached). The recursive case defines that, if the X = tree size random walker has not reached t yet, he might do so later on, when X = average degree 1e+00 continuing the walk according to the empirical clickthrough rates. (num. children) 1e−01 ) n 1e−02 Solving the random-walk equation. The random-walk equation ≥ X (Eq. 10) can be solved by power iteration. Pseudocode is listed as ( Out−deg. [1, 30) Pr 1e−03 Out−deg. [30, 61) Algorithm 2. Let Q be the vector containing as its elements the 1e−04 t Out−deg. [61, 140) Fraction of new links of new Fraction estimates Qst for all pages s and the fixed target t. Initially, Qt has Out−deg. [140, 5601)

entries of zero for all pages with the exception of Qtt = 1, as per the 1e−05 1e−06 base case of Eq. 10 (lines 1–2). We then keep multiplying Qt with 1 2 5 10 50 200 0.001 0.010 0.100 1.000 the matrix of pairwise transition probabilities, as per the recursive n Mininum pst case of Eq. 10 (line 6), resetting Qtt to 1 after every step (line 7). (a) Tree properties (b) New-link usage Since this reset step depends on the target t, the algorithm needs to be run separately for each t we are interested in. Figure 2: Wikipedia dataset statistics. (a) CCDF of size and The algorithm is guaranteed to converge to a fixed point under average degree of navigation trees. (b) CCDF of clickthrough P rate for various source-page out-degrees. (Log–log scales.) the condition that s pst < 1 for all t (proof omitted for concise- ness), which is empirically the case in our data. low this approach with time differences as weights. Hence, the trees produced by our heuristic are the best possible trees under 4. DATASETS AND LOG PROCESSING the objective of minimizing the sum (or mean) of all time differ- In this section we describe the structure and processing of web ences spanned by the referer edges. server logs. We also discuss the properties of the two websites we work with, Wikipedia and Simtk. Mining search events. When counting keyword searches, we con- sider both site-internal search (e.g., from the Wikipedia search box) 4.1 From logs to trees and site-external search from general engines such as Google, Ya- hoo, and Bing. Mining internal search is straightforward, since all Log format. Web server log files contain one entry per HTTP search actions are usually fully represented in the logs. An external request, specifying inter alia: timestamp, requested URL, referer search from s for t is defined to have occurred if t has a search en- URL, HTTP response status, user agent information, client IP ad- gine as referer, if s was the temporally closest previous page view dress, and proxy server information via the HTTP X-Forwarded- by the user, and if s occurred at most 5 minutes before t. For header. Since users are not uniquely identifiable by IP ad- dresses (e.g., several clients might reside behind the same proxy 4.2 Wikipedia data server, whose IP address would then be logged), we create an ap- proximate user ID by computing an MD5 digest of the concate- Link graph. The Wikipedia link graph is defined by nodes repre- nation of IP address, proxy information, and user agent. Common senting articles in the main namespace, and edges representing the bots and crawlers are discarded on the basis of the user agent string. hyperlinks used in the bodies of articles. Furthermore, we use the English Wikipedia’s publicly available full revision history (span- From logs to trees. Our goal is to analyze the traces users take ning all undeleted revisions from 2001 through April 3, 2015) to on the hyperlink network of a given website. However, the logs do determine when links were added or removed. not represent these traces explicitly, so we first need to reconstruct them from the raw sequence of page requests. Server logs. We have access to Wikimedia’s full server logs, con- We start by grouping requests by user ID and sorting them by taining all HTTP requests to Wikimedia projects. We consider only time. If a page t was requested by clicking a link on another page s, requests made to the desktop version of the English Wikipedia. the URL of s is recorded in the referer field of the request for t. The The log files we analyze comprise 3 months of data, from January idea, then, is to reassemble the original traces from the sequence through March 2015. For each month we extracted around 3 billion of page requests by joining requests on the URL and referer fields navigation trees. Fig. 2(a) summarizes structural characteristics of while preserving the temporal order. Since several clicks can be trees by plotting the complementary cumulative distribution func- made from the same page, e.g., by opening multiple browser tabs, tions (CCDF) of the tree size (number of page views) and of the the navigation traces thus extracted are generally trees. average degree (number of children) of non-leaf page views per While straightforward in principle, this method is unfortunately tree. We observe that trees tend to be small: 77% of trees consist of unable to reconstruct the original trees perfectly. Ambiguities arise a single page view; 13%, of two page views; and 10%, of three or when the page listed in the referer field of t was visited several more page views. Average degrees also follow a heavy-tailed dis- times previously by the same user. In this situation, linking t to all tribution. Most trees are linear chains of all degree-one nodes, but its potential predecessors results in a directed acyclic graph (DAG) page views with larger numbers of children are still quite frequent; rather than a tree. Transforming such a DAG into a tree requires e.g., 6% (44 million per month) of all trees with at least two page a heuristic approach. We proceed by attaching a request for t with views have an average degree of 3 or more. referer s to the most recent request for s by the same user. If the referer field is empty or contains a URL from an external website, 4.3 Simtk data the request for t becomes the root of a new tree. Our second log dataset stems from Simtk.org, a website where Not only is this greedy strategy intuitive, since it seems more biomedical researchers share code and information about their proj- likely that the user continued from the most recent event, rather ects. We analyze logs from June 2013 through February 2015. than resumed an older one; it also comes with a global optimality Contrasting Simtk and Wikipedia, we note that the Simtk dataset guarantee. More precisely, we observe that in a DAG G with edge is orders of magnitude smaller than the Wikipedia dataset, in terms weights w, attaching each internal node v to its minimum-weight of both pages and page views, at hundreds, rather than millions, parent argminu wuv yields a minimum spanning tree of G. We fol- of pages, and tens of millions, rather than tens of billions, of page sntncsayi re,weesvrlnx ae a eopened be may pages next page several same where the from trees, in necessary not is clicks? ( for source compete same they do the or in other, placed each of links independent new page several answer are links new we Second, do particular, volume click receive? In much how 2). First, questions: (Sec. following formulat- problem the new when placement adding made link we of assumptions the effects the ing the of investigate some to assess is and section links this of goal The LINKS NEW OF EVALUATION: EFFECTS 5. tree. same the in path a on often than how consecu- rather up session, between proportions tallying path [13] by compute hour 3) we one consistent, (Sec. be than to more Further, events. no tive of extracting times of idle instead with Therefore, logs other the Wikipedia. extract we from the for trees, mined on than traces rich ours; navigation less as the are such that much means method is also a there it by that hand, means made this be hand, to dense one improvement less the significantly On also Wikipedia’s. is than structure hyperlink Simtk’s on boundaries views. bin Lower bottom). is source 1 of top; markup is wiki (0 in page link out- of source-page position (a) relative (b) of and function degree as rate Clickthrough 3: Figure rcsaegnrlytes o his(e.41.Wiei chains, in While 4.1). (Sec. chains not trees, generally links are between Traces Competition bottom. the at appearing and 5.2 links center, of the that in as appearing high links as of times that 3.7 rate as about clickthrough high a as achieving links times page with 1.6 the [16], of about page top source the to the close in appearing position its with correlated also trend. negative strong a shows rate and clickthrough out-degree by the confirmed plots is which finding 3(a), This Fig. pages. higher source with lower-degree clickthrough, 1% between for over values achieve out-degree, links new source-page of 13% on and 3% depending that, observe We click- the of rate (CCDF) through function distribution links cumulative the plementary 1). and (Fig. of March, times 66% in 100 time than that more single extent used a new were even the that 1% used to only not fact were used, the February rarely to in be added alluded to already tend have links we links introduction new the of In Usage 5.1 involving traces all p p pst s st st , oase hs usin,w osdrteaon 0,0 links 800,000 around the consider we questions, these answer To ial,Fg ()cnrstefc httepplrt faln is link a of popularity the that fact the confirms 3(b) Fig. Finally, com- the shows 2(b) Fig. numbers, absolute plots 1 Fig. While t

= 0.001 0.004 0.007 ) ol omadsrbto over distribution a form would de oteEgihWkpdai eray21 n study and 2015 February in Wikipedia English the to added 1), 1 Source−page out−degree P 17 t 26 p p (a) st st 35 taie yteotdge ftesuc page source the of out-degree the by stratified , a qa h u-ereof out-degree the equal may sessions, 46 s s 61 uhta,i h oteteecs we all (when case extreme most the in that, such , ntemnh fJnayadMrh2015. March and January of months the in 83 114 endhr ssqecso aeviews page of sequences as here defined 175 310

pst s t

curdbefore occurred 0.002 0.004 given

0 p s 0.109 , Relative linkposition st s i.e.

. 0.227 gis source-page against (b)

, 0.354 P 0.479 t t

p 0.596 ntesame the in st

x 0.7 = -axis. 0.798 ,this 1, 0.887 0.954 s . s of out transitions of number carefully. more question cru- this of investigate is we links Next between limited importance. interaction a cial of only links. question propose the between to links, allowed interaction, of is number or method competition, placement link of our sign As first may a which as rates, clickthrough serve individual lower have to from tend follow degree to chooses user page given a a links to of links number the good increase many adding that its able with it click to choosing eval- probability and page, separately, entire option the link scanning every reader a uating imply would it since the true, be links; between competition from no of see out-degree would the larger we 2.1), (Sec. tion ieyta oelnsaetkn r(i tutrldge a be ‘topical topic may or complex degree ‘interestingness’ more structural as a (ii) such more complexity’: or it factors makes latent taken; also with are page correlated a links to more links that more likely adding simply (i) that ble 4(b)). Fig. out-links; out-links more 10 or than less 288 for for 1.00 (median follow to links users more that in given 76% Additionally, stop and out-links. not out-links, 288 do 10 least than at less with stopping pages with less median for pages a is for with 89% stopping degree, of that structural probability larger shows of 2015, pages January on likely in articles 300,000 for ere(i.4d)vr nysihl,ee o rsi hne in changes drastic for even slightly, navigational only grows and vary degree 4(c)) 4(d)) structural (Fig. (Fig. probability as degree stopping above: both from shrinks, ii or hypothesis to support of degree from navigational links the removing and or probability stopping adding the of on March. effect the for captures one now and values January page for fixed sepa- one a two quantity, For obtaining each snapshots, for these values in rate present links the exclu- on based sively probabilities, these stopping and as take structural well of We as compute degrees, months we navigational the Then, 2015. respectively. in 1, March, Wikipedia and of March January states from approximate other the the as Jan- snapshots and from 2015, one Wikipedia, 1, of uary snapshots fixed two cor- for consider are degree we degree navigational particular, in structural changes in with changes related whether observing and time of properties inherent inherent of because of but topic present the are of options properties link more because ply s ( at of stopping of probability the (1) pages of number the be to of out in make stop users transitions of degree of navigational number the words, other In i.e. hswudla omr lcsfrom clicks more to lead would this ; a ena nitra oeo tree, a of node internal an as seen was is edfiethe define we First out- higher of pages from links that saw already we 5.1 Sec. In assump- independence its with model browsing multi-tab the In hs fet ol eepandb w yohss ti possi- is It hypotheses. two by explained be could effects These counts transition the from computed been has which 4(a), Fig. i.4c n ()so htti feti iucl,tu lending thus minuscule, is effect this that show 4(d) and 4(c) Fig. these for control to need we true, is hypothesis which decide To of degree structural the of relation the examine we Next, s cosalreaddvresto pages of set diverse and large a across , ik)t te oista ih erlvn o understanding for relevant be might that topics other to links) , s o fixed for s eas en the define also We . p st nals xrm om oee,i swl conceiv- well is it however, form, extreme less a In . s hymk oecik naeaewhen average on clicks more make they , p s st h ifrnebtenteMrhadJanuary and March the between difference the , nispr om hsmdlsesulkl to unlikely seems model this form, pure its In . s aiainldge d degree navigational ed ob rcigtesame the tracking by so do We . s d h agrteepce ubro clicks of number expected the larger the , s . s s = ik to. links s iie ytettlnme ftimes of number total the by divided , tutrldegree, structural P N s t s − 6= n 2 h aiainldegree navigational the (2) and ∅ N s N s s ∅ ilhv oeconnections more have will st otoetpc,btntsim- not but topics, those to i.e. . s ihu tpigthere: stopping without , ie htte onot do they that given , s s . ssml h average the simply is s s of ih significantly might or s out-degree, ob h total the be to s swl.In well. as s s through vs. s . s offers with 1.38 (11) of s s s . nisonbfr hyaeitoue.Teeoe ete check then we Therefore, introduced. are they links before practically such own a identify fact, to its ability the on additional after eval- the only have the known must Since system is useful links suited links. added new are of of 3 set parts. rates Sec. uation three from clickthrough methods into the estimation predicting divided the for is that Wikipedia show for we First, results of analysis The Wikipedia 6.1 Simtk. and Wikipedia websites, different very evaluating two by on approach our it of universality the demonstrate we PLACEMENT Next EVALUATION: LINK 6. functions objective the of ties in- high- not pages. spread will source should this different many one since across rather, links page, clickthrough indefinitely; source volume much same click too the crease spending on avoid budget should one’s budget-constrained one of for algorithms: hence important placement and has link modeling observation This user for attention. compete user implications links for Instead, other page. each a from with taken clicks of number overall 4(d)) (Fig. (removed). ranges added interquartile are links of indicated as shift as (downward) values, upward typical, the than by rather extreme, from stems 0.1%. effect only is decrease relative median the when deleted, and 0.3%, are only links still 100 is degree navigational in increase relative structure; link outliers. without from range full values the the show to on whiskers respect tiles; boundaries with bin relative Lower are (d) January. The and degree. (c) navigational in (d) axes effect and small the probability a stopping fixing only (c) has on When degree structural however, pos- degree. page, navigational (b) source and with probability, correlated stopping itively (a) with is degree correlated structural negatively pages, source different Across 4: Figure Stopping−probability difference Stopping probability −0.2 −0.1 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 hs nig justify, findings These the increase not does links more adding simply nutshell, a In this but all, at effect no has links adding that say to not is This

Structural−degree differenceStructural−degree 1 −Inf 10

−100 degree Structural 17 (c) (a) e.g. −10 26 −1 38 hn10lnso oeaeadd h median the added, are more or links 100 when , 0 54 79 1 115 othoc, post 10 175 100 288 f 2

and Navigational−degree difference Navigational degree −0.2 −0.1 1.0 1.2 1.4 1.6 1.8 2.0 2.2 h iiihn-eun proper- diminishing-returns the 0.0 0.1 0.2 f 3 x Sc 2.2). (Sec. ae.Bxsso quar- show Boxes -axes.

Structural−degree differenceStructural−degree 1 −Inf 10

−100 degree Structural 17 (d) (b) −10 26 −1 38 0 54 79 1 115 10 175 100 288 y - de.I atclr ecmaefiedfeetmtos(e.3): (Sec. methods different five compare we particular, 2015, In January added. of data log rates from clickthrough 2015 the March predict evaluate can were we we Here that well 2015. links how February 800,000 in Wikipedia identified English have the we to added 5, Sec. in described As [24]. button a of click simple the with suggestions them our decline shown or are accept users and Wikipedia: to links missing add to al- our of constraints. behavior budget the under investigate placement link we for Last, gorithm addition. of worthy is predicted large a whether intervals. confidence 95% methods bootstrapped estimation with rate Wikipedia, clickthrough on of Comparison 1: Table hrfr osdrallnsaddi eraywt tlat1 in- set 10 this call least January; at in with searches February or in we paths counts, added direct search links and all path consider on mainly therefore based high are a methods predict exam- our potentially Since positive can methods of our number which for non-negligible ples a require We examples. February. in Wikipedia to our links added whether to were correspond evaluating that data by January case the the using predictions actually high-ranking is this whether we the when that known imply not actu- is were set that this links practice. above, of in stated deployed subset are as the predictors But, for well added. fairly future ally link the a predict to of manage usage models the that showed we link subsection a hypothet- the that about scenario reason ical prediction clickthrough for models Our 5(b). and the 5(a) Fig. when to for refer between we section relation values, the next ground-truth on and the perspective predicted graphical see a (but For all page fails). for baseline value same same the the from predicts links it though even 0.72%, method) performing of well error quite performs absolute baseline mean (mean the Also, why is This much. out-degree. by page behind that 3(a) lagging Fig. not from recall is the matrix, requires to only transition encouraging which pairwise is predictor, path- random-walk–based it and the Further, proportion that see path best. metrics perform all proportion across and-search that shows correlation. table rank The Spearman and correlation, Pearson error, absolute of candidates out-links all all page of for source rate through prediction same same the in the originating makes baseline mean The .. siaigcikhog rates clickthrough Estimating 6.1.1 .. rdcigln addition link Predicting 6.1.2 enbsl.007 ( 0.0072 baseln. Mean erhpo.007 ( 0.0070 prop. Search eas eeoe rpia sritraeta ae teasy it makes that interface user graphical a developed also We o hsts,w edagon-rt e fpstv n negative and positive of set ground-truth a need we task, this For rate clickthrough predicted large a Intuitively, mean metrics: different three using results the evaluates 1 Table ad ak .00( 0.0060 walks Rand. ama baseline. mean a 5. and walks, random 4. proportion, path-and-search combined the 3. proportion, path 2. proportion, search 1. & prop. P&S ahprop. Path ( s , t ) enaslt r.Pasncr.Sera corr. Spearman corr. Pearson err. absolute Mean saueu ikadsol hsb de.Here added. be thus should and link useful a is 0.0057 0.0057 ( ( ± ± ± ± ± ( p s p st .04 .0( 0.20 ( 0.53 0.0004) 0.0004) .04 .9( 0.49 0.0003) ( 0.58 0.0004) 0.0003) , st t st ag atdtrie ysource- by determined part large a to is ) au nedas en httelink the that means also indeed value saddt h ie nteprevious the In site. the to added is s htarayexist. already that vs. 0.61 .7 civdb h best- the by achieved 0.57% s aeyteaeaeclick- average the namely , ( ± ± ± ± ± .6 .4( 0.34 ( 0.59 0.06) 0.13) .2 .7( 0.17 0.13) 0.22) 0.12) i.e. eoeteln was link the before , p L st ic nreality in Since . fteelnsin links these of p st 0.64 0.64 hudalso should ( ( p ± ± ± ± ± st 0.02) 0.02) 0.01) 0.02) 0.01) value. t t s s

p p Solution overlap 1.0 1.0 1e−01 1e−01 0.8 0.8 k 0.6 0.6 1e−03 1e−03 0.4 0.4

Precision@ Path proportion 0.2 Jaccard coefficient Jaccard Search proportion 0.2 f 1 vs. f 2 f vs. f 1e−05 1e−05 Random walks 1 3 0.0 0.0 Ground−truth clickthrough rate Ground−truth clickthrough rate 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1 10 100 1000 10000 0 2000 6000 10000 Path proportion Search proportion Rank k Size K of solution A (a) (b) (c) (d)

Prior cumulative clickthrough Solution concentration Total click volume Average click volume A

0.8 f 1 f 1 3.0

f 2 100 f 2

0.7 f 3 f 3 2.5 250000 f 1 80

0.6 f 2

f 3 2.0 60

0.5 f 1 1.5 100000 f 2 Total number of clicks number Total 40 0.4 f 3 of clicks number Average 0 Targets per source page in Targets 1.0

Mean prior cumulative clickthrough Mean prior cumulative 0 2000 6000 10000 0 2000 6000 10000 0 2000 6000 10000 0 2000 6000 10000 Size K of solution A Size K of solution A Size K of solution A Size K of solution A (e) (f) (g) (h) Figure 5: Wikipedia results. (a) Path proportion vs. clickthrough rate. (b) Search proportion vs. clickthrough rate (log–log; black lines correspond to gray dots kernel-smoothed on linear scales). (c) Precision at k on link prediction task (Sec. 6.1.2; logarithmic x-axis); path-and-search proportion nearly identical to path proportion and thus not shown. (d–h) Results of budget-constrained link placement for objectives of Eq. 4–6 (Sec. 6.1.3); ‘prior cumulative clickthrough’ refers to second term in denominator of Eq. 6. most page pairs are never linked, we also need to add a large num- First, we consider how similar the solutions produced by the dif- ber of negative examples to the test set. We do so by including all ferent objective functions are. For this purpose, Fig. 5(d)) plots the out-links of the sources, and in-links of the targets, appearing in L. Jaccard coefficients between the solutions of f1 and f2 (orange) and Using this approach, we obtain a set of about 38 million link can- between the solutions of f1 and f3 (blue) as functions of the solu- didates, of which 9,000 are positive examples. The aforementioned tion size K. We observe that f1 and f2 produce nearly identical so- threshold criterion is met by 7,000 positive, and 104,000 negative, lutions. The reason is that clickthrough rates pst are generally very examples. Therefore, while there is a fair number of positive exam- small, which entails that Eq. 4 and 5 become very similar (formally, ples with many indirect paths and searches, they are vastly outnum- this can be shown by a Taylor expansion around 0). Objective f3, bered by negative examples with that same property; this ensures on the contrary, yields rather different solutions; it tends to favor that retrieving the positive examples remains challenging for our source pages with fewer pre-existing high-clickthrough links (due methods (e.g., returning all links meeting the threshold criterion in to the second term in the denominator of Eq. 6; Fig. 5(e)) and at- random order yields a precision of only 6%). tempts to spread links more evenly over all source pages (Fig. 5(f)). Given this labeled dataset, we evaluate the precision at k of the In other words, f3 offers a stronger diminishing-returns effect: the candidate ranking induced by the predicted pst values. The results, marginal value of more links decays faster when some good links plotted in Fig. 5(c), indicate that our methods are well suited for have already been added to the same source page. predicting link addition. Precision stays high for a large set of top Next, we aim to assess the impact of our suggestions on real Wi- predictions: of the top 10,000 predictions induced by the path-pro- kipedia users. Recall that we suggest links based on server logs portion and random-walk measures, about 25% are positive exam- from January 2015. We quantify the impact of our suggestions in ples; among the top 2,000 predictions, as many as 50% are positive. terms of the total number of clicks they received in March 2015 As before, random walks perform nearly as well as the more data- (contributed by links added by editors after January). Fig. 5(g), intensive path proportion, and better than search proportion. which plots the total March click volume as a function of the so- Note that the mean baseline, which performed surprisingly well lution size K, shows that (according to f3) the top 1,000 links re- on the clickthrough rate prediction task (Sec. 6.1.1), does not ap- ceived 95,000 clicks in total in March 2015, and the top 10,000 re- pear in the plot. The reason is that it fails entirely on this task, pro- ceived 336,000 clicks. Recalling that two-thirds of all links added ducing only two relevant links among its top 10,000 predictions. by humans receive no click at all (Fig. 1), this clearly demonstrates We conclude that it is easy to give a good pst estimate when it is the importance of the links we suggest. Fig. 5(h) displays the av- known which links will be added. However, predicting whether a erage number of clicks per suggested link as a function of K. The link should be added in the first place is much harder. decreasing shape of the curves implies that higher-ranked sugges- tions received more clicks, as desired. Finally, comparing the three 6.1.3 Link placement under budget constraints objectives, we observe that f3 fares best on the Wikipedia dataset: The above evaluations concerned the quality of predicted pst val- its suggestions attract most clicks (Fig. 5(g) and 5(h)), while also ues. Now we rely on those values being accurate and combine them being spread more evenly across source pages (Fig. 5(f)). in our budget-constrained link placement algorithm (Sec. 2.3). We These results show that the links we suggest fill a real user need. use the pst values from the path-proportion method. Our system recommends useful links before they would normally Mean absolute err. Pearson corr. Spearman corr. term and the numerator; with the first link added to s, the ratio in Path prop. 0.020 (±0.003) 0.41 (±0.10) 0.50 (±0.09) Eq. 6 becomes close to 1, and the contribution of s to f close to Mean baseln. 0.013 (±0.003) −0.01 (±0.11) −0.27 (±0.10) 3 its possible maximum ws. Thereafter, the objective experiences ex- Table 2: Performance of path-proportion clickthrough estima- treme diminishing returns for s. Fig. 6(b) confirms that f3 starts by tion on Simtk, with bootstrapped 95% confidence intervals. inserting one link into each of the around 750 pages. We conclude that with few pre-existing links, the page-centric t s p Solution concentration multi-tab objective f2 might be better suited than the single-tab ob- A 5 f 1 jective f3, as it provides a more balanced trade-off between the 2e−01 f 2

4 number of links per source page and predicted clickthrough rates. f 3 2e−02 3 7. DISCUSSION AND RELATED WORK 2

2e−03 There is a rich literature on the task of link prediction and rec- ommendation, so rather than trying to be exhaustive, we refer the 1 Targets per source page in Targets reader to the excellent survey by Liben-Nowell and Kleinberg [17] 2e−04 Ground−truth clickthrough rate 2e−04 2e−03 2e−02 2e−01 0 500 1000 1500 2000 and to some prior methods for the specific case of Wikipedia [11, Path proportion Size K of solution A 19, 22, 32], as it is also our main evaluation dataset. The important (a) (b) point to make is that most prior methods base their predictions on Figure 6: Simtk results. (a) Path proportion vs. clickthrough the static structure of the network, whereas we focus on how users rate (black line corresponds to gray dots kernel-smoothed on interact with this static structure by browsing and searching on it. linear scales). (b) Solution concentration of budget-constrained Among the few link recommendation methods that take usage link placement; dotted vertical line: number of distinct pages. data into account are those by Grecu [12] and West et al. [31]. They leverage data collected through a human-computation game in which users navigate from a given start to a given target page be added by editors, and it recommends additional links that we on Wikipedia, thereby producing traces that may be used to recom- may assume would also be clicked frequently if they were added. mend new ‘shortcut’ links from pages along the path to the target Table 3 illustrates our results by listing the top link suggestions. page. Although these approaches are simple and work well, they are limited by the requirement that the user’s navigation target be 6.2 Simtk explicitly known. Eliciting this information is practically difficult It is important to show that our method is general, as it relies on no and, worse, may even be fundamentally impossible: often users do features specific to Wikipedia. Therefore, we conclude our evalu- not have any target in mind, e.g., during exploratory search [18] or ation with a brief discussion of the results obtained on the second, when browsing the Web for fun or distraction. Our approach allevi- smaller dataset from Simtk. ates this problem, as it does not rely on a known navigation target, For evaluating our approach, we need to have a ground-truth set but rather works on passively collected server logs. of new links alongside their addition dates. Unlike for Wikipe- Web server logs capture user needs and behavior in real time, so dia, no complete revision history is available for Simtk, but we can another clear advantage of our log-based approach is the timeliness nonetheless assemble a ground truth by exploiting a specific event of the results it produces. This is illustrated by Table 3, where most in the site’s history: after 6 months of log data, a sidebar with links suggestions refer to current events, films, music, and books. to related pages was added to all pages. These links were deter- Our random-walk model (Sec. 3) builds on a long tradition of mined by a recommender system once and did not change for 6 Markovian web navigation models [6, 8, 27]. Among other things, months. Our task is to predict these links’ clickthrough rates pst Markov chains have been used to predict users’ next clicks [6, 26] for the 6 months after, based on log data from the 6 months before. as well as their ultimate navigation targets [30]. Our model might The results on the pst prediction task are summarized in Table 2; be improved by using a higher-order Markov model, based on work we compare path proportion to the mean baseline (cf. Sec. 6.1.1). by Chierichetti et al. [5], who found that human browsing is around At first glance, when considering only mean absolute error, it might 10% more predictable when considering longer contexts. seem as though the former were outperformed by the latter. But Further literature on human navigation includes information for- upon closer inspection, we realize that this is simply because the aging [4, 23, 25, 30], decentralized network search [14, 15, 29], baseline predicts very small values throughout, which tend to be and the analysis of click trails from search result pages [3, 9, 33]. close to the ground-truth values, but cannot discriminate between In the present work we assume that frequent indirect paths be- good and bad candidates, as signified by the negative correlation tween two pages indicate that a direct link would be useful. While coefficients. Path proportion, on the contrary, does reasonably well, this is confirmed by the data (Fig. 5(a)), research has shown that at Pearson (Spearman) correlation coefficients of 0.41 (0.50). This users sometimes deliberately choose indirect routes [28], which correlation can be observed graphically in Fig. 6(a). may offer additional value [33]. Detecting automatically in the logs Last, we analyze the solutions of our budget-constrained link when this is the case would be an interesting research problem. placement algorithm. Our first observation is that the two multi-tab Our method is universal in the sense that it uses only data logged objectives f1 and f2 differ much more on Simtk than on Wikipe- by any commodity web server software, without requiring access dia (with Jaccard coefficients between 40% and 60%, compared to to other data, such as the content of pages in the network. We over 90%), which shows that the page-centric multi-tab objective demonstrate this generality by applying our method on two rather f2 is not redundant but adds a distinct option to the repertoire. different websites without modification. While this is a strength of We also observe that the value of the single-tab objective f3 sat- our approach, it would nonetheless be interesting to explore if the urates after inserting around as many links as there are pages (750). accuracy of our link clickthrough model could be improved by us- The reason is that only very few links were available and clicked ing machine learning to combine our current estimators, based on before the related-link sidebar was added, so the second term in path counts, search counts, and random walks, with domain-spe- the denominator of Eq. 6 is typically much smaller than the first cific elements such as the textual or hypertextual content of pages. (a) Page-centric multi-tab objective f2 (b) Single-tab objective f3 Source Target Source Target * ITunes Originals – Red Hot Road Trippin’ Through Time * ITunes Originals – Red Hot Road Trippin’ Through Time Chili Peppers Chili Peppers * Internat. Handball Federation 2015 World Men’s Handball Champ. * Gran Hotel Grand Hotel (TV series) Baby, It’s OK! Olivia (singer) * Tithe: A Modern Faerie Tale Ironside: A Modern Faery’s Tale Payback (2014) (2015) * Nordend Category:Mountains of the Alps * Nordend Category:Mountains of the Alps * Jacob Aaron Estes The Details (film) * Gran Hotel Grand Hotel (TV series) Blame It on the Night Blame (Calvin Harris song) * Edmund Ironside Edward the Elder * Internat. Handball Federation 2015 World Men’s Handball Champ. * Jacob Aaron Estes The Details (film) * The Choice (novel) Benjamin Walker (actor) Confed. of African Football 2015 Africa Cup of Nations Payback (2014) Royal Rumble (2015) Live Rare Remix Box Road Trippin’ Through Time * Just Dave Van Ronk No Dirty Names

Table 3: Top 10 link suggestions of Algorithm 1 using objectives f2 and f3. Clickthrough rates pst estimated via path proportion (Sec. 3). Objective f1 (not shown) yields same result as f2 but includes an additional link for source page Baby, It’s OK!, demonstrating effect of diminishing returns on f2. Asterisks mark links added by editors after prediction time, independently of our predictions.

A further route forward would be to extend our method to other [15] J. Kleinberg. The small-world phenomenon: An algorithmic types of networks that people navigate. For instance, citation net- perspective. In STOC, 2000. works, knowledge graphs, and some online social networks could [16] D. Lamprecht, D. Helic, and M. Strohmaier. Quo vadis? On the all be amenable to our approach. Finally, it would be worthwhile effects of Wikipedia’s policies on navigation. In Wiki-ICWSM, 2015. to explore how our methods could be extended to not only identify [17] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019–1031, 2007. links inside a given website but to also links between different sites. [18] G. Marchionini. Exploratory search: From finding to understanding. In summary, our paper makes contributions to the rich line of Communications of the ACM, 49(4):41–46, 2006. work on improving the connectivity of the Web. We hope that [19] D. Milne and I. H. Witten. Learning to link with Wikipedia. In future work will draw on our insights to increase the usability of CIKM, 2008. websites as well as the Web as a whole. [20] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions – I. Acknowledgments. This research has been supported in part by NSF IIS- Mathematical Programming, 14(1):265–294, 1978. 1016909, IIS-1149837, IIS-1159679, CNS-1010921, NIH R01GM107340, [21] C. Nentwich, L. Capra, W. Emmerich, and A. Finkelstein. Xlinkit: A Boeing, Facebook, Volkswagen, Yahoo, SDSI, and Wikimedia Foundation. consistency checking and smart link generation service. TOIT, Robert West acknowledges support by a Stanford Graduate Fellowship. 2(2):151–185, 2002. [22] T. Noraset, C. Bhagavatula, and D. Downey. Adding high-precision 8. REFERENCES links to Wikipedia. In EMNLP, 2014. [1] L. Adamic and E. Adar. Friends and neighbors on the Web. Social [23] C. Olston and E. H. Chi. ScentTrails: Integrating browsing and Networks, 25(3):211–230, 2003. searching on the Web. TCHI, 10(3):177–197, 2003. [2] E. Adar, J. Teevan, and S. T. Dumais. Large scale analysis of Web [24] A. Paranjape, R. West, and L. Zia. Project website, 2015. revisitation patterns. In CHI, 2008. https://meta.wikimedia.org/wiki/Research: [3] M. Bilenko and R. White. Mining the search trails of surfing crowds: Improving_link_coverage (accessed Dec. 10, 2015). Identifying relevant websites from user activity. In WWW, 2008. [25] P. Pirolli. Information foraging theory: Adaptive interaction with [4] E. H. Chi, P. Pirolli, K. Chen, and J. Pitkow. Using information scent information. Oxford University Press, 2007. to model user information needs and actions and the Web. In CHI, [26] R. R. Sarukkai. Link prediction and path analysis using Markov 2001. chains. Computer Networks, 33(1):377–386, 2000. [5] F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarlos. Are Web users [27] P. Singer, D. Helic, A. Hotho, and M. Strohmaier. HypTrails: A really Markovian? In WWW, 2012. Bayesian approach for comparing hypotheses about human trails on [6] B. D. Davison. Learning Web request patterns. In Web Dynamics. the Web. In WWW, 2015. Springer, 2004. [28] J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The [7] P. Devanbu, Y.-F. Chen, E. Gansner, H. Müller, and J. Martin. Chime: perfect search engine is not enough: A study of orienteering behavior Customizable hyperlink insertion and maintenance engine for in directed search. In CHI, 2004. software engineering environments. In ICSE, 1999. [29] C. Trattner, D. Helic, P. Singer, and M. Strohmaier. Exploring the [8] D. Downey, S. Dumais, and E. Horvitz. Models of searching and differences and similarities between hierarchical decentralized search browsing: Languages, studies, and application. In IJCAI, 2007. and human navigation in information networks. In i-KNOW, 2012. [9] D. Downey, S. Dumais, D. Liebling, and E. Horvitz. Understanding [30] R. West and J. Leskovec. Human wayfinding in information the relationship between searchers’ queries and information goals. In networks. In WWW, 2012. CIKM, 2008. [31] R. West, A. Paranjape, and J. Leskovec. Mining missing hyperlinks [10] J. Edmonds. Matroids and the greedy algorithm. Mathematical from human navigation traces: A case study of Wikipedia. In WWW, Programming, 1(1):127–136, 1971. 2015. [11] S. Fissaha Adafre and M. de Rijke. Discovering missing links in [32] R. West, D. Precup, and J. Pineau. Completing Wikipedia’s hyperlink Wikipedia. In LinkKDD, 2005. structure through dimensionality reduction. In CIKM, 2009. [12] M. Grecu. Navigability in information networks. Master’s thesis, [33] R. White and J. Huang. Assessing the scenic route: Measuring the ETH Zürich, 2014. value of search trails in Web logs. In SIGIR, 2010. [13] A. Halfaker, O. Keyes, D. Kluver, J. Thebault-Spieker, T. Nguyen, [34] E. Wulczyn and D. Taraborelli. Wikipedia Clickstream. Website, K. Shores, A. Uduwage, and M. Warncke-Wang. User session 2015. http://dx.doi.org/10.6084/m9.figshare.1305770 identification based on strong regularities in inter-activity time. In (accessed July 16, 2015). WWW, 2015. [35] E. Zachte. Wikipedia statistics. Website, 2015. [14] D. Helic, M. Strohmaier, M. Granitzer, and R. Scherer. Models of https://stats.wikimedia.org/EN/TablesWikipediaZZ.htm human navigation in information networks based on decentralized (accessed July 16, 2015). search. In HT, 2013.