<<

Real Time Personalized Search on Social Networks

# + # Yuchen Li , Zhifeng Bao ⇤, Guoliang Li , Kian-Lee Tan # + National University of Singapore ⇤University of Tasmania Tsinghua University liyuchen, tankl @comp.nus.edu.sg [email protected] [email protected] { }

Abstract—Internet users are shifting from searching on tra- global major trends but not personalized search results which ditional media to platforms (SNPs) to retrieve may easily result in biased views [8]. Thus, it calls for a up-to-date and valuable information. SNPs have two unique new and general framework to provide real time personalized characteristics: frequent content update and small world phe- nomenon. However, existing works are not able to support these search over data that leverages all unique two features simultaneously. To address this problem, we develop characteristics of SNPs. a general framework to enable real time personalized top-k As a preliminary effort to support the above two charac- query. Our framework is based on a general ranking function teristics, we focus on three most important dimensions on that incorporates time freshness, social relevance and textual SNPs: Time Freshness, Social Relevance and Text Similarity. similarity. To ensure efficient update and query processing, there Time Freshness is essential as outdated information means are two key challenges. The first is to design an index structure that is update-friendly while supporting instant query processing. nothing to user in a highly dynamic SNP. Social Relevance The second is to efficiently compute the social relevance in a must be adopted in the search framework as well. Leveraging complex graph. To address these challenges, we first design a the to rank results will greatly improve novel 3D cube inverted index to support efficient pruning on the the search experience as people tend to trust those who are three dimensions simultaneously. Then we devise a cube based “closer” and will also enable more user-interactions. E.g., a threshold algorithm to retrieve the top-k results, and propose several pruning techniques to optimize the social distance compu- basketball fan is more likely to respond to a post on “NBA tation, whose cost dominates the query processing. Furthermore, Finals” posted by a friend than unfamiliar strangers. Lastly, we optimize the 3D index via a hierarchical partition method Text Similarity is the fundamental dimension where keywords to enhance our pruning on the social dimension. Extensive are used to distinguish the results from available records. experimental results on two real world large datasets demonstrate There are many challenges to support the search over these the efficiency and the robustness of our proposed solution. three dimensions. First, it is challenging to design an index I. INTRODUCTION structure that is update-friendly while supporting powerful With the rise of online social networks, Internet users are pruning for instant query processing. Specifically, when a new shifting from searching on traditional media to social network record is posted, it must be made available immediately in platforms (SNPs) to retrieve up-to-date and valuable informa- the search index rather than being periodically loaded. The tion [1]. For example, one may search on Twitter to see the second challenge lies in the query evaluation based on the latest news and comments on Malaysia Airlines flight MH17. index. In particular, how to enable an efficient computation However, searching SNP data is rather challenging due to its along the social dimension, whose performance dominates its unique properties: not only containing user-generated contents counterparts on the other two dimensions. The social distance from time to time, but also having a complex graph structure. is usually modeled as the shortest distance on the social graph In particular, there are two distinguishing characteristics: [9], [10], [6], [5], [7]. These solutions either compute the : Twitter has over 288M active users distance on-the-fly [7] or pre-compute all-pairs distances [5]. • High Update Rate with 400 million posts per day in 20131. The first is extremely inefficient for large networks, which : People in the social network renders it unacceptable for real time response, while the • Small World Phenomenon are not very far away from each other. For , the latter requires prohibitively large storage. A natural way is average degrees of separation between users are 4.742. to develop query processing techniques based on index with Moreover, unlike traditional road networks where each reasonable size. However, existing distance indices are unable vertex has a small degree, the vertex degree of a social to handle massive social network because they neglect at least network follows the power law distribution [2]. one of the unique characteristics of SNPs. To address these challenges, we present a novel solution Top-k keyword search is an important tool for users to to support real time personalized top-k query on SNPs. The consume Web data. However, existing works on keyword contributions of this paper are summarized as follows: search over social networks [3], [4], [5], [6], [7] usually ignore We present a 3D cube inverted index to support efficient one or many of the above characteristics of SNPs and lead • pruning on the three dimensions (time, social, textual) to several drawbacks, e.g. returning user with meaningless simultaneously. Such index is update-efficient, while flex- or outdated search results, low query performance, providing ible in size w.r.t. different system preferences. 1 We design a general ranking function that caters to http://www.statisticbrain.com/twitter-statistics/ • 2http://www.facebook.com/DegreesApp/ various user preferences on the three dimensions, and devise a cube threshold algorithm (CubeTA) to retrieve sorted lists w.r.t keyword frequencies requires huge number of the top-k results in the 3D index by this ranking function. random I/Os. Besides, [6] only considers the documents posted We propose several pruning techniques to accelerate the from a user’s direct friends while we consider all documents • social distance computation, and a single distance query in SNPs. Thus, there is a need to design novel indices that can be answered with an average of less than 1µs for a are update-efficient and support efficient search with social graph with 10M vertices and over 230M edges. relevance feature. We optimize the 3D index via a hierarchical partition • Distance Indices on Social Networks. Road network distance method to enhance the pruning on the social dimension. A query has been well studied in [13], [14]. However, they deeper partition tree will lead to better query performance cannot work on large social networks because the vertex and the flexibility of the index allows the space occupied degree in road networks is generally constant but dynamic to be the same as the basic 3D index. in social networks due to the power law property. Existing We conduct extensive experimental studies on two real- • distance indices for social networks [15], [16], [17], [18] world large datasets: Twitter and Memetracker. Our pro- cannot be applied to our scenario for several reasons. First, the posed solution outperforms the two baselines with aver- schemes in [15], [16], [17] assume un-weighted graphs and are age 4-8x speedups for most of the experiments. not able to handle weighted graphs. Second, the mechanisms Our framework has a general design of ranking function, in [15], [16] are only efficient for social graphs with low tree- index and search algorithm. It enables opportunities to support width property. Unfortunately, as reported in [19], the low personalized-only (regardless results’ freshness) and real-time- tree width property does not hold in real world social graphs. only (regardless of social locality) search. It is also a good We also made this observation in our real datasets. Lastly, complement to the global search provided by existing SNPs. personalized top-k query requires one-to-many distance query Our vision is: search results are not the end of story, instead whereas the methods in [17], [16] are applicable only for one- they will be fed into data analytic modules, e.g. sentiment anal- to-one query and the solution in [18] only supports one-to-all ysis, to support real-time ultimately. query. It is hard to extend these schemes to fit our case. It We present related work in Sec. II and problem definition therefore motivates us to design pruning methods to overcome in Sec. III. We propose the 3D inverted index in Sec. IV and the distance query problem on large social networks. the CubeTA in Sec. V. In Sec. VI we present several pruning III. PROBLEM DEFINITION methods to speed up the distance query evaluation. We propose a hierarchical partition scheme for our 3D inverted index to Data Model. Let G =(V,E) be an undirected social graph, optimize CubeTA in Sec. VII, and report experiment results where V is the set of vertices representing users in the network in Sec. VIII. Finally we conclude in Sec. IX. and E is the set of edges in G denoting social links. For each vertex v, Rv is a set of records attached to v (e.g., microblogs II. RELATED WORK published by the user). A record r R is formulated as 2 v Search on Platform. Facebook developed r = r.v, r.W, r.t where: h i Unicorn [11] to handle large-scale query processing. While r.v shows which vertex this record belongs to. • it supports socially related search, the feature is only available r.W is a set of keywords contained in r. • for predefined entities rather than for arbitrary documents. r.t denotes the time when the record is published by r.v. • Moreover, Unicorn is built on list operators like LISTAND Each edge is associated with a weight, which quantifies the and LISTOR to merge search results. This can be very costly social distance between the two vertices. In this paper, we especially for real time search because the search space is adopt the Jaccard Distance which has been widely used [20], huge. Twitter’s real time query engine, Earlybird [3], has [21], [2]; however, our method can be easily extended to other also been reported to offer high throughput query evaluation functions. The weight between user v and its neighbor v0, for fast rate of incoming tweets. Unfortunately, it fails to denoted as D(v, v0), is the Jaccard Distance between their consider social relationship. Therefore, our proposed method neighbor sets N(v) and N(v0) where N(x)= n (x, n) E . can complement existing engines by efficiently handling real { | 2 } N(v) N(v0) time search with social relevance. D(v, v0)=1 | \ | s.t.(v, v0) E (1) N(v) N(v0) 2 Search on Social Network. Several research works have been | [ | proposed for real time search indices over SNPs. Chen et al. Query Model. A top-k query q on the social graph G is represented as a vector q = q.v, q.W, q.t, q.k where: introduced a partial index named TI to enable instant keyword h i q.v is the query user. search for Twitter [4]. Yao et al. devised an index to search for • q.W is the set of query keywords. the microblog messages which are ranked by their provenance • q.t is the time when the query is submitted. in the network [12]. However, none of them offers customized • q.k is the number of desired output records. search for the query user. Although indices on social tagging • network offer the social relevance feature [5], [6], [7], they Ranking Model. Given a query q, our objective is to find rely on static inverted lists that sort documents by keyword a set of q.k records with the highest relevance. To quantify frequencies to perform efficient pruning. These indices cannot the relevance between each record and the query, we should handle high rate of documents ingestion because maintaining consider the following aspects. Partition 1 Partition 3 Partition 2 Partition 1 Partition 3 Partition 2 U1 U2 0.5 U4 rid:user TS Keywords 3 5 5 r0:u10 0.1 (“icde”,0.8),(“nus”,0.1) r1:u1 0.1 (“icde”,0.9) 1 4 6 2 8 U5 r2:u7 0.1 (“icde”,0.1),(“nus”,0.5) 0 2 3 4 5 7 0 3 7 8 r3:u2 0.2 (“icde”,0.6),(“nus”,0.2) U10 Text r4:u3 0.3 (“icde”,0.7),(“nus”,0.2) Dimension 0.2 r5:u5 0.4 (“icde”,0.4) Term Term 0.4 [1,0.7) [0.7,0.4) [0.4,0] [1,0.7) [0.7,0.4) [0.4,0] U6 U7 r6:u4 0.6 (“icde”,0.8) Frequency Frequency U 3 r7:u7 0.7 (“icde”,0.8),(“nus”,0.1) Social Partition 2 (1,2,2) (1,2,1) (1,2,0) Partition 2 (0,2,2) (0,2,1) (0,2,0) U11 Dimension r8 r0,r1 r3,r4 r8:u2 0.7 (“icde”,0.7) Partition 3 (1,1,2) (1,1,1) (1,1,0) Partition 3 (0,1,2) (0,1,1) (0,1,0) r9:u5 0.8 (“icde”,0.2),(“nus”,0.2) U9 r7 r11 r10 r2 U8 r10:u9 0.9 (“icde”,0.1),(“nus”,0.4) Partition 1 (1,0,2) (1,0,1) (1,0,0) Partition 1 (0,0,2) (0,0,1) (0,0,0) r11:u11 1.0 (“icde”,0.7) r6 r9 r5 Partition 1 Partition 2 Partition 3 Time Slice 1: [1,0.4) Time Slice 0: [0.4,0] Time Fig. 1: A social network example including records posted. Records Dimension are ordered by time from old to recent (used throughout this paper). Fig. 2: 3D inverted list for “icde”

(1) Social Relevance: The social distance for two vertices v IV. 3D INDEX $ v0 is adopted as the shortest distance [9], [10], [6], [5]. To support the high update rate for SNPs, we propose a SD(v, v0)= min D(vi,vi+1)/maxDist (2) memory based 3D inverted list. Time is the primary dimension. pathv=v0...vk=v0 i=0..k 1 Textual and social dimensions are also included to support The social relevance is computedX as SR=max(0, 1 SD) and efficient pruning. Before introducing the 3D list design, we maxDist is the user-tolerable maximal distance. would like to present how to handle the social dimension first (2) Textual Relevance: We adopt the well-known tf-idf based as it is not straightforward to order the records according to tf approach [22]. Let w,r denote the frequency of keyword the social relevance without knowing who the query user is. w in r whereas idf is the inverse frequency of w in the w Graph Partition. Given a social graph G like Fig. 1, we entire document collection. We represent textual relevance as first divide the vertices into c partitions using k-way partition a cosine similarity between q.W and r.W : [24] with minimum cut utility. For each Pi, a pivot vertex TS(q.W, r.W)= tfw,r idfw (3) pn is selected among the vertices in P and the shortest · i i w q.W 2 1 distance from pni to each vertex in G is pre-computed. X 2 2 Specifically, tfw,r = zw,r/( w r.W zw,r) where zw,r is In addition, for any two partitions Pi,Pj, we pre-compute 2 the number of occurrences of w in r.W ; and idfw = 1 SD(Pi,Pj)=minx Pi,y Pj SD(x, y). This index is used for 2 2 P 2 2 zw/( w q.W zw) , zw = ln(1 + R /dfw) where R is the estimating the social distance between the query user q.v and 2 | | | | total number of records posted and df w gives the number of a partition. As the partitions along the social dimension do not P records that contain w. have an ordering which depends on the query, we rank Pi by (3) Time Relevance: The time freshness score TF is the SD(Pq.v,Pi) in query time where Pq.v contains q.v. The space normalized time difference between q.t and r.t. In particular, complexity of the index for social distance is O( V + c2). | | let tmin be the pre-defined oldest system time, then 3D Inverted List. Each keyword w is associated with a 3D r.t t TF(q.t, r.t)= min (4) list of records which contain w. From Example 1, Fig. 2 shows q.t tmin the 3D list for keyword “icde”. The primary axis of the 3D Overall Ranking Function. Now, with the social, textual list is time, from latest to oldest, which is divided into slices. and time relevances normalized to [0,1], our overall ranking Each time slice contains a 2D grid consisting of the social function is a linear combination of these three components. and textual dimensions. The social dimension corresponds to (q, r)=↵TS(q.W, r.W)+SR(q.v, r.v)+TF(q.t, r.t) (5) the c partitions constructed on G. The textual dimension is < discretized to m intervals w.r.t. the keyword frequencies tf by where ↵,, are user preference parameters for general using equi-depth partition on historical data [25]. This results weighting functions, and ↵,, [0, 1]. Tuning the param- 2 in a cube structure of the 3D list where each cube may contain eters is a well-studied problem in information retrieval and 0 or many records. Note that in Fig. 2, the time and textual a widely-adopted solution is by user click-throughs [23]. In dimensions are sorted offline whereas the social partitions will this paper we focus on devising efficient algorithms for query only be sorted against user u1 when u1 issues a query at processing and leave the parameter tuning as a future work. runtime in Example 1. Example 1. Fig. 1 is an example social network where all To support efficient retrieval of records from the 3D list (as records posted are listed. For each record r, its user id (UID), discussed later in Sec. V), we define the neighbours of a cube: time score (TS), keywords and their frequencies are included. Definition 1. We assign each cube with a triplet: (x, y, z), Suppose ↵===1 and u1 expects to get the top-1 record that where x,y,z refer to time, social and textual dimensions. contains “icde”. By Equation 5, r11 is the desired result as The three neighbours of cube cb(x, y, z) are cb(x 1,y,z), (q ,r11) = 1.0+(1.0 0.4) + 0.7=2.3 has the maximum < u1 cb(x, y 1,z), cb(x, y, z 1) respectively. value among all records. For example in Fig. 2, there are 18 cubes and the dimension indices (x, y, z) are listed in the first row of a cube. The Algorithm 1: CubeTA Algorithm cube cb(1, 1, 1) which contains r11 has neighbours: cb(0, 1, 1), Input: Query q = q.v, q.W, q.t, q.k , CubeQueue CQ h i cb(1, 0, 1), cb(1, 1, 0) along the time, social and textual dimen- which is initialized by inserting the first cube for sions respectively. To avoid storing empty cubes in the 3D list, each of the q.W’s inverted list. + we deploy a B -tree for each time slice as shown in Fig. 2. Output: Top q.k records that match the query q. + In each B -tree, the value of a leaf node represents a cube’s 1 MinHeap H /* q.k best records */ linearized dimension index in the 2D grid. Suppose a cube has th 2 " 0 /* q.k record’s score */ (y, z) y z a grid index entry where and represent the entries 3 while !CQ.empty() do of the social and textual dimensions respectively, then a leaf 4 Cube cb = CQ.pop() ym + z + node with value is added to the B -tree. To facilitate 5 if EstimateBestScore(q, cb) <"then + random access, each B -tree keeps a reference to the leaf node 6 return H (y, max (z)) y with index for each partition . 7 else Index Update. When a new record r arrives, it is inserted 8 foreach record r in cb do into w’s 3D list w r.W . r is always allocated to the latest 9 if r has not been seen before then 8 2 time slice and mapped to the corresponding cube. When the 10 if GetActualScore(q, r, ") >"then number of records in the latest time slice exceeds a system 11 H.pop() and Push r to H determined threshold, a new time slice will be created. Let us 12 " H.top()’s score w.r.t q take r6 for an example and assume the time slice threshold is 13 foreach of the three neighbour cubes nc of cb do 6. When r6 arrives, there are already 6 records in time slice 14 if nc has not been seen before then 0 (see Fig. 2), so slice 1 is created, which r6 is inserted into. 15 if EstimateBestScore(q, cb) >"then In social dimension, since r6 belongs to user u4 (see Fig. 1), 16 Push nc to CQ r6 should go to partition 1. In textual dimension, “icde” has a keyword frequency of 0.8 in r6, which falls in the frequency The social score of cb is SRcb =1 SD(Pq.v,Pcb), where interval [1, 0.7). Finally, r6 should be inserted into the cube P is the partition containing the query user and P is cb(1, 0, 2) that maps to leaf node 2 after linearization. The q.v cb the partition containing the cube cb. The time freshness B+-tree is updated accordingly. TFcb and text similarity TScb are the maximum values of cb’s time Complexity. Since the record is inserted into B+tree and each interval and frequency interval. It is easy to see that the total B+tree has at most c leaf nodes, the complexity of inserting · estimated score SRcb is actually an upper bound of all the a new record is just O( r.W log(c m)). | |· · unseen records in the cube, so if it is still smaller than the Forward Index. To enable efficient random access, a forward current kth best record’s score ", we can simply terminate the list is built for each record r. Each list stores the keywords in search and conclude the top-k results are found (lines 5-6). r and their keyword frequencies. The user who posts it and This stopping condition is presented in Theorem 1. the time posted are stored as well. The table in Fig. 1 is an example of the forward lists for the social network. Theorem 1. Let cb be the next cube popped from CQ. The score estimated by Equation 6 is the upper bound of any V. C UBETA ALGORITHM unseen record in the 3D lists of all query keywords q.W. Given the 3D list design, we are now ready to present our basic query evaluation scheme, CubeTA (Algorithm 1). Proof : Let r be any record that exists in any of the 3D lists CubeTA extends the famous TA algorithm [26] by introducing and whose score has not been computed. Given a query q, let =SR(q.v, r.v)+TF(q.t, r.t) and w = ↵ tfw,r idfw a two-level pruning upon the 3D list to further speed up the · · where w q.W. The overall score (q, r ) of r w.r.t. q is: query processing, i.e. at record level and at cube-level. We 2 < x maintain two data structures in CubeTA: (1) The cube queue (q, rx)=+ w = (w + ) CQ < q.W which ranks the cubes by their estimated relevance scores w q.W w q.W X2 X2 | | (computed by EstimateBestScore function in line 15); (2) the q.W r .W min heap H which maintains the current top-k candidates. The = ( + )+ | \ x | (7) w q.W q.W main workflow is as follows: In each iteration, we first access w q.W r.W 2 X\ | | | | the 3D list for each keyword and get the cube cb with the best But note that r must exist in one of the 3D lists, say w⇤. estimated scores among all unseen cubes in CQ (line 4). Next x Then it follows Equation 7: we evaluate all the records stored in cb (lines 8-12), then we keep expanding the search to the three neighbors of cb (lines (q, r ) ( + )+ q.W r .W ( + ) < x  w q.W | \ x | w⇤ q.W 13-16), until the current top-k records are more relevant than w q.W r.W 2 X\ | | | | the next best unseen cube in CQ. Following Equation 5 in SRcb + TFcb q.W max (↵TScb + ) ⌅ computing the score of a record, Equation 6 illustrates how | |·cb q.W q.W EstimateBestScore estimates the score of a cube cb: 2 | | SRcb + TFcb GetActualScore (Algorithm 2) computes the exact rele- (q, cb)= q.W (↵TScb + ) (6) < | | q.W vance of a certain record. With the forward list mentioned | | 0.14 Algorithm 2: GetActualScore Algorithm node histogram 0.12 fitted gaussian Input: Query q, record r and threshold " 0.1 Output: (q, r) >"? (q, r) : 1 < < 0.08 1 Compute TF and TS using the forward list of r 0.06 2 Compute the Social Relevance Lower Bound

node distribution node 0.04 min SRr =(" ↵TS TF)/ 0.02 3 SRr =1 ShortestDistance(q.v, r.v, min SRr) 0 4 return ↵TF + TS + SR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 social distance Fig. 4: The averaged distance distribution of Twitter Dataset. in Sec. IV, we can compute the exact text similarity and cb(l, c, m), we need to prove that any cb(x, y, z) is reachable time freshness. Since we have the kth best score " among from cb(l, c, m). Along the textual dimension, cb(x, y, z) is the evaluated records, a lower bound for social relevance (i.e. reachable from cb(x, y, m). cb(x, y, m) is reachable from the distance upper bound) can be computed for the current cb(x, c, m) along the social dimension. Lastly cb(x, c, m) is record r before evaluating the distance query (in line 3). This reachable by cb(l, c, m) along the time dimension. bound enables efficient pruning which we will later discuss on ⌅ how to compute the exact social relevance score in Section VI. VI. SOCIAL-AWARE PRUNING Example 2 shows a running example of how CubeTA works. In CubeTA, to process a query issued by a user v, a large Iteration Processing Cube Candidates minSR maxSD Best Cube:Score Top 1 number of distance queries need to be evaluated between v 1(1,2,2) (1,1,2): 2.8 and the users who post records containing the query keywords 2(1,1,2)r7:0.7+(1-0.6)+0.8=1.9 0 1 (1,0,2): 2.6 r7 r7:0.7+(1-0.6)+0.8=1.9 in GetActualScore. In order to accelerate such one-to-many 3(1,0,2) (1,2,1): 2.6 r7 r6:0.6+(1-0.7)+0.8=1.7 0.5 0.5 distance query evaluation, we develop a series of optimizations r8:0.7+(1-0.2)+0.7=2.2 0.4 0.6 that operate in three categories: (1) Quick computation of the 4(1,2,1)r7:0.7+(1-0.6)+0.8=1.9 (1,1,1): 2.5 r8 r6:0.6+(1-0.7)+0.8=1.7 social relevance scores for records that cannot be pruned; (2) r11:1+(1-0.4)+0.7=2.3 0.2 0.8 Immediate pruning for records that are not among the final r8:0.7+(1-0.2)+0.7=2.2 5(1,1,1) (0,2,2): 2.3 r11 r7:0.7+(1-0.6)+0.8=1.9 top-k records; (3) Fast construction of the pruning bound at r6:0.6+(1-0.7)+0.8=1.7 the initial rounds of the graph traversal to facilitate pruning as Fig. 3: Example of CubeTA (The highlighted text indicates that the early as possible. records are being evaluated in their current iteration.) Example 2. By Example 1, the social partitions are sorted A. Observation by their distances to u1 when u1 issues the query. Fig. 2 First, we highlight an observation on the real SNP datasets shows the reordered 3D list of keyword “icde”. Detailed steps that motivates our pruning idea. Fig. 4 is the distance dis- for CubeTA are illustrated in Fig. 3. In iteration 1, cube tribution which shows the percentage of vertices that have cb(1, 2, 2) is first evaluated. Since no record is in cb(1, 2, 2), the corresponding social distances to a randomly sampled three neighbors of cb(1, 2, 2) are inserted into the cube queue. vertex in the social network. The distribution is constructed In iteration 2, cb(1, 1, 2) is popped from the cube queue by uniformly sampling 1000 vertices from the Twitter dataset and before expanding its neighbor cubes into the queue, we which is extensively studied in our experiments. We observe evaluate r7 and insert it into the candidate max heap. This that there exist layers of vertices with different distances from procedure terminates at iteration 5 because r11 is found to a random vertex. Due to the small world property, the number have an equal score to the best cube score in the cube queue. of layers is usually small. This finding has also been verified Efficient Cube Traversal in the news dataset that we have studied. Moreover, this is As discussed in Sec. IV, traversing the 3D list may encounter consistent in real life context where close friends are rare and many empty cubes. To speed up the traversal in CubeTA (line the rest just scatter around, which means our defined social 13), we define the boosting neighbors for a cube cb(x, y, z): distance measure may enable more pruning power on real world SNPs. If we avoid computing the distances between the Definition 2. Suppose there are c social partitions and m source vertex and those vertices that are positioned at these frequency intervals, the boosting neighbors of cb(x, y, z) are: layers, the performance will be greatly improved. cb(x 1,y,z) if y = c z = m. • ^ cb(x, y 1,z) if z = m. B. Baseline Solution: Direct Pruning • cb(x, y, max z0 z0 max SDr6, which means r6 needs to be pruned from socially far away from v and should be pruned from the the top-1 candidate. current top-k result set. For a record ru, early-determination only computes SD(v, u) without exploiting max SDr for C. In-circle Pruning u pruning ru. In Fig. 4, suppose we want to evaluate ru where Since DA traverses the graph in a closest-vertex-first fash- SD(v, u)=0.8 and the nearest neighbor of u has a distance ion, we must avoid traversing to any vertex that has a large of 0.1 from u, then the processing by early-determination is social distance from the query user v so that the layers slow because SD(v, u) cannot be determined until we pop in Fig. 4 are never reached. Otherwise evaluating just one a vertex from PQ that has a distance of 0.7 from v. This distance query may require traversing the entire social graph motivates us to propose the second pruning method, namely and the performance can be as bad as original DA, i.e. early-pruning as outlined in category (2), to complement M the early-determination method. Like early-determination that µmin popt = wig(x µi,i)dx (8) estimates a lower bound that SD⇤(v, u) could be updated to | Z1 i=1 by vertices in PQ, early-pruning essentially prunes the record Assuming the record authorsX are independent and identically ru by exploiting max SDru , as described in Theorem 4. distributed random variables w.r.t their distance to v in the social graph, the probability of having at least q.k records Theorem 4. If mdist +minu0 SD(u, u0) max SDru and whose social distances to v are smaller than µmin follows the SD⇤(v, u) max SDru , then SD(v, u) max SDru . binomial distribution: q.k Proof : If mdist +minu SD(u, u0) max SDr , similar 0 u i i to Theorem 3, SD (v, u) will not be updated to a distance p()=1 (1 popt) popt (9) ⇤ i smaller than max SDr by any remaining vertices in PQ. i=0 ✓ ◆ u In this work we aim toX ensure p() > 99.9% so that, in most Thus it is possible for SD(v, u) < max SDr if SD⇤(v, u) < u cases, the first q.k records in WQ have social distances that max SDru . So by ensuring SD⇤(v, u) max SDru , we can are less than µmin. determine SD(v, u) max SDru . ⌅ Algorithm 3: Optimized Distance Query Computation Example 5. From Example 4, the traversal stops at u11 after Input: Query user v, record ru, set S, priority queue evaluating r7. The next record to compute is r6 posted by u4. PQ, max SDru and mdist SD⇤(u1,u4)=0.7 as u2 S and u2 is a neighbor of u4.We 2 Output: SD(v, u) < max SDru ? SD(v, u)ru : 1 also know max SDr6 =0.5 in Fig. 3. However, we cannot 1 if u S then early determine SD(u1,u4) at u11 because SD(u1,u11)+ 2 2 return SD(v, u) < max SDru ? SD(v, u)ru : 1 minu SD(u4,u0)=0.6 < SD⇤(u1,u4)=0.7. Instead we 0 3 u00 nearest 2-hop neighbour of u. can use early-pruning to eliminate r6 because SD(u1,u11)+ 4 for (u, u0) E do minu SD(u4,u0) > max SDr6 and SD⇤(u1,u4) > max SDr6. 2 0 5 if SD⇤(v, u0)+SD(u0,u) < SD⇤(v, u) then 3) Cold Start: This refers to scenarios where max is not 6 SD⇤(v, u) SD⇤(v, u0)+SD(u0,u) SD small enough for efficient pruning at the early stage of query 7 Update SD⇤(v, u) of u in PQ processing. Suppose the first record to evaluate is posted by 8 while PQ is not empty do the furthest vertex in the social graph w.r.t v, early-pruning 9 if mdist + SD(u, u00) SD⇤(v, u) then cannot help as the pruning bound is trivial. Although early- 10 return SD⇤(v, u) < max SDru ? SD(v, u) : 1 determination can reduce the query time to some degree, it is 11 if mdist + SD(u, u00) max SDru SD⇤(v, u) ^ still highly possible to visit almost every vertex. max SDru then Thus, we propose the third pruning method called warm- 12 return 1 up queue, which aligns with the optimization of category (3). 13 Vertex n PQ.pop()

The warm-up queue WQ is meant to evaluate the records that 14 mdist SD⇤(v, n) are nearer to v first and to obtain a decent bound for further 15 S S n [ pruning. WQ is constructed as follows: we push a number 16 for (n0,n) E and n0 S do 2 62 of records to WQ before computing any social distance. 17 if SD⇤(v, n)+SD(n, n0) < SD⇤(v, n0) then WQ ranks the records with an estimated distance, which is 18 SD⇤(v, n0) SD⇤(v, n)+SD(n, n0) computed by exploiting the pivot vertices as the transit vertices 19 Update SD⇤(v, n0) of n0 in PQ between v and the authors of the records. When the size of 20 if (n0,u) E then 2 WQexceeds , all records in WQwill be popped out and their 21 if SD⇤(v, n0)+SD(n0,u) < SD⇤(v, u) then exact scores are computed followed by the original CubeTA. 22 SD⇤(v, u) SD⇤(v, n0)+SD(n0,u)

A key problem is to determine . We wish that, among 23 Update SD⇤(v, u) of u in PQ records in WQ, there are at least q.k records whose social distances to v are smaller than the left most layer D. Out-of-circle Pruning in the vertex-distance distribution. Based on the observation In Theorems 3 and 4, the nearest neighbor u0 of the target from Fig. 4, we model the vertex-distance distribution as a vertex u and its distance to u play a critical role to quickly mixture of gaussians, which is a weighted sum of M normal determine SD(v, u), However, when u0 has a very short dis- distributions: p(x)= M w g(x µ , ) where g(x µ , ) is i=1 i | i i | i i tance to u, none of the aforementioned pruning techniques is the probability density of a normal distribution with mean µi u 2 P effective. So by exploiting the 2-hop nearest neighbour of in and variance i . In the context of social network, the number the above pruning methods, in particular early-determination of layers in the vertex-distance distribution is small due to the and early-pruning, the pruning power could be enhanced as small world property and it makes the training complexity compared to its in-circle counterparts. This is because the 2- low. The model is established by the standard expectation hop nearest neighbour has a longer distance from u, which maximization method. Given the mixture model and a random results in an earlier determination of SD(u, v). We refer to record ru, the probability popt that SD(q, u) <µmin where this approach as out-of-circle pruning, and will demonstrate µmin is the mean of the left most normal component which its merit over the in-circle pruning in Example 6. means µ µ , i =1..M. min  i 8 Example 6. By Examples 4 & 5, in-circle early-determination G [1,0.7) [0.7,0.4) [0.4,0] ܩழଶǡ଴வ r8 needs to traverse 6 vertices to evaluate SD(u1,u7), while {u1,u2,u3,u8,u10,u11} {u4,u5,u6,u7,u9} Insert SD(u1,G[1,0])=0.1 SD(u1,G[1,1])=0.4 r6-r10 with the out-of-circle early-determination we only need to ܩழଶǡଶவ r6 r9 G[1,0] G[1,0] traverse 3 nodes. The nearest 2-hop neighbour of u7 is u3 ܩழଶǡଷவ r7 r10 [1,0.7) [0.7,0.4) [0.4,0] and SD(u7,u3)=0.5. Before any traversal, we use u8, which G[2,1] G[2,3]

ࡳழ૚ǡ૙வ r8,r11 we assume it to be the pivot vertex of partition 3, to estimate {u8,u10,u11} {u7,u9} Insert SD(u1,G[2,1])=0.1 SD(u1,G[2,3])=0.5 r11 SD⇤(u1,u7)=SD(u1,u8)+SD(u8,u7)=0.6. Then the out- ܩழଶǡଶவ r6 r9 G[2,0] G[2,2] of-circle early-determination takes effect when we reach u2: ܩழଶǡଷவ r7 r10 {u ,u ,u } now (u ,u )+ (u ,u )=0.7 > ⇤(u ,u )=0.6;so 1 2 3 {u4,u5,u6} SD 1 2 SD 7 3 SD 1 7 SD(u1,G[2,0])=0.1 SD(u1,G[2,2])=0.4 Fig. 6: Example of building time by Theorem 3 we guarantee SD(u1,u7)=0.6. Furthermore, Fig. 5: Tree partition of the social slice 1 of the 3D list for “icde” on we can use out-of-circle early-pruning to eliminate r6. Since graph in Fig. 1. the partition tree in Fig. 5. u posted r6, we first identify the 2-hop nearest distance 4 complexity in both time and space. If we use 3-hop nearest of u4 to be SD(u4,u9)=0.4. In addition we know that distance, one has to ensure SD⇤(v, u) is updated via a path SD⇤(u ,u )=0.7 since u was reached when evaluating 1 4 2 that contains n if n is 2-hop away from u. However, to check r7. Then by out-of-circle early-pruning we are sure r6 is not 0 0 if n is a 2-hop neighbour of u, we must either store all 2- the top-1 candidate at u because SD(u ,u )+SD(u ,u ) > 0 2 1 2 4 9 hop neighbours for each vertex or validate 2-hop relationship max SD and SD⇤(u ,u ) > max SD . r6 1 4 r6 on the fly. Storing all 2-hop neighbours are not realistic for As a result, by enabling more powerful out-of-circle pruning large graphs whereas computing on the fly will trivially slow upon the DA traversal, we present a complete solution of down the query processing. Therefore, it can be concluded that the social distance computation in Algorithm 3. In particular, out-of-circle pruning achieves the maximum pruning along the lines 9-12 extend the idea of Theorems 3 and 4 by replacing social dimension. the nearest neighbour distance by the nearest 2-hop distance; VII. 3D INDEX OPTIMIZATION lines 4-7 and 20-23 are meant to guarantee the correctness In the 3D index, the social dimension is statically parti- of pruning: if n0 is a neighbor of u, SD⇤(v, u) gets updated tioned. Intuitively, the larger the number of partitions is, the via a path that contains n0. The rest part is in line with the original DA: specifically, Lines 1-2 return the social distance more accurate the estimate of the social score will be; and this will translate to more effective pruning. However, a large if SD(v, u) has been determined in S; Lines 13-19 follow the graph traversal and distance estimate in DA. number of partitions on the social dimension will severely tax the system resources and this is not scalable for large Example 7. In Example 6, we visit u1, u3, u2 in order after social networks. Moreover the static partition strategy does not evaluating r7, r6 while r8 and r11 remain to be evaluated. capture the nature of online social activities. Given a period The total score of r8 is 2.2 and we continue to evaluate of time, people who are socially related are likely to publish r11 posted by u11. By adopting the same assumption from similar information on the online network. A fine-grained Example 6, we use u8 to be the pivot vertex of partition 3. Then yet time-aware social partition is required for more efficient we obtain SD⇤(u1,u11)=SD(u1,u8)+SD(u8,u11)=0.8. pruning. Therefore we develop a dynamic division strategy on According to lines 4-7, we need to update SD⇤(u1,u11)= social dimension using the hierarchical graph partition. SD⇤(u ,u )+SD(u ,u )=0.4 when we traverse to u 1 10 10 11 3 A. Hierarchical Graph Partition Index as u10 is a neighbour of both u11 and u3. We improve static social division scheme by a hierarchical If we do not perform lines 4-7, this means SD⇤(u1,u11)= 0.8 instead of 0.4. Since r8 is the current top-1 candidate, graph partition using a binary tree denoted as pT ree. Fig. 5 serves an example of a partition tree of the social graph max SDr11 =0.5. As the traversal stops at u2 and nearest mentioned in Fig. 1. The root of pT ree contains all vertices 2-hop neighbour of u11 is u3, the out-of-circle early-pruning in G. Each child node in the tree represents a sub partition of will eliminate r11 because SD(u1,u2)+SD(u3,u11)=0.5 G. A sub partition is represented as G[h,idx] where h denotes max SDr11 and SD⇤(u1,u11) > max SDr11. But r11 is the ultimate top-1 result which should not be pruned. lines 20- the level in pT ree and idx is the position of G[h,idx] at level 23 in Algorithm 3 have a similar reason to guarantee the h. For the 3D list, we still keep c partitions as described in correctness of the out-of-circle pruning. Sec. IV. The improvement is that the c partitions are formed by nodes in pT ree instead of uniform graph partitions. Out-of-circle pruning requires the pre-computation of the The 3D list dynamically updates the social dimension when nearest 2-hop distance from each vertex which is retrieved new records are inserted. The update procedure maintains c by using DA. The worst time complexity is O( V + E ) and partitions within a time slice. When u posted a new record | | | | r the process can easily be parallelized. The space complexity r, within each time slice where r is inserted into, we try is O( V ) which brings almost no overhead for using out-of- to find the sub partition G to store r. Traversing from | | [h,idx] circle pruning compared to in-circle pruning. One may wonder the root of pT ree, if G[h0,idx0] has already contained some whether the pruning can be further extended by using 3-hop records or G[h0,idx0] is a leaf node of pT ree, we insert r nearest distance and beyond. The answer is that it brings more into G[h0,idx0]. Otherwise we traverse to the child partition Text TABLE II: All parameter settings used in the experiments. Dimension Parameters Pool of Values Term Term [1,0.7) [0.7,0.4) [0.4,0] [1,0.7) [0.7,0.4) [0.4,0] Frequency Frequency Datasets Twitter News Social ܩ (1,2,2) (1,2,1) (1,2,0) ܩ (0,2,2) (0,2,1) (0,2,0) # of users 1M 5M 10M 0.1M 0.5M 1M Dimension ழଵǡ଴வ ழଶǡ଴வ SD=0.1 r8,r11 SD=0.1 r1 r3,r4 max vertex degree 0.7M 0.7M 0.7M 16K 29K 30K ܩழଶǡଶவ (1,1,2) (1,1,1) (1,1,0) ܩழଶǡଵவ (0,1,2) (0,1,1) (0,1,0) average vertex degree 81.6 82.5 46.1 9.2 6.8 7.0 SD=0.4 r6 r9 SD=0.1 r0 # of records 10M,15M,20M 0.5M,1M,5M ܩழଶǡଷவ (1,0,2) (1,0,1) (1,0,0) ܩழଵǡଵவ (0,0,2) (0,0,1) (0,0,0) SD=0.5 r7 r10 SD=0.4 r2,r5 keyword frequency low,medium,high degree of query user low,medium,high Time Slice 0: [1,0.4) Time Slice 1: [0.4,0] Time Dimension top-k 1,5,10,15,...,50 Fig. 7: Reordered inverted list for keyword “icde” by using the dimension (0.1,0.1,0.1)(0.1,0.3,0.5)(0.1,0.5,0.3)(0.3,0.1,0.5) hierarchical partition w.r.t. user u1. weight (↵,,) (0.3,0.5,0.1)(0.5,0.1,0.3)(0.5,0.3,0.1)

of G[h0,idx0] that contains ur. For any two nodes G[x,idxx] and by taking the minimum value of their leaf nodes.

G[y,idxy ] on pT ree, we denote their Lowest Common Ancestor After reordering the social dimension w.r.t. user u1, we by LCA(G[x,idxx],G[y,idxy ]). After insertion, if we find c +1 can visualize the 3D list of “icde” as Fig. 7. For each social non-empty sub partitions, we merge two sub partitions Gleft partition, the estimated social distances are listed in the cell. and Gright to form G[h⇤,idx⇤] = LCA(Gleft,Gright), and The partitions may vary across different time slice but the

G[h⇤,idx⇤] has the lowest height h⇤ among all possible merges. number of partitions remains the same. CubeTA can be applied If there is a tie in h⇤, the merge that involves the least number directly to the hierarchical index. of records will be executed. We also see the feasibility of the 3D list and CubeTA which Recall Example 2, Fig. 6 demonstrates how to divide the two of them easily incorporate the hierarchical index into social partition dynamically using a hierarchical partition for efficient query processing. the 3D list of the keyword “icde” in Fig. 2. Since we are still building three partitions within a time slice, r6 r10 VIII. EXPERIMENT RESULTS are inserted into the leaves of the partition tree when they We implemented the proposed solution on a CentOS server arrive chronologically. When r11 arrives, we first identify that (Intel i7-3820 3.6GHz CPU with 60GB RAM) and compared it should be inserted into G[2,1]. However we can only keep with the baseline solutions on two large yet representative 3 three partitions, so G[2,0] and G[2,1] are merged to form G[1,0] real world datasets: Twitter and Memetracker from SNAP . shown in Fig. 6 when inserting r11. Note that in a time slice The original Twitter dataset contains 17M users, 476M tweets. we only store sub partitions that contain at least one record, Memetracker is an online news dataset which contains 9M me- and the tree structure is just a virtual view. dia and 96M records. Twitter encapsulates a large underlying The advantage of our hierarchical partition is that, we have social graph but short text information (average number of a fine-grained social partitions which improve the estimation distinct non-stopped keywords in a tweet is 7); Memetracker of the cube’s relevance score. This enables more opportunities has a smaller social graph but rich in text information (average for powerful pruning along the social dimension. At the same number of distinct non-stopped keywords in a news is 30). time, we still have c partitions and the resource requirement The datasets with different features are used to test our does not increase. Again in Fig. 5, the score under each tree proposed solution. Since both raw social graphs have a lot node represents the social distance estimated from u1. The of isolated components, we sampled the users that formed a static partition scheme estimates the distance from u1 to u7 connected component to demonstrate the effectiveness of our as 0.2 where the hierarchical partition scheme gives 0.5 which solution. Accordingly we filtered the documents/tweets based is closer to the true value (0.6). on the sampled users, resulting in the datasets used in our experiments. Table II contains all the parameters used in our B. CubeTA on Hierarchical Graph Partition Index experiments, and those highlighted in bold denote the default CubeTA has to be extended to incorporate the hierarchical settings unless specified otherwise. For scalability tests, we partition to improve the pruning efficiency. Since the partition make three samples of the social graph and three samples of scheme is improved from a single layer to multi layers, we the text records for each dataset. For query keywords, we ran- need to change the method to estimate social distances. In the domly sampled keywords with length 1, 2, 3. We are aware of pre-processing step, all leaf nodes of partition tree pT ree are four factors that may have impact on the overall performance: computed for distance to each other. When a user u submits keyword frequency, query user, top-k and weights for different a query, we first identify the leaf node Gu that this user dimensions for the ranking function (Equation 5). matches. Then the distance SD(u, G[h,idx]) from u to any As no existing work supports searching along all three partitions G[h,idx] is estimated using min SD(u, G[h0,idx0]) and dimensions, we would like to compare the proposed solution

G[h,idx] is an ancestor of G[h0,idx0] in pT ree. Suppose user u1 with several baseline methods to demonstrate the effectiveness. submits a query, then the social distances from u1 to other sub : The state of the art on real time • Time Pruning (TP) partitions are estimated in Fig. 5. The distances from Gu1 to social keyword search [3], [12], [4] sorts the inverted list all leaf nodes are first retrieved from the pre-computed index by reverse chronological order, in order to return the latest and the value is shown below the nodes. Then distance from 3 u1 to G[1,0] and G[1,1] are estimated as 0.1 and 0.4 respectively http://snap.stanford.edu/ Text Social 700 3D FP TP 3D FP TP 700 3D FP TP 600 1000 600 1000 500 400 800 500 800 300 400 200 600 100 600 300 0 400 200 400 100 Executition Time(ms) Executition 200 Time (ms) Execution 200

0 Time(ms) Execution Execution TIme (ms) Execution 0 3D TP FP 3D TP FP 3D TP FP 0 low mid high 01020304050 keywords frequencies low mid high top-k parameters (a) vary frequencies:twitter (b) vary user degree:twitter (c) vary top-k:twitter (d) vary parameters:twitter Text Social 3D FP TP 3D FP TP 120 3D FP TP 120 140 120 100 100 80 120 100 60 100 80 40 80 20 80 60 60 0 60 40 40 40 20 Execution Time (ms) Time Execution

Execution (ms) TIme Execution 20 20 0 Execution Time (ms) Execution Execution (ms) TIme Execution 0 3D TP FP 3D TP FP 3D TP FP 0 low mid high 0 10 20 30 40 50 keywords frequencies low mid high top-k parameters (e) vary frequencies:news (f) vary user degree:news (g) vary top-k:news (h) vary parameters:news Fig. 8: Performance for various settings

result. Thereby, we implement a TA approach that can We also study the impact of hierarchical partition (proposed take advantage of such order to retrieve the top-k results. in Sec. VII) on query processing. Three factors will impact on : Without considering efficient the performance: (1) Number of documents within a time slice, • Frequency Pruning (FP) updates in the real time social network, traditional ap- (2) Height of the partition tree pT ree, (3) Number of partitions proach for personalized keyword search sorts the inverted to keep in the 3D list. Since the memory is usually constrained list by keyword frequencies [5], [6]. FP is the implemen- by a maximum limit, we cannot keep too many partitions. tation of TA on this type of inverted list. Therefore, we fix the number of partitions as the optimal : It is the complete solution that we setting just mentioned. Fig. 10 shows the performance results • 3D Index (3D) proposed in this paper, i.e. CubeTA (Algorithm 1) with on twitter dataset when we vary the number of documents our optimized distance query computation (Algorithm 3). within a time slice and also the height of the partition tree. All methods consist of two parts of computation for retrieving We find: (1) more fine-grained social partition leads to better the top-k results. We refer the time to evaluate all candidates’ query performance but it will slow down the index update as social relevance scores as social and the rest is denoted as text. allocating records to the 3D list requires tree traversal; (2) an Two notions are adopted throughout all figures in this section. optimal setup exists for number of documents to be allocated in a time slice (10000 in this case). Twitter data News data 3D tree_height:6 tree_height:7 200 18 tree_height:8 tree_height:9 tree_height:10 130 180 17 The hierarchical partition achieves better performance than 125 160 16 140 120 the static partition, but it will involve many discussions on 15 120 115 14 parameter settings. Due to the space constraint, the static 100 110 13 80 105 12 partition is used in 3D to compare with the two baselines: TP 60 100 40 11 95 and FP. We have seen that the time slice size and the number 20 10 (ms) Time Execution 90 Execution time-News (ms) time-News Execution Execution time-Twitter (ms) time-Twitter Execution 0 9 85 of social partitions have corresponding optimal setups and we 0100200300400500 01000020000300004000050000 Number of Social Partitions # of documents in a time slice will adopt this setting throughout the experiment. Besides, we Fig. 9: Effect of increasing so- Fig. 10: Hierarchial tree partition use 10 intervals for the text dimension. cial partitions for both datasets vs. static partition Text Social Text Social 2000 120 100 A. Discussion on Social Partitions 1500 80 1000 60 Before presenting the results to compare all the methods, 40 500 20 Execution Time (ms) Time Execution we investigate the query performance with different social Time (ms) Execution 0 0 partitions in the 3D list. Although more social partitions in 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP direct pruning In-circle out-of-circle direct pruning in-circle out-of-circle the 3D list bring better accuracy in distance estimates, such (a) twitter data (b) news data improved accuracy has slowed down due to the small world Fig. 11: Effect of distance pruning phenomenon and the high clustering property of the social network. Moreover, the query processor needs to visit more B. Efficiency Study cubes due to the additional social partitions. Thus, as shown 1) Evaluating Our Pruning Techniques: Fig. 11 shows the in Fig. 9 there is an optimal setup for the number of social experimental results for both datasets with default settings. partitions for both datasets. For twitter dataset, the optimal In general, 3D outperforms the other by a wide margin. In number of social partitions is 32 while that for news dataset addition, prunings based on the time dimension (TP, 3D) is 64. Even though twitter dataset has a larger social graph than have better performance than the pruning based on the textual news dataset, twitter has a higher average degree resulting in dimension (FP) because searching along the text dimension is a higher degree of clustering. This brings more difficulties in yet another multi-dimensional search for multiple keywords, distinguishing the vertices in terms of their social distances. and the well-known curse of dimensionality problem reduces Text Social Text Social Text Social Text Social 2500 200 1200 1000 1000 2000 150 800 800 1500 600 100 600 1000 400 400 50 500 200 200 Execution Time (ms) Time Execution Execution Time (ms) Time Execution Execution Time (ms) Time Execution Execution Time (ms) Time Execution 0 0 0 0 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP 3D TP FP graph:1M graph:5M graph:10M graph:0.1M graph:0.5M graph:1M tweets:10M tweets:15M tweets:20M news:0.5M news:1M news:5M (a) twitter data (b) news data (a) twitter data (b) news data Fig. 12: Scalability: Effect of increasing social graph size Fig. 13: Scalability: Effect of increasing text information the pruning effect along the text dimension. In contrast, the for smaller top-k values, which also explains why we set time dimension is just a scalar so it is more efficient to prune. the default top-k as 5. 3D retrieves the results extremely fast To see the effect of the distance optimizations proposed: compared to other methods and scales against even larger top- direct, in-circle and out-of-circle pruning, we apply them to k value. For TP, the performance is very poor for news dataset TP, FP and 3D. We find that, with better distance pruning against larger top-k values as news dataset contains more text methods, the time for distance queries is greatly reduced for all information, so time pruning becomes less effective. Lastly for methods. Moreover, we have confirmed that 3D works better FP, it almost exhausts all possible documents in the inverted with the optimized distance query computation and enables lists to get the top-k results which is almost a linear scan. more efficient pruning compared to the other two. When all 5) Varying Dimension Weights: Lastly, for each dimension methods employ the optimized distance query computation we consider, i.e. time, social and text, we assign different (equipped with all pruning techniques in Sec. VI), 3D achieves weights to them to test the flexibility of our scheme. As shown 4x speedups against TP and 8x speedups against FP. in Fig. 8d and 8h, 3D remains superior over the other methods. In the rest of this section, we investigate the query process- Even when the weight of time dimension is small, i.e. 0.1, ing time (of CubeTA + complete optimized distance compu- 3D is still better than TP where both methods use the time tation) by varying the keywords frequencies, the query user, dimension as the primary axis. 3D does not perform well when the choice of top-k and the dimension weights respectively. the weight of the social dimension is the largest among all 2) Varying Keyword Frequency: Fig. 8a and 8e show the dimension weights because the small world property makes it query processing time over different ranges of keyword fre- hard to differentiate users effectively in term of social distance. quency. Keywords with high and medium frequencies are the C. Scalability top 100 and top 1000 popular keywords respectively, whereas the low frequency keywords are the rest which appear at In order to test the scalability of our proposed solution, we least 1000 times. In both datasets, we have the following decide to scale along the social graph size and the number of user-posted documents respectively. observations: (1) Among all approaches, 3D dominates the rest for all frequency ranges. (2) As query keywords become First, we test the scalability when the social graph grows more popular (i.e. with higher frequencies), the performance while the number of records remains the same. It naturally follows the empirical scenario that users issue real time queries speedup by 3D against other methods becomes larger. Intu- itively, there are more candidate documents for more popular for the most recent posts in a very big social network. In Fig. 12, we find both and spend much longer time query keywords. 3D effectively trims down candidates and TP FP retrieves the results as early as possible. in the social distance computation. In contrast, 3D limits the 3) Varying Query User: We further study the performance time for distance query to minimal due to the simultaneous 3- w.r.t. users with degrees from high to low. The high, medium dimension pruning upon the 3D list. Therefore, 3D is capable and low degree users denote the upper, mid and lower third of handling increased volume of the social graph. We also in the social graph respectively. We randomly sample users tested what happens if one of the three pruning techniques, from each category to form queries. The results are reported i.e. early-determination, early-pruning and warm-up queue, is missing. The results are shown in Table III where we in Fig. 8b and 8f. We find: (1) 3D achieves a constant speedup compared to the rest of the methods regardless of the query observe that early-determination is the most powerful pruning. Nevertheless, all prunings must work together to ensure an user’s degree. (2) The social relevance computation for TP efficient distance computation with the increasing graph size. and FP takes longer than 3D, even though TP and FP have Second, we test each approach w.r.t. varying number of the same distance pruning technique as 3D. This is because records posted while fixing the social graph sizes to the default 3D prunes more aggressively on social dimension using time and text dimension whereas the other two methods only have values. As we can see from Fig. 13, 3D remains to be the best one dimension to prune. As illustrated later in Sec. VIII-C, against the rest and is able to maintain the performance to near such advantage is magnified when the graph goes larger. constant. Therefore, it verifies that our proposed method is also 4) Varying Top-k: We vary the top-k value from 1 to 50. As scalable against high text volume. shown in Fig. 8c and 8g, the performance slowly drops with D. Handling Fast Update more required results. Since we are doing high dimensional Lastly we study how our framework deals with fast and search, result candidate scores tend to be close to each other. continuously coming new data. The experiment is conducted Therefore it is more meaningful to identify results quickly in this way: we measure the query time, online index update 3D FP TP 3D FP TP 3D FP TP TABLE III: Performance for evaluating social 1600 18 9 distances with different pruning disabled. The 1400 16 8 7 units are in milliseconds. 1200 14 12 6 Twitter Graph 1M 5M 10M 1000 10 5 No Warm-up Queue 266.1 257.4 1987 800 8 4 No Early-Determination 1341 2377 6018 600 No Early-Pruning 271.9 531.3 2407 6 (GB) Size Index 3 Execution Time (ms) Time Execution 400 all 31.75 124.8 69.34 4 2 200 News Graph 0.1M 0.5M 1M 2 1

No Warm-up Queue 2.56 38.3 38.7 0 records) (s/millionRate Update Online 0 0 10 12 14 16 18 20 10 12 14 16 18 20 10 12 14 16 18 20 No Early-Determination 17.5 67.6 129 Incoming Records (million) Incoming Records (million) Incoming Records (million) No Early-Pruning 1.34 55.9 52.0 Fig. 14: The query time when Fig. 15: The index update rate Fig. 16: The index size with all 0.80 11.2 12.8 records are being ingested. with incoming records. incoming records.

rate and index size while new tweets are being ingested [2] C. C. Aggarwal, Ed., Social Network Data Analytics. Springer, 2011. continuously from 10M to 20M tweets, and the results are [3] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-time search at twitter,” in ICDE, 2012. presented in Fig. 14, 15, 16 respectively. Result for the news [4] C. Chen, F. Li, B. C. Ooi, and S. Wu, “Ti: An efficient indexing dataset is similar to Twitter, while not presented due to space mechanism for real-time search on tweets,” in SIGMOD, 2011. limit. We find: (1) For query time, time based pruning methods [5] R. Schenkel, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J. X. Parreira, and G. Weikum, “Efficient top-k querying over social-tagging achieve fairly stable performance while text pruning is getting networks,” in SIGIR, 2008. worse when tweets are being ingested; 3D outperforms TP by [6] S. Amer-Yahia, M. Benedikt, L. V. S. Lakshmanan, and J. Stoyanovich, 5-7x speedup and 3D is even more stable against index update. “Efficient network aware search in collaborative tagging sites,” PVLDB, 2008. (2) For online index update time, TP is the most efficient [7] S. Maniu and B. Cautis, “Network-aware search in social tagging method due to its simple index stricture. Nevertheless, 3D also applications: instance optimality versus efficiency,” in CIKM, 2013. demonstrates its fast-update feature as it clearly outperforms [8] M. Mathioudakis and N. Koudas, “Twittermonitor: trend detection over the twitter stream,” in SIGMOD, 2010. FP and the margin against TP is quite small. (3) For index [9] M. Qiao, L. Qin, H. Cheng, J. X. Yu, and W. Tian, “Top-k nearest size, the space occupied by 3D is close to TP as we stored keyword search on large graphs,” in PVLDB, 2013. the 3D list as time slices of trees (Sec. IV) to avoid empty [10] B. Bahmani and A. Goel, “Partitioned multi-indexing: Bringing order to social search,” in WWW, 2012. cubes. For FP, more space is required as a large tree structure [11] M. Curtiss, I. Becker, T. Bosman, S. Doroshenko, L. Grijincu, T. Jack- is used to maintain a sorted list w.r.t keyword frequencies. son, S. Kunnatur, S. Lassen, P. Pronin, S. Sankar, G. Shen, G. Woss, In summary, equipped with its fast update and efficient C. Yang, and N. Zhang, “Unicorn: A system for searching the social graph,” PVLDB, 2013. space management features, 3D has the advantage in handling [12] J. Yao, B. Cui, Z. Xue, and Q. Liu, “Provenance-based indexing support real time personalized query against the other methods. in micro-blog platforms,” in ICDE, 2012. [13] R. Zhong, G. Li, K.-L. Tan, and L. Zhou, “G-tree: An efficient index IX. CONCLUSION for knn search on road networks,” in CIKM, 2013. [14] K. C. K. Lee, W.-C. Lee, B. Zheng, and Y. Tian, “Road: A new spatial In this work, we presented a general framework to support object search framework for road networks,” TKDE, vol. 24, no. 3, 2012. real time personalized keyword search on social networks [15] F. Wei, “Tedi: Efficient shortest path query answering on graphs,” in by leveraging the unique characteristics of the SNPs. We Graph Data Management, 2011. [16] T. Akiba, Y. Iwata, and Y. Yoshida, “Fast exact shortest-path distance first proposed a general ranking function that consists of queries on large networks by pruned landmark labeling,” in SIGMOD, the three most important dimensions (time,social,textual) to 2013. cater to various user preferences. Then, an update-efficient 3D [17] R. Agarwal, M. Caesar, P. B. Godfrey, and B. Y. Zhao, “Shortest paths in less than a millisecond,” in WOSN, 2012. cube index is designed, upon which we devised an efficient [18] J. Cheng, Y. Ke, S. Chu, and C. Cheng, “Efficient processing of distance Threshold Algorithm called CubeTA. We further proposed queries in large graphs: A vertex cover approach,” in SIGMOD, 2012. several pruning methods in social distance query computation. [19] A. B. Adcock, B. D. Sullivan, and M. W. Mahoney, “Tree-like structure in large social and information networks,” in ICDM, 2013. Extensive experiments on real world social network data have [20] R. Li, S. Bao, Y. Yu, B. Fei, and Z. Su, “Towards effective browsing of verified the efficiency and scalability of our framework. In large scale social annotations,” in WWW, 2007. future, we would like to study how to design an effective [21] X. Li, L. Guo, and Y. E. Zhao, “Tag-based social interest discovery,” in WWW, 2008. ranking function to provide high-quality personalized results. [22] J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM Acknowledgement. This work is funded by the NExT Search Comput. Surv., vol. 38, no. 2, 2006. Centre (grant R-252-300-001-490), supported by the Singa- [23] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie, “Smoothing clickthrough data for web search ranking,” in SIGIR, 2009. pore National Research Foundation under its International [24] G. Karypis and V. Kumar, “Parallel multilevel k-way partitioning scheme Research Centre @ Singapore Funding Initiative and admin- for irregular graphs,” in Supercomputing, 1996. istered by the IDM Programme Office. Guoliang Li is partly [25] V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita, “Improved histograms for selectivity estimation of range predicates,” in SIGMOD, supported by the 973 Program of China (2015CB358700) and 1996. NSF of China (61373024 and 61422205). [26] R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algorithms for middleware,” in PODS, 2001. REFERENCES [27] E. W. Dijkstra, “A note on two problems in connexion with graphs,” NUMERISCHE MATHEMATIK, 1959. [1] X. Hu and H. Liu, “Text analytics in social media,” Mining Text Data, 2012.