arXiv:1708.02062v1 [cs.IR] 7 Aug 2017 tv o nigsmlrcnett ure tm[ item queried a to content similar finding for itive . that observed have works [ be many can experience streams, post reading of blog his context a enrich the or to In item items news similar a offered reading user a example, is applications dation by offered ubiquitously [ are providers content content such major recommendation find and of search that be Indeed, services will user. given that in a content ‘fish to to stream interest crucial identify namely, become [ stream’, has services it the news explosion information online of by aggre-era are spread world and the around gated sources different from headlines iepedsca ei ltom [ platforms media throu users social of billions widespread by millions daily of generated hundreds are items streams: content data endless in arriving tion INTRODUCTION 1. policy Retention Stream-LSH, search, Similarity Keywords theoretical our confirms which datasets, results. same using stream study the world empirical an real using conduct similar further approaches We find capacity. alternative to space to probability show compared the analytically items We increases Stream-LSH quali attributes. that freshness, popularity their to dynamic according and items retaining by propose size We while data, bounded. unbounded LSH process are to resources needs computation it SS as similarity. challenging to introduce addition is we in characteristics paper, temporal and this In that (SSDS) of items data-streams query. notion data given time-sensitive retrieving a the of to task similar the are is search Similarity ABSTRACT udmna uligbokfrsac n recommen- and search for block building fundamental A sr oa r xoe omsievlmso informa- of volumes massive to exposed are today Users admzdSD loih htbud h index the bounds that algorithm SSDS randomized a , ieb EDepartment EE Viterbi ehin af,Israel Haifa, Technion, am Kraus Naama hc ae noacutdt quality data account into takes which , iiaiysearch similarity iiaiySac vrEdesData Endless over Search Similarity 18 , 19 iiaiysac vrendless over search similarity , 30 , 15 39 nagrtmcprim- algorithmic an , , ihn nteStream: the in Fishing , 17 40 ]. , 30 4 , ;fehnews fresh ]; 18 37 .I this In ]. , Stream- 8 ao Research Yahoo ai Carmel David .For ]. 42 DS ty, gh of ]. iesoa aaes[ datasets hi massive dimensional for technique search similarity randomized used rmi toc.Ised nie’ ersnaini the in representation item’s disappear Figure an not age. Instead, its do with items decreases once. index index, at the Sinc it in time. from redundancy over entries is index there eliminates gradually that icy eti hehl.W ee oti eeto oiyas policy [ retention stories exceeds new this size detecting to its for refer used once We index eliminat- the Threshold by threshold. an from size certain indexing items index when a oldest the Thus, bound the items. can ing freshest one the stream, on endless focus to redundancy.is dynamic of and level item’s age, an quality, determining account in into popularity Stream takes and recall, further increase LSH to order in redundancy some with on based hr l tm r nw ob aubet sr [ users to valuable be to known are items old where fa SSagrtmi hrfr its therefore is [ retrieval algorithm SSDS fast in un- an for the of is RAM when acute in stream particularly resides is the dex limitation that this with- grow bound; is cannot out capacity however, space physical challenge, whereas bounded, The an maintain to data. needs algorithm SSDS an we end, (SSDS) this streams To data been Section data-streams. not in endless introduce has handle primitive to search extended similarity the in Nevertheless, metrics [ temporal similarity account to into addition take to ought applications htteaeaercl slre hnwt Threshold. with than larger so is efficiently recall show more average further resources the We a capacity that items. of exploits fresh cost a Smooth very with the that find at period to comes time probability capacity, this lower longer space a probability; same decaying for the items gradually fail Using similar but 20. finds items, than similar Smooth older fresh items see find find We index to to days. the likely 20 is example, for items Threshold this policies retain that retention to In Threshold two for capacity. the suffices space size with same item the similar using a find to bility 11 [ self-join ne n hc ofre.Tega st eaniesthat application items stream-based retain of to users is of goal needs the The the in satisfy forget. retain best to to which items and data index which determines continuously tagtowr prahfrbudn h ne size index the bounding for approach straightforward A nodrt ffiinl ereesc otn truntime, at content such retrieve efficiently to order In esgetinstead suggest We nSection In ]. 34 oaiySniieHsig(LSH) Hashing Sensitive Locality lhuhsc napoc a eneffectively been has approach an such Although . ,i sls da o erhadrecommendations, and search for ideal less is it ], 3 epresent we , 2 . ieb EDepartment EE Viterbi ehin af,Israel Haifa, Technion, Smooth n ao Research Yahoo and 22 h rbe of problem the 19 .LHbid ahbsdindex hash-based a builds LSH ]. , 30 dtKeidar Idit Stream-LSH , admzdrtninpol- retention randomized a – 29 39 , n temn similarity streaming and ] 25 , 1 eeto policy retention 31 iiaiysac over search similarity 39 lutae h proba- the illustrates nSD algorithm SSDS an , , , 40 index hc sawidely a is which , 33 , .Akyaspect key A ]. 23 , fstreamed of 36 , 39 which , 27 , , 34 42 gh s. ]. e s - , - 1 2. SIMILARITY SEARCH OVER DATA•STREAMS Threshold We extend the problem of similarity search and define 0.8 Smooth similarity search over unbounded data-streams. Our stream

0.6 similarity search is time-sensitive and quality-aware, and so we also define a recall metric that takes these aspects into 0.4 account.

0.2

Probability to find item 2.1 Background: Similarity Search 0 10 20 30 40 50 60 Similarity search is based on a similarity function, which Age of item (days) measures the similarity between two vectors, u, v ∈ V , where V =(R+)d is some high d-dimensional vector space [16]: Figure 1: Probability of successful retrieval of similar items 0 as a function of their age with Threshold and Smooth reten- tion policies in example settings. Definition 2.1 (similarity function). A similarity function sim : V × V → [0, 1] is a function such that ∀u, v ∈ We extend Stream-LSH to consider additional data char- V,sim(u, v)= sim(v, u) and sim(v, v) = 1. acteristics beyond age. First, our Stream-LSH algorithm considers items’ query-independent quality, and adjusts an The similarity function returns a similarity value within the item’s redundancy in the index based on its quality. This range [0, 1], where 1 denotes perfect similarity, and 0 denotes is in contrast to the standard LSH, which indexes the same no similarity. We say that u is s-similar to v if sim(u, v)= s. number of copies for all items regardless of their quality. A commonly used similarity function for textual data is Second, we present the DynaPop extension to Stream-LSH, angular similarity [39, 36], which is closely related to co- which considers items’ dynamic popularity. DynaPop gets sine similarity [13, 16]. The angular similarity between two as input a stream of user interests in items, such as retweets vectors u, v ∈ V is defined as: or clickthrough information, and re-indexes in Stream-LSH items of interest; thus, it has Stream-LSH dynamically ad- θ(u, v) sim(u, v) = 1 − , (1) just items’ redundancy to reflect their popularity. π To analyze Stream-LSH with different retention policies, where θ(u, v)= arccos( u·v ) is the angle between u and we formulate in Section 4 the theoretical success probabil- kuk·kvk v. ity (SP) metric of an SSDS algorithm when seeking items Given a (finite) subset of vectors, U ⊆ V , similarity search within given similarity, age, quality, and popularity radii. is the task of finding vectors that are similar to some query Our results show that Smooth increases the probability to vector. More formally, an exact similarity search algorithm find similar and high quality items compared to Threshold, accepts as input a query vector q ∈ V and a similarity radius when using the same space capacity. We show that our R ∈ [0, 1], and returns Ideal(q, R), a unique ideal result set quality-sensitive approach is appealing for similarity search of all vectors v ∈ U satisfying sim(q, v) > R. applications that handle large amounts of low quality data, The time complexity of exact similarity search has been such as user-generated social data [7, 10, 12], since it in- shown to be linear in the number of items searched, for creases the probability to find high-quality items. Finally, high-dimensional spaces [43]. Approximate similarity search we show that using DynaPop, Stream-LSH is likely to find improves search time complexity by trading off efficiency for popular items that are similar to the query, while also re- accuracy [37, 8]. Given a query q, it returns Appx(q, R), an trieving similar items that are not highly popular albeit with approximate result set of vectors, which is a subset of q’s R- lower probability. Retrieving similar items from the tail of ideal result set. We refer to approximate similarity search the popularity distribution in addition to the most popu- simply as similarity search in this paper. lar ones is beneficial for applications such as query auto- completion [9] and product recommendation [46]. In Section 5 we validate our theoretical results empirically on several 2.2 SSDS real-world stream datasets using the recall metric. In summary, we make the following contributions: SSDS considers an unbounded item stream U ⊆ V arriv- ing over an infinite time period, divided into discrete ticks. • We formulate SSDS – a time-sensitive similarity search The (finite) time unit represented by a tick is specified by primitive for unbounded data streams (Section 2). the application, e.g., 30 minutes or 1 day. On every time tick, 0 or more new items arrive in the stream, and the age • We propose Stream-LSH, a bounded-capacity SSDS al- of a stream item is the number of time units that elapsed gorithm with randomized insertion and retention poli- since its arrival. Note that each item in U appears only cies (Section 3). once at the time it is created. Each item is associated with a query-independent quality score, which is specified by a • We show both analytically and empirically that Stream- given weighting function quality : V → [0, 1]. LSH successfully finds similar items according to age, quality, and dynamic popularity attributes (Sections 4 and 5). Similarity search over data•streams. An SSDS algorithm’s input consists of a query vector q ∈ V and a three-dimensional Related work is discussed in Section 6, and Section 7 con- radius, (Rsim, Rage, Rquality), of similarity, age, and qual- cludes the paper. ity radii, respectively. An exact SSDS algorithm returns a unique ideal result set Locality Sensitive Hashing (LSH) [24, 22] is a widely used approximate similarity search algorithm for high-dimensional Ideal(q, Rsim, Rage, Rquality) , spaces, with sub-linear search time complexity. LSH limits the search to vectors that are likely to be similar to the {v ∈ U|sim(q, v) ≥ Rsim ∧ age(v) ≤ Rage∧ query vector instead of linearly searching over all the vec- quality(v) ≥ Rquality}. tors. This reduces the search time complexity at the cost of missing similar vectors with some probability. An (approximate) SSDS algorithm A returns LSH uses hash functions that map a vector in the high di- a subset Appx (A,q,Rsim, Rage, Rquality) of q’s ideal result R+ d mensional input space ( 0 ) into a representation in a lower set. dimension k <

|Appx(A,q,Rsim, Rage, Rquality)| with similarity function sim is a distribution on a family H . of hash functions on a collection of vectors, h : V → {0, 1}, |Ideal(q, Rsim, Rage, Rquality)| such that for two vectors u, v, The recall at radius Recall(A, Rsim, Rage, Rquality) of A is the mean recall over the query set Q. Prh∈H [h(u)= h(v)] = sim(u, v). (2) We use here a hash family H for angular similarity [13]. In order to increase the probability that similar vectors are Dynamic popularity. We consider a second unbounded mapped to the same bucket, the algorithm defines a family G stream I which consists of items from the item stream U of hash functions, where each g(v) ∈G is a concatenation of and arrives in parallel to U. We call I the interest stream. k functions chosen randomly and independently from H. In The arrival of an item at some time tick in I signals interest k the case of angular similarity, g : V → {0, 1} , i.e., g hashes in the item at that point in time. Note that an item may v into a binary sketch vector, which encodes v in a lower appear multiple times in the interest stream. dimension k. For two vectors u, v, Pr [g(u)= g(v)] = We capture an item’s dynamic popularity by a weighted g∈G (sim(u, v))k, for any randomly selected g ∈G. The larger k aggregation of the number of times it appears in the interest is, the higher the precision. stream, where weights decay exponentially with time [28]: In order to mitigate the probability to miss similar items, Let t ,...,t denote time ticks since the starting time t , 0 n 0 the LSH algorithm selects L functions randomly and inde- and the current time t . The indicator a (x) is 1 if item x n i pendently from G. The item vectors are now replicated in appears in the interest stream at time t and is 0 otherwise. i L hash tables H , 1 ≤ i ≤ L. Upon query, search is per- A parameter 0 <α< 1 denotes the interest decay, which i formed in L buckets. This increases the recall at the cost of controls the weight of the interest history and is common to additional storage and processing. all items. Definition 2.3 (item popularity). The function pop : 3.2 Stream•LSH U → [0, 1] assigns a popularity score pop(x) to an item Stream-LSH, presented in Algorithm 1, extends LSH’s in- x ∈ U: dexing procedure to operate on an unbounded data-stream. Every time tick, Stream-LSH accepts a set of newly arriv- n ing items in the item stream U and indexes each item into (n−i) pop(x) , (1 − α) ai(x)α . its LSH buckets. Stream-LSH selects an item’s initial re- Xi=0 dundancy according to its quality: it indexes the item into each bucket with a probability that equals its quality, inde- Given an assignment of popularity scores to items, we pendently of other buckets. In addition, in order to bound are interested in the retrieval of items within a popularity the index size, in each time tick, Stream-LSH eliminates radius Rpop ∈ [0, 1], i.e., with a popularity score that is not items from the index according to the retention policy it lower than Rpop. We define recall in a similar manner to the uses. Note that the two operations – indexing new items previous definitions. and eliminating old ones – are independent, and indexing and elimination work independently in each bucket. 3. STREAM•LSH Stream-LSH is an extension of Locality Sensitive Hash- 3.3 Retention Policies ing (overviewed in Section 3.1) for unbounded data-streams, We first describe the Threshold [39, 34] and Bucket [36] augmented with age, quality, and dynamic popularity di- policies, which have been used by prior art in other contexts. mensions. Stream-LSH consists of a retention policy that Then, we describe our randomized Smooth policy that grad- defines which items are retained in the index and which are ually eliminates item’s copies from the index as a function eliminated as new items arrive. of its age. 3.1 Background: Locality Sensitive Hashing 3.3.1 Threshold and Bucket Algorithm 1 Stream-LSH Algorithm 4 Smooth retention policy 1: On every time tick t do: 1: function H.Eliminate 2: foreach Hi ∈ HashTables do 2: foreach item in H do 3: foreach item ∈ items(t) do 3: with probability 1 − p, remove item from H 4: ⊲ Hash to bucket Bi 4: end function 5: Bi ← gi(item) 6: ⊲ Quality-based indexing 7: with probability quality(item), Bi.add(item) 8: ⊲ Elimination by retention policy DynaPop extends Stream-LSH indexing procedure to dy- 9: Hi.Eliminate() namically re-index items based on signals of user interests, as reflected by the interest stream I. Here, an item’s redun- dancy increases as the interest in it increases. At each time tick, DynaPop re-indexes an item that arrives in I into each The Threshold retention policy [39, 34] presented in Al- of its buckets with probability quality(x)u independently of gorithm 2 sets a limit Tsize on table size, and eliminates the other buckets; the insertion factor, 0

Algorithm 2 Threshold retention policy 4. ANALYSIS 1: function H.Eliminate In this section, we analyze Stream-LSH’s index size and 2: remove |H| - Tsize oldest items in table prove that it maintains a bounded index in expectation. We 3: end function additionally analyze the probability that Stream-LSH finds items within similarity, age, quality, and popularity radii. 4.1 Index Size and Number of Retained Copies The Bucket retention policy [36] given in Algorithm 3 sets a limit B on bucket size (rather than on table size), and size Index size. We analyze the index size according to Stream- eliminates the oldest items in each bucket once its size limit LSH’s retention policies. For the sake of the analysis, we is exceeded. Note that with Bucket, the number of copies assume that a constant number of new items µ arrive at of an item in the index varies with age, since each bucket is each time unit, and that their mean quality is φ. By defini- maintained independently. The probability of an item to be tion, Threshold and Bucket guarantee a bounded index size. eliminated from a bucket depends on the data distribution, We analyze the index size while using Smooth with a reten- i.e., on the probability that newly arriving items will be tion factor p. Consider one specific hash table, and denote mapped to that item’s bucket. time ticks as t0, · · · ,tn. At time t0, Smooth stores µφ items in the hash table in expectation. This set of µφ items are Algorithm 3 Bucket retention policy scanned independently n times until time tn, and thus the 1: function H.Eliminate expected number of items that arrive at t0 and survive elim- n 2: foreach Bi ∈ H do ination until tn is p µφ. It follows that the expected number 3: remove |Bi| - Bsize oldest items in bucket of items in the table when processing an infinite stream is ∞ i µφ 4: end function i=0 p µφ = 1−p . The retention process is performed inde- pendentlyP in each of the L hash tables, therefore, Proposition 1. Assuming µ new items with mean qual- ity φ arrive at each time unit, the expected size of an index 3.3.2 Smooth with L hash tables when using Smooth with retention factor µφ In Algorithm 4 we present Smooth, our randomized reten- p is 1−p L. tion policy that gradually eliminates item copies from the index as a function of their age. Smooth accepts as a pa- rameter a retention factor p, 0

E[number of indexed copies] 0 0 10 20 30 40 50 60 the query (s is closer to 1), the higher the success probabil- Item Age ity (color tends towards red). Fixing an s value, the success probability remains constant as a increases, since the num- Figure 2: Expected number of copies of an item in the index ber of buckets an item is indexed into remains constant. as a function of its age, for items with quality values of 1 With Smooth, on the other hand, for a fixed s value, the and 0.5. success probability gradually decays as a increases. Smooth retains items for a longer time period than Threshold, and 4.2 Success Probability thus the success probability is non-zero for items older than 20. Success probability quantifies the probability of an algo- rithm to find some item x given a query q [32]. For an LSH algorithm A with parameters k and L, we denote this probability by SP (A(k, L)). We sometimes omit k, L where obvious from the context. According to LSH theory for angular similarity [13], the probability of an LSH(k, L) algorithm to find x that is s- k similar to q in bucket gi(q) equals s , under a random selec- tion of gi ∈ G. LSH searches in L buckets g1(q), · · · ,gL(q) independently, thus, LSH(k, L) finds x with probability 1 − L 1 − sk .  4.2.1 Retention Policies (a) SP Threshold(10,15) (b) SP Smooth(10,15) We analyze Stream-LSH success probability when using Threshold and Smooth. We do not analyze Bucket, as its Figure 3: Success probability to find an item according to behavior depends on the data distribution, which we do not similarity and age for a common index size. model.

Success probability. We denote by SP (A(k, L),s,a,z) the probability of algorithm A(k, L) to find an item x for query Cumulative success probability. We next formulate the q, s.t. sim(q,x)= s, age(x)= a, and quality(x)= z. cumulative success probability (CSP) over similarity, age, Stream-LSH indexes a newly arriving item x into each and quality radii. Given a query q, a similarity radius bucket gi(x) independently with probability z. For a con- Rsim, an age radius Rage, and a quality radius Rquality, stant arrival rate µ, a mean quality φ, and a size limit Tsize, CSP quantifies an algorithm’s probability to find an item Tsize Threshold eliminates items that reach age Tage = . µφ x for which sim(q,x) ∈ [Rsim, 1], age(x) ∈ [0, Rage], and Thus, quality(x) ∈ [Rquality, 1]. CSP is the expected SP over all k L choices of s ∈ [Rsim, 1], a ∈ [0, Rage], and z ∈ [0, Rquality] 1 − (1 − s z) , if a

0.8 ψ(Rsim, Rage, Rquality)= 0.8 ) 1 R 1 )

age age age f(s,a,z)dsdadz, 0.6 0.6 Zz=R Za=0 Zs=R quality sim 0.4 0.4 CSP(0.8,R

Threshold CSP(0.9,R is a normalization factor. 0.2 0.2 1 Smooth Threshold keeps items up to age 1−p ; from (3) and (4): 0 0 20 40 60 80 100 20 40 60 80 100 Age Radius (R ) Age Radius (R ) age age CSP (T hreshold(k, L), Rsim, Rage, Rquality)= 1 1 1 k L (a) Rsim = 0.8 (b) Rsim = 0.9 1−p f(s,a,z)(1 − (1 − s z) ) dsdadz Z Z Z z=Rquality a=0 s=Rsim ψ(Rsim, Rage, Rquality) Figure 4: Cumulative success probability of Threshold and Smooth by age radius for a common index size. and

lize space for improving the CSP of high quality items. We CSP (Smooth(k, L), R , R , R )= sim age quality compare Stream-LSH’s quality-sensitive indexing, which in- 1 Rage 1 f(s,a,z)(1 − (1 − paskz)L) dsdadz dexes an item with redundancy that is proportional to its Z Z Z z=Rquality a=0 s=Rsim ψ(Rsim, Rage, Rquality) quality, to a quality-insensitive Stream-LSH, which indexes L copies of each item regardless of its quality. We use the Smooth retention policy and the same index size for both Numerical illustration. We compare variants. CSP (A, Rsim, Rage, Rquality) of Stream-LSH with Thresh- Assuming an average quality of 0.5, the expected number old and Smooth. We pose the following assumptions: we fo- of newly indexed items per table is µ according to quality- cus on the effect of the retention policy and hence we assume insensitive indexing, and 0.5µ according to quality-sensitive a constant quality function ∀x quality(x) = 1. In general, indexing. To obtain a common index size of 10µL, we use the items’ similarity distribution is data-dependent, here, we p = 0.9 in the quality-insensitive algorithm and p = 0.95 in assume a uniform distribution. We consider a discrete age the quality-sensitive one (Proposition 1). Figure 5 illustrates distribution because time is partitioned into discrete time the two Stream-LSH variants for Rsim = 0.8 and varying ticks. We assume a constant number of items arriving at Rage radii. In Figure 5(a), we examine items above the mean each time unit, hence items’ age is distributed uniformly. quality (Rquality = 0.5), and in Figure 5(b) we examine Last, we assume that similarity and age are independent. high-quality items (Rquality = 0.9). Both figures show that Under these assumptions we get: quality-sensitive indexing increases the CSP compared to the quality-insensitive variant. The improvement is even more marked for high-quality items (Rquality = 0.9). CSP (T hreshold(k, L), Rsim, Rage)= 1 1−p 1 (1 − (1 − sk)L) ds, 1 1 Z Rage(1 − Rsim) aX=0 s=Rsim 0.8 0.8 ,0.9) and ,0.5)

0.6 age 0.6 age

0.4 0.4 CSP (Smooth(k, L), Rsim, Rage)= 0.2 CSP(0.8,R 0.2 Quality−insesitive Smooth Rage 1 a k L Quality−sensitive Smooth CSP(0.8,R (1 − (1 − p s ) ) 0 0 ds. 20 40 60 80 100 20 40 60 80 100 Z Rage(1 − Rsim) Age Radius (R ) Age Radius (R ) aX=0 s=Rsim age age For the purpose of the illustration, we select the same (a) Rquality = 0.5 (b) Rquality = 0.9 configuration as above: k = 10, L = 15, T = 20µ, and size Figure 5: Cumulative success probability comparison of p = 0.95. Figures 4(a) and 4(b) depict CSP for fixed R sim quality-sensitive and quality-insensitive Stream-LSH by age values 0.8 and 0.9, respectively, and a varying age radius radius for a common index size. Rage. The graphs show a freshness-similarity tradeoff be- tween Threshold and Smooth. Smooth has a better CSP for high age radii (Rage exceeds 20), for any Rsim value. This 4.2.3 Dynamic Popularity comes at the cost of a decreased CSP for similarity radius 0.8 when Rage ≤ 20. We next analyze an item’s success probability according to Stream-LSH when using DynaPop and the Smooth retention policy. We first formulate the bucket probability (SB) to find 4.2.2 Quality•Sensitivity an item in a bucket it is hashed to. We then use SB to Some applications, most notably social media ones, com- formulate an item’s success probability. monly handle content of varying quality, and in particular, For the sake of the analysis, we assume that the interest large amounts of low quality user-generated content [7, 10, in an item does not vary over time. At each time ti, x is 12]. We show that for such applications, quality-sensitive included in the interest stream with some probability ρx, indexing is expected to be attractive, as it can better uti- which we call the item’s interest probability. That is, for all ti, x appears with probability ρx. According to Defini- tion 2.3 and due to the linearity of expectation: 1 1 u=1 p=0.98 0.8 0.8 n u=0.8 p=0.95 (n−i) u=0.6 p=0.9 E(pop(x)) = (1 − α) E(ai(x))α , 0.6 0.6

Xi=0 SB SB 0.4 0.4 which is a geometrical series that converges to E(ai(x)) 0.2 0.2 when n →∞. As our stream is infinite and E(ai(x)) = ρx: 0 0 100 101 102 103 104 105 100 101 102 103 104 105 E(pop(x)) = ρx. (5) Interest prob. rank (log scale) Interest prob. rank (log scale) When clear from the context, we omit x and denote ρ for (a) Insertion factor u (b) Retention factor p brevity. Figure 6: Effect of parameters on the probability to find an item in its bucket according to DynaPop. Bucket probability. We denote by SB(p,u,ρ,z,n) the prob- ability that x is stored in its bucket at time tn, where u and p are the insertion and retention factors, respectively, The cumulative success probability is computed similarly to quality(x)= z, and ρ is x’s interest probability. We denote the cumulative success probability analysis in Section 4.2.1, by Ei, 0 ≤ i ≤ n, the event that x is inserted to its bucket and in particular depends on the distribution of w. at time ti, and survives elimination until time tn, but is not selected for insertion to its bucket at any subsequent time Numerical illustration. Figure 7 depicts SP (DynaPop(k, L),s,w,z) t , i < j ≤ n. Then j as a function of w’s rank. We illustrate SP for three s values: (n−i) n−i n−i Pr(Ei)= zuρp (1 − zuρ) = zuρ[p(1 − zuρ)] 0.7, 0.8, and 0.9. We fix k = 10, L = 15, z = 1, p = 0.95 and n n u = 1. As the graphs show, an item’s success probability Since SB(p,u,ρ,z,n)= Pr( i=0 Ei) and i=0 Ei is a union of pairwise disjoint events, itS follows that S 1 n n−i SB(p,u,ρ,z,n)= zuρ[p(1 − zuρ)] . 0.8 s=0.9 s=0.8 Xi =0 s=0.7 zuρ 0.6 SB(p,u,ρ,z,n) is a geometric series that converges to 1−p(1−zuρ) SP when n →∞. Our interest stream is infinite, thus: 0.4

Proposition 2. Given an item x, the probability SB(p,u,ρ,z) 0.2 to find item x in its bucket when using Stream-LSH with Dy-

zuρ 0 0 1 2 3 4 5 naPop and the Smooth retention policy is 1−p(1−zuρ) , where 10 10 10 10 10 10 u and p are the algorithm’s insertion and retention factors Expected pop. rank (log scale) respectively, quality(x)= z, and ρ is x’s interest probability. Figure 7: Success probability to find an item as a function of its expected popularity according to DynaPop.

Numerical illustration. We illustrate SB for an interest increases as its similarity to the query increases. Addition- probability that follows a Zipf distribution (typical in social ally, an item’s success probability increases as its expected phenomena [14]) which implies a small number of very pop- popularity increases. ular items and a of rare ones. We consider a Zipf distribution where the r-th ranked item x has an interest probability ρx = 1/r. We set the quality z to 1. 5. EMPIRICAL STUDY Figure 6 illustrates the SB according to interest probabil- We conduct an empirical study of Stream-LSH using real ity rank, for different values of u and p. In Figure 6(a) we world stream datasets and evaluate its effectiveness using fix p = 0.95 and examine the effect of the insertion prob- the recall metric. ability u. The graphs illustrate that increasing u increases SB most notably for popular items. In Figure 6(b) we fix 5.1 Methodology u = 1 and examine the effect of the retention probability p. The graphs illustrate that when p increases, DynaPop External libraries. We use Apache Lucene 4.3.0 [1] search retains additional items of lower popularity. library for the indexing and retrieval infrastructure. For retrieval, we override Lucene’s default similarity function Success probability. We denote by SP (DynaPop(k, L),s,w,z) by implementing angular similarity according to Equation the probability of Stream-LSH with DynaPop and the Smooth 1. For the LSH family of functions, we use TarsosLSH [3]. retention policy to find an item x for query q, s.t. sim(q,x)= s, E(pop(x)) = w, and quality(x)= z. By applying Propo- Datasets. We use Reuters RCV1 [2] news dataset and Twit- sition 2 and as w = ρ (Equation 5): ter [45, 34] social dataset. In both datasets, each item is zuw k L associated with a timestamp denoting its arrival time. The SP (DynaPop(k, L),s,w,z) = 1 − (1 − s ) 1 − p(1 − zuw) Reuters dataset consists of news items from August 1996 (6) to August 1997, and the Twitter dataset consists of Tweets collected in June 2009. These datasets do not contain qual- 1 1 ity information and so we assume quality(x) = 1 for all Threshold Smooth 0.8 ) 0.8 items. In order to evaluate quality-sensitivity, we use a ) Bucket age age smaller Twitter dataset [5], denoted TwitterNas, consisting 0.6 0.6 of a stream of Nasdaq related Tweets spanning 97 days from March 10th to June 15th 2016. TwitterNas contains number 0.4 0.4 Recall(0.8,R of followers of Tweets authors, which we use for assigning 0.2 Recall(0.9,R 0.2 quality scores to Tweets (see Section 5.3). In all datasets, 0 0 10 30 50 70 90 10 30 50 70 90 we represent an item as a (sparse) vector whose dimension Age Radius (R ) Age Radius (R ) is the number of unique terms in the entire dataset, and age age each vector entry corresponds to a unique term, weighted (a) Reuters: Rsim = 0.8 (b) Reuters: Rsim = 0.9 according to Lucene’s TF-IDF formula. 1 1

0.8 ) 0.8 Train and test. We partition each dataset into (disjoint) ) age age train and test sets. The train set is the prefix of the item 0.6 0.6 stream up to a tick that we consider to be the current time. 0.4 The test set is the remainder of the dataset, which was not 0.4 Threshold Smooth Recall(0.8,R

Recall(0.9,R 0.2 previously seen by the Stream-LSH algorithm. We randomly 0.2 Bucket sample an evaluation set Q of 3,000 items from the test 0 0 10 30 50 70 90 10 30 50 70 90 set and compute recall over Q according to the given radii. Age Radius (R ) Age Radius (R ) Table 1 summarizes the train and test statistics. age age (c) Twitter: Rsim = 0.8 (d) Twitter: Rsim = 0.9 5.2 Retention Policies Figure 8: Recall comparison by age radius of the three re- We evaluate the recall of the three retention policies for tention policies using approximately the same index size. the Reuters and Twitter datasets as a function of age. As the retention aspect of our algorithm is orthogonal to the quality-sensitive indexing aspect, we assume here that quality(x)=which contains for each Tweet x the number of followers 1 for all items. In order to achieve a fair comparison, we of its author representing its authority, and denoted Tf (x). use k = 10 and L = 15 for the three retention policies, We define the following quality scoring function: and configure them to use approximately the same index quality(x)= log2(1 + min(1,Tf (x)/Nf )), size: We set Tsize = 45,000 and Bsize = 45 in Reuters;

Tsize = 180,000 and Bsize = 177 in Twitter; p = 0.95 in where Nf is a configurable normalization factor. In our ex- both datasets. periments, we set Nf = 5,000 (15% of the authors have more Figure 8 depicts our recall results for Reuters in the top than 5,000 followers). Applying quality(x) on TwitterNas row, and Twitter in the bottom row. Our goal is to retrieve entails an average quality score of 0.33. items that are similar to the query, hence we focus on Rsim We experiment with quality-sensitive and quality-insensitive values 0.8, and 0.9. As we are also interested in the retrieval variants of Smooth, with k = 10 and L = 15. In order to of items that are not highly fresh, we evaluate recall over conduct a fair comparison, we set retention factors that en- varying age radii values. tail approximately the same index size for both variants. When considering Rsim = 0.8 (leftmost column) there is More specifically, we set p = 0.9 for the quality-insensitive a tradeoff between Threshold and Smooth: when focusing variant, which results in an index size of 636,290 items in our on the highly fresh items (Rage < 20), Threshold’s recall experiment, and p = 0.97 for the quality-sensitive variant is slightly larger than Smooth’s. Indeed, Threshold is ef- which results in an index size of 590,818 items in our exper- fective when only the retrieval of the highly fresh items is iment. Recall that quality-sensitive indexing is more com- desired. However, Smooth outperforms Threshold when the pact, which enables a slower removal of item copies. This is age radius increases to include also less fresh items. For reflected by a larger p value in the quality-insensitive case. example, in Reuters, Smooth achieves a recall of 0.69 for We experiment with the same radii as in our analysis: we items that are at least 0.8-similar to the query and are not fix Rsim = 0.8, and experiment with Rquality = 0.5, and older than age 50, and Threshold achieves a lower recall of Rquality = 0.9 over varying age values. Figure 9 depicts the 0.42. Bucket’s recall is higher than Threshold’s for ages that recall achieved by the two Smooth variants as a function of exceed 20, as unlike Threshold, Bucket does not eliminate the age radius. items at once. Yet, Smooth outperforms Bucket when in- The graphs demonstrate that for both Rquality values, creasing the age radius due to applying an explicit gradual the quality-sensitive approach significantly outperforms the elimination over all items. When increasing the similarity quality-insensitive approach when searching for similar items radius to Rsim = 0.9 (leftmost column), the advantage of (Rsim = 0.8) over all age radii values that we examined. Smooth over Threshold becomes pronounced. For example, This is since the quality-sensitive approach better exploits in Twitter, Smooth achieves a recall of 0.97 for items that the space resources for high quality items. The advantage are not older than age 50, whereas Threshold only achieves of quality-sensitive indexing increases as the age of high- a recall of 0.7. quality items increases, which is an advantage when the retrieval of items that are not necessarily the most fresh 5.3 Quality•Sensitivity ones is desired. For example, considering Rquality = 0.5 We move on to evaluating Stream-LSH’s quality-sensitive (Figure 9(a)) and Rage = 30, the recall achieved by quality- approach. We experiment with the TwitterNas dataset, insensitive Smooth is 0.7, whereas the recall achieved by Train Test Time unit Num. items Num. ticks Num. items Num. ticks Reuters Day 756,927 343 22,986 10 Twitter 10 Minutes 18,224,293 2,705 42,296 10 TwitterNas Day 275,946 92 18,831 5

Table 1: Train and test statistics.

items in the dataset: For example, in the Reuter’s dataset 1 1 (Figure 10(a)), for Rsim = 0.8 (blue curve) and Rpop = 0.05 0.8 0.8 (capturing the 3.5% most popular items in the data set), ,0.9) ,0.5) age age 0.6 0.6 the recall is 0.86. When increasing the similarity radius to Rsim = 0.9 (green curve), the recall increases and is 0.97. 0.4 0.4 DynaPop’s recall is lower when we also search for less pop- Quality−insensitive Smooth 0.2 0.2 ular items: In the Reuter’s dataset, for Rsim = 0.8 and Recall(0.8,R

Quality−sensitive Smooth Recall(0.8,R Rpop = 0.01 (capturing the 24% most popular items in the 0 0 10 30 50 70 90 10 30 50 70 90 Age Radius (R ) Age Radius (R ) data set), the recall is 0.72. When increasing the similarity age age radius to Rsim = 0.9, the recall is 0.9. Overall, DynaPop (a) Rquality = 0.5 (b) Rquality = 0.9 achieves good recall for popular items that are similar to Figure 9: Recall comparison of quality-insensitive and the query while also retrieving similar items that are less quality-sensitive Smooth using approximately the same in- popular albeit with lower recall; the latter is beneficial for dex size. applications such as query auto-completion [9] and product recommendation [46].

quality-sensitive Smooth is 0.88. When considering Rage = 1 1 90, the recall of quality-insensitive Smooth is 0.45, whereas ) the recall of quality-sensitive Smooth is 0.76. A similar ) 0.8 0.8 pop pop trend is observed for Rquality = 0.9 (Figure 9(b)). Note 0.6 0.6 that the graphs of Rquality = 0.9 and Rquality = 0.5 dif- 0.4 0.4 R =0.8 fer due to the different distributions of similarity, age, and sim Recall(0.9,R

0.2 R =0.9 Recall(0.9,R 0.2 quality, when considering different radii values in real data. sim

As we noted in our analysis, the advantage of the quality- 0 0 0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.05 Popularity Radius (R) Popularity Radius (R ) sensitive approaches is most pronounced when there exists pop pop a large amount of low quality items in the dataset. Indeed, (a) Reuters (b) Twitter in our setting, 73% of the items are assigned a quality value below 0.5. In such cases, using quality-sensitive Stream-LSH Figure 10: Recall by popularity radius of Stream-LSH when is expected to be appealing for similarity-search stream ap- using DynaPop with the Smooth retention policy. plications. 5.4 Dynamic Popularity 6. RELATED WORK We wrap up by studying Stream-LSH when using Dy- Previous work on recommendation over streamed con- naPop and the Smooth retention policy. We experiment tent [20, 18, 30, 26, 29, 31, 12] focused on using tempo- with u = 0.95, p = 0.95. As our datasets do not con- ral information for increasing the of recommended tain temporal interest information, we simulate an interest items. Stream recommendation algorithms extend techniques stream I by considering query results as signals of interests originally designed for static data, such as content-based and in items [14], as follows: We use the first 75% items in the collaborative-filtering [6], and apply them to streamed data train set as the item stream U. We construct a query set Q∗ by taking into account the temporal characteristics of stream by randomly sampling each item from the remaining 25% of generation and consumption within the algorithm internals. the train set with probability 0.1. For each query q ∈ Q∗, In the context of search, many works extend ranking meth- we retrieve its top 10 most similar items in U and include ods to consider temporal aspects of the data, (see [25] for a them in the interest stream I at q’s timestamp tq, as well as survey), and quality features such as a social post’s length or at their original arrival times in U. Table 2 summarizes the the author’s influence [40]. However, these search and rec- item and interest stream statistics. We compute popularity ommendation works do not tackle the challenge of bounding scores at the current time according to Definition 2.3 with the capacity of their underlying indexing data-structures. α = 0.95. Rather, they assume an index of the entire stream with tem- Figure 10 depicts recall as a function of Rpop for similarity poral information is given. Our work is thus complementary radii 0.8 and 0.9. For both datasets and similarity radii, the to these efforts in the sense that we offer a retention pol- recall increases as the popularity radius increases. DynaPop icy that may be used within their similarity search building provides high recall when searching for the most popular block. Item stream Interest stream Num. items Num. ticks Num. items Num. ticks Reuters 540,882 252 226,890 95 Twitter 13,124,853 2,000 4,267,518 1,500

Table 2: DynaPop item and interest streams statistics.

TI [14] and LSII [44] improve realtime indexing of stream is static and query-independent, e.g., ranking items by their data using a policy that determines which items to index on- age. Each item’s score is known a priori for all queries, and line and which to defer to a later batch indexing stage. Both can be used to decide at indexing time which items to retain assume unbounded storage and are thus complementary to in the index. This approach is less suitable to similarity our work. In addition, the TI focuses on highly popular search, where scores are query-dependent and only known queries, whereas we also address the tail of the popularity at runtime. distribution. LSII addresses the tail, however, it assumes In addition, we note that the aforementioned works on exact search while we focus on approximate search, which is bounded-index stream processing [36, 39, 34, 33] do not take the common approach in similarity search [22]. into account quality and dynamic popularity as we do. A few previous works have addressed bounding the under- Several papers have focused on improving the space com- lying index size in the context of stream processing [36, 39, plexity of LSH via alternative search algorithms [21, 41, 38], 34, 33]. Two papers [36, 39] have focused on first story de- via decreasing the number of tables used at the cost of exe- tection, which detects new stories that were not previously cuting more queries [35], or by searching more buckets [32]. seen. Both use LSH as we do. Petrovi´cet al. [36] main- Unlike Stream-LSH, these works consider static (finite) data tain buckets of similar stories, which are used in realtime rather than a stream. for detecting new stories using similarity search. In order to bound the index, they define a limit on the number of 7. CONCLUSIONS AND FUTURE WORK stories kept within a bucket, and eliminate the oldest sto- We introduced the problem of similarity search over end- ries when the limit is reached. We call this retention policy less data-streams, which faces the challenge of indexing un- Bucket. Sundaram et al. [39]’s primary goal is to parallelize bounded data. We proposed Stream-LSH, an SSDS algo- LSH, in order to support high-throughput data streaming. rithm that uses a retention policy to bound the index size. They bound the index size using a retention policy we call We showed that our Smooth retention policy increases recall Threshold, by eliminating the oldest items when the entire of similar items compared to methods proposed by prior art. index exceeds a given space limit. Both papers focus on In addition, our Stream-LSH indexing procedure is quality- the first story detection application, while our work focuses sensitive, and is extensible to dynamically retain items ac- on the similarity search primitive. Their retention policies cording to their popularity. are well-suited for first story detection, where the index is While our work focuses on similarity search, our approach searched in order to determine whether any recent matching may prove useful in future work, for addressing space con- result exists (indicating that the story is not the first), and straints in other stream-based search and recommendation are less adequate for similarity search, where multiple rele- primitives. vant results to an arbitrary query are targeted. We evaluate their retention policies in our Stream-LSH algorithm and 8. REFERENCES find that our Smooth policy provides much better results in [1] Lucene. http://lucene.apache.org/core/. our context. [2] Reuters rcv1. http://www.daviddlewis.com/resources/ Morales and Gionis propose streaming similarity self-join testcollections/rcv1/. (SSSJ) [34], a primitive that finds pairs of similar items [3] Tarsos-lsh. https://github.com/JorenSix/TarsosLSH. within an unbounded data stream. Similarly to us, SSSJ [4] The top 20 valuable facebook statistics. https://zephoria.com/ top-15-valuable-facebook-statistics/. needs to bound its underlying search index. Our work dif- [5] Twitter nasdaq. http://followthehashtag.com/datasets/ fers however in several aspects: First, we study a different nasdaq-100-companies-free-twitter-dataset/. search primitive, namely, similarity search, which searches [6] G. Adomavicius and A. Tuzhilin. Toward the next generation of for items similar to an arbitrary input query rather than recommender systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and data retrieving pairs of similar items from the stream. Second, engineering, 17(6):734–749, 2005. SSSJ only retrieves items that are not older than a given age [7] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. limit. It thus bounds the index using a variant of Thresh- Finding high-quality content in social media. WSDM ’08, pages old. In contrast, we do not assume that an age limit on 183–194, 2008. [8] A. Andoni. : the Old, the New, and all queries is known a priory. In this context, we propose the Impossible. PhD thesis, Massachusetts Institute of Smooth, which better fits our setting as we show in our Technology, 2009. evaluation. Third, we tackle approximate similarity search [9] Z. Bar-Yossef and N. Kraus. Context-sensitive query whereas SSSJ searches for an exact set of similar pairs. auto-completion. WWW ’11, pages 107–116, 2011. [10] H. Becker, M. Naaman, and L. Gravano. Selecting quality Magdy et al. [33] propose a search solution over stream twitter content for events. ICWSM11, 2011. data with bounded storage, which increases the recall of [11] F. R. Bentley, J. J. Kaye, D. A. Shamma, and J. A. tail queries. Their work differs from ours in the retrieval Guerra-Gomez. The 32 days of christmas: Understanding temporal intent in image search queries. CHI ’16, pages model, more specifically, they assume the ranking function 5710–5714, 2016. [12] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and [30] J. Liu, P. Dolan, and E. R. Pedersen. Personalized news J. Lin. Earlybird: Real-time search at twitter. ICDE ’12, pages recommendation based on click behavior. IUI ’10, pages 31–40, 1360–1369, 2012. 2010. [13] M. S. Charikar. Similarity estimation techniques from rounding [31] M. Lu, Z. Qin, Y. Cao, Z. Liu, and M. Wang. Scalable news algorithms. In STOC ’02, pages 380–388, 2002. recommendation using multi-dimensional similarity and [14] C. Chen, F. Li, B. C. Ooi, and S. Wu. Ti: An efficient indexing jaccard-kmeans clustering. J. Syst. Softw., pages 242–251, mechanism for real-time search on tweets. SIGMOD ’11, pages 2014. 649–660, 2011. [32] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. [15] J. Chen, R. Nairn, and E. Chi. Speak little and well: Multi-probe lsh: Efficient indexing for high-dimensional Recommending conversations in online social streams. CHI ’11, similarity search. In VLDB ’07, pages 950–961, 2007. pages 217–226, 2011. [33] A. Magdy, R. Alghamdi, and M. F. Mokbel. On main-memory [16] F. Chierichetti and R. Kumar. LSH-preserving functions and flushing in microblogs data management systems. ICDE ’16, their applications. In SODA ’12, pages 1078–1094, 2012. pages 445–456, 2016. [17] M. Curtiss, I. Becker, T. Bosman, S. Doroshenko, L. Grijincu, [34] G. D. F. Morales and A. Gionis. Streaming similarity self-join. T. Jackson, S. Kunnatur, S. Lassen, P. Pronin, S. Sankar, PVLDB, 9(10):792–803, 2016. G. Shen, G. Woss, C. Yang, and N. Zhang. Unicorn: A system [35] R. Panigrahy. Entropy based nearest neighbor search in high for searching the social graph. Proc. VLDB Endow., pages dimensions. In SODA ’06, pages 1186–1195, 2006. 1150–1161, 2013. [36] S. Petrovi´c, M. Osborne, and V. Lavrenko. Streaming first [18] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news story detection with application to twitter. HLT ’10, pages personalization: Scalable online collaborative filtering. WWW 181–189, 2010. ’07, pages 271–280, 2007. [37] M. Slaney and M. Casey. Locality-sensitive hashing for finding [19] F. Diaz. Integration of news content into web results. WSDM nearest neighbors. Signal Processing Magazine, IEEE, pages ’09, pages 182–191, 2009. 128–131, 2008. [20] E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: [38] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. Srs: Solving Providing personalized newsfeeds via analysis of information c-approximate nearest neighbor queries in high dimensional novelty. WWW ’04, pages 482–490, 2004. euclidean space with a tiny index. Proc. VLDB Endow., pages [21] J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive 1–12, 2014. hashing scheme based on dynamic collision counting. SIGMOD [39] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, ’12, pages 541–552, 2012. P. Indyk, S. Madden, and P. Dubey. Streaming similarity [22] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high search over one billion tweets using parallel locality-sensitive dimensions via hashing. In VLDB ’99, pages 518–529, 1999. hashing. Proc. VLDB Endow., 6(14):1930–1941, Sept. 2013. [23] I. Guy, T. Steier, M. Barnea, I. Ronen, and T. Daniel. [40] K. Tao, F. Abel, C. Hauff, and G.-J. Houben. Twinder: A Swimming against the streamz: Search and analytics over the search engine for twitter streams. ICWE’12, pages 153–168, enterprise activity stream. CIKM ’12, pages 1587–1591, 2012. 2012. [24] P. Indyk and R. Motwani. Approximate nearest neighbors: [41] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate Towards removing the curse of dimensionality. STOC ’98, pages nearest neighbor and closest pair search in high-dimensional 604–613, 1998. space. ACM Trans. Database Syst., pages 20:1–20:46, 2010. [25] N. Kanhabua, R. Blanco, and K. Nørv˚ag. Temporal [42] J. Teevan, D. Ramage, and M. R. Morris. #twittersearch: A information retrieval. Foundations and Trends in Information comparison of microblog search and web search. WSDM ’11, Retrieval, pages 91–208, 2015. pages 35–44, 2011. [26] M. Kompan and M. Bielikova. Content-based news [43] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis recommendation. In E-Commerce and Web Technologies, and performance study for similarity-search methods in pages 61–72. 2010. high-dimensional spaces. VLDB ’98, pages 194–205, 1998. [27] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a [44] L. Wu, W. Lin, X. Xiao, and Y. Xu. LSII: an indexing social network or a news media? WWW ’10, pages 591–600, structure for exact real-time search on microblogs. ICDE ’12, 2010. pages 482–493, 2013. [28] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of [45] J. Yang and J. Leskovec. Patterns of temporal variation in Massive Datasets, 2nd Ed. Cambridge University Press, 2014. online media. WSDM ’11, pages 177–186, 2011. [29] L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan. Scene: [46] H. Yin, B. Cui, J. Li, J. Yao, and C. Chen. Challenging the A scalable two-stage personalized news recommendation long tail recommendation. Proc. VLDB Endow., pages system. SIGIR ’11, pages 125–134, 2011. 896–907, 2012.