Subscription Indexes for Web Syndication Systems

Subscription Indexes for Web Syndication Systems Zeinab Hmedeh Harris Kourdounakis Vassilis Christophides CEDRIC Lab. - CNAM FORTH/ICS, Univ. of Crete FORTH/ICS, Univ. of Crete Paris, France Heraklion, Greece Heraklion, Greece [email protected] [email protected] [email protected] Cedric du Mouza Michel Scholl ∗ Nicolas Travers CEDRIC Lab. - CNAM CEDRIC Lab. - CNAM CEDRIC Lab. - CNAM Paris, France Paris, France Paris, France [email protected] [email protected] [email protected] ABSTRACT information collectors and content generators themselves. In The explosion of published information on the Web leads to this context, Web syndication formats such as RSS or Atom the emergence of a Web syndication paradigm, which trans- emerge as a popular mean for timely delivery of frequently forms the passive reader into an active information collec- updated Web content. According to these formats, informa- tor. Information consumers subscribe to RSS/Atom feeds tion publishers provide brief summaries (textual snippets) of and are notified whenever a piece of news (item) is pub- the content they deliver on the Web [8], called information lished. The success of this Web syndication now offered on items, while information consumers subscribe to a number Web sites, blogs, and social media, however raises scalability of RSS/Atom feeds (i.e., channels) and get informed about issues. There is a vital need for efficient real-time filtering newly published items. Today, almost every personal we- methods across feeds, to allow users to follow effectively per- blog, news portal, discussion forum, user group or social sonally interesting information. We investigate in this paper media (e.g., Facebook, Twitter, Flickr) on the Web employs three indexing techniques for users’ subscriptions based on RSS/Atom feeds. Given that the amount and diversity inverted lists or on an ordered trie. We present analytical of the information generated on a daily basis in Web 2.0 is models for memory requirements and matching time and we unprecedented, there is a vital need for efficient real-time fil- conduct a thorough experimental evaluation to exhibit the tering methods across feeds which allow users to effectively impact of critical workload parameters on these structures. follow personally interesting information. For these reasons, we advocate a content-based Publish/Subscribe paradigm for Web 2.0 syndication in which information consumers are de- Categories and Subject Descriptors coupled (in both space and time) from information providers H.3 [Information Systems]: Miscellaneous and they can express their interest to specific information items using content-based subscriptions. Keyword-based General Terms subscriptions will be matched on the fly against the content of incoming items originating from different feeds. Performance To efficiently check whether all keywords of a subscription also appear in an incoming item (i.e., broad match seman- Keywords tics) we need to index the subscriptions. Count-based (CI) and Tree-based (TI) are two main indexing schemes proposed Pub/sub, subscription indexing in the literature for counting explicitly vs implicitly the number of contained keywords. The majority of related data 1. INTRODUCTION structures [15, 7, 1] cannot be employed for conjunctions Web 2.0 technologies have transformed the Web from of keywords (rather than attribute-value pairs) due to the a publishing-only environment into a vibrant information space high-dimensionality. In this paper, we are interested place where yesterday’s passive readers have become active in efficient implementations of both indexing schemes using Inverted Lists (IL) [23] for CI and a variant for distinct terms ∗Michel Scholl passed away on Nov 15th, 2011, too early of Ordered Tries (OT) [11] for TI and study their behavior after a short battle with cancer. We would like to express for critical parameters of realistic Web syndication work- our gratitude for everything Michel offer us on a personal loads. Although these data structures have been employed and professional level during our long-lasting collaboration. to evaluate broad match queries in the context of selective information dissemination [21] and sponsored search [12] or for mining frequent Item sets [3, 14], their memory and match- Permission to make digital or hard copies of all or part of this work for ing time requirements appear to be quite different in our personal or classroom use is granted without fee provided that copies are setting. This is due to the peculiarities of Web syndication not made or distributed for profit or commercial advantage and that copies systems which are characterized [8] (a) by information items bear this notice and the full citation on the first page. To copy otherwise, to of average length (25-36 distinct terms) which are greater republish, to post on servers or to redistribute to lists, requires prior specific than advertisement bids (4-5 terms [12]) and smaller than permission and/or a fee. documents of Web collections (12K terms [21]) (b) by very EDBT 2012, March 26–30, 2012, Berlin, Germany. Copyright 2012 ACM 978-1-4503-0790-1/12/03 ...$10.00 large vocabularies of terms (up to 1.5M terms) Note also, 312 that due to broad match semantics Information Retrieval is worth noting that in reality, VS may diverge from VI sig- techniques for optimizing ILs (e.g., early pruning [23]) are nificantly. In this context, a match occurs if and only if all not suited in our setting. of the terms (keywords) of a subscription s are also present A detailed analysis of Trie structures has not been con- in a news item I (i.e., broad match semantics). ducted in the past while the Ordered Trie usage has been discouraged in Pub/Sub systems due to the prohibiting performance exhibited in other application areas studied in re- Table 1: Example of keyword based subscriptions lated work (e.g. document filtering in [22]). In our work we Subscription S1(t1 ∧ t2 ∧ t4) S2(t1 ∧ t3) S3(t1 ∧ t2 ∧ t5) are going one step forward in identifying real setting param- (Terms) S4(t2 ∧ t4) S5(t1 ∧ t3 ∧ t6) eters under which Trie structures became competitive. In a Consider the set of subscriptions S illustrated in Table 1. nutshell, the main contributions of this work are: Matching item I = {t2, t4} against S will result in the set of (1) In Section 2 we consider three index structures imple- matched subscriptions SM = {S4} since t2 and t4 of S4 are menting different counting techniques for pruning as early contained in I. A naive matching approach consists in test- as possible non matching subscriptions to an incoming item. ing whether the terms of every subscription are contained The first two implement the CI scheme and rely on an In- in the incoming news item. Clearly, this naive solution does verted List (IL) of terms to the subscriptions that contain not scale to millions of subscriptions. For this reason, in- them. The Count-based Inverted List (CIL) variant stores dex structures have to be found which allow to prune as each subscription into the ILs of all the terms it contains early as possible non matching subscriptions. A widely used while the Ranked-key Inverted List (RIL) only once in the structure is the inverted list (IL) which maintains an inverse IL of its least frequent term. We finally consider an Ordered mapping from terms tj to the subscriptions s that contain Trie (OT) of distinct terms implementing TI with factoriza- them. It essentially confines the original search space only tion of common subscription prefixes while and a variation to subscriptions containing at least a term present within that compacts paths of unary nodes called POT. the item being matched. In particular, two variants of IL (2) Section 3 provides a detailed probabilistic analysis of namely the Count-based Inverted List and the Ranked-key the size and number of nodes visited during matching of the Inverted List are studied in this paper. Additionally an Or- three indices which takes into account the term occurrence dered Trie structure is considered which exploits the term distribution and the distribution of subscription lengths. To subset relations between subscriptions. the best of our knowledge an analysis of OT matching time has not been previously reported in the literature. Count-based Inverted List (3) In Section 4 we conduct a thorough experimental eval- Count-based Inverted List (CIL) is essentially a mapping dic- 1 uation of CIL, RIL and POT using generated subscriptions tionary whose key is a term tj ∈ VS and value the corre- 2 of terms appearing in a real RSS/Atom testbed of items [8]. sponding posting list P ostings(tj ), i.e. the set of subscrip- To the best of our knowledge, this is the first study investi- tions that contain the term. Furthermore, to implement gating (a) how critical workload parameters, such as terms broad match semantics, an additional structure has to be distribution, size of the vocabulary and lengths of subscrip- maintained: a counter per subscription keeps track of the tions affect the morphology of OT, i.e., the level of achieved number of remaining terms to be matched for a given sub- factorization and (b) their scalability and performance in re- scription. The structure that maps every subscription s ∈ S alistic settings (e.g. for 100M of subscriptions with 100K of to the number of remaining terms to be tested before re- distinct terms and real distribution of terms in items). porting a matching is denoted by Counter. Related work is presented in Section 5 while a brief sum- Figure 1(a) depicts the CIL index for the example of Ta- mary along with plans for future work are given in Section 6. ble 1. The posting list associated with t2 is P ostings(t2) = {S1,S3,S4}.

Load more