The Architecture and Implementation of an Extensible

Jonathan M. Hsieh, Steven D. Gribble, and Henry M. Levy Department of Computer Science & Engineering University of Washington, Seattle, WA, USA 98195 {jmhsieh,gribble,levy}@cs.washington.edu

Abstract assessment [24], and emerging trend detection [6]. New Web services that want to take advantage of Many Web services operate their own Web crawlers Web-scale data face a high barrier to entry. Finding and to discover data of interest, despite the fact that large- accessing data of interest requires crawling the Web, and scale, timely crawling is complex, operationally inten- if a service is sensitive to quick access to newly pub- sive, and expensive. In this paper, we introduce the ex- lished data, its Web crawl must operate continuously and tensible crawler, a service that crawls the Web on be- focus on the most relevant subset of the Web. Unfortu- half of its many client applications. Clients inject filters nately, massive-scale, timely web crawling is complex, into the extensible crawler; the crawler evaluates all re- operationally intensive, and expensive. Worse, for ser- ceived filters against each Web page, notifying clients of vices that are only interested in specific subsets of Web matches. As a result, the act of crawling the Web is de- data, crawling is wasteful, as most pages retrieved will coupled from determining whether a page is of interest, not match their criteria of interest. shielding client applications from the burden of crawling In this paper, we introduce the extensible crawler, a the Web themselves. utility service that crawls the Web on behalf of its many This paper describes the architecture, implementa- client applications. An extensible crawler lets clients tion, and evaluation of our prototype extensible crawler, specify filters that are evaluated over each crawled Web and also relates early experience from several crawler page; if a page matches one of the filters specified by applications we have built. We focus on the challenges a client, the client is notified of the match. As a re- and trade-offs in the system, such as the design of a filter sult, the act of crawling the Web is decoupled from language that is simultaneously expressive and efficient the application-specific logic of determining if a page is to execute, the use of filter indexing to cheaply match a of interest, shielding Web-crawler applications from the page against millions of filters, and the use of document burden of crawling the Web themselves. and filter partitioning to scale our prototype implemen- tation to high document throughput and large numbers We anticipate two deployment modes for an extensi- of filters. We argue that the low-latency, high selectiv- ble crawler. First, it can run as a service accessible re- ity, and scalable nature of our system makes it a promis- motely across the wide-area Internet. In this scenario, ing platform for taking advantage of emerging real-time filter sets must be very highly selective, since the band- streams of data, such as Facebook or Twitter feeds. width between the extensible crawler and a client appli- cation is scarce and expensive. Second, it can run as a utility service [17] within cloud computing infrastructure 1 Introduction such as Amazon’s EC2 or ’s AppEngine. Filters can be much less selective in this scenario, since band- Over the past decade, an astronomical amount of in- width between the extensible crawler and its clients is formation has been published on the Web. As well, abundant, and the clients can pay to scale up the compu- Web services such as Twitter, Facebook, and Digg re- tation processing selected documents. flect a growing trend to provide people and applica- This paper describes our experience with the design, tions with access to real-time streams of information implementation, and evaluation of an extensible crawler, updates. Together, these two characteristics imply that focusing on the challenges and trade-offs inherent in this the Web has become an exceptionally potent reposi- class of system. For example, an extensible crawler’s fil- tory of programmatically accessible data. Some of the ter language must be sufficiently expressive to support most provocative recent Web applications are those that interesting applications, but simultaneously, filters must gather and process large-scale Web data, such as virtual be efficient to execute. A na¨ıve implementation of an ex- tourism [33], knowledge extraction [15], Web site trust tensible crawler would require computational resources keyword Web copyright Web proportional to the number of filters it supports multi- notification malware violation research plied by its crawl rate; instead, our extensible crawler # clients ~10^6 ~10^2 ~10^2 ~10^3 prototype uses standard indexing techniques to vastly re- duce the cost of executing a large number of filters. To # filters per client ~10^2 ~10^6 ~10^6 ~10^2 fraction of pages that ~10^-5 ~10^-4 ~10^-6 ~10^-3 scale, an extensible crawler must be distributed across match for a client a cluster. Accordingly, the system must balance load total # filters ~10^8 ~10^8 ~10^8 ~10^5 (both filters and pages) appropriately across machines, otherwise an overloaded machine will limit the rate at Table 1: Web-crawler application workloads. This ta- which the entire system can process crawled pages. Fi- ble summarizes the approximate filter workloads we ex- nally, there must be appropriate mechanisms in place to pect from four representative applications. allow web-crawler applications to update their filter sets frequently and efficiently. Web malware detection. This application uses a We demonstrate that XCrawler, our early prototype database of regular-expression-based signatures to iden- system, is scalable across several dimensions: it can ef- tify malicious executables, JavaScript, or Web con- ficiently process tens of millions of concurrent filters tent. New malware signatures are injected daily, and while processing thousands of Web pages per second. clients require prompt notification when new malicious XCrawler is also flexible. By construction, we show pages are discovered. This application must support a that its filter specification language facilitates a wide small number of clients (e.g., McAfee, Google, and range of interesting web-crawler applications, including Symantec), each with a large and moderately quickly keyword-based notification, Web malware detection and changing filter set. Each filter should be highly selec- defense, and copyright violation detection. tive; in aggregate, approximately roughly 1 in 1000 Web An extensible crawler bears similarities to sev- pages contain malicious content [26, 28]. eral other systems, including streaming and parallel Copyright violation detection. Similar to commercial databases [1, 10, 11, 13, 14, 19], publish-subscribe offerings such as attributor.com, this application systems [2, 9, 16, 27, 31], search engines and web lets clients find Web pages containing content contain- crawlers [8, 12, 20, 21], and packet filters [23, 25, 30, ing their intellectual property. A client, such as a news 32, 34]. Our design borrows techniques from each, but provider, maintains a large database of highly selective we argue that the substantial differences in the work- filters, such as key sentences from their news articles. load, scale, and application requirements of extensible New filters are injected into the system by a client when- crawlers mandate many different design choices and op- ever new content is published. This application must sup- timizations. We compare XCrawler to related systems in port a moderate number of clients, each with a large, se- the related work section (Section 5), and we provide an lective, and potentially quickly changing filter set. in-depth comparison to search engines in Section 2.1. Web measurement research. This application permits scientists to perform large-scale measurements of the 2 Overview Web to characterize its content and dynamics. Individual To better motivate the goals and requirements of research projects would inject filters to randomly sample extensible crawlers, we now describe a set of web- Web pages (e.g., sample 1 in 1000 random pages as rep- crawler applications that we have experimented with us- resentative of the overall Web) or to select Web pages ing our prototype system. Table 1 gives some order- with particular features and tags relevant to the study of-magnitude estimates of the workload that we expect (e.g., select -related JavaScript keywords in a study these applications would place on an extensible crawler investigating the prevalence of Ajax on the Web). This if deployed at scale, including the total number of filters application would support a modest number of clients each application category would create and the selectiv- with a moderately sized, slowly changing filter set. ity of a client’s filter set. 2.1 Comparison to a Keyword-based notification. Similar to Google Alerts, this application allows users to register keyword At first glance, one might consider implementing an phrases of interest, and receive an event stream corre- extensible crawler as a layer on top of a conventional sponding to Web pages containing those keywords. For search engine. This strawman would periodically exe- example, users might upload a vanity filter (“Jonathan cute filters against the search engine, looking for new Hsieh”), or a filter to track a product or company (“palm document matches and transmitting those to applica- pre”). This application must support a large number of tions. On closer inspection, however, several fundamen- users, each with a small and relatively slowly-changing tal differences between search engines and extensible filter set. Each filter should be highly selective, matching crawlers, their workloads, and their performance require- a very small fraction of Web pages. ments are evident, as summarized in Table 2. Because of

search engine extensible crawler users. Filters for an extensible crawler are assumed to clients millions of people thousands of applications be more selective than search engine queries, but even ~trillion stored in a crawl arrive in a stream from if they are not, filters are executed against documents as documents database and periodically crawler, processed on-the- refreshed by crawler fly and not stored they are crawled rather than against the enormous Web arrive in a stream from ~billion stored in a filter corpus gathered by a search engine. All matching pages filters / queries users, processed on-the- database and periodically found by an extensible crawler are communicated to a fly and not stored updated by applications indexing index documents index queries web-crawler application; result ranking is not relevant. query response time is document processing time Traditional search engines and extensible crawlers are latency crucial; document refresh is crucial; filter update in some ways complementary, and they can co-exist. Our latency is less important latency is less important work focuses on quickly matching freshly crawled docu- selectivity and queries might not be filters are assumed to be query result selective; result ranking is selective; all matches are ments against a set of filters, however, many applications ranking crucial for usability sent to applications can benefit from being able to issue queries against a full, in-memory cache of entire filter index is stored caching popular query results and in memory; result caching existing Web index in addition to filtering newly discov- ³LPSRUWDQW´LQGH[VXEVHW is not relevant ered content. 2.2 Architectural goals Table 2: Search engines vs. extensible crawlers. This table summarizes key distinctions between the workload, Our extensible crawler architecture has been guided performance, and scalability requirements of search en- by several principles and system goals: gines and extensible crawlers. High Selectivity. The primary role of an extensi- ble crawler is to reduce the number of web pages a these differences, we argue that there is an opportunity web-crawler application must process by a substantial to design an extensible crawler that will scale more effi- amount, while preserving pages in which the applica- ciently and better suit the needs of its applications than a tion might have interest. An extensible crawler can be search-engine-based implementation. thought of as a highly selective, programmable matching In many regards, an extensible crawler is an inversion filter executing as a pipeline stage between a stock Web of a search engine. A search engine crawls the Web to pe- crawler and a web-crawler application. riodically update its stored index of Web documents, and Indexability. When possible, an extensible crawler receives a stream of Web queries that it processes against should trade off CPU for memory to reduce the compu- the document index on-the-fly. In contrast, an extensi- tational cost of supporting a large number of filters. In ble crawler periodically updates its stored index of filters, practice, this implies constructing an index over filters and receives a stream of Web documents that it processes to support the efficient matching of a document against against the filter index on-the-fly. For a search engine, all filters. One implication of this is that the index must though it is important to reduce the time in between doc- be kept up-to-date as the set of filters defined by web- ument index updates, it is crucial to minimize query re- crawler applications is updated. If this update rate is low sponse time. For an extensible crawler, it is important to or the indexing technique used supports incremental up- be responsive in receiving filter updates from clients, but dates, keeping the index up-to-date should be efficient. for “real-time Web” applications, it is more important to process crawled documents with low latency. Favor Efficiency over Precision. There is generally There are also differences in scale between these two a tradeoff between the precision of a filter and its effi- systems. A search engine must store and index hundreds cient execution, and in these cases, an extensible crawler of billions, if not trillions, of Web documents, contain- should favor efficient execution. For example, a filter ing kilobytes or megabytes of data. On the other hand, language that supports regular expressions can be more an extensible crawler must store and index hundreds of precise than a filter language that supports only conjuncts millions, or billions, of filters; our expectation is that fil- of substrings, but it is simpler to build an efficient index ters are small, perhaps dozens or hundreds of bytes. As a over the latter. As we will discuss in Section 3.2.2, our result, an extensible crawler must store and index four or XCrawler prototype implementation exposes a rich fil- five orders of magnitude less data than a search engine, ter language to web-crawler applications, but uses relax- and it is more likely to be able to afford to keep its entire ation to convert precise filters into less-precise, indexable index resident in memory. versions, increasing its scalability at the cost of exposing Finally, there are important differences in the perfor- false positive matches to the applications. mance and result accuracy requirements of the two sys- Low Latency. To support crawler-applications that de- tems. A given search engine query might match millions pend on real-time Web content, an extensible crawler of Web pages. To be usable, the search engine must rely should be capable of processing Web pages with low la- heavily on page ranking to present the top matches to tency. This goal suggests the extensible crawler should pod each document arriving at a pod needs to be distributed to each pod node for processing. Thus, the throughput of fa ... fm Web the pod is limited by the slowest node within the pod; this f ... f crawler implies that load balancing of filters across pod nodes is n q app

collect crucially important to the overall system throughput. dispatch fr ... fz Each node within the pod contains a subset of the sys- web crawler tem’s filters. A na¨ıve approach to processing a docu- crawler app ment on a node would involve looping over each filter on that node serially. Though this approach would work fa ... fm correctly, it would scale poorly as the number of filters f ... f crawler grows. Instead, as we will discuss in Section 3.2, we n q app

collect trade memory for computation by using filter indexing, dispatch fr ... fz relaxation, and staging techniques; this allows us to eval- uate a document against a node’s full filter set with much pod faster than linear processing time. Figure 1: Extensible crawler architecture. This fig- If a document matches any filters on a node, the node ure depicts the high-level architecture of an extensible notifies a match collector process running within the pod. crawler, including the flow of documents from the Web The collector gathers all filters that match a given docu- through the system. ment and distributes match notifications to the appropri- ate web-crawler application clients. be architected as a stage in a dataflow pipeline, rather Applications interact with the extensible crawler than as a batch or map-reduce style computation. through two interfaces. They upload, delete, or modify Scalability. An extensible crawler should scale up to filters in their filter sets with the filter management API. support high Web page processing rates and a very large As well, they receive a stream of notification events cor- number of filters. One of our specific goals is to han- responding to documents that match at least one of their dle a linear increase in document processing rate with a filters through the notification API. We have considered corresponding linear increase in machine resources. but not yet experimented with other interfaces, such one for letting applications influence the pages that the web 2.3 System architecture crawler visits. Figure 1 shows the high-level architecture of an ex- tensible crawler. A conventional Web crawler is used to 3 XCrawler Design and Implementation fetch a high rate stream of documents from the Web. De- In this section, we describe the design and imple- pending on the needs of the extensible crawler’s appli- mentation of XCrawler, our prototype extensible crawler. cations, this crawl can be broad, focused, or both. For XCrawler is implemented in Java and runs on a cluster of example, to provide applications with real-time informa- commodity multi-core x86 machines, connected by a gi- tion, the crawler might focus on real-time sources such gabit switched network. Our primary optimization con- as Twitter, Facebook, and popular news sites. cern while building XCrawler was efficiently scaling to Web documents retrieved by the crawler are parti- a large number of expressive filters. tioned across pods for processing. A pod is a set of In the rest of this section, we drill down into four as- nodes that, in aggregate, contains all filters known to the pects of XCrawler’s design and implementation: the fil- system. Because documents are partitioned across pods, ter language it exposes to clients, how a node matches each document needs to be processed by a single pod; an incoming document against its filter set, how docu- by increasing the number of pods within the system, the ments and filters are partitioned across pods and nodes, overall throughput of the system increases. Document and how clients are notified about matches. set partitioning therefore facilitates the scaling up of the system’s document processing rate. 3.1 Filter language and document model Within each pod, the set of filters known to the exten- XCrawler’s declarative filter language strikes a bal- sible crawler is partitioned across the pod’s nodes. Filter ance between expressiveness for the user and execution set partitioning is a form of sharding and it is used to efficiency for the system. The filter language has four address the memory or CPU limitations of an individual entities: attributes, operators, values, and expressions. node. As more filters are added to the extensible crawler, There are two kinds of values: simple and composite. additional nodes may need to be added to each pod, and Simple values can be of several types, including byte se- the partitioning of filters across nodes might need adjust- quences, strings, integers and boolean values. Composite ment. Because filters are partitioned across pod nodes, values are tuples of values. A document is tuple of attribute and values pairs. on a node, we create memory-resident indexes for the Attributes are named fields within a document; during attributes referenced by filters. Matching a document crawling, each Web document is pre-processed to extract against an indexed filter set requires a small number of a static set of attributes and values. This set is passed index lookups, rather than computation proportional to to nodes and is referenced by filters during execution. the number of filters. However, a high fidelity index Examples of a document’s attribute-value pairs include might require too much memory, and constructing an ef- its URL, the raw HTTP content retrieved by the crawler, ficient index over an attribute that supports a complex op- certain HTTP headers like Content-Length or Content- erator such as a might be intractable. Type, and if appropriate, structured text extracted from In either case, we use relaxation to convert a filter into the raw content. To support sampling, we also provide a form that is simpler or cheaper to index. For example, a random number attribute whose per-document value is we can relax a regular expression filter into one that uses fixed at chosen when other attributes are extracted. a conjunction of substring operators. A user-provided filter is a predicate expression; if the A relaxed filter is less precise than the full filter from expression evaluates to true against a document, then the which it was derived, potentially causing false positives. filter matches the document. A predicate expression is ei- If the false positive rate is too high, we can feed the tenta- ther a boolean operator over a single document attribute, tive matches from the index lookups into a second stage or a conjunct of predicate expressions. A boolean opera- that executes filters precisely but at higher cost. By stag- tor expression is an (attribute, operator, value) triple, and ing the execution of some filters, we regain higher preci- is represented in the form: sion while still controlling overall execution cost. How- attribute.operator(value) ever, if the false positive rate resulting from a relaxed filter is acceptably low, staging is not necessary, and all The filter language provides expensive operators such matches (including false positives) are sent to the client. as substring and regular expression matching as well as Whether a false positive rate is acceptable depends on simple operators like equalities and inequalities. many factors, including the execution cost of staging in For example, a user could specify a search for the the extensible crawler, the bandwidth overhead of trans- phrase “Barack Obama” in HTML files by specifying: mitting false positives to the client, and the cost to the mimetype.equals("text/") & client of handling false positives. text.substring("Barack Obama") 3.2.1 Indexing Alternatively, the user could widen the set of accept- able documents by specifying a conjunction of multiple, Indexed filter execution requires the construction of less restrictive keyword substring filters. an index for each attribute that a filter set references, and mimetype.equals("text/html") & for each style of operator that is used on those attributes. text.substring("Barack") & For example, if a filter set uses a substring operator over text.substring("Obama") the document body attribute, we build an Aho-Corasick multistring search trie [3] over the values specified by fil- Though simple, this language is rich enough to sup- ters referencing that attribute. As another example, if a port the applications outlined previously in Section 2. filter set uses numeric inequality operators over the docu- For example, our prototype Web malware detection ap- ment size attribute, we construct a binary search tree over plication is implemented as a set of regular expression the values specified by filters referencing that attribute. filters derived from the ClamAV virus and malware sig- Executing a document against a filter set requires nature database. looking up the document’s attributes against all indexes to find potentially matching filters. For filters that con- 3.2 Filter execution tain a conjunction of predicate expressions, we could in- When a newly crawled document is dispatched to a sert each expression into its appropriate index. Instead, node, that node must match the document against its set we identify and index only the most selective predicate of filters. As previously mentioned, a na¨ıve approach expression; if the filter survives this initial index lookup, to executing filters would be to iterate over them se- we can either notify the client immediately and risk false quentially; unfortunately, the computational resources positives or use staging (discussed in Section 3.2.3) to required for this approach would scale linearly with both evaluate potential matches more precisely. the number of filters and the document crawl rate, which Creating indexes lets us execute a large number of fil- is severely limiting. Instead, we must find a way to opti- ters efficiently. Figure 2 compares the number of nodes mize the execution of a set of filters. that would be required in our XCrawler prototype to sus- To do this, we rely on three techniques. To main- tain a crawl rate of 100,000 documents per second, using tain throughput while scaling up the number of filters either na¨ıve filter execution or filter execution with in- 600 1000 naïve naïve 500 100 indexed relaxed+indexed+staged relaxed+indexed 400 10 relaxed+indexed+staged

300 1

200 0.1 0.01 100 (GB) footprint memory

# nodes for 100,000 docs/s 100,000 for nodes # 0.001 0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 0.0001 # filters 0.01 0.1 1 10 Figure 2: Indexed filter execution. This graph com- # filters (millions) pares the number of nodes (machines) required for a Figure 3: Indexed and relaxed filter memory foot- crawl rate of 100,000 documents per second when us- print. This graph compares the memory footprint of sub- ing na¨ıve filter execution and when using indexed filter string filters when using four different execution strate- execution, including relaxation and staging. gies. The average filter length in the filter set was 130 bytes, and relaxation used 32 byte ngrams. Note that the dexing, relaxation, and staging enabled. The filter set relaxed+indexed and the relaxed+indexed+staged lines used in this measurement are sentences extracted from overlap on the graph. articles; this emulates the workload of a copy- right violation detection application. Our measurements were gathered on a small number of nodes, and pro- for as substrings within documents. Instead of search- jected upwards to larger numbers of nodes assuming lin- ing for the full sentence, filters can be relaxed to search ear scaleup in crawl rate with document set partitioning. for an ngram extracted from the sentence (e.g., a 16 byte Our prototype runs on 8 core, 2 GHz Intel processors. character fragment). This would significantly reduce the When using indexing, relaxation, and staging, a node size of the in-memory index. with 3GB of RAM is capable of storing approximately There are many possible ngram relaxations for a spe- 400,000 filters of this workload, and can process docu- cific string; the ideal relaxation would be just as selec- ments at a rate of approximately 9,000 documents per tive as the full sentence, returning no false positives. second. To scale to 100,000 documents per second, we Intuitively, shorter ngrams will tend to be less selec- would need 12 pods, i.e., we must replicate the full filter tive but more memory efficient. Less intuitively, dif- set 12 times, and partition incoming documents across ferent fragments extracted from the same string might these replicas. To scale to 9,200,000 filters, we would have different selectivity. Consider the string , and two possi- 3GB of RAM each. Thus, the final system configuration ble 8-byte relaxations

0 5 10 15 20 (docs/s) throughput node 1 node # 0.01 0.1 1 10 100 average # staged filter evaluations per document Figure 5: Node throughput. This (sorted) graph shows the maximum throughput each node within a pod is ca- Figure 6: Staged evaluations vs. throughput. This pable of sustaining under two different policies: random graph shows the effect of increasing partial hit rate of filter placement, and alphabetic filter placement. a filter set within a node on the maximum document throughput that node is capable of processing. evaluated against each filter on that node without requir- ing any cross-node interactions. ally the node spends most of its time performing these Since a document must be evaluated against all fil- evaluations. In practice, the number of na¨ıve evaluations ters known by the system, each document arriving at a can increase for two reasons. First, a given relaxation can pod must be transmitted to and evaluated by each node be shared by many independent filters. If the relaxation within the pod. Because of this, the document through- matches, all associated filters must be executed fully. put that a pod can sustain is limited by the throughput of Second, a given relaxation might match a larger-than- the slowest node within the pod. usual fraction of incoming documents. In Section 4.2, Two issues substantially affect node and pod through- we further quantify these two effects and propose strate- put. First, a filter partitioning policy that is aware of the gies for mitigating them. indexing used by nodes can tune the place- Given all of these issues, our implementation takes ment of filters to drive up the efficiency and throughput the following strategy to partition filters. Initially, of all nodes. Second, some filters are more expensive to an index-aware partitioning is chosen; for example, process than others. Particularly expensive filters can in- alphabetically-sorted prefix grouping is used for sub- duce load imbalances across nodes, driving down overall string filters. Filters are packed onto nodes until mem- pod throughput. ory is exhausted. Over time, the load of nodes within Figure 5 illustrates these effects. Using the same each pod is monitored. If load imbalances appear within Wikipedia workload as before, this graph illustrates the a pod, then groups of filters are moved from slow nodes maximum document throughput that each node within a to faster nodes to rectify the imbalance. Note that mov- pod of 24 machines is capable of sustaining, under two ing filters from one node to another requires the indexes different filter set partitioning policies. The first policy, on both nodes to be recomputed. The expense of doing random, randomly places each filter on a node, while the this bounds how often we can afford to rebalance. second policy, alphabetic, sorts the substring filters al- A final consideration is that the filter set of an exten- phabetically by their most selective ngram relaxation. By sible crawler changes over time as new filters are added sorting alphabetically, the second policy causes ngrams and existing filters are modified or removed. We cur- that share prefixes to end up on the same node, improv- rently take a simple approach to dealing with filter set ing both the memory and computation efficiency of the changes: newly added or modified filters are accumu- Aho-Corasick index. The random policy achieves good late in “overflow” nodes within each pod and are ini- load balancing but suffers from lower average through- tially executed na¨ıvely without the benefit of indexing. put than alphabetic. Alphabetic exhibits higher average We then take a generational approach to re-indexing and throughput but suffers from load imbalance. From our re-partitioning filters: newly added and modified filters measurements using the Wikipedia filter set, a 5 million that appear to be stable are periodically incorporated into filter index using random placement requires 13.9GB of the non-overflow nodes. memory, while a 5 million filter index using alphabetic 3.4 Additional implementation details placement requires 12.1GB, a reduction of 13%. In Figure 6, we measure the relationship between the The data path of the extensible crawler starts as doc- number of na¨ıve evaluations that must be executed per uments are dispatched from a web crawler into our sys- document when using staged relaxation and the through- tem and ends as matching documents are collected from put a node can sustain. As the number of na¨ıve execu- workers and transmitted to client crawler applications tions increases, throughput begins to drop, until eventu- (see Figure 1). Our current prototype does not fully ex- plore the design and implementation issues of either the 3.5 Future considerations dispatching or collection components. In our experiments, we use the open source Nutch spi- There are several interesting design and implementa- der to crawl the web, but we modified it to store docu- tion avenues for the extensible crawler. Though they are ments locally within each crawler node’s filesytem rather beyond the scope of this paper, it is worth briefly men- than storing them within a distributed Hadoop filesys- tioning a few of them. Our system currently only indexes tem. We implemented a parallel dispatcher that runs on textual documents; in the future, it would be interesting each crawler node. Each dispatcher process partitions to consider the impact of richer media types (such as im- documents across pods, replicates documents across pod ages, videos, or flash content) on the design of the filter nodes, and uses backpressure from nodes to decide the language and on our indexing and execution strategy. We rate at which documents are sent to each pod. Each pod currently consider the crawler itself to be a black box, but node keeps local statistics about filter matching rates, given that clients already specify content of interest to annotates matching documents with a list of filters that them, it might be beneficial to allow clients to focus the matched, and forwards matching documents to one of a crawler on certain areas of the Web of particular inter- static set of collection nodes. est. Finally, we could imagine integrating other streams An interesting configuration problem concerns bal- of information into our system besides documents gath- ancing the CPU, memory, and network capacities of ered from a Web crawler, such as real-time “firehoses” nodes within the system. We ensure that all nodes within produced by systems such as Twitter. a pod are homogeneous. As well, we have provisioned each node to ensure that the network capacity of nodes is not a system bottleneck. Doing so required provi- 4 Evaluation sioning each filter processing node with two 1-gigabit NICs. To take advantage of multiple cores, our filter pro- In this section, we describe experiments that explore cessing nodes use two threads per core to process doc- the performance of the extensible crawler, we investi- uments against indexes concurrently. As well, we use gate the effect of different filter partitioning policies. As one thread per NIC to pull documents from the network well, we demonstrate the need to identify and reject non- and place them in a queue to be dispatched to filter pro- selective filters. Finally, we present early experience cessing threads. We can add additional memory to each with three prototype Web crawler applications. node until the cost of additional memory becomes pro- All of our experiments are run on a cluster of 8-core, hibitive. Currently, our filter processing nodes have 3GB 2GHz Intel Xeon machines with 3GB of RAM, dual of RAM, allowing each of them to store approximately a gigabit NICs, and a 500 GB 7200-RPM Barracuda ES half-million filters. SATA hard drive. Our systems are configured to run Within the extensible crawler itself, all data flows 32bit Linux kernel version 2.6.22.9-91.fc7, and to use through memory; no disk operations are required. Most Sun’s 23 bit JVM version 1.6.0 12 in server mode. Un- memory is dedicated to filter index structures, but some less stated otherwise, the filter workload for our experi- memory is used to queue documents for processing and ments consists of 9,204,600 unique sentences extracted to store temporary data generated when matching a doc- from Wikipedia; experiments with relaxation and stag- ument against an index or an individual filter. ing used 32 byte prefix ngrams extracted from the filter We have not yet explored fault tolerance issues. Our sentences. prototype currently ignores individual node failures and For our performance oriented experiments, we gath- does not attempt to detect or recover from network or ered a 3,349,044 Web document crawl set on August switch failures. If a filter node fails in our current im- 24th, 2008 using the Nutch crawler and pages from the plementation, documents arriving at the associated pod DMOZ open directory project as our crawl seed. So that will fail to be matched against filters that resided on that our experiments were repeatable, when testing the per- node. Note that our overall application semantics are best formance of the extensible crawler we used on a cus- effort: we do not (yet) make any guarantees to client ap- tom tool to stream this document set at high throughput, plications about when any specific web page is crawled. rather than re-crawling the Web. Of the 3,349,044 docu- We anticipate that this will simplify fault tolerance is- ments in our crawl set, 2,682,590 contained textual con- sues, since it is difficult for clients to distinguish between tent, including HTML and PDF files; the rest contain bi- failures in our system and the case that a page has not yet nary content, including images and executables. Our ex- been crawled. Adding fault tolerance and strengthening tensible crawler prototype does not yet notify wide-area our service guarantees is a potentially challenging future clients about matching documents; instead, we gather engineering topic, but we do not anticipate needing to statistic about document matches, but drop the matching invent fundamentally novel mechanisms. documents instead of transmitting them. 100000 10000000 8,434,126 1000000 10000 100000

1000 10000

100 1000 100 10 Lucene 10 extensible crawler node throughput (docs/s) throughput node

1 size that of sets collision # 1 0 500000 1000000 1500000 2000000 2500000 1 10 100 1000 10000 100000 # documents crawled until re-indexing (N) collision set size Figure 7: Lucene vs. extensible crawler. This graph Figure 8: Collision set size. This histogram shows the compares the document processing rates of a single-node distribution of collision set sizes when using 32-byte extensible crawler and a single-node Lucene search en- ngrams over the Wikipedia filter set. gine. The x-axis displays the number of documents crawled between reconstructions of the Lucene index. of the two implementations, with 400,000 filters from the Note that the y-axis is logarithmically scaled. Wikipedia workload. Note that the x-axis corresponds to N, but this parameter only applies to the Lucene crawler. 4.1 Nutch vs. the extensible crawler The extensible crawler implementation has nearly two orders of magnitude better performance than Lucene; In Section 2.1, we described architectural, workload, this is primarily due to the fact that Lucene must service and expected performance differences between the ex- queries against its disk-based document index, while the tensible crawler and an alternative implementation of extensible crawler’s filter index is served out of mem- the service based on a conventional search engine. To ory. As well, the Lucene implementation is only able to demonstrate these differences quantitatively, we ran a se- achieve asymptotic performance if it indexes batches of ries of experiments directly comparing the performance N > 1, 000, 000 documents. of our prototype to an alternative implementation based Our head-to-head comparison is admittedly still un- on the Lucene search engine, version 2.1.0 [7]. fair, since Lucene was not optimized for fast, incremen- The construction of a Lucene-based search index tal, memory-based indexing. Also, we could conceivably is typically performed as part of a Nutch map-reduce bridge the gap between the two implementations by us- pipeline that crawls web pages, stores them in the HDFS ing SSD drives instead of spinning platters to store and distributed filesystem, builds and stores an index in serve Lucene indexes. However, our comparison serves HDFS, and then services queries by reading index en- to demonstrate some of the design tensions between con- tries from HDFS. To make the comparison of Lucene to ventional search engines and extensible crawlers. our prototype more fair, we eliminated overheads intro- duced by HDFS and map-reduce by modifying the sys- 4.2 Filter partitioning and blacklisting tem to store crawled pages and indexes in nodes’ local As mentioned in Section 3.3.2, two different aspects filesystems. Similarly, to eliminate variation introduced of a filter set contribute to load imbalances between oth- by the wide-area Internet, we spooled our pre-crawled erwise identical machines: first, a specific relaxation Web page data set to Lucene’s indexer or to the extensi- might be shared by many different filters, causing a par- ble crawler over the network. tial hit to result in commensurately many na¨ıve filter ex- The search engine implementation works by periodi- ecutions, and second, a given relaxation might match a cally constructing an index based on the N most recently large number of documents, also causing a large number crawled web pages; after constructing the index and par- of na¨ıve filter executions. We now quantify these effects. titioning it across nodes, each node evaluates the full fil- We call the a set of filters that share an identical relax- ter set against its index fragment. The implementation ation a collision set. A collision set of size 1 implies the uses one thread per core to evaluate filters. By increasing associated filter’s relaxation is unique, while a collision N, the implementation indexes less frequently, reducing set of size N implies that N filters share a specific relax- overhead, but suffers from a larger latency between the ation. In Figure 8, we show the distribution of collision downloading of a page by the crawler and the evaluation set sizes when using a 32-byte prefix relaxation of the of the filter set against that page. In contrast, the exten- Wikipedia filter set. The majority of filters (8,434,126 sible crawler implementation constructs and index over out of 9,204,600) have a unique relaxation, but some its filters once, and then continuously evaluates pages relaxations collide with many filters. For example, the against that index. largest collision set size was 35,585 filters. These filters In Figure 7, we compare the single node throughput all shared the prefix “The median income for a house- 10000 Vanity Copyright Malware 9000 alerts violation detection 8000 7000 # filters 10,622 251,647 3,128 6000 doc hit rate 68.98% 0.664% 45.38% 5000 false +ves per doc 15.76 0.0386 0.851 4000 relaxation only throughput (docs/s) 7,244 8,535 8,534 3000 random # machines needed 13.80 11.72 11.72 2000 alphabetic doc hit rate 13.1% 0.016% 0.009% 1000 alphabetic-BL node throughput (docs/s) throughput node with relaxation throughput (docs/s) 592 8,229 6,354 0 and staging 0 5 10 15 20 # machines needed 168.92 12.15 15.74 node # Table 3: Web crawler application features and perfor- Figure 9: Node throughput. This (sorted) graph shows mance. This table summarizes the high-level workload the maximum throughput each node within a pod is capa- and performance features of our three prototype Web ble of sustaining under three different policies: random crawler applications. filter placement, alphabetic filter placement, and alpha- betic filter placement with blacklisting. vice that looks for pages containing copies of Reuters or Wikipedia articles. Table 3 summarizes the high-level hold in the”; this sentence is used in many Wikipedia features of these applications and their filter workloads; articles describing income in cities, counties, and other we now discuss each in turn, relating additional details population centers. If a document matches against this and anecdotes. relaxation, the extensible crawler would need to na¨ıvely execute all 35,585 filters. 4.3.1 Vanity alerts Along a similar vein, some filter relaxations will match a larger-than-usual fraction of documents. For our vanity filter application, we authored 10,622 One notably egregious example from our Web mal- filters based on names of university faculty and students. ware detection application was a filter whose relax- Our filters were constructed as regular expressions of ation contained was the 32-character sequence . Unsurprisingly, a name followed by their last name, with the constraint of very large fraction of Web pages contain this substring! no more than 20 characters separating the two parts. The To deal with these two sources of load imbalance, our filters were first relaxed into a conjunct of substring 32- implementation blacklists specific relaxations. If our im- grams, and from there the longest substring conjunct was plementation notices a collision set containing more than selected as the final relaxation of the filter. 100 filters, we blacklist the associated relaxation, and This filter set matched against 13.1% of documents compute alternate relaxations for those filters. In the case crawled. This application had a modest number of fil- of our Wikipedia filter set, this required modifying the re- ters, but its filter set nonetheless matched against a large laxation of only 145 filters. As well, if our implementa- fraction of Web pages, violating our assumption that a tion notices that a particular relaxation has an abnormally crawler application should have highly selective filters. high document partial hit rate, that “hot” relaxation is Moreover, when using relaxed filters without, there were blacklisted and new filter relaxations are chosen. many additional false partial hits (an average of 15.76 Blacklisting rectifies these two sources of load im- per document, and an overall document hit rate of 69%). balance. Figure 9 revisits the experiment previously il- Most false hits were due to names that are contained in lustrated in Figure 5, but with a new line that shows commonly found words, such as Tran, Chang, or Park. the effect of blacklisting on the distribution of document The lack of high selectivity of its filter set leads us processing throughputs across the nodes in our cluster. to conclude that this application is not a good candidate Without blacklisting, alphabetic filter placement demon- for the extensible crawler. If millions of users were to strates significant imbalance. With blacklisting, the ma- use this service, most Web pages would likely match and jority of the load imbalance is removed, and the slowest need to be delivered to the crawler application. node is only 17% slower than the fastest node. 4.3.2 Copyright violation detection 4.3 Experiences with Web crawler applications For our second application, we prototyped a copyright To date, we have prototyped three Web crawler ap- violation detection service. We evaluated this application plications: vanity alters that detect pages containing by constructing a set of 251,647 filters based on 30,534 a user’s name, a copyright detection application that AP and Reuters news articles appearing between July finds Web objects that match ClamAV’s malware sig- and October of 2008. Each filter was a single sentence nature database, and a copyright violation detection ser- extracted from an article, but we extracted multiple fil- ters from each article. We evaluated the resulting filter 5 Related work set against a crawl of 3.68 million pages. Overall, 590 crawled documents (0.016%) matched The extensible crawler is related to several classes against the AP/Reuters filter set, and 619 filters (0.028% of systems: Web crawlers and search engines, publish- of the filter set) were responsible for these matches. We subscribe systems, packet filtering engines, parallel and manually determined that most matching documents that streaming databases, and scalable Internet content syndi- matched were original news articles or pages that cation protocols. We discuss each in turn. quoted sections of articles with attribution. We did find Web crawlers and search engines. The engineer- some sites that appeared to contain unauthorized copies ing issues of high-throughput Web crawlers are complex of entire news stories, and some sites that plagiarized but well understood [21]. Modern Web crawlers can re- news stories by integrating the story body but replacing trieve thousands of Web pages per second per machine. the author’s byline. Our work leverages existing crawlers, treating them as a If a document hit, it tended to hit against a single fil- black box from which we obtain a high throughput doc- ter (50% of document hits were for a single filter). A ument stream. The Mercator project explored the design smaller number of documents hit against many sentences of an extensible crawler [20], though Mercator’s notion (13% of documents matched against more than 6 filters). of extensibility is different than ours: Mercator has well- Documents that matched against many filters tended to defined APIs that simplify the job of adding new modules contain full copies of the original news story, while doc- that extend the crawler’s set of network protocols or type- uments that match a single sentence tended to contain specific document processors. Our extensible crawler boilerplate prose, such as a specific author’s byline, legal permits remote third parties to dynamically insert new disclosures, or common phrases such as “The officials filters into the crawling pipeline. spoke on condition of anonymity because they weren’t Modern Web search engines require complex engi- authorized to release the information.” neering, but the basic architecture of a scalable search engine has been understood for more than decade [8]. 4.3.3 Web malware detection Our extensible crawler is similar to a search engine, but The final application we prototyped was Web mal- inverted, in that we index queries rather than documents. ware detection. We extracted 3,128 text-centric regular As well, we focus on in-memory indexing for through- expressions from the February 20, 2009 release of the put. Cho and Rajagopalan described a technique for sup- ClamAV open-source malware signature database. Be- porting fast indexing of regular expressions by reducing cause many of these signatures were designed to match them to ngrams [12]; our notion of filter relaxation is a malicious JavaScript or HTML, some of their relax- generalization of their approach. ations contain commonly occurring substrings, such as Though the service is now discontinued, Ama- . http: The evaluation of XCrawler, our early prototype sys- //lucene.apache.org/. tem, focused on scaling issues with respect to its num- [8] S. Brin and L. Page. The anatomy of a large-scale hyper- ber of filters and crawl rate. Using techniques from re- textual web search engine. In Proceedings of the Sev- lated work, we showed how we can support rich, expres- enth International Conference on the sive filters using relaxation and staging techniques. As (WWW7), Brisbane, Australia, 1998. well, we used microbenchmarks and experiments with [9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Achiev- application workloads to quantify the impact of load bal- ing scalability and expressiveness in an Internet-scale ancing policies and confirm the practicality of our ideas. event notification service. In Proceedings of the Nine- Overall, we believe that the low-latency, high selectivity, teenth Annual ACM Symposium on Principles of Dis- and scalable nature of our system makes it a promising tributed Computing (PODC ’00), Portland, OR, 2000. platform for many applications. [10] J. Chen, D. J. Dewitt, F. Tian, and Y. Wang. Nia- garaCQ: A scalable continuous query system for Internet databases. In Proceedings of the 2000 ACM SIGMOD In- Acknowledgments ternational Conference on Management of Data, Dallas, TX, May 2000. We thank Tanya Bragin for her early contributions, Raphael Hoffman for his help in gathering the Wikipedia [11] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Car- filter set, and Paul Gauthier and David Richardson for ney, U. Cetintemel, Y. Xing, and S. Zdonik. Scalable dis- tributed stream processing. In Proceedings of the First Bi- their helpful feedback on drafts of this paper. We also ennial Conference on Innovative Data Systems Research thank our anonymous reviews and our shepherd, Stefan (CIDR ’03), Asilomar, CA, January 2003. Savage, for their guidance. This work was supported in [12] J. Cho and S. Rajagopalan. A fast regular expression in- part by an NDSEG Fellowship, by the National Science dexing engine. In Proceedings of the 18th International Foundation under grants CNS-0132817, CNS-0430477 Conference on Data Engineering (ICDE ’02), San Jose, and CNS-0627367, by the Torode Family Endowed Ca- CA, February 2002. reer Development Professorship, by the Wissna-Slivka [13] G. Copeland, W. Alexander, E. Boughter, and T. Keller. Chair, and by gifts from Nortel Networks and Intel Cor- Data placement in Bubba. ACM SIGMOD Record, poration. 17(3):99–108, 1988. [14] D. J. DeWitt, S. Ghandeharizadeh, D. Schneider, References A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma database machine project. IEEE Transactions on Knowl- [1] D. J. Abadi, D. Carney, U. ¸etintemel, M. Cherni- edge and Data Engineering, 2(1):44–62, March 1990. ack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data [15] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. stream management. The VLDB Journal, 12(2):120–139, Popescu, T. Shaked, S. Soderland, D. S. Weld, and 2003. A. Yates. Web-scale information extraction in KnowItAll. In Proceedings of the 13th International Conference on [2] M. K. Aguilera, R. E. Strom, D. C. Sturman, M. Astley, the World Wide Web, New York, NY, May 2004. and T. D. Chandra. Matching events in a content-based [16] F. Fabret, A. Jacobsen, F. Llirbat, J. Pereira, K. Ross, and subscription system. In Proceedings of the Eighteenth An- D. Shasha. Filtering algorithms and implementation for nual ACM Symposium on Principles of Distributed Com- very fast publish/subscribe systems. In Proceedings of the puting (PODC ’99) , Atlanta, GA, 1999. 2001 ACM SIGMOD International Conference on Man- [3] A. V. Aho and M. J. Corasick. Efficient string matching: agement of Data, Santa Barbara, CA, May 2001. an aid to bibliographic search. Communications of the [17] R. Geambasu, S. D. Gribble, and H. M. Levy. Cloud- ACM, 18(6):333–340, 1975. Views: Communal data sharing in public clouds. In Pro- [4] M. Altinel and M. J. Franklin. Efficient filtering of XML ceedings of the Workshop on Hot Topics in Cloud Com- documents for selective dissemination of information. In puting (HotCloud ’09), San Diego, CA, June 1999. Proceedings of the 26th International Conference on Very [18] Google. Google alerts. http://www.google.com/ Large Data Bases (VLDB ’00), San Francisco, CA, 2000. alerts. [19] E. N. Hanson, C. Carnes, L. Huang, M. Konyala, [33] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: L. Noronha, S. Parthasarathy, J. Park, and A. Vernon. Exploring photo collections in 3D. In Proceedings of Scalable trigger processing. In Proceedings of the 15th the 33rd International Conference an Exhibition on Com- International Conference on Data Engineering, Sydney, puter Graphics and Interactive Techniques (SIGGRAPH Austrialia, March 1999. ’06), Boston, MA, July 2006. [20] A. Heydon and M. Najork. Mercator: A scalable, exten- [34] R. Sommer and V. Paxson. Enhancing byte-level network sible web crawler. World Wide Web, 2(4):219–229, 1999. intrusion detection signatures with context. In Proceed- [21] H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov. IRL- ings of the 10th ACM conference on Computer and Com- bot: Scaling to 6 billion pages and beyond. In Proceed- munications Security (CCS ’03), October 2003. ings of the 17th International World Wide Web Confer- [35] T. W. Yan and H. Garcia-Molina. Index structures ence, Beijing, China, April 2008. for selective dissemination of information under the [22] Y. Liu and B. Plale. Survey of publish subscribe event boolean model. ACM Transactions on Database Systems, systems. Technical Report TR574, Indiana University, 19(2):332–364, June 1994. May 2003. [23] A. Majumder, R. Rastogi, and S. Vanama. Scalable regu- lar expression matching on data streams. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, June 2008. [24] McAfee, Inc. McAfee SiteAdvisor. http://www. siteadvisor.com/. [25] S. McCanne and V. Jacobson. The BSD packet filter: a new architecture for user-level packet capture. In Pro- ceedings of the USENIX Winter 1993 Conference, San Diego, California, January 1993. [26] A. Moshchuk, T. Bragin, S. Gribble, and H. Levy. A crawler-based study of spyware on the Web. In Proceed- ings of the 13th Annual Network and Distributed Systems Security Symposium (NDSS ’06), San Diego, CA, Febru- ary 2006. [27] J. Pereira, F. Fabret, F. Llirbat, and D. Shasha. Efficient matching for web-based publish/subscribe systems. In Proceedings of the 7th International Conference on Co- operative Information Systems (CoopIS 2000), Eilat, Is- rael, September 2000. [28] N. Provos, P. Mavrommatis, M. A. Rajab, and F. Mon- rose. All your iFRAMEs point to us. In Proceedings of the 17th USENIX Security Symposium, San Jose, CA, July 2008. [29] I. Rose, R. Murty, P. Pietzuch, J. Ledlie, M. Roussopou- los, and M. Welsh. Cobra: Content-based filtering and ag- gregation of blogs and RSS feeds. In Proceedings of the 4th USENIX/ACM Symposium on Networked Systems De- sign and Implementation (NSDI 2007), Cambridge, MA, April 2007. [30] S. Rubin, S. Jha, and B. P. Miller. Protomatching network traffic for high throughput network intrusion detection. In Proceedings of the 13th ACM conference on Computer and Communications Security (CCS ’06), October 2006. [31] B. Segall, D. Arnold, J. Boot, M. Henderson, and T. Phelps. Content based routing with Elvin4. In Proceed- ings of the 2000 AUUG Annual Conference (AUUG2K), Canberra, Australia, June 2000. [32] R. Smith, C. Estan, and S. Jha. XFA: Faster signature matching with extended automata. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (Oakland ’08), Oakland, CA, May 2008.