Distributed Complex Event Processing with Query Rewriting
Total Page:16
File Type:pdf, Size:1020Kb
Distributed Complex Event Processing with Query Rewriting Nicholas Poul Matteo Migliavacca Peter Pietzuch Schultz-Møller Department of Computing Department of Computing Department of Computing Imperial College London Imperial College London Imperial College London United Kingdom United Kingdom United Kingdom [email protected] [email protected] [email protected] ABSTRACT is first stored and indexed and only queried afterwards. The nature of data in enterprises and on the Internet is In contrast, complex event processing (CEP) systems re- changing. Data used to be stored in a database first and gard new data (or updates to existing data) as a stream of queried later. Today timely processing of new data, repre- events [19]. Users can specify queries as high-level patterns sented as events, is increasingly valuable. In many domains, of events. The CEP system detects complex events matching complex event processing (CEP) systems detect patterns of these queries and notifies clients of matches in soft real-time. events for decision making. Examples include processing of For example, a credit card processing company may monitor environmental sensor data, trades in financial markets and fraudulent credit card use by specifying patterns that detect RSS web feeds. Unlike conventional database systems, most concurrent transactions caused by the same credit card at current CEP systems pay little attention to query optimi- geographically distant locations. sation. They do not rewrite queries to more efficient rep- Detecting complex patterns in high-rate event streams re- resentations or make decisions about operator distribution, quires substantial CPU resources. To make a CEP system limiting their overall scalability. scale in the number of concurrently detectable patterns, it needs additional computational resources by distributing de- This paper describes the Next CEP system that was espe- cially designed for query rewriting and distribution. Event tection across a set of machines or use the existing resources patterns are specified in a high-level query language and, be- more efficiently through query rewriting. However, the ma- fore being translated into event automata, are rewritten in jority of current CEP systems only support execution on a a more efficient form. Automata are then distributed across single machine and the semantics of their query languages a cluster of machines for detection scalability. We present make automated query rewriting challenging. algorithms for query rewriting and distributed placement. In this paper, we describe the design, implementation and Our experiments on the Emulab test-bed show a significant evaluation of the Next CEP system that focusses on dis- improvement in system scalability due to rewriting and dis- tributed event detection with query rewriting. The system tribution. uses an expressive automata-based approach for complex event detection. Users specify event patterns in a high-level, SQL-like language with six operators that are translated to 1. INTRODUCTION low-level event automata. The language facilitates rewriting Many application domains need timely processing of new of expressions into equivalent ones and the automata model data [8]. For example, a financial application may moni- eases deployment of detection operators across multiple ma- tor a continuous stream of credit card transactions for fraud chines. The system makes optimisation and placement deci- patterns. In environmental sciences, sensor networks output sions according to cost functions derived from the resource infinite sequences of measurements that may be processed consumption of event automata. Guided by the cost model, to respond quickly to physical events, such as an increase in our system rewrites queries containing two commonly used monitored pollution levels. On the Internet, RSS feeds from operators, \next" and \union", to reduce their CPU con- weblogs generate vast amounts of real-time information that sumption. The system also greedily selects distributed de- may be analysed to infer opinions of many people. In gen- ployment plans to perform event processing on a cluster of eral, the requirement of instant processing of new data is at machines while reusing already-existing operators. odds with the conventional database model, as implemented Our evaluation on the Emulab networks test-bed shows by database management systems (DBMSs), in which data that query rewriting increases the supported number of queries with the \next" operator by up to 100% and with the \union" operator of up to 20%. We also demonstrate how distribu- tion enables the system to scale linearly. Permission to make digital or hard copies of all or part of this work for In summary, the main contributions of this paper are: personal or classroom use is granted without fee provided that copies are 1.A high-level event pattern language based on a new ex- not made or distributed for profit or commercial advantage and that copies pressive event automata model that is suitable for query bear this notice and the full citation on the first page. To copy otherwise, to rewriting and distribution (x3); republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 2.A cost model for queries that allows automated optimi- DEBS’09, July 6-9, Nashville, TN, USA. sation (x4.2); Copyright 2009 ACM X-XXXXX-XX-X/XX/XX ...$5.00. 3. Algorithms for query rewriting into more efficient rep- Case for distributed detection. As estimated above, the resentations (x4.3) and for distributed query deployment data rate from a single source is relatively small. However, with operator reuse (x4.4); if a query correlates transaction events from several differ- 4. The design and implementation of a prototype CEP sys- ent sources, then hundreds of thousands of events have to tem in Erlang illustrating the above techniques (x5) and be processed per second. In this case, the CPU resources of an evaluation on the Emulab test-bed (x6). a single machine may become a bottleneck and limit system scalability. When the system runs out of CPU resources, it 1.1 Detection of Credit Card Fraud has to buffer events or discard them. Buffering events only We describe an application scenario | the detection of provides temporary relief and is only effective under tran- credit card fraud | that illustrates the scalability chal- sient overload conditions. A distributed CEP system with lenges in CEP and shows how efficient distribution can pro- query optimisation can distribute detection queries across vide relief. Credit card fraud has been committed for many machines and reduce the cost of individual queries by rewrit- years and remains a major problem. Just in the US alone, ing them into more efficient forms. 9:91 million people were victims of credit card fraud in 2007 with a total loss of US$ 52:6 billion [18]. Fraud typically 2. RELATED WORK happens when the credit card owner uses the card in a shop. During the payment, an employee can easily run the card A range of systems have been used to detect events us- through a card reader without the owner's knowledge. Later ing different detection approaches. This section provides an the copied credit card information is sold to other criminals overview of previous work, particularly focusing on propos- who manufacture a new card and commit the actual fraud. als for query optimisation in event processing. There are several patterns indicating such behaviour in Complex event processing systems. The goal of CEP credit card transaction logs, which can be detected by a systems is the fast detection of event patterns in streams. CEP system. A fast detection of suspicious patterns en- Specification languages for event patterns are frequently in- ables banks to contact customers early in the fraud or even spired by regular languages and therefore have automata- automatically alert users about suspicious credit card trans- based semantics [21]. In addition to commercial CEP sys- actions by sending text messages to their mobile phones. A tems, such as ruleCore CEP Server, Coral8 Engine and Es- typical fraud pattern is that the criminal starts by testing per, several open research prototypes exist. the credit card with a few small purchases followed by larger Cayuga [14, 13] is a high performance, single server CEP purchases. In our high-level query language described in x3, system developed at Cornell University. Event streams are this can be expressed as follows: infinite sequences of relational tuples with interval-based timestamps. Its event algebra has six operators: projec- SELECT * FROM ( t S1 ; t S2 ; t L ) tion, selection, renaming, union, conditional sequence and WHERE FILTER(S1.acc = S2.acc), FILTER(S2.acc = L.acc), iteration. Event algebra expressions are detected by non- S1.amount < 100 AND S2.amount < 100 AND deterministic finite automata, which can detect unbounded L.amount > 250 AND (S1, L) OCCURS WITHIN 12 hours sequences. To achieve high performance, Cayuga uses cus- Properties of data sources. We make the following as- tom heap management, indexing of operator predicates and sumptions about the workload of a CEP system for credit reuse of shared automata instances. However, Cayuga does card fraud detection. The system processes transactions not support automated query rewriting and distributed de- from multiple sources. The sources are credit card process- tection. The distribution of Cayuga automata is compli- ing companies responsible for a given credit card brand in a cated by the fact that it merges event algebra expressions given region, such as Northeastern US. into a single automaton, which would have to be partitioned We take credit card transactions made with VisaTM in the across nodes for distribution. US as an example. According to the Visa website [26], they The DistCED system [24] is a Java-based CEP frame- processed 27:612 billion transactions in 2007. This means work that uses extended, non-deterministic finite state au- an average of 875 credit transactions per second based on tomata.