Rete + DBMS = Efficient Continuous Profile Matching on Large-Volume

ARGUS: Rete + DBMS = E±cient Continuous Pro¯le Matching on Large-Volume Data Streams Chun Jin, Jaime Carbonell Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 fcjin, [email protected] July 6, 2004 Abstract Most existing data stream management projects focus on applications in which answers are generated at a fairly high rate. We identify an important sub-class of data stream applications, Stream Anomaly Monitoring Systems (SAMS), in which answers are generated very infrequently (i.e. the underlying SQL queries have extremely high selectivity), but may generate very high-urgency alerts when they do match. We exploit this property to produce signi¯cant optimization via an extension of the Rete algorithm, minimizing intermediate join sizes. A SAMS monitors transactions or other structured data streams and generates alerts when anomalies or potential hazards are detected. SAMS applications include but are not limited to potential ¯nancial crime monitoring, epidemiology tracking for ¯rst signs of new outbreaks, and potential fraud analysis on stock market transaction streams. An important secondary objective is to build the SAMS on top of mature DBMS technology, to preserve full DBMS functionality and to permit full modularization of functions, guaranteeing cross-DBMS portability. The di±cult part is to accomplish this objective while maximizing performance. But, the very-high-selectivity nature of SAMS applications permits us to achieve both objectives: e±ciency and full DBMS compatibility. We introduce the ARGUS prototype SAMS, which exploits very-high-selectivity via an extension of the Rete algorithm on a DBMS. The primary Rete extensions in our implementation are User-De¯ned Join Priority and Transitivity Inference. Preliminary experiments show the e®ectiveness and e±ciency of the ARGUS approach over a standard Oracle DBMS. We further investigate several new techniques for Rete optimization: computation sharing, incremental aggregation, time window detection, and index selection. Keywords: Stream Data, Continuous Query, Rete, Incremental Query Evaluation, Transitivity Inference, Database. Acknowledgements: This work was funded by the Advanced Research and Development Activity as part of the Novel Intelligence from Massive Data program with contract NMA401-02-C-0033. The views and conclu- sions are those of the authors, not of the U.S. government or its agencies. We would like to thank Chris Olston for his helpful suggestions and comments, and Phil Hayes, Bob Frederking, Eugene Fink, Cenk Gazen, Dwight Dietrich, Ganesh Mani, Aaron Goldstein, and Johny Mathew for helpful discussions. 1 Contents 1 Introduction 3 2 Overview of Stream Anomaly Monitoring Systems 3 2.1 SAMS Characteristics . 4 2.2 SAMS Query Examples . 6 3 Designs of Stream Anomaly Monitoring Systems 9 3.1 Alternative Approaches to SAMS . 9 3.2 Adapted Rete Algorithm . 10 4 ARGUS Pro¯le System Design 13 4.1 Database Design . 13 4.2 Restrictions on the SQL . 15 4.3 Translating SQL Queries into Rete Networks . 17 4.3.1 Rete Topology Construction . 17 4.3.2 Aggregation and Union . 17 5 Improvements on Rete Network 20 5.1 User-De¯ned Join Priority . 20 5.2 Transitivity Inference . 20 6 Experimental Results 22 6.1 Experiment Setting . 22 6.1.1 Data Sets . 22 6.1.2 Queries . 22 6.2 Results Interpretation . 23 6.2.1 Aggregation . 23 6.2.2 Transitivity Inference . 24 6.2.3 Partial Rete Generation . 24 6.2.4 Using Indexes . 26 7 Related Work 27 8 Conclusion and Future Work 29 8.1 Rete Network Optimization . 29 8.2 Computation Sharing . 29 8.3 Incremental Aggregation . 29 8.4 Using Time Windows . 29 8.5 Enhancing Transitivity Inference . 29 8.6 Index Selection . 30 A A Sample Rete Network Procedure for Example 4 30 B Sample Rete Network DDLs for Example 4 34 C A Sample Rete Network Procedure for Example 5 35 D A Sample Rete Network Procedure for Example 3 35 2 1 Introduction As data processing and network infrastructure continue to grow rapidly, data stream processing quickly becomes possible, demanding, and prevalent. A Data Stream Management System (DSMS) [58] is designed to process continuous queries over data streams. Existing data stream management projects focus on applications in which answers are generated at a fairly high rate. We identify an important sub-class of data stream applications, Stream Anomaly Monitoring Systems (SAMS), in which answers are generated very infrequently (i.e. the underlying SQL queries have extremely high selectivity). We exploit this property to produce signi¯cant optimization via an extension of the Rete algorithm, minimizing intermediate join sizes. A SAMS monitors transaction data streams or other structured data streams and alert when anomalies or potential hazards are detected. The conditions of anomalies or potential hazards are formulated as continuous queries over data streams composed by experienced analysts. In a SAMS, data streams are composed of homogeneous records that record information of transactions or events. For example, a stream could be a stream of money transfer transaction records, stock trading transaction records, or in-hospital patient admission records. A SAMS is expected to periodically process thousands of continuous queries over rapidly growing streams at the daily volume of millions of records in a timely manner. Examples motivating a SAMS can be found in many domains including banking, medicine, and stock trading. For instance, given a data stream of FedWire money transfers, an analyst may want to ¯nd linkages among big money transfer transactions connected to suspected people or organizations that may invite further investigation. Given data streams from all the hospitals in a region, a SAMS may help with early alerting of potential diseases or bio-terrorist events. In a stock trading domain, connections among suspiciously high pro¯t trading transactions may draw an analyst's attention for further check whether insider information is illegally used. Comparing to traditional On-Line Transaction Processing (OLTP), a SAMS usually runs complex queries with joins and/or aggregations involving multiple facts and data records. Processing data streams dynamically and incrementally, a SAMS also distinguishes itself from On-Line Analytic Processing (OLAP). It has been well recognized that many traditional DBMS techniques are useful for stream data processing. However, traditional DBMS's by themselves are not optimal for stream applications because of performance concerns. While a DSMS bene¯ts from many well-understood DBMS techniques, such as operator implementation methods, indexing techniques, query optimization techniques, and bu®er management strategies, a traditional DBMS is not designed for stream processing. A traditional DBMS is assumed to deal with rather stable data relations and volatile queries. In a stream application, queries are rather stable, persistent or residential, and data streams are rapidly changing. It is usual to expect a stream system to cope with thousands of continuous queries and high data rates. Therefore, while utilizing traditional DBMS techniques, many general-purpose stream projects, such as STREAM, TelegraphCQ, Aurora, and NiagaraCQ, are developing stream systems from scratch. However, for speci¯c applications, such as SAMS's, it turns out that we can get away with an implementation on top of a DBMS. ARGUS is a prototype SAMS that exploits the very-high-selectivity property with the adapted Rete algorithm, and is built upon the platform of a full-ﬂedged traditional DBMS. For illustration purpose in this paper, we assume ARGUS runs in the FedWire Money Transfer domain. However, ARGUS is designed to be general enough to run with any streams. In ARGUS, a continuous query is translated into a procedural network of operators: selections, binary joins, and aggregations, which operates on streams and relations. Derived (intermediate) streams or relations are conditionally materialized as DBMS tables with old data truncated to maintain reasonable table sizes. A node of the network, or the operator, is represented as one or more simple SQL queries that perform the incremental evaluation. The whole network is wrapped as a DBMS stored procedure. Registering a continuous query in ARGUS involves two steps: create and initialize the intermediate tables, and install and compile the stored procedure. In this paper, we ¯rst identify the unique characteristics of SAMS vs. general stream processing, and compare the streams with other types of streams, such as network tra±c data, sensor data, and Internet data. Keeping the above observations and comparisons in mind, we then discuss alternative approaches to SAMS including those explored in other stream projects, and justify our choice of the ARGUS system design. We then present a detailed description of our Rete-based ARGUS design. We conclude with performance results and motivate future work. 2 Overview of Stream Anomaly Monitoring Systems Figure 1 shows the SAMS dataﬂow. Data records in streams arrive continuously. They are matched against continuous queries registered in the system, and are stored in the database. Old stream data are truncated to keep the database manageable. Analysts selectively formulate the conditions of anomalies or potential hazards as continuous 3 queries, and register them with the SAMS. Registered queries are scheduled periodical executions over the data streams, and return new results (e.g. alerts), if any. Data Streams s e ri e u Stream Q Stream

Rete + DBMS = Efficient Continuous Profile Matching on Large-Volume

Drools Expert User Guide

Reactive Rules on the Web

Production Systems and Rete Algorithm Formalisation Horatiu Cirstea, Claude Kirchner, Michael Moossen, Pierre-Etienne Moreau

Fuzzy Dynamic Discrimination Algorithms for Distributed Knowledge Management Systems

Using Cloud Computing with RETE Algorithms in a Platform As a Service (Paas) for Business Systems Development

Evaluation and Implementation of Match Algorithms for Rule-Based Multi-Agent Systems Using the Example of Jadex

Eclipse: a C-Based Inference Engine Em6edded in a Lisp Interpreter Brian Beach; Randolph Splitter Software and Systems Laboratory HPL-90-213 December, 1990

Memory Efficient Stream Reasoning on Resource-Limited Devices

292 Chapter 9 Inference in First-Order Logic

Research of the Rule Engine Based On

Rete Algorithm

Chapter 12 –Business Rules and DROOLS