From Complex Event Processing to Simple Event Processing
Total Page:16
File Type:pdf, Size:1020Kb
From Complex Event Processing to Simple Event Processing Sylvain Hallé1 Abstract Many problems in Computer Science can be framed as the computation of queries over sequences, or “streams” of data units called events. The field of Complex Event Processing (CEP) relates to the techniques and tools developed to efficiently process these queries. However, most CEP systems developed so far have concentrated on relatively narrow types of queries, which consist of sliding windows, aggregation functions, and simple sequential patterns computed over events that have a fixed tuple structure. Many of them boast throughput, but in counterpart, they are difficult to setup and cumbersome to extend with user-defined elements. This paper describes a variety of use cases taken from real-world scenarios that present features seldom considered in classical CEP problems. It also provides a broad review of current solutions, that includes tools and techniques going beyond typical surveys on CEP. From a critical analysis of these solutions, design principles for a new type of event stream processing system are exposed. The paper proposes a simple, generic and extensible framework for the processing of event streams of diverse types; it describes in detail a stream processing engine, called BeepBeep, that implements these principles. BeepBeep’s modular architecture, which borrows concepts from many other systems, is complemented with an extensible query language, called eSQL. The end result is an open, versatile, and reasonably efficient query engine that can be used in situations that go beyond the capabilities of existing systems. Keywords: event processing, software testing, query languages, runtime verification 1. Introduction taken from domains as varied as bug detection in video games and network intrusion detection. Event streams have become an important part of the This potent concept has spawned an impressive amount mass of data produced by computing systems. They can of work in the past twenty years. As we will see in Section 3, be generated by a myriad of sources such as sensors [1–3], there exist literally dozens of competing systems claiming business process logs [4], instrumented software [5–7], finan- the CEP label, ranging from academic proofs-of-concept cial transactions [8], healthcare systems [9], and network to commercial cloud frameworks such as Apache Spark or packet captures [10]. The ability to collect and process Microsoft Azure. These systems are based on a commen- these event streams can be put to good use in fields as surate number of research papers, technical reports and diverse as software testing, data mining, and compliance buzzword-laden whitepapers introducing a plethora of in- auditing. compatible formalizations of the problem. This reveals that Event stream processing typically involves computations CEP has never been a single problem, but rather a family that go beyond the evaluation of simple functions on in- of related problems sharing a relatively blurry common dividual events. Of prime importance is the possibility ground. Many of these systems and frameworks, however, to perform correlations between events, either at multiple have in common the fact that they deserve the epithet “complex”. They often rely on an intricate definition of arXiv:1702.08051v1 [cs.DB] 26 Feb 2017 moments in time within a single stream, or even between events taken from different event streams. The term Com- seemingly simple concepts; some of them don’t even state plex Event Processing (CEP) has been coined to refer to them formally, making their available implementation a de computations of this nature. One of the goals of CEP is to facto specification. Many of their built-in query languages create aggregated (i.e. “complex”) events using data fetched are quirky outgrowths of SQL, whose syntax seldom pre- from one or more lower-level events [11]. This computation serves backward-compatibility for operations identical to can be executed in cascade, with the output streams of one those found in relational databases. Almost none of them process becoming the input streams of the next, leading to provides comprehensive means for extending their language events of increasingly higher levels of abstraction. Section syntax with user-defined constructs. Finally, some suffer 2 starts this paper by presenting a wide range of examples from high setup costs, requiring hours if not days of arcane configuration editing and boilerplate code to run even the smallest example. 1Laboratoire d’informatique formelle, Université du Québec à While it is obvious that convincing use cases motivate Chicoutimi, Canada the existence of these systems, the current state of things Preprint submitted to acmjacm February 28, 2017 leaves a potential user between two uncomfortable extremes: 2. Use Cases for Event Stream Processing embrace a complex Event Processing system, with all its aforementioned shortcomings, or do without and fall back Complex Event Processing (CEP) can loosely be defined to low-level scripting languages, such as Perl or Python, as the task of analyzing and aggregating data produced to write menial trace-crunching tasks. What seems to be by event-driven information systems [11]. A key feature missing is a “Simple Event Processing” engine, in the same of CEP is the possibility to correlate events from multiple way that a spreadsheet application like Microsoft Excel sources, occurring at multiple moments in time. Informa- is often a satisfactory middle ground between a pocket tion extracted from these events can be processed, and calculator and a full-blown accounting system. Such a lead to the creation of new, “complex” events made of that system should provide higher abstraction than hand-written computed data. This stream of complex events can itself scripts, an easy to understand computational model, zero- be used as the source of another process, and be aggregated configuration operation and reasonable performance for and correlated with other events. light- to medium-duty tasks. Event processing distinguishes between two modes of op- eration. In online (or “streaming”) mode, input events are consumed by the system as they are produced, and output This paper presents a detailed description of such a Sim- events are progressively computed and made available. It ple Event Processing engine, called BeepBeep. In Section 4, is generally assumed that the output stream is monotonic: we first describe the fundamental design principles behind once an output event is produced, it cannot be “taken the development of this system. Some of these principles back” at a later time. In contrast, in offline (or “batch”) purposefully distance themselves from trends followed by mode, the contents of the input streams are completely current CEP solutions, and are aimed at making the in- known in advance (for example, by being stored on disk tended system both simpler and more versatile. Section 5 or in a database). Whether a system operates online or then formally describes BeepBeep’s computational model. offline sometimes matters: for example, offline computation This simple formalization completely describes the system’s may take advantage of the fact that events from the input semantics, making it possible for alternate implementations streams may be indexed, rewinded or fast-forwarded on to be independently developed. demand. Recently, the hybrid concept of “micro-batching” has been introduced in systems like Apache Spark Stream- One of the key features of BeepBeep is its associated ing (cf. Section 3.1.7). It is a special case of batch processing query language, called eSQL, which is described in Section with very small batch sizes. 6. Substantial effort has been put in making eSQL simple Guarantees on the delivery of events in a CEP system and coherent; in accordance to BeepBeep’s design princi- can also vary. “At most once” delivery entails that every ples, it strives towards relational transparency, meaning event may be sent to its intended recipient, but may also be that queries that perform computations similar to rela- lost. “At least once” delivery ensures reception of the event, tional database operations are written in a syntax that is but at the potential cost of duplication, which must then backwards-compatible with SQL. Section 7 then describes be handled by the receiver. In between is perfect event de- the various means of extending BeepBeep’s built-in func- livery, where reception of each event is guaranteed without tionalities. A user can easily develop new event processing duplication. These concepts generally matter only for dis- units in a handful of lines of code, and most importantly, tributed event processing systems, where communication define arbitrary grammatical extensions to eSQL to use links between nodes may involve loss and latency. these custom elements inside queries. Extensions can be In the following, we proceed to describe a few scenarios bundled in dynamically-loaded packages called palettes; where event streams are produced and processed. we describe a few of the available palettes, allowing Beep- Beep to manipulate network captures, tuples, plots, and 2.1. Stock Ticker temporal logic operators, among others. A recurring scenario used in CEP to illustrate the perfor- mance of various tools is taken from the stock market [12]. Equipped with these constructs, Section 8 then proceeds One considers a stream of stock quotes, where each event to showcase BeepBeep’s functionalities. An experimental contains attributes such as a stock symbol, the price of the comparison of BeepBeep’s performance with respect to a stock at various moments (such as its minimum price and selection of other CEP engines is detailed in Section 9. closing price), as well as a timestamp. A typical stream To the best of our knowledge, this is the first published of events of this nature is shown in Figure 1. This figure account of such an empirical benchmark of CEP engines shows that events are structured as tuples, with a fixed set on the same input data. These experiments reveal that, on of attributes, each of which taking a scalar value.