Towards an Efficient Online Causal-Event-Pattern-Matching Framework

2013 IEEE 33rd International Conference on Distributed Computing Systems Towards an Efficient Online Causal-Event-Pattern-Matching Framework Sukanta Pramanik David Taylor Bernard Wong David R. Cheriton School of Computer Science University of Waterloo, Waterloo, Canada Email: {spramanik, dtaylor, bernard}@uwaterloo.ca Abstract—Event monitoring and logging, that is, recording the runtime behaviour of a distributed application is to track communication events between processes, is a critical component its execution over time in order to capture its runtime-state in many highly reliable distributed systems. The event logs enable information. The collected information is then used to detect the identification of certain safety-condition violations, such as race conditions and mutual-exclusion violations, as safety is whether a property is satisfied or violated in the global state. generally contingent on processes communicating in a specific This approach of global-predicate detection is a well studied causally ordered pattern. Previous efforts at finding such patterns problem and is based on building a lattice of global states [12], have often focused on offline techniques, which are unable to which is known to be NP-complete [29]. identify operational problems as they occur. Online monitoring The semantics intended by the programmer for an ap- tools exist but they are often restricted to identifying a specific violation condition, such as a deadlock or a race condition, using plication is also closely related to a linearized history of dedicated data structures. We address the more general problem the system [18]. A linearized history is a finite sequence of detecting causally related event patterns that can be used to of events that consists of either invocation of operations identify various undesired behaviours in the system. or responses to operations. Software behaviour can then be The main challenge for online pattern matching is the need to analyzed by monitoring for a pattern of events that represent store the partial matches to the pattern, as they may combine with future events to form a complete match. Unlike pattern matching some undesired behaviour, such as bugs, misuse, or intrusions. in most other domains, causally ordered patterns can span a For example, in a traffic-light system, a correctness condition potentially unbounded number of events and efficiently searching is that lights in only one direction may be green in the through this large collection poses a significant challenge. global state. Alternatively, this problem can be modeled as ffi In this paper, we introduce OCEP, an e cient online causal- a sequence of events between the lights. An event-matching- event-pattern-matching framework that bounds the number of partial matches it stores by reporting only a representative based approach monitors the events ei that denote light i has subset of pattern matches. We define a subset of matches as turned green and then searches for a pattern that represents representative if it has at least one occurrence of each event in two events ei and e j happening concurrently. A match to the pattern on each process, which is applicable for a large class of this pattern signifies that the system is in an unsafe state. distributed applications. With this definition, OCEP introduces a Similar patterns can also be used to identify mutual-exclusion backtracking algorithm to efficiently find a representative subset from the history of events. An evaluation of the framework shows violations in a distributed application even if the actual global that OCEP is capable of handling several frequently occurring state observed by the system is correct [36]. violation patterns at the event rates of some representative In this paper, we introduce OCEP, an efficient online frame- distributed applications. work for matching causal event-patterns. We define an event Index Terms—Causal Ordering, Event-Based System, Dis- as a state transition in the system, often a result of receiving or tributed System, Distributed System Monitoring. sending a message, and a causal event-pattern as a distributed sequence of causally ordered events representing a complex I. Introduction interaction in the system. Matching a causal event-pattern Distributed application development has, in the past few allows us to reason about the correctness of many distributed years, seen a significant resurgence due to the popularity of systems without requiring the global state. large-scale Web services that rely on distributed application Although the use of causal event-pattern matching is sig- backends. These distributed backends are designed to scale nificantly less expensive than tracking the global state, it still horizontally to utilize the vast quantity of resources available requires storing a substantial amount of intermediate informa- in modern datacenters. Supporting such large-scale deploy- tion, representing partial pattern matches, to match arbitrary ments, however, introduces additional uncertainty and com- causal relationships. This affects the runtime performance in plexity to these distributed applications, which already have two different ways. Firstly, when monitoring a set of arbitrarily complicated communication patterns to support sophisticated long-running processes, the amount of memory required to and demanding Web services. These factors make it incredibly store the partial matches may also become arbitrarily large. difficult to reason about the correctness of a modern distributed Secondly, the monitoring algorithm needs to traverse this application. large collection of partial matches on each event in order One common approach to help with understanding the to extend them to a total match. For these reasons, previous 1063-6927/13 $26.00 © 2013 IEEE 568481 DOI 10.1109/ICDCS.2013.66 attempts [31, 41] at causal event-pattern matching are targeted only conjunctive queries with arithmetic comparisons [5]. at post-mortem analysis. Although post-mortem analysis is a Another approach is to constrain the search space by using valuable tool in identifying system problems, it does not help a sliding window and report only the matches that fall within service providers resolve operational problems as they occur. it [3]. We address the problem of finding the constituent events The ability to diagnose problems in a distributed application that match a pattern limiting only the number of matches that in realtime can significantly help in reducing the impact of are reported. problems on the service providers. Dynamic analysis of program behaviour through instru- One possible approach to providing online causal event- mentation is also a well studied field. Often it is focused matching is to maintain a time-based sliding window and on identifying specific concurrency errors, e.g., to detect discard the partial matches that lie outside the window [3, 15]. deadlocks [2], data races [35], message races [32], or atomicity While this is a simple and effective solution in some cases, violations [40]. We address the problem of detecting causally there are applications, such as intrusion and anomaly detection, related event patterns that are more generic in nature and can that require event-patterns that span a long interval. be used to match various undesired behaviours in a system. An alternative approach is to restrict the pattern-matching There is a large collection of literature that focuses on queries to limit the search space. For example, there are exist- post-mortem analysis by parsing the logs of instrumented ing tools that focus on Select-Project-Join (SPJ) queries, also events [7, 31, 34, 41] or passively monitored communica- known as conjunctive queries with arithmetic comparisons [5], tion [4]. We see our work as a complementary tool that can which are sufficient for applications that only require reporting be used alongside post-mortem analysis tools like log-based the existence of matches. These queries are not sufficiently or replay-based analysis. A user may identify a runtime safety expressive to determine the participating processes in the violation using our tool and then restrict offline analysis, for query results. Therefore, for applications, such as intrusion- in-depth checking, to particular traces that are involved. detection systems, that require such information, an additional There also exist replay-based tools for checking runtime post-mortem tool is necessary to determine the participating properties by replaying the instances of a distributed sys- processes. tem [17, 24]. The benefit of these approaches is their ability to In contrast, OCEP does not restrict the matching time deterministically replay the previous execution and reproduce interval or event-pattern, but instead provides a representa- a bug. Unfortunately, saving and replaying an entire execution tive subset of matches for a given causal event-pattern. A of a large system is prohibitively expensive. Also replay-based representative subset includes each event in the pattern that debuggers are not constrained to react in a timely manner, as has occurred anywhere in the system, if such an event exists, an online monitor needs to. and is part of an event-pattern match. By limiting the number D3S [25] and P2 [37] allow developers to specify predicates of matches, OCEP can perform online event-pattern matching on distributed properties of a deployed system. Both of them while retaining only a bounded number of partial matches. use temporal causality which may sometimes indicate a po- More importantly, finding a

Load more