The Complex Event Recognition Group

Elias Alevizos Alexander Artikis Nikos Katzouris Evangelos Michelioudakis Georgios Paliouras Institute of Informatics & Telecommunications, National Centre for Scientific Research (NCSR) Demokritos, Athens, Greece alevizos.elias, a.artikis, nkatz, vagmcs, paliourg @iit.demokritos.gr { }

ABSTRACT and, possibly, a set of spatial constraints, expressing The Complex Event Recognition (CER) group is a re- a composite or complex event of special significance search team, affiliated with the National Centre of Sci- for a given application. The system must then be entific Research “Demokritos” in Greece. The CER efficient enough so that instances of pattern satisfac- group works towards advanced and efficient methods tion can be reported to a user with minimal latency. for the recognition of complex events in a multitude of Such systems are called Complex Event Recognition large, heterogeneous and interdependent data streams. (CER) systems [6, 7, 2]. Its research covers multiple aspects of complex event CER systems are widely adopted in contempo- recognition, from efficient detection of patterns on event rary applications. Such applications are the recog- streams to handling uncertainty and noise in streams, nition of attacks in computer network nodes, hu- and machine learning techniques for inferring interest- man activities on video content, emerging stories ing patterns. Lately, it has expanded to methods for fore- and trends on the Social Web, traffic and transport casting the occurrence of events. It was founded in 2009 incidents in smart cities, fraud in electronic market- and currently hosts 3 senior researchers, 5 PhD students places, cardiac arrhythmias and epidemic spread. and works regularly with under-graduate students. Moreover, Big Data frameworks, such as Apache Storm, Spark Streaming and Flink, have been ex- tending their stream processing functionality by in- 1. INTRODUCTION cluding implementations for CER. The proliferation of devices that work in real- There are multiple issues that arise for a CER sys- time, constantly producing data streams, has led to tem. As already mentioned, one issue is the require- a paradigm shift with respect to what is expected ment for minimal latency. Therefore, a CER sys- from a system working with massive amounts of tem has to employ highly efficient reasoning mecha- data. The dominant model for processing large- nisms, scalable to high-velocity streams. Moreover, scale data was one that assumed a relatively fixed pre-processing steps, like data cleaning, have to be database/knowledge base, i.e., it assumed that the equally efficient, otherwise they constitute a “lux- operations of updating existing records/facts and ury” that a CER system cannot afford. In this case, inserting new ones were infrequent. The user of such the system must be able to handle noise. This may a system would then pose queries to the database, be a requirement, even if perfectly clean input data arXiv:1802.04086v1 [cs.AI] 12 Feb 2018 without very strict requirements in terms of latency. is assumed, since domain knowledge is often insuf- While this model is far from being rendered ob- ficient or incomplete. Hence, the patterns defined solete (on the contrary), a system aiming to ex- by the users may themselves carry a certain degree tract actionable knowledge from continuously evolv- of uncertainty. Moreover, it is quite often the case ing streams of data has to address a new set of chal- that such patterns cannot be provided at all, even lenges and satisfy a new set of requirements. The by domain experts. This poses a further challenge of basic idea behind such a system is that it is not how to apply machine learning techniques in order always possible, or even desirable, to store every to extract patterns from streams before a CER sys- bit of the incoming data, so that it can be later tem can actually run with them. Standard machine processed. Rather, the goal is to make sense out of learning techniques are not always directly applica- these streams of data, without having to store them. ble, due to the size and variability of the training This is done by defining a set of queries/patterns, set. As a result, machine learning techniques must continuously applied to the data streams. Each work in an online fashion. Finally, one often needs such pattern includes a set of temporal constraints to move beyond detecting instances of pattern sat- 1 ) 0.9 ECcrisp isfaction into forecasting when a pattern is likely to 0.8 MLN –EC 0.7 be satisfied in the future. 0.6 1 0.5

Our CER group at the National Centre for Sci- input data

| 0.4 entific Research (NCSR) Demokritos, in Athens, 0.3 CE Greece, has been conducting research on CER for ( 0.2

P 0.1 0 the past decade, and has developed a number of 0 3 10 20 time I novel algorithms and publicly available software tools. initiation initiation termination In what follows, we sketch the approaches that we have proposed and present some indicative results. Figure 1: CE estimation in the Event Calculus. The solid line concerns a probabilistic 2. COMPLEX EVENT RECOGNITION Event Calculus, such as MLN-EC, while the dashed line corresponds to a crisp (non-probabilistic) ver- Numerous CER systems have been proposed in sion of the Event Calculus. Due to the law of in- the literature [6, 7]. Recognition systems with a ertia, the CE probability remains constant in the -based representation of complex event (CE) absence of input data. Each time the initiation con- patterns, in particular, have been attracting atten- ditions are satisfied (e.g., in time-points 3 and 10), tion since they exhibit a formal, declarative seman- the CE probability increases. Conversely, when the tics [2]. We have been developing an efficient di- termination conditions are satisfied (e.g., in time- alect of the Event Calculus, called ‘Event Calculus point 20), the CE probability decreases. for Run-Time reasoning’ (RTEC) [4]. The Event Calculus is a formalism for rep- resenting and reasoning about events and their ef- 3. UNCERTAINTY HANDLING fects [14]. CE patterns in RTEC identify the con- CER applications exhibit various types of uncer- ditions in which a CE is initiated and terminated. tainty, ranging from incomplete and erroneous data Then, according to the law of inertia, a CE holds at streams to imperfect CE patterns [2]. We have been a time-point T if it has been initiated at some time- developing techniques for handling uncertainty in point earlier than T , and has not been terminated CER by extending the Event Calculus with proba- in the meantime. bilistic reasoning. Prob-EC [21] is a logic program- RTEC has been optimised for CER, in order to ming implementation of the Event Calculus using be scalable to high-velocity data streams. A form of the ProbLog engine [13], that incorporates proba- caching stores the results of subcomputations in the bilistic semantics into logic programming. Prob-EC computer memory to avoid unnecessary recomputa- is the first Event Calculus dialect able to deal with tions. A set of interval manipulation constructs sim- uncertainty in the input data streams. For exam- plify CE patterns and improve reasoning efficiency. ple, Prob-EC is more resilient to spurious data than A simple indexing mechanism makes RTEC robust the standard (crisp) Event Calculus. to events that are irrelevant to the patterns we want MLN-EC [22] is an Event Calculus implementa- to match and so RTEC can operate without data fil- tion based on Markov Logic Networks (MLN)s [20], tering modules. Finally, a ‘windowing’ mechanism a framework that combines first-order logic with supports real-time CER. One main motivation for graphical models, in order to enable probabilistic RTEC is that it should remain efficient and scalable inference and learning. CE patterns may be associ- in applications where events arrive with a (variable) ated with weight values, indicating our confidence delay from, or are revised by, the underlying sen- in them. Inference can then be performed regard- sors: RTEC can update the intervals of the already ing the time intervals during which CEs of inter- recognised CEs, and recognise new CEs, when data est hold. Like Prob-EC, MLN-EC increases the arrive with a delay or following revision. probability of a CE every time its initiating con- RTEC has been analysed theoretically, through a ditions are satisfied, and decreases this probability complexity analysis, and assessed experimentally in whenever its terminating conditions are satisfied, several application domains, including city trans- as shown in Figure 1. Moreover, in MLN-EC the port and traffic management [5], activity recogni- domain-independent Event Calculus rules, express- tion on video feeds [4], and maritime monitoring ing the law of inertia, may be associated with weight [18]. In all of these applications, RTEC has proven values, introducing probabilistic inertia. This way, capable of performing real-time CER, scaling to large the model is highly customisable, by tuning appro- data streams and highly complex event patterns. priately the weight values with the use of machine 1http://cer.iit.demokritos.gr/ learning techniques, and thus achieves high predic- 1 paths. Then, for all incorrectly predicted CEs, the 0.9 hypergraph is searched using relational pathfind- 0.8 ing, for clauses supporting the recognition of these 0.7 CEs. The paths discovered during the search are 0.6 generalised into first-order clauses. Subsequently,

score 0.5 the weights of the clauses that pass the evaluation 1 F 0.4 stage are optimised using off-the-shelf online weight 0.3 learners. Then, the weighted clauses are appended 0.2 MLN –EC l–CRF to the hypothesis and the procedure is repeated for 0.1 the next set of training examples t+1. 0 D 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 OLED [11] is an Inductive Logic Programming Threshold (ILP) system that learns CE patterns, in the form of Event Calculus theories, in a supervised fashion and Figure 2: CER under uncertainty. F1-score of in a single pass over a data stream. OLED constructs MLN-EC and linear-chain CRFs for different CE patterns by first encoding a positive example from acceptance thresholds. the input stream into a so-called bottom rule, i.e., a most-specific rule of the form α δ1 ... δn, where α is an initiation or termination← ∧ atom∧ and tive accuracy in a wide range applications. δ , . . . , δ are relational features, constructed on- The use of background knowledge about the task 1 n the-fly, using prescriptions from a predefined lan- and the domain, in terms of logic (the Event Cal- guage bias. These features express anything “inter- culus), can make MLN-EC more robust to varia- esting”, as defined by the language bias, that is true tions in the data. Such variations are very com- within the positive example at hand. A bottom rule mon in practice, particularly in dynamic environ- is typically too restrictive to be useful. To learn a ments, such as the ones encountered in CER. The useful rule, OLED searches within the space of rules common assumption made in machine learning that that θ-subsume the bottom rule, i.e., rules that in- the training and test data share the same statisti- volve some of the δ ’s only. To that end, OLED starts cal properties is often violated in these situations. i from the most-general rule—a rule with an empty Figure 2, for example, compares the performance of body—and gradually specialises that rule by adding MLN-EC against linear-chain Conditional Random δ ’s to its body, using a rule evaluation function to Fields on a benchmark activity recognition dataset, i assess the quality of each generated specialisation. where evidence is incomplete in the test set as com- OLED’s single-pass strategy is based on the Ho- pared to the training set. effding bound [8], a statistical tool that allows to 4. EVENT PATTERN LEARNING approximate the quality of a rule on the entire in- put using only a subset of the data. In particu- The manual authoring of CE patterns is a te- lar, given a rule r and some of its specialisations dious and error-prone process. Consequently, the r1, . . . , rk, the Hoeffding bound allows to identify automated construction of such patterns from data the best among them, with probability 1 δ and is highly desirable. We have been developing super- −1 1 within an error margin , using only N = ( 2 ln δ ) vised, online learning learning tools for construct- training examples from the input stream.O ing logical representations of CE patterns, from a We have evaluated OLED and OSLα on real single-pass over a relational data stream. OSLα [16] datasets concerning activity recognition, maritime is such a learner for Markov Logic Networks (MLNs), monitoring, credit card fraud detection, and traf- formulating CE patterns in the form of MLN-EC fic management in smart cities [11, 16, 3, 15, 12]. theories. OSLα extends OSL [9] by exploiting a We have also compared OLED and OSLα to OSL [9], background knowledge in order to significantly con- XHAIL, a ‘batch’ structure learner requiring many strain the search for patterns. passes over the data [19], and to hand-curated Event In each step t of the online procedure, a set of Calculus patterns (with optimised weight values). training examples t arrives containing input data The results suggest that both OLED and OSLα can D along with CE annotation. t is used together match the predictive accuracy of batch learners as D with the already learnt hypothesis, if any, to pre- well as that of hand-crafted patterns. Moreover, dict the truth values of the CEs of interest. This OLED and OSLα have proven significantly faster than is achieved by MAP (maximum a posteriori) infer- both batch and online learners, making them more ence. Given t, OSLα constructs a hypergraph that appropriate for large data streams. represents theD space of possible structures as graph 5. EVENT FORECASTING a a Forecasting over time-evolving data streams is a a task that can be defined in multiple ways. There a b b b is a conceptual difference between forecasting and start 0 1 2 3 4 prediction, as the latter term is understood in ma- chine learning, where the main goal is to “predict” b a the output of a function on previously unseen in- b put data, even if there is no temporal dimension. (a) Deterministic Finite Automaton, state 1. In forecasting, time is a crucial component and the goal is to predict the temporally future output of 1 some function or the occurrence of an event. Time- state:0 0.8 state:1 series forecasting is an example of the former case interval:3,8 and is a field with a significant history of contri- 0.6 state:2 state:3 butions. However, its methods cannot be directly 0.4 transferred to CER, since it handles streams of (mostly) real-valued variables and focuses on fore- 0.2 0 casting relatively simple patterns. On the contrary, Completion Probability 1 2 3 4 5 6 7 8 9 10 11 12 in CER we are also interested in categorical val- Number of future events ues, related through complex patterns and involv- (b) Waiting-time distribution, state 1. ing multiple variables. Our group has developed a method, where automata and Markov chains are Figure 3: Event Forecasting. The event pattern employed in order to provide (future) time intervals requires that one event of type a is followed by three during which a match is expected with a probability events of type b. θfc = 0.5. For illustration, the x above a confidence threshold [1]. axis stops at 12 future events. We start with a given pattern of interest, defin- ing relations between events, in the form of a regular in Figure 3, where the DFA in Figure 3a is in state expression—i.e., using operators for sequence, dis- 1, the waiting-time distributions for all of its non- junction and iteration. Our goal, besides detecting final states are shown in Figure 3b, and the distri- occurrences of this pattern, is also to estimate, at bution, along with the forecast interval, for state 1 each new event arrival, the number of future events are shown in green. that we will need to wait for until the expression is Figure 4 shows results of our implementation on satisfied, and thus a match be detected. A pattern two real-world datasets from the financial and the in the form of a regular expression is first converted maritime domains. In the former case, the goal to a deterministic finite automaton (DFA) through was to forecast a specific case of credit card fraud, standard conversion algorithms. We then construct whereas in the latter it was to forecast a specific a Markov chain that will be able to provide a proba- vessel manoeuver. Figures 4a and 4d show precision bilistic description of the DFA’s run-time behavior, results (the percentage of forecasts that were accu- by employing Pattern Markov Chains (PMC) [17]. rate), where the y axes correspond to different val- The resulting PMC depends both on the initial pat- ues of the threshold θ , and the x axes correspond tern and on the assumptions made about the sta- fc to states of the PMC (more “advanced” states are tistical properties of the input stream—the order m to the right of the axis), i.e., we measure precision of the assumed Markov process. for the forecasts produced by each individual state. After constructing a PMC, we can use it to calcu- Similarly, Figures 4b and 4e are per-state plots for late the so-called waiting-time distributions, which spread (the length of the forecast interval), and Fig- can give us the probability of reaching a final state ures 4c and 4f are per-state plots for distance (the of the DFA in k transitions from now. To estimate temporal distance between the time a forecast is the final forecasts, another step is required, since produced and the start of the forecast interval). our aim is not to provide a single future point with As expected, more “advanced” states produce fore- the highest probability, but an interval in the form casts with higher precision, smaller spread and dis- of I=(start, end). The meaning of such an inter- tance. However, there are cases where we can get val is that the DFA is expected to reach a final earlier both high precision and low spread scores state sometime in the future between start and end (see Figures 4d and 4e). This may happen when with probability at least some constant threshold there exist strong probabilistic dependencies in the θ (provided by the user). An example is shown fc stream, e.g., when one event type is very likely (or 100 10 15 0.8 0.8 80 0.8 8

10 0.6 60 0.6 6 0.6

0.4 40 0.4 4 0.4 5 20 2 0.2 0.2 0.2 Prediction Threshold Prediction Threshold Prediction Threshold 0 0 0 0 1 11 12 13 2 21 3 4 5 6 7 0 1 11 12 13 2 21 3 4 5 6 7 0 1 11 12 13 2 21 3 4 5 6 7 State State State (a) Precision. (b) Spread. (c) Distance.

100 350 15 0.8 300 0.8 0.8 80 250 0.6 0.6 10 0.6 60 200

150 0.4 40 0.4 0.4 100 5 20 0.2 0.2 50 0.2 Prediction Threshold Prediction Threshold Prediction Threshold 0 0 0 3te 7tw 9tn 11 ts 13 gse 14 gsn 15 gsw 16 gss 17 gen 18 gew 19 ges 20 gee 3te 7tw 9tn 11 ts 13 gse 14 gsn 15 gsw 16 gss 17 gen 18 gew 19 ges 20 gee 3te 7tw 9tn 11 ts 13 gse 14 gsn 15 gsw 16 gss 17 gen 18 gew 19 ges 20 gee State State State (d) Precision. (e) Spread. (f) Distance.

Figure 4: Event forecasting for credit card fraud management (top) and maritime monitoring (bottom). The y axes correspond to different values of the threshold θfc. The x axes correspond to states of the PMC. very unlikely) to appear, given that the last event(s) ical Mobility Forecasting) is an H2020 EU project is of a different event type. Our system can take that introduces novel methods for detecting threats advantage of such cases in order to produce high- and abnormal activity in very large fleets of mov- quality forecasts early. ing entities, such as vessels and aircrafts. Simi- larly, AMINESS5 (Analysis of Marine Information 6. PARTICIPATION IN RESEARCH & IN- for Environmentally Safe Shipping) was a national NOVATION PROJECTS project that developed a computational framework for environmental safety and cost reduction in the The CER group has been participating in sev- maritime domain. The CER group has been work- eral research and innovation projects, contributing ing on maritime and aviation surveillance, devel- to the development of intelligent systems in chal- oping algorithms for, among others, highly efficient 2 lenging domains. SPEEDD (Scalable Proactive spatio-temporal pattern matching [18], complex Event-Driven Decision Making) was an FP7 EU- event forecasting [1], and parallel online learning funded project, coordinated by the CER group, that of complex event patterns [12]. developed tools for proactive analytics in Big Data Track & Know (Big Data for Mobility & Track- applications. In SPEEDD, the CER group worked ing Knowledge Extraction in Urban Areas) is an on credit card fraud detection and traffic manage- H2020 EU-funded project that will research, de- ment [3, 15], developing formal tools for highly scal- velop and exploit a new software framework increas- able CER [4], and pattern learning [10, 16]. ing the efficiency of Big Data applications in the 3 REVEAL (REVEALing hidden concepts in so- transport, mobility, motor insurance and health sec- cial media) was an FP7 EU project that developed tors. The CER team is responsible for the complex techniques for real-time extraction of knowledge from event recognition and forecasting technology that social media, including influence and reputation as- will be developed in Track & Know. sessment. In REVEAL, the CER group developed a technique for online (single-pass) learning of event 7. CONTRIBUTIONS TO THE COMMU- patterns under uncertainty [11]. NITY datACRON4 (Big Data Analytics for Time Crit-

2 The CER group supports the research commu- http://speedd-project.eu/ nity at different levels; notably, by making avail- 3http://revealproject.eu/ 4http://www.datacron-project.eu/ 5http://aminess.eu/ able the proposed research methods as open-source 10(12):1996–1999, 2017. solutions. The RTEC CER engine (see Section 2) [8] W. Hoeffding. Probability inequalities for is available as a monolithic implementation6 sums of bounded random variables. JASA, and as a parallel Scala implementation7. The OLED 58(301):13–30, 1963. system for online learning of event patterns (see Sec- [9] T. N. Huynh and R. J. Mooney. Online tion 4) is also available as an open-source solution8, Structure Learning for Markov Logic both for single-core and parallel learning. OLED is Networks. In ECML, pages 81–96, 2011. implemented in Scala; both OLED and RTEC use [10] N. Katzouris, A. Artikis, and G. Paliouras. the Akka actors library for parallel processing. Incremental learning of event definitions with The OSLα online learner (see Section 4), along inductive logic programming. Machine with MAP inference based on integer linear pro- Learning, 100(2-3):555–585, 2015. gramming, and various weight optimisation algo- [11] N. Katzouris, A. Artikis, and G. Paliouras. rithms (Max-Margin, CDA and AdaGrad), are con- Online learning of event definitions. TPLP, tributed to LoMRF9, an open-source implementa- 16(5-6):817–833, 2016. tion of Markov Logic Networks. LoMRF provides [12] N. Katzouris, A. Artikis, and G. Paliouras. predicate completion, clausal form transformation, Parallel online learning of complex event and function elimination. Moreover, it provides a definitions. In ILP. Springer, 2017. parallel grounding algorithm which efficiently con- [13] A. Kimmig, B. Demoen, L. D. Raedt, structs the minimal Markov Random Field. V. Costa, and R. Rocha. On the implementation of the probabilistic logic 8. REFERENCES programming language ProbLog. TPLP, 11(2-3):235–262, 2011. [1] E. Alevizos, A. Artikis, and G. Paliouras. [14] R. A. Kowalski and M. J. Sergot. A Event forecasting with pattern markov chains. logic-based calculus of events. New In DEBS, pages 146–157. ACM, 2017. Generation Comput., 4(1):67–95, 1986. [2] E. Alevizos, A. Skarlatidis, A. Artikis, and [15] E. Michelioudakis, A. Artikis, and G. Paliouras. Probabilistic complex event G. Paliouras. Online structure learning for recognition: A survey. ACM Comput. Surv., traffic management. In ILP. Springer, 2016. 50(5):71:1–71:31, 2017. [16] E. Michelioudakis, A. Skarlatidis, [3] A. Artikis, N. Katzouris, I. Correia, C. Baber, G. Paliouras, and A. Artikis. Online structure N. Morar, I. Skarbovsky, F. Fournier, and learning using background knowledge G. Paliouras. A prototype for credit card axiomatization. In ECML, 2016. fraud management: Industry paper. In DEBS, [17] G. Nuel. Pattern Markov Chains: Optimal pages 249–260, 2017. Markov Chain Embedding through [4] A. Artikis, M. J. Sergot, and G. Paliouras. An Deterministic Finite Automata. Journal of event calculus for event recognition. IEEE Applied Probability, 2008. TKDE, 27(4):895–908, 2015. [18] K. Patroumpas, E. Alevizos, A. Artikis, [5] A. Artikis, M. Weidlich, F. Schnitzler, M. Vodas, N. Pelekis, and Y. Theodoridis. I. Boutsis, T. Liebig, N. Piatkowski, Online event recognition from moving vessel C. Bockermann, K. Morik, V. Kalogeraki, trajectories. GeoInformatica, 21(2), 2017. J. Marecek, A. Gal, S. Mannor, D. Gunopulos, [19] O. Ray. Nonmonotonic abductive inductive and D. Kinane. Heterogeneous stream learning. JAL, 7(3):329–340, 2009. processing and crowdsourcing for urban traffic [20] M. Richardson and P. M. Domingos. Markov management. In EDBT, pages 712–723, 2014. logic networks. Machine Learning, [6] G. Cugola and A. Margara. Processing flows 62(1-2):107–136, 2006. of information: From data stream to complex [21] A. Skarlatidis, A. Artikis, J. Filipou, and event processing. ACM Comput. Surv., G. Paliouras. A probabilistic logic 44(3):15:1–15:62, 2012. programming event calculus. TPLP, [7] N. Giatrakos, A. Artikis, A. Deligiannakis, 15(2):213–245, 2015. and M. N. Garofalakis. Complex event [22] A. Skarlatidis, G. Paliouras, A. Artikis, and recognition in the big data era. PVLDB, G. A. Vouros. Probabilistic Event Calculus for 6https://github.com/aartikis/RTEC Event Recognition. ACM Transactions on 7https://github.com/kontopoulos/ScaRTEC Computational Logic, 16(2):11:1–11:37, 2015. 8https://github.com/nkatzz/OLED 9https://github.com/anskarl/LoMRF