Spatiotemporal Queries

Von der Fakult¨atf¨urMathematik und Informatik der FernUniversit¨atin Hagen genehmigte

Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.)

von Mahmoud Attia Sakr Geburtsort: El-Scharkia, Agypten¨

Hagen, 2012 ii Dedicated to the martyrs of the Egyptian revolution Jan 25th 2011. To those who gave their lives to make Egypt a better place for everyone.

i ii Aknowledgement

goal goal

start start When people see my work, they think I have reached these results like this. But actually it was like this!

I guess this drawing holds for all research projects. When I saw it some years ago on Facebook, I thought this is really expressive! It was fun going the whole way to finishing this thesis. Without the support of my work group, family and friends, it would not have been possible to reach this end. I gratefully thank all of them and wish them all the best! All thanks and praise to God who has planned all this for me. My adviser Prof. Ralf Hartmut G¨utinghas been actively involved at all stages of this work from writing the proposal to finishing the thesis. He was always available for discussing research issues. I admire his ability to establish such a good relationship with foreign students. Very shortly after I joined his group, he introduced me to his family and friends. Finally, he is one of the few people I know who could beat me at table tennis! Thank you, Hartmut, for all your support, for the friendship, and for being a great teacher! I thank my colleagues in the work group for their patience during the long hours they spent with me helping with my research. I surely needed all the goodwill they have shown towards me. Thanks to Prof. Michael Gertz who accepted to review this thesis, and did the review in a short time. This was essential to me, as I was committed to a limited stay in Germany. I also need to express all my love to my family and to my son. They have carried on my responsibilities in Egypt during my stay in Germany. Despite the distance, they managed to always let me know they were continuously supporting me.

iii iv AKNOWLEDGEMENT

This work was funded in the first two years by DAAD, with supplemen- tary funds from the Egyptian study mission. The remaining 21 months were funded by the FernUniversit¨atin Hagen, Lehrgebiet Datenbanksysteme f¨ur neue Anwendungen, Prof. Ralf Hartmut G¨uting. Contents

i

Aknowledgement iii

List of Tables ix

List of Algorithms xi

List of Figures xiii

Abstract xv

1 Introduction 1 1.1 Problem Definition ...... 2 1.2 Thesis Contributions ...... 3 1.2.1 Individual STP Queries ...... 4 1.2.2 Group STP Queries ...... 6 1.2.3 Contributions ...... 10 1.3 Thesis Organization ...... 11

2 Related Work 13 2.1 Individual Spatiotemporal Pattern Queries ...... 13 2.2 Group Spatiotemporal Pattern Queries ...... 15

3 The Moving Objects Data Model (Review) 19 3.1 The Type System ...... 19 3.2 SECONDO ...... 22 3.3 The Query ...... 22 3.4 Time-Dependent Predicates ...... 24

4 Individual Spatiotemporal 27 4.1 The Spatiotemporal Pattern Predicate ...... 29 4.2 Evaluating The STP Predicate ...... 32 4.3 Extending The Definition Of The STP Predicate ...... 36 4.4 Optimizing The STP Predicate ...... 38 4.4.1 Basic Assumptions ...... 38 4.4.2 Query Optimization ...... 39 4.5 The Implementation In Secondo ...... 42 4.5.1 Extending The Kernel ...... 42

v vi CONTENTS

4.5.2 Extending The Optimizer ...... 43 4.6 Experimental Evaluation ...... 44 4.6.1 Experiment 1: The Overhead Of Evaluating The STP Predicate ...... 45 4.6.2 Experiment 2: Run-Time And Optimization Gain . . . 47 4.6.3 Experiment 3: Scalability ...... 49 4.7 Conclusions ...... 53

5 Group Spatiotemporal Patterns 55 5.1 Extending The Type System ...... 55 5.2 Group Spatiotemporal Pattern Operators ...... 56 5.2.1 The reportpattern Operator ...... 59 5.2.2 The gpattern Operator ...... 60 5.2.3 The crosspattern Operator ...... 63 5.3 The crosspattern Evaluation Algorithm ...... 65 5.3.1 Related Work (REVIEW) ...... 66 5.3.2 The Proposed Algorithm ...... 67 5.4 Examples ...... 76 5.5 Optimization ...... 78 5.5.1 Optimizing The gpattern Operator ...... 79 5.5.2 Optimizing The crosspattern Operator ...... 81 5.5.3 Optimizing The reportpattern Operator ...... 81 5.5.4 Notes On The Implementation ...... 82 5.6 Experimental Evaluation ...... 82 5.6.1 Evaluating The gpattern Operator ...... 82 5.6.2 Evaluating The crosspattern Operator ...... 86 5.6.3 A Remark ...... 90 5.7 Conclusions ...... 91

6 Temporal Reasoning 93 6.1 Temporal Reasoning (REVIEW) ...... 94 6.2 The Reasoning Task ...... 95 6.3 Integration With The Individual STP Operators ...... 97 6.4 Experimental Evaluation ...... 102 6.4.1 Experiment 1: The Overhead Of Temporal Reasoning . 102 6.4.2 Experiment 2: The Performance Gain Of Temporal Reasoning ...... 102 6.5 Discussion And Conclusions ...... 107

7 Application Examples 109 7.1 Air Traffic Control ...... 109 7.1.1 Missed Approach ...... 111 7.1.2 Stepwise-Descent Landing ...... 113 7.2 Ghent City Game ...... 115

8 Conclusions and Future Work 121

Appendices 123 CONTENTS vii

A Experimental Repeatability 125 A.1 Repeating The Individual STP Experiments ...... 125 A.1.1 Repeating the First Experiment ...... 125 A.1.2 Repeating the Second Experiment ...... 126 viii CONTENTS List of Tables

2.1 Examples of the Group STPs ...... 16

3.1 The type system ...... 20

4.1 Individual spatiotemporal pattern operators ...... 28 4.2 A language for expressing interval relationships ...... 30 4.3 Mapping time-dependent predicates into standard predicates. . 41 4.4 The database relations used in the scalability experiment . . . 50

5.1 The extended type system ...... 56 5.2 Group spatiotemporal pattern operators ...... 58 5.3 Composing component states ...... 72 5.4 Component states occurrence frequencies ...... 72 5.5 The datasets used for evaluating the gpattern operator . . . . 83 5.6 Description of the datasets of the crosspattern experiment . . 89

6.1 Individual STP operators with temporal reasoning ...... 101

A.1 The schema of the STPQExpr1Result relation ...... 126 A.2 The schemas of the Expr2StatsDO.csv and Expr2StatsEO.csv files ...... 127

ix x LIST OF TABLES List of Algorithms

1 The MatchPattern Algorithm ...... 34 2 The Extend Helper Function ...... 35

3 The gpattern Algorithm ...... 63 4 The mbool2mset Helper Function ...... 63 5 The DeleteUnitsBelowSize Helper Function ...... 63 6 The DeleteElemsBelowDuration Helper Function ...... 64 7 The crosspattern Algorithm ...... 68 8 The FindLargeConnectedComponents Helper Function . . . . . 69 9 The FindConnectedComponents Helper Function ...... 70

10 The IA-To-PA Helper Function ...... 96 11 MatchPattern With Temporal Reasoning ...... 98 12 The GetConsistentPeriods Helper Function ...... 99

xi xii LIST OF ALGORITHMS List of Figures

3.1 The sliced representation of a moving point ...... 22 3.2 Time-dependent predicates ...... 25

4.1 Evaluating the STP predicate ...... 32 4.2 The overhead of evaluating the STP predicate ...... 47 4.3 The run-time of the STP queries on the Trains20 relation . . 49 4.4 Scalability results ...... 51

5.1 State NewlyAdded followed by state RemoveNow ...... 72 5.2 State NewlyAdded followed by state Split ...... 72 5.3 State Merged followed by state Split ...... 73 5.4 Helper indexes for representing the nodes of the pattern graph 75 5.5 The run-time of the gpattern operator ...... 85 5.6 The run-time of the crosspattern operator ...... 90

6.1 Continuous and discontinuous endpoint relationships . . . . . 95 6.2 The Decompose vector ...... 96 6.3 The | operator ...... 97 6.4 The overhead of temporal reasoning ...... 103 6.5 The performance gain of temporal reasoning ...... 105 6.6 Run-time statistics ...... 106

7.1 Integrating SECONDO with V-Analytics ...... 111 7.2 Missed approach query results ...... 113 7.3 Stepwise-descent query results ...... 115

xiii xiv LIST OF FIGURES Abstract

Capturing moving objects data is now possible and becoming cheaper with the advances in positioning and sensor technologies. The increasing amount of such data and the various fields of applications call for intensive research work for building spatiotemporal database management systems. This in- volves several tasks such as modeling the moving objects data, designing query methods, query optimization, etc. This PhD project goes in this direc- tion. Our goal is to design and implement query operators for the Spatiotem- poral Pattern Queries (STP queries). This includes the design of expressive query operators, the integration with the query optimizers for efficient evalu- ation, and the implementation within the context of a spatiotemporal DBMS. In this thesis we distinguish between two kinds of spatiotemporal pat- terns: individual and group, according to the number of objects composing the pattern. Accordingly, two sets of query operators are proposed. An individual STP is a certain movement profile shown by the moving object during its observation period. An example of this is the pattern of a drifting ship. When a ship has engine problems, and the crew has no more control over it, it starts drifting. It moves according to the wind and current. This movement pattern is characterized by low speed and high drifting angle (the difference between the heading and the course over ground). Another example is the holding pattern of aircraft. When the run-way is busy and the landing of an aircraft has to be delayed, it is directed to join the holding pattern. The aircraft flies in circles over a pre-defined area in the airspace of the destination airport until the run-way is ready. The first part of this work proposes query operators for expressing such patterns and matching them on trajectory databases. They allow one to specify temporal order constraints on the fulfillment of several predicates. This way of expressing patterns is novel. The aircraft holding pattern, for instance, would be expressed using three predicates: the aircraft is close to its destination, the aircraft flies in circles, and the aircraft lands. The three predicates are required to be fulfilled in the temporal order: during the aircraft is close to its destination, it flies in circles then it lands. Our proposed solution for individual STP queries builds on the well- established concept of time-dependent predicates. These are predicates that yield time-dependent Booleans (functions from time to Boolean). They are generic, and neither restricted to certain types of moving objects nor to cer- tain types of operations. Hence, the proposed operators can express arbitrar- ily complex patterns that involve various types of spatiotemporal operations such as range, metric, topological, set operations, aggregations, distance, di-

xv xvi ABSTRACT rection, and boolean operations. They are also applicable to various types of moving objects, such as moving point objects, moving regions, and time series. This expressive power is unmatched in previous approaches. This work covers the language integration, the evaluation algorithms of the operators, and the integration with the query optimizer. We provide a complete implementation in C++ and Prolog in the context of the Secondo system. The implementation is made publicly available as part of the Sec- ondo system with hands-on tutorials and scripts to repeat the experiments in this thesis. The second part of this PhD project focuses on group spatiotemporal pat- terns. These are the movement patterns, in space and time, formed by groups of moving objects, such as flocks, concurrence, encounter, etc. In order to match such patterns, one needs to simultaneously analyze the movement of several moving objects, and their spatiotemporal relationships to one an- other. We propose query operators that can consistently express and match a wide range of group STPs. These operators also allow for expressing com- posite patterns. For example, one can express in a query that a flock of type-A objects meets a convoy of type-B objects. Our approach defines generic operators that allow the user to define the pattern by varying the values of the parameters. This is in contrast to the common approach in the literature, that pre-defines certain patterns (e.g., flock) and provides algorithms for matching them. We formally define these operators, illustrate the evaluation algorithms, and discuss the optimization methods. Several examples are given to showcase the expressive power. The operators that we propose for individual and group STP queries share common concepts. They are based on the same moving objects model. This means that they can be combined together in the user queries. One is able to express an individual STP on the result of a group STP and vice versa. Thus, new queries can be expressed that were not possible before. This expressiveness is achieved without compromising the efficiency. Chapter 1

Introduction

Moving objects are objects that change their location and/or extent with Moving objects and time. These could be humans, cars, animals, clouds, parcels, planets, oil applications spills, etc. They could be moving in the free space, such as birds or animals in the forest, or restricted to some network, like cars on street network. The advances and the wide spread usage of positioning and sensor technologies have made the tracking of moving objects feasible in terms of technology and implementation cost. Aircrafts are being tracked by radar systems, sea vessels by the Automatic Identification Systems, GPS receivers are being integrated into new generation cars and mobile devices, cell phone service providers are mandated to implement mobile tracking as a part of the En- hanced 911 emergency service, RFID tags are being more and more used on transportation tickets, library books, etc. As a result, very large amounts of moving objects data (also called spatiotemporal data) are now available in digital formats. In the world of computer applications, there is a wide variety of applica- tions that need to deal with moving objects. The air traffic control systems (ATC), for instance, are used to track the movement of aircraft, analyze these tracks, and build safety models to maximize both safety and capac- ity. These systems need answers to questions like how many flights had to wait for a free runway to land?, and what is the average waiting time over a given airport? It is also interesting to do a complex analysis to estimate the amount of noise and pollution around airports. Researchers in the field of animal behavioral studies are interested in tracking individuals and groups of animals, and studying their migration patterns, home zones, mating, re- sponse to weather changes, etc. Cell-phone service providers are interested in analyzing how people move and how this affects the loads on their base stations. Currently such applications are being developed in an ad-hoc fashion. Moving object At best, they try to extend the functionality of a geographical informa- databases tion system (GIS). Clearly, moving object database management systems are needed as a foundation for such applications. The research in moving object databases (MOD) started around 1996, and became active in the early 2000s. Since then it is attracting the attention of the database community. There have been research works in modeling the moving objects data, spatiotem- poral query operations, query , spatiotemporal indexes, spatiotem-

1 2 CHAPTER 1. INTRODUCTION

poral data privacy, spatiotemporal data mining, etc. Few prototype MOD systems have been developed, namely Secondo [Secd], HERMES [PTVP06], and DEDALE [GRS98] (the DEDALE project is currently stopped, and the system is not available anymore). These research results and systems are still not widely used in real-world applications. Only recently have projects been carried out in order to bring together research and applications of MOD (e.g., the European project MOVE (COST Action IC0903)). There are two flavors of moving object databases. The first deals with the current movement and the predicted near future (e.g., [WXCJ98]). The second deals with the trajectories or the history of the movement (e.g., Scope of the thesis [GBE+00]). This thesis focuses on the second class of models, the tra- jectory databases. This thesis is part of the continuous work to represent and query moving objects [GBE+00] [FGNS00] [CLFG+03]. The result of this work so far is a MOD system (called Secondo) incorporating various data representations, large collections of query operations, indexes, two lev- els of query languages: executable and SQL-like, a query optimizer, and a graphical user interface GUI. This thesis aims at proposing generic query op- erators to support the complex class of queries called spatiotemporal patterns queries.

1.1 Problem Definition

Spatiotemporal pattern queries provide a complex query framework for mov- ing objects. In particular, they specify temporal order constraints between a set of predicates. For example, suppose the predicates P , Q, and R that may hold over one or more time intervals and/or instants. We would like to be able to express conditions like the following:

• P then (later) Q then R.

• P ending before 8:30 then Q for no more than 1 hour.

• (Q then R) during P .

The predicates P , Q, and R might be of the form:

• Vehicle X on road W .

• Train X is inside a snow storm Y .

• The extent of the storm area Y is larger than 4 square kms.

• The speed of airplane Z is between 400 and 500 km/h.

For such conditions to hold, there must exist a time interval/instant for each of the predicates, during which it is fulfilled, and this set of time inter- vals/instants must fulfill the specified temporal order. The spatiotemporal patterns described by such conditions cannot be expressed in traditional queries. That is, a traditional query has a single predicate (might be a conjunction). The kind of queries above consist of several predicates and 1.2. THESIS CONTRIBUTIONS 3 temporal constraints between them. This is why one would rather need the spatiotemporal pattern queries (STP queries). Shortly after this PhD project was started, we could distinguish between two kinds of STP queries. We call the ones described above individual spa- tiotemporal pattern queries. That is, the pattern can be matched against a single trajectory, e.g.:

Find all trains that encountered a delay of more than half an hour after passing through a snow storm.

The example shows an individual STP that is a sequence of two predicates: train crosses a snow storm, then delay of train >= 30 min. The pattern can be completely evaluated for every train, hence the name individual.

An individual spatiotemporal pattern query reports the moving objects that fulfill a set of predicates in a certain temporal order.

The second kind we call group spatiotemporal pattern queries. A group STP query reports the patterns that involve a collective movement of several moving objects, e.g.:

Find a flock of geese migrating in the northern direction.

The query looks for a group of geese traveling close to one another, in a flock formation, in the northern direction. Answering such a query requires analyzing the trajectories of several moving objects, and their spatiotemporal relationships to one another. This is more related to spatiotemporal data mining.

A group spatiotemporal pattern query reports the moving objects that collectively show a certain spatiotemporal formation.

The objectives of this PhD project are to provide generic query operators Thesis objectives for expressing the two kinds of spatiotemporal pattern queries, based on the MOD model in [GBE+00] [FGNS00] [CLFG+03], and to implement them within the Secondo system [Secd] where a big part of this MOD model is realized. By generic we mean that the operators are able to express and evaluate a wide range of STP queries. In the following, we start directly with a rough illustration of the operators that we have proposed.

1.2 Thesis Contributions

Although our technical development is not yet started, we discuss thesis contributions at this point. We think it is better to give a rough big picture of the whole work before going into the details. This is intended to help the reader easily follow up with the technical development in the rest of the thesis. It is also a good introduction before reviewing the related work (in 4 CHAPTER 1. INTRODUCTION

Chapter 2). This discussion does not cover the details however. It is fine if after reading this section, the reader still has questions about the data representation or about the exact function of proposed operators. We start by casual descriptions and examples of the proposed operators for individual STP queries (Section 1.2.1) and group STP queries (Section 1.2.2). Finally we list the thesis contributions in Section 1.2.3.

1.2.1 Individual STP Queries The Spatiotemporal Pattern Predicate (STP predicate) is the tool/operator that we propose for expressing individual STP queries. It describes the pat- tern as a set of time-dependent predicates and another set of binary temporal constraints. Consider the following example:

Bank robbers Example 1.1. A query for possible bank robbers may look for the cars example which entered a gas station, kept close to the bank for a while, then drove away fast. The query describes an individual STP consisting of three time-dependent predicates: car inside gas station, car close to the bank, and speed of car ≥ 80 km/h. We call these predicates time-dependent, because their values change with time. Take, for instance, the predicate car inside gas station. An individual car might be driving close to the gas station (i.e., the predicate is unfulfilled), it enters the gas station (i.e., fulfilled), then it leaves (i.e., unfulfilled again). It might even enter the gas station several times. Therefore the result of such a predicate is a time-dependent boolean, as will be explained in Chapter 2. In this example, the three predicates are required to be fulfilled in a sequential temporal order. Now we illustrate how this query is expressed in Secondo, where we have implemented the STP predicate. Secondo accepts queries in two lan- guage levels: executable and SQL-like. The executable queries are processed directly by the Secondo kernel. The SQL-like queries are accepted by the Secondo optimizer, which translates them into optimized executable queries, and passes them to the kernel for execution. The two language lev- els are accessible to the end user. The bank robbers query is written in the SQL-like as follows:

Bank robbers SQL- SELECT c.licencenumber like query FROM cars c, landmark l WHERE l.type = "gas station" and pattern( [ c.trip inside l.region AS gas, distance(c.trip, bank) < 50.0 AS bnk, speed(c.trip) > 80000 AS leaving ], [ stconstraint(gas, bnk, vec("aabb")), stconstraint(bnk, leaving, vec("abab", "aa.bb", "aabb") ]) where c.trip is a moving point object representing the car trajectory. The STP predicate, denoted pattern in the SQL-like syntax, includes the three time-dependent predicates: 1.2. THESIS CONTRIBUTIONS 5

c.trip inside l.region, distance(c.trip, bank) < 50.0, speed(c.trip) > 80000

They have the aliases gas, bnk, and leaving respectively. The STP predicate requires that every time-dependent predicate be assigned a unique alias, so that these aliases can be used as references in the temporal constraints. This is analogous to the aliases given to attributes and tables in the standard SQL. The STP predicate in this example has two temporal constraints, denoted stconstraint. Each of them is stating a temporal relationship between a pair of the time-dependent predicates (i.e., binary temporal constraints). The operator vec(.) states the temporal order between the fulfillments of the two time-dependent predicates. Each of the terms/arguments of the vec operator specifies a relation be- tween two time intervals. The start and the end time instants of the first interval are denoted aa, and those of the second interval are denoted bb. The order of the symbols describes the interval relationship visually. The dot symbol denotes the equality. For example, the relation aa.bb between the in- tervals i1, i2 denotes the order: ((i1.t1 < i1.t2)∧(i1.t2 = i2.t1)∧(i2.t1 < i2.t2)). The temporal relation expressed by the vec operator is the disjunction of its components. A temporal constraint is fulfilled iff there is a time inter- val/instant i1 during which the first time-dependent predicate is fulfilled, and another interval/instant i2 during which the second time-dependent predicate is fulfilled, and i1, i2 fulfill the relation within the vec operator. Back to the example above, the first temporal constraint states that the car came close to the bank after it had left the gas station. The second constraint is a bit more tricky. We wish to say that the car left the bank area quickly. This means that the car started fast, or may have started normally and then sped up after a while. The three arguments of the vec(.) operator state these possibilities, which are treated disjunctively. The same query would be written in the Secondo executable as follows:

query Bank robbers exe- cars feed {c} cutable query landmark feed {l} filter[.type l = "gas station"] product filter[. stpattern[gas: .trip c inside .region l, bnk: distance(.trip c, bank) < 50.0, leaving: speed(.trip c) > 100000; stconstraint("gas", "bnk", vec("aabb")), stconstraint("bnk", "leaving", vec("abab", "aa.bb", "aabb"))]] consume;

The Secondo executable is a procedural language, in which one specifies step-by-step operators for achieving the result. The feed is a post-fix oper- ator that scans a relation sequentially and loads it from disk into a mem- ory stream of tuples. The curly brackets denote the rename operator. It 6 CHAPTER 1. INTRODUCTION

renames all the attributes in the input stream of tuples by appending an underscore character plus the given character sequence at the end of every attribute name. The two relations cars and landmark are fed into memory and renamed. The landmark stream is filtered so that only the gas stations remain. The product operator performs a cross product between the two streams. Then the resulting tuple stream is filtered using the STP predicate stpattern, so that only the tuples that fulfill the pattern remain. The pattern description in the executable language is similar to that in the SQL-like, ex- cept for some syntax differences. Finally, the consume operator writes the result tuple stream to a disk relation, and displays the result in the terminal. Please note that our contribution here is the pattern operator (or the stpattern operator in the executable language). The rest of operators are already part of Secondo.

1.2.2 Group STP Queries This section, similar to the previous one, roughly illustrates by example our group STP query operators. The details of these operators are explained in Chapter 5. Assume we are given a dataset for gazelles moving in a forest. The dataset has two attributes Id representing the gazelle identifiers and Trip representing the gazelle trajectories as moving point objects (details about the moving point object representation will follow in Chapter 3):

Concurrence Example 1.2. Find a group of 20 gazelles that concurrently move with a speed of more than 15 km/h, for at least 10 minutes.

Convergence Example 1.3. Find a group of at least 20 gazelles moving simultaneously for 10 minutes towards some meeting place.

Composite Example 1.4. Find a group of at least 20 gazelles which moved simultane- ously for 10 minutes towards some meeting place, then moved together with a speed of at least 15 km/h.

Example 1.2 is a typical concurrence pattern. That is, a group of n moving objects show the same value for some motion attribute (here it is the speed), for a certain time duration. Notice that in order to evaluate this query, we need to compute the time-dependent speed of each gazelle in the input, and then search for such a group. The pattern, hence, is expressed and evaluated based on some motion attribute, that is evaluated for each object in the input independently from others. For this kind of patterns we propose the gpattern operator. The query is as follows:

let ten minutes= create_duration(0, 600000); SELECT * FROM gpattern( Gazelles, Id, speed(Trip) > 15.0, ten minutes, 20, "atleast");

The gpattern operator is implemented as a table expression (i.e., an expres- sion that yields a table). Therefore, it is used in the FROM clause. It 1.2. THESIS CONTRIBUTIONS 7 accepts six arguments: the input relation, the identifier attribute, a time- dependent predicate, the minimum duration of the pattern, the number of objects in the group, and a quantifier for the number of objects, either atleast or exactly. The gpattern evaluates the time-dependent predicate speed(Trip) > 15.0 for every gazelle in the input relation. Using the evaluated time- dependent booleans, it finds the groups with cardinality of at least 20 gazelles, that concurrently fulfill the time-dependent predicate, for at least 10 min- utes. There might be several such groups. Therefore, we expect a stream in the result. Every element in the result stream is a time-dependent set of identifiers (i.e., a set whose elements change over time) representing one such group. This is because gazelles more than the minimum group size might be joining and leaving the group. The group STP in Example 1.3 is of another kind that cannot be expressed by the gpattern operator. It is a convergence pattern, in which a group of n objects converge to one another so that they meet in a circular region if they keep speed and heading. This pattern is to be expressed in terms of the mutual distance between the gazelles. That is, we are looking for a group of at least 20 Gazelles that are close enough to one another and their distance to one another keeps decreasing with time. And the whole thing must remain for at least 10 minutes. The motion attribute, upon which this STP is expressed, is not computed for every gazelle independently of the others, hence is not expressible by the gpattern operator. For this kind of patterns, we propose the crosspattern operator. It ex- presses a group STP in terms of dual interactions between the moving objects. The query of Example 1.3 looks as follows:

SELECT * FROM crosspattern( ( SELECT * FROM Gazelles g1, Gazelles g2 WHERE g1.Id < g2.Id ), g1.Id, g2.Id, ( distance(g1.Trip, g2.Trip) < 1000.0 ) and ( isdecreasing(distance(g1.Trip, g2.Trip)) ), ten minutes, 20, "clique");

Similar to the gpattern, the crosspattern operator is also a table expres- sion. It requires pairs of gazelles in the input stream, hence the Gazelles relation is first self-joined. Besides the input stream, the crosspattern op- erator accepts six other arguments: the identifier of the first gazelle, the identifier of the second gazelle, a time-dependent predicate, the minimum duration of the pattern, the minimum group size, and a string constant. The time-dependent predicate is evaluated for every input tuple (i.e., every pair of gazelles). Conceptually, this computes a kind of time-dependent join. That is, two tuples from the original Gazelles relation join together when- ever the time-dependent predicate holds. Internally, within the crosspattern operator, this is represented as a time-dependent graph, as will be defined in Chapter 5. The nodes of this graph are the gazelles, and the edges are the values of the time-dependent predicates. That is, an edge exists between two 8 CHAPTER 1. INTRODUCTION nodes, whenever the time-dependent predicate holds (i.e., the two gazelles are getting closer to one another). The last three arguments in the crosspattern operator define search crite- ria for a special sub-graph within this time-dependent graph. In this example, we look for a clique containing a minimum of 20 nodes, that lasts for a dura- tion of at least ten minutes. A clique in this case means that each gazelle in the group is continuously getting closer to all the other gazelles in the same group. One can specify other sub-graph types according to the pattern one is describing (e.g., a connected component). The self-join of the Gazelles relation in this example is a brute force solution to obtain pairs of gazelles. For an efficient execution, one would use an index to retrieve only the pairs of gazelles that have chances to fulfill the time-dependent predicate. The two operators gpattern and crosspattern are able to express most of the group patterns studied so far in the literature, as will be discussed in Chapter 5. Both of them are built on the top of the time-dependent predi- cates. To allow for composite group STP queries, we add another operator on the top of them, the reportpattern operator. So, now there are three levels: the time-dependent predicates at the bottom, above them are the gpattern and the crosspattern operators, and the reportpattern operator at the top. We call the gpattern and the crosspattern operators patternoid operators. That is, they express simple patterns, such as concurrence, flock, conver- gence, divergence, convoy, etc. The reportpattern allows for expressing com- posite group STPs consisting of several such patternoids. It allows one to specify constraints (e.g., temporal and spatial) between them. Example 1.4 describes such a composite pattern, that consists of the two patternoids in the previous examples. They are composed based on three constraints: (1) a temporal constraint restricting the concurrence patternoid to occur after the convergence patternoid, (2) a spatial constraint that the concurrence patter- noid starts from the final location of the convergence patternoid (i.e., the place of meeting), and (3) an attribute constraint that the same group of moving objects matches the two patternoids. The query looks as follows:

1 let then = vec("abab", "aa.bb", "aabb"); 2 SELECT [converge, concurrence] 3 FROM ( 4 SELECT [converge, concurrence, 5 val(final(converge)) AS convFinSet, 6 val(initial(concurrence)) AS concInitSet, 7 mset2mreg(Gazelles, Id, Trip, concurrence) AS concMReg, 8 mset2mreg(Gazelles, Id, Trip, converge) AS convMReg, 9 val(final(convMReg)) AS convFinReg, 10 val(initial(concMReg)) AS concInitReg] 11 FROM 12 reportpattern([ 13 crosspattern( 14 ( SELECT * FROM Gazelles g1, Gazelles g2 15 WHERE g1.Id < g2.Id ), 1.2. THESIS CONTRIBUTIONS 9

16 g1.Id, g2.Id, 17 ( distance(g1.Trip, g2.Trip) < 1000.0 ) and 18 ( isdecreasing(distance(g1.Trip, g2.Trip)) ), 19 ten minutes, 20, "clique") AS converge, 20 gpattern(Gazelles, Id, speed(Trip) > 15.0, ten minutes, 20, "atleast") AS concurrence], 21 [stconstraint(converge, concurrence, then)] )) 22 WHERE 23 [ no_components(convFinSet intersection concInitSet) >= 20 AND 24 convFinReg intersects .concInitReg AND 25 not(sometimes(area(concMReg) > 8000.0))]

The reportpattern operator in this query (lines 12-21) contains two patter- noid operators (explained in the previous two examples), and one temporal constraint. The two patternoids are given the aliases concurrence and con- verge. The temporal constraint is expressed by the stconstraint operator. It states that converge happens first then concurrence, where then is defined by the let command in line 1. Each of the two patternoid operators yields a stream of time dependent sets, where each set represents a group of gazelles fulfilling the patternoid as explained in Examples 1.2, 1.3 above. The reportpattern operator computes the product of these two streams, and filters it using the temporal constraints. In this example, a pair of time-dependent sets fulfill the temporal constraint if their definition times fulfill the then relationship (i.e., by fulfilling any of its three components). The reportpattern operator yields a tuple stream having two attributes: converge and concurrence, where each one is a time-dependent set. The spatial and the attribute constraints between the two patternoids are applied to this tuple stream. First, we extend this tuple stream by the following computed attributes (lines 5-10):

• convFinSet is computed as the set of gazelle identifiers that was there in the time-dependent set converge at its last time instant.

• concInitSet is computed as the set of gazelle identifiers that was there in the time-dependent set concurrence at its initial time instant.

• concMReg is a moving region object that represents the time-dependent convex hull of gazelles that belong to the time-dependent set concur- rence. The operator mset2mreg does this interpolation. It is explained later in more detail.

• convMReg is similar to concMReg, but for the converge attribute.

• convFinReg is the final region of convMReg.

• concInitReg is the initial region of concMReg.

Now this extended tuple stream is passed to the outer SELECT statement to apply the filter conditions. The attribute constraint is expressed in line 23. It enforces that the set of gazelles at the final time instant of converge and 10 CHAPTER 1. INTRODUCTION the set of gazelles at the initial time instant of concurrence have at least 20 gazelles in common. The spatial constraint is expressed in line 24. It enforces that the region representing converge at its final time instant, and the region representing concurrence at its initial time instant intersect. Finally the last condition, in line 25, states that the gazelles in concurrence must maintain spatial proximity to one another. That is, the area of their convex hull must never exceed 8000 m2.

1.2.3 Contributions

In this thesis, we propose two sets of operators: the first is for expressing individual STP queries, and the second is for expressing the group STP queries. In the first part related to individual STP queries, our contributions are the following:

• We propose query operators that are both generic and efficient for expressing individual STP queries. Our novel approach of using the time-dependent predicates as a basis for expressing patterns allows one to express patterns that involve various types of spatiotemporal oper- ations such as range, metric, topological, set operations, aggregations, distance, direction, etc.).

• The proposed approach is not limited to certain types of moving ob- jects. It can express patterns involving moving point object, moving region object, moving scalars (i.e., time series), etc.

• The proposed approach allows for relative, as well as absolute temporal constraints. It is possible to express temporal constraints such as: P later Q, P between 9:00 and 13:00, or P no more than 10 minutes after Q, where P,Q are predicates.

• In contrast to previous work we are able to make individual STP queries accessible in an SQL-like syntax, and to actually integrate them into the query optimizer. Obviously for an efficient execution of pattern queries on large databases the use of indexes is mandatory. In Chapter 4 we explain how STP queries can be mapped by the query optimizer to efficient index accesses.

• We propose a simple language for describing the relationship between two time intervals (e.g., Allen’s operators [All83]). The language makes it easier, from the users’ point of view, to express interval relations without the need to memorize their names.

• We have implemented the individual STP query operators in the con- text of the Secondo platform [Secd], including the integration with the Secondo query optimizer. The implementation is publicly avail- able as a Secondo Plugin and can be downloaded from the Plugins 1.3. THESIS ORGANIZATION 11

web site [Seca]1. We have also written a user manual describing how to install and run the Plugin within a Secondo system. • There are automatic scripts for repeating the experiments in Chapter 4. They are installed during the installation of the Plugin. The scripts, together with the well documented source provided in the Plugin, allow the readers to explore our approach, further elaborate on it, and compare it with other approaches. Our contributions related to group STP queries are the following: • Group STPs are typically expressed based on some motion attributes (e.g., direction, area, distance between objects, etc.). These attributes can be the raw coordinates (e.g., find cars meeting in down town), a derivative (e.g., find a cattle moving with high speed), or a second derivative (e.g., find oil spills increasing their area). Similar to our approach of individual STP queries, the proposed operators for group STP queries are generic. They are neither restricted to certain types of spatiotemporal operations, nor to certain types of moving objects.

• The proposed approach is able to express group STPs based on indi- vidually computed motion attributes, or based on the dual interactions between the moving objects. It can express simple, as well as composite group STPs.

• We have provided the theoretical foundation of integrating the group STP operators with the query optimizer, and making them accessible in SQL.

• The proposed operators are implemented in the Secondo system, and are accessible in the Secondo executable language. The implementa- tion is available in open source. We have also developed an optimization technique that is based on tem- poral reasoning, in order to increase the efficiency of both individual and group STP operators. The idea is to use the temporal constraints that are explicitly listed in the user query to infer the implicit constraints. The re- sult of this inference is used by our evaluation algorithms for restricting the input trajectories only to their interesting parts. This leads to more efficient pattern matching. The experimental evaluation showed, however, that the gain in run-time is not significant.

1.3 Thesis Organization

The rest of this thesis is organized as follows: Chapter 2 reviews the related work. Chapter 3 reviews the moving objects data model on which this work is

1Starting from the Secondo version 3.2.0 (09/08/2011), the spatiotemporal pattern algebra is already included in the Secondo source. Starting from this version, you will need the Plugin only if you plan to repeat the experiments of this section, because the Plugin contains the scripts for setting up and running the experiment. 12 CHAPTER 1. INTRODUCTION based. Chapter 4 explains our approach of individual STP queries. Chapter 5 explains our approach of group STP queries. Chapter 6 proposes a temporal reasoning approach for optimizing the evaluation of the individual and the group STP operators. In Chapter 7, we demonstrate the expressive power of the proposed operators by several application examples. Finally we conclude in Chapter 8. Chapter 2

Related Work

2.1 Individual Spatiotemporal Pattern Queries

This section reviews the work related to individual STP queries. A theory and a design for this kind of queries, although important, is not yet well established. Only few proposals exist. In [dMR05] and [dMRS05], a model Space time that relies on a discrete representation of the spatiotemporal space is pre- partitioning sented. The spatial space is divided into a finite set of user defined partitions, called zones, each of which being assigned a particular label. The time do- main is partitioned into constant-sized intervals. A trajectory is represented as a string of labels, encoding the zones traversed by the moving object. For example, the trajectory part rzzzh represents a moving object that stayed in zone r for one time unit, moved to zone z and stayed there for three time units, then moved to zone h for one time unit. The user query is composed as a formal expression, which is then evaluated using efficient string matching techniques. This approach lacks generality because the space and time have to be partitioned. The partitioning depends on the intended application and has to be done in advance. Moreover, only patterns that describe the change of the location can be expressed. The approach leaves behind all other kinds of predicates (e.g., topological, metric comparisons, ...). Finally, this approach assumes that a moving object is located into only one zone at a given time instant. While this is acceptable for moving point objects, it is not true for other kinds of moving objects such as the moving regions. A similar approach is adopted in [VBT10] and [VFMB+10]. These works build on the same idea of space time partitioning, and propose efficient query processing techniques and indexes. In [HKBT05], an index structure and efficient algorithms to evaluate Hadjieleftheriou et individual STP queries that consist of spatial and neighborhood predicates al. are presented. The work addresses the problem of conjoint neighborhood queries (e.g., find all objects that were as close as possible to A at time T1 then were as close as possible to B at time T2). The two NN conditions in this query have to be evaluated conjointly. In other words, an object which minimizes the sum of the two distances at the two time points is the answer of this query.

13 14 CHAPTER 2. RELATED WORK

Again this approach is limited to moving point objects, and to a limited set of predicates. It tightly couples the evaluation of the predicates with the evaluation of the pattern query itself. On the one hand, this allows for efficient evaluation of the pattern query. On the other hand, it is very specific to this set of predicates. In the context of systems, a modular design that decouples the predicate evaluation from the pattern query evaluation would allow for extensibility. Spatiotemporal The series of publications [ES99], [ES02], [Erw04], and [Sch05] provide developments a concrete formalism for spatiotemporal developments. A spatiotemporal development is an alternating sequence of spatiotemporal and spatial predi- cates, and is itself a spatiotemporal predicate. It describes the change, with respect to time, in the topological relationship between two moving objects. Consider, for example, a moving point object MP and a moving region object MR. The development MP Crosses MR is defined as:

Crosses= Disjoint meet Inside meet Disjoint

where meet is a spatial predicate that yields true when its two arguments touch each other, and Disjoint is a spatiotemporal predicate that yields true when its two arguments are spatially disjoint. The spatiotemporal predicates (denoted by being capitalized) differ from the spatial predicates in that, the former hold at time intervals while the latter hold at instants1. Spatiotempo- ral developments consider two spatiotemporal objects and precisely describe the change in their topological relationship. Complex spatiotemporal developments can be defined by means of regular expressions over other spatial and spatiotemporal predicates [ES99], [ES02]. To evaluate a development for a pair of moving objects, one has to find a connected alternating sequence of time intervals and instants in which the constituting sequence of spatiotemporal and spatial predicates hold. If such partitioning of time cannot be found, then the trajectory does not fulfill the development. The spatiotemporal developments in their definition are not equivalent to individual spatiotemporal patterns, as they can only describe the change in the topological relationship between two objects. The individual STPs on the other hand involve several interactions between one trajectory and many other objects in the spatiotemporal space, as well as the motion attributes of the trajectory (e.g., speed, direction, ...etc.). Besides being limited to certain types of moving objects and to certain types of predicates, all the related works discussed above share two limi- tations. First, they do not address issues of system integration (e.g., SQL queries). Second, only sequential patterns are allowed. That is, the predi- cates describing the pattern occur in a temporal sequence. Temporal rela- tionships such as during are not expressible. The approach that we propose in this thesis overcomes these limitations. Mainly, it is designed with expres- siveness, system integration, and extensibility in mind.

1Note that this definition of spatial and spatiotemporal predicates is the one given by the authors in their work. Please do not confuse it with other meanings of the same terms that we use throughout this thesis. 2.2. GROUP SPATIOTEMPORAL PATTERN QUERIES 15 2.2 Group Spatiotemporal Pattern Queries

This section reviews the works related to group STP queries. In their work, Dodge et al [DWL08] presented a systematic taxonomy of movement pat- terns, along with an extensive survey of this area of research. They proposed a set of dimensions to classify the movement patterns, underlining the com- monalities between them. Their work suggests that a generic toolbox for pattern search can be designed based on such a classification. Andrienko and Andrienko have several publications that focus on solu- Visual analytics tions based on visual analytics, for example [AAW07]. In particular, vari- ous visualization techniques, interactive tools, and computational methods for analyzing spatial, temporal, and spatiotemporal data are made available to enable a human analyst to discover interesting patterns in moving ob- jects data. Among them are time-aware maps, space-time cubes, other types of graphs and diagrams; interactive dynamic filtering of data according to their spatial, temporal, and thematic (attributive) components; computa- tional procedures oriented to movement data such as clustering of trajecto- ries [AAR+09], generalization and summarization [AA11], extracting various types of events from movement data, etc. Clearly the outcome depends on the ability of the human analyst to combine these tools to make sense of the data. It is also necessary that the size of data be reasonably small for visualization. It is best if one could combine query processing techniques, capable of processing complex queries on large data volumes, with visual analytics tech- niques, to explore the spatiotemporal patterns in the data. We illustrate such an integration scheme in Chapter 4. The algorithmic solutions of the group STP queries problem tend to ad- Algorithmic dress very specific patterns. These works typically focus on matching a single solutions pattern. The pattern is given a restricted definition, then an efficient pattern matching algorithm is given based on this definition. These algorithms share too little to be integrated within one system context. Moreover, there is no wide agreement on the definitions of the patterns. Table 2.1 lists some of these works. The table cites two of the different flock definitions. The first is proposed in [LKI04] and [GvKS04]. The pattern matching is done using the REMO approach, that we discuss below in detail. The second definition appears in [BGHW08]. The trajectories are represented as points in a high dimensional space, indexed by a skip quad tree. A flock query is then modeled as a count query to the skip quad tree. Several variants of the flock pattern were defined under different names, and efficient matching algorithms were proposed, e.g., the convoy pattern in [JSZ08] and the moving cluster pattern in [KMB05]. Clearly the drawback of this approach is the lack of genericness. The MoveMine program [LJL+10] implements several such algorithms MoveMine for group STPs, and trajectory clustering and classification. The application offers a GUI to set the parameters of every algorithm and to visualize the results. MoveMine did not show, however, efforts towards integrating these pattern matching algorithms in a unified computational framework. Within the European GeoPKDD project (Geographic Privacy-aware GeoPKDD Knowledge Discovery and Delivery) [Geo] [GNP+09], tools for mobility data 16 CHAPTER 2. RELATED WORK

Table 2.1: Examples of the Group STPs

1 Concurrence [LIW05]: n moving point objects showing the same motion attribute value (e.g., speed, azimuth) at time t. 2 Trend-setter [LIW05]: One trend-setting moving point object an- ticipates the motion of m others, so that they show the same motion attribute value. 3 (m, r) Flock [LKI04], [GvKS04]: At least m moving point objects are within a circular region of radius r and they move in the same direction. 4 (m, r, d) Leadership [LKI04], [GvKS04]: At least m moving point objects are within a circular region of radius r, and they move in the same direction. At least one of them was already moving in this direction for at least d time steps. 5 (m, r) Convergence [LKI04], [GvKS04]: At least m moving point objects will pass through the same circular region of radius r, assuming they keep their direction. 6 (m, r) Encounter [LKI04], [GvKS04]: At least m moving point objects will be simultaneously inside the same circular region of radius r (assuming they keep their speed and direction). 7 (m, r, d) Convoy [JSZ08]: At least d consecutive timesteps, during which an (m, r) Density-based Cluster is defined. 8 (m, r, d) Flock [BGHW08]: At least d consecutive timesteps, such that for every timestep there is a disk of radius r that contains all the m moving point objects.

mining have been created on the top of the Hermes MOD system [PTVP06], including pattern mining tools. The whole system is called the GeoPKDD system. In [ORP+08], the so-called two worlds model was introduced. It stores both the moving objects data and the discovered patterns in the database. The data is stored in a subset of the database logically marked as the data world, while the patterns are stored in another subset logically marked as the model world. The Hermes system is used for storage. Above Hermes, some pattern mining algorithms are implemented, and made ac- cessible through a data mining query language [Tra10] [TGN+11]. The im- plemented algorithms are the T-Pattern algorithm in [GNPP07], the flock algorithm [BGHW08], and two algorithms for trajectory clustering. The GeoPKDD system, similar to the MoveMine program, provides the user with a menu of pattern mining operators. While this menu is extensible by new algorithms, the user is always restricted to the very specific definition of the patterns imposed by the algorithms. Our approach in this thesis uses the fact that many of the group STPs studied in the literature are similar. Therefore, we propose generic operators, each of them being able to express a range of patterns. The user can thus describe the pattern more freely. We will illustrate in Chapter 5 that the operators we propose are able to express all the patterns in Table 2.1, and match them in the same way, only varying the values of parameters. REMO The work by Laube et al. [LIW05] proposes the REMO (RElative MO- 2.2. GROUP SPATIOTEMPORAL PATTERN QUERIES 17 tion) model. It is the most closely related work to our approach. It defines a language that can consistently express a number of group STP queries. Briefly, the model defines a 2D matrix, where the rows represent the moving objects, and the columns represent a series of time instants. Whereas the columns are clearly ordered by time, the rows have no inherent order. The elements of the matrix are the values of some predefined motion attribute (e.g., object’s speed, azimuth, acceleration, etc.) computed for the moving object (row) at the time instant (column). Hence, the REMO matrix is a 2D string, where 2D string patterns can be searched for. The pattern query is expressed by means of regular expressions. The model could successfully express the concurrence, and the trend-setter patterns [LIW05]. We see the following shortcomings in the REMO approach:

• The size of the REMO matrix is proportional to the database size. One would even need to maintain several matrices for several motion attributes.

• The REMO matrix does not inherently support the analysis based on the object locations (e.g., spatial proximity constraints between the moving objects). The model is extended in [LKI04] to support such analysis functions as second class citizens. That is, first the analysis is done on the REMO matrix, then the analysis part that requires object locations is done on the results. For example, according to the REMO model a flock pattern is a concurrence pattern plus spatial proxim- ity constraints. This would require that the flock members match the concurrence pattern (i.e., concurrently share similar values for the mo- tion attributes). Such a definition is clearly restrictive (e.g., cows in a flock may move around, without leaving their flock, with different speed and/or azimuth).

• The model cannot express the patterns which are described based on object interactions. The motion attributes in the REMO matrix de- scribe every object independently of the other objects. Patterns which are described based on the mutual relationships between the moving objects (e.g., north of, closer than) cannot be expressed.

• The REMO matrix handles the time discretely. It does not directly support continuous trajectories. Trajectory sampling is known to incur inaccuracies in the data representation.

Up to our knowledge, this work by Laube et al. is the only one that proposes a generic language for group STP queries. The above analysis calls for a new approach that overcomes these problems. 18 CHAPTER 2. RELATED WORK Chapter 3

The Moving Objects Data Model (Review)

This chapter reviews the moving objects data model proposed in [GBE+00] and [FGNS00]. This is the model on which this work is based. The review covers only the parts of the model that are necessary for understanding the rest of this thesis. We use the Second-Order Signature (SOS) [Gu93] for the Second-order formal definitions. It is a tool for specifying data models, query processing, signature and optimization rules. It lets the user first define the type system (the first signature), then define polymorphic operations on the types (the second signature). A signature consists of sorts and operators, and generates a set of terms. In the first signature, the one defining the type system, sorts are called kinds, operators are called type constructors, and terms are called types. That is, type constructors operate on kinds to generate types. The second signature defines polymorphic operations over the types, defined as the terms of the first signature. In terms of database systems, the first signature defines the data model, and the second signature defines the query operations.

3.1 The Type System

The upper part of Table 3.1 shows the type system of the moving objects model in [FGNS00]. The left and middle columns display the argument and the result kinds, and the right column displays the type constructors. Kinds are sets of types. The kind DATA, for instance, contains four types corresponding to the four type constructors (int, bool, real, string). These are constant type constructors, because they accept no arguments. In this case, we denote the type in the same notation as its type constructor (e.g., the type int is generated by calling the type constructor int). A type constructor that accepts no arguments is a constant, and yields a single type (e.g., int, point). A type constructor that accepts arguments generates a set, possibly infinite, of types (e.g., range(int), ..., range(instant)). The upper part of Table 3.1 defines abstract data types (ADT) for moving object representation, as was proposed in [FGNS00]. The lower part describes a relational data model, where ATTR = DATA ∪ ... ∪ MAPPING. We re-

19 20 CHAPTER 3. THE MOVING OBJECTS DATA MODEL (REVIEW)

Table 3.1: The type system

→IDENT ident →DATA int, bool, real, string →DISCRETE int, bool, string →SPATIAL point, region, line →TIME instant DATA ∪ TIME →RANGE range DATA ∪ SPATIAL →TEMPORAL intime DISCRETE →UNIT constunit →UNIT ureal, upoint, uline, uregion UNIT →MAPPING mapping

(ident × ATTR)+ →TUPLE tuple TUPLE →REL rel TUPLE →STREAM stream

strict ourselves in this thesis to the relational model. It is however possible to introduce similar solutions for other database models (e.g., object relational, XML, etc.). Table 3.1 defines the syntax of the type system. The semantics is defined by assigning a domain for every type. The base types (e.g., int, string, etc.) have similar domains as in programming languages, except that their domains here are extended by the value undefined, e.g.:

Dreal = R ∪ {undefined}, where Dtype denotes the domain of type. This type system defines two kinds for moving objects: UNIT and MAP- PING. Together, the two kinds define the so-called sliced representation of moving objects [FGNS00]. That is, the complete movement of a moving ob- ject during a certain observation period is decomposed into slices, each of them describes the movement during a smaller time interval. An object of kind UNIT represents a single slice. An object of kind MAPPING is a set of UNITs/slices, describing the complete movement. Let Dinstant denote the domain of time instants instant, isomorphic to R. Let Interval be the set of time intervals defined as:

Interval ={(t1, t2, lc, rc)| t1, t2 ∈ Dinstant , lc, rc ∈ {false, true},

t1 ≤ t2, (t1 = t2) ⇒ lc = rc = true}

That is, a time interval can be left-closed and/or right-closed as indicated by the values of lc and rc respectively. It is also possible that an interval collapses into a single time instant. A type in the UNIT kind describes a pair consisting of a time interval and a temporal function. The temporal function describes the evolution of 3.1. THE TYPE SYSTEM 21 the value of the moving object during the associated time interval. For the types that have discrete domains (i.e., DISCRETE), the temporal function is a constant, and the type constructor constunit constructs their unit types. Let Dσ be the domain of a type σ ∈ DISCRETE. The domain of the corre- sponding unit type is:

Dconstunit(σ) = Interval × Dσ

For example, the domain Dconstunit(bool) = Interval × {false, true, undefined}. For the types that have continuous domains (e.g., real, point), the domains of their unit types are defined individually. For example, the domain of the unit point is:

2 2 Dupoint = Interval × (R × R ) The unit point describes a linearly moving point in the form:

(I, ((x1, y1), (x2, y2))). The position of the point at time t ∈ I is:

(x2 − x1)(t − I.t1) (y2 − y1)(t − I.t1) (x1 + , y1 + ) I.t2 − I.t1 I.t2 − I.t1 A type in the MAPPING kind describes the complete movement of a moving object during some observation periods. It is therefore represented as a set of UNITs.

Definition 3.1. ∀ unit in UNIT, let Dunit denote its domain. The domain of the type mapping(unit) is:

Dmapping(unit) = {U ⊂ Dunit | ∀(i1, f1) ∈ U, (i2, f2) ∈ U :

(i) i1 = i2 ⇒ f1 = f2

(ii) i1 6= i2 ⇒ (i1 ∩ i2 = ∅) ∧

(i1 adjacent i2 ⇒ ¬((i1, f1) mergeable (i2, f2)))}



where i1 adjacent i2 :⇔ (i1.t2 = i2.t1) ∨ (i1.t1 = i2.t2), and mergeable yields true if the two units can be merged together. It is defined according to the unit type. For constant units (constunit):

((i1, f1) mergeable (i2, f2)) = (f1 = f2). For upoint, for instance, mergeable yields true if both units have the same direction and speed, and share one end point. This last condition ensures that a moving object has a unique representation, the one with the minimum number of units. Note that the definition allows for temporal gaps (i.e., time intervals in which the moving object is undefined). In the rest of this thesis, the types in MAPPING are denoted by a preceding m (e.g., mint, mpoint, mbool), and the types in UNIT are denoted by a preceding u (e.g., upoint, ubool). Figure 3.1 illustrates an mpoint object. 22 CHAPTER 3. THE MOVING OBJECTS DATA MODEL (REVIEW)

Figure 3.1: The sliced representation of an mpoint

[”2003-11-20-06:06” ”2003-11-20-06:06:08.692”[,

(16229.0 1252.0), (16673.0 1387.0) t [”2003-11-20-06:06:08.692” ”2003-11-20-06:06:24.776”[,

(16673.0 1387.0), (16266.0 1672.0)

[”2003-11-20-06:06:24.776” ”2003-11-20-06:06:32.264”[,

(16266.0 1672.0), (16444.0 1818.0)

y [”2003-11-20-06:06:32.264” ”2003-11-20-06:06:39.139”], x (16444.0 1818.0), (16144.0 2227.0)

3.2 SECONDO

A large part of this moving objects model is implemented in the Secondo system [Secd], [GAA+05], [GBA+04]. It is an extensible DBMS platform that does not presume a specific database model. It is rather open for new database model implementations. For example, it should be possible to im- plement relational, object-oriented, spatial, temporal, or XML models. Secondo consists of three loosely coupled modules: the kernel, the GUI and the query optimizer. The kernel includes the command manager, query processor, algebra manager and storage manager. The kernel can be extended by algebra modules. In an algebra module one can define new data types and/or new operations. The integration of the new types and/or operations in the query language is then achieved by adding syntax rules to the command manager. The Secondo kernel accepts queries in a special syntax called Secondo executable language. The SQL-like syntax is provided by the optimizer. For more information about Secondo modules see [Secd] and [Secc]. For more information about extending Secondo see the documentation on [Secb]. If it is the case that a new data type needs a special graphical user interface (GUI) for display, the Secondo GUI module is also extensible by adding viewer modules. Several viewers exist that can display different data types. Moving objects, for example, are animated in the Hoese viewer with a time slider to navigate forwards and backwards.

3.3 The Query Language

A large number of operations are defined for the type system described above, see [FGNS00] and [GBE+00]. They fall into three classes:

1. Static operations defined on the non-temporal types (e.g., topological predicates, set operations, aggregations).

2. Spatiotemporal operations defined on the temporal types (e.g., the tra- jectory of an mpoint, the temporal function (snapshot) of an mregion for a given instant of time). 3.3. THE QUERY LANGUAGE 23

3. Lifted operations offered for combinations of moving and non-moving types. Basically they are time-dependent versions of the static opera- tions (e.g., area of an mregion, addition of a real and an mint).

The Secondo executable language is a procedural language, in which Secondo one specifies step-by-step methods for achieving the results. It consists of the three classes of operations above, and the query processing operations of the relational model, usually applied in a stream processing mode. It is essentially a precisely defined notation for query plans. While other database systems represent query plans as operator trees generated by the optimizer, Secondo to our knowledge is unique in offering a complete syntax for query plans and a corresponding language level that is accessible to the user. In other words, the user can type query plans directly; these are parsed, checked for correct composition of operations, and then executed. The Secondo optimizer offers an SQL-like language like other database systems. The three classes of operations mentioned above are integrated into this level as well. The optimizer uses cost based optimization to map SQL queries into query plans of the Secondo executable language. Any plan that the optimizer generates could as well be typed directly by a user. The operators for spatiotemporal pattern queries, that we describe in this thesis, require extensions to a database system both at the levels of query processing and of query optimization. The first step is to extend query processing by new operators. The second step is to also extend the optimizer to make use of the new query processing operators and to apply further optimizations. In the sequel, we will first embed our proposed language operators into the Secondo executable language. Since this language level is accessible to the user, we will explain the concepts at this level and formulate also example queries at this level, making use of Secondo query processing operations as needed. This allows us to explain our operators in terms of precisely de- fined query processing operations and their related algorithms, without being bothered by the additional level of complexity resulting from embedding into SQL and query optimization. Later to this, we explain the second step of integrating the language operations into the SQL language on top of the optimizer. Here we briefly introduce the executable language level. Let PhoneBook be a relation of the type:

rel(tuple(<(Name, string), (Phone, string)>)).

A query that finds the entries with the name Ali Mahmoud is:

query PhoneBook feed filter[.Name contains "Ali Mahmoud"] consume;

The feed operator reads a relation from disk and converts it into a memory tuple stream. The consume operator does the opposite. The filter operator evaluates a boolean expression for every input tuple, and yields only the 24 CHAPTER 3. THE MOVING OBJECTS DATA MODEL (REVIEW) tuples that fulfill it. For such stream processing operations usually a postfix syntax is defined so that one can conveniently write query operators in the order in which they are applied. The signatures of these operators are:

rel(tuple) → stream(tuple) feed # stream(tuple) → rel(tuple) consume # stream(tuple) × (tuple → bool) → stream(tuple) filter #[ ] string × string → bool contains # where tuple is a type variable that can be instantiated with any type in TUPLE. The last column in the signature shows the operator syntax, where # denotes the operator, and denotes an argument. We will describe more operations in the sequel.

3.4 Time-Dependent Predicates

As mentioned in Chapter 1, both individual and group STPs are expressed based on some motion attributes that can be the raw coordinates, a deriva- tive, or a second derivative. Time-dependent predicates are able to express conditions on such attributes. They are also defined for all kinds of mov- ing objects (e.g. mpoint, mregion, mreal). We express both individual and group STP queries on top of them, thus allowing for complex analysis on all kinds of motion attributes and all types of moving objects. Formulating the STP predicate on the top of the time-dependent predicates easily leverages a considerable part of the available infrastructure. This section focuses on illustrating these predicates since they are the foundation of our work. Time-dependent predicates are all the operations that yield mbool. Prac- tically they are equivalent to the lifted predicates introduced in [GBE+00]. Lifted predicates define time dependent versions of static predicates. That is, a static predicate is a function with the signature

T1 × .... × Tn → bool where Ti is a type variable that can be instantiated by any static/non- temporal data type (e.g., int, point, region). Example: BrandenburgGate inside Berlin. A lifted predicate, on the other hand, is a function with the signature

T1 × .... × Tk × mapping(Tk+1) × ... × mapping(Tn) → mbool

A lifted predicate is thus obtained by allowing one or more of the parameters of a static predicate to be of a moving type. Consequently, the return type is a time dependent boolean mbool. Example: Train RE1206 inside Berlin. Note that inside in this example is a lifted predicate because the object Train RE1206 is an mpoint. It is therefore the lifted version of the standard inside predicate in the previous example. Figure 3.2 illustrates the lifted inside predicate. To the left, the Trip is an mpoint object, that crosses the 3.4. TIME-DEPENDENT PREDICATES 25 region object R between the time instants 378.3 and 397.7. The figure to the right shows the evaluation of the time-dependent predicate Trip inside R. It is a time-dependent boolean mbool having the value true between these two time instants and false otherwise. A time-dependent predicate, hence, is an operation that yields an mbool . All the static predicates dened for non-moving types can be consistently lifted into time-dependent predicates. In the sequel, we use the term time-dependent rather than lifted, because it is more general. A lifted predicate, by definition, must have a static counter- part. While, a time-dependent predicate is any function yielding an mbool. One might, for example, define a function that yields a random mbool, as we do in the experiments in Chapter 4.

t Trip Trip inside R

TRUE 397.7

378.3 R y FALSE t x 378.3 397.7

Figure 3.2: Time-dependent predicates 26 CHAPTER 3. THE MOVING OBJECTS DATA MODEL (REVIEW) Chapter 4

Individual Spatiotemporal Patterns

This chapter explains in detail the spatiotemporal pattern predicate that we propose for expressing individual STP queries. We start in Section 4.1 by defining the basic version of the STP predicate, and illustrate the evaluation algorithm in Section 4.2. In Section 4.3, the basic STP predicate is extended into a more expressive version. In the same section, we also illustrate other variants of the STP predicate, that are found useful in real-world applica- tions. In Section 4.4 we propose a generic approach to integrate the STP predicate with the query optimizers, and propose the SQL-like syntax of the predicate. Section 4.5 is dedicated to the technical aspects of the implemen- tation in the Secondo framework. The experimental evaluation is shown in Section 4.6, and we conclude in Section 4.7. To get the most understanding of this chapter, we advice the reader to download and install Secondo [Secd], and download and install our Sec- ondo Plugin for spatiotemporal patterns [Seca]1, which are both available as open source. Appendix A illustrates how to repeat the experiments in this Chapter, using automatic scripts that we have prepared for this pur- pose. Finally, in Chapter 7, we demonstrate several application examples that emphasize the expressive power of the STP predicate.

1Starting from the Secondo version 3.2.0 (09/08/2011), the spatiotemporal pattern algebra is already included in the Secondo source. Starting from this version, you will need the Plugin only if you plan to repeat the experiments of this section, because the Plugin contains the scripts for setting, up and running the experiment.

27 28 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS ) ] ] , ] ; ] ; , ; ; ; ; ) ) #( #[ #[ #[ #[ #( #( stpatternextend, stpatternextendstream stpatternexextend, stpatternexextendstream vec stconstraint stpattern stpatternex start, end + + ) ) ) bool bool bool ( ( → × × tuple + + ( ) ) + + ) ) bool bool ( ( bool stream × × mbool mbool + + → → ) ) bool → → ) ) ) → bool bool tuple mbool mbool tuple tuple Table 4.1: Individual spatiotemporal pattern operators ( ( ( → → → → × × stvector ) ) stvector stream → tuple tuple instant × tuple tuple ( ( ( ( + 2 tuple tuple × → × ) ) ( ( → × × string string stream string ( tuple tuple stream ( 4.1. THE SPATIOTEMPORAL PATTERN PREDICATE 29

Table 4.1 summarizes all the operators that we propose in this chapter. The vec operator creates an object of type stvector, which is used to repre- sent a temporal relationship. The stconstraint operator is used to express a temporal constraint within the STP predicate. The operator stpattern is our simple version of the STP predicate. The extended STP predicate that will be explained in Section 4.3 is the stpatternex operator. The remaining operators are variants of the spattern and stpatternex operators. They are stream operators, rather than predicates. They extend the tuples with the time periods during which the STP is fulfilled. Tuples that do not fulfill the pattern are extended with nulls. The start and end operators are used within the stpatternex predicate and its variant to express absolute time conditions. The following sections describe these operators formally.

4.1 The Spatiotemporal Pattern Predicate

Now we start the formal definition of the STP predicate. First, we define a language for temporal relationships between pairs of time intervals. It will Temporal be the base of the temporal constraints within the STP predicate. The tra- constraints ditional way that used to be adopted in the literature is to assign names for such relationships (e.g., before, after, etc like in [All83]). Here we alterna- tively propose a language instead of names. This is because in our case 26 such relationships are possible, which makes it difficult for a user to memorize the names. Table 4.2 shows the 26 terms of this language, and a graphical illustration of each. In the terms, the letters aa denote the begin and end time instants of the first interval. Similarly bb are the begin and end of the second interval. The order of the letters describes the temporal relationship, that is, a sequence ab means a < b. The dot symbol denotes the equality constraint, hence, the sequence a.b means a = b, and a.a means that the start and the ends of the first interval are the same (i.e., the interval degenerates into a time instant). Formally, let IR be the set of interval relationships of Table 4.2, that is IR = {aabb, abba, ..., a.a.b.b}

Let i1, i2 ∈ Interval, ir = s1s2...sk ∈ IR (note that 4 ≤ k ≤ 7, that is, the shortest term includes two a’s and two b’s, and the longest term includes additionally three dots),  i1.t1 if si is the first a in ir  i .t if s is the second a in ir  1 2 i Let rep(si) = i2.t1 if si is the first b in ir  i2.t2 if si is the second b in ir  . if si = .

i1 and i2 fulfill s1s2...sk :⇔∀j ∈ {1, ..., k − 1} :

(i)sj 6= . 6= sj+1 ⇒ rep(sj) < rep(sj+1)

(ii)sj+1 = . ⇒ rep(sj) = rep(sj+2) 30 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

Table 4.2: A language for expressing interval relationships Term Illustration Term Illustration Term Illustration Both arguments are intervals (Allen’s operators) aabb aaaa abba aaaaaaaa bbaa aaaa bbbb bbbb bbbb a.bab aaaa aa.bb aaaa a.bba aaaaaaaa bbbbbbbb bbbb bbbb bb.aa aaaa baa.b aaaa abab aaaa bbbb bbbbbbbb bbbb aba.b aaaaaa baba aaaa a.ba.b aaaa bbbb bbbb bbbb baab aaaa bbbbbbbb The first argument is an instant a.abb a bb.a.a a a.a.bb a bbbb bbbb bbbb bba.a a ba.ab a bbbb bbbb The second argument is an instant b.baa aaaa aa.b.b aaaa b.b.aa aaaa b b b aab.b aaaa ab.ba aaaa b b Both arguments are instants a.ab.b a b.ba.a a a.a.b.b a b b b

Two time intervals i1, i2 ∈ Interval fulfill a set of interval relationships if they fulfill any of them, that is:

i1 and i2 fulfill SI ⊆ IR :⇔∃ ir ∈ SI : i1 and i2 fulfill ir

The vec(.) operator in Table 4.2 allows for composing such SI subsets as objects of type stvector. An stvector is a bitvector, where every temporal relationship in Table 4.2 corresponds to one bit. It is a compact represen- tation of SI sets. For syntactic elegance in both SQL-like and Secondo executable, one can assign names to them using the let command, e.g.:

let then = vec("abab", "aa.bb", "aabb"); let later = vec("aabb");

The STP predicate For ease of presentation, in the following we define the STP predicate within the relational data model. The definitions can however be adapted to fit within other database models (e.g., object oriented), thanks to the ADT modeling of the moving objects which does not depend on a particular database model. Let tuple be a type variable that can be instantiated by any valid tuple type in the sense of the relational data model. Let Dtuple denote the domain 4.1. THE SPATIOTEMPORAL PATTERN PREDICATE 31

of the tuples conforming to this type. Let Dmbool denote the domain of the type mbool.A time-dependent predicate is a function with signature: tuple → mbool hence it is a function

f : Dtuple → Dmbool

We denote a predicate with this signature as ptuple, and a set of such predi- cates as Ptuple when the tuple type is relevant. Let Ptuple = {p1, ..., pn} be a set of time-dependent predicates. A temporal constraint on Ptuple is an element of the set:

TC(Ptuple) = {1..n} × {1..n} × P(IR)

Hence it is a binary constraint, that assigns a pair of predicates in Ptuple a set of interval relationships. The operator stconstraint in Table 4.1 expresses a temporal constraint. It accepts three arguments: two aliases of time- dependent predicates, and a set of interval relationships composed by the vec(.) operator. Based on the above definitions, a spatiotemporal pattern predicate is defined as follows: Definition 4.1. A spatiotemporal pattern predicate (STP predicate) is a pair (Ptuple,C), where C ⊆ TC(Ptuple).  The operator stpattern in Table 4.2 denotes the STP predicate. It accepts a tuple, a non empty set of time-dependent predicates mapping this tuple into mbools (Ptuple ), and a non empty set of temporal constraints C, and it yields a bool. For an STP predicate to hold, all the temporal constraints in C must be The evaluation of fulfilled. Formally it is as follows: the STP predicate Let t ∈ Dtuple be a tuple and Ptuple = {p1, ..., pn}, we denote by pk(t) the evaluation of pk ∈ Ptuple on t. Hence pk(t) ∈ Dmbool. We also define the set of candidate assignments CA(Ptuple, t) as:

true true CA(Ptuple, t) = p1(t) × ... × pn(t)

true true where pk(t) = {i|(i, true) ∈ pk(t)}. In other words, pk(t) is the set of time intervals during which pk is fulfilled with respect to the tuple t. That is, the CA(Ptuple, t) is simply the Cartesian product of the sets of time intervals during which the time-dependent predicates in Ptuple are fulfilled with respect to the tuple t. Let ca = (i1, ..., in) ∈ CA(Ptuple, t) and let c = (j, k, SI) ∈ TC(Ptuple) where 1 ≤ j, k ≤ n be a temporal constraint

ca fulfills c :⇔ ij and ik fulfill SI

Let C ⊆ TC(Ptuple) be a set of temporal constraints, and let t ∈ Dtuple be a tuple. The set of supported assignments of C is defined as:

SA(Ptuple, C, t) = {ca ∈ CA(Ptuple, t) | ∀ c ∈ C : ca fulfills c} 32 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

That is, for a candidate assignment to be a supported assignment, it must fulfill all the constraints in C. An STP predicate is fulfilled for a given tuple if and only if such a supported assignment is found.

Definition 4.2. A spatiotemporal pattern predicate is a function with the signature tuple → bool. Given a tuple t of type tuple, its evaluation is defined as: eval((Ptuple,C), t) = (SA(Ptuple, C, t) 6= ∅)  Here we display again the bank robbers query of Section 1.2.1 for illus- tration:

SELECT c.licencenumber FROM cars c, landmark l WHERE l.type = "gas station" and pattern( [ c.trip inside l.region AS gas, distance(c.trip, bank) < 50.0 AS bnk, speed(c.trip) > 80000 AS leaving ], [ stconstraint(gas, bnk, vec("aabb")), stconstraint(bnk, leaving, vec("abab", "aa.bb", "aabb") ])

Figure 4.1 illustrates one tuple that fulfills this pattern. It shows the three mbool results of the time-dependent predicates. In this example there are four candidate assignments. Only one of them is supported, the hatched one.

Figure 4.1: Evaluating the STP predicate

4.2 Evaluating The Spatiotemporal Pattern Predicate

The formalism of the STP predicate in the previous section maps pretty well into the well known Constraint Satisfaction Problem (CSP). This sec- tion illustrates this mapping and the algorithms used to evaluate the STP predicate.

Constraint Definition 4.3. Formally, a constraint satisfaction problem is defined as a satisfaction triple hX,D,Ci, where X is a set of variables, D is a set of initial domains, problems and C is a set of constraints. Each variable xi ∈ X has a non-empty domain 4.2. EVALUATING THE STP PREDICATE 33

di ∈ D. Each constraint involves a subset of variables and specifies the allowable combinations of values for this subset. An assignment for a subset of variables is supported if it satisfies all constraints. A solution to the CSP is in turn a supported assignment of all variables. CSP algorithms iteratively remove values from the domains once they discover that these values cannot be part of a solution. 

Given a tuple t, and an STP predicate (Ptuple,C), we construct a cor- responding constraint satisfaction problem having as many variables as the number of time-dependent predicates in Ptuple. The domain of a variable true xi ∈ X that maps to the time-dependent predicate pi ∈ Ptuple is di = pi(t) . That is di is the set of time intervals during which pi is fulfilled with respect to the tuple t. The set of constraints C is the same in both the CSP and the STP predicate. Right now, one would simply apply any of the known CSP algorithms to evaluate the the STP predicate with respect to t. Unfortunately, this straightforward mapping leads to an inefficient evalu- ation of the STP predicate. It requires that the domains of the variables be known in advance. In other words, it requires that all the time-dependent predicates in Ptuple are evaluated first. But what if we find after evaluating the first time-dependent predicate with respect to t, that it is never fulfilled? More complex, what if we find after evaluating a subset of Ptuple, that some constraints cannot be fulfilled? In such cases it does not make sense to evalu- ate the remaining predicates in Ptuple. That is, regardless of their results, the STP predicate cannot be fulfilled. Therefore we propose an algorithm that solves the CSP incrementally. It does not require that all domains are known in advance. Rather, it computes the variable domains as required (i.e., by triggering the evaluation of the corresponding time-dependent predicates). In the following we give a brief review about existing CSP algorithms, then illustrate our algorithm. A CSP having only binary constraints is called binary CSP and can be A survey of CSP represented graphically as a constraint graph. The nodes of the graph are the algorithms variables and the edges are the binary constraints. Two nodes are connected with an edge if they share a constraint. CSPs are typically solved using variants of the traditional backtracking algorithm. The algorithm is a depth-first tree search that starts with an empty list of assigned variables and recursively tries to find a solution. In every call, backtracking adds a new variable to its list and tries all the pos- sible assignments. If an assignment is supported, a new recursive call is made. Otherwise the algorithm backtracks to the last assigned variable. The algorithm runs in exponential time and space. Constraint propagation methods [Bes06] (also called local consistency methods) can reduce the domains before backtracking to improve the perfor- mance. Examples are the ARC Consistency, and the Neighborhood Inverse Consistency (NIC) algorithms. They detect and remove some values from the variable domains that cannot be part of a solution. Local consistency algorithms do not guarantee backtrack-free search. To have the nice property of backtrack-free search one would need to enforce n-consistency (equivalent to global consistency), which is again exponential in time and space. 34 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

Algorithm 1: MatchPattern input : variables, constraints result: whether the CSP is consistent or not

1 Clear SA, Agenda and ConstraintGraph; 2 Add all variables to Agenda; 3 Add all constraints to the ConstraintGraph; 4 while Agenda is not empty do 5 Pick a variable Xi from the Agenda; 6 Compute the variable domain Di; // (i.e., evaluate the corresponding time-dependent predicate) 7 Extend SA with Di; 8 if SA is empty then return Inconsistent; 9 else return Consistent;

The proposed Our proposed algorithm MatchPattern (Algorithm 1) tries to solve the algorithm sub-CSP of k − 1 variables (CSPk−1 ) first and then to extend it to CSPk . Therefore, an early stop is possible if a solution to the CSPk−1 cannot be found. Which means that, in case no solution is found, the evaluation will be stopped as soon as this is realized, without the unnecessary evaluation of the remaining time-dependent predicates. The MatchPattern algorithm uses three data structures, which are made globally accessible for all its helper functions: the SA list (for Supported Assignments), the Agenda and the ConstraintGraph. The Agenda keeps a list of variables that are not yet consumed by the algorithm. One variable from the Agenda is consumed in every iteration. Consuming a variable is equivalent to computing its domain (i.e., evaluating the corresponding time- dependent predicate). Every supported assignment in the SA list is a solution for the sub-CSP consisting of the variables that have been consumed so far. In iteration k there are k − 1 previously consumed variables and one newly consumed variable (Xk with domain Dk). Every entry in SA at this iteration is a solution for the CSPk−1 . To extend the SA, the Cartesian product of SA and Dk is computed. Then only the entries that constitute a solution for CSPk are kept in SA. CSPk is constructed using the consumed variables and their corresponding constraints in the constraint graph. Picking the The methodology of picking the variables from the Agenda has a big variables effect on the run-time. The best method will choose the variables, so that inconsistencies can be detected soon. For example, suppose an STP predicate having four predicates with aliases u, v, w, and x. The constraints are: stconstraint(u, x, vec(aabb)), stconstraint(v, x, vec(aabb)), and stconstraint(w, x, vec(aabb)). If the variables are picked in the order u, v, w, then x, the space and time costs are the maximum. Since u, v, and w are not connected by any constraints, the SA is populated by the Cartesian product of their domains in the first three iterations. The actual filter to SA starts in the fourth iteration after x is picked. The heuristic is thus to pick first the variables which lead to the 4.2. EVALUATING THE STP PREDICATE 35

Algorithm 2: Extend

input : i, Di; the index and the domain of the newly consumed variable

1 if SA is empty then 2 foreach I: Interval in Di do 3 insert a new row sa in SA having sa[i] = I and undefined for all other variables; 4 else 5 SA = the Cartesian product SA × Di; 6 foreach sa in SA do 7 if sa is not supported then remove sa from SA; evaluation of the largest number of constraints. The function which picks the variables from the Agenda (Line 5 in Algo- rithm 1) chooses the variables according to their degree in the Constraint- Graph. An edge that connects the variable/node to another variable in the Agenda is counted as 0.5. While an edge that connects the variable to a non- Agenda variable is counted as 1. This is done in the sense that consuming a variable contributes 50% to evaluating the constraint, if the other variable is still in the Agenda (i.e., not consumed). If the other variable is a non-Agenda variable, we count it as 1 because the constraint is ready for evaluation. Back again to the example, in the first iteration, all the variables u, v, and w have a degree of 0.5, whereas x has a degree of 1.5. Therefore, x is picked in the first iteration. In the second iteration u, v, and w have a degree of 1, so the algorithm picks any of them, and so on. This variable picking methodology tries to maximize the number of evaluated constraints in every iteration in the hope that they filter the SA list and detect inconsistencies as soon as possible. The time cost of the MatchPattern algorithm is Run-time analysis

n i X Y dk × ek i=1 k=1 where n is the number of variables, dk is the number of values in the domain th of the k variable and ek is the number of constraints in CSPk . The storage cost is n i X Y dk i=1 k=1 The algorithm runs in O(edn) and takes O(dn) space. We are not worried about the exponential time and space costs. That is, in normal situations an individual STP query would be expressed using few time-dependent predicates (i.e., small value of n). In the real world applications that we illustrate in Chapter 7, we needed no more than five predicates to express any of the patterns. In the experimental evaluation in Section 4.6.1, we show that the overhead of matching the pattern (Algorithm 1) is still negligible with 8 time-dependent predicates. The MatchPattern 36 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

algorithm achieves its efficiency by avoiding the unnecessary evaluation of time-dependent predicates, in the case the input tuple does not fulfill the pattern. The cost of evaluating such predicates varies, but it is expected to be expensive because usually they require retrieving and processing the complete trajectory of the moving object. The complexity of several time- dependent predicates is analyzed in [CLFG+03].

4.3 Extending The Definition Of The STP Predicate

Why? Back to the example of bank robbers, a sharp eyed reader will notice that the provided SQL query might yield undesired tuples. Suppose that long enough trajectories are kept in the database. A car that entered a gas station in one day, passed close to the bank in the next day, and in a third day oversped will be part of the result. To avoid this, we would like to constrain the period between leaving the gas station till overspeeding to be at most 1 hour. How? So far the design of the STP predicate that we have explained does not support such quantitative constraints. It support only qualitative constraint in the form of before, after, during, etc. Hence, an extension is required. Indeed the proposed design of the STP predicate is flexible, so that such an extension is easy to integrate. The idea is that after the STP predicate is evaluated, the SA data structure contains all the supported assignments. As illustrated before, a supported assignment assigns an interval to every time-dependent predicate during which it is fulfilled, and this set of intervals fulfills all the temporal constraints. Now that we know the time intervals, we can impose more constraints on them. For example, we state that the pe- riod between leaving the gas station (first predicate) till overspeeding (third predicate) must be below 1 hour. The extended STP In the following we formally describe an extended version of the STP predicate predicate that allows for such additional constraints. Let Ptuple = {p1, ..., pn} be a set of time-dependent predicates, and let C ⊆ TC(Ptuple) be a set of temporal constraints. Let g be a function:

n g :(Interval) × Dtuple → Dbool

That is, g is a predicate that accepts a set of n time intervals and a tuple, and yields a bool. An extended STP predicate is defined as follows:

Definition 4.4. An extended spatiotemporal pattern predicate is a triple (Ptuple, C, g). Given a tuple t of type tuple its evaluation is defined as:

eval((Ptuple, C, g), t) = ({sa ∈ SA(Ptuple, C, t)| g(sa, t) = true}= 6 ∅)

 That is, the boolean predicate g is applied to the supported assignments in SA and to the input tuple t. For the extended STP predicate to be fulfilled, g must be fulfilled at least once. The evaluation of the extended STP predicate 4.3. EXTENDING THE DEFINITION OF THE STP PREDICATE 37 is thus done in two parts, that both must succeed. The first solves the STP predicate (Ptuple,C) for the given tuple t, and the second part, which is processed only after the success of the first part, evaluates the boolean predicate g for every supported assignment. Hence, conditions on the list of supported assignments SA are possible. Syntactically, the user is provided with two functions start(.) and end(.) that yield the start and end time instants of the intervals in an SA element. The two functions are in the form:

f :(Interval)n × {1, ..., n} → Instant

Given a supported assignment sa ∈ SA and an index, the two functions yield the start and the end time instants of the time interval at this index in sa. Formally let sa = {i1, ..., in} ∈ SA(Ptuple, C, t).

start(sa, k) = ik.t1, and

end(sa, k) = ik.t2 where 1 ≤ k ≤ n. To implement this extension, Line 9 of the MatchPattern algorithm (Al- gorithm 1) is changed into return SA;. The predicate g is then iteratively evaluated for the elements of the SA. The extended STP predicate is denoted stpatternex in the Secondo ex- ecutable, and patternex in the SQL-like syntax. The bank robbers query is rewritten using it as follows:

let ONE_HOUR = create_duration(0, 3600000); Bank robbers let later= vec("aabb"); revisited let then= vec("abab", "aa.bb", "aabb"); SELECT c.licencenumber FROM cars c, landmark l WHERE l.type = "gas station" and patternex([c.trip inside l.region as gas, distance(c.trip, bank) < 50.0 as bnk, speed(c.trip) > 100000 as leaving], [stconstraint(gas, bnk, later), stconstraint(bnk, leaving, then], start(leaving) - end(gas) < ONE_HOUR) where the additional condition start(leaving) - end(gas) < ONE HOUR en- sures that the time period between the car getting out from the gas station till it starts leaving the bank area is less than one hour. Note that the start and end operators accept the predicate aliases, rather than their indexes as in the definition. More complex conditions can be expressed. The time intervals can be used, for example, to retrieve parts from the moving object trajectory to express additional spatial conditions. For example, the query for possible bank robbers might more specifically look for the cars which entered a gas 38 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

station, made a round or more surrounding the bank, then drove away fast. To check that the car made a round surrounding the bank, a possible solution is to check the part of the car trajectory close to the bank for self intersection. The query is then written as follows:

SELECT c.licencenumber FROM cars c, landmark l WHERE l.type = "gas station" and patternex([c.trip inside l.region as gas, distance(c.trip, bank) < 50.0 as bnk, speed(c.trip) > 100000 as leaving], [stconstraint(gas, bnk, later), stconstraint(bnk, leaving, then], isSelfIntersecting(c.trip, start(bnk), end(bnk)) and (start(leaving) - end(bnk)) < ONE_HOUR)

where we assume that isSelfIntersecting is a user defined function2 that checks a part of an mpoint object for self intersection. Other variants of In real-world application, we found that it is required to know the parts the STP predicate of the moving object that fulfill the pattern. In the bank robbers exam- ple, these would be the three time intervals during which the three time- dependent predicates are fulfilled. Formally this is the first sa entry that fulfills the temporal constraints. For this reason, we introduce the two oper- ators stpatternextend, stpatternexextend in Table 4.1. Moreover, one might be interested in retrieving all such sa instances, rather than retrieving only the first. This is implemented in the two operators stpatternextendstream, stpatternexextendstream. We illustrate the usage of these operators in Chap- ter 7.

4.4 Optimizing The STP Predicate

In this section we show how to integrate the STP predicate with the query optimizer. First, we make it available in SQL, then we extend the optimizer with rules for suggesting efficient query plans that invoke indexes when pos- sible. We do not assume a specific optimizer or optimization technique. The optimizer is however required to have some basic features, as we discuss in the following subsection.

4.4.1 Basic Assumptions A typical query optimizer contains two basic modules; the re-writer and the planner [Ioa96]. The re-writer uses some heuristics to transform a query into another equivalent query that is, hopefully, more efficient or easier to handle in further optimization phases. The planner creates for the user query (or the re-written version) the set of possible execution plans, possibly restricted

2In Secondo, users can define functions and store them as function objects in the database. These functions can be used in the user queries. This is pretty much like the stored procedures in ORACLE, for example. 4.4. OPTIMIZING THE STP PREDICATE 39 to some classes of plans. Finally it applies a selection methodology (e.g., cost based) to select the best plan. We assume that the query optimizer contains both the re-writer and the planner modules. We also assume that it supports the data types and oper- ations on moving objects in [GBE+00] and [FGNS00].

4.4.2 Query Optimization For The STP Predicate One observation that we wish to make clear is that the STP predicate itself does not process database objects directly. The first step in the processing is the evaluation of the time-dependent predicates that compose the STP predicate. If at least one of these time-dependent predicates is never fulfilled by the tuple, the STP predicate cannot be fulfilled. The idea, hence, is to design a generic framework for optimizing the time-dependent predicates. The optimizer should be able to invoke available indexes to filter out tuples that have no chance of fulfilling a given time-dependent predicate. Normally, query optimizers are able to do so given some kinds of standard predicates (e.g., a query optimizer would invoke an R-tree index for optimizing spatial range predicates). Up to our knowledge, this has never been investigated for time-dependent predicates. We propose a generic framework, that utilizes the common index structures, and the available optimization framework of standard predicates, in order to optimize time-dependent predicates. The idea is to add each of the time-dependent predicates, in a modi- Main idea of fied form, as an extra standard predicate to the query, that is, a predicate optimization returning a boolean value. The standard predicate is chosen according to the time-dependent predicate, so that the fulfillment of the standard predi- cate implies that the time-dependent predicate is fulfilled at least once. This is done during query rewriting. The additional standard predicates in the rewritten query trigger the planner to use the available indexes. To illustrate the idea, the following query shows how the bank robbers query is supposed to be re-written by the query optimizer:

SELECT c.licencenumber FROM cars c, landmark l WHERE l.type = "gas station" and pattern([c.trip inside l.region as gas, distance(c.trip, bank) < 50.0 as bnk, speed(c.trip) > 100000 as leaving], [stconstraint(gas, bnk, later), stconstraint(bnk, leaving, then]) and c.trip passes l.region and sometimes(distance(c.trip, bank) < 50.0) and sometimes(speed(c.trip) > 100000)

The three time-dependent predicates: - c.trip inside l.region, - distance(c.trip, bank) < 50.0, and 40 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

- speed(c.trip) < 100000 are mapped into the three standard predicates: - c.trip passes l.region, - sometimes(distance(c.trip, bank) < 50.0, and - sometimes(speed(c.trip) < 100000) respectively. The predicate sometimes(.) is a predicate that accepts an mbool and yields true if the argument is ever true during its lifetime, otherwise false. Each of the standard predicates ensures that the corresponding time- dependent predicate is fulfilled at least once, a necessary but not sufficient condition for the STP predicate to be fulfilled. Clearly, the rewritten query is equivalent to the original query. The choice of the standard predicate depends on the type of the time- dependent predicate and the types of the arguments. For example, the lifted spatial range predicates (i.e., the spatial projection can be described by a box) are mapped into the passes standard predicate. In this example, it is fulfilled if the car c.trip ever passed the gas station l.region. If passes fails, then we know that inside will never be true and that pattern will also fail. The planner is expected to already have some optimization rule available for the added passes predicate (e.g., invoking a spatial R-tree index if available). Mapping To generalize this solution, we define a table of mappings between the time-dependent time-dependent predicates (or groups of them) and the standard predicates. predicates into Clearly, this mapping is extensible for the time-dependent predicates that standard predicates can be introduced in the future. The mapping for the set of time-dependent predicates proposed in [GBE+00] is shown in Table 4.3. The lifted spatial range predicates map into passes. The lifted predicate distance(x, y) < z is conceptually equivalent to a lifted spatial range pred- icate, where the spatial range is the bounding box of the static argument extended by z in every side. The lifted comparisons of mint, mbool, mstring, and mreal are classified, similar to the standard comparisons, into equality, range, left range, and right range. All of them are mapped into sometimes(.). Other types of time-dependent predicates are mapped into sometimes. Note that we can alternatively rewrite all time-dependent predicates into some- times(.), and provide translation rules accordingly. This gives us the extra benefit of optimizing the sometimes(.) predicate, which may also appear directly in the user queries. Actually we adopt this approach in our imple- mentation in Secondo. Now we need to extend the planner with translation rules that translate sometimes(.) into index lookups in the query plan. For every type of time- dependent predicates, one such translation rule is required. For example, the sometimes(Pred), where Pred is a lifted left range predicate, searches for a B-tree defined on the units of the moving object, and performs a left range search in the B-tree. We discuss in detail the implementation of this extension to the Secondo optimizer in Section 4.5.2. After this mapping, the optimizer should be able to generate a plan like the following for the bank robbers query: 4.4. OPTIMIZING THE STP PREDICATE 41

Table 4.3: Mapping time-dependent predicates into standard predicates. Lifted Predicates Type Standard Predicates σ = α mpoint × point → mbool mregion× region → mbool σ inside α mpoint × region → mbool mpoint × points → mbool mpoint × line → mbool lifted σ passes α mregion× region → mbool spatial mregion× points → mbool mregion× line → mbool σ intersects α mregion× points → mbool mregion× region → mbool mregion× line → mbool σ = α mint × int → mbool mbool × bool → mbool lifted sometimes(σ = α) mstring× string → mbool equal- mreal × real → mbool ity σ <= α, σ < α mint × int → mbool sometimes(σ <= α) mbool × bool → mbool lifted sometimes(σ < α) mstring× string → mbool left mreal × real → mbool range σ >= α, σ > α mint × int → mbool sometimes(σ >= α) mbool × bool → mbool lifted sometimes(σ > α) mstring× string → mbool right mreal × real → mbool range distance(σ , α) < threshold mpoint × region → mreal σ passes mpoint × point → mreal lifted enlargeRect(bbox(α), mregion× point → mreal spatial threshold, threshold) mregion× region → mreal range Other time-dependent predicates, P sometimes(P )

query cars trip sptuni Bank robbers windowintersectsS[ enlargeRect(bbox(bank), 50.0, 50.0)] optimizer plan sort rdup cars gettuples filter[sometimes(distance(.trip, bank) < 50.0)] {c} landmark feed {l} filter[.type l = "gas station"] product filter[. stpattern[gas: .trip c inside .region l, bnk: distance(.trip c, bank) < 50.0, leaving: speed(.trip c) > 100000; 42 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

stconstraint("gas", "bnk", vec("aabb")), stconstraint("bnk", "leaving", vec("abab", "aa.bb", "aabb"))]] consume;

This query plan differs from the one in Section 1.2.1 in that the straight forward cars feed c is replaced by the first four lines in this plan. These four lines invoke the spatial R-tree index cars trip sptuni, that we assume already exists in the database, and performs a window query on it with a window centered at bank and extended 50 meters in every direction. Only the cars that intersect this window are retrieved. Note that only these cars have the opportunity to fulfill the standard predicate sometimes(distance(c.trip, bank) < 50.0) added by the re-writer, and hence, only these cars have the opportunity to fulfill the time-dependent predicate distance(c.trip, bank) < 50.0 in the original query. Invoking this index in the plan was triggered by the predicate sometimes(distance(c.trip, bank) < 50.0) that was added during the re-writing.

4.5 The Implementation In Secondo

This section describes the implementation of the STP predicate, and its variants in the context of the Secondo system. We first extend the kernel by the proposed operators, so that they will be accessible to the user through the Secondo executable language. Then we extend the Secondo optimizer, so that they will be accessible to the user in the SQL-like language, and so that the optimizer will be able to generate optimized query plans for such user queries.

4.5.1 Extending The Kernel We have implemented the STP predicate and its variants in the Secondo kernel in a new algebra module called STPatternAlgebra.A Secondo alge- bra consists of types and operations on these types. The STPatternAlgebra contains:

1. The stvector type used to represent the interval relationships as defined in Section 4.1.

2. All the operators listed in Table 4.1.

Note that the signature of the operators, as shown in Table 4.1, differ in some cases from their formal definitions. For instance, the start and end operators are formally described in Section 4.3 as functions:

f :(Interval)n × {1, ..., n} → Instant

According to this formal definition, one would expect that the operators accept a tuple of time intervals and an integer, and yield an instant. However, in our implementation, they accept a string (denoting a predicate alias) and yield an instant. 4.5. THE IMPLEMENTATION IN SECONDO 43

The reason for such a difference is that the Secondo query languages (both executable and SQL) are strongly typed, in the sense of programming languages. That is, the query processor and the optimizer perform a complete type checking before executing the query. Since the tuple of time intervals required for the start, end operators is generated at run-time (this tuple represents one supported assignment), we are not able to use it in the user query. Therefore, we chose to implement these operators slightly different from their formal definitions. Similar to this is the operator stconstraint.

4.5.2 Extending The Optimizer The Secondo optimizer is written in Prolog. It accepts user queries in an SQL-like query language, and translates them into optimized query plans in the Secondo executable language. The Secondo optimizer includes a separate rewriting module that can be switched on and off by setting the optimizer options. The planner implements a novel cost based optimization algorithm which is based on shortest path search in a predicate order graph. The predicate order graph (POG) is a weighted graph whose nodes represent sets of evaluated predicates and whose edges represent predicates, containing all possible orders of predicates. For each predicate edge from node x to node y, so-called plan edges are added to represent possible evaluation methods for this predicate. Every complete path via plan edges in the POG from the bottom-most node (i.e., zero evaluated predicates) till the top-most node (i.e., all predicates evaluated) represents a different execution plan. Different paths/execution plans represent different orderings of the predicates and dif- ferent evaluation methods. The plan edges of the graph are weighted by their estimated costs, which in turn are based on given selectivities. The selectiv- ity of predicates are either retrieved from prerecorded values, or estimated by sending selection or join queries on small samples of the involved relations to the Secondo kernel and reading the cardinality of the results. The al- gorithm is described in more detail in [GBA+04] as well as in the Secondo programmers guide [Secb]. Our extension to the optimizer has three major parts: query rewriting, operator description, and translation rules. In the query rewriting, we choose to rewrite all the time-dependent predicates into sometimes(.), as explained in Section 4.4.2. The following are the Prolog rules that do the rewriting: Extending the re-writer inferPatternPredicates([], []).

inferPatternPredicates([Pred|Preds], [sometimes(Pred)|Preds2] ):- assert(removefilter(sometimes(Pred))), inferPatternPredicates(Preds,Preds2). where the inferPatternPredicate accepts the list of the time-dependent pred- icates within the STP predicate as a first argument, and yields the list of rewritten predicates in the second argument. This rule straightforwardly sur- rounds all the time-dependent predicates by the predicate sometimes, and 44 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

adds them as standard predicates to the user query. The additional some- times(.) predicates are kept in the fact-base removefilter(.), so that we will be able to exclude them from the executable plan afterward. Extending the In the Secondo planner, in the operator descriptions, we annotate the planner time-dependent predicates by their types as in Table 4.3 (e.g., lifted left range). Then we provide translation rules for sometimes(.) for every type of time-dependent predicates. The following is an example for such a rule:

indexselectLifted(arg(N), Pred ) => gettuples(rdup(sort(windowintersectsS( dbobject(IndexName), BBox))), rel(Name, *)) :- Pred =..[Op, Arg1, Arg2], ((Arg1 = attr(_, _, _), Attr= Arg1) ; (Arg2 = attr(_, _, _), Attr= Arg2)), argument(N, rel(Name, *)), getTypeTree(Arg1, _, [_, _, T1]), getTypeTree(Arg2, _, [_, _, T2]), isLiftedSpatialRangePred(Op, [T1, T2]), ( (memberchk(T1, [rect, rect2, region, point, line, points, sline]), BBox= bbox(Arg1) ); (memberchk(T2, [rect, rect2, region, point, line, points, sline]), BBox= bbox(Arg2) ) ), hasIndex(rel(Name, _), Attr, DCindex, spatial(rtree, unit)), dcName2externalName(DCindex, IndexName).

where this rule translates the lifted spatial range predicates into an R-tree window query, as indicated in the rule header. The => Prolog operator reads as translates into. It means that the expression in the right is the translation of the expression in the left, if the conditions in the rule body hold. The body of the rule starts by inferring the types of the arguments of the time-dependent predicate within the sometimes(.). Then it uses them to make sure that the predicate is of the type lifted spatial range. Finally, it checks whether a spatial R-tree index on the involved relation and attribute is available in the database catalog. It tries to find a spatial R-tree built on the units of the moving object. Similar translation rules are provided for other types of indexes. The optimizer plan for the bank robbers query, that we have shown in Section 4.4.2, results from this translation rule.

4.6 Experimental Evaluation

We proceed in this section with an experimental evaluation of the STP pred- icate. The intention is to give an insight into the performance. Clearly the run-time of an STP predicate depends on the number and types of the time-dependent predicates. Therefore, we show three experiments. The first 4.6. EXPERIMENTAL EVALUATION 45 measures only the overhead of evaluating the STP predicate. That is, we set the time of evaluating the time-dependent predicates to negligible values, in order to evaluate Algorithm 1. In the second experiment, we generate random STP queries with varying numbers of time-dependent predicates and temporal constraints and measure the run-time of the queries. The experiment also evaluates the optimization of STP predicates. Every query is run twice; once without invoking the optimizer, and once more with the optimizer being invoked, and the run- times are compared. The third experiment is dedicated to evaluating the scalability. Mainly it evaluates the proposed optimization approach in large databases. A ran- dom set of queries is generated and evaluated against relations of cardinality 50,000, 100,000, 200,0000, and 300,000 tuples, where the trajectories are in- dexed using the traditional R-tree index. The first two experiments use the berlintest database, which is packed in- side the distribution of Secondo. The last experiment uses the BerlinMOD benchmark [DBG09] to generate the four relations. The BerlinMOD bench- mark is available for download on [Secd]. The three experiments are run on Machine specs a Secondo platform installed on a Linux 32-bit machine. The machine is a Pentium-4 dual-core 3.0 GHz processor with 2 GBytes main memory.

4.6.1 Experiment 1: The Overhead Of Evaluating The STP Predicate

To perform the first experiment, we have added two operators to Secondo; randommbool and passmbool. The operator randommbool accepts an instant and creates an mbool object whose definition time starts at the given time instant, and consists of a random number of units. The operator passmbool mimics a time-dependent predicate. It accepts an mbool database object, and directly yields it. Clearly the cost of passmbool is negligible, a property that we require for this experiment. More details are given below.

Preparing The Data

This section describes how the test data for the first experiment are cre- ated. The randommbool operator is used to create a set of 30 random mbool instances and store them as database objects, named mb1...mb30. The oper- ator creates mbool objects with a random number of units varying between 1 and 20. The first unit starts at the time instant provided in the argument. Every unit has a random duration between 2 and 50000 milliseconds. The value of the first unit is randomly set to true or false. The value of every second unit is the negation of its preceding unit. These mbool objects are created by calling randommbool(now()) 30 consecutive times, to guarantee that the definition times of the objects overlap. 46 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

Generating The Queries The queries of the first experiment are selection queries consisting of one filter condition in the form of an STP predicate. The queries are gener- ated with different experimental settings, that is, different numbers of time- dependent predicates and constraints in the STP predicate. The number of time-dependent predicates varies between 2 and 8. The number of con- straints varies between 1 and 16. The queries are not generated for every combination. For example, it does not make sense to generate STP predicates with 2 time-dependent predicates and 10 constraints. For N time-dependent predicates, the number of constraints varies between N − 1 and 2N. The ra- tionale of this is that, if the number of constraints is less than N −1, then the constraint network can not be connected (i.e., some predicates are not refer- enced within constraints). Also, having more than 2N constraints increases the probability to encounter contradicting constraints. For every experimen- tal setting, 100 random queries are evaluated and the average run-time is recorded. A sample query with 3 time-dependent predicates and 2 constraints looks as follows:

query thousand feed filter[. stpattern[pred1: passmbool(mb5), pred2: passmbool(mb13), pred3: passmbool(mb3); stconstraint("pred2", "pred1", later), stconstraint("pred2", "pred3", vec("abab") ]] count where query thousand feed streams the thousand relation, which contains 1000 tuples. For every tuple, the STP predicate stpattern is evaluated. Note that the predicate does not depend on the tuples. That is, the same predicate is executed 1000 times in the query. This is to minimize the effect of the time taken by Secondo to prepare for query execution. The time-dependent predicates are all in the form of passmbool(X ), where X is one of the 30 stored random mbool objects. The constraints are generated so that the constraint graph will be con- nected. We start by initializing a set called connected having one randomly selected alias. For every constraint, the two aliases are randomly chosen from the set of aliases in the query, so that at least one of them will belong to the set connected. The other alias is inserted into the set connected. After the required number of constraints is generated, we check the connectivity of the graph. If it is not connected, the query is excluded from the experi- ments, and another query is generated. The temporal relationship for every temporal constraint is randomly chosen from 31 pre-prepared relationships, namely the 26 simple relationships in Table 4.2 and 5 vectors (later, follows, immediately, meanwhile, and then) defined using the vec(.) operator. Before running the queries, we query the 30 mbool objects so that they will be loaded into the database buffer. Hence, the measured run-times 4.6. EXPERIMENTAL EVALUATION 47 should show only the overhead of evaluating the STP predicates in Secondo because other costs are made negligible.

Results The results are shown in Figure 4.2. The number of time-dependent predi- cates is denoted as N. Increasing the number of time-dependent predicates and constraints in the STP predicate does not have a great effect on the run time. This is a direct result of the early pruning strategy in the MatchPat- tern algorithm. The results show that the overhead of the algorithm is very small.

0.7 N=2 N=3 0.6 N=4 N=5 0.5 N=6 N=7 N=8 0.4

0.3

0.2

0.1

Seconds per 1000 Tuples 0 0 2 4 6 8 10 12 14 16 Number of Constraints

Figure 4.2: The overhead of evaluating the STP predicate

4.6.2 Experiment 2: Run-Time And Optimization Gain The second experiment is intended to evaluate the run-time of STP queries. It also evaluates the effect of the proposed optimization. We generate 10 ran- dom queries for every experimental setting and record the average run-time. Every query is run twice; without being optimized, and after optimization.

Preparing The Data The queries use the Trains20 relation. It is generated by replicating the tuples of the Trains relation in the berlintest database 20 times. The Trains relation was created by simulating the underground trains of the city Berlin. The simulation is based on the real train schedules and the real underground network of Berlin. The simulated period is about 4 hours in one day. The schema of Trains20 is similar to Trains with the additional attribute Serial: Trains20[Serial: int, Id: int, Line: int, Up: bool, Trip: mpoint] where Trip is an mpoint representing the trajectory of the train. The rela- tion contains 11240 tuples and has a disk size of 158 MB. To evaluate the optimizer, a spatial R-tree index called Trains20 Trip sptuni is built on the units of the Trip attribute. A set of 300 points is also created to be used in 48 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS the queries, named point1...point300. The points represent geometries of the top 300 tuples in the Restaurants relation in the berlintest database.

Generating The Queries The queries are generated in the same way as in the first experiment. In this experiment, however, we use actual time-dependent predicates instead of passmbool. Every time-dependent predicate in the STP predicate is randomly chosen from 1. distance(trip, randomPoint) < randomDistance.

2. speed(trip) > randomSpeed. where randomPoint is a point object selected randomly from the 300 restau- rant points, randomDistance ranges from 0 to 50, and randomSpeed ranges between 0 and 30. The distance(., .) < . is a sample for the time-dependent predicates that can be mapped into an index access, so that we can evaluate the optimizer. In this experiment, the generated queries are in Secondo SQL. Here is one example:

SELECT count(*) FROM trains20 WHERE pattern( [ distance(trip, point170) < 18.0 as pred1, speed(trip) > 11.0 as pred2], [ stconstraint("pred1", "pred2", vec("b.ba.a")) ])

The rewritten version of this query as generated by the rewriting module of the Secondo optimizer is:

SELECT count(*) FROM trains20 WHERE [ pattern( [ distance(trip, point170) < 18.0 as pred1, speed(trip) > 11.0 as pred2], [ stconstraint("pred1", "pred2", vec("b.ba.a"))]), sometimes(distance(trip, point170) < 18.0), sometimes(speed(trip) > 11.0)]

Finally, the optimal execution plan, as yielded by the Secondo optimizer, is:

query Trains20 Trip sptuni windowintersectsS[ enlargeRect(bbox(point170), 18.0, 18.0)] sort rdup Trains20 gettuples filter[sometimes((distance(.Trip,point170) < 18.0))] {0.00480288, 1.69712} project[Trip] filter[. stpattern[ pred1: (distance(.Trip, point170) < 18.0), 4.6. EXPERIMENTAL EVALUATION 49

pred2: (speed(.Trip) > 11.0); stconstraint("pred1", "pred2", vec("b.ba.a"))]] {0.00480288, 1.49038} filter[sometimes((speed(.Trip) > 11.0))] {0.883731, 1.48077} count where Trains20 Trip sptuni is an R-tree spatial index. The pairs of numbers between the curly brackets do not affect the semantics of the query. They are estimated predicate selectivities and run-time statistics used to help estimate the query execution progress.

Results In Figure 4.3, the chart in the left shows the average run times of the non- optimized STP queries. The chart in the right shows the average run-times of their optimized counterparts. The N is again the number of time-dependent predicates. The run-times of the optimized STP predicates are very promis- ing.

Non-Optimized Queries Optimized Queries

30 N=2 30 N=3 25 25 N=4 N=5 20 N=2 20 N=6 N=7 N=3 15 15 N=4 N=8 N=5 Seconds 10 10 N=6 5 N=7 5 N=8 0 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Number of Constraints Number of Constraints

Figure 4.3: The run-time of the STP queries on the Trains20 relation

We have investigated the reason of the high peak in the right chart at N = 2 and Number of Constraints = 2. It happened because five of the ten generated queries have only speed(.) < . predicates. Since the sometimes(speed(.) < .) predicate does not map into index access, the av- erage run-time for this experimental setting is close to the non-optimized version.

4.6.3 Experiment 3: Scalability This experiment evaluates the performance of the STP predicate when used with large datasets. As shown in Section 4.4.2, the optimization of the STP predicate is carried out without special index structures, which is practically preferred in the context of systems. It remains however questionable, how far these traditional indexes (e.g., R-trees) are effective for such a type of queries. This experiment tries to answer this question. 50 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

Obviously if all the time-dependent predicates within the STP predicate in a given query cannot be translated by the optimizer into index accesses, then one is out of luck, and the STP predicate will be evaluated for every tuple in the input stream. Therefore, in this experiment, we compose the STP predicates by time-dependent predicates that are supported by index structures available in Secondo.

Generating The Data The data for this experiment is generated using the BerlinMOD benchmark [DBG09]. It simulates an arbitrary number of cars moving in the city of Berlin. The scenarios of the trips are quite realistic, simulating the trips to and from the work place, and the leisure trips. The benchmark is download- able from the Secondo web site [Secd]. The trajectory data is generated by running Secondo scripts. It is possible to control the number of cars, and the number of observation days by editing a configuration file. For this experiment, we have generated the four relations described in Table 4.4. The table shows for every relation the number of cars/trajectories, the number of simulation days, the number of units of all trajectories, and the storage space of the relation. The number of units is analogous to the total number of observations of all cars, in the discrete sense. Note that in this moving objects model, the trajectories are continuous. The generation of the four relations using the BerlinMOD benchmark took about 5 days on the machine described in Section 4.6.

Table 4.4: The database relations used in the scalability experiment Relation Number Duration Number Size Name of Cars of Units datascar50 50,000 1 day 64,331,426 9.1 GB datascar100 100,000 1 day 128,437,840 18.2 GB datascar200 200,000 1 day 256,373,737 36.3 GB datascar300 300,000 1 day 384,923,972 54.5 GB

For each of the four relations, a spatial R-tree index is derived for the trip attribute. The R-tree contains the bounding boxes of the units of the Trip attribute, which are of type upoint.

Generating The Queries The BerlinMOD benchmark generates for every car up to five trips in a working day. Two of them drive to and from the work place, and the other three trip are leisure trips in the afternoon/evening. The leisure destinations are randomly chosen from the neighborhood of the car’s home location with a probability of 80%, and from the whole map with a probability of 20%. We use this information to design the experiment queries. For each of the four relations in this experiment, a set of 10 queries is randomly generated. Each of the queries randomly picks a car, and retrieves 4.6. EXPERIMENTAL EVALUATION 51 its home location and three locations from its neighborhood, call them at- mmachine, supermarket, and bakery for example. The query looks for the cars that made a leisure trip starting from the location home, and passing by the locations atmmachine, supermarket, and bakery respectively. Since the locations are chosen from the neighborhood of an existing car, there is some probability that the cars will fulfill the pattern. A sample query looks as follows:

SELECT count(*) FROM datascar300 c WHERE [ pattern( [ c.trip = home as pred1, c.trip = atmmachine as pred2, c.trip = supermarket as pred3, c.trip = bakery as pred4], [ stconstraint("pred1", "pred2", later), stconstraint("pred2", "pred3", later), stconstraint("pred3", "pred4", later)])] where home is the home location of the car, and the time-dependent predicate is fulfilled in the time instants/intervals in which its two arguments have the same spatial coordinates. Ten such queries are randomly generated for every relation. The next subsection shows the average run-times.

Results In this experiment, we switch on the optimizer. Since the = time-dependent predicates in the queries belong to the lifted spatial range predicates, as shown in Table 4.3, the optimizer generates execution plans that use the R-tree indexes which were built during the data generation. Figure 4.4 shows the average run-times.

35

30

25

20

15

10 Average run time (sec)

5

0 50 100 150 200 250 300 350

Number Of Trajectories (x 1000)

Figure 4.4: Scalability results 52 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS

These results conclude two points:

• Taking into consideration the large relation sizes as shown in Table 4.4, and the moderate machine specifications described in Section 4.6, the average run-times are cheap regarding such complex query type. To be able to compare, we measured the average run-time of an optimized spatiotemporal range query on the 300,000 relation, and it shows 20 seconds. This is in comparison to an average of 28.6 seconds for the STP query. This confirms that the proposed optimization approach works fine without the need for specialized index structures.

• The run-time seems to scale linearly with the relation size. This is already expected since the STP predicate is applied to every tuple in the input (i.e., the tuples retrieved after the index access). Note that the BerlinMOD benchmark generates all the trips within the limited spatial space of the city Berlin. A larger number of cars in the simu- lation implies that the window queries on the R-tree index yield more candidates.

To sum things up, the scalability of the STP predicate is affected by four parameters:

1. The number of time-dependent predicates in the STP predicate.

2. The number of the temporal constraints in the STP predicate.

3. The number of input tuples/trajectories.

4. The length of the trajectories in terms of number of units.

The scalability in terms of the first three parameters is evaluated already in the three experiments in this paper. The last parameter, the length of trajec- tories, affects the evaluation time of the STP predicate indirectly as it affects the evaluation time of the time-dependent predicates. This is because the time-dependent predicates are evaluated for the complete trajectory. When the trajectories are long (e.g., several weeks of observation time), the cost of evaluating the time-dependent predicates increases accordingly. The major- ity of them scale linearly with the number of units in the trajectory. More about the time-dependent predicate evaluation algorithms can be found in [CLFG+03]. In the STP predicate, the temporal constraints impose a certain temporal order between the time-dependent predicates. While evaluating the STP predicate, one gets temporal information from the time-dependent predicates evaluated so far. A proper analysis of this information can identify parts of the trajectory that can be safely ignored while evaluating other time- dependent predicates, and thus reduce the cost of processing long trajectories. In Chapter 6 we illustrate a temporal reasoning approach that utilizes this information, in order to restrict the trajectories. 4.7. CONCLUSIONS 53 4.7 Conclusions

This chapter explained our approach for individual spatiotemporal pattern queries. It combines efficiency, expressiveness and a clean concept. It builds on existing moving objects database concepts. Therefore, it is convenient in the context of spatiotemporal DBMSs. Unlike the previous approaches, it is integrated with query optimizers. In Chapter 7 we demonstrate an ap- plication example from the real-world, to emphasize the expressive power of our approach. This work is completely implemented in the Secondo platform. The implementation and the scripts for experimental repeatability are available on the Web. Appendix A explains how to repeat the experi- ments in this Chapter. The experimental evaluation shows that the proposed operators perform well even with large datasets. Chapter 6 describes an ex- tension, based on temporal reasoning, to efficiently match patterns in long trajectories. 54 CHAPTER 4. INDIVIDUAL SPATIOTEMPORAL PATTERNS Chapter 5

Group Spatiotemporal Patterns

In Chapter 1, we have explained group spatiotemporal patterns, and gave a couple of illustrative examples. This chapter provides the formal definitions, and the algorithms of the group STP operators that we propose.

5.1 Extending The Type System

In Table 3.1, we have summarized the type system on which our STP op- erators are built. We need to extend this type system in order to support group STP operators. This extension is shown in Table 5.1. The extensions are highlighted in bold. Mainly we introduce the kind COLL, and the type constructor set which set, mset represents a set of elements from the DATA or SPATIAL kinds. The goal of this extension is to introduce the moving set type mset (i.e., mset denotes mapping(constunit(set)) ). It represents a set whose elements are changing over time. The domain Dmset is defined as follows: Let elem be a type variable that can be instantiated by any type in DATA ∪ SPATIAL, and let Delem be its domain. The domain of the type set is:

Dset = P(Delem), and the domain of the type uset (i.e., uset = constunit(set)) is

Duset = Interval × Dset .

Finally Dmset is defined as Dmapping(uset) according to Definition 3.1. We will use the type mset to represent a group of moving objects that matches a group STP query. We also define the set MSetPart, which is MSetPart required for the operator definitions, as follows:

Definition 5.1. Let MSetPart be a subset of Dmset defined as:

MSetPart = {{(i1, f1), ..., (in, fn)} ∈ Dmset |

∃ i ∈ Interval : i = (i1 ∪ ... ∪ in)}



55 56 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

That is, an element of the MSetPart is an mset instance that has no def- inition gaps within its definition time. Its definition time can be represented as a single time interval. As an illustration, consider the following example for an mset instance: [ 2011-04-07-10:51:37.453, 2011-04-07-10:52:10.31 [, { 16, 14 } [ 2011-04-07-10:52:10.31, 2011-04-07-10:53:58.9 [, { 16, 14, 113 } [ 2011-04-07-10:53:58.9, 2011-04-07-11 ], { 16 } It consists of 3 usets, and the elements are of type int. It also belongs to MSetPart, because the time intervals of the three units are adjacent (i.e., no temporal gaps in between).

Table 5.1: The extended type system

→IDENT ident →DATA int, bool, real, string →DISCRETE int, bool, string →SPATIAL point, region, line →TIME instant DATA ∪ TIME →RANGE range DATA ∪ SPATIAL →COLL set DATA ∪ SPATIAL ∪ →TEMPORAL intime COLL DISCRETE ∪ COLL →UNIT constunit →UNIT ureal, upoint, uline, uregion UNIT →MAPPING mapping

(ident × ATTR)+ →TUPLE tuple TUPLE →REL rel TUPLE →STREAM stream

5.2 Group Spatiotemporal Pattern Operators

Table 5.2 lists the signatures of the proposed group STP operators. The three operators reportpattern, gpattern, and crosspattern are the main pat- tern matching operators. They accept the pattern description, and match it against the trajectories in the database. Their details are discussed in the fol- lowing subsections. These operators represent their results as msets, where every mset object represents the identifiers of the moving objects constituting one group that matches the pattern. While the mset is computationally convenient for representing the results, it does not make much sense for the user. This is why we introduce the boundingregion and the members operators which convert this representation The boundingregion into more meaningful ones. The boundingregion operator computes the operator convex hull of the mpoint objects in a given group. Consider its first overload 5.2. GROUP SPATIOTEMPORAL PATTERN OPERATORS 57 in Table 5.2, which accepts an mset argument. It accepts the stream of tuples containing the original mpoints, the mset object representing the group, and the sampling rate. It retrieves the mpoint objects that are members of the mset object from the input stream (by matching the identifiers in the mset instance with the first attribute in the stream), and samples their positions with the given sampling interval. For every sample, which is represented as a points object (i.e., a set of point), it computes the convex hull as a region object. Finally the computed convex hulls of all samples are interpolated into a single mregion object. This operator, thus, computes an mregion representation of a group of mpoints. The first two overloads of this operator accept an mset or an uset as the group representation, and yield an mregion. The third overload yields a region, because the group is represented by an intime(set), which contains a single time instant. The members operator retrieves the mpoint objects that constitute the The members mset object from the input stream, and yields them as a stream. If the operator boolean argument is set to true, the returned mpoint objects are restricted to the time periods in which they belong to the mset. If this argument is set to false, these mpoints are wholly returned in the result without temporal restriction. 58 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS ] ] , , , , , ) ) , , , ) , ) ] , , , , , , ; #[ #[ #[ #( #( #( #( gpattern crosspattern reportpattern boundingregion boundingregion members members ) ) ) )) )) )) )) → > > > > )) ) ) ) ) mset mset ( ( mbool > + tuple → ) ( ) mpoint mpoint mpoint mpoint × stream stream , , , , + 2 mset ) tuple ) , → → ( × int ident ident ident ident bool mpoint ) ( ( ( ( ( ( , , , , ident ) ) ) ) ( int mregion × → string string region points < + int int int int × × ( → → , , , , )) stream → → int int ) ) Table 5.2: Group spatiotemporal pattern operators tuple → } } × tuple × ( ident ident ident ident tuple ( mset set set ( ( ( ( ( ( ( ( × × < < < < duration bool mset mset ) ) ( ( ( ( , , × × stream stream duration T intime T intime duration tuple tuple tuple tuple tuple tuple uset uset → × × × × × × ( ( ( ( ( ( × { { ) in in T T ident ( stream stream mbool stream stream ∀ stream ∀ stream 5.2. GROUP SPATIOTEMPORAL PATTERN OPERATORS 59

5.2.1 The reportpattern Operator In this section, we formally define the reportpattern operator. It is very similar to the individual STP predicate in Section 4.1. It accepts two sets: a set of patternoid operators each of which having an alias, and a set of temporal constraints. Patternoid operators are the gpattern and the crosspattern Patternoid operators. Syntactically, reportpattern requires that a patternoid operator operators yield a stream(mset), and that a temporal constraint be a bool expression. Semantically, the result of a patternoid operator is required to be a subset of MSetPart. This is required in order to be able to evaluate the temporal constraints, as will be explained below in this Section. We start by reminding the reader about the set of temporal relationships Temporal IR that was defined in Section 4.1. That is: constraints

IR = {aabb, abba, bbaa, a.bab, aa.bb, a.bba, bb.aa, baa.b, abab, aba.b, baba, a.ba.b, baab, a.abb, bb.a.a, a.a.bb, bba.a, ba.ab, b.baa, aa.b.b, b.b.aa, aab.b, ab.ba, a.ab.b, b.ba.a, a.a.b.b}

We recall also that two time intervals i1, i2 ∈ Interval fulfill a set of these relationships if they fulfill any of them. Let P = {p1 , ..., pn} be a set of patternoid operators. A temporal con- straint on P is an element of the set:

TC(P ) = {1..n} × {1..n} × P(IR)

It is a binary constraint that assigns a pair of patternoid operators in P a set of interval relationships. Syntactically, it is expressed by the stconstraint operator that was explained before in Section 4.1. Let P = {p1 , ..., pn} be a set of patternoid operators. Let eval(pi) de- Candidate note the evaluation of the patternoid operator pi ∈ P , that is, eval(pi) ⊂ assignments MSetPart. We define the set of candidate assignments CA(P ) as:

CA(P ) = eval(p1) × ... × eval(pn)

That is, the CA(P ) is simply the Cartesian product of the result streams of the patternoid operators. Let ca = (v1 , ..., vn) ∈ CA(P ) and let c = (j, k, SI) ∈ TC(P ) be a temporal constraint.

ca fulfills c :⇔ deftime(vj) and deftime(vk) fulfill SI where deftime(vi) is the time interval in which the mset vi is defined. Since the reportpattern operator requires that a result of a patternoid operator belongs to MSetPart, it is guaranteed that deftime(vi) yields a single time interval. Let C ⊆ TC(P ) be a set of temporal constraints. The set of supported Supported assignments is defined as: assignments

SA(P,C) = {ca ∈ CA(P ) | ∀ c ∈ C : ca fulfills c} 60 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

That is, for a candidate assignment to be a supported assignment, it must fulfill all the constraints in C. A supported assignment, thus, contains a single mset instance for every patternoid, and fulfills all the temporal constraints. The result of the reportpattern operator is the set of supported assignments, that is:

Definition 5.2. The reportpattern operator is a pair (P,C), where P = {p1 , ..., pn} is a set of patternoid operators, and C ⊆ TC(P ) is a set of tem- poral constraints. Its evaluation is defined as:

eval((P,C)) = SA(P,C)

 The reportpattern operator is evaluated in a way similar to the individual STP predicate. That is, it is modeled as a constraint satisfaction problem (CSP), and the evaluation algorithm is very similar to the MatchPattern algorithm (Algorithm 1).

5.2.2 The gpattern Operator This section formally defines the gpattern operator. This operator reports groups of moving objects that simultaneously fulfill a time-dependent predi- cate. Examples are: find a group of at least 5 stocks whose prices were simul- taneously increasing or stable for a whole week during Feb 2011, find large groups (more than 400) of air passengers that were simultaneously heading to the passport control counters on a certain day, find groups of geese migrating in northern direction. The syntax of the gpattern operator, as shown in Table 5.2 is: stream(tuple) × (tuple → int) × (tuple → mbool) × duration × int× string → stream(mset) gpattern #[ , , , , ] The last argument has only two allowed values: “exactly”, and “atleast”. The semantics of Hence, the operator has the following two forms: gpattern • S gpattern[id, α, d, n,“exactly”]. • S gpattern[id, α, d, n,“atleast”].

where S is a stream of tuples representing the set of moving objects. The id argument is a function that maps a tuple in S into the identifier of the moving object represented in this tuple. Note that in the base data model, moving objects have no identifiers. For this reason, we require that the tuples in S include two attributes: the moving object attribute, and an identifier at- tribute (for simplicity, we require it to be of type int). The argument α is the time-dependent predicate that is used to express the pattern. For instance, to find the groups of geese migrating in northern direction, α would be some- thing like 45◦ < direction( Trip ) < 135◦, where Trip is an mpoint object representing the trajectory of the geese. The argument d is the minimum duration (e.g., in milliseconds) of the patternoid. Finally the n argument is the minimum size of the group. In the following, we provide two definitions 5.2. GROUP SPATIOTEMPORAL PATTERN OPERATORS 61 for the two forms of gpattern. In the definitions, we deal with S as a set of moving objects, simplifying the technical detail of its representation as a stream of tuples. The evaluation of the first form of the gpattern operator is: eval( S gpattern[ id, α, d, n,“exactly” ] ) =

{{(I,V )}|V ⊆ S, |V | = n, I ∈ Interval, length(I) ≥ d, ∀t ∈ I, ∀e ∈ V :(α(e))(t), ∀I0 such that I0 ∈ Interval,I ⊂ I0 : ∃e ∈ V, ∃t ∈ I0 : ¬(α(e))(t)}

where length(I) = I.t2 − I.t1, and (α(e))(t) ∈{false, true, undefined} is the evaluation of the time-dependent predicate α for the moving object e at the time instant t. The operator yields groups of exactly n moving objects. Therefore, every mset in the result contains only one unit (I,V ). The last condition guarantees that only the longest duration patternoids are reported, to avoid an infinite number of results. The definition allows a moving object to belong to several groups in the result. This is required to report all possible combinations of exactly n ob- jects. This is helpful in expressing the trend-setter pattern, for instance. That is, a group of exactly k moving objects matches some patternoid de- scription, followed by another group of at least j moving objects that matches the same description. To define the semantics of the second form of the gpattern operator, we need first to define the following auxiliary operations:

mbool → bool always #( ) mbool → bool sometimes #( ) mset × set → mbool ⊂, ⊆ # data × mset → mbool in # mapping → range(instant) deftime #( ) mbool × bool → mbool at # where always yields true if its argument has the value true whenever it is defined. sometimes yields true if its argument is ever true. The time- dependent versions of the set predicates ⊂, ⊆, and in yield true at the time in- stants/intervals during which their corresponding standard predicates hold1. The deftime operation yields the set of time intervals and/or instants during which a moving object is defined. Finally, the at operation restricts the def- inition time of a moving object to the time intervals and/or instants during which its value is equal to the second argument. Given a gpattern operator in the form S gpattern[ id, α, d, n,“atleast”],

1The set operations ⊂, ⊆ read as ispropersubset, issubset. We intentionally avoid de- noting the in operation using the symbol ∈ to differentiate between uset ∈ mset and the lifted data in mset. 62 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

we first define the set M as:

M = {X|X ∈ MSetPart, always(X ⊆ S), always(|X| ≥ n), (∀e such that sometimes(e in X): P = deftime((e in X) at true) ⇒ (i) ∀I ∈ P : length(I) ≥ d (ii) ∀t ∈ P :(α(e))(t))}

The set M contains all possible matches of the patternoid. In contrast to the previous form of gpattern, an mset instance in the result of this form might contain several units (i.e., usets). This means that moving objects might be joining and/or leaving the group during its definition time. The last four lines ensure that each time an object joins the group, it stays for a duration of at least d before leaving it. We think this is a required feature, in order to exclude the commuters who do not really belong to the group. In the following definition, we restrict the set M to the maximal patternoids in terms of the duration, and the group count. This is to avoid an infinite number of results. The evaluation of the second form of gpattern is defined as: eval( S gpattern[ id, α, d, n,“atleast” ] ) =

{X|X ∈ M, (@Y ∈ M such that deftime(X) ⊂ deftime(Y )), (@Y ∈ M such that (deftime(X) = deftime(Y )∧ sometimes(X ⊂ Y )))}

The evaluation The evaluation of the gpattern operator is achieved by Algorithm 3. It algorithms uses the auxiliary functions shown in Algorithms 4 - 6. It also uses the time-dependent union operation of msets, which is defined as follows:  o1(t) ∪ o2(t), if isdef(o1(t)) ∧ isdef(o2(t))  o1(t), if isdef(o1(t)) ∧ ¬isdef(o2(t)) (o1 ∪ o2)(t) = o (t), if¬isdef(o (t)) ∧ isdef(o (t))  2 1 2  undef if¬isdef(o1(t)) ∧ ¬isdef(o2(t))

where o1, o2 are mset instances, t is a time instant, and isdef yields true iff its argument is defined. Algorithm 3 accumulates the evaluations of the time-dependent predi- cate, for all the moving objects in the input stream, in the mset instance Accumulator. A moving object identifier appears in the Accumulator in the time intervals/instants during which it fulfills the time-dependent predicate. The size and duration thresholds n, d are then applied to the Accumulator. Finally, the last while loop iterates over the Accumulator to generate the result stream. Every iteration of this loop reads the Accumulator units till a definition gap is met (i.e., a time interval/instant during which the Accumu- lator is undefined). The read part is called the head, and it belongs to the set MSetPart. Therefore, head is already one result for the gpattern in the 5.2. GROUP SPATIOTEMPORAL PATTERN OPERATORS 63

Algorithm 3: gpattern(S: stream(tuple), id:(tuple → int), α:(tu- ple → mbool), n: int, d: duration, q: enum{”exactly”, ”atleast”}) → stream(mset)

1 let R be a stream(mset), initially empty; 2 let Accumulator be an mset, initially empty; 3 foreach s: tuple in S do 4 Accumulator = Accumulator ∪ mbool2mset(α(s), id(s)); 5 let Changed := true; 6 while Changed do 7 Changed := DeleteUnitsBelowSize(Accumulator, n); 8 Changed := Changed ∨ DeleteElemsBelowDuration(Accumulator, d); 9 while Accumulator has more units do 10 let head := Head(Accumulator); 11 Accumulator := Rest(Accumulator); 12 if q = ”atleast” then R.Add(head); 13 else R.Add( ExactSubsets(head, n, d)); 14 return R;

Algorithm 4: mbool2mset(mb: mbool, id: int) → mset

1 let ms be an mset, initially empty; 2 foreach (I, true): ubool ∈ mb do 3 ms.AddUnit(I, {id}); 4 return ms;

Algorithm 5: DeleteUnitsBelowSize(Accumulator: mset, n: int) → bool 1 let Changed be a bool, initially false; 2 foreach (I, s): uset ∈ Accumulator do 3 if |s| < n then 4 Accumulator.DeleteUnit((I, s)); 5 Changed := true; 6 return Changed; case of atleast. In the case of exactly, the ExactSubsets function computes all the maximal duration subsets of head that have a cardinality of n, and they are added to the result stream.

5.2.3 The crosspattern Operator The crosspattern operator expresses a patternoid in terms of a time-dependent predicate evaluated for pairs of moving objects. In contrast to the gpattern operator, it expresses patternoids based on the mutual spatiotemporal rela- tionships between objects. Many patternoids can be expressed in this way (e.g., flock, convoy, convergence, etc.). A flock for instance can be expressed 64 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

Algorithm 6: DeleteElemsBelowDuration(Accumulator: mset, d: duration) → bool

1 let Changed be a bool, initially false; 2 let s be a set, initialized by all the moving object identifiers in all the units of Accumulator; 3 foreach elem: int in s do 4 let p := deftime(elem in Accumulator); 5 foreach I : interval in p do 6 if length(I) < d then 7 DeleteElemPart(Accumulator, elem, I); 8 Changed := true ; 9 return Changed;

as a group of moving objects where the spatial distance between every object, and all other flock members are below some threshold. The pattern graph Formally, let S be a set of moving objects. Let α be a time-dependent predicate that can be applied to pairs of moving objects. We define the so-called pattern graph as follows:

PG(S, α) = {(n1, n2, p)|n1, n2 ∈ S, p ∈ Drange(instant),

∀t ∈ p :(α(n1, n2))(t)}.

That is, a pattern graph is a set of time-dependent edges in the form (n1, n2, p). Such an edge exists in the time periods p, disappears otherwise. In other words, it connects its two nodes n1, n2 whenever the time-dependent predi- cate α is fulfilled given the moving objects n1, n2. The set of graph nodes is not explicitly represented in the pattern graph. Rather, a node exists in the graph as long as at least one edge connects to it. Therefore, the set of nodes is also time-dependent. The pattern graph is a kind of an analytical space, where patternoids can be searched for. The connectivity of the graph tells about the interaction between the moving objects based on the α function. A temporal scan of a pattern graph, starting at its least most time in- stant and continuously increasing the time, yields a fully dynamic graph (i.e., a standard graph that undergoes a sequence of edge/node additions and/or deletions). There already exist algorithms in the graph theory for answering connectivity queries on dynamic graphs (e.g., finding connected components). These algorithms efficiently update the solution after a change occur in the graph, rather than re-evaluating the query from scratch. A review of such techniques can be found in [EGI99]. We can safely assume that all types of connectivity queries that are supported for standard graphs are also sup- ported for dynamic graphs (e.g., Walk, Clique, Connected Component). That is, if we lack algorithms that efficiently search for certain subgraph types in a dynamic graph, the search can still be done inefficiently by re-evaluating the query after every change to the graph. Hence, the crosspattern operator is able to search for various subgraph types within the pattern graph (e.g., Walk, Clique, Knot) using such techniques. Given a pattern graph PG(S, α), the temporal function PG(S, α)(t), 5.3. THE CROSSPATTERN EVALUATION ALGORITHM 65 where t is a time instant, yields the standard graph (N,E), where E is the set of edges in PG(S, α) that are defined at time t, and N is the set of nodes connected to at least one edge in E. Now we use the pattern graph to define the crosspattern operator. In the definition, we use the mset type to represent the results of the crosspattern operator. The elements of the mset are the moving objects (i.e., the nodes of the pattern graph) that fulfill the patternoid. Let U ⊆ S×S be a set of candidate pairs2 of elements from S, and let S0 = Semantics of π1(U)∪π2(U) where for a set of tuples V , πi(V ) denotes the projection on the crosspattern i-th component. So S is reduced to S0, the elements mentioned in U. Given an application of the crosspattern operator in the form U crosspattern[id1, id2, α, d, n, “clique”], we first define the set Q as:

Q ={X ∈ MSetPart | always(X ⊆ S0) ∧ (∀(I,V ) ∈ X, ∀t ∈ I : V is a maximal clique in PG(S0, α)(t) ∧ |V | ≥ n) ∧ (∀e such that sometimes(e in X): P = deftime((e in X) at true) =⇒ ∀I ∈ P : length(I) ≥ d)}

An element X ∈ Q is an mset instance representing one group of moving ob- jects that matches the patternoid. The maximal clique condition guarantees that the number of moving objects in a result is maximal. We mention the clique in this definition as an example. Clearly it can be replaced by other subgraph types. Now we define the evaluation of the crosspattern operator as: eval( U crosspattern[id1, id2, α, d, n, “clique”]) =

{X ∈ Q | @Y ∈ Q such that (deftime(X) ⊂ deftime(Y )∧ (∀t ∈ deftime(X): X(t) = Y (t)))} where X(t),Y (t) are the set snapshots of X,Y at time t. This definition adds to the definition of Q the condition that the definition time of a result is maximal, thus avoiding an infinite number of results. The evaluation algorithm of the crosspattern operator is illustrated in the following section.

5.3 The crosspattern Evaluation Algorithm

We dedicate a separate section to the evaluation algorithm of the crosspattern operator, because of the many details it incorporates. Abstractly speaking, the function of the crosspattern operator is to search for large connected components within a time-dependent graph (i.e., the pattern graph), which is constructed from the evaluations of the time-dependent predicate on pairs

2We use U as input to the crosspattern operator instead of S to be able to do some prefiltering. For example, interesting pairs of candidates from S may be those coming close to each other during their lifetime, and they can possibly be determined efficiently using indexes. Evaluating instead all pairs from S may be prohibitively expensive. 66 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

of moving objects. As discussed in the previous section, the general form of the crosspattern operator is:

U crosspattern[id1, id2, α, d, n, subgraph-type].

Sub-functions of The function of the crosspattern operator can be divided into three sub- crosspattern functions: 1. Constructing the pattern graph PG(S0, α). 2. Searching for the large connected components of type subgraph-type within the pattern graph. Large means that they must fulfill the min- imum duration d constraint, and the minimum group cardinality n constraint. 3. Representing the found connected components, and yielding them.

5.3.1 Related Work (REVIEW)

The problem of the The problem of the crosspattern operator is indeed a generic one, and it crosspattern has applications in other fields. Generally it helps analyzing the temporal operator is a behavior of evolving networks. Social network users and the evolution of generic one their relationships can be modeled by pattern graphs. The α predicate in such a case might be representing whether a pair of users are friends, or whether they contact one another on a daily basis. Other applications are the modeling of the link availability in communication networks, especially wireless and ad-hoc networks, and the link analysis in the evolving web. Such topics are in the focus of several recent workshops, such as the Workshop on Dynamic Network Analysis (DNA-SDM 2012 ), the Temporal Web Analytics Workshop (TempWeb 2012 ), and the Workshop on Mining Social Network Dynamics (MSND 2012 ). Up to our knowledge, this problem has not yet been approached in the context of moving object databases. It seems natural however to have a type for time-dependent graphs (moving graphs) and relevant operations defined Evolving Graphs on top of it. In the context of communication networks, there are works on the so called Evolving Graphs [BXFJ03] to model network dynamics. An evolving graph is a sequence of snapshots, each of them representing the graph at a single time instant. A recent work [RLK+11] proposed a compression technique, so that the space requirements of evolving graphs are affordable. We see two features of evolving graphs that makes them unsuitable for representing pattern graphs: 1. The pattern graph is a continuous mapping from time to graph. The evolving graph is a discrete representation. Clearly the two models are not equivalent. 2. An evolving graph requires O(n) storage space, even after applying the compression technique in [RLK+11], where n is the number of changes occuring in the graph. In the case of the crosspattern operator, the number of changes normally ranges from 4 to 6 order-of-magnitude. This would easily exceed several GB of memory space. 5.3. THE CROSSPATTERN EVALUATION ALGORITHM 67

We are also not informed of works on finding connected components in evolv- ing graphs. Connectivity queries are studied in the context of the dynamic graphs. Dynamic Graphs A dynamic graph is a standard graph that undergoes a sequence of edge additions or deletions. If both additions and deletions are allowed, it is called a fully dynamic graph. Dynamic graph techniques focus on online graph updates and queries. Only the most recent graph is represented, and no history is kept. A dynamic graph is different from a pattern graph, because the latter needs to represent the history. On the other hand, a dynamic graph can be derived from a pattern graph by means of a temporal scan. There are smart algorithms for maintaining the minimum spanning for- est in fully dynamic graphs [EGI99]. That is, instead of re-evaluating the minimum spanning forest after every graph update (i.e., edge addition or deletion), the algorithm does this evaluation only once in the beginning, and tries to efficiently update the forest after every graph update. Such algo- rithms can easily be modified to maintain connected components instead of the minimum spanning forest. There are two main differences that make the problem of the crosspattern operator more complex than the problem of maintaining connected compo- nents in fully dynamic graphs:

1. The crosspattern operator searches for large connected components in terms of the two constraints d, n as we explained before.

2. The crosspattern operator is required to store/represent the result (i.e., the large connected components) as time-dependent graphs. The algo- rithms for maintaining minimum spanning forests allow only for online queries, and keep no history.

Up to our knowledge, the problem of finding large connected components in time-dependent graphs is a novel one. We have developed an algorithm (explained in the following section) that is able to do the task correctly and efficiently.

5.3.2 The crosspattern Evaluation Algorithm Algorithm 7 illustrates the evaluation of crosspattern. The first argument is a stream of tuples that corresponds to the set U, defined in Section 5.2.3. Every tuple in U has a pair of moving objects, whose identifiers can be queried using the two mappings IDFun1, IDFun2. The first part of the algorithm Constructing (Lines 3 - 8) constructs the pattern graph, which is represented by the mset PG(S0, α) instance Accumulator. The Accumulator contains edge identifiers, such that an edge identifier belongs to the Accumulator whenever its corresponding edge exists in the pattern graph. For each tuple in U, the identifiers of the two contained moving objects are used to compute a unique edge identifier by the function NodesToEdge. This function has an inverse EdgeToNodes. The goal is to be able to compute the edge identifier given its two nodes’ identifiers, and vice versa in a constant time. The edge itself is computed by evaluating the time-dependent predicate α for the current tuple. Note that 68 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

the definition time of the edge corresponds to the time periods in which α(u) is true. Finally the edge is constructed using the mbool2mset function (see Algorithm 4), and inserted into the Accumulator.

Algorithm 7: crosspattern( U: a stream of tuples, IDFun1, IDFun2: (tuple → int), α:(tuple → mbool), n: int, d: duration, q: enum{cc, clique, ...}) → a stream of mset

1 let R be a stream(mset), initially empty; 2 let Accumulator be an mset , initially empty; 3 foreach u: tuple in U do 4 NodeID 1 ← IDFun1(u); 5 NodeID 2 ← IDFun2(u); 6 EdgeID ← NodesToEdge(NodeID1, NodeID2); 7 Edge ← α(u); 8 Accumulator ← Accumulator ∪ mbool2mset(Edge, EdgeID); 9 AccumulatorParts ← split the Accumulator at the locations where it has definition gaps; 10 foreach part: mset in AccumulatorParts do 11 largeCCs ← FindLargeConnectedComponents(part, n, d); 12 subGraphs← FindSubGraphs(largeCCs, n, d, q); 13 Append subGraphs to R; 14 return R;

A second In the experiments, we have seen that the number of units in the Ac- implementation of cumulator, and the number of edge identifiers in every unit are very large. mset The straightforward implementation of the mset type in Secondo stores the edge identifiers of every unit. Knowing that for a medium input size (e.g., U has 4 order-of-magnitude tuples), the number of units in the Accumulator is 5 order-of-magnitude, and every unit has 2-3 order-of-magnitude edge iden- tifiers. Under such conditions, the Accumulator would require several GB of disk space3. We have implemented a second variant of the mset, that stores in every unit only the changes from the previous unit (i.e., edges that are added or removed). This variant considerably reduces the space requirement, because in most of the time the change is adding or removing a single edge. We have also implemented the union operation in Line 8 in a different way that works efficiently with this mset implementation. Inside the loop (Lines 3 - 8), we use a sorted list to buffer events in the form of at time t1 the edge ej starts to be contained in Accumulator, and at time t2 the edge ej ends to be contained in Accumulator. There are two such events for every time interval in which α is fulfilled: one event for joining, and the other for leaving the Accumulator. These events are sorted inside the list by increasing time. Buffering an event is O(log(n)) in the buffer size. After the loop ends, the Accumulator is constructed from the buffer in a single linear scan.

3Secondo uses FLOBs (Faked Large OBjects) to store moving objects. FLOBs may be stored in memory or on disk as decided by the FLOB manager. The decision is made based on FLOB size. Large FLOBs are written to disk. 5.3. THE CROSSPATTERN EVALUATION ALGORITHM 69

We expect that the most expensive part in constructing the pattern graph will be the evaluation of α(u) (Line 7). The cost varies according to the asso- ciated time-dependent predicate α. The predicate distance(. , .) < . between two mpoint objects, for instance, is O(n + m), where n and m are the num- ber of units of the two arguments. More about time-dependent predicates, and their evaluation algorithms can be found in [CLFG+03]. Keeping in mind that α is evaluated for every tuple in U, and that U represents pairs of moving objects, so U is quadratic in the number of moving objects, we find the use of indexes is necessary. The tuples that never fulfill α can safely be ignored, and not included in U. We discuss such an optimization technique in Section 5.5.2. The rest of the crosspattern algorithm finds the sub graphs that fulfill Finding connected the user criteria n, d, q. First the Accumulator is split at the locations where components it has definition gaps, if any. The result is a list of mset instances that belong to MSetPart. The search proceeds considering each of these parts separately as a pattern graph, finding the sub graphs in each, and concatenating the results into a single output stream R. We have invested a considerable time in developing a fast algorithm for finding large connected components in the pattern graph. Since the possible values of q are all special kinds of connected components (e.g., clique), we start first by finding the large connected com- ponents, then search for sub graphs of kind q within them. At the time of writing this thesis, only the algorithm for finding connected components is implemented. So queries looking for clique, walk, etc are not yet available in Secondo.

Algorithm 8: FindLargeConnectedComponents(PatternGraph: mset, n: int, d: duration)→ stream(mset)

1 let R be a stream(mset), initially empty; 2 components ← FindConnectedComponents(PatternGraph, n, d); 3 foreach component: mset in components do 4 largeComponents= ApplyThresholds(component, n, d); 5 Append largeComponents to R; 6 return R;

Finding large connected components is done in two steps as shown in Algorithm 8. The first step finds the connected components that must not be large. The second step filters the nodes and edges of each found component according to the thresholds n, d. Note that applying the thresholds can only result in edges being removed from the connected component. It is not possible that edges that do not belong to a connected component be there in the large connected component. Thus the function ApplyThresholds does not need to know about the pattern graph. Rather, it is a function in the connected component, and the two thresholds n and d, in order to find the large connected components. While applying the thresholds, a connected component might split into several, or it might disappear completely if too many nodes and edges are filtered out. Thus, the result of ApplyThresholds is a stream of large components, which is possibly empty. 70 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

Algorithm 9: FindConnectedComponents(PatternGraph: mset, n: int, d: duration)→ stream(mset)

1 let g be a standard graph, initially empty; 2 let Components be a list of Component, initially empty; 0 3 let R, R be stream(mset), initially empty; 4 foreach graphUpdate: uset in PatternGraph do 5 let addedEdges: set be the set of added edges in graphUpdate; 6 let removedEdges: set be the set of removed edges in graphUpdate; 7 if addedEdges is not empty then 8 Insert addedEdges into the graph g; 9 Find in g the new connected components having more than n nodes, assign labels to these components, mark them as NewlyAdded, and insert them into Components; 10 Find in Components those that got additional edges, and mark them as AddedEdges; 11 Find in Components whether two or more existing components merge together, and mark them as Merged; 12 Update in g the component labels of the nodes, so that every node knows its component; 13 if removedEdges is not empty then 14 foreach edge in removedEdges do 15 Find in g whether edge belongs to a component. If so, insert this component into the set affectedComponents; 16 Remove edge from the graph g; 17 Find in affectedComponents the components whose number of nodes became < n, and mark them RemoveNow; 18 Find in affectedComponents those that split into several smaller components. Mark each small component as either RemoveNow (if its node count < n) or as Split otherwise; 19 Mark the rest of affectedComponents as RemovedEdges; 20 Update in g, the component labels of the nodes that don’t belong anymore to existing components, and set them to null; 0 21 Finalize(Components, R, R ); 22 return R;

The function FindConnectedComponents is described in Algorithm 9. It iterates over the units of the mset representing the pattern graph. This iteration is equivalent to a temporal scan of the pattern graph. The result of such a scan is a dynamic graph, as mentioned in Section 5.3.1. Every uset in the pattern graph contains two sets that describe the change to the graph edges: addedEdges, and removedEdges. The algorithm starts with an empty standard graph g (the dynamic graph), and an empty list of connected components Components. While iterating over the units, the algorithm tries to efficiently maintain g and Components, so that they will be up to date with the last visited unit. The Components list holds only the connected components in g whose number of nodes ≥ n. The nodes of g know the labels 5.3. THE CROSSPATTERN EVALUATION ALGORITHM 71 of their components if they belong to any, otherwise null. In the beginning of an iteration, g is a snapshot of the pattern graph at the end instant of the unit in the previous iteration, and Components holds the connected components in g. During the iteration, the goal is to update g and Components to reflect the recent changes to the pattern graph, represented by the two sets addedEdges and removedEdges. Most of the time, an update to the pattern graph is an addition or a removal of a single edge. It occurs less often that an update involves multiple edge additions and/or removals. In the experiment that we will show in Section 5.6.2, 27.3% of the updates involved multiple edges, while 72.7% were single edge updates. Algorithm 9 is designed to handle multiple edge updates, because it is the general case. Newly added edges are first inserted into g. Then the algorithm inspects Handling edge their effect on the existing components, which can be any of the following: additions

1. Some of the new edges might connect to one another and to other edges in g forming a new component having n nodes or more. This new component is then added to the list Components and marked as NewlyAdded (Line 9 of the algorithm).

2. Some of the new edges might connect to one of the existing components causing it to grow up its number of edges (and probably nodes). Such a component is marked as AddedEdges (Line 10).

3. Some of the new edges might connect several existing components to- gether. These components are marked as Merged, as a preparation to merge them together into a single larger component. This large com- ponent will consist of the Merged components and the newly added edges.

4. The rest of edges stay in g, and have no effect on the Components list.

These four cases are checked in the same way. A breadth first search that begins from the newly added edges is started to find their entire connected components in g. A search path is terminated if it reaches a node that already belongs to a component in Components. This search is done in linear time. Every newly added edge contributes to exactly one of the above cases. After this inspection, every component in the list Components has one of the states: NotChanged, NewlyAdded, AddedEdges, or Merged. The algorithm proceeds to handle edge removals. It iterates over the Handling edge edges to be removed and finds in g whether they belong to components. removals Looking up the component of an edge is O(1) because we implement g as an adjacency list using a hash table. The removal of edges might affect the existing components in one of the following ways:

1. Edges that belong to no component are removed from g, and have no further effects.

2. Edges that belong to components are removed from them. When a component loses some of its edges, it might disappear or split. 72 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

The loop in Lines 14 - 16 collects all the components that are affected by edge removals into the set affectedComponents. The connectivity of each of the affected components is checked, to see whether it is still connected. This is done in O(nodes + edges) of the component. Now we need to update the states of the affected components. Note that these components might have already states assigned to them in the part handling added edges (this is rather rare). Table 5.3 illustrates the state composition. Column headers list the possible states after handling added edges, and the row headers list the possible states after handling removed edges. The cells of the table show the result of the composition. In the exper- iment that will be shown in Section 5.6.2, we have counted the occurrences of each of these component states. The percentages are listed in Table 5.4. Note that the ReDistribute state was never observed in our experiment, and this is why we list its percentage as zero percent, and not approximately zero (∼ 0.0%) like the AddRemoveMix which occurred very few times.

Table 5.3: Composing component states NotChanged AddedEdges NewlyAdded Merged RemoveNow RemoveNow RemoveNow RemoveNow RemoveNow RemovedEdges RemovedEdges AddRemoveMix NewlyAdded Merged Split Split Split NewlyAdded ReDistribute NotChanged NotChanged AddedEdges NewlyAdded Merged

Table 5.4: Component states occurrence frequencies State Percentage State Percentage NotChanged ∼ 52.42% AddedEdges ∼ 22.86% RemovedEdges ∼ 22.61% AddRemoveMix ∼ 0.0% NewlyAdded ∼ 0.98% RemoveNow ∼ 0.98% Merged ∼ 0.04% Split ∼ 0.09% ReDistribute 0.0%

Figure 5.1: State NewlyAdded followed by state RemoveNow

Figure 5.2: State NewlyAdded followed by state Split 5.3. THE CROSSPATTERN EVALUATION ALGORITHM 73

Figure 5.3: State Merged followed by state Split

Figure 5.1 illustrates an example of a graph update that causes the state NewlyAdded followed by RemoveNow. The minimum component size in this example is n= 5. Figure 5.1 (A) displays the graph before the update. The update adds the edges {(3,1), (6,9)}, and removes the edges {(2,5), (2,6)}. The algorithm handles the added edges first, and the result is a NewlyAdded component having 6 nodes (Figure 5.1 (B)). Then the removed edges are handled, and the result is the removal of this component (RemoveNow), because it is split into two small components. Composing NewlyAdded with RemoveNow results in the state RemoveNow for the two small components in Figure 5.1 (C). Similarly, Figure 5.2 illustrates NewlyAdded followed by Split, and Figure 5.3 illustrates Merged followed by Split. Clearly, situations like these examples hardly occur in real world appli- cations. Consider for examples the application of finding moving clusters within a group of animals. The example in Figure 5.1 happens when the two pairs of animals (3,1), (6,9) come close to one another (i.e., below the distance threshold), and the pairs (2,5), (2,6) come apart from one another at exactly the same time instant. It is unlikely that the movement is synchronized in this way. Now as the algorithm knows the updates that occurred to every com- Propagating the ponent, it propagates these updates to the result stream R0. Every mset changes to the element in R0 represents the history of a connected component in the given result stream pattern graph. At the end of every iteration, the function Finalize appends a unit (uset) to every element in R0 representing the change that happened to this component in this iteration. We do not list here the algorithm of Finalize, but we briefly describe it in the following. It maintains two lists of mset instances R, R0. The list R holds the connected components that do not receive updates anymore. These are connected components that used to exist in the past, and are currently missing. The list R0 holds the active components that still receive updates. Elements of R0 are linked to the elements of the Components list that propagate changes to them. Finalize receives the Components list and iterates over it. According to the state of every component it does one of the following:

1.A NewlyAdded component results in creating and adding a new mset instance to R0. This reflects that a new connected component started to appear in the pattern graph, and that Finalize has to track it. This new mset instance has a single uset having the same time interval as the unit of the pattern graph that is currently being processed, and the 74 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

set of edges of the new component.

2. A component whose state is one of AddedEdges, RemovedEdges, or AddRemoveMix results in extending the associated active component in R0 by an additional uset having the same time interval as the unit of the pattern graph that is being currently processed, and the set edges in the component after the update.

3. A non changed component (i.e., state is NotChanged) results in extend- ing the associated active component in R0 by an additional uset having the same time interval as the unit of the pattern graph that is being currently processed, and the same set of edges as its last uset instance4.

4. If the state of the component is RemoveNow, the associated active component is moved from R0 to R. This is done under the condition that the duration of this active component be ≥ the minimum duration threshold d, otherwise it is ignored by being removed from R0, and not put into R.

5. If several components merge together (i.e., their state is Merged), this means that they have been separate in the past, but starting from now they share the same set of edges. Finalize extends each of the associated active components by an additional uset having the same time interval as the unit of the pattern graph that is being currently processed, and the set of edges that consists of all the edges in the merged components, plus the newly added edges that caused them to merge.

6. If a component is marked as Split, this means that an active compo- nent in R0 splits at this time instant into several smaller components. Finalize makes as many copies of this active component as the number of the split components, and extends every copy with an uset whose time interval is the same as the unit of the pattern graph that is being currently processed, and whose set of edges is the set of edges of one of the split components5.

7. If a component is marked as ReDistribute, this means that Finalize does not really understand the update. It has no clue which active components should be updated, and how. This state happens only if a component is marked as Merged then as Split in the same iteration, as

4To guarantee a unique minimal representation of msets, which is a required feature in the base moving objects model, adjacent units having the same set of elements are merged together into a single unit, whose time interval is the union of their time intervals. 5Copying the whole history of the active component is expensive. Some of these copies might not appear in the result if their components disappear from the pattern graph early before their duration reaches d. In our implementation, we represent R0 by two lists: a list that holds parts of the component history, and another list that holds indexes to the first list. Using these two lists, the parts of the history that are shared by several active components are stored only once in the first list, and referenced several times in the second list. At the time of moving an active component from R0 to R, the parts that constitute this active component are concatenated together. 5.3. THE CROSSPATTERN EVALUATION ALGORITHM 75

illustrated in Figure 5.3. In such a case, the links between this compo- nent and the associated active components might be invalid. Finalize needs to re-investigate the update. It collects in one hand all the com- ponents that are marked as ReDistribute, and in the other hand all their associated active components, and it matches the two hands. A component matches an active component if they have common graph nodes. After the matching is done, Finalize is able to assign to ev- ery component one of the states above (e.g., Split, AddedEdges), and process it accordingly.

At the end of Finalize, it resets the state of all components in Components to NotChanged, preparing for the next iteration. Back to Algorithm 8, the next step after finding the connected compo- Finding large nents is to apply the thresholds n, d to them, in order to yield the large con- connected nected components. Please note that these thresholds apply to the nodes of components the connected components, and note that the connected components are rep- resented as mset instances that store the edges. Hence, the ApplyThresholds function needs to transform the component it receives from a representation of graph edges into a representation of graph nodes. Using a linear scan of the received component, it builds a hash table in the form shown in Figure 5.4 (A), where every node has one entry. An interval i entry represents one of the intervals on which this NodeID belongs to the connected component. It stores also pointers to the units of the connected components that fall within this interval. ApplyThresholds also builds a map in the form shown in Figure 5.4 (B), where Unit is a pointer to one of the units (uset) of the connected component. This map stores the number of nodes in every unit, so that the threshold n can be quickly applied.

NodeID interval 1 interval 2 ... interval n

...... (A) The hash table for indexing the pattern graph nodes.

Unit number of nodes ...... (B) The map for indexing the node count in pattern graph units.

Figure 5.4: Helper indexes for representing the nodes of the pattern graph

The hash table and the map are linked to the units of the mset represent- ing the connected component, in such a way that the three of them can be quickly synchronized if any of them is changed. ApplyThresholds makes sure that the three structures are always synchronized. To apply the d thresh- old, it scans the hash table and filters out all the intervals whose duration is less than d. Such a change affects the node count of the units, and it is reflected in the map by means of synchronization. To apply the n thresh- old, ApplyThresholds scans the map and removes the units that have less than n nodes. Such a change affects the hash table, and it inserts definition 76 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS gaps within the connected component. ApplyThresholds keeps applying the two thresholds iteratively until no more changes occur. It splits the con- nected component at the locations where it has definition gaps. The pieces that come out of this split are all large connected components, so they are appended to the result stream. Back to Algorithm 7 Line 12, the function FindSubGraphs receives this stream of large connected components, and searches for the sub graphs of type q within them. Up to the time of writing this thesis, this function has not been implemented. Naively, one would search every unit in the large connected components for sub graphs of type q, and concatenate the results to construct the whole history of the sub graph. Smarter algorithms are still missing, and require further research. Algorithm 7 has one more missing part. It deals only with undirected pattern graphs. This means that the time-dependent predicate α must be commutative. So far, the group patterns that we have seen in the literature can be expressed using commutative predicates. Most of the patterns are expressed based on the distance between objects, or on a derivative of this distance. Still, the crosspattern operator would be more expressive if it al- lowed for non-commutative predicates as well. This completes our illustration of the crosspattern algorithm.

5.4 Examples

This section illustrates the expressive power of the proposed operators. The goal is to illustrate that the three operators reportpattern, gpattern, and cross- pattern are able to express a wide variety of patterns. We focus on expressing the patterns in Table 2.1. More application examples are given in Chapter 7.

Example 5.1. Find a flock of at least 20 gazelles moving within a circle of radius 30 m for a duration of 10 minutes.

This is a (20, 30m, 10min) flock, similar to number 8 in Table 2.1. The difference is that the notion of time here is continuous. The query looks as follows: query Gazelles feed {a} Gazelles feed {b} symmjoin[ .Id_a < ..Id_b ] crosspattern[ .Id_a, .Id_b, distance(.Trip_a, .Trip_b) < 30.0, ten minutes, 20, "clique" ) ] transformstream consume;

Example 5.2. Find a convoy of at least 20 gazelles moving within a circle of radius 30 m for a duration of 10 minutes.

The convoy patternoid (number 7 in Table 2.1) is a variant of the above flock patternoid. Unlike a flock, the members of a convoy may change. This can be expressed as follows: 5.4. EXAMPLES 77 query Gazelles feed {a} Gazelles feed {b} symmjoin[ .Id_a < ..Id_b ] crosspattern[ .Id_a, .Id_b, distance(.Trip_a, .Trip_b) < 30.0, one minute, 20, "clique" ) ] transformstream filter[ inst(final(.elem)) - inst(initial(.elem)) >= ten minutes ] consume;

That is, every object in the patternoid stays at least one minute, while the patternoid itself exists for at least ten minutes, as enforced by the filter operator. The effect of this is that the objects of the patternoid may change.

Example 5.3. Find a leadership pattern that consists of: 3 gazelles heading in northern direction, followed by a moving cluster of 20 gazelles.

A moving cluster is another variant of a flock. Every member is required to keep a small distance to some (not necessarily all) other members. let follow = vec("baba", "bab.a", "baab"); query reportpattern[ leader: Gazelles feed gpattern[.Id, mdirection(.Trip) between [ 45.0, 135.0 ], three minutes, 3, "exactly" ], cluster: Gazelles feed {g1} Gazelles feed {g2} symmjoin[ .Id_g1 < ..Id_g2 ] crosspattern[.Id_g1, .Id_g2, distance(Trip_g1, Trip_g2) < 30.0, ten minutes, 20, "cc" ]; stconstraint("cluster", "leader", follow) ] extend[ clusterMReg: fun(t: TUPLE) Gazelles feed boundingregion(.Id, .Trip, attr(t, cluster)) ] extend[ leaderFinSet: val(final(.leader)), clusterInitSet: val(initial(.cluster)) ] filter[ always(mdirection(rough_center(.clusterMReg)) between [ 45.0, 135.0 ]) and (.leaderFinSet issubset .clusterInitSet) ] consume; where mdirection computes the time-dependent angle, in degrees, between the moving point’s motion vector and the x-axis, and rough center com- putes the center of a moving region as a moving point. The query finds a leader group, followed by a moving cluster, where both are heading between north-east and north-west. The moving cluster patternoid is expressed as a connected component (cc) in the pattern graph, rather than a clique in the case of the flock patternoid in Example 5.1.

Example 5.4. The encounter patternoid is a variant of the convergence patternoid, in Example 1.3. It is a convergence that ends with the moving objects meeting together. We express it as follows: 78 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

query Gazelles feed {a} Gazelles feed {b} symmjoin[ .Id_a < ..Id_b ] crosspattern[ .Id_a, .Id_b, isdecreasing(distance(.Trip_a, .Trip_b)), 20, ten minutes, "clique" ) ] transformstream extend[ convMReg: fun(t2: TUPLE) Gazelles feed boundingregion(.Id, .Trip, attr(t2, .elem) ] filter[ area(val(final(.convMReg))) <= const_pi() * 30.0 * 30.0 ] consume;

that is, the gazelles at the end instant of the patternoid are required to be within a circular area with radius of 30 m.

5.5 Optimization

This section discusses the integration of the proposed group STP operators with query optimizers. The techniques that we propose in this section extend a MOD optimizer, so that it will be able to generate optimized execution plans for group STP queries. In order to make this discussion concrete, we describe the proposed optimization techniques in the context of the Secondo optimizer [GBA+04]. These techniques are however generic, and it should be possible to apply them in other optimizer frameworks. Defining SQL The three operators reportpattern, gpattern, and crosspattern are stream syntax operators, because they yield stream(mset). Hence, they are defined in the SQL-like syntax as table expressions (i.e., expressions that compute tables). That is, they can be used in the FROM clause, for instance. The following query illustrates the SQL-like syntax for the Example 5.3:

SELECT leader, cluster, FROM ( SELECT leader, cluster, val(final(leader)) as leaderFinSet, val(initial(cluster)) as clusterInitSet, boundingregion(Gazelles, Id, Trip, cluster) as clusterMReg FROM reportpattern([ gpattern(Gazelles, id, mdirection(Trip) between [ 45.0, 135.0 ], three minutes, 3, "exactly") as leader, crosspattern( ( SELECT * FROM Gazelles g1, Gazelles g2 WHERE g1.Id < g2.Id ), g1.Id, g2.Id, distance(g1.Trip, g2.Trip)< 30, ten minutes, 20, "scc") ] as cluster, [ stconstraint(cluster, leader, follow) ] )) WHERE [ always(mdirection(rough_center(clusterMReg)) between [ 45.0, 135.0 ]), leaderFinSet issubset clusterInitSet ] 5.5. OPTIMIZATION 79

The reportpattern operator in this example yields a table having two at- tributes leader and cluster of type mset. The number and the names of the attributes are the same as the number and the aliases of the patternoids. Similarly the gpattern and the crosspattern operators are defined as table expressions. Each of them can be directly used in the FROM clause. The table that they return has a single attribute whose name is elem, and whose type is mset. The next step is to define translation rules for the three operators. The Defining straightforward translation rule of the reportpattern operator is as follows: translation rules

reportpattern([p1 as a1, ..., pn as an], [tc1, ..., tcm]) → reportpattern[a1 : p1, ..., an : pn; tc1, ..., tcm] where pi is a patternoid, ai is an alias, and tci is a temporal constraint. A translation rule defines one way of generating the execution plan for an SQL-like expression. It has the structure:

SQL-like expr → Executable expr [:- conditions]

It reads: the SQL-like expr is translated into the Executable expr under the conditions. The optional conditions at the end of the rule specify the cir- cumstances under which the rule is applicable. There can be several valid translations for the same SQL-like expression. Optimizers apply different techniques for selecting the most efficient translation (e.g., cost-based op- timization). The above translation rule of the reportpattern operator does only a translation of syntax. It does not propose efficient execution plans. The following subsections propose more sophisticated translation rules that invoke indexes and other optimization mechanisms for the three operators gpattern, crosspattern, and reportpattern.

5.5.1 Optimizing The gpattern Operator The gpattern operator evaluates for every tuple in the input stream the time- dependent predicate. Obviously, a tuple whose evaluated mbool is always false cannot be part of a result. We would like to use indexes to remove such tuples from the input stream before evaluating the gpattern operator. Hence, the translation rule should analyze the time-dependent predicates and trans- late them into index accesses, whenever suitable indexes are available. In other words, we need to extend the optimizer with translation rules for the time-dependent predicates. This has already been done, and described in the optimization of the stpattern operator in Section 4.4. This section briefly describes how to apply the same optimization technique to the gpattern op- erator. We assume that the underlying optimizer framework already has trans- lation rules defined for the standard (i.e., non-temporal) selection predi- cates. That is, given a query that consists of a table expression in stream and a standard selection predicate f, it is possible to invoke a function optimize(in stream, f ), which yields an optimal execution plan for this query. If in stream is a database relation for instance, optimize might yield an index 80 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS scan. If it is not possible to use indexes, it yields the straightforward plan in stream filter[f]. We define the translation rules of the time-dependent predicates on top of the translation rules of the standard predicates. The same approach was described in Section 4.4. There we have defined a mapping that translates every time-dependent predicate α into a standard predicate f such that f ⇔ sometimes(α). For example:

mpoint × region → mbool inside # is mapped into the standard predicate:

mpoint × region → bool passes # where it is true that

passes(mpoint1, region1 ) ⇔ sometimes(inside(mpoint1, region1 )).

The complete mapping is listed in Table 4.3. Let map(α) be the mapping of the time-dependent α into a standard predicate f as defined in the table. The translation rule of the gpattern operator is as follows:

gpattern(S, id, α, d, n, q) → optimize(S, map(α)) gpattern[id, α, d, n, q]

It consists of two steps. Firstly, the time dependent predicate α is rewritten into the standard predicate map(α). Secondly, the already existing transla- tion rules of map(α) are invoked using optimize(S, map(α)). The following example illustrates this translation rule. Suppose that we wish to express a group STP query called drinking, that reports groups of gazelles that gather to drink from some lake. The SQL-like query would be as follows:

SELECT boundingregion(Gazelles, Id, Trip, elem) as drinking FROM gpattern( Gazelles, Id, Trip inside buffer(theLake, 50), half hour, 20, "atleast" );

The gpattern operator in this query yields a relation that contains one mset attribute with the default name elem. The buffer operator creates a larger region around the region theLake, whose boundary is 50 m far from the boundary of theLake. The query finds groups of gazelles that concurrently stay within a distance of 50 m from theLake, for at least half an hour. Ac- cording to this translation rule, and assuming that a suitable index exists in the database, the optimizer generates the following execution plan: query Gazelles Trip sptuni windowintersectsS[bbox(buffer(theLake, 50))] sort rdup Gazelles gettuples filter[.Trip passes buffer(theLake, 50)] gpattern[.Id, .Trip inside buffer(theLake, 50), half hour, 20, "atleast") 5.5. OPTIMIZATION 81

transformstream extend[drinking: fun(t: TUPLE) Gazelles feed boundingregion(.Id, .Trip, attr(t, elem)) consume;

In order for the optimizer to generate this execution plan, it firstly rewrites the time-dependent predicate Trip inside buffer(theLake, 50) into the stan- dard predicate Trip passes buffer(theLake, 50). This is the mapping from α into map(α). The rest of the work is done by optimize(S, map(α)). It invokes the translation rules of the standard passes predicate, that already exists in the optimizer. In this example, the predicate passes is translated into a window query on an spatial R-tree index, the Gazelles Trip sptuni. The query window is the bounding box of the buffer region around the lake. The result of the window query is refined using the passes predicate. Thus, only the tuples that fulfill passes, and consequently fulfill sometimes(inside), are passed to the gpattern operator.

5.5.2 Optimizing The crosspattern Operator

The optimization of the crosspattern operator is similar to that of the gpat- tern operator. The only difference is that the time-dependent predicate in the crosspattern operator is applied to pairs of moving objects. So, it is han- dled as a join predicate rather than a selection predicate as in the gpattern operator. Nevertheless, the strategy is the same. The time-dependent predicate α in the crosspattern operator is mapped into a standard join predicate by map(α). We assume that the optimizer de- fines a function optimize(S1,S2, f) that accepts two table expressions S1,S2 and a standard join predicate f and yields an optimized execution plan. Hence, the translation rule is:

crosspattern(S, id1, id2, α, d, n, q) → optimize(S, S, map(α)) crosspattern[id1, id2, α, d, n, q]

5.5.3 Optimizing The reportpattern Operator

The reportpattern operator is evaluated using a constraint satisfaction algo- rithm that incrementally solves the CSPk−1 first, then extending it to CSPk. During the evaluation, the algorithm knows temporal information from the patternoid operators evaluated so far, and from the temporal constraints. This temporal information can be used to restrict the definition time of the input trajectories before the evaluation of the remaining patternoid opera- tors. This is the basic idea that we use for generating efficient execution plans for the reportpattern operator. We perform temporal reasoning in order to extract and use this temporal information. This will be explained in detail in Chapter 6. 82 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

5.5.4 Notes On The Implementation Unfortunately we were not able, in practice, to integrate the gpattern, cross- pattern, and reportpattern operators with the Secondo optimizer. The Sec- ondo optimizer is designed so that it is easy to extend it with new operators and predicates, that are to be invoked within the FROM and the WHERE clauses. We have shown already in Section 4.5.2 the optimizer extension for the stpattern predicate. It is, however, not currently extensible by table ex- pressions, that are to be invoked within the FROM clause. Since the three operators are table expressions, we were not able to implement this optimizer extension. Still, one is able to write optimized execution plans directly in the Sec- ondo executable language, as we have already done all over this chapter. One is able to use the available indexes in the queries. Actually this is a nice feature of Secondo that the execution plans can be directly written by the user using a precisely defined query language.

5.6 Experimental Evaluation

This section experiments with the gpattern and the crosspattern operators. The reportpattern operator is similar to the stpattern operator, which is al- ready evaluated in Section 4.6. We do not evaluate it here again.

5.6.1 Evaluating The gpattern Operator This experiment evaluates the gpattern operator. We use datasets of different sizes to evaluate the scalability. We start with the Trains relation from the berlintest database, that comes with the Secondo distribution. This relation is replicated n times to scale up the dataset size. The replication is done carefully so that the result relation will have patterns that can be extracted by the gpattern operator. Then we issue group STP queries using the gpattern operator to find these patterns and to evaluate the operator correctness. The scalability is evaluated by repeating the experiment for different values of n.

Preparing The Data The Trains relation has the following schema:

Trains[Id: int, Line: int, Up: bool, Trip: mpoint]

It represents underground trains in Berlin. It was generated by simulating the train trajectories based on the real train schedules and the underground train network of Berlin. This relation is used to generate several relations with increasing sizes, where each is an n replica of Trains. The replication is done by making n copies of every train, and adding random position shifts to the units of the copied trains. These shifts are bound with 500 meters on every dimension x, y. So these disturbed copies are at most ∼707 meters away 5.6. EXPERIMENTAL EVALUATION 83 from the original train at any given time instant. The query that generates these replicas is as follows: let Trains200= Trains feed intstream(1, 199) namedtransformstream[ No ] product projectextend[ No, Id, Line; Trip: randomshiftdelay( .Trip, create_duration(0, 0), 500.0, 500.0) ] Trains feed extend[ No: 0 ] project[ No, Id, Line, Trip ] concat extend[ tupID: (.No * 100000) + (.Line * 1000) + .Id ] consume;

The replica is computed by cross joining the Trains relation with a stream of integers 1 to (n − 1). For every tuple in the cross product, the randomshift- delay operator adds random (x,y) position shifts to the units of the Trip attribute, where the shifts in x, and the shifts in y are within the range [-500, 500] meters. This cross product is then concatenated with the original Trains relation, and a tuple identier tupID is computed for all the tuples in this new relation. The original Trains relation has 562 tuples. Table 5.5 lists the relations that were generated for this experiment.

Relation Number Total Number Size On Name of Tuples of Observa- Disk tions Trains5 2810 257,720 41.5 MB Trains10 5620 515,440 83 MB Trains15 8430 773,160 124.5 MB Trains30 16860 1,546,320 249 MB Trains50 28100 2,577,200 414 MB Trains70 39340 3,608,080 580 MB Trains85 47770 4,381,240 673 MB Trains100 56200 5,154,400 828 MB Trains120 67440 6,185,280 993 MB Trains150 84300 7,731,600 1.2 GB Trains200 112400 10,308,800 1.6 GB Trains300 168600 15,463,200 2.4 GB

Table 5.5: The datasets used for evaluating the gpattern operator

Generating The Queries The intention behind generating the data in this way is to have groups of trains that move close to the original train that they were replicated from. 84 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

Given one of the original trains, the gpattern operator can be used to retrieve the whole group of the replicated trains. The query looks like this: query Trains100 feed gpattern[ .tupID, distance(.Trip, train7) < 800.0, ThirtyMinutes, 80, atleast ] transformstream consume

It finds the groups of at least 80 trains that concurrently stayed within a distance of 800 meters from train7 for at least 30 minutes. In the experiment, we have randomly picked ten query objects instead of train7. They are selected by the query: let QueryObjects= Trains sample[10, 0.0000001] consume;

The sample operator selects a random sample from the input relation having the size indicated in the arguments. The main query in this experiment is: query QueryObjects feedproject[ Id, Trip ] {q} extendstream[ grp: fun(t: TUPLE) Trains100 feed gpattern[ .tupID, distance(.Trip, attr(t, Trip_q)) < 800.0, ThirtyMinutes, 80, atleast ] ] remove[ Trip_q ] consume;

It basically iterates over the ten query objects, and invokes for each of them the gpattern operator to find the corresponding group of replicated trains. Note that the run-time of this query includes the time of evaluating the time-dependent predicate distance(.Trip, attr(t, Trip q)) < 800.0. As men- tioned before, we expect that evaluating the time-dependent predicate will be the most expensive part in the evaluation algorithms of gpattern and cross- pattern operators. We experimentally evaluate this statement here. The query above is separated into the two queries. The first query evaluates the time-dependent predicate and stores the mbool results into a Secondo rela- tion. The second query invokes the gpattern operator to match the pattern, and it uses the pre-evaluated mbools instead of the time-dependent predicate. The two queries are as follows: let Trains100Distance= QueryObjects feedproject[ Id, Trip ] {q} loopjoin[ fun(t: TUPLE) Trains100 feed projectextend[ tupID; PredicateResult: distance(.Trip, attr(t, Trip_q)) < 800.0 ] ] remove[ Trip_q ] consume; 5.6. EXPERIMENTAL EVALUATION 85

query Trains100Distance feed groupby[ Id_q; grp: group feed gpattern[ .tupID, .PredicateResult, ThirtyMinutes, 50, atleast ] transformstream extract[ elem ] ] consume;

Results

These three queries are run once for each of the relations in Table 5.5, after changing the relevant parameters. The response time of every run is divided by ten to compute the average run-time per query object. Figure 5.5 plots this average run-time in minutes against the relation size in 1000 tuples. It shows two run-time curves. The upper one includes the time of evaluating the time-dependent predicate, while the lower curve excludes it. The difference between the two curves is the time taken for evaluating the time-dependent predicates.

4 Execluding the time of evaluating the time-dependent predicate Including the time of evaluating the time-dependent predicate 3.5

3

2.5

2 Minutes 1.5

1

0.5

0 0 20 40 60 80 100 120 140 160 180 Number of tuples (x 1000)

Figure 5.5: The run-time of the gpattern operator

In the light of these results, we conclude that evaluating the time-dependent predicate has the dominant cost in the evaluation of gpattern (Algorithm 3). The run-time scales linearly with the dataset size, and it is within a few minutes for large input. This experiment was done on an Ubuntu Linux 64 bit machine. It has 8 GB RAM, and an AMD processor with 6 cores, 800 MHz each. 86 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

5.6.2 Evaluating The crosspattern Operator This section experiments with the crosspattern operator. We combine two datasets together. The first contains simulated flocks, moving in the Carte- sian space. The second is a background dataset, that is used to hide the flocks. It is generated using the BerlinMOD data generator [DBG09]. The queries of this experiment invoke the crosspattern operator to extract these hidden flocks.

Preparing The Data

Generating the We used the C++ Boids data generator 6 by Christopher Kline to generate flocks flocks. This data generator belongs to the research area of artificial life. It simulates a decentralized emergent behavior that mimics groups of birds. Boids are bird-like moving objects. The simulation is based on the idea that every bird decides its route based on its local neighborhood. The local neighborhood of a bird are those other birds visible to it. Every individual Boid continuously corrects its route by applying a list of simple steering rules, with decreasing priorities:

1. Avoid collision with neighbors and with obstacles in the space.

2. Keep close to the center of the local neighborhood.

3. Maintain safe cruising distances to neighbors.

4. Match the velocity to the velocity of neighbors.

5. Wander.

6. Maintain a stable altitude (frequent ascends and descends are exhaust- ing for birds).

The data generator allows for defining several groups of Boids, and as- signing every group a different feather color. Individual Boids flock only with Boids having their own feather color. We have modified this data generator for the purpose of this experiment. The modifications are as follows:

1. The original data generator outputs 3D coordinates. For this experi- ment we require 2D coordinates. We have modified the functions that apply the steering rules, so that they work in the 2D space, and made the generator output 2D coordinates.

2. We have added a special Boids group whose members freely fly in the background without flocking. This group responds only to the steering rules: avoid collision, wander, and level flight. The intention was to use this group to hide the flocks. However, we decided to use the BerlinMOD data instead. We found that these background Boids more- or-less perform random walks, and that the BerlinMOD data is more realistic. 6http://www.behaviorworks.com/people/ckline/cornellwww/boid/boids.html 5.6. EXPERIMENTAL EVALUATION 87

3. The original generator was modified, so that it accepts a world as a parameter. Boids are restricted in their movement by the boundaries of this world. They never cross it. This is necessary to overlay the Boids dataset with the BerlinMOD dataset.

We have integrated the modified C++ Boids data generator into Sec- ondo, and it can now be invoked using the generateboids operator. It has the signature: vector(int) × vector(real) × instant × duration → stream(tuple< (BoidID, int), (T, instant), (X real), (Y, real) >) where the first argument is a vector that determines the number of groups, and the number of Boids in every group. The first entry in this vector is the size of the background group. In this experiment we set it to zero because we use the BerlinMOD data in the background. The second argument defines the obstacles that the Boids have to avoid. The world, in which the Boids move, should be given as the first entry of this vector. The Boids start their trips inside this world, and they avoid hitting or crossing its borders while they fly. For simplicity, the operator assumes that obstacles are circles. It expects triples in the form of (center x, center y, radius). The third and the fourth arguments are the start time and the duration of the simulation. The operator yields a stream of observations, each of which is a time-stamped 2D position of a Boid. These observations can be interpolated using the Sec- ondo approximate operator to construct the trajectories. The queries are as follows: let xyt= generateboids( create_vector(0, GROUP_COUNT, GROUP_COUNT, GROUP_COUNT), create_vector(getx(WORLD_CENTER), gety(WORLD_CENTER), WORLD_RADIUS), create_instant(P_STARTDAY + randint( P_NUMDAYS ), randint(64800000) ), ONE_HOUR) consume; let Boids = xyt feed projectextend[ BoidID, T; Position: makepoint(.X, .Y) ] sortby[ BoidID, T ] groupby[ BoidID; Trip: group feed approximate[T, Position, OneHour] ] consume; where xyt is the relation generated by calling the generateboids operator. GROUP COUNT is set to 30 in this experiment. The first query generates observations for three flocks, each having 30 Boids. The world obstacle is defined in terms of WORLD CENTER and WORLD RADIUS which are derived from the BerlinMOD background data, so that the flocks spatially overlap them. Similarly the simulation start time and duration temporally overlap the BerlinMOD data. The second query interpolates the observation into trajectories. Firstly, the point attribute Position is computed from the X and Y attributes. Then the tuples are sorted and grouped by BoidID. 88 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

Finally, the mpoint Trip attribute is interpolated from the observations using the approximate operator. After generating the flocks we compute an estimate of the average distance between the Boids that belong to the same flock. This will be used as a distance threshold in the experiment queries. Generating the The BerlinMOD benchmark [DBG09] is used to generate a background background data dataset in which we hide the Boid flocks. It simulates car trips given a street network (the default is the network of Berlin, unless otherwise provided) and a distribution of home and work places in the different zones of the city. It simulates home-work trips, work-home trips, and leisure trips, considering many factors that make the simulation realistic. The data generator accepts a real number (called the scale factor) that controls the dataset size. Using this scale factor the data generation script computes the two parameters: √ 1. number of cars = round(2000 * scale factor) √ 2. simulation duration in days = round( 28 * scale factor)

The default scale factor is 0.005, which generates trips for 141 cars over approximately 2 days. A scale factor of 1 generates trips for 2000 cars over 28 days. The script generates one trajectory per car. One can optionally instruct it to divide this trajectory into separate trips. In this experiment we do so. Hiding the flocks in We concatenate the Boids relation with the BerlinMOD relation con- BerlinMOD data taining the trips (called dataMtrip). The result relation has the schema: AllData[Id: int, Trip: mpoint] The identifier Id is sequentially assigned after the concatenation. Computing the join The crosspattern operator expects a stream of tuples, each having a pair of moving objects. Hence, we compute a self-join on this relation. The join is computed based on the distance, because the query will be looking for flocks. That is, a pair of trips join if their spatial distance is sometimes less than some threshold. This threshold is estimated by sampling the pair- wise distance between the mpoints in the Boids relation. Computing the spatiotemporal join on such large datasets is expensive (several hours). We used a cluster of 6 machines, each having two logical machines, to compute these joins. This is done using the HadoopParallel Algebra in Secondo. It provides operators that interact with both Hadoop and Secondo in order to run spatiotemporal join queries on a cluster. This is a non-published pre- liminary work of the colleague Jiamin Lu. Joining the datasets that we used in this experiment took from approximately 1.5 to approximately 24 minutes on the cluster. The relation resulting from this join has the schema: AllDataJoined[Id r1: int, Id r2: int, Trip r1: mpoint, Trip r2: mpoint]

We expect that the run-time of the STP query that finds the flocks be affected by: (1) the number of flocks, and (2) the size of AllDataJoined. The higher the number of flocks, the higher the number of components that are maintained by the crosspattern algorithm during its iterations. So we expect that the run-time will increase when the number of flocks is increased. We 5.6. EXPERIMENTAL EVALUATION 89 also expect that the run-time will be proportional to the size of AllData- Joined. This is because the time-dependent predicate will be evaluated for every tuple, and also because the number of graph edges and the number of graph updates will increase. Therefore we have generated the six datasets shown in Table 5.6 by varying these two parameters.

Table 5.6: Description of the datasets of the crosspattern experiment Flocks BerlinMOD Tuples in Tuples in Units in Disk size of scale factor AllData AllDataJoined AllDataJoined AllDataJoined 3 0.01 1,855 2,579 2,804,947 402 MB 10 0.01 2,065 13,462 14,173,430 2 GB 3 0.1 15,253 17,002 20,715,492 2.9 GB 10 0.1 15,463 119,956 133,228,643 19 GB 3 0.25 36,436 54,062 67,140,914 9 GB 10 0.25 36,646 367,131 409,777,381 57 GB

Generating The Queries Using each of these six datasets, we issue the query: query AllDataJoined feed crosspattern[ .Id_r1, .Id_r2, smooth(distance(.Trip_r1, .Trip_r2) < DistThreshold, FIVE_MINUTES), FOURTY_MINUTES , 25, cc ] namedtransformstream[ Flock ] addcounter[ Cnt, 1 ] consume;

This query should retrieve all the flocks that we have hidden in the data. The distance threshold DistThreshold is computed during the data generation by sampling the pairwise distance of the Boids belonging to the same flock. The smooth operator negates all the false units in the given mbool instance that are shorter than 5 minutes, so that they become true. We have added this operator, because we noticed that pairs of Boids which belong to the same flock often come apart from one another for short time periods. While in reality this is a normal wandering behavior, it affects the results of the crosspattern operator. Some flocks might not appear in the result because they split into smaller flocks for few seconds. The smooth operator bridges such short intervals of predicate unfulfillment to avoid this problem.

Results The query above was able to correctly find all the flocks that we have hidden in the data. It also reported no other flocks from the background data. This is because in the query we have set the minimum group size to 25, and the minimum duration to 40 minutes. These conditions made it hard for the berlinMOD trajectories in the background to accidentally fulfill the pattern. Figure 5.6 shows the run time of this query. The x-axis shows the input size (i.e., the number of tuples in AllDataJoined), and the y-axis shows the 90 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS

Execluding the time of evaluating the time-dependent predicate 140 Including the time of evaluating the time-dependent predicate

120

100

80 Minutes 60

40

20

0 0 5 10 15 20 25 30 35 Number of tuples (x 10,000)

Figure 5.6: The run-time of the crosspattern operator

time in minutes. The figure shows two curves. The larger curve shows the response time of the query. The smaller curve shows the response time after excluding the time taken for evaluating the time-dependent predicate. That is, the smaller curve shows the time taken to construct the pattern graph and to find the large connected components. Clearly the response time is dominated by the time of evaluating the time-dependent predicates. This suggests that it is not worthy to further improve our algorithm for finding large connected components (Algorithm 9), because the time taken by it is already negligible compared to the total response time. This experiment was done on an Ubuntu Linux 64 bit machine. It has 8 GB RAM, and an AMD processor with 6 cores, 800 MHz each.

5.6.3 A Remark In this experiment, we have noticed that the crosspattern operator is very sensitive to its parameters. Both the results, and the run time are affected by slight changes to the parameter values. If the DistThreshold is too small, for example, the components will split more often, and might disappear from the final results. If it is too large, the number of components that the evaluation algorithm maintains during its iterations will increase, because objects from the background data will satisfy this distance threshold and the minimum group count. These components will be maintained during the evaluation, thus increasing the run-time, but they will probably not appear in the results, because they will not fulfill the minimum duration threshold. We have noticed that the crosspattern operator becomes very slow when the number of maintained components approaches 1000 (this number depends 5.7. CONCLUSIONS 91 on the machine specifications). This is because we use Secondo FLOBs (Faked Large OBjects) to represent the components. Secondo writes a FLOB to disk if it exceeds a certain size threshold. Concurrently maintaining such a large number of FLOBs involves a lot of disk access, thus the run-time significantly increases. A good practice is to start with a strict description of the pattern, in order to avoid a large number of intermediately maintained components. In this experiment for example, a strict description would be a combination of small DistThreshold, a large group size threshold, and a long pattern duration threshold. If the results are not satisfactory (i.e., some expected groups are missing), one should relax the pattern description a little and try again, and so on.

5.7 Conclusions

This chapter has explained our approach for group spatiotemporal pattern queries. We have proposed three generic query operators that can be com- bined together to express complex group STPs, such as concurrence, flocks, moving clusters, convoys, ...etc. The three operators are called gpattern, crosspattern, and reportpattern. The formal definitions and the evaluation algorithms have been given and explained in the chapter. These operators are implemented in Secondo, and are available for downloading. Our approach is both expressive and extensible. Many examples were given in this chapter to demonstrate that these operators are able to ex- press most of the group STPs that exist in the literature, and more. In Chapter 7 more application examples will be demonstrated. Essentially pat- terns are composed of patternoids, and patternoids are expressed on top of time-dependent predicates. Each of these three layers is both expressive and extensible. The overall expressive power of the language is a kind of multiplication of their expressive powers. We have defined two patternoid operators, namely gpattern, crosspattern. They are able to express patternoids in terms of the independent movement, and the dual interaction between moving objects. An opportunity for future work is to propose a patternoid operator for team interactions. Such pat- ternoids occur in soccer games for instance (e.g., the offside trap). Mainly they describe groups of moving objects, where every individual shows some individual movement pattern, and these patterns relate together and show some group STP. The representation of the pattern results as msets allows for nesting group STP queries into more complex queries. It is possible, for instance, to build individual STP queries on the results of a group STP query. That is, one can create moving region representations of the mset results and further process these moving regions. One can also build group STP queries on the results of other group STP queries (e.g., a meeting pattern of several flocks). We have theoretically developed generic methods for integrating the gpat- tern and the crosspattern operators with the query optimizers. They do not require specialized index structures. Rather, they are built on top of the 92 CHAPTER 5. GROUP SPATIOTEMPORAL PATTERNS existing optimization framework. These methods were however not imple- mented because the Secondo optimizer lacks the required support. The experimental evaluation suggests that it is not worthy to further improve the evaluation algorithms that we have proposed for the gpattern and the crosspattern operators. Approximately 98% of the response time is taken for evaluating the time-dependent predicates, which are a kind of external tool used by the two operators. Chapter 6

Temporal Reasoning

In Chapters 4, 5 we have described the evaluation of the stpattern and Motivation reportpattern operators respectively. Both operators are similarly modeled as a Constraint Satisfaction Problem (CSP) that consists of binary temporal constraints. They are evaluated incrementally by solving the CSPk−1 (having k − 1 variables) first, then extending it to CSPk. Consider the stpattern operator for example. During its evaluation, the evaluation algorithm knows temporal information from the time-dependent predicates evaluated so far, and from the temporal constraints. This temporal information can be used to restrict the definition times of the input trajectories before the evaluation of the remaining time-dependent predicates. To illustrate this, consider for example an stpattern predicate in the form:

stpattern[pred1: p1, pred2: p2 ; stconstraint("pred1", "pred2", vec("aabb"))] where p1 and p2 are time-dependent predicates. Assume that the algorithm evaluating the stpattern operator (Algorithm 1) decides to evaluate p1 in the first iteration. Assume also that it knows, after evaluating p1 for a given tuple, that the earliest fulfillment of p1 ends at time t1. Now the stpattern algorithm has enough knowledge to conclude that a supported assignment of p2 can only start after t1, because of the temporal relation “aabb”. We can safely restrict the definition time of the input trajectory to t > t1 before evaluating p2. Thus, p2 receives a shorter trajectory, and its evaluation time decreases. The same idea applies to the reportpattern operator. In this chapter, we modify the algorithms of evaluating the stpattern and the reportpattern operators so that they make use of this idea, and hopefully run faster. Abstractly, there are two tasks to be done:

1. Given the set of the temporal constraints, perform temporal reasoning to infer the relationship between every pair of variables in the CSP.

2. Change the operator syntax and evaluation algorithms so that these inferred relationships can be used to restrict the trajectories.

Section 6.1 reviews some works on temporal reasoning. Section 6.2 ex- plains how we do the temporal reasoning task. Section 6.3 explains the

93 94 CHAPTER 6. TEMPORAL REASONING

integration of the temporal reasoning with the stpattern operator and its variants. The integration with the reportpattern operator is similar to that of the stpattern operator, so we do not discuss it in this chapter to avoid the repetition. Finally, Section 6.4 illustrates an experimental evaluation.

6.1 Temporal Reasoning (REVIEW)

The reasoning task that is required here is well known in the field of arti- ficial intelligence. It is called minimal labeling in [vBC90], and closure in [VKvB90]. It can be defined as follows: given a set C of constraints relating a set I of time intervals, find the strongest relationship (minimal constraint) between every pair i1, i2 ∈ I. Interval Algebra One of the best known works in this field is the Interval Algebra by (IA) Allen [All83]. The algebra reasons over all the 13 possible interval relation- ships (e.g., before, meets, overlaps, etc.), and vectors of them. A vector represents the disjunction of its components. Computing the closure over Allen’s Interval Algebra (IA) is proven to be NP-complete [VKvB90]. Note that the set of interval relationships IR (described in Section 4.1), that we use to express the temporal constraints, consists of the 13 Allen’s relation- ships and variants of them that allow each of the two intervals to collapse into an instant. Point Algebra (PA) On the other hand, computing the closure over the Point Algebra (PA) (i.e., the constraints relate time points instead of time intervals) is tractable. The Point Algebra reasons over time points and the set of relationships {< , >, =, ≤, ≥, 6=, ? }, where ? is the universal/don’t-care relationship. One way to make our reasoning task tractable is to perform it over the Point Algebra. That is, instead of reasoning over the relationships between time intervals, we reason over the relationships between the endpoints of these time intervals. For example, the temporal relationship vec(“abab”, “a.bab”) between two time intervals i1, i2 can be expressed by the conjunction ((i1.t1 ≤ i2.t1) ∧ (i2.t1 < i1.t2) ∧ (i1.t2 < i2.t2)) between endpoints. The problem is that only a subset of the relationships of the IA can be represented by the PA. For example, the interval relationship vec(“aabb”, “bbaa”) translates into the time point relationship ((i1.t1 6= i2.t1) ∧ (i1.t1 6= i2.t2) ∧ (i1.t2 6= i2.t1) ∧ (i1.t2 6= i2.t2)). The later point relationship does not correctly represent the interval relationship because it accepts, for instance, the interval relationship “abba”. Vilain et al. [VKvB90] define the subset of IA that can be represented by PA as the subset that can be entirely encoded into PA relationships without any instance of the 6= relationship (i.e., using only the relationships (<, >, =, ≤, ≥, ?)). They call this subset of IA the continuous endpoint algebra. Continuous Mainly, any of the 13 possible interval relationships is representable by endpoint algebra the PA. The problem starts when vectors of such relationships are com- posed. We compose vectors only when we are not sure about the exact order of events. For example, if we say Ali went to school after the summer hol- idays, we know that the temporal relationship between holidays and school is “aa.bb”. But when we say Ali drank his choco milk after he started eating 6.2. THE REASONING TASK 95 the sandwich, we are not sure about the exact relationship between sand- wich and choco milk. It could be any of {“aabb”, “aa.bb”, “abab”, “aba.b”, “abba” }. Therefore we use a vector. Vilain et al. calls it a representational ambiguity (“Such ambiguity arises as a result of one’s inability to resolve the exact duration of each event”). While IA is able to represent all kinds of representational ambiguities, PA represents only the continuous ones.

Holidays Sandwich School Choco milk

Training Party Shopping Shopping Meeting Meeting

Figure 6.1: Continuous and discontinuous endpoint relationships

To make it more clear, Figure 6.1 illustrates four interval relationships, including the aforementioned two examples. Representational ambiguities are marked in the figure by dashed rectangles. In the holidays - school ex- ample, there is no ambiguity at all. In the sandwich - choco milk example, there is a continuous endpoint ambiguity. That is, the range of values that can be taken by the endpoints of the choco milk interval is continuous. The third example describes the scenario Ali went shopping either before or after his training. This is a truly disjunctive relationship. The range of values for the Shopping endpoints is disconnected. It is, therefore, non-representable by the PA. Similar to this is the fourth example. It describes the scenario Ali is going to meet his friend at the party if he attends, otherwise at home after the party. Only the first two of these four examples belong to the continuous endpoint algebra, and can be represented by the PA. We think that the continuous endpoint algebra is sufficient for expressing the temporal constraints in STP queries. We have found it sufficient for expressing all the examples in this thesis, although the examples were not intentionally designed for this. So we choose to live with this subset, and use the PA for reasoning. If the user query contains discontinuous relationships, we use the original algorithm that does not invoke the temporal reasoning. The closure of a PA network having n variables/time points can be computed using the path consistency algorithm in O(n3) time [VKvB90].

6.2 The Reasoning Task

This section illustrates how the reasoning task is carried out. That is, we il- lustrate how a PA network is constructed from a given STP query. Once such a network is constructed, the closure is computed by the path consistency algorithm in [VKvB90]. Let C be a set of temporal constraints, part of the spatiotemporal predi- cate (Section 4.1) or of the reportpattern operator (Section 5.2.1). An element 96 CHAPTER 6. TEMPORAL REASONING c ∈ C is in the form:

stconstraint(alias1, alias2, SI ) where SI ⊆ IR as described in (Section 4.1). alias1 and alias2 are aliases of time-dependent predicates or patternoids. The components of SI are interval relationships. They need to be mapped into endpoint relationships in order to build the corresponding PA network. Algorithm 10 illustrates the generation of the PA network, given such a set C.

Algorithm 10: IA-To-PA input : C, the set of temporal constraints result: PANetwork, the corresponding PA network

1 let PANetwork be a graph, initially empty; 2 foreach in C do 3 let V : vector[12] be initially empty; 4 foreach ir in SI do 5 let v: vector[12] = Decompose (ir); 6 V = V | v; 7 PANetwork.Add(alias1, alias2, V ); 8 return PANetwork;

The Decompose function decomposes a simple interval relationship from Table 4.2 into a vector of 12 endpoint relationships. Every entry in this vector represents the relationship between a pair of endpoints. Hence, there is a total of 12 relationships as illustrated in Figure 6.2. For instance, the vector entry at index 10 represents the relationship between i2.t2 as the first operand and i1.t2 as the second operand. The interval relationship “aa.bb”, for instance, is decomposed into the vector:

(< , < , < , > , = , < , > , = , < , > , > , >)

The decomposition of each of the 26 interval relationships in the IR set into such a vector is hard coded inside the Decompose function.

i1.t1 i1.t2 i2.t1 i2.t2

i1.t1 0 1 2 i1.t2 3 4 5 i2.t1 6 7 8 i2.t2 9 10 11 Figure 6.2: The Decompose vector

The | operator (reads or) is used to aggregate the PA representations of all the simple interval relationships in SI (Lines 4 - 6 of the algorithm). It processes every entry in the first vector with the corresponding entry in the second vector according to the function defined in Figure 6.3. It is a 6.3. INTEGRATION WITH THE INDIVIDUAL STP OPERATORS 97

| < ≤ > ≥ = 6= ? empty < < ≤ 6= ? ≤ 6= ? < ≤ ≤ ≤ ? ? ≤ ? ? ≤ > 6= ? > ≥ ≥ 6= ? > ≥ ? ? ≥ ≥ ≥ ? ? ≥ = ≤ ≤ ≥ ≥ = ? ? = 6= 6= ? 6= ? ? 6= ? 6= ? ? ? ? ? ? ? ? ? empty < ≤ > ≥ = 6= ? empty

Figure 6.3: The | operator

symmetric operator. Finally the PA vector representing SI is added to the PA network using the Add function of the path consistency algorithm, listed in [VKvB90]. After all the constraints in C are inserted into the PA network, it is returned by the IA-To-PA algorithm, and it is now ready for computing the closure using the path consistency algorithm in [VKvB90].

6.3 Integration With The Individual STP Op- erators

Integrating the temporal reasoning with the stpattern operator and with the reportpattern operator is done in the same way. We illustrate in this section the integration with the stpattern operator, and we do not repeat the discussion for the reportpattern operator. Hence, this section illustrates the integration of temporal reasoning with the STP predicate and its variants (Table 4.1). We start by illustrating the idea on the bank robbers example (Section 1.2.1). The temporal constraints in this example are:

stconstraint(gas, bnk, vec("aabb")), stconstraint(bnk, leaving, vec("abab", "aa.bb", "aabb")

Recall from Section 4.1 that a supported assignment of an STP predicate assigns for every time-dependent predicate an interval during which it is fulfilled, and this set of intervals fulfill all the temporal constraints. At the end of an iteration k of the MatchPattern algorithm (Algorithm 1), we obtain partial supported assignments that have assignments for k variables only. In the next iteration k +1, MatchPattern picks a new Agenda variable and tries to find supported assignments for it. In the bank robbers example, MatchPattern picks the bnk variable in the first iteration, because it has the highest degree in the Constraint Graph. Assume that the set of supported assignments SA at the end of this iteration looks so: 98 CHAPTER 6. TEMPORAL REASONING

gas bnk leaving

unassigned [10:30 - 10:35] unassigned unassigned [15:11 - 15:13] unassigned unassigned [22:03 - 23:00] unassigned Assume also that MatchPattern picks next the gas variable. Now as we have already computed the closure of the PA temporal network, we know the relationships between the endpoints of gas and bnk. The most restrictive of them is gas.t2 < bnk.t1. From the three supported assignments above, we obtain three consistent intervals for gas: [begin of time , 10:30), [begin of time , 15:11), and [begin of time , 22:03). The union of the three intervals, [begin of time - 22:03), is used to restrict the trajectory of the car’s trip before evaluating the gas time-dependent predicate 1.

Algorithm 11: MatchPattern With Temporal Reasoning input : variables, constraints result: whether the CSP is consistent or not

1 Clear SA, Agenda and ConstraintGraph; 2 Add all variables to Agenda; 3 Add all constraints to the ConstraintGraph; 4 Let PANetwork= IA-To-PA (constraints); 5 Compute the closure of PANetwork using the path consistency algorithm in [VKvB90] ; 6 if PANetwork is inconsistent then return Inconsistent; 7 while Agenda is not empty do 8 Pick a variable Xi from the Agenda; 9 Let P:periods= GetConsistentPeriods (Xi, PANetwork); 10 SetConsistentPeriods (Xi, P); 11 Compute the variable domain Di; 12 Extend SA with Di; 13 if SA is empty then return Inconsistent; 14 else return Consistent;

Algorithm 11 illustrates how the MatchPattern algorithm is modified to integrate temporal reasoning. It starts by constructing the PA network and computing the closure. If the PA network is inconsistent (i.e., the temporal constraints entered by user are inconsistent), it yields that the pattern cannot be matched. The function GetConsistentPeriods (Algorithm 12) in the main loop infers the set of time intervals that can be used to restrict the input trajectory2. Now we need to restrict the trajectory to these time intervals. Note that the

1Actually we expand this interval a little bit at the right end to make sure that gas does not overlap bnk. 2Whenever it is not possible to infer such a set of time intervals (e.g., in the first loop when no variables are consumed yet), GetConsistentPeriods yields the time interval [begin of time , end of time], which results in no restriction to the trajectory. We omit this detail from the algorithm for simplicity. 6.3. INTEGRATION WITH THE INDIVIDUAL STP OPERATORS 99

Algorithm 12: GetConsistentPeriods(Xi, PANetwork)→ periods

1 Let lbounds= Search PANetwork for all the lower bounds of Xi.t1; // (i.e., the variables in PANetwork having the relationships {<, =, ≤} to Xi.t1. 2 Let ubounds= Search PANetwork for all the upper bounds of Xi.t2; 3 Let P: periods be initially empty; 4 foreach sa in SA do 5 Substitute the values of sa in lbounds, ubounds; 6 Let t1= Largest(lbounds); 7 Let t2= Smallest(ubounds); 8 P= P ∪ [t1, t2]; 9 return P;

STP predicate does not directly process the trajectory. It receives the mbool values resulting from the evaluation of the time-dependent predicates, and checks the temporal constraints on them. The time-dependent predicates are the ones that process the trajectories. In such a way, we have no chance to restrict the trajectories in an automatic way, because we can not expect what time-dependent predicates will appear in the user query and what are the types and order of their arguments. We also cannot expect whether the time-dependent predicates receive the raw trajectories, or other intermediate operators are firstly applied to them, and thus the trajectories should be restricted before evaluating these intermediate operators. At this point, it is necessary that the user tell in his query how to restrict the trajectories. We change the signature of the stpattern operator from: tuple × (tuple → mbool)+ × (bool)+ → bool into: tuple × ((tuple × periods) → mbool)+ × (bool)+ → bool That is, the time-dependent predicate receives a second parameter of type periods. The type periods= range(instant) represents a set of time inter- vals that are disjoint and non-adjacent. In our implementation we have left the original operators unchanged, and introduced copies having the new sig- nature, and having 2 in their names (e.g., stpattern2, stpatternex2 ). The signatures of these new operators are listed in Table 6.1. The value of the newly added periods parameter will be assigned in the run-time by the function SetConsistentPeriods (Algorithm 11 Line 10). The value of this parameter will be the periods value that should be used to restrict the trajectory. It is the responsibility of the user to decide how this restriction is done. For example, the bank robbers query should be re-written as:

query cars feed {c} The bank rob- landmark feed {l} bers query filter[.type l = "gas station"] using the stpat- product tern2 operator filter[. stpattern2[ 100 CHAPTER 6. TEMPORAL REASONING

gas: fun(t1: STREAMELEM, p1: periods) (attr(t1, trip c) atperiods p1) inside .region l, bnk: fun(t2: STREAMELEM, p2: periods) distance((attr(t2, trip c) atperiods p2), bank) < 50.0, leaving: fun(t3: STREAMELEM, p3: periods) speed(attr(t3, trip c) atperiods p3) > 100000; stconstraint("gas", "bnk", vec("aabb")), stconstraint("bnk", "leaving", vec("abab", "aa.bb", "aabb")) ] ] consume;

Now we see it clearly in the query that the second argument of the stpattern2 operator is a set of functions that accept two arguments: a STREAMELEM (here equivalent to a tuple coming from the input stream), and a periods object. In this query, it is used to restrict the trip c attribute using the atperiods operator. One can think of a smart way to make the optimizer able to generate such a plan. So that the user does not bother writing such a complicated syntax, and simply writes the SQL query that was shown before in Section 1.2.1. The biggest challenge is to let the optimizer decide what to do with the periods argument. Note that the the time-dependent predicate might accept a long array of arguments including static and moving types. The optimizer must decide which of these arguments should be restricted by the periods parameter. This has to do with the semantics of the query and the operators used in it. Unfortunately we could not come up with such an idea. In our implementation, these operators are accessible only in Secondo executable. 6.3. INTEGRATION WITH THE INDIVIDUAL STP OPERATORS 101 ] ] ] ; ] ; ; ; ; ; #[ #[ #[ #[ stpatternexextend2, stpatternexextendstream2 stpattern2 stpatternex2 stpatternextend2, stpatternextendstream2 + + ) ) bool → bool bool ( ( + + ) ) × × + + ) ) bool bool ( ( ) × × mbool mbool + + ) ) tuple ( → → mbool mbool bool stream → → periods periods ) ) → → × × ) ) ) Table 6.1: Individual STP operators with temporal reasoning bool bool tuple periods periods tuple tuple ( ( ( → → × × × × ) ) stream tuple tuple tuple tuple ( ( (( (( tuple tuple × → × ( ( × × stream tuple tuple stream 102 CHAPTER 6. TEMPORAL REASONING 6.4 Experimental Evaluation

This section experimentally evaluates the performance gain of the temporal reasoning. On the one hand, constructing the temporal network, comput- ing the closure, and the effort of restricting the trajectories entails perfor- mance overhead. On the other hand the restricted trajectories are processed faster than the complete ones. Therefore, we perform two experiments in this section. The first one measures the overhead of the temporal reason- ing compared to the original STP predicate. The second one evaluates the performance gain.

6.4.1 Experiment 1: The Overhead Of Temporal Rea- soning Basically we repeat the experiment of Section 4.6.1 that we used to evaluate the overhead of the stpattern operator. The queries of that experiment are written once more using the stpattern2 operator. The two sets of queries are run, and the run-times are compared.

Generating The Queries In the same way as in Section 4.6.1, the randommbool operator is used to cre- ate a set of 30 random mbool instances and store them as database objects, named mb1...mb30. The same set of 4900 queries is replicated twice: once using the stpattern operator, and once using the stpattern2 operator (i.e., using temporal reasoning). The objective is to evaluate the performance of the STP predicate extended with temporal reasoning in comparison with the original STP predicate. Since the time-dependent predicates in this experi- ment have negligible time-cost, the results reflect only the overhead of the two operators stpattern, stpattern2. The difference between the two overheads is the additional overhead due to temporal reasoning.

Results The results are shown in Figure 6.4. In the average, the stpattern2 operator responds in 1.5 times the response time of the stpattern operator. Still the overhead is promisingly low. It is about half a second per 1000 tuples. Note that the closure is computed only once per query. Thus the overhead of computing the closure (which is the biggest part of the whole overhead) is not affected by the number of tuples in the input stream. But the atperiods operator is called for every input tuple, and this adds to the overhead.

6.4.2 Experiment 2: The Performance Gain Of Tem- poral Reasoning This experiment evaluates the gain in performance due to temporal reason- ing. Random queries are generated, where every query has two copies: one invoking the stpattern operator, and the other invoking the stpattern2. Both 6.4. EXPERIMENTAL EVALUATION 103

Number of Predicates = 3 Number of Predicates = 4 1 1 stpattern stpattern stpattern2 stpattern2

0.8 0.8

0.6 0.6

0.4 0.4 Seconds per 1000 Tuples Seconds per 1000 Tuples 0.2 0.2

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Constraints Number of Constraints

Number of Predicates = 5 Number of Predicates = 6 1 1 stpattern stpattern stpattern2 stpattern2

0.8 0.8

0.6 0.6

0.4 0.4 Seconds per 1000 Tuples Seconds per 1000 Tuples 0.2 0.2

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Constraints Number of Constraints

Number of Predicates = 7 Number of Predicates = 8 1 1 stpattern stpattern stpattern2 stpattern2

0.8 0.8

0.6 0.6

0.4 0.4 Seconds per 1000 Tuples Seconds per 1000 Tuples 0.2 0.2

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Constraints Number of Constraints

Figure 6.4: The overhead of temporal reasoning queries consist of the same set of time-dependent predicates and temporal constraints. The goals of the experiment are (1) to compare the run-time of the two sets of queries in order to evaluate the gain of temporal reason- ing, and (2) to extract several run-time statistics that help us interpret the results.

Preparing the Data The performance gain of temporal reasoning can be remarkable only if the trajectories are long. That is, it does not cause a remarkable performance improvement if already short trajectories are restricted into shorter ones. Therefore, we first prepare long trajectories (i.e., having a large number of units) for the purpose of this experiment. We start with the Trains relation of the berlintest database that comes with the Secondo distribution, downloadable on [Secd]. This relation has 562 trajectories (mpoint), each having 90 units on the average. We replicate the trajectories twice, and randomly disturb their movement in X, Y to have 104 CHAPTER 6. TEMPORAL REASONING a larger dataset having 1124 trajectories. To increase the number of units, the positions of the trajectories are sampled with an interval of 3 seconds. These sampled positions are randomly displaced using small displacement values. Finally, they are used to interpolate new trajectories. As a result, the interpolated trajectories are similar to the original ones, but contain a bigger number of units. We call this final relation ScaledTrains. The trajectories in this relation have 670 units on average. The disk size of the ScaledTrains is 77.2 MB. It has a total number of units of 750,788.

Generating the Queries We wrote a little C++ function that accepts two integers representing the number of time-dependent predicates and the number of temporal constraints, and it yields two Secondo queries. The first invokes the stpattern opera- tor, and the second invokes the stpattern2 operator. The two queries have the same set of time-dependent predicates, and temporal constraints. The time-dependent predicates are randomly chosen from: • distance(.Trip, random point) < random distance threshold • speed(.Trip) > random speed threshold • .Trip inside random region The random points of the distance predicate are the top 30 restaurant lo- cations in the Restaurants relation, in the berlintest database. The random regions of the inside predicate are circular regions with radius 3500 meter, centered at cinemas in the Kinos relation. We randomly chose 30 cinemas in the city center. This is because most trains pass through the city center. Therefore, the probability that the predicate is fulfilled is high. The temporal constraints are randomly generated in a similar way as in Section 4.6.1. Yet, we exclude the temporal relationships that are not representable by the Point Algebra. In a C++ program, this function is iteratively called to generate such pairs of queries, ranging the number of time-dependent predicates from 2 to 8, and the number of constraints from 1 to 16. A total of 2450 such pairs of queries are generated. Since the constraints are randomly generated, some queries might have inconsistent constraints. It would be favorable to the stpattern2 operator, if these queries are kept in the experiment. That is, the stpattern2 opera- tor will discover these inconsistencies by means of temporal reasoning, and will immediately return without evaluating any of the time-dependent pred- icates. On the other hand, the stpattern operator needs to evaluate some of time-dependent predicates (might also be all of them) in order to conclude. Therefore, we exclude such queries from the experiment. This is done with the help of the computeclosure Secondo operator, that was implemented specially for the purpose of this experiment. This operator accepts the same set of arguments as the stpattern2, and yields an integer value 0-2. Zero de- notes an inconsistent set of constraints, one means a consistent set, and two means that the set of constraints cannot be represented by the Point Algebra. Only the consistent queries (i.e, value of 1) are used in the experiment. 6.4. EXPERIMENTAL EVALUATION 105

Results The average run-time for every experimental setting (i.e., number of time- dependent predicates, and number of constraints) is shown in Figure 6.5. Unexpectedly, the performance gain of temporal reasoning is not high. It is 6.2% reduction in the run-time on the average. For the user of the MOD system, this is a non significant improvement. In the following, we try to give an interpretation for these results.

Number of Predicates = 3 Number of Predicates = 4 200 200 stpattern stpattern stpattern2 stpattern2

150 150

100 100 Seconds Seconds

50 50

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Constraints Number of Constraints

Number of Predicates = 5 Number of Predicates = 6 200 200 stpattern stpattern stpattern2 stpattern2

150 150

100 100 Seconds Seconds

50 50

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Constraints Number of Constraints

Number of Predicates = 7 Number of Predicates = 8 200 200 stpattern stpattern stpattern2 stpattern2

150 150

100 100 Seconds Seconds

50 50

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Constraints Number of Constraints

Figure 6.5: The performance gain of temporal reasoning

As mentioned before, the temporal reasoning achieves performance gain only when the input trajectory is restricted/shortened before evaluating the time-dependent predicates. This restriction is possible only starting from the secondly evaluated time-dependent predicate. Hence, trajectories that fail to fulfill the firstly evaluated time-dependent predicate do not contribute to the gain. To illustrate this, consider that a given trajectory t fulfills a given pattern. If this pattern is expressed using the stpattern operator, the trajectory t is completely processed by all the time-dependent predicates. In contrast, if this pattern is expressed by the stpattern2 operator, the firstly 106 CHAPTER 6. TEMPORAL REASONING evaluated time-dependent predicate processes the whole trajectory, the sec- ondly evaluated time-dependent predicate processes a part of the trajectory (probably), the thirdly evaluated time-dependent processes a shorter part, and so on. Thus, the more time-dependent predicates are fulfilled by the trajectory, the higher the probable performance gain is.

Total number of calls to the stpattern/stpattern2 operator 777,808 Calls that fulfilled the firstly evaluated time-dependent pred- 566,073 icate Calls that fulfilled the secondly evaluated time-dependent 147,495 predicate Calls that fulfilled the thirdly evaluated time-dependent pred- 22,531 icate Calls that fulfilled the fourthly evaluated time-dependent 3,792 predicate Calls that fulfilled the fifthly evaluated time-dependent pred- 775 icate Calls that fulfilled the pattern 42

Number of calls to the distance operator 520,185 Number of units processed by the distance operator, when 359,188,765 called by stpattern Number of units processed by the distance operator, when 314,189,722 called by stpattern2

Number of calls to the speed operator 512,993 Number of units processed by the distance operator, when 354,734,103 called by stpattern Number of units processed by the distance operator, when 310,028,892 called by stpattern2

Number of calls to the inside operator, 485,296 Number of units processed by the distance operator, when 335,603,010 called by stpattern Number of units processed by the distance operator, when 292,232,959 called by stpattern2

Number of calls to the atperiods operator 1,519,084 Figure 6.6: Run-time statistics

Figure 6.6 shows run-time statistics that were collected in this experi- ment. Each of the two operators stpattern, stpattern2 was invoked 777,808 times, these are the 692 queries times the 1124 tuples of the ScaledTrains relation. Around 27 % of these calls did not fulfill the firstly evaluated time- dependent predicate, and had no positive contribution to the gain. The number tuples decreased quickly after evaluating the second, third,... pred- icates. This means that the number of tuples contributing positive gain decreased quickly after every iteration. There were only 42 calls that fulfilled the pattern. 6.5. DISCUSSION AND CONCLUSIONS 107

We have also counted the calls to the three operators distance, speed, and inside, that are used within experiment queries. The figure shows the number of calls to these operators that were initiated by each of the stpattern and stpattern2 operators. When invoked by the stpattern operator, the three op- erators processed a total number of units (upoint) 359,188,765, 354,734,103, 335,603,010 respectively. When invoked by the stpattern2 operator, the num- ber of processed units was around 13% less due to the temporal reasoning and to the trajectory restriction. This 13% reduction positively contributes to the gain. On the other hand, the atperiods operator was called 1,519,084 times, contributing negatively to the gain (i.e., adding to the overhead). This explains the low gain. Top sum up this discussion, the experiment showed that the temporal reasoning and the trajectory restriction achieved performance gain by reduc- ing the number of precessed units by 13%. A part of this gain was eaten by the overhead of computing the closure, and of evaluating the atperiods operator. This explains the low performance gain of 6.2% that we got.

6.5 Discussion And Conclusions

This chapter discussed a technique that uses temporal reasoning to speed up the evaluation of our pattern matching operators. The concept applies to both individual and group pattern operators. The idea of this optimization came around while we were designing the stpattern operator. We became more motivated to achieve it after we got the referee comments on our paper [SG11]. We cite here this comment “... Since each step that extends the list of supported assignments requires to fully evaluate the predicates and apparently the evaluation does not take advantage of discovered constraints (e.g., if the gas station is visited only once at a given hour of the day, and the pattern duration is very limited, there is no need to compute all time intervals of predicates ”bnk” and ”leaving”)...”. In contrast to the promising expectations from such an optimization, the experimental evaluation shows only slight performance improvements. We think that it is not a remarkable benefit to the user that a query that takes a minute to finish is sped up 4 seconds. Moreover, the syntax of the operators that apply temporal reasoning is more complex than the original operators. In the second experiment, we have noticed that the majority of the queries are 1-5 seconds more expensive when temporal reasoning is used. In these queries the speed up of the trajectory restriction was overtaken by the over- head of temporal reasoning and of invoking the atperiods operator for such long trajectories. Only few queries showed a performance gain of more than 50% (up to one minute in some cases). Nevertheless the average gain is not appealing. To conclude, the experimental evaluation shows that the overhead of tem- poral reasoning is negligible, and that the performance gain is small on the average but very high in few cases. From the performance point of view, it is beneficial to apply temporal reasoning. From the syntax point of view, it is less desirable to use the more complex syntax of the temporal reasoning 108 CHAPTER 6. TEMPORAL REASONING operators. We conclude that unless, in some future work, one finds a way to let the query optimizer automatically generate execution plans invoking tem- poral reasoning, so that the syntax overhead is carried out by the optimizer, there is not enough justification for using temporal reasoning. Chapter 7

Application Examples

This chapter illustrates two real world application examples of the individual and the group spatiotemporal pattern queries. The goal is to demonstrate the practical impact of the proposed query operators. The scenarios of these applications helped us much in refining our design to meet the practical needs. In other words, the features of the proposed operators in this thesis were not all clear from the beginning. They were, rather, iteratively refined with the help of these application examples.

7.1 Air Traffic Control

This section demonstrates individual STP queries using ATC data (Air Traf- fic Control systems, used for tracking the movement of aircraft). The goal of Problem domain ATC systems is to expedite the air traffic flow without compromising safety. They are ground stations run by trained staff. They receive radar record- ings of aircraft within their airspace, and interact with the pilots in order to separate the traffic. Offline, they try to spot hazardous and less preferable situations in their archived data, analyze them, and build safety models. The idea of this application example came through a joint work with Background Gennady Andrienko and Natalia Andrienko. The work aimed at integrating the MOD system (Secondo) with the visual analytics system (V-Analytics [AAW07]) to better explore the spatiotemporal patterns in data. Christophe Hurter from ENAC (Ecole´ Nationale de l’Aviation Civile - a French civil aeronautics academy) contributed an ATC dataset, and played the role of the domain expert, describing the interesting patterns that ATC practitioners would like to explore in the data. This work has been demonstrated in [SGB+11]. Meanwhile Christophe Hurter has provided two other ATC data sets. The three datasets are: Data

1. One day (Feb 22, 2008) radar recording of aircraft positions (Lon, Lat, Alt) over France and neighboring territories, consisting of 17,851 tra- jectories and 429,027 recorded positions. The average sampling interval per flight ranges from about 0.8 to about 5 minutes. The Secondo relation storing these flight tracks occupies approximately 172 MB of disk-space on a Linux 32 bit machine.

109 110 CHAPTER 7. APPLICATION EXAMPLES

2. One day (Sep 9, 2009) radar recording of aircraft positions (Lon, Lat, Alt) over France and neighboring territories, consisting of 17,275 tra- jectories and 410,020 recorded positions. The average sampling interval per flight ranges from about 0.8 to about 5 minutes. The Secondo relation storing these flight tracks occupies approximately 166 MB of disk-space on a Linux 32 bit machine.

3. One week (Apr 6-12, 2008) radar recording of aircraft positions (Lon, Lat, Alt) over France and neighboring territories, consisting of 53,224 trajectories and 1,076,070 recorded positions. The average sampling interval per flight ranges from about 1.3 to about 6 minutes. The Secondo relation storing these flight tracks occupies approximately 444 MB of disk-space on a Linux 32 bit machine.

The datasets have been anonymized by removing aircraft identities and Questions airline information. In this dataset, we search for the following landing pat- terns:

1. Missed Approach: while the aircraft is in its final approach (i.e., the landing phase) the pilot decides to climb again, probably because a safe landing cannot be accomplished. In order to improve the safety, it is worthwhile to analyze such hazardous situations. Such analysis can highlight redundant situations that can be solved by changing airways, approach procedures, etc.

2. Stepwise Descent Landing: a landing pattern in which the aircraft al- ternates between descent and cruise till it touches the ground. This is a less desirable landing pattern compared to the Diving pattern, where the aircraft constantly descends till it lands. The latter is preferable because of lower fuel consumption and lower pollution and noise.

There are other interesting patterns that one can explore in ATC data. We take these two as examples. We perform these exploration tasks by integrating the visual analytics Solution system V-Analytics [AAW07] with Secondo. Secondo brings an exten- sive set of query operations accessible through a query language to let the user express arbitrarily complex queries, including our STP queries, and ef- ficiently evaluate them on large moving objects databases. On the other hand, V-Analytics enables the user to pre-process the data, select an inter- esting subset for analysis, tune the parameters of the sophisticated queries in Secondo and interpret their results. The motivation behind such an in- tegrated approach is to provide the user with an interactive tool to help him setting the parameter values of the STP predicate, that best describe the in- tended pattern. Domain expertise is required to formulate and tune queries. The user has to decide the set of predicates, their arguments, and the tem- poral constraints that best describe the pattern. In fact, STP queries are not trivial. They require fine tuning by checking the sensitivity of their results to the parameters. Moreover, real data is often incomplete or erroneous, the sampling rate might be low, and the user might lack a good description of 7.1. AIR TRAFFIC CONTROL 111 the pattern in terms of time-dependent predicates. The integration with a visual analytics system allows for such fine tuning through user interaction. This integrated approach for exploring STPs is novel up to our knowledge. We have demonstrated it in [SGB+11].

Figure 7.1: Integrating SECONDO with V-Analytics

Figure 7.1 illustrates the integration between Secondo and V-Analytics. The moving object trajectories are initially loaded into the databases of the two systems. The data exchange between the two systems is possible in both directions. One can start by visualizing the data in V-Analytics, select a set of candidate trajectories, and/or interesting time periods within their life times, then send the trajectory identifiers and periods to Secondo for further processing. In the opposite direction, one can perform sophisticated query operations in Secondo (e.g., STP queries) and compute derived at- tributes, then transfer the query results to V-Analytics for visualization and further analysis. This process of transferring data between the two systems allows the user to explore the patterns in the data effectively.

7.1.1 Missed Approach In this exploration task, we started in V-Analytics and selected the trajec- tories that landed at the end of their observed trips, to filter out noisy and incomplete data and to eliminate transit flights. In Secondo, we received the identifiers of this subset, and issued a preliminary query that roughly describes the Missed Approach pattern. After several cycles of adjusting the parameters between Secondo and V-Analytics, the final Secondo query that we have reached is:

1 let MissedApproach = xyt feed Missed Ap- 2 filter[ get duration(deftime(.Pos)) > ThirtyMinutes ] proach query 3 extend[ Last30: 4 theRange(inst(final(.Pos)) - ThirtyMinutes, 5 inst(final(.Pos)), TRUE, TRUE) ] 6 projectextend[ 7 Id; Pos: .Pos atperiods .Last30, 8 Alt: .Alt atperiods .Last30 ] 9 extend[ Destination: val(final(.Pos)), Der: derivative(.Alt) ] 10 stpatternexextend[ 11 Close: distance(gk(.Pos), gk(.Destination)) < 15000.0, 12 Down1: ((.Der < 0.0) and (.Alt < 600.0)), 13 Up: .Der > 0.0, 112 CHAPTER 7. APPLICATION EXAMPLES

14 Down2: .Der < 0.0 15 ; stconstraint("Close", "Down1", vec("abba", "a.bba", "baba")), 16 stconstraint("Close", "Up", vec("abba", "aba.b", "abab")), 17 stconstraint("Down1", "Up", vec("aabb", "aa.bb")), 18 stconstraint("Up", "Down2", vec("aabb", "aa.bb")) 19 ; ((start("Up") - end("Down1")) < TwoMinutes) and 20 is within total turn range( 21 gk(.Pos atperiods 22 theRange(end("Up"), inst(final(.Pos)), TRUE, FALSE)), 23 150.0, 720.0) ] 24 filter[ isdefined(.Close) ] 25 filter[ ( diagonal length(gk(.Pos)) / length(gk(.Pos)) ) > 0.5 ] 26 extend[ MissedApproachPeriod: .Down1 union .Up union .Down2 ] 27 extend[ TrajClose: trajectory(.Pos atperiods .Close), 28 TrajDown1: trajectory(.Pos atperiods .Down1), 29 TrajUp: trajectory(.Pos atperiods .Up), 30 TrajDown2: trajectory(.Pos atperiods .Down2) ] 31 extend[ AltClose: .Alt atperiods .Close, 32 AltDown1: .Alt atperiods .Down1, 33 AltUp: .Alt atperiods .Up, 34 AltDown2: .Alt atperiods .Down2 ] 35 consume; where the xyt relation has the schema: xyt [Id: int, Pos: mpoint, Alt: mreal]. Pos is the 2D position (Lon, Lat) of the aircraft, and Alt represents the altitude. Lines 2 - 8 restrict the life times of Pos and Alt to their last 30 minutes. This threshold was decided with the help of V-Analytics. By means of visualization, we have noticed that the aircraft’s approach/landing happens during the last 30 minutes of the trip. We restrict the trajectories so that the query will run faster. Line 9 computes the destination airport of the aircraft as its final position, and the derivative of the altitude, which is needed to express the pattern. The pattern is expressed using the four time-dependent predicates Close, Down1, Up, and Down2. The gk operator converts the geographical coordinates into Gauss-Kr¨uger coordinates, so that the distance be measured in meters. The Missed Approach is described as a scenario where the aircraft descends below 600 m, climbs up, then descends again and lands. All this happens while the aircraft is close to its destination. Between the climbing event and the final landing, the aircraft performs a total turn of 150◦ - 720◦ (the is within total turn range Secondo function, lines 20 - 24). Again, these parameters were set with the help of V-Analytics. Line 25 additionally filters out the training aircraft that keeps circling in a small area. Basically it compares the diagonal of the bounding box of the aircraft’s track with the total length of the trip. Such a training aircraft will have a long track length, and a short bounding box diagonal. For the datasets described above, the run-time of this query ranges from few seconds to a minute depending on the amount of selected trajectories. Figure 7.2 illustrates several visualizations of the results. The Trajectory Globe View in (a) is a tool developed by NASA, and integrated into V- 7.1. AIR TRAFFIC CONTROL 113

Analytics. It visualizes the earth in 3D, and allows for flexible navigation. The Trajectory Time Graph in (b) shows the projection of the trajectory lifetimes, and highlights in red the parts where the Missed Approach occur. The Trajectory Wall in (c) is another 3D visualization tool from V-Analytics. The wall height corresponds to the trajectory lifetime. Finally the Secondo view in (d) shows the 2D trajectory projection of one result, and the altitude curve during the Missed Approach occurrence.

(a) (b)

(c) (d) Figure 7.2: Missed approach query results

7.1.2 Stepwise-Descent Landing The Stepwise-Descent Landing pattern is explored using a similar exploration procedure, as the previous section. The pattern is expressed using the stpat- ternexextendstream operator as follows:

1 let StepDown= xyt feed 2 filter[ get duration(deftime(.Pos)) > ThirtyMin ] 3 extend[ Last20: 4 theRange(inst(final(.Pos)) - TwentyMin, 5 inst(final(.Pos)), TRUE, TRUE) ] 6 projectextend[ 7 Id; Pos: .Pos atperiods .Last20, Alt: .Alt atperiods .Last20 ] 114 CHAPTER 7. APPLICATION EXAMPLES

8 filter[ total abs turns(.Pos) < 50.0 ] 9 extend[ Destination: val(final(.Pos)) ] 10 extend[ Der: derivative(.Alt) ] 11 projectextend[ Id, Pos, Alt, Destination 12 ; Der: const2linear(.Der) ] 13 extend[ SecondDer: derivative(.Der) ] 14 stpatternexextend[ 15 Dive1: .SecondDer < 0.0, 16 Lift : .SecondDer >= 0.0, 17 Dive2: .SecondDer < 0.0 18 ; stconstraint("Dive1", "Lift", vec("aa.bb")), 19 stconstraint("Lift", "Dive2", vec("aa.bb")) 20 ; ((end("Lift") - start("Lift")) > OneMinute) and 21 ((inst(final(.Pos)) - end("Dive2")) < HalfMinute ) ] 22 filter[ isdefined(.Dive1) and 23 total abs turns( 24 .Pos atperiods (.Dive1 union .Lift union .Dive2)) < 30.0 ] 25 filter[ minimum(.Alt atperiods .Dive2) < 2000.0 ] 26 extend[ StepDownPeriod: .Dive1 union .Lift union .Dive2 ] 27 extend[ PosDive1: .Pos atperiods .Dive1, 28 PosLift: .Pos atperiods .Lift, 29 PosDive2: .Pos atperiods .Dive2 ] 30 extend[ AltDive1: .Alt atperiods .Dive1, 31 AltLift: .Alt atperiods .Lift, 32 AltDive2: .Alt atperiods .Dive2 ] 33 extend[ Traj: trajectory(.Pos), 34 TrajDive1: trajectory(.PosDive1), 35 TrajLift: trajectory(.PosLift), 36 TrajDive2: trajectory(.PosDive2) ] 37 consume;

The pattern is expressed as a sequence of increasing (Dive1 ), decreasing (Lift), then again increasing rate of descent (Dive2 ). The Lift event is re- quired to last for more than one minute, and the aircraft lands no later than half a minute after the end of the Dive2 event. In the beginning of the query, the trajectories are restricted to their last 20 minutes. To distinguish the Stepwise-descent pattern from the Missed Approach pattern, line 8 en- sures that the aircraft flies almost straight without big turns. This condition is repeated with a tighter threshold in line 23. The total abs turns Secondo function object computes the sum of the absolute angle of turn for all turns performed by the aircraft. Again the run-time of this query is within one minute for the datasets above. Figure 7.3 illustrates several visualizations of the results. The se- quence dive, lift, dive can be seen clearly for two different trajectories in the Trajectory Globe Views in (a), (b). The map in (c) shows an interesting observation: such landings occur frequently for flights to Paris from SW, W and NW, and sometimes from SE. The space time cube in (d) shows that such patterns occur in the morning and late afternoon around Paris, and 7.2. GHENT CITY GAME 115

(a) (b)

(c) (d) Figure 7.3: Stepwise-descent query results

during the day in Lyon and Nice1.

7.2 Ghent City Game

The City Game is played by three or more teams. One of the teams plays the Problem domain role of the thief. This team has to commit a set of crimes at different locations in the city. To commit a crime the thief might be required to collect some items (e.g., weapons) from other locations in the city. The remaining teams, playing police, try to prevent the thief from committing crimes, and put him to jail. All teams are given GPS devices. To make the game challenging, police teams receive partial location information about the thief, and vice versa. The city game was played by four teams in the city of Ghent in April Background 7th, 2011. As a preparation for the Summer School and Young Researchers

1The screen shots in Figure 7.3 (c), (d), and the conclusions drawn from them are con- tributed by Gennady and Natalia Andrienko. They are part of our Demo paper [SGB+11]. 116 CHAPTER 7. APPLICATION EXAMPLES

Forum on Moving objects and knowledge discovery, August 22-26, 2011 2, the organizers suggested that the presenters use this data to demonstrate their MOD systems. Ralf Hartmut G¨uting,the adviser of this PhD project, was a presenter in this event. We were given permission to use this dataset. The task was to demonstrate the Secondo system using it. Data The dataset has four trajectories, representing the four teams. It has a total of 3855 observations, with the average rate of one observation every Questions six seconds approximately. We use the group STP operators to answer the following questions: 1. Find the thief among the four teams.

2. Find how many rounds the game was played.

3. Find the winning police team/teams that seized the thief on every round.

Solution We have imported the four trajectories into a Secondo relation called Players having the following schema: Players[PlayerId: int, Trip: mpoint] To find the thief, we use the game rule that says that in the beginning of every round, all teams meet inside the police station, the thief is allowed to leave the police station some minutes before the police teams are allowed to go after him. Our goal thus is to find a group of four trajectories which are located inside the police station, followed by a group of three trajectories inside the police station. The difference between the two groups is the thief. To express this in terms of Secondo queries, we use the crosspattern operator to report all groups having two trajectories or more that stayed close to one another for a duration of two minutes or more:

let TWO_MINUTES = create_duration(0, 120000); let DIST_THRESHOLD = 65.0;

let PlayersJoined= Players feed {a} Players feed {b} symmjoin[ .PlayerId_a < ..PlayerId_b ] extend[ Dist: distance( gk(.Trip_a), gk(.Trip_b) ) ] consume;

let Groups= PlayersJoined feed crosspattern[ .PlayerId_a, .PlayerId_b, .Dist < DIST_THRESHOLD, TWO_MINUTES , 2, cc ] namedtransformstream[ Group ] addcounter[ GroupID, 1 ] consume;

These queries first compute a self join of the Players relation, and use the result as input to the crosspattern operator. The output of the last query is

2http://cartogis.ugent.be/Summerschool/index.php 7.2. GHENT CITY GAME 117 stored in the Groups relation, which has the schema: Groups[Group: mset, GroupID: int] A sample tuple in this relation looks as follows: GroupID Group [ 10:51:37.453, 10:52:10.313 [ { 167411, 167412 } [ 10:52:10.313, 10:53:58.983 [ { 167409, 167411, 167412 } 1 [ 10:53:58.983, 10:59:27.758 [ { 167409, 167410, 167411, 167412 } [ 10:59:27.758, 11:05:29.297 [ { 167409, 167410, 167411 } [ 11:05:29.297, 11:05:38.348 [ { 167410, 167411 } This group has five units (uset). Comparing the third and fourth units, we can conclude that the team number 167412 is the thief, under the condition that during the time interval of the third unit, the four teams will be inside the police station. The following queries express this: let DummyRect= rectangle2(0.0, 0.0, 0.0, 0.0); let GroupUnits= Groups feed loopjoin[ units(.Group) namedtransformstream[ UGroup ] addcounter[ UID, 1 ] ] project[ GroupID, UID, UGroup ] extend[ GroupMembers: val(initial(.UGroup)) ] extend[ GroupBBox: fun(t:TUPLE) members(Players feed project[PlayerId, Trip], attr(t, UGroup), TRUE) transformstream projectextend[ ; bbx: bbox2d(.elem) ] aggregateB[ bbx; fun( r1: rect, r2: rect ) r1 union r2; DummyRect ] ] sortby[ GroupID asc, UID asc ] consume; let thief= GroupUnits feed extend_last[ ThiefID: ifthenelse( (.GroupID = ..GroupID) and (..GroupBBox intersects PoliceStation) and (cardinality(.GroupMembers) = 3) and (cardinality(..GroupMembers) = 4), ..GroupMembers - .GroupMembers, [const intset value (-1) ] ):: [ const intset value (-1) ] ] filter[ .ThiefID # [ const intset value (-1) ] ] project[ ThiefID ] sort rdup extract[ ThiefID ];

First we store the units (uset) of the computed groups in the relation GroupUnits. Every unit is extended by the two attributes: GroupMembers 118 CHAPTER 7. APPLICATION EXAMPLES which is the set of team identifiers in this unit, and GroupBox which is the bounding box of the group members during this unit. The members operator, as described in Section 5.2, yields the moving point objects within this unit restricted to the unit’s definition time. The units are sorted so that units coming from the same mset object appear after one another with temporal order. The query for finding the thief compares every two consecutive units in the GroupUnits relation (using the extend last operator), and tries to find a unit having 3 members preceded by a unit having 4 members whose bounding box intersects the police station. The extend last operator computes new attributes from the current tuple and the previous one in the stream. The attributes of the current tuple are accessed with the dot symbol, and those of the previous tuple are accessed with the double dots. When such two units are found, the second one is extended by an attribute holding the difference of the two sets (i.e., the thief identifier). All other tuples have the value -1 in this attribute. Using these queries we were able to correctly identify the thief (team 167412). To find the teams that seized the thief, we distinguish between three cases: (1) one team seized the thief, a second joined, then a third, (2) two teams seized the thief, then the third joined, or (3) the three police teams were together when they all seized the thief. The following query retrieves the first case. The two other cases are similar: let winnersolo= groupunits feed extend_last[ prevGroup: ifthenelse( (.GroupID = ..GroupID), ..GroupMembers, [const intset value (-1)]) :: [ const intset value (-1)] ] extend_last[ beforePrevGroup: ifthenelse( (.GroupID = ..GroupID), ..prevGroup, [ const intset value (-1) ] ) :: [ const intset value (-1)] ] filter[ (.GroupBBox intersects PoliceStation) and (cardinality(.GroupMembers) = 4) and (cardinality(.prevGroup) = 3) and (cardinality(.beforePrevGroup) = 2) and (thief issubset .beforePrevGroup) and (thief issubset .prevGroup) ] addcounter[ RoundNumber, 1 ] projectextend[ GroupID, RoundNumber; Winner1: .beforePrevGroup - thief, Winner2: .prevGroup - .beforePrevGroup, Winner3: .GroupMembers - .prevGroup, RoundEnd: deftime(.UGroup) ] sortby[ RoundEnd asc ] krdup[ RoundEnd ] consume; This query simply extends every tuple with the group members in its pre- ceding tuple and the tuple before, under the condition that they belong to 7.2. GHENT CITY GAME 119 the same mset object. It tries then to find a unit that (1) has 4 members, (2) its bounding box intersects the police station, (3) its preceding units has three members including the thief, and (4) the units before has two members including the thief. Taking the difference between the three sets (this, the previous, and before the previous), we are able to decide the winning team, and the teams that followed (the projectextend operator). Finally repeated results are removed. Finally, to query how many rounds the game was played, we simply need to count how many times the game was won. That is we sum the counts of the three cases of seizing the thief listed above. We omit this query here as it is trivial. 120 CHAPTER 7. APPLICATION EXAMPLES Chapter 8

Conclusions and Future Work

In this thesis, we have approached the problem of spatiotemporal pattern queries from the side of the query language. We have focused on proposing generic query operators that are made accessible to users through a query language. The issues of system integration have been in the heart of this work. This work is to be seen as an extension to the MOD model in [GBE+00] and [FGNS00]. Therefore, it has been implemented in Secondo [Secd], where a big part of this MOD model is already implemented. In the beginning of this PhD project, the term spatiotemporal pattern (STP) was not well defined. Previous researches have failed to give generic definitions. Instead, they have focused on developing different pattern match- ing algorithms, each of them being able to match a single pattern, or a small subset. This PhD project has proposed generic definitions of STPs and cor- responding query operators. An extensive survey of previous work has led us to distinguish two types of STPs: individual and group. Clear definitions have been given, and generic query operators have been proposed for the two types. The contributions of this project can be divided into three parts: (1) individual STP query operators, (2) group STP query operators, and (3) temporal reasoning. In the following sections, we review the contributions of every part.

Individual STP Query Operators

The contributions in this part are: (1) individual STPs have been defined, (2) expressive query operators have been proposed, (3) these operators have been implemented in Secondo, and have been fully integrated into the query optimizer, finally (4) they have been extensively tested with synthetic datasets as well as in real-world scenarios. We have had the chance to try these operators in two real-world scenarios. The first was the Maritime Transportation Showcase1, among the activities of the MOVE-COST European project. In this showcase, the individual STP operators have been applied to a dataset of ship trajectories to find the situ- ations of near-miss and drifting-ship. The second scenario was the air traffic

1http://www.move-cost.info/index.php?option=com content&view=category& layout=blog&id=38&Itemid=57

121 122 CHAPTER 8. CONCLUSIONS AND FUTURE WORK control scenario described in Section 7.1. These applications have helped us enrich the operator design, and propose extensions. Our impression is that these operators are in the level of being useful in real-world applications.

Group STP Query Operators

The contributions in this part are: (1) group STPs have been defined, (2) ex- pressive query operators have been proposed, (3) these operators have been implemented in Secondo, (4) the integration with the query optimizer has been theoretically developed, but not implemented, and finally (5) they have been extensively tested with synthetic datasets. The Ghent city game scenario in Section 7.2 has been used to illustrate the group STP operators. It is, however, a small scale application example. We believe that several improvement opportunities of these operators will show up if they are used in large-scale real-world application. Unfortunately, we did not have such a chance during the time of this PhD project. Hopefully, such applications will be available in the near future.

Temporal Reasoning

Although temporal reasoning is an extension to the individual and group STP query operators, it is considered a separate part of this project. This is because another area of research (artificial intelligence) had to be surveyed. The goal has been to extend the evaluation algorithms with temporal rea- soning capabilities, in order to make them more efficient. The contributions in this part are: (1) the techniques of temporal reasoning have been adapted so that they can be integrated with our individual and group STP operators, (2) these techniques have been implemented and extensively tested with in- dividual STP query operators, (3) the empirical evaluation showed that the performance gain is 6.2% on the average. It is not an appealing gain, but it would have been cogent to modify our STP query operators and use tempo- ral reasoning if this gain is taken at no cost. Unfortunately, this is not the case, because the executable query syntax is more complex when temporal reasoning is used. A future work may be to extend the query optimizer to generate such execution plans, and thus the syntax overhead is taken off from the user. In such a case, it will be plausible to use temporal reasoning. Appendices

123

Appendix A

Experimental Repeatability

A.1 Repeating The Individual STP Experi- ments

To repeat the experiment in Section 4.6, please download the Spatiotemporal Pattern Queries Plugin from the Secondo Plugins web site [Seca]. The User Manual (also available on the Plugin we site) describes how to install and run the Plugin. We have made available the scripts for running the first and the second experiments in this section, and one illustration example called Berlintest, so that the results are repeatable. There are no scripts for the third experiments. If you interested in repeating it, please refer to the BerlinMOD benchmark [DBG09] to generate the test data, then use the queries as described in Section 4.6.3. Before running the scripts of the experiments, you need to install: 1. The Secondo system version 2.9.11. A brief installation guide is given in the Plugin User Manual on [Seca], and a detailed guide is given in the Secondo User Manual [Secc]. 2. The Spatiotemporal Pattern Queries Plugin (STPatterns) as described in [Seca].

A.1.1 Repeating the First Experiment During the installation of the STPattern Plugin, two files are copied to the Secondo bin directory $SECONDO BUILD DIR/ bin. These two files Expr1Script.sec and STPQExpr1Query.csv automate the repeatability of the first experiment. The experiment can then be run as follows: 1. Run SecondoTTYNT (i.e., go to $SECONDO BUILD DIR/bin and write SecondoTTYNT using the bash console). 2. Make sure that the berlintest database is restored (i.e. at the Secondo prompt, write list databases and make sure that berlintest database is in the list). Otherwise, restore it by writing

1These scripts may also run with more recent Secondo versions. But this document is tested only with version 2.9.1.

125 126 APPENDIX A. EXPERIMENTAL REPEATABILITY

restore database berlintest from berlintest

at the Secondo prompt (press twice).

3. Execute the script by writing @Expr1Script.sec at the Secondo prompt. The script creates the required database objects and exe- cutes the experiment queries. This may take half an hour depending on your machine.

Executing the script creates a Secondo relation STPQExpr1Result in the berlintest database, which stores the experimental results. Its schema is shown in Table A.1.

Table A.1: The schema of the STPQExpr1Result relation

Attribute Meaning Example no A serial number for 0 the query. queryText The query text. thousand feed filter [.stpattern[ a:passmbool(mb10), b:passmbool(mb30); stconstraint("a", "b", vec("aa.b.b"))]] count numPreds The number of the 2 lifted predicates in the STP predicate. numConstraints The number of the 1 constraints in the STP predicate. ElapsedTimeReal The measured re- 0.171932 sponse time, in sec- onds, for this query. ElapsedTimeCPU The measured CPU 0.16 time, in seconds, for this query

The experimental results are also saved to a comma separated file STPQ- Expr1Result.csv in the Secondo bin directory. The file has a similar struc- ture as the table STPQExpr1Result.

A.1.2 Repeating the Second Experiment Repeating the second experiments is also automated by script files that are copied to the Secondo directories during the installation of the STPattern A.1. REPEATING THE INDIVIDUAL STP EXPERIMENTS 127

Plugin. For the second experiment, two script files are used: (1) the $SEC- ONDO BUILD DIR/bin/Expr2Script.sec file creates the necessary database objects, and (2) the $SECONDO BUILD DIR/Optimizer/expr2Queries.pl executes the queries. The Expr2Script.sec file is described in Appendix B, and the expr2Queries.pl in Appendix C. The experiment is repeated as fol- lows:

1. Run SecondoTTYNT.

2. Make sure that the berlintest database is restored, otherwise, restore it.

3. Execute the Expr2Script.sec by writing @Expr2Script.sec at the Sec- ondo prompt. This creates the necessary database objects.

4. Quit SecondoTTYNT (i.e. write quit at the Secondo prompt), go to the Secondo optimizer folder $SECONDO BUILD DIR/ Optimizer and write SecondoPL. This starts the Secondo optimizer user interface in the single user mode.

5. Write consult(expr2Queries). to let Prolog interpret the script file expr2Queries.pl.

6. Open the berlintest database (i.e. write open database berlintest.).

7. Write runSTPQExpr2DisableOptimization. to run the queries with- out enabling the optimization of the STP predicate, or runSTPQExpr2EnableOptimization. to run the queries with the opti- mization of the STP predicate being enabled. This can take more than an hour.

The results are saved to the comma separated files Expr2StatsDO.csv and Expr2QueriesDO.csv in the Secondo optimizer folder if the STP predicate optimization is disabled. If it is enabled, the results are saved to the files Expr2StatsEO.csv and Expr2QueriesEO.csv. The files Expr2StatsDO.csv and Expr2StatsEO.csv show the run times. They include the columns described in Table A.2.

Table A.2: The schemas of the Expr2StatsDO.csv and Expr2StatsEO.csv files Attribute Meaning Example NumberOfPredicates The number of the lifted predicates 2 in the STP predicate. NumberOfConstraintsThe number of the constraints in the 1 STP predicate. Serial A serial for the query in the range 1 [0,9]. The serial is repeated with ev- ery experimental setup ExecTime The measured response time, in mil- 443 liseconds, for this query. 128 APPENDIX A. EXPERIMENTAL REPEATABILITY

The files Expr2QueriesDO.csv and Expr2QueriesEO.csv have a similar structure. They exclude the ExecTime attribute and have two more at- tributes; the SQL attribute which stores the SQL-like query, and the Exe- cutablePlan which stores the execution plan generated by the Optimizer. Bibliography

[AA11] Natalia Andrienko and Gennady Andrienko. Spatial generaliza- tion and aggregation of massive movement data. IEEE Trans- actions on Visualization and Computer Graphics, 17:205–219, February 2011.

[AAR+09] Gennady Andrienko, Natalia Andrienko, Salvatore Rinzivillo, Mirco Nanni, Dino Pedreschi, and Fosca Giannotti. Interactive visual clustering of large collections of trajectories. In IEEE VAST, pages 3–10, 2009.

[AAW07] Gennady Andrienko, Natalia Andrienko, and Stefan Wrobel. Visual analytics tools for analysis of movement data. SIGKDD Explor. Newsl., 9:38–46, December 2007.

[All83] James F. Allen. Maintaining knowledge about temporal inter- vals. Commun. ACM, 26(11):832–843, 1983.

[Bes06] Christian Bessiere. Handbook of Constraint Programming, chap- ter 3. Elsevier, 2006.

[BGHW08] Marc Benkert, Joachim Gudmundsson, Florian H¨ubner,and Thomas Wolle. Reporting flock patterns. Comput. Geom. The- ory Appl., 41(3):111–125, 2008.

[BXFJ03] Binh-Minh Bui-Xuan, Afonso Ferreira, and Aubin Jarry. Com- puting shortest, fastest, and foremost journeys in dynamic net- works. International Journal of Foundations of Computer Sci- ence, 14(2):267–285, 2003.

[CLFG+03] Jos´eAntonio Cotelo Lema, Luca Forlizzi, Ralf Hartmut G¨uting, Enrico Nardelli, and Markus Schneider. Algorithms for moving objects databases. Comput. J., 46(6):680–712, 2003.

[DBG09] Christian D¨untgen, Thomas Behr, and Ralf Hartmut G¨uting. Berlinmod: a benchmark for moving object databases. The VLDB Journal, 18(6):1335–1368, 2009.

[dMR05] C´edric du Mouza and Philippe Rigaux. Mobility patterns. Geoinformatica, 9(4):297–319, 2005.

129 130 BIBLIOGRAPHY

[dMRS05] C´edricdu Mouza, Philippe Rigaux, and Michel Scholl. Effi- cient evaluation of parameterized pattern queries. In Proceed- ings of the 14th ACM international conference on Information and knowledge management, CIKM ’05, pages 728–735, New York, NY, USA, 2005. ACM.

[DWL08] Somayeh Dodge, Robert Weibel, and Anna-Katharina Lauten- sch¨utz.Towards a taxonomy of movement patterns. Information Visualization, 7(3):240–252, 2008.

[EGI99] David Eppstein, Zvi Galil, and Giuseppe F. Italiano. Dynamic graph algorithms. In Mikhail J. Atallah, editor, Algorithms and Theory of Computation Handbook, chapter 8. CRC Press, 1999.

[Erw04] Martin Erwig. Spatio-Temporal Databases (ed. Caluwe, De), chapter 2, pages 29–54. Springer-Verlag New York, Inc., 2004.

[ES99] Martin Erwig and Markus Schneider. Developments in spatio- temporal query languages. In DEXA ’99: Proceedings of the 10th International Workshop on Database and Expert Systems Applications, page 441, Washington, DC, USA, 1999. IEEE Computer Society.

[ES02] Martin Erwig and Markus Schneider. Spatio-temporal predi- cates. IEEE Trans. on Knowl. and Data Eng., 14(4):881–901, 2002.

[FGNS00] Luca Forlizzi, Ralf Hartmut G¨uting, Enrico Nardelli, and Markus Schneider. A data model and data structures for mov- ing objects databases. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 319–330, New York, NY, USA, 2000. ACM.

[GAA+05] Ralf Hartmut G¨uting,Victor Almeida, Dirk Ansorge, Thomas Behr, Zhiming Ding, Thomas H¨ose,Frank Hoffmann, Markus Spiekermann, and Ulrich Telle. Secondo: An extensible DBMS platform for research prototyping and teaching. In ICDE ’05: Proceedings of the 21st International Conference on Data Engineering, pages 1115–1116, Washington, DC, USA, 2005. IEEE Computer Society.

[GBA+04] Ralf Hartmut G¨uting, Thomas Behr, Victor Almeida, Zhiming Ding, Frank Hoffmann, and Markus Spiekermann. Secondo: An extensible DBMS architecture and prototype. Technical Report Informatik-Report 313, FernUniversit¨atHagen, March 2004.

[GBE+00] Ralf Hartmut G¨uting,Michael H. B¨ohlen,Martin Erwig, Chris- tian S. Jensen, Nikos A. Lorentzos, Markus Schneider, and BIBLIOGRAPHY 131

Michalis Vazirgiannis. A foundation for representing and query- ing moving objects. ACM Trans. Database Syst., 25(1):1–42, 2000.

[Geo] Geopkdd website: Geographic privacy-aware knowledge discov- ery and delivery. http://www.geopkdd.eu.

[GNP+09] Fosca Giannotti, Mirco Nanni, Dino Pedreschi, Chiara Renso, Salvatore Rinzivillo, and Roberto Trasarti. Geopkdd – geo- graphic privacy-aware knowledge discovery. In The European Future Technologies Conference (FET 2009), 2009.

[GNPP07] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pe- dreschi. Trajectory pattern mining. In KDD’07, pages 330–339, 2007.

[GRS98] St´ephane Grumbach, Philippe Rigaux, and Luc Segoufin. Spatio-temporal data handling with constraints. In Proceed- ings of the 6th ACM international symposium on Advances in geographic information systems, GIS ’98, pages 106–111, New York, NY, USA, 1998. ACM.

[Gu93] Ralf Hartmut G¨uting.Second-order signature: a tool for spec- ifying data models, query processing, and optimization. SIG- MOD Rec., 22(2):277–286, 1993.

[GvKS04] Joachim Gudmundsson, Marc van Kreveld, and Bettina Speck- mann. Efficient detection of motion patterns in spatio-temporal data sets. In GIS ’04: Proceedings of the 12th annual ACM International Workshop on Geographic Information Systems, pages 250–257, New York, NY, USA, 2004. ACM.

[HKBT05] Marios Hadjieleftheriou, George Kollios, Petko Bakalov, and Vassilis J. Tsotras. Complex spatio-temporal pattern queries. In VLDB ’05: Proceedings of the 31st International Conference on Very Large Data Bases, pages 877–888. VLDB Endowment, 2005.

[Ioa96] Yannis E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, 1996.

[JSZ08] Hoyoung Jeung, Heng Tao Shen, and Xiaofang Zhou. Convoy queries in spatio-temporal databases. In ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engi- neering, pages 1457–1459, Washington, DC, USA, 2008. IEEE Computer Society.

[KMB05] Panos Kalnis, Nikos Mamoulis, and Spiridon Bakiras. On dis- covering moving clusters in spatio-temporal data. In SSTD, pages 364–381, 2005. 132 BIBLIOGRAPHY

[LIW05] Patrick Laube, Stephan Imfeld, and Robert Weibel. Discov- ering relative motion patterns in groups of moving point ob- jects. International Journal of Geographical Information Sci- ence, 19(6):639–668, 2005.

[LJL+10] Zhenhui Li, Ming Ji, Jae-Gil Lee, Lu-An Tang, Yintao Yu, Ji- awei Han, and Roland Kays. MoveMine: mining moving object databases. In SIGMOD ’10: Proceedings of the 2010 inter- national conference on Management of data, pages 1203–1206, New York, NY, USA, 2010. ACM.

[LKI04] Patrick Laube, Marc van Kreveld, and Stephan Imfeld. Finding REMO - detecting relative motion patterns in geospatial life- lines. In Developments in Spatial Data Handling: Proceedings of the 11th International Symposium on Spatial Data Handling, pages 201–215, Berlin Heidelberg, 2004. Springer.

[ORP+08] Riccardo Ortale, Ettore Ritacco, Nikos Pelekis, Roberto Trasarti, Gianni Costa, Fosca Giannotti, Giuseppe Manco, Chiara Renso, and Yannis Theodoridis. The daedalus frame- work: progressive querying and mining of movement data. In GIS, page 52, 2008.

[PTVP06] Nikos Pelekis, Yanis Theodoridis, Spyros Vosinakis, and Themis Panayiotopoulos. HERMES - a framework for location-based data management. In In Proceedings of EDBT 2006, 2006.

[RLK+11] Chenghui Ren, Eric Lo, Ben Kao, Xinjie Zhu, and Reynold Cheng. On querying historical evolving graph sequences. PVLDB, 4(11):726–737, 2011.

[Sch05] Markus Schneider. Evaluation of spatio-temporal predicates on moving objects. In ICDE ’05: Proceedings of the 21st Interna- tional Conference on Data Engineering, pages 516–517, Wash- ington, DC, USA, 2005. IEEE Computer Society.

[Seca] Secondo plugins. http://dna.fernuni-hagen.de/secondo.html/ start content plugins.html.

[Secb] Secondo programmer’s guide. http://dna.fernuni-hagen.de/secondo.html/files/ programmersguide.pdf.

[Secc] Secondo user manual. http://dna.fernuni-hagen.de/secondo.html/files/ secondomanual.pdf.

[Secd] Secondo web site. http://dna.fernuni-hagen.de/secondo.html/. BIBLIOGRAPHY 133

[SG11] Mahmoud Sakr and Ralf Hartmut G¨uting.Spatiotemporal pat- tern queries. GeoInformatica, 15:497–540, 2011.

[SGB+11] Mahmoud Sakr, Ralf Hartmut G¨uting,Thomas Behr, Gennady Andrienko, Natalia Andrienko, and Christophe Hurter. Explor- ing spatiotemporal patterns by integrating visual analytics with a moving objects database system. In Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’11, pages 505–508, New York, NY, USA, 2011. ACM.

[TGN+11] Roberto Trasarti, Fosca Giannotti, Mirco Nanni, Dino Pe- dreschi, and Chiara Renso. A query language for mobility data mining. IJDWM, 7(1):24–45, 2011.

[Tra10] Roberto Trasarti. Mastering the Spatio-Temporal Knowledge Discovery Process. PhD thesis, University of Pisa Department of Computer Science, Italy, 2010.

[vBC90] Peter van Beek and Robin Cohen. Exact and approximate rea- soning about temporal relations. Comput. Intell., 6:132–147, July 1990.

[VBT10] Marcos R. Vieira, Petko Bakalov, and Vassilis J. Tsotras. Querying trajectories using flexible patterns. In Proceedings of the 13th International Conference on Extending Database Tech- nology, EDBT ’10, pages 406–417, New York, NY, USA, 2010. ACM.

[VFMB+10] Marcos R. Vieira, Enrique Fras-Martnez, Petko Bakalov, Vanessa Fras-Martnez, and Vassilis J. Tsotras. Querying spatio- temporal patterns in mobile phone-call databases. In Proceed- ings of the 2010 Eleventh International Conference on Mobile Data Management, MDM ’10, pages 239–248, Washington, DC, USA, 2010. IEEE Computer Society.

[VKvB90] Marc Vilain, Henry Kautz, and Peter van Beek. Constraint propagation algorithms for temporal reasoning: a revised report, pages 373–381. Morgan Kaufmann Publishers Inc., San Fran- cisco, CA, USA, 1990.

[WXCJ98] Ouri Wolfson, Bo Xu, Sam Chamberlain, and Liqin Jiang. Mov- ing objects databases: Issues and solutions. In SSDBM’98: 10th International Conference on Scientific and Statistical Database Management, pages 111–122, 1998. 134 BIBLIOGRAPHY Resume

Personal Information: 18.6.1980 Born in El-Sharkia, Egypt.

Education: 5.2001 B.Sc. in Computer and Information Systems, Ain Shams University, Cairo, Egypt. 12.2006 M.Sc. in Computer and Information Systems, Ain Shams University, Cairo, Egypt.

Work Experience: 2002 - Present Teaching assistant, Information Systems Depart- ment, Ain Shams University, Cairo, Egypt. On leave since June 2008 to pursue the PhD degree in Ger- many.

Publications: Mahmoud Sakr, and Ralf Hartmut G¨uting: Group Spatiotemporal Pattern Queries. Fernuniversit¨atin Hagen, Informatics-Report 358 - 4/2011. Mahmoud Sakr, Thomas Behr, Ralf Hartmut G¨uting,Gennady An- drienko, Natalia Andrienko, and Christophe Hurter: Exploring spa- tiotemporal patterns by integrating visual analytics with a moving objects database system. ACM SIGSPATIAL 2011: 505-508. (Demo paper). Mahmoud Sakr, and Ralf Hartmut G¨uting:Spatiotemporal pattern queries. GeoInformatica 15(3): 497-540 (2011). Mahmoud Sakr, and Ralf Hartmut G¨uting: A New Approach for Spatiotemporal Pattern Queries in Trajectory Databases. Mobile Data Management 2010: 270-272. (Demo paper). Mahmoud Sakr: Spatiotemporal Pattern Queries. VLDB PhD Work- shop 2010. Mahmoud Sakr, and Ralf Hartmut G¨uting:Spatiotemporal Pattern Queries in Secondo. SSTD 2009: 422-426. (Demo paper).

135