Czech Technical University in Prague Faculty of Electrical Engineering

DIPLOMA THESIS

Vojtˇech Kˇr´ıˇzek

Intelligent Visualization of data from Multi-agent Simulations

Supervisor: Ing. Ondˇrej Vanˇek Department of Cybernetics

Prague, 2012

Prohl´aˇsen´ı

Prohlaˇsuji,ˇzejsem pˇredloˇzenoupr´acivypracoval samostatnˇea ˇzejsem uvedl veˇsker´epouˇzit´einformaˇcn´ızdroje v souladu s Metodick´ympokynem o dodrˇzov´an´ı etick´ych princip˚upˇripˇr´ıpravˇevysokoˇskolsk´ych z´avˇereˇcn´ych prac´ı.

V Praze dne 9. 5. 2012...... podpis Acknowledgements

I’d like to thank to Ing. OndˇrejVanˇekand Ing. Michal Jakob, Ph.D. for en- abling me to do the work in the web applications field. I’d like also to thank Bc. OndˇrejHrstka for helping me with AgentC platform setup. Finally thanks to Ing. Jan Faigl for providing me LATEX template. Abstract

The thesis deals with design and implementation of a web-based application that is used for visualization of large-scale data from multi-agent simulations. An overview of related applications for web-based visualizations and of available technologies is presented. From this overview, system architecture, consisting of modular filter framework on the server side, and utilizing jQuery framework on the client side, is proposed. Filter framework uses K-means, agglomera- tive hierarchical clustering, line clustering, and kernel density estimation algo- rithms, which enable fast-response visualization in a web browser. Static and dynamic visualizations are supported, therefore whole simulation can be re- played. The application is tested on different data sets to evaluate performance and accuracy. Abstrakt

Tato diplomov´apr´acese zab´yv´an´avrhem a implementac´ı webov´eaplikace, kter´a vizualizuje velk´e objemy dat z multiagentn´ıch simulac´ı. Je vypra- cov´anpˇrehledpodobn´ych aplikac´ıpro webovou vizualizaci a dostupn´ych tech- nologi´ıch. Na z´akladˇetohoto pˇrehleduje vytvoˇrenn´avrhsyst´emov´earchitek- tury, kter´ase skl´ad´az modul´arn´ıhofiltrovac´ıhoframeworku na serverov´estranˇe a jQuery frameworku na klientsk´estranˇe. Filtrovac´ıframework pouˇz´ıv´aalgo- ritmy K-stˇred˚u,aglomerativn´ıhierarchick´eseskupov´an´ı,ˇc´arov´eseskupov´an´ıa odhad j´adrov´ehustoty, kter´eumoˇzˇnuj´ıwebovou vizualizaci velk´ehomnoˇzstv´ı dat s rychlou odezvou. Je podporov´anastatick´ai dynamick´avizualizace, a proto je moˇzn´eznovu pˇrehr´atuloˇzenousimulaci. V´ykonnost a pˇresnostap- likace je otestov´ana pomoc´ır˚uzn´ych datov´ych soubor˚u. Contents

1 Introduction 1

2 Problem 3 2.1 Specification ...... 4 2.1.1 Assigned Use Cases ...... 4 2.2 Requirements ...... 5 2.3 Related Work ...... 6 2.3.1 General Concept ...... 6 2.3.2 Implemented Systems and Frameworks ...... 7 2.3.3 Technologies ...... 9

3 Analysis 11 3.1 Multi-agent Simulation Overview ...... 11 3.1.1 AgentC and AgentPolis ...... 12 3.2 Evaluation of Available Technologies and Tools ...... 13 3.2.1 Data Formats Comparison ...... 13 3.2.2 Databases Comparison ...... 14 3.2.3 Javascript Frameworks Comparison ...... 14 3.2.4 Map Frameworks Comparison ...... 16 3.2.5 Web Technologies ...... 16

i 4 Solution 19 4.1 Selected Technologies and Libraries ...... 19 4.2 System Architecture ...... 20 4.2.1 Database Schema ...... 20 4.2.2 Input Format ...... 23 4.2.3 Server Architecture ...... 24 4.2.4 Data Processing ...... 27 4.2.5 Client Structure ...... 31 4.3 Implemented Algorithms ...... 34 4.3.1 Map Projection System ...... 34 4.3.2 K-means ...... 37 4.3.3 Line Clustering Algorithm ...... 38 4.3.4 Agglomerative Hierarchical Clustering ...... 39 4.3.5 Heat Map Generation ...... 41

5 Evaluation 43 5.1 Performance ...... 43 5.1.1 Random Data ...... 44 5.1.2 Real Data ...... 47 5.2 Line Clustering Algorithm Evaluation ...... 48 5.3 Technical Problems and Limitations ...... 50

6 Conclusion 51

Appendix A Code Listing 59

Appendix B User Guide 65

Appendix C Contents of the CD 83

ii List of Tables

3.1 Fields of Every Event ...... 12 3.2 Fields of Some Events ...... 12 3.3 Comparison of Data Formats ...... 13 3.4 Comparison of Databases ...... 14 3.5 Comparison of Javascript Frameworks ...... 15 3.6 Comparison of Map Frameworks ...... 16

4.1 Specified Input Format Fields ...... 23 4.2 Web Server Structure ...... 27

5.1 Size Comparison of Testing Data Files ...... 45 5.2 Speed Comparison of Points of Testing Data Files ...... 45 5.3 Speed Comparison of Traces of Testing Data Files ...... 46 5.4 Size Comparison of Real Data Files ...... 47 5.5 Speed Comparison of Points of Real Data Files ...... 47 5.6 Speed Comparison of Traces of Real Data Files ...... 48 5.7 Comparison of Line Clustering Algorithm – Points ...... 48 5.8 Comparison of Line Clustering Algorithm – Lines ...... 49 5.9 Comparison of Line Clustering Algorithm – Distance ...... 49

C.1 CD Path Structure ...... 83

iii List of Figures

2.1 Google Fusion Tables – Map Visualization of Imported Data . . . . . 10

4.1 System Architecture Diagram ...... 21 4.2 Database Schema – E-R Diagram ...... 22 4.3 Filter Architecture – Object Model and Data Flow ...... 25 4.4 Screenshot – Static Data Visualization – Map Layer with traces, the heat map and events (left) and Chart of events by days of the week (right) ...... 28 4.5 Screenshot – Dynamic Data Visualization – Vessel traces and Events 30 4.6 Diagram of Page Rendering in a Web Browser ...... 32 4.7 Diagram of Data Flow After User’s Action ...... 33 4.8 Mercator Projection – Illustration of cylindric projection of Earth (based on [49]) ...... 34 4.9 Screenshot – Mercator Projection Example ...... 35 4.10 Trace Interpolation Diagram ...... 37 4.11 Centroids Initialization Points – Map Tile in Mercator Projection . . 37 4.12 Screenshot – Generated Heat Map Visualization ...... 42

5.1 Visualization of Speed Comparison of Points Data ...... 45 5.2 Visualization of Speed Comparison of Traces Data ...... 46

iv List of Algorithms

2.1 Syslog Example (Apache HTTP Server) [46] ...... 10 4.1 Web Page Initialization Pseudocode ...... 31 4.2 Line Clustering Algorithm Pseudocode ...... 38 4.3 Agglomerative Hierarchical Clustering Pseudocode ...... 40 A.1 Input Data Format demonstration code in JSON formatting ...... 59 A.2 SQL Database Tables Generation Script ...... 60

v List of Abbreviations

Abbreviation Description AJAX Asynchronous JavaScript and XML CSS Cascading Style Sheets CSV Comma-Separated Values GIS Geographical Information System GML Geography Markup Language GUI Graphical User Interface HTML HyperText Markup Language HTTP HyperText Transfer Protocol ISO International Organization for Standardization JDBC Java Database Connectivity JSON JavaScript Object Notation KML Keyhole Markup Language PHP PHP: Hypertext Preprocessor (recursive acronym) RDF Resource Description Framework REST Representational State Transfer RSS RDF Site Summary SDK Software Development Kit SVG Scalable Vector Graphics URL Uniform Resource Locator WKT Well-Known Text WMS Web Map Service XML Extensible Markup Language

vi Chapter 1

Introduction

Multi-agent simulations produce large amount of data which mostly contain a ge- ographical position or a trace. Data are usually visualized directly from a simulation, but there do not exist reasonable general visualization platforms which can compare simulation runs against each other or even reconstruct simulation flow. Web technologies have experienced huge technological shift last years. Now it is possible to play 3D games, render PDF documents or run offline applications without any extensions directly in a web browser. Application market is also being changed rapidly in these days which is noticeable on the growing segment of online applica- tions that were being used on a desktop before. An example of such applications are e-mail clients, office suites or games. Huge performance improvement of web browsers and multi-platform availability of web technologies are main motivators to build a web-based application which can process and visualize recorded multi-agent data. The application is then easily usable, because of well known web interface, and it can be run on any Java-compatible operating system. Specific problem of the work is to process and visualize large amount of data which contain geo-spatial information. Data need to be filtered and aggregated before visualization to produce relevant results and to reduce amount of input data. Web browsers need to be investigated on technical limits of visualization of these data too. It is also needed to find a way to export visualized data in a web browser for the future use because currently web browsers do not support export of vector graphics.

1 The result of the work is technology overview, algorithms for intelligent data pro- cessing and implementation of a web-based application that does static and dynamic visualization of entities on a map and chart visualization of aggregated entities. The application also supports export of visualized map or chart. The main challenge was to visualize huge amount of line segments. Therefore a line clustering algorithm is proposed, which significantly compacts vessel traces into a limited amount of lines that can be reasonably visualized in a web browser. Chapter 2 describes work specification and assigns use cases. Chapter 3 analyzes the problem, related work and available technologies. Chapter 4 describes used tech- nologies, system architecture and implemented algorithms. Chapter 5 evaluates the work and measures performance of the application. Chapter 6 concludes the work.

2 Chapter 2

Problem

This chapter describes work specification (Section 2.1) and requirements related to multi-agent systems which generates input data for the application. The work must support importing of data from multi-agent systems, static and dynamic visualization of large amount of geo-spatial data, and chart visualization of filtered and aggregated data. It must also support exporting of visualizations and generation of a heat map. Static visualization of geo-spatial data should support displaying of clustered el- ements such as points or lines. Dynamic visualization should be able to reconstruct recorded simulation including occurred events and vessels movements. Both visual- izations should be able to restrict visualized data by a time range or by an exclusion pattern. Pie chart, bar chart, and line chart should be supported as chart visualizations. It should be also able to restrict visualized data and to display data by a selected cluster of events. Chart data should be possible to display in a table or in a list too. Finally it should be implemented visualization of a risk along a vessel trajectory.

3 2.1 Specification

The work aims to develop an application which is used for visualization of data from multi-agent systems. The work consists of several parts:

• studying existing solutions for visualization of geo-spatial data,

• familiarize with multi-agent platforms AgentC1 and AgentPolis2,

• defining a unified data format for data import from multi-agent platforms,

• proposing an intelligent data aggregation and filtering for large amount of data,

• designing of an application architecture,

• implementing of the application,

• visualizing and validating of the application on a test data set from multi-agent platform.

2.1.1 Assigned Use Cases

Particular use cases, that are derived from problem description with respect to AgentC1 platform, are assigned to precise work specifications. The application must support following use cases:

• show risk along a trajectory for a vessel,

• show a number of merchant vessels in April in the Gulf of Aden,

• show average speed and highlight a trajectory of a vessel,

• render a chart of ratios of events for each month for a given year,

• render a bar chart of events grouped by areas,

• display a heat map generated from events and show its legend,

• show events related for a vessel,

• availability to interpolate captured trajectory, 1AgentC, http://agents.felk.cvut.cz/projects/agentc/ 2AgentPolis, http://agents.felk.cvut.cz/projects/agentpolis/

4 • reconstruct simulation flow and display it in a web browser,

• restrict processed data with a time window.

The application should support filtering and aggregation mechanism to produce data described by assigned use cases. It should also support visualization of different graphical entities such as points, lines or graphics (for a heat map). Interactivity should be done via Javascript handlers on displayed entities in a web browser.

2.2 Requirements

This section describes criteria for technologies and software tools selection. Com- parisons between technologies and tools are done according to specified requirements too. All tools are evaluated using following general criteria:

• speed performance – computational overhead of a tool,

• documentation – availability of documentation, tutorials and user support,

• customization – possibility to extend evaluated tool itself or via plug-ins,

• Java support – check if a tool can be run by/within Java application,

• library or code size – evaluates code size and tool effectivity.

Some tools are also evaluated using specific criteria such as:

• embeddable – if a tool can be embedded into Java application,

• size variability – amount of data sparseness for a data format,

• custom shapes – drawing support of custom objects for a map visualization,

• speed of debugging – how fast can be seen produced Javascript program,

• software development kit (SDK) – if Javascript frameworks is provided with SDK.

5 2.3 Related Work

This section describes available related applications and tools that deal with geo- spatial data. For these applications it is used term Geographical Information Systems (GIS). GIS overviews (Section 2.3.1) and related GIS implementations (Section 2.3.2) are described. GIS field is studied a lot therefore many publications and articles, that deal with geo-data related problems, exist. Especially in geo-visualization we can see a tech- nological shift from rendered maps to dynamically generated maps in a user’s web browser. The shift can be illustrated in articles [50] (Section 2.3.2) and [27] (Section 2.3.1).

2.3.1 General Concept

Undermentioned articles introduce problems and issues that are related to GIS. They also analyze and describe available technology and techniques. MacEachren and Kraak [27] describe characteristics and issues of geo-data and geo-visualization. It mentions that around 80% of produced data are geo-data (in- cluding geo-referenced data like addresses or zip codes). The article introduces re- search topics in four areas: representation, visualization, interfaces and cognitive and usability issues. Because the article was published in 2001, many topics have been solved since then (i.e. 3D complex data visualization, user-centric tools for geo visu- alization, system architecture for geo-spatial data visualization). It nicely illustrates progress of GIS. Schnabel and Hurni [37] describe a GIS overview. It contains comparison of SVG (Scalable Vector Graphics) supported web browsers, geo-formats (GML – Geography Markup Language, KML – Keyhole Markup Language, Geo-JSON – Geo-JavaScript Object Notation, GeoRSS – Geo RDF3 Site Summary, WKT – Well-Known Text), rich internet application formats and map frameworks. It also mentions 3D map visualization tools. It concludes that mobile devices with multitouch support will serve better user experience on map visualizations, which is becoming true nowadays. Andrienko and Gatalsky [1] describe a review of spatio-temporal visualization. It describes techniques used for a data visualization such as querying, animation and linking views. It also analyzes common GIS tasks with respect to the time. It describes two types of data analysis tasks: (a) when → what + where, (b) what + 3RDF – Resource Description Framework

6 where → when. The first task (a) identifies a behavior of analyzed object in a defined time interval. It can be illustrated on a map animation. The second task (b) searches for a time interval when a behavior occurs. It can be illustrated on a comparison of time-series graphs of two entities.

2.3.2 Implemented Systems and Frameworks

Listed articles describe architectures of proposed tools and define problems which are related to GIS. Fang and Feng [12] propose a web-based GIS framework for data sharing. It con- sists of four application levels – application layer, service layer, function layer and storage layer. The application layer is mainly based on the OpenLayers project4, the service layer uses Google Web Map Service (WMS)5 or Yahoo WMS6, the function layer uses TileCache7, GeoServer8 and GeoTools9, the storage layer is based on Post- greSQL10 with PostGIS extension11. The framework uses Java language and due to GeoTools it conforms with OGC specifications12. Data between layers are passed in the XML format according to OGC specifications. Chan et al. [6] introduce Vispedia – a web-based system for a geo-data visualiza- tion and exploration. It uses Wikipedia’s13 data which are visualized and integrated dynamically. In the end a user can interactively explore visualized data that uses semantic graphs for queried data relations. Delort [10] describes a technique for a visualization of large geo-data sets in a web browser. It is based on building of a hierarchical clustering tree that is cut and visualized with respect to a map zoom level. It also incorporates constraints to avoid points overlapping. Evaluation shows that a time complexity for 5,000 points is 2,182 seconds on dual-core CPU with 2 GB RAM. Because of a bad scalability it proposes data pre-clustering by the K-means algorithm. Shahabi et al. [41] describe the GeoDec framework for a geo-data visualization and querying. It is based on three tiers – presentation, query-interface and data. It

4OpenLayers, http://openlayers.org/ 5Google WMS, http://developers.google.com/maps/ 6Yahoo WMS, http://developer.yahoo.com/maps/ 7TileCache, http://tilecache.org/ 8GeoServer, http://geoserver.org/ 9GeoTools, http://www.geotools.org/ 10PostgreSQL, http://www.postgresql.org/ 11PostGIS extension, http://postgis.refractions.net/ 12OGC – Open Geospatial Consortium, http://www.opengeospatial.org/standards 13Wikipedia, http://www.wikipedia.org/

7 adds the query-interface tier in a comparison to an ordinary GIS. This layer helps to customize user queries by adding temporal and spatial bounds to database queries (data tier). Queries are created visually in the user interface (GUI). Dork et al. [11] describe VisGets – a web-based tool for a geo-data visualization. It combines temporal, spatial and topical data filters to visualize news items from RSS14 feeds. It consists of the client part (visualization and interaction) and a the server part (data processing and filtering). The client part is built on top of HTML (HyperText Markup Language), CSS (Cascading Style Sheets) and JavaScript languages and it uses jQuery framework15 and Google Maps API16. The client part interacts with the server by HTTP (HyperText Transfer Protocol) POST requests. Results are obtained in JSON (JavaScript Object Notation) format from the server part which is built on top of PHP (PHP: Hypertext Preprocessor) language and it uses CakePHP framework17 and MySQL database server18. Boulos et al. [5] describe visualization of real data from WHO (World Health Organization) in Microsoft Live Labs Pivot technology19. It also shows Google Public Data Explorer20 visualization of same data. The article also shows different types of visualizations and combines multiple data sources. Qu et al. [35] describe a spatial web service client. It is based on Microsoft WMS (Bing Maps)21. The client is used for a visualization of data from NASA Spatial Web Portal22. It supports various domains and it can visualize data in two modes: static and dynamic. Granell et al. [22] describe a web-based mashup for water resource applications. It uses Google Maps API16 for visualizations and data are obtained from the server in KML23 file format. System architecture modules follow OGC specifications12. Long [26] describe a web-based trajectory visualization using open-source libraries. It aims to visualize iceberg movements of Antarctica. It uses PHP (PHP: Hypertext Preprocessor) and PostgreSQL10 on the server side and timemap.js24 dynamic map

14RSS – RDF (Resource Description Framework) Site Summary 15jQuery, http://jquery.com/ 16Google Maps API, http://developers.google.com/maps/ 17CakePHP, http://cakephp.org/ 18MySQL, http://www.mysql.com/ 19Microsoft PivotViewer, http://www.microsoft.com/silverlight/pivotviewer/ 20Google Public Data Explorer, http://www.google.com/publicdata/directory 21Bing Maps API, http://www.microsoft.com/maps/developers/ 22NASA Spatial Web Portal, http://wms.gmu.edu/SWPortalBing/ 23KML – Keyhole Markup Language 24timemap.js, http://code.google.com/p/timemap/

8 visualization library on the client side. Data from the server to the client are trans- ferred in KML23 file format. Wiesel et al. [50] published in 1996 an article, which describes map visualization using HTML 2.0 and a dynamic map. It also describes a problem with map interac- tivity. Despite article’s old age proposed system architecture is very similar to current GIS.

2.3.3 Technologies

Google Fusion Tables

Google Fusion Tables (GFTs) are described in articles [18] and [19] and they are accessible on https://www.google.com/fusiontables/. GFTs are working out-of-the-box, it has visualization templates and simple to use interface. They have table, map, line, bar or timeline data visualizations. Data can be imported from a CSV (Comma-Separated Values) file, a text file or a KML23 file. Data can be aggregated or filtered. They do not support dynamic visualizations, clustering of data points or more sophisticated chart visualizations. On the Figure 2.1 is visible map visualization of a data sample. A user cannot add new custom tile layer nor generate custom heat map layer. According to the Section 2.1.1 GFTs cannot be used as a visualization platform for data from multi-agent systems.

The Syslog Protocol

The Syslog Protocol [17] has been analyzed because of the similar structure and processing of data. A typical work-flow is that data are generated by some source, then they are written to a file with the Syslog format and finally they are processed automatically from the file. The protocol is mainly used in Unix operating systems (Linux, Mac OS X) for logging application events such as user log in, web page visit or external device connections to the log file. Every application uses different data or message format therefore it is problematic to parse logged data. An example of a web image request event in the Syslog format is in the Algorithm 2.1.

9 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET / ⤦ Ç apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com ⤦ Ç /start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

Algorithm 2.1: Syslog Example (Apache HTTP Server) [46]

From the analysis of this protocol we can discover useful fields like Priority, Time- stamp (ISO formatted [14]), Hostname, App name, Proc ID (a unique number of a running process where an event occurred), Structured data (optional additive data in a structured format – RFC 5234 [7]) and Msg (description text).

Figure 2.1: Google Fusion Tables – Map Visualization of Imported Data

10 Chapter 3

Analysis

This chapter describes multi-agent simulation, analyzes available technologies and selects them according to defined criteria.

3.1 Multi-agent Simulation Overview

In the beginning we need to define what is an agent and a multi-agent system. Russel and Norvig [36] describe an agent as “. . . anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators”. Assume that we have an environment (system) with a set of agents which have settings and tasks defined by ourselves. With respect to Shoham and Brown [42]: “. . . the agent will need to embody your knowledge of other similar agents with which it will interact (e.g., agents who might compete with it in an auction, or agents representing store owners)—including their own preferences and knowledge. A collection of such agents forms a multi-agent system.” According to Davidsson [9]: “almost every simulation model can be seen as a specification of a system in terms of states and events”. Simulations are divided into time driven (simulation time has constant steps) and event driven (simulation time step is based on a next event occurrence). Events are most important data carriers of every simulation. Part of a simulation where no event occurs is not generally important for an observer. Because of that

11 events are visualized, logged and used for a post-simulation analysis. An event can be in maritime domain for example vessel movement, weather change or hijack of a vessel.

3.1.1 AgentC and AgentPolis

AgentC platform has a time driven simulation engine [24] and AgentPolis platform has an event driven simulation engine [28]. Both platforms log events and traces of agents (vessels, cars, pedestrians, etc.). Every event contains data that are described in the Table 3.1. Some events contain additional data which are described in the Table 3.2.

Field Type Description simulation time timestamp time in a simulation event type string description of event type (i.e. Hijack, Anchored) event source string description of event source (i.e. Pirate 1, Ship 3) Table 3.1: Fields of Every Event

Field Type Description GPS location GPS latitude, longitude and optionally elevation GPS trace list of GPSes list of GPS locations description string description of an event Table 3.2: Fields of Some Events

On each platform, the events additionally contain custom data fields, which con- tain text or numerical data. Every simulation run is initialized with a unique random seed number that helps to reproduce a simulation result.

12 3.2 Evaluation of Available Technologies and Tools

Available technologies and tools are analyzed and compared in this section. In the first part tools, which can be used for an implementation, are compared. Criteria are selected with respect to requirements (Section 2.2) and they are compared relatively between selected technologies. In the second part other implemented technologies are described.

3.2.1 Data Formats Comparison

This section aims to analyze and compare data formats for exchanging data be- tween multi-agent platforms and the application. Comma-separated Values (CSV) [40], Extensible Markup Language (XML) [52], JavaScript Object Notation (JSON) [8] data formats are compared in the Table 3.3.

Data Format CSV XML JSON Speed Performance ‚ ‚ ‚ ‚ ‚‚ Documentation ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ Object Expressivity ‚ ‚ ‚ ‚ ‚ ‚ ‚ Java Libraries ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ Data Size Overhead ‚ ‚ ‚ ‚ ‚‚ Size Variability ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ – bad/no, ‚‚ – average, ‚ ‚ ‚ – good/yes Table 3.3: Comparison of Data Formats

Speed performance is the best for CSV, JSON has good performance and XML file format has poor performance. Because all formats are standardized for a long time they have very good documentation available. Many Java libraries are available for all formats too. XML file format has the best expressivity of analyzed formats which causes the biggest size overhead from them. JSON file format can be in the most cases rewritten to XML and vice versa, but it has smaller expressivity than XML therefore it has smaller size overhead. CSV file is just matrix representation so it can’t be efficiently used for object storage, but it has the smallest size overhead. JSON file format is selected because of reasonable performance and data size. For purposes of the work it has same expressivity as XML file format, but size overhead is much smaller.

13 3.2.2 Databases Comparison

This section analyzes databases which are compared in the Table 3.4. Analyzed databases are PostgreSQL [34], MySQL [32], SQLite [48] and H2 [23].

Database PostgreSQL MySQL SQLite H2 Speed Performance ‚‚ ‚‚ ‚ ‚ ‚ ‚ Documentation ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ GIS Extension ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚‚ Java Support ‚‚ ‚‚ ‚‚ ‚ ‚ ‚ (JDBC) (JDBC) (JDBC) (native) Program Size ‚‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ Embeddable ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ – bad/no, ‚‚ – average, ‚ ‚ ‚ – good/yes Table 3.4: Comparison of Databases

Database engines performances have been analyzed in [29]. The fastest database is H2, PostgreSQL is the second, MySQL third and SQLite last. All databases have very good documentation. GIS extensions are available for all databases by 3rd party provider, H2 database with GIS extension is provided only as a part of the OrbisGIS project1. All databases can be connected to Java application by JDBC connector. H2 is pure Java-based database therefore its connector does not need any data conversion. Embeddable databases SQLite and H2 are also the smallest. MySQL database has huge installed size (approx. 500 MB). H2 database is selected as a data storage and it is selected because of native Java support and very good speed performance.

3.2.3 Javascript Frameworks Comparison

Javascript frameworks are described in this section. There also exists other tech- nologies, that have similar functionality, such as Adobe Flash2 or Microsoft Sil- verlight3, but they are proprietary and they do not provide free software development kits (SDKs).

1H2 spatial, http://trac.orbisgis.org/t/wiki/H2spatial/Download 2Adobe Flash, http://www.adobe.com/products/flash.html 3Microsoft Silverlight, http://www.microsoft.com/silverlight/

14 Google Web Toolkit (GWT) framework [21] and jQuery framework [47] are se- lected for comparison. GWT framework enables developer to write application code in Java and XML. Written code is then compiled into Javascript code which client runs. JQuery is pure Javascript framework that enables developer to access low-level functions easily. There also exists other pure Javascript frameworks like Prototype4, Dojo5 or YUI Library6, but they provide similar functionality as jQuery, so they are not analyzed nor considered against GWT. In the Table 3.5 GWT and jQuery are compared. Speed performance of GWT is very good because of code optimization during compilation phase. JQuery is slower in general, but it also depends on developer’s programming style. Documentation of both tools is very good. GWT is more problematic to customize because of code compilation, which also affects usage of some plugins with it. Size of a produced code is bigger in GWT than in jQuery. GWT is provided with Eclipse SDK, jQuery as a native Javascript library doesn’t have similar SDK. Main problem of GWT is speed of debugging because of compilation of Java code to Javascript in each request during development.

Data Format Google Web Toolkit jQuery Speed Performance ‚ ‚ ‚ ‚‚ Documentation ‚ ‚ ‚ ‚ ‚ ‚ Code Customization ‚‚ ‚ ‚ ‚ Plugins ‚‚ ‚ ‚ ‚ Size ‚‚ ‚ ‚ ‚ SDK ‚ ‚ ‚ ‚ Speed of Debugging ‚ ‚ ‚ ‚ ‚ – bad/no, ‚‚ – average, ‚ ‚ ‚ – good/yes Table 3.5: Comparison of Javascript Frameworks

JQuery Javascript framework is selected, because of very good speed of debugging and great customization possibilities. It also works with selected map framework without any problems.

4Prototype, http://www.prototypejs.org/ 5Dojo, http://dojotoolkit.org/ 6YUI Library, http://developer.yahoo.com/yui/

15 3.2.4 Map Frameworks Comparison

Google Maps API [20], OpenLayers [31] and Polymaps [43] map frameworks are analyzed in this section. Comparison is visible on the Table 3.6.

Data Format Google Maps API OpenLayers Polymaps Speed Performance ‚‚ ‚ ‚ ‚ ‚ ‚ ‚ Documentation ‚ ‚ ‚ ‚ ‚ ‚ ‚‚ Code Customization ‚ ‚‚ ‚ ‚ ‚ Custom Tiles ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ Custom Shapes ‚‚ ‚ ‚ ‚ ‚ ‚ ‚ Size ‚‚ ‚ ‚ ‚ ‚ ‚ – bad/no, ‚‚ – average, ‚ ‚ ‚ – good/yes Table 3.6: Comparison of Map Frameworks

The speed performance of Google Maps API is lower because it needs to load Javascript library directly from the Google’s web site on every page refresh. Docu- mentation of Polymaps is very brief in comparison with rest frameworks. OpenLayers framework is being developed since 2006 therefore it has the biggest code size and it is also hard to customize such a big code. Google Maps API can’t be customized because of proprietary source code. This also limits usage of customized shapes which are visualized on a map. Polymaps framework is the most modern and it has the smallest size too. It enables developer to access SVG objects (like tiles, points) directly which helps to customize them in any way. Polymaps framework is selected, because of light-weight customizable and clear architecture. It also works with other Javascript frameworks and tools without any problems.

3.2.5 Web Technologies

This section describes specifics of used technologies in the application. Main fo- cus is kept on pointing out differences between generally known possibilities of web technologies and the latest technology updates.

16 HTML

The Hypertext Markup Language (HTML) is generally well known standard for coding web pages. The latest version, HTML5 [53], is still draft version, but it is already supported by web browsers like Google Chrome or Mozilla Firefox. The standard extends HTML4 by adding new form input types (date, number, etc.), multimedia elements (video, audio), customized drawing element (canvas), off- line support, etc. In this work there are used especially new form input types, custom element data attributes or file drag & drop support.

CSS

Cascading Style Sheets (CSS) standard [55] is also well known as HTML. Draft version of CSS3 introduces two important features: color effects and selectors. Color effects contain object shadows, background color transitions, semi-transparent colors or animated style transitions. Selectors enable to choose better specified objects like every second table row, all direct children of an element, all elements of a given class except with a given other class, etc. This reduces an amount of written code to achieve same functionality as with CSS2. In this work there are used selectors and background color transitions.

SVG

Using Scalable Vector Graphics (SVG) standard [54] in a web browser it is possible to draw vector objects. This standard supports typical shapes (circle, text, rectangle, line) and also custom objects, which are drawn using polygons or polylines. It also supports gradients and other color effects. The standard is mostly used for map visualization frameworks and for visualiza- tion of charts. In this work SVG is used for chart and map visualization and its objects are build via Javascript in a web browser.

17 AJAX

AJAX (Asynchronous JavaScript and XML7) [16] is not a standard nor a technol- ogy, but it is a model how to communicate and transmit data between a web browser and the server. Main difference between traditionally dynamically generated page and AJAX- supported page is that returned file is built on the server and transferred to the client, but if AJAX model is used, static web page is transferred in the beginning and then only requested data are transferred. AJAX communication between the client and the server is used in this work. That causes a fast response for a user interaction and it also enables to visualize dynamic data such as vessel traces.

REST

Representational state transfer (REST) [13] was introduced in 2000 like a concept of communication between the client and the server. REST utilizes HTTP8 protocol and URL9 address schema to achieve high func- tionality with low effort. REST is currently used in connection with AJAX requests and JSON10 or XML7 responses.

7XML – Extensible Markup Language 8HTTP – HyperText Transfer Protocol 9URL – Uniform Resource Locator 10JSON – JavaScript Object Notation

18 Chapter 4

Solution

This chapter describes implemented system architecture and used algorithms, which are realized in Filter module.

4.1 Selected Technologies and Libraries

The application consists of the server part and the client part. Technologies used for each part are selected according to requirements (Section 2.2). The server part is written in Java programming language. It uses Jetty as a web- server engine and H2 as a database engine. The server part also uses essential libraries such as Baltik1 for SVG to PNG/PDF conversion, Apache Commons2 for extending Java functionality and Joda Time3 for better date and time handling. The client part is written in HTML5, CSS3 and Javascript programming lan- guages. It uses jQuery Javascript framework and Polymaps as a map framework. The client part also uses D3.js4 framework which is data driven Javascript library. D3.js is used for a chart generation and for a great circle arc rendering.

1Baltik, http://xmlgraphics.apache.org/batik/ 2Apache Commons, http://commons.apache.org/ 3Joda Time, http://joda-time.sourceforge.net/ 4D3.js, http://mbostock.github.com/d3/

19 4.2 System Architecture

The high-level diagram of the system architecture is on the Figure 4.1. As we have mentioned in the previous section, the system consists of the server part and the client part. Server’s main component is the System Core which is module transferring data between the storage and system interfaces. The storage is divided into two parts – the Database Engine that is persistent data storage and the Memory Cache that speeds up server responses on repeated requests. The Filter Computer and the Filter Convertor and Validator are very important server components because they manage filters, build SQL queries and convert re- sulted data into JSON format. Filters are further described in the Section 4.2.3. The server also contains the Map Tile Caching Proxy, the SVG to PDF/PNG converter and the Input parser and Converter, which imports data into the database storage. The client consists of modules that behave as an interface for the server. The Map Visualizer module fetches layers of a map with a varying level of detail. These layers are visualized through the Data Visualizer module. Visualized data are selected according to a selected filter by the Filter Editor module. Data can be imported by the Data Uploader module and maps or charts can be exported by the PDF/PNG Exporter module.

4.2.1 Database Schema

H2 embedded database engine is used for storing data. The diagram of database tables is on the Figure 4.2. SQL table generation script is in the Appendix A, Algo- rithm A.2. The database consists of tables for data entries and their optional values, a heat map and filters. It also consists of particular tables for storing user-selected areas (polygons), chart colors, computed data caches and configurations. SQL operations are realized using JDBC technology. SQL requests are build for each call or they are created from prepared statements. Java Persistence API is not used because of serious performance issues [25].

20 %   

'" !" % '" !" 

iue41 ytmAcietr Diagram Architecture System 4.1: Figure  #  $        "  !" + !"  !"   * "  #  '(. ,-, 21       ' (! )    

+' (! )           */0    $      

       ' (! )    

           ' (!  & Figure 4.2: Database Schema – E-R Diagram

22 4.2.2 Input Format

The input format for exchanging data between a multi-agent platform and the application is inspired by the Syslog protocol (Section 2.3.3) and it is created with respect to multi-agent platforms (Section 3.1.1). It uses JSON file format (Section 3.2.1), which contains defined fields in the Table 4.1.

Field M* Type Description time  string represents real time in ISO 8601 date and time format [14] host  string represents IPv4 address or IPv6 address or host- name app  string represents application name uid  string represents unique application run ID type  string represents data entry type or event type source  string represents data entry source or event source sTime  string represents simulation time, same format as the time field center  doubles represents center point of an event, format: [lat, lon, elev] or [lat, lon] waypoints  string, represents waypoints (trace) of an event, format doubles is array of vector which consists of timestamp and point: [[timestamp, lat, lon, elev], [timestamp, lat, lon], ...] (timestamp is like the time field, rest is like the center field) text  string represents HTML formatted information text data  – represents data container for formatted optional data Ç key  string name of data field, expected characters: [a-zA-Z0-9 ]+ Ç value  boolean, value of data field double or string *) M = Mandatory Field Table 4.1: Specified Input Format Fields

Example of a file with exported demonstration data is in the Appendix A, Al- gorithm A.1. Light-weight library, which is derived from the application code and written in Java, is used for exporting data from existing projects.

23 4.2.3 Server Architecture

This section describes detailed server structure. It also describes processes that compute/convert raw data to a visualizable format.

Filter Architecture

The Filter module is the main part of the application. The module has modular architecture, which supports building of user-defined filters with pre-defined visu- alization types, e.g. chart of number of events clustered by a month of each event. Chart data result is defined via FilterType object, clustering operation is defined by Operation object, number of events is defined by the first Value object argument and month of each event is defined by the second Value object argument. To avoid mixing data types of input and output values, each object has defined accepted and/or pro- duced formats (array of Value or UniParent objects), which are similar to strongly typed languages like Java, except it produces array of formatted values instead of just one formatted value. All object instances are defined in the server part of the application and they are written in Java. A user can just connect matching objects together and create a filter with a defined operation with custom arguments. Each stored filter, according to its diagram on the Figure 4.3, consists of:

• the FilterType object instance, which describes a produced data format (i.e. chart, map) and supported input values. It also contains name, type (e.g. ID), the Filter Computer (describes conversion of data to JSON file format) and the Operation.

• the Operation object instance contains name, operation ID, help and the Oper- ation Computer (implements an algorithm that processes data according to Op- eration input arguments). It also has output values which are matched against FilterType input values, and supported input values (arguments).

• Value and Operation objects extends functionality of the UniParent abstract object. The Operation object is described earlier. The Value object instance keeps format, sub-format and assigned value (e.g. events count operation, center point field definition, integer number 5).

24         

iue43 itrAcietr betMdladDt Flow Data and Model Object – Architecture Filter 4.3: Figure         '($ (' (' '('                               25                      

 '(' '('  

)

   !

*   



+          

     %" 

'($

)

"  &   

+

  

 # $ 



+

The UniParent object instance is assigned instead of corresponding initial Oper- ation input value. It can be assigned recursively until it adds the UniParent object instance with a supported output value format. It creates tree structure, where root is an Operation and leafs are Values and Operations with no arguments. The WebApi Service is processing requests, selecting corresponding FilterType object instance and creating the FilterEnvironment object instance, which holds val- ues like time range window, visible bounds, etc. In the next step the Filter Computer object instance with value of the FilterEnvironment object is called. The object is immediately passed to the Operation Computer object instance. This happens until the FilterEnvironment object reaches the Value object instance, then a real value (number, vector of strings, etc.) is returned. When the Operation Computer gathers all values, it computes a result. The re- sult is then passed to its superior Operation Computer. When the superior is the Filter Computer, the result is just converted to JSON-ready object instance which is converted into JSON in the WebApi Service. JSON result is finally returned as a server HTTP response.

Web Server Structure

The web server consists of data providers, which are described on the Table 4.2. Client files (HTML, CSS, Javascript) are on path /cli/*, which is mapped di- rectly to a specified filesystem folder. On the /cli path are bounded AJAX proxy and Map tile caching proxy. The most important data providers are mapped to the /web api path. They are implemented with respect to REST API (Section 3.2.5). Described data providers are mentioned in following sections.

26 Path Description /status – Status information – CPU utilization and mem- ory consumption /kml api – Google Earth events visualization /cli – Client (GUI) /ajax proxy – Request proxy – Avoid of cross-domain requests /bing proxy – Map tile caching proxy /* – Other static client files (HTML, CSS, Javascript) /web api – REST API container /chart data – Chart layer /data api – Settings, filters, configurations, etc. /dynamic api – Dynamic page /geo data – Points layer /geo heatmap – Heat map layer /lines api – Lines layer /polygon api – Polygon layer /svg converter – SVG to PDF/PNG converter /upload – File uploader /viewer – Data viewer /* – Other static and JSP helper files Table 4.2: Web Server Structure

4.2.4 Data Processing

This section describes flow of data processing of each data type. Static visual- ization example of points, lines, a heat map and a bar chart is on the Figure 4.4. Dynamic visualization example of interpolated traces is on the Figure 4.5.

Points Data Processing

Clustered points are produced by a selected filter algorithm. Filters are described in the Section 4.2.3. Requests are split into parts, which correspond to map tiles. Splitting helps to improve response performance due to multi-threading. Results are not cached.

27 Line Processing

A result of clustered lines is produced along following steps:

1. layer is requested from the client,

2. check cache for saved record, if found go to 6.,

3. fetching vessel’s traces from the database,

4. clustering of traces using the Algorithm 4.2 from the Section 4.3.3,

5. save result to the cache,

6. send to the client.

Figure 4.4: Screenshot – Static Data Visualization – Map Layer with traces, the heat map and events (left) and Chart of events by days of the week (right)

Heat Map Processing

Heat map is generated from the Setup web page. On this page are defined pa- rameters for generation. These parameters are then passed to the bivariate kernel density estimation algorithm from the Section 4.3.5.

28 An overlay tile image is produced by following steps:

1. tile request,

2. check cache for saved tile image, if found go to 6.,

3. generate tile image from the heat map matrix (Section 4.3.5),

4. convert tile image to PNG file format,

5. save PNG file to cache,

6. return PNG file.

Dynamic Data Processing and Waypoints Interpolation

Data generated for the Dynamic web page are not cached. Data are produced with respect to following steps:

1. request with a defined time range,

2. find events within the given time range,

3. find samples5 closest to the time range,

4. interpolation of the trace to the corresponding time range (Section 4.3.1, equa- tions 4.3, 4.4),

5. pack data into JSON format.

Chart Data Processing

Chart data are processed similarly as points data, which use filters (Section 4.2.3). Requests can be without any arguments or they can contain information about se- lected points (when a user clicks on a cluster). Results are not cached except the Spatial filter which computation takes long time.

Tile Caching Proxy

The caching proxy of tiles from map providers is implemented to speed-up per- formance of map visualizations. It also helps to avoid reaching daily quotas of map providers that are limiting usage of public-provided map tiles. The caching proxy is visualized on the Figure 4.1. 5sample is a point of a vessel trace with a time information

29 Figure 4.5: Screenshot – Dynamic Data Visualization – Vessel traces and Events

PDF/PNG File Generation

The PDF/PNG file generation is used for exporting visualized maps or charts. Due to SVG standard it is possible to export any SVG element. An exporting process is mainly done on the client part of the application and the final conversion into PDF or PNG is done on the server part. The process is described in following steps:

1. SVG element extraction,

2. color fixing (because of non-supported CSS3 color value rgba()),

3. decoration of the element by adding view bounds and pixel size,

4. sending to the server,

5. conversion to PDF or PNG via Baltik toolkit6,

6. produced file is offered to download in a web browser.

6Baltik, http://xmlgraphics.apache.org/batik/

30 4.2.5 Client Structure

A structure of the client part of the application consists of a selection part and a web pages part. Detailed architecture is described in the following subsection and a user interaction flow is described in the second subsection.

Architecture

The architecture of the client part is visualized on the Figure 4.6. We can see that after a web browser loads all files, it executes the Parse URL process. When parsing is done, requested web page is selected and initialized (the Select & Initialize page). Finally, the code is updated in the web browser. Web pages are used like modules which can be extended or removed. Each web page consists of HTML code, CSS styling and module’s logic in Javascript. Before a web page is returned to the web browser, an initialization function is called. The function is described in the Algorithm 4.1.

i f page i s loaded then // avoid duplicated page load updateDataFromServer() else loadAndExecutePageLogicCode() loadDataFromServer() registerHandlers() // Javascript handlers (onClick, etc.) end i f

Algorithm 4.1: Web Page Initialization Pseudocode

Implemented web pages (Figure 4.6) are using connections with the server, which are described on the Figure 4.1. Map Tiles, PDF/PNG Export and Visualized Data are used on Home, Visio and Dynamic web pages. The Filter Editor is used on the Filters page, The Web-based Setup is used on the Settings page and the Uploaded Data File is used on the Upload page. The Data page just renders data from the system database using data-to-JSON conversion.

31 * .  $ ()*

iue46 iga fPg edrn naWbBrowser Web a in Rendering Page of Diagram 4.6: Figure $ %  &'$      % 

#$ % 

    

32                                                               +,-*

  %

       *                       !   "                

        User Interaction Flow

To support user interaction, the application contains many handlers which capture events created by a user. Example of a data flow after user’s interaction is on the Figure 4.7. We can see from the figure, that user’s action is processed and transformed to a request, which is realized by AJAX call to the server. The web page in a web browser is waiting for a response. When the server produces the response (usually JSON-formatted data), it is converted to SVG elements for a visualization. Finally, the corresponding chart is updated by requested data by the user.

!)     "    

                     !       

"#"$ #%&'(  

 Figure 4.7: Diagram of Data Flow After User’s Action

33 4.3 Implemented Algorithms

This section describes algorithms used in the Filter module. We focus on high level description of their function and implementation issues.

4.3.1 Map Projection System

Spherical mercator map projection [33] is used for a web-based map visualization by map tile providers (like OpenStreetMaps, Google Maps, Bing Maps). Mercator projection is specific by non-linear scaling. It causes that areas which are farther from the equator are bigger than closer areas. It results in having the size of Greenland same as the size of Africa. This is effect of projecting Earth marble to square via cylinder (see the Figure 4.8).

Figure 4.8: Mercator Projection – Illustration of cylindric projection of Earth (based on [49])

Whole world (zoom level 1) contains four tiles with size 256×256 pixels, so whole map has size 512×512 pixels [39]. This zoom level is captured on the Figure 4.9. Zooming to the next level (zoom level 2) doubles the map size to 1024×1024 pixels and the map contains sixteen tiles. The maximal zoom level is 23 with resolution 18.7 millimeters per pixel, which causes that the map has resolution over four gi- gapixels. Higher zoom levels are good for web-based visualization, because places are comparably scaled and the projection doesn’t have high inaccuracy.

34 Figure 4.9: Screenshot – Mercator Projection Example

Conversion from a GPS location (latitude, longitude) = (ϕ, λ) to a pixel position (x, y) on an assigned zoom level z is defined by the Equation 4.1 [39]. Note that latitude is ϕ ∈ ⟨−90°, 90°⟩ and longitude is ϕ ∈ ⟨−180°, 180°⟩. In Mercator projection latitude ϕ is approximately in the interval ⟨−85°, 85°⟩ due to value divergency around poles.

ϕ log tan 45° 2  ‹ + 2  x norm = π λ ynorm = 180°

z−1 n = 2 ← map pixel size

n n x min n 1, max 0, x =  − ‹ norm ⋅ 2  + 2  n n y min n 1, max 0, y (4.1) =  − ‹ 2 −  norm ⋅ 2 

35 Distance calculation is done by the haversine formula [44]. The formula is in the

Equation 4.2, where R is radius of the Earth, (ϕ1, λ1), (ϕ2, λ2) are latitude and longitude coordinates of the first and the second point.

¾ ϕ ϕ λ λ d 2R arcsin sin2 2 − 1 sin2 2 − 1 cos ϕ cos ϕ (4.2) = ⋅ ‹ 2  + ‹ 2  1 2

An intermediate point between two points on a great circle with a given fraction [51] is computed using the following equation:

sin 1 f d A [( − ) ⋅ ] = sin d sin f d B ( ⋅ ) = sin d

x = A cos ϕ1 cos λ1 + B cos ϕ2 cos λ2

y = A cos ϕ1 sin λ1 + B cos ϕ2 sin λ2

z = A sin ϕ1 + B sin ϕ2

⎛ z ⎞ ϕ = arctan » ← latitude ⎝ x2 + y2 ⎠

y λ arctan longitude , (4.3) = ‹x ← where d is a distance between two points (Equation 4.2), f is a given fraction on the interval ⟨0, 1⟩ (0 is closest to the first point), (ϕ1, λ1), (ϕ2, λ2) are latitude and longitude coordinates of the first and the second point, (ϕ, λ) is the intermediate point. The intermediate point calculation for a trace and a defined time range is done by using the intermediate point equation (Equation 4.3) with the following input value:

tx − t1 f = , (4.4) t2 − t2

7 where tx is a time of an interpolated sample (start of the time range without prime, end of the time range with prime), t1 is a time of the closest left sample to tx and t2 is a time of the closest right sample to tx (see the Figure 4.10 for an explanation). 7sample is a point of a vessel trace with a time information

36   

 

                Figure 4.10: Trace Interpolation Diagram

4.3.2 K-means

K-means clustering algorithm is reasonably fast and it produces clusters that are centroids of processed elements [36]. The algorithm works as follows: 1. Initialize k centroids randomly (usually selected from an input data set). 2. Assign all data entries to their closest centroid. 3. Recompute positions (means) of k centroids according to assigned entries. 4. If positions of k centroids have changed then go to 2., otherwise finish.

Because the general K-means algorithm initializes centroids randomly, it produces different results on the same input data set. This behavior causes moving of centroids on a map when tiles are refreshed. To avoid the moving, k centroids are initialized according to the Figure 4.11, from 5th centroid initialization is random.

      

  

       Figure 4.11: Centroids Initialization Points – Map Tile in Mercator Projection

Customized K-means algorithm is executed on data that are bounded by tile dimensions. That helps to improve performance and it also lowers a number of loops to converge.

37 4.3.3 Line Clustering Algorithm

Vessels traces reduction by clustering is inspired by Schroedl et al. [38], that describe how to extract road lanes from large amount of GPS traces affected by a noise. It splits traces into segments which connect clusters with maximum distance constraint. Then these segments are intersected and refined to follow road lanes. Direct usage of the proposed algorithm is not possible due to different task, that the algorithm is designed for. It finds lanes from the noisy data, but we need to reduce number of lanes with respect to their intensity. The pseudocode of the line clustering algorithm is in the Algorithm 4.2. Pre- clustering is done in the first step. In the second step K-means clustering is done. In the last step line matching to new points is done. The Euclidean distance function is used because of performance reasons. The first point of a cluster is also used to improve performance. d i s t = 0 . 1 // defines minimum distance between points k = 1000 // defines number of centroids lines = getAllLines() // fetch from the database points = extractPoints(lines) // extract all points from lines clusters = [] for each point from p o i n t s do add [ point ] to c l u s t e r s // convert points to clusters end for clusters2 = [] while not empty c l u s t e r s do // remove the first cluster and ⤦ Ç search for the nearest cluster c = remove f i r s t from c l u s t e r s for each c l u s t e r from c l u s t e r s do i f euclideanDistance(c[0] , cluster [0]) < d i s t then remove c l u s t e r from c l u s t e r s append c l u s t e r to c // append the nearest cluster ⤦ Ç to removed cluster end i f end for add c to c l u s t e r s 2 end while

38 // distance is measured by euclideanDistance(c1[0], c2[0]) centroids = Kmeans(clusters2 , k) // reduces clusters count to k

// new points lookup operation for lines clusteredLines = [] for each l i n e from l i n e s do startPoint = search in c e n t r o i d s for l i n e [ 0 ] endPoint = search in c e n t r o i d s for l i n e [ 1 ] i f startPoint != endPoint then add [startPoint , endPoint] to clusteredLines end i f end for

// array of vectors [startPoint , endPoint, intensity] producedLines = convert(clusteredLines)

Algorithm 4.2: Line Clustering Algorithm Pseudocode

4.3.4 Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering builds a tree of clusters from leafs to root (bottom-up) [15]. A distance function to measure gap between two clusters is usually defined as a single linkage:

dsl(C1,C2) = min dij , (4.5) i∈C1,j∈C2 a complete linkage:

dcl(C1,C2) = max dij , (4.6) i∈C1,j∈C2 or an average linkage:

1 d C ,C d , (4.7) al( 1 2) = C C Q Q ij S 1SS 2S i∈C1 j∈C2 where dij is a distance function (usually Euclidean) between points i and j. The pseudocode of the program is in the Algorithm 4.3. Performance is increased by building the full distance matrix before starting the algorithm. After clusters are merged, the distance matrix is just updated on corresponding rows and columns.

39 minDistance = ... // defines minimum distance between points k = . . . // defines number of final clusters points = getAllPoints() // select all points from the database clusters = [] for each point from p o i n t s do add [ point ] to c l u s t e r s // convert points to clusters end for

// build a triangular distance matrix distanceMatrix = [][] for i from 0 to c l u s t e r s − 1 do | | for j from 0 to c l u s t e r s − i − 1 do | | distanceMatrix[i][j] = distance(clusters[i], clusters[j ⤦ Ç ]) // distance function d(C1, C2) end for end for

// stop condition − number of non−empty clusters while not empty c l u s t e r s > k | | [i ,j] = indexesOfMinValue(distanceMatrix) // search for the ⤦ Ç minimal distance and return its column and row indexes

// found minimal distance is bigger than minDistance i f distance(clusters[i], clusters[j]) > minDistance then break while // interrupt loop end i f

clusters[i] = mergeClusters(clusters[i], clusters[j]) // ⤦ Ç merge clusters into a new cluster clusters[j] = [] // set to empty value

updateDistances(distanceMatrix , clusters , i , j) // update ⤦ Ç corresponding rows and cols just for updated clusters end while return not empty c l u s t e r s

Algorithm 4.3: Agglomerative Hierarchical Clustering Pseudocode

40 Points Clustering by Number of Clusters

Agglomerative hierarchical clustering algorithm (Algorithm 4.3) is stopped when a number of non-empty clusters is the same as a desired value. To achieve that the algorithm is initialized with values minDistance = +∞ and k ∈ {1, 2, 3,... }.

Points Clustering by Distance

Overlapping of visualized clusters on a map can be avoided by agglomerative hierarchical clustering algorithm (Algorithm 4.3), which has stop condition related to a distance between clusters. The algorithm will produce such a result by initializing values minDistance ∈ R>0 and k = 1.

4.3.5 Heat Map Generation

Bivariate kernel density estimation [45] is an algorithm for estimation of areas with high probability of element occurrence. It is based on summing up points that are transformed using a kernel function. Gaussian kernel function is selected for generating heat maps in the application. Example of a produced chart is on the Figure 4.12. ˆ Used density estimation function fnorm(x, y) is defined by the following equation:

x x 2 y y 2 n ( − i) + ( − i) ˆ − 2 f(x, y) = Q e 2σ i=1

⎡ ˆ ˆ ˆ ⎤ ⎢ f(1, 1) f(1, 2) ⋯ f(1, m) ⎥ ⎢ ⎥ ⎢ ˆ ˆ ˆ ⎥ ˆ ⎢ f(2, 1) f(2, 2) ⋯ f(2, m) ⎥ F = ⎢ ⎥ ⎢ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⎥ ⎢fˆ m, 1 fˆ m, 2 fˆ m, m ⎥ ⎣ ( ) ( ) ⋯ ( )⎦

ˆ 1 ˆ fnorm(x, y) = fxy , (4.8) max Fˆ where n is size of vectors of points coordinates (n = SXS = SYS); σ is blob spread; m is map array size, respectively size of matrix Fˆ; x, y are coordinates converted from lat- th itude and longitude (4.1); xi, yi are i elements of vectors of point coordinates X, Y;

41 ˆ 1 f(x, y) is derived Gaussian kernel density estimator excluding n coefficient [45]; and ˆ ˆ F is the matrix of computed f(x, y) values. ˆ A value obtained from the function fnorm(x, y) is in the range ⟨0, 1⟩. The value is linearly transformed into a corresponding color from Jet colormap8. Size of the matrix Fˆ can be bigger than available system memory, therefore the matrix is represented by the table in the database, which has row and column indexes as primary keys (the table Heatmap Points on the Figure 4.2).

Figure 4.12: Screenshot – Generated Heat Map Visualization

8Jet colormap, http://www.mathworks.com/help/techdoc/ref/colormap.html

42 Chapter 5

Evaluation

This chapter evaluates the implemented application with respect to specifications (Section 2.1). The Section 5.1 analyzes application performance, the Section 5.2 analyzes information loss and the Section 5.3 describes technical limitations of the application.

5.1 Performance

Application performance is important criterion especially in web-based applica- tions. The performance is analyzed on testing data sets (Section 5.1.1) and on real data sets (Section 5.1.2). The first test suite compares sizes of data files containing defined number of ele- ments. A file size depends on an amount of data in optional fields and the text field. Both tests do not use custom data values (see the Section 4.2.2). The second test suite compares speed of operations on points-related elements, which contain center point information and no trace information. The File Upload & Parse test measures time that is needed to parse and store an input data file. The Heat Map Generation test does generation of a heat map with following values: spread σ = 1.5, zoom level z = 9 and matrix size 9 × 9 (resolution of Gaussian). The K-means clustering test measures time which is needed to compute k = 4 clusters. The first Hierarchical clustering test computes k = 4 clusters with the single-linkage distance function. The second Hierarchical clustering test uses the single-linkage distance function and uses a distance stop condition for d = 2. The last test suite compares speed of traces-related operations. The File Upload & Parse test is the same as for points-related elements but uses data which contain

43 trace information and no center point. The Lines clustering test measures speed of the Line Clustering algorithm. The Spatial filter test computes data for four polygon areas which approximately cover Indian Ocean, North Atlantic Ocean, Philippine Sea and Caspian Sea. All tests are done on Intel Core 2 Duo 2.26 GHz dual-core processor, 8 GB DDR3 RAM, 250 GB 2.5” 5,400 rpm HDD on 64-bit Unix-based operating system with 64-bit Java Runtime Environment.

5.1.1 Random Data

Random data sets for testing purposes are generated for a defined number of elements. Points are randomly generated with latitude ϕ ∈ ⟨−180°, 180°⟩ and longitude λ ∈ ⟨−85°, 85°⟩ by uniform distribution. The time field is sequentially increased by 5 minute interval. Traces are generated from 10,000 randomly generated points That amount of points is selected due to performance reasons. Each path contains 10,000 points, which can repeat, and they are selected from the generated data set according to following steps:

1. randomly split a data set into 10 points which are used as nodes and 9,990 points that are used as terminates (note that terminates and nodes data sets are the same for all paths),

2. randomly select two points from terminates data set,

3. find for each point closest point from nodes data set,

4. append to path the first random point, the corresponding closest point, the corresponding point to the second random point and the second random point,

5. cycle 2.-4. until path contains 10,000 points.

Process for generation of traces is designed to simulate repeated routes for the lines clustering algorithm. Otherwise traces with only unique random lines are generated. Those completely random traces are unrealistic for multi-agent domains, because agents are moving along pre-specified paths.

44 The Table 5.1 compares sizes of data sets of a defined number of elements. Relation between size and number of elements is linear.

Number of Elements 1 k 10 k 100 k 1 M Points (MiB) 0.246 2.4 24 243 Traces (MiB) 11 110 1,103 11,030 (estimate) Table 5.1: Size Comparison of Testing Data Files

The Table 5.2 and the Figure 5.1 compares speed of points-related tests of a defined number of elements. Estimated values are approximations based on first 500 cycles of the hierarchical clustering algorithm. Estimates are derived from tendencies of values.

Number of Elements 1 k 10 k 100 k 1 M File Upload & Parse (s) 0.356 2.03 15.8 363 Heat Map Generation (s) 5.38 20.4 34.5 80.5 K -means cl., k = 4 (s) 0.032 0.051 0.717 11.6 Hier. cl., min., k = 4 (s) 0.014 3.29 1,140 300,000 (estimate) Hier. cl., min., d = 2 (s) 0.088 6.21 1,284 400,000 (estimate) Table 5.2: Speed Comparison of Points of Testing Data Files

          





        ) 

'    !"#$ 



  ! !"#$

   ! !#%    



    

&  '  (  ) Figure 5.1: Visualization of Speed Comparison of Points Data

The file upload speed is decreasing with number of values which is probably caused by the database engine performance. The heat map generation speed per element is        *   45 

 +  ) 

' +  



    

    

   

&  '  (  ) increasing because of many points are hitting the same matrix cell. That is caused by the selected level of detail and uniform points distribution. The K-means performance       is better than the hierarchical clustering performance.     That happens because the K-means algorithm usually converges in less steps than  the hierarchical clustering algorithm which converges in a fixed number of steps (n−k for a defined k).     The Table 5.3 and the Figure 5.2 compares speed of traces-related   tests of a ) 

'    !"#$  defined  number of elements. The estimate for the Spatial filter is based on the value

  ! !"#$

   ! !#% tendency.     Number of Elements 1 k 10 k 100 k  File Upload & Parse (s) 4.82 31.9 410     Lines cl. (s) 5.32 10.5 56.6 &  '  (  ) Spatial filter (s) 1.21 110 10,000 (estimate) Table 5.3: Speed Comparison of Traces of Testing Data Files

       *  



 +  ) 

' +  



    

    

   

&  '  (  ) Figure 5.2: Visualization of Speed Comparison of Traces Data

The upload speed is also decreasing as in points-related tests. The Lines clustering test speed is also decreasing which is probably caused by utilization of the K-means algorithm and by usage of large amount of memory which causes many garbage collector calls by Java Virtual Machine. The Spatial filter test speed depends on a simulated time range interval because it needs to reconstruct vessel’s movements in 5 minute steps to compute spatial information.

46 5.1.2 Real Data

Real data are generated by AgentC platform. They are produced by two simulation runs with thousands of agents. Data are tested to measure real performance and to compare difference between month time range and year time range simulations. The month data set contains 259 points (events) and 4,003 traces. The year data set contains 2,640 points (events) and 4,003 traces which are approximately 12 times longer than in the month data set. The size comparison is done in the Table 5.4. A file size also relates to number of points and traces as for testing data sets.

Number of Elements Month (259 points) Year (2,640 points) Points (MiB) 0.093 0.956 Traces (MiB) 35 383 Table 5.4: Size Comparison of Real Data Files

The Table 5.5 compares speed of points-related tests for each data set. All tests have reasonable time consumption except hierarchical clustering tests which are over 260 times slower that the K-means clustering, therefore the hierarchical clustering is mostly usable just for static visualizations.

Number of Elements Month (259 points) Year (2,640 points) File Upload & Parse (s) 0.647 2.87 Heat Map Generation (s) 0.57 1.16 K -means cl., k = 4 (s) 0.015 0.101 Hier. cl., min., k = 4 (s) 0.243 26.3 Hier. cl., min., d = 2 (s) 0.264 53.3 Table 5.5: Speed Comparison of Points of Real Data Files

47 The Table 5.6 compares speed of traces-related tests for each data set. The file upload speed is not limiting instant usage of the application because a file is uploaded in less that two minutes. The Line clustering test takes more than a minute for the year data set but the produced result is cached so the second request takes just few seconds. The Spatial filter test result is also cached but for larger data sets it will be unfeasible to compute the result in a reasonable time interval.

Number of Elements Month (259 points) Year (2,640 points) File Upload & Parse (s) 16.4 101 Lines cl. (s) 11.5 64.1 Spatial filter (s) 28.2 345 Table 5.6: Speed Comparison of Traces of Real Data Files

5.2 Line Clustering Algorithm Evaluation

This section evaluates the line clustering algorithm (Section 4.3.3), because it is not a general algorithm. The algorithm is tested on the random data set (from the Section 5.1.1) and on the real data set (the year data set from the Section 5.1.2). The Table 5.7 describes an input amount of points that are stored in the database, a percentage of distinct points (non-repeated points in a data set), a number of pre-clustered points (before K-means call, see the Algorithm 4.2) and a number of clustered points.

Data Set Random Real Values Absolute Relative Absolute Relative Points (-) 10,100,000 100% 3,575,727 100% Unique Points (-) 10,000 0.099% 1,875,462 52.45% Pre-clustered Points (-) 9,984 0.099% 5,113 0.143% Clustered Points (-) 1,996 0.020% 1,022 0.029% Table 5.7: Comparison of Line Clustering Algorithm – Points

The random data set has 10,000 unique points and just 16 points are pre-clustered. The K-means algorithm reduces points to 1,996. The real data set contains over 52% of unique points and the pre-clustering process reduces number of points to 0.143% of the input amount of points. The K-means reduces this amount five times to 1,022 points.

48 The Table 5.8 shows absolute numbers and percentages of lines that are processed by the line clustering algorithm. The random data set contains 1.1% of unique lines, after the points clustering 4.3% of points are removed. The set contains over 95% of duplicities or more. Just over 0.2% are unique lines. The real data set contains nearly 60% of unique lines, after the clustering process over 80% lines are removed. Lines that are duplicates or more are over 19% of the input data set. Less than 0.1% of lines are unique after the clustering.

Data Set Random Real Values Absolute Relative Absolute Relative Lines (-) 10,000,000 100% 3,571,724 100% Unique Lines (-) 110,063 1.1% 2,140,559 59.9% Removed Lines (-) 433,384 4.3% 2,885,052 80.8% Cluster Multi-Lines (-) 9,546,399 95.5% 683,963 19.1% Unique Cluster Lines (-) 20,217 0.202% 2,709 0.076% Table 5.8: Comparison of Line Clustering Algorithm – Lines

Comparison of a sum of all distances and differences is in the Table 5.9. The sum of all distances is after the clustering process almost the same as before it.

Data Set Random Real Values Absolute Relative Absolute Relative Original Distance (km) 7.906 ⋅ 1010 100% 7.69 ⋅ 108 100% Clustered Distance (km) 7.915 ⋅ 1010 100.11% 7.54 ⋅ 108 98.11% Line Difference (km) 3.186 ⋅ 109 4.03% 1.73 ⋅ 108 22.57% Table 5.9: Comparison of Line Clustering Algorithm – Distance

The main difference is in the Line Difference value which describes similarity of a clustered line to an original line. Value is calculated using the following equation:

ddiff (L, f) = Q SS la − f(la) SS + SS lb − f(lb) SS , (5.1) (la,lb)∈L

′ where L = {l1, l2,... } is the set of input lines, f ∶ L → L is a linear mapping that ′ transforms the data set to the clustered data set L , lx = ⟨(ϕ1, λ1), (ϕ2, λ2)⟩ is a line with two GPS coordinates, and SS l1 − l2 SS is the haversine formula (Equation 4.2). The random data set has about 4% distance difference of the original distance value. The real data set has nearly 23% distance difference which is caused by 81%

49 of removed lines. These lines are mostly very short and they are caused by turning of an entity on a map.

5.3 Technical Problems and Limitations

This section describes problems and limits that are reached by the application. The first paragraph describes tiles-related problems, the second describes web browsers limits, the third describes application-related problems and the last paragraph de- scribes database-related problems. A usage of tiles for clustering sometimes causes an overlapping of visualized clus- ters. This behavior happens “by design” and it can be removed by computing whole screen view at one time but it will have impact on performance because of more points in the clustering process, or it can be removed by adding constraints to avoid the overlapping. If too many elements are uploaded, the visualization becomes very slow. It can be observed especially on the Dynamic page, which dynamically replays a simulation. About 5,000 SVG elements cause web browser slowness of responses. Increase of elements escalates to unresponsiveness of a web browser. That happens especially in the Internet Explorer 9 because of its slow Javascript engine in comparison to Google Chrome 18 or Mozilla Firefox 11. Calculation of the Spatial filter takes a long time for wider time ranges, even if only a few traces and events are present. This is caused by time driven reconstruction of a simulation. For larger time ranges it will be better to use event driven reconstruction which will probably speed up whole process. Every visualized cluster is associated with unique IDs of its elements. If a cluster contains lots of IDs it causes the database engine overload which completely freezes the application. That can be probably resolved by caching of processed elements or by splitting of SQL requests.

50 Chapter 6

Conclusion

Processing and web browser visualization of large amount of geo-spatial data are problematic due to web browser performance limitations. Filtering and aggregation algorithms have been implemented to reduce amount of input data and produce visualizable results for a web browser. Implemented web-based application can visualize geo-spatial data on a map and on a chart. It also supports dynamic and static visualization, filtering of data and exporting of visualized graphics. The web-based application has been implemented and evaluated against testing data sets. The application supports static and dynamic visualization of entities on a map and chart visualization of aggregated entities. The work uses architecture of filters, which overcome limitation of today’s web browsers. They support filtering, processing and clustering of stored data. Filters have been designed to be modular, easily extensible, and modifiable. Creation of custom filters is also very straightforward and it can be done in a web browser. Several technical limitations of the application have been discovered, such as web browser performance, database speed, or low algorithm speed on input data due to its complexity. Some limits can be extended by using better computer hardware, but performance of some algorithms cannot be reasonably improved so it is needed to propose better solutions which do e.g. data reduction before processing. On the other hand web browsers are rapidly developed nowadays, so in a few years their performance can be significantly improved. The work can be extended by implementing other filters, which could produce better visualizations or by improving application performance on large data sets. Additionally direct link to a multi-agent system and real-time visualization of events

51 can be implemented. Finally visualization of clusters on a map can be improved by adding constraints to avoid overlapping.

52 Bibliography

[1] N. Andrienko, G. Andrienko, and P. Gatalsky. Exploratory spatio-temporal visualization: an analytical review. Journal of Visual Languages & Computing, 14(6):503–541, 2003.

[2] Donald Bell. UML basics: The class diagram [online], 2004. March 19, 2012. [3] Janine Bernat. Crow’s foot notation, 2004. March 19, 2012. [4] Joshua Bloch. Effective Java. Addison-Wesley, 2nd edition, 2008.

[5] M.N.K. Boulos, T. Viangteeravat, M.N. Anyanwu, V.R. Nagisetty, and E. Kuscu. Web gis in practice ix: a demonstration of geospatial visual ana- lytics using microsoft live labs pivot technology and who mortality data. Int. J. Health Geogr, 10(19), 2011.

[6] B. Chan, J. Talbot, L. Wu, N. Sakunkoo, M. Cammarano, and P. Hanrahan. Vis- pedia: on-demand data integration for interactive visualization and exploration. In Proceedings of the 35th SIGMOD international conference on Management of data, pages 1139–1142. ACM, 2009.

[7] D. Crocker and P. Overell. Augmented BNF for Syntax Specifications: ABNF. RFC 5234 (Standard), January 2008. [8] D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON). RFC 4627 (Informational), July 2006. [9] P. Davidsson. Multi agent based simulation: beyond social simulation. Multi- Agent-Based Simulation, pages 141–155, 2001.

53 [10] J.Y. Delort. Hierarchical cluster visualization in web mapping systems. In Proceedings of the 19th international conference on World wide web, pages 1241– 1244. ACM, 2010.

[11] M. Dork, S. Carpendale, C. Collins, and C. Williamson. Visgets: Coordinated vi- sualizations for web-based information exploration and discovery. Visualization and Computer Graphics, IEEE Transactions on, 14(6):1205–1212, 2008.

[12] YIN Fang and M. Feng. A webgis framework for vector geospatial data shar- ing based on open source projects. In Proceedings of the 2009 International Symposium on Web Information Systems and Applications, 2009.

[13] Roy Thomas Fielding. Architectural Styles and the Design of Network-based Software Architectures. PhD thesis, University of California, Irvine, 2000. Doctoral dissertation.

[14] International Organization for Standardization. ISO 8601:2004(E): Data ele- ments and interchange formats – information interchange – representation of dates and times.

[15] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning. Springer-Verlag, 2nd edition, 2008. 763 pages.

[16] Jesse James Garrett. Ajax: A New Approach to Web Applications, February 18, 2005. March 28, 2012.

[17] R. Gerhards. The Syslog Protocol. RFC 5424 (Proposed Standard), March 2009.

[18] H. Gonzalez, A. Halevy, C.S. Jensen, A. Langen, J. Madhavan, R. Shapley, and W. Shen. Google fusion tables: data management, integration and collaboration in the cloud. In Proceedings of the 1st ACM symposium on Cloud computing, pages 175–180. ACM, 2010.

[19] H. Gonzalez, A.Y. Halevy, C.S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data man-

54 agement and collaboration. In Proceedings of the 2010 international conference on Management of data, pages 1061–1066. ACM, 2010.

[20] Google, Inc. Google Maps API. March 28, 2012. [21] Google, Inc. Google Web Toolkit. March 28, 2012. [22] C. Granell, L. Diaz, and M. Gould. Geospatial web service integration and mashups for water resource applications. The International archives of the photogrammetry, Remote sensing and spatial information sciences, 37:661–666, 2008.

[23] H2 Group. H2 Database Engine. March 28, 2012. [24] Jakob M., VanˇekO., Boˇsansk´yB., Hrstka O., Kˇr´ıˇzekV., Urban S.,ˇ Benda P., Pˇechouˇcek M. Adversarial Modeling and Reasoning in the Maritime Domain (Year 2 Report). Technical report, Agent Technology Center, Department of Cybernetics, FEE, CTU in Prague, 2010. [25] Markus Karg. JPA is great but damned slow, January 2010. April 21, 2012. [26] N.H. Long. Web Visualization of Trajectory Data using Web Open Source Vi- sualization Libraries. PhD thesis, Master thesis. International Insititute for Geo-information Science and Earth Observation, The Netherland, 2010.

[27] A.M. MacEachren and M.J. Kraak. Research challenges in geovisualization. Cartography and Geographic Information Science, 28(1):3–12, 2001.

[28] Michal Jakob, Zbynek Moler, Antonin Komenda, Zhengyu Yin, Albert Xin Jiang, Matthew P. Johnson, Michal Pechoucek and Milind Tambe. Agentpolis: Towards a platform for fully agent-based modeling of multi-modal transporta- tion (demonstration). In 12th International Conference on Autonomous Agents and Multiagent Systems, 2012.

55 [29] ObjectDB Software Ltd. JPA Performance Benchmark: EclipseLink, 2012. March 28, 2012.

[30] Open Geospatial Consortium. KML [online]. March 19, 2012.

[31] Open Source Geospatial Foundation. Openlayers. March 28, 2012.

[32] Oracle Corporation. MySQL. March 28, 2012.

[33] Frederick Pearson. Map Projections: Theory and Applications. CRC Press, 1990.

[34] PostgreSQL Global Development Group. PostgreSQL. March 28, 2012.

[35] X. Qu, M. Sun, C. Xu, J. Li, K. Liu, J. Xia, Q. Huang, C. Yang, M. Bambacus, Y. Xu, et al. A spatial web service client based on microsoft bing maps. In Geoinformatics, 2011 19th International Conference on, pages 1–5. IEEE, 2011.

[36] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009. ISBN: 978-0136042594.

[37] O. SCHNABEL and L. HURNI. Cartographic web applications–developments and trends. In CDROM:] Proceedings of the 24nd International Cartographic Conference ICC, 2009.

[38] S. Schroedl, K. Wagstaff, S. Rogers, P. Langley, and C. Wilson. Mining GPS traces for map refinement. Data mining and knowledge Discovery, 9(1):59–87, 2004.

[39] Joe Schwartz. Bing Maps Tile System. Microsoft. March 19, 2012.

[40] Y. Shafranovich. Common Format and MIME Type for Comma-Separated Val- ues (CSV) Files. RFC 4180 (Informational), October 2005.

[41] C. Shahabi, F. Banaei-Kashani, A. Khoshgozaran, L. Nocera, and S. Xing. Geodec: A framework to visualize and query geospatial data for decision-making. Multimedia, IEEE, 17(3):14–23, 2010.

56 [42] Y. Shoham and K. Leyton-Brown. Multiagent systems: Algorithmic, game- theoretic, and logical foundations. Cambridge University Press, 2009.

[43] SimpleGeo and Stamen. Polymaps. March 28, 2012.

[44] R.W. Sinnott. Virtues of the haversine. Sky and Telescope, 68(2):159, 1984.

[45] George R. Terrell and David W. Scott. Variable kernel density estimation. The Annals of Statistics, 20(3):1236–1265, September 1992.

[46] The Apache Software Foundation. Apache HTTP Server Version 2.0: Log Files. March 26, 2012.

[47] The jQuery Project. jQuery. March 28, 2012.

[48] The SQLite Consortium. SQLite. March 28, 2012.

[49] U.S. Department of the Interior, U.S. Geological Survey. Map Projections, December 2000. April 21, 2012.

[50] J. Wiesel, W. Hagg, A. Koschel, R. Kramer, and R. NICOLAI. A client/server map visualization component for an environmental information system based on www. INTERNATIONAL ARCHIVES OF PHOTOGRAMMETRY AND REMOTE SENSING, 31:402–407, 1996.

[51] Ed Williams. Aviation formulary v1.46. April 1, 2012.

[52] World Wide Web Consortium. Extensible Markup Language (XML) 1.0. Fifth Edition. March 28, 2012.

[53] World Wide Web Consortium. HTML5: A vocabulary and associated APIs for HTML and XHTML. W3C Working Draft 15 April 2012. March 19, 2012.

57 [54] World Wide Web Consortium. Scalable Vector Graphics (SVG) 1.1. Second Edition. March 28, 2012.

[55] World Wide Web Consortium. Cascading Style Sheets (CSS) Snapshot, 2010. March 28, 2012.

58 Appendix A

Code Listing

[ { "time": "2011-10-16T15:29:38.342+02:00", "host": "127.0.0.1", "app": "AutoGenerator", "uid": "14057139880130948715", "type": "Move", "source": "VehicleNu151", "sTime": "2011-10-15T20:17:38.334+02:00", "center": [30.1 50.2 234.1], "waypoints": [[ "2011-10-16T15:29:38.342+02:00", ⤦ Ç 30.1, 50.2, 234.1], [ "2011-10-16T15 ⤦ Ç :29:38.342+02:00", 30.2, 50.3, 234.2]], "text": "\u003cb\u003eVehicle\u003c/b\u003e moved \ ⤦ Ç u003ci\u003eone step\u003c/i\u003e.", " data ": { "blackWheels": false, "redRoof": true, "speed": 49.3, "plate": "9A0 1215", "serialNu": "FNU148193-341F" } }, { "time": "2011-10-16T15:29:43.413+02:00", "host": "127.0.0.1", "app": "AutoGenerator", "uid": "14057139880130948715", "type": "Paused", "source": "Environment"

59 }, { "time": "2011-10-16T15:38:43.413+02:00", "host": "127.0.0.1", "app": "AutoGenerator", "uid": "14057139880130948715", "type": "Started", "source": "Environment", " data ": { "temperature": 26.2 } } ] Algorithm A.1: Input Data Format demonstration code in JSON formatting

CREATE TABLE ‘db_data_entries‘ ( ‘id‘ INT AUTO_INCREMENT, ‘time‘ DATETIME NOT NULL, ‘host‘ CHAR(64) NOT NULL, ‘application‘ VARCHAR(64) NOT NULL, ‘unique_id‘ VARCHAR(64) NOT NULL, ‘data_type‘ VARCHAR(250) NOT NULL, ‘data_source‘ VARCHAR(250) NOT NULL, ‘simulation_time‘ DATETIME, ‘lat‘ DOUBLE, ‘lon‘ DOUBLE, ‘elev‘ DOUBLE, ‘text‘ TEXT, ‘waypoints‘ BLOB, PRIMARY KEY (‘id‘) ); CREATE INDEX ‘data_idx‘ ON ‘db_data_entries‘ (‘id‘ ASC); CREATE INDEX ‘points_lat‘ ON ‘db_data_entries‘ (‘lat‘ ASC); CREATE INDEX ‘points_lon‘ ON ‘db_data_entries‘ (‘lon‘ ASC); CREATE INDEX ‘time_index_de‘ ON ‘db_data_entries‘ (‘time‘ ⤦ Ç ASC); CREATE INDEX ‘uid_index_de‘ ON ‘db_data_entries‘ (‘unique_id ⤦ Ç ‘ASC); CREATE INDEX ‘data_type_index_de‘ ON ‘db_data_entries‘ (‘ ⤦

60 Ç data_type‘ ASC); CREATE INDEX ‘data_source_index_de‘ ON ‘db_data_entries‘ (‘ ⤦ Ç data_source‘ ASC);

CREATE TABLE ‘db_map_booleans‘ ( ‘mb_id‘ INT AUTO_INCREMENT, ‘data_entry_id‘ INT NOT NULL, ‘mb_name‘ VARCHAR(64) NOT NULL, ‘mb_value‘ BOOLEAN NOT NULL, PRIMARY KEY (‘mb_id‘), CONSTRAINT ‘fk_mb_data_entry‘ FOREIGN KEY (‘data_entry_id‘) REFERENCES ‘db_data_entries‘ (‘id‘) ONDELETENOACTION ONUPDATENOACTION ); CREATE INDEX ‘de_index_mb‘ ON ‘db_map_booleans‘ (‘ ⤦ Ç data_entry_id‘ ASC); CREATE INDEX ‘de_index_bname‘ ON ‘db_map_booleans‘ (‘mb_name ⤦ Ç ‘ASC); CREATE INDEX ‘de_index_bvalue‘ ON ‘db_map_booleans‘ (‘ ⤦ Ç mb_value‘ ASC);

CREATE TABLE ‘db_map_doubles‘ ( ‘md_id‘ INT AUTO_INCREMENT, ‘data_entry_id‘ INT NOT NULL, ‘md_name‘ VARCHAR(64) NOT NULL, ‘md_value‘ DOUBLE NOT NULL, PRIMARY KEY (‘md_id‘), CONSTRAINT ‘fk_md_data_entry‘ FOREIGN KEY (‘data_entry_id‘) REFERENCES ‘db_data_entries‘ (‘id‘) ONDELETENOACTION ONUPDATENOACTION ); CREATE INDEX ‘de_index_md‘ ON ‘db_map_doubles‘ (‘ ⤦ Ç data_entry_id‘ ASC); CREATE INDEX ‘de_index_dname‘ ON ‘db_map_doubles‘ (‘md_name‘ ⤦ Ç ASC); CREATE INDEX ‘de_index_dvalue‘ ON ‘db_map_doubles‘ (‘ ⤦ Ç md_value‘ ASC);

61 CREATE TABLE ‘db_map_strings‘ ( ‘ms_id‘ INT AUTO_INCREMENT, ‘data_entry_id‘ INT NOT NULL, ‘ms_name‘ VARCHAR(64) NOT NULL, ‘ms_value‘ TEXT NOT NULL, PRIMARY KEY (‘ms_id‘), CONSTRAINT ‘fk_ms_data_entry‘ FOREIGN KEY (‘data_entry_id‘) REFERENCES ‘db_data_entries‘ (‘id‘) ONDELETENOACTION ONUPDATENOACTION ); CREATE INDEX ‘de_index_ms‘ ON ‘db_map_strings‘ (‘ ⤦ Ç data_entry_id‘ ASC); CREATE INDEX ‘de_index_sname‘ ON ‘db_map_strings‘ (‘ms_name‘ ⤦ Ç ASC);

CREATE TABLE ‘db_caches‘ ( ‘key_id‘ CHAR(32), ‘data‘ BLOB NOT NULL, PRIMARY KEY (‘key_id‘) );

CREATE TABLE ‘db_heatmaps‘ ( ‘hm_id‘ INT AUTO_INCREMENT, ‘hm_name‘ VARCHAR(100) NOT NULL, ‘hm_zoom‘ INT NOT NULL, ‘hm_sigma‘ DOUBLE NOT NULL, ‘hm_gauss_size‘ INT NOT NULL, ‘hm_type‘ INT NOT NULL, PRIMARY KEY (‘hm_id‘) );

CREATE TABLE ‘db_heatmap_caches‘ ( ‘hm_ref_id‘ INT NOT NULL, ‘hm_x‘ INT NOT NULL, ‘hm_y‘ INT NOT NULL, ‘hm_zoom‘ INT NOT NULL, ‘data‘ BLOB NOT NULL, PRIMARY KEY (‘hm_ref_id‘, ‘hm_x‘, ‘hm_y‘, ‘hm_zoom‘),

62 CONSTRAINT ‘hmc_ref_id_pts‘ FOREIGN KEY (‘hm_ref_id‘) REFERENCES ‘db_heatmaps‘ (‘hm_id‘) ONDELETENOACTION ONUPDATENOACTION );

CREATE TABLE ‘db_heatmap_pts‘ ( ‘hm_ref_id‘ INT NOT NULL, ‘hm_x‘ INT NOT NULL, ‘hm_y‘ INT NOT NULL, ‘hm_count‘ INT NOT NULL, ‘hm_value‘ DOUBLE NOT NULL, PRIMARY KEY (‘hm_ref_id‘, ‘hm_x‘, ‘hm_y‘), CONSTRAINT ‘hm_ref_id_pts‘ FOREIGN KEY (‘hm_ref_id‘) REFERENCES ‘db_heatmaps‘ (‘hm_id‘) ONDELETENOACTION ONUPDATENOACTION );

CREATE TABLE ‘db_polygons‘ ( ‘polygon_id‘ INT AUTO_INCREMENT, ‘polygon_name‘ VARCHAR(100) NOT NULL, ‘polygon_value‘ BLOB, PRIMARY KEY (‘polygon_id‘) );

CREATE TABLE ‘db_filters‘ ( ‘filter_id‘ INT AUTO_INCREMENT, ‘filter_name‘ VARCHAR(100) NOT NULL, ‘filter_value‘ TEXT, PRIMARY KEY (‘filter_id‘) );

CREATE TABLE ‘db_set_filters‘ ( ‘set_id‘ CHAR(10), ‘filter_fid‘ INT, PRIMARY KEY (‘set_id‘), CONSTRAINT ‘sf_ref_id‘

63 FOREIGN KEY (‘filter_fid‘) REFERENCES ‘db_filters‘ (‘filter_id‘) ONDELETENOACTION ONUPDATENOACTION );

CREATE TABLE ‘db_config‘ ( ‘config_id‘ CHAR(10), ‘data‘ TEXT, PRIMARY KEY (‘config_id‘) );

CREATE TABLE ‘db_colors‘ ( ‘field_type‘ VARCHAR(10) NOT NULL, ‘field_name‘ VARCHAR(100) NOT NULL, ‘field_format‘ VARCHAR(20) NOT NULL, ‘data‘ TEXT, PRIMARY KEY (‘field_type‘, ‘field_name‘, ‘field_format‘) ); Algorithm A.2: SQL Database Tables Generation Script

64 Appendix B

User Guide

65 Brief Documentation

Vojtěch Křížek

May 8, 2012 Contents

1 Data Upload 2

2 First Configuration 2

3 Use Cases 4 3.1 On Vessel Click, Risk ...... 4 3.2 Spatial Filter ...... 6 3.3 Events Display ...... 7 3.4 Heat Map Display ...... 8

4 Other Functionality 9 4.1 New Filter Creation ...... 9 4.2 Data Viewer ...... 10 4.3 Selected Vessel Trace ...... 12 4.4 Other Visualizations ...... 12

5 Technical Requirements 15 5.1 Hardware ...... 15 5.2 Software ...... 15

List of Figures 16

Geo Visio 1 Brief Documentation 1 Data Upload

1) Drag&Drop JSON file to a web-browser (see the Figure 1). 2) Wait until data are uploaded and processed. 3) Refresh the web-page (because of visualization caches)

Figure 1: Drag&Drop JSON file upload

2 First Configuration

1) Go to the Settings page (see the Figure 2). 2) Select the field in the Colors section for group-based visualizations (1), choose desired colors (2) and Save (3). 3) Select the field in the Functions section for a visualization of Dynamic data (4), selected field is used for a visualization of matching field data. 4) Select desired Map Tile provider (5). 5) Select Heat map parameters (6) and click on the Build/Rebuild link (7). 6) Go to the Filters page (see the Figure 3). 7) Click on the Init filters link to initialize default set of filters (1). 8) Select desired filters in the View settings section (2). 9) Go to the Home page (see the Figure 4). 10) You can see default data overview.

Geo Visio 2 Brief Documentation 11) Go to the Visio page (see the Figure 9). 12) You can see data visualization based on selected filters. 13) Go to the Dynamic page (see the Figure 6). 14) After pressing Start you can see traces reconstruction with events visualization based on selected data.

Figure 2: Settings page

Figure 3: Filters page

Geo Visio 3 Brief Documentation Figure 4: Home page

3 Use Cases

3.1 On Vessel Click, Risk

Risk along trajectory for a vessel

Figure 5: Dynamic page – Vessel risk chart and vessel info

If you know vessel ID 1) Go to the Dynamic page (see the Figure 6). 2) Select an ID from Select value list (1).

Geo Visio 4 Brief Documentation 3) Pop-up window with vessel info and graph from generated heat map appears (see the Figure 5). 4) Click on the Show Events link to see trajectory highlight (1).

X Tip For seeing only specified time range (window) use From and To field or the time slider and then select an ID again.

If you do not know vessel ID 1) Go to the Dynamic page (see the Figure 6). 2) Start simulation (2). 3) Pause or stop simulation when you see a vessel which data you want to see (3). 4) Click on the vessel trace or icon (4). 5) Follow 3 of If you know vessel ID enumeration.

Figure 6: Dynamic page

Geo Visio 5 Brief Documentation 3.2 Spatial Filter

Number of merchant vessels in April in the Gulf of Aden

1) Go to the Visio page (see the Figure 7). 2) Create new areas (polygons) via the User select checkbox. An area must be closed with Shift + click on the red area. 3) Go to the Filters page (see the Figure 3). 4) Select from the Chart list (2) Spatial info filter (if not present click on the Init filters link). 5) Go to the Visio page and wait couple of seconds (see the Figure 8). 6) From the Info box list select Table (1).

X Tip See page bottom (status bar) for CPU utilization and memory consumption. X Tip For seeing only specified time range (window) use From and To field or time slider and then click on the Filter link.

Figure 7: Create new polygon area – Visio page

Geo Visio 6 Brief Documentation 3.3 Events Display

Ratios of events for each month (for given year/grouped by year) Each cluster (displayed) show as a group in the bar chart

1) Go to the Filters page (see the Figure 3). 2) Select from the Map layer list (2) Points by K-means (if not present click on the Init filters link). 3) Select from the Chart list (2) Events by month (if not present click on the Init filters link). 4) Go to the Visio page (see the Figure 9). 5) Use the mouse wheel to zoom in/out, click on a pie chart to see detailed chart (1).

X Tip On the Visio page click on the link All (on right under the main menu) to see a chart for all points.

X Tip Convert a chart or a map to PNG image/PDF document using links on page left bottom.

Figure 8: Spatial filter – Visio page

Geo Visio 7 Brief Documentation 3.4 Heat Map Display

Display heat map + legend (number for a color) 1) Go to the Settings page (see the Figure 2). 2) Set desired heat map values for generation (6). 3) Set a date range to restrict used events for generation (7). 4) You can restrict some events by setting the Value filter (2) on the Filters page (Figure 3). 5) Click on the Build/Rebuild link (8). 6) Go to the Visio/Dynamic page (see the Figure 9). 7) Check the Show Heat Map checkbox (3). 8) Click somewhere on a heat map layer to see values in a bubble (2). 9) Click on the Show Heat Map Legend link to see heat map color legend (4).

Figure 9: Visio page

Geo Visio 8 Brief Documentation 4 Other Functionality

4.1 New Filter Creation

Create New Filter 1) Go to the Filters page (see the Figure 10). 2) Select filter type (1). 3) Fill in filter name (4). 4) Click on the Set link (Input data). 5) Dialog window with the list of filters will appear. 6) Select a desired operation by clicking on the Add as argument link (2). 7) Add an argument by clicking on the Add link (3). This can be done at the dialog window or at the Input data section. (see the Figure 11). 8) Dialog window with available arguments will appear. 9) Fill in desired value (2) and add it to the operation by clicking on the Add as argument link (1). 10) Save the filter by clicking on the Add filter link (3).

Figure 10: Filters page – New Filter Creation – step 1

Geo Visio 9 Brief Documentation Figure 11: Filters page – New Filter Creation – step 2

Create Filter from Copy 1) Go to the Filters page (see the Figure 12). 2) Double click on a filter name. 3) Dialog window with filter details will appear. 4) Click on the Copy as new link (1) to copy values to the Add filter area. 5) Click on the Print link (2) to see processed data in Javascript console. The value debugConsole must be set to true in the config.js file.

4.2 Data Viewer

Go to the Data page (Figure 13) to see imported data in the database.

Geo Visio 10 Brief Documentation Figure 12: Filters page – New Filter Creation – From Existing Filter

Figure 13: Data page

Geo Visio 11 Brief Documentation 4.3 Selected Vessel Trace 1) Go to the Dynamic page (see the Figure 14). 2) If you click on the Show Events link (Figure 5), related events and a trace for selected vessel will appear. 3) Visible time range can be changed by adjusting the time slider or from and to date fields.

Figure 14: Dynamic page – Selected Vessel Trace

4.4 Other Visualizations

Clustered traces 1) Go to the Visio page (see the Figure 15). 2) Select Lines from the Features list. 3) Wait several seconds.

Geo Visio 12 Brief Documentation Heat map and clustered events 1) Go to the Visio page (see the Figure 16). 2) Select Points from the Features list and check the Heat map checkbox.

Areas and chart 1) Go to the Visio page (see the Figure 17). 2) Select Polygons from the Features list and select corresponding filter.

Bar chart 1) Go to the Visio page (see the Figure 18). 2) Select Info box Page, Chart Info Box and Bar Chart. 3) You can switch between Stack and Group view.

Figure 15: Visio page – Clustered traces (all vessels)

Geo Visio 13 Brief Documentation Figure 16: Visio page – Heat map and clustered events

Figure 17: Visio page – Created areas and chart of events split by type and area

Geo Visio 14 Brief Documentation Figure 18: Visio page – Chart of events split by type and day of week

5 Technical Requirements

5.1 Hardware CPU: dual-core processor, 64-bit architecture recommended ∙ Memory: 4 GB RAM, 8 GB RAM recommended ∙ HDD: at least 2 GB free space, SSD drive recommended ∙ Graphic card: GPU with good 2D performance ∙ Display: resolution 1280 x 800 or higher, 24-bit colors ∙

5.2 Software Any Java supported operating system, 64-bit OS and 64-bit JRE recommended ∙ HTML5, SVG, JavaScript compatible web-browser (i.e. Mozilla Firefox, Google Chrome; ∙ Opera and Internet Explorer work only partially), Google Chrome recommended

Geo Visio 15 Brief Documentation List of Figures

1 Drag&Drop JSON file upload ...... 2 2 Settings page ...... 3 3 Filters page ...... 3 4 Home page ...... 4 5 Dynamic page – Vessel risk chart and vessel info ...... 4 6 Dynamic page ...... 5 7 Create new polygon area – Visio page ...... 6 8 Spatial filter – Visio page ...... 7 9 Visio page ...... 8 10 Filters page – New Filter Creation – step 1 ...... 9 11 Filters page – New Filter Creation – step 2 ...... 10 12 Filters page – New Filter Creation – From Existing Filter ...... 11 13 Data page ...... 11 14 Dynamic page – Selected Vessel Trace ...... 12 15 Visio page – Clustered traces (all vessels) ...... 13 16 Visio page – Heat map and clustered events ...... 14 17 Visio page – Created areas and chart of events split by type and area . . . . . 14 18 Visio page – Chart of events split by type and day of week ...... 15

Geo Visio 16 Brief Documentation Appendix C

Contents of the CD

Attached CD contains source code of the application and thesis text in PDF format. CD structure is described in the following table.

Path Description src/ source code of the application (NetBeans1 project, incl. 3rd party libraries) thesis.pdf thesis text

Table C.1: CD Path Structure

1NetBeans, http://netbeans.org/

83