DEPARTMENT OF COMPUTER ENGINEERINGAND INFORMATICS

UNIVERSITYOF PATRAS

DOCTORAL DISSERTATION

EFFICIENT ALGORITHMSFOR BIG DATA MANAGEMENT

ELIAS DRITSAS

SUPERVISOR:SPYROS SIOUTAS,PROFESSOR

PATRAS -AUGUST 2020 2 DEPARTMENT OF COMPUTER ENGINEERINGAND INFORMATICS

UNIVERSITYOF PATRAS

DOCTORAL DISSERTATION

EFFICIENT ALGORITHMSFOR BIG DATA MANAGEMENT ELIAS DRITSAS

DISSERTATION COMMITTEE:

SPYROS SIOUTAS,PROFESSOR (SUPERVISOR)

CHRISTOS MAKRIS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER)

KONSTANTINOS TSICHLAS,ASSISTANT PROFESSOR (COMMITTEE MEMBER)

GEORGE ALEXIOU,PROFESSOR (COMMITTEE MEMBER)

DIMITRIS TSOLIS,ASSISTANT PROFESSOR (COMMITTEE MEMBER)

IOANNIS TZIMAS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER)

PHOIVOS MYLONAS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER)

3 4 ELIAS DRITSAS AUGUST 2020

5 1 ABSTRACT

In the context of the doctoral research, I dealt with data management problems by developing methods and techniques that, on the one hand, maintain or improve the privacy and anonymity of users and, on the other hand, are efficient in terms of time and storage space for large volumes of databases. The research results of the work focus on the following:

• Evaluate the performance of queries in a large volume database using or not the Bloom Filter structure.

• Evaluate workload time, memory and disk usage of the Privacy Preserving Record Linkage (PPRL) problem in Hadoop MapReduce Framework.

• Methods of answering queries of nearest neighbors to spatio-temporal data (mov- ing users trajectories) in order to preserve anonymity, where queries are applied to clustered or non-clustered data.

• The k anonymity method was used, where, the set of anonymity with which each moving object of the space-time database is being camouflaged, consists of its k nearest neighbors. The robustness of the method was quantified with a prob- 1 ability of k and the effect of dimensionality and correlation of the data on the preservation of anonymity and privacy was studied.

• The above method was improved in terms of efficient storage of spatio-temporal data by applying queries of nearest neighbors to Hough transformed nonlinear trajectories of moving objects. The application of secure k-NN queries was eval- uated in the GeoSpark environment.

• Sentiment Analysis on Twitter Data and Tourist Forecasting at Apache Spark

Keywords: Bloom Filters, Privacy Preserving, k-NN Queries, k-anonymity, Spatio- temporal Databases, Sentiment Analysis, Twitter Apache Spark, GeoSpark.

i ACKNOWLEDGEMENT

This dissertation signifies the end of an academic journey as a PhD student. At this point, I would like to sincerely thank all those who have supported me all these years. First and foremost, with immense pleasure and deep sense of gratitude, I wish to express my sincere thanks to my supervisor Dr. Spyros Sioutas, Professor, University of Patras, without his motivation and continuous encouragement, this research would not have been successfully completed. I would also like to express my sincere gratitude to Prof. Spyros Sioutas for his enthusiasm in supervising this work and for his constant support, encouragement and critical suggestions during the writing of this thesis. I am also grateful to my initial supervisor, Prof. Athanasios Tsakalidis who supervised me during the most part of my Ph.D. studies and giving me the opportunity to undertake this research thesis. Also, I warmly thank Associate Prof. Christos Makris who willingly accepted to supervise me during their final part. I express my sincere thanks to Dr. Andreas Kanavos for his kind support and encouragement in several ways throughout my research work. I wish to extend my profound sense of gratitude to my parents for all the sac- rifices they made during my research and also providing me with moral support and encouragement whenever required. Last but not the least, I would like to thank my wife Maria Trigka for her constant encouragement and moral support along with patience and understanding. Finally, i would like to acknowledge the support and funding of the current PhD thesis by the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation (HFRI).

Place: Patra

Date: 31/08/2020 Elias Dritsas

ii TABLE OF CONTENTS

ABSTRACT ...... i ACKNOWLEDGEMENT ...... ii LIST OF FIGURES ...... viii LIST OF TABLES ...... xii LIST OF ABBREVIATIONS ...... xiii

1 Introduction 1

I Research on Methods and Algorithms for Secure Queries Processing 9

2 Bloom Filters for Efficient Coupling between Tables of a Database 11 2.1 Introduction ...... 11 2.2 Bloom Filters Background ...... 13 2.2.1 Bloom Filter Elements ...... 13 2.2.2 Space-Time Advantages and Constraints ...... 15 2.3 Bloom Filters and RDBMS ...... 16 2.3.1 Relational Database Management Systems ...... 16 2.3.2 Queries Language-SQL ...... 17 2.3.3 Indexes Table ...... 19 2.4 Experimental Evaluation in SQL Server ...... 20 2.5 Conclusions ...... 21 2.5.1 Research Conclusions ...... 21 2.5.2 Research Constraints ...... 23 2.5.3 Future Extensions ...... 23

iii 3 MapReduce Implementations for Privacy Preserving Record Linkage 24 3.1 Introduction ...... 24 3.2 Related Work ...... 25 3.2.1 PPRL encoding techniques ...... 25 3.2.2 Private Indexing ...... 28 3.3 MapReduce Framework ...... 29 3.4 Performance Evaluation ...... 30 3.5 Conclusions ...... 33

4 Security and Privacy Solutions associated with NoSQL Data Stores 34 4.1 Introduction ...... 34 4.2 Related Work ...... 35 4.3 Comparison of Relational and NoSQL Databases ...... 36 4.3.1 Reliability of Transactions ...... 37 4.3.2 Scalability Issues and Cloud Support ...... 37 4.3.3 Complexity and Big Data Management ...... 38 4.3.4 Data Model ...... 38 4.3.5 Data Warehouse and Crash Recovery ...... 39 4.3.6 Privacy and Security ...... 39 4.4 Proposed Security and Privacy Solutions ...... 40 4.4.1 Pseudonyms-based Communication Network ...... 40 4.4.2 Monitoring, Filtering and Blocking ...... 42 4.5 Conclusions ...... 43

5 Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio- Temporal Databases 44 5.1 Introduction ...... 44 5.2 Materials and Methods ...... 46 5.2.1 Clustering ...... 46 5.2.2 Classification ...... 47 5.2.3 Useful Definitions ...... 48 5.2.4 System Architecture ...... 49 5.2.5 Problem Definition ...... 52

iv 5.2.6 System Model ...... 55 5.2.7 Privacy Preserving Analysis ...... 57 5.2.8 Experiments Data and Environment ...... 59 5.3 Discussion ...... 60 5.4 Results ...... 62 5.4.1 Experiments Results ...... 62 5.4.2 Experiments Conclusions ...... 65

6 Storage Effiecient Trajectory Clustering and k-NN for Robust Pri- vacy Preserving Databases 68 6.1 Introduction ...... 68 6.2 Related Work ...... 70 6.3 Materials and Methods ...... 72 6.3.1 Dual Transform for Moving Objects ...... 72 6.3.2 kNN Classification and Clustering in Dual Space ...... 73 6.3.3 Problem Definition ...... 74 6.3.4 Problem Formulation ...... 75 6.3.5 System Model ...... 77 6.3.6 Vulnerability and Storage Efficiency ...... 78 6.3.7 Privacy Preservation Analysis ...... 81 6.3.8 Experimental Data and Environment ...... 85 6.4 Results ...... 85 6.4.1 Vulnerability Evaluation in Hough Space ...... 87 6.4.2 Vulnerability Evaluation in Hybrid Space ...... 89 6.5 Discussion ...... 91 6.6 Conclusions ...... 92

7 Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Query Processing in GeoSpark 95 7.1 Introduction ...... 95 7.2 Related Work ...... 98 7.2.1 Distributed Frameworks for Spatio-Temporal data Queries Pro- cessing ...... 98

v 7.2.2 Efficient Privacy Preserving k-NN Queries ...... 99 7.3 Materials and Methods ...... 101 7.3.1 Operations on Spatial Data ...... 101 7.3.2 The k-NN Classifier from Big Spatial Data Perspective ...... 102 7.3.3 Problem Definition ...... 103 7.3.4 Problem Formulation ...... 105 7.3.5 System Model ...... 106 7.3.6 GeoSpark System Overview ...... 108 7.4 Results ...... 115 7.4.1 Environment and Dataset ...... 115 7.4.2 Time Performance of k-Anonymity Set ...... 115 7.4.3 Vulnerability Evaluation ...... 121 7.5 Discussion ...... 124 7.5.1 Performance Issues ...... 125 7.5.2 Vulnerability ...... 126 7.6 Conclusions and Future Work ...... 126

II Sentiment Analysis and Tourism Forecasting 129

8 An Apache Spark Implementation for Graph-based Hashtag Senti- ment Classification on Twitter 131 8.1 Introduction ...... 131 8.2 Related Work ...... 133 8.2.1 Sentiment Analysis and Classification Models ...... 133 8.2.2 Cloud Computing Preliminaries ...... 136 8.3 Sentiment Classification on Twitter ...... 137 8.3.1 Tweet-Level Sentiment Classification ...... 137 8.3.2 Hashtag-Level Sentiment Classification ...... 137 8.4 Spark Implementation ...... 140 8.5 Results and Evaluation ...... 142 8.6 Conclusions ...... 143

vi 9 An Efficient Preprocessing Tool for Supervised Sentiment Analysis on Twitter Data 144 9.1 Introduction ...... 144 9.2 Related Work ...... 146 9.3 Tools and Environment ...... 147 9.3.1 Twitter ...... 148 9.3.2 Publications Mining Tools ...... 148 9.3.3 Pre-processing Scheme ...... 149 9.3.4 Features ...... 151 9.3.5 Topic Modeling ...... 152 9.4 Evaluation ...... 153 9.5 Conclusions and Future Work ...... 155

10 An Apache Spark Methodology for Forecasting Tourism Demand in Greece 156 10.1 Introduction ...... 156 10.2 Related Work ...... 157 10.3 Preliminaries ...... 157 10.3.1 Forecasting Tourism Methods ...... 157 10.3.2 Apache Spark ...... 158 10.3.3 Machine Learning Algorithm ...... 158 10.4 Implementation ...... 160 10.4.1 Methodology ...... 160 10.4.2 Dataset Description ...... 161 10.5 Experiments - Evaluation ...... 162 10.6 Conclusions and Future Work ...... 164 REFERENCES ...... 164 LIST OF PUBLICATIONS ...... 184

Appendices

Appendix A Matlab Code 187

Appendix B GeoSpark Code 192

vii LIST OF FIGURES

2.1 Bloom Filter Overview ...... 14 2.2 B-Tree overview ...... 19 2.3 Queries Execution Time vs Records Size ...... 22 3.1 HLSH under FPS ...... 29 3.2 PPRL evaluation ...... 31 3.3 PPRL evaluation ...... 32 5.1 Data flow diagram ...... 49 5.2 API Request Diagram ...... 51 5.3 A Matlab overview of mobile users trajectories’ points ...... 55 5.4 Both clustering and k-NN: (a) x and (b) (x,y) for N=400 trajectories, L=100 time-stamps and k = 5...... 64 (a) ...... 64 (b) ...... 64 5.5 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=400 tra- jectories, L=100 time-stamps and k = 5...... 65 (a) ...... 65 (b) ...... 65 5.8 Clustering (x,y,θ,v) and k-NN x for N=400 trajectories, L=100 time- stamps for (a) k = 5 and (b) k = 15...... 65 (a) ...... 65 (b) ...... 65 5.6 Both clustering and k-NN: (a) x and (b) (x,y) for N=2000 trajectories and L=100 time-stamps...... 66 (a) ...... 66 (b) ...... 66

viii 5.7 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=2000 trajectories and L=100 time-stamps...... 66 (a) ...... 66 (b) ...... 66 5.9 Clustering with (x,y,θ,v) and k-NN in x. Figure (a) concerns k=15 while (b) k=30 for N=2000 trajectories and L = 100 time-stamps. . . . 67 (a) ...... 67 (b) ...... 67 6.1 An overview of trajectory segmentation and Hough-X transformation for a linear trajectory segment (TS), which consists of M points. The dual

points of M points in TS are the same, for example, a1 = ... =

aM , u1 = ... = uM , where the left graph shows the y(t) line and the right graph shows the Hough-X points...... 73 (a) ...... 73 (b) ...... 73 6.2 A raw trajectory approximation with a discrete number of R linear sub- trajectories. In the dual dimensional space, each one is represented as a

dual point—for example, the linear sub-trajectory [l(t0), l(t1)] is repre-

sented as a dual point dp1, and the linear sub-trajectory [l(t1), l(t2)] is

represented as a dual point dp2...... 74 6.3 Theoretical curve of compression ratio for M = [10 100 1000 10000 100000]...... 81

6.4 Both clustering and k-NN: (a) (Ux, ax) and (b) (Ux, ax,Uy, ay) for N =

1000 trajectories, L = 10 time-stamps, (c) (Ux, ax) and (d) (Ux, ax,Uy, ay) for N = 995 trajectories, L = 100 time-stamps...... 87 (a) ...... 87 (b) ...... 87 (c) ...... 87 (d) ...... 87

ix 6.5 Clustering with (Ux, ax,Uy, ay) and suppressing k-NN: (a) (Ux, ax, ∗, ∗)

and (b) (∗, ∗,Uy, ay) for N = 1000 trajectories, L = 10 time-stamps,

(c) (Ux, ax, ∗, ∗) and (d) (∗, ∗,Uy, ay) for N = 995 trajectories, L = 100 time-stamps...... 88 (a) ...... 88 (b) ...... 88 (c) ...... 88 (d) ...... 88

6.6 Clustering with (Ux, ax) and k-NN with (Ux, ax):(a) Mobile User 10 and (b) Mobile User 100 for N = 995 trajectories, L = 50 time- stamps, (c) Vulnerability measure in dual Hough-X and native dimen- sional space of (x, y)...... 89 (a) ...... 89 (b) ...... 89 (c) ...... 89 6.7 (a) Initial points per trajectory and (b) compression ratio for N = 87 trajectories, L = 100 time-stamps. Clustering with (x, y) and k-NN:

(c) x and (Ux, ax) and (d) (x, y) and (Ux, ax,Uy, ay) for N = 87 trajec- tories, L = 100 time-stamps...... 90 (a) ...... 90 (b) ...... 90 (c) ...... 90 (d) ...... 90 6.8 Trajectory partition, grouping, and representatives...... 91 7.1 An Overview of Continuous Trajectory Point k Nearest Neighbor (CT P kNN) Query...... 105 7.2 An Overview of Spatio-Temporal Data Partitioning and Indexing. . . . . 107 7.3 An Overview of GeoSpark Layers...... 109 7.4 An Overview of 40 Trajectories through Zeppelin...... 117 7.5 Time Cost for k-Anonymity Set Computation with or without Indexing for N = 80, 500, 2000 Mobile Objects...... 118

x 7.6 Time Cost for k-Anonymity Set Computation with or without Indexing for 3 Cases of Total Input Data in Executor...... 118 7.7 Time Cost for Mobile Objects N = {500, 2000, 8000, 32.000} without Indexing for k = 8...... 120 7.8 Spatial PointRDD Data Distribution for 4 Spatial Partition Techniques for 2000 Mobile Objects...... 121 7.9 (a) Euclidean Space and (b) Polar Space for N = 500 Trajectories, L = 100 Timestamps...... 123 (a) ...... 123 (b) ...... 123 7.10 Hough-X Space of x for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps...... 123 (a) ...... 123 (b) ...... 123 7.11 Hough-X Space of y for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps...... 123 (a) ...... 123 (b) ...... 123 7.12 Vulnerability Performance Comparison in Euclidean and Hough-X Space.124 8.1 An example of Hashtag Graph Model [169] ...... 137 9.1 KDD process for knowledge mining from data [94] ...... 145 10.1 Tourist Arrivals in Greece (2006 - 2015) ...... 162 10.2 Predictions (2014 - 2018) ...... 164

xi LIST OF TABLES

2.1 SQL Queries Execution Time Results vs Data Size ...... 20 2.2 SQL Queries Execution Time Results vs Data Size ...... 21 5.1 An example of spatio-temporal database for d = 4 ...... 53 5.2 k-anonymity sets for N mobile users in L = 5 time-stamps ...... 54 5.3 Parameters for the 1st experiment of N=400 trajectories ...... 62 5.4 Parameters for the 2nd experiment of N=2000 trajectories ...... 63 6.1 An overview of the transformed spatio-temporal database...... 76 6.2 Parameters for the experiment of using only Hough-X of x and Hough- X of x, y, for N = 1000 trajectories, L = 10 time-stamps (Figure 6.4a,b) and N = 995 trajectories, L = 100 time-stamps (Figure 6.4c,d). . . . . 85 6.3 Parameters for the experiment using Hough-X of x and suppressing Hough-X of y (Exp1) for N = 1000 trajectories, L = 10 time-stamps (Figure 6.5a,b) and using Hough-X of y and suppressing Hough-X of x (Exp2) for N = 995 trajectories, L = 100 time-stamps (Figure 6.5c,d). . 86 6.4 Parameters for the experiment using (x, y) for clustering and Hough-X for k-NN for N = 87 trajectories and L = 100 time-stamps (Figure 6.7c,d)...... 86 7.1 The Different Types of Point Resilient Distributed Dataset (RDDs) Ac- cording to Selected Features...... 111 7.2 Trajectory Representation in a PointRDD...... 115 7.3 Simulation Parameters...... 116 7.4 Time for 80 Mobile Objects without Indexing and with R-Tree Indexing. 118 7.5 Time for 500 Mobile Objects without Indexing and with R-Tree Indexing.119 7.6 Time for 2000 Mobile Objects without Indexing and with R-Tree In- dexing...... 119 7.7 Impact of Spatial Partitioning in Time Performance...... 122

xii 7.8 Parameters when using N = 500 Trajectories, L = 100 Timestamps. . . 122 8.1 Performance of Tweet-Level Classifiers ...... 142 9.1 Datasets Details ...... 153 9.2 RapidMiner Results - Accuracy ...... 154 9.3 Accuracy for different Training - Test Set ratios ...... 154 9.4 10-Fold Cross-Validation ...... 155 10.1 Data Sources ...... 162

xiii LIST OF ABBREVIATIONS

SQL Structured Query Language

NoSQL No Structured Query Language

RDBMS Relational Database Management Systems

RSA Rivest–Shamir–Adleman

RL Record Linkage

PPRL Privacy Preserving Record Linkage

HLSH Hamming Locality Sensitive Hashing

CLK Cryptographically Long Keys k-NN k Nearest Neighbors

MWCL Method Witout CLustering

MCL Method CLustering

DUST DUal-based Spatio-temporal kDUST DUal-based k anonymity

DP Dual Point

CR Compressive Ratio

DukNN Dual-based k Nearest Neighbor

DuCLkNN Dual-based Clustering k Nearest Neighbor

SkNN Snapshot k Nearest Neighbors

CkNN Continuous k Nearest Neighbors

STPkNN Snapshot Trajectory Point k Nearest Neighbors

CTPkNN Continuous Trajectory Point k Nearest Neighbors

xiv CHAPTER 1

Introduction

The rapid growth and evolution of technology, such as social networks and smart mobile devices, coupled with the large number of users, is leading to an increase in the volume of data on the Internet. In addition to, there are other important sources of large amounts of data, such as scientific, telecommunications, banking, business data, whose analysis and management are important for critical decision making. However, the large amount of data, known as Big Data [119]-[112] has caused several problems with how to store, process and recover them. There are many systems [121], [91], [95] with different architectures that address this challenge, namely, the management of massive data. The aim of this thesis is to develop and implement efficient management and pro- cessing algorithms for large scale data. In particular, emphasis will be placed on multi- dimensional data classification algorithms. These algorithms are widely used in site related services/applications (e.g. GPS) to query a database and retrieve the desired information. A popular algorithm in the literature is one of the nearest neighbors, or k − NN [119]-[122], [112], which uses distance metrics to ”answer” queries of users to a database. Over the last decade, the vast explosion of data has fueled the develop- ment of Big Data management systems and technologies. The most popular solutions have been proposed in centralized environments whose efficiency is limited to a large amount of data, so searching for distributed solutions is imperative. Nowadays, digital data are the most valuable asset of almost every organization. Database management systems are considered as storing systems for efficient retrieval and processing of digital data. However, effective operation, in terms of data access speed and relational database is limited, as its size increases significantly [96]. Bloom filter is a special data structure with finite storage requirements and rapid control of an object membership to a dataset. It is worth mentioning that the Bloom filter structure has

1 been proposed with a view to constructively increase data access in relational databases. Since the characteristics of a Bloom filter are consistent with the requirements of a fast data access structure, we examine the possibility of using it in order to increase the SQL query execution speed in a database. In the context of this research in chapter 2, a database in a RDBMS SQL Server that includes big data tables is implemented and in following the performance enhancement, using Bloom filters, in terms of execution time on different categories of SQL queries, is examined. I experimentally proved the time effectiveness of Bloom filter structure in relational databases when dealing with large scale data. This investigation was initiated by the study to optimize query performance [96], [125] in a large database using or not the Bloom Filter structure [27] which requires finite storage and rapid control of the existence of an object in a data set. Given the aforementioned Bloom Filter features, consideration was given to using it to increase the speed of running SQL queries in a large database. Furthermore, the Privacy Preserving Record Linkage problem based on Bloom Fil- ter encoding techniques is being described in chapter 3, which both maintain users’ se- curity and permit similarity control. The research study focused on the problem of the Privacy Preserving Record Linkage known as the PPRL [162], [26] due to the wide res- onance in the research community to protect the identity and characteristics of entities associated with records in different databases. This topic was studied in the Mapreduce Framework in Hadoop using the Locality Sensitive Hashing technical indexing in the Hamming field (HLSH). The coding technique based on Bloom Filters [144], known as CLK (Cryptographically Long Keys) [22], was utilized due to its security against in- truders. Moreover, our study extended to the HLSH/F P S private indexing technique and briefly describe four implementations in the MapReduce distributed framework that is capable of processing large scale data. I also conducted experimental evaluation of these four versions in order to evaluate them in terms of job execution time, memory and disk usage. In addition, in chapter 4 security and privacy issues for NoSQL databases are stud- ied where security mechanisms and privacy solutions are thoroughly examined. The adoption of Cloud computing and big data management technologies have created an urgent need for specific databases to safely store extensive data along with high avail- ability. Specifically, a growing number of companies have adopted various types of

2 non-relational databases, commonly referred to as NoSQL databases. These databases provide a robust mechanism for the storage and retrieval of massive data without using a predefined schema. NoSQL platforms are superior to RDBMS, especially in cases when we are dealing with big data and parallel processing, and in particular, when there is no need to use relational modeling. Let recall that, the main objective of the research is the development of the basic knowledge mining algorithms (namely, K-means clustering and k-NN classification) for the processing of high volume spatial data on the basis of enhancing security and protecting privacy. Hence, in the context of this research in chapter 5 the problem of Privacy Preserving on Spatio-Temporal Databases is studied. In particular, the k- anonymity of mobile users based on real trajectory data is being used to quantify pri- vacy. The k-anonymity set consists of the k nearest neighbors. A motion vector of the form (x,y,θ,v) is constructed, where x,y are the spatial coordinates,θ the angle direc- tion, v the velocity of mobile users, and study the problem in four-dimensional space. Two approaches are followed. The former applies only k-Nearest Neighbor (k-NN) algorithm in the whole data set, while the latter combines trajectory clustering, based on K-Means, with k-NN. Unlike previous works, such as [150], [178], which deal with trajectories clustering. Actually, it applies k-NN inside a cluster of mobile users with similar motion pattern (g,v). We define a metric, called Vulnerability, that measures 1 the rate at which k-NNs are varying. This metric varies from k (high robustness) to 1 (low robustness) and represents the probability the real identity of a mobile user be- ing discovered from a potential attacker. The aim of this work is to prove that with 1 high probability, the above rate tends to a number very close to k in clustering method which means that the k-anonymity is highly preserved. Through experiments on real spatial data sets, the anonymity robustness is evaluated, the so-called Vulnerability, of the proposed method. Bearing in mind the ”curse of dimensionality” and its effect on clustering and classi- fication, the impact of this on the maintenance of privacy has been studied. That’s why we’ve evaluated the impact of the number and the correlation of dimensions on privacy protection for the two approaches, defining a ’Vulnerability’ metric, which measures the rate at which the k nearest neighbors of a set of moving users change. The study on real spatial data evaluated the performance of both methods in terms of privacy preserv-

3 ing for different combinations of characteristics (x, y, θ, v). Regardless of the method used, if the k nearest neighbors’ ids remain the same or do not often change in time, it is difficult for an opponent to discover a moving user based on historical data. The need to store massive volumes of spatio-temporal data has become a difficult task as GPS capabilities and wireless communication technologies have become preva- lent to modern mobile devices. As a result, massive trajectory data are produced, incur- ring expensive costs for storage, transmission, as well as query processing. A number of algorithms for compressing trajectory data have been proposed in order to over- come these difficulties. These algorithms try to reduce the size of trajectory data, while preserving the quality of the information. In following, in chapter 6, i focus on both the privacy preservation and storage of spatio-temporal databases. To alleviate this is- sue, I focused on the storage-compression problem [157] of spatio-temporal databases. An effective method for spatio-temporal data compression called Dual-based Spatio- temporal Trajectory (DUST) is proposed here, whereby an initial raw trajectory is di- vided into a number of linear sub-trajectories under Hough transformation [157], [153], [53] which forms the representatives of each linear component of the initial trajectory and therefore, it is compressed. The Hough transformation breaks down the k-NN query into two one-dimensional and allows it to be applied in a smaller space. This helps to bring compression into the data and enhance the safety of the queries. In particular, any intruder, even if he/she has access to the representatives of the trajectory data and tries to reproduce the points of the initial track, the identity of the mobile object remains safe with high probability. The anonymity set now consists of mobile users who have the same motion pattern based on Hough-X/Y transformation. This approach differentiates from previous approaches described in [196]-[128]. A theoretical limit for measuring the sensitivity of methods based on Hough in two projections of x, y dimensions and the overall sensitivity is computed. In addition, a model of attacks in Spatio-temporal Databases and in by Hough transformed ones which store the trajectories data of a set of moving objects, is being studied. It also recommends using a Digital Pseudonyms version protocol with an Identity Provider that enhances the security of the database objects identity to a malicious user known as the Brand Protocol. This reinforces the firmness of the k-anonymity method that we

4 already studied in previous chapter. To our knowledge, we are the first to study and address k-NN queries on nonlinear moving object trajectories that are represented in dual dimensional space. Additionally, the proposed approach is expected to reinforce the privacy protection of such data. Specifically, even in case that an intruder has ac- cess to the dual points of trajectory data and try to reproduce the native points that fit a specific component of the initial trajectory, the identity of the mobile object will re- main secure with high probability. In this way, the privacy of the k-anonymity method recommended in [39] is reinforced. Through experiments on real spatial datasets, we evaluate the robustness of the new approach and compare it with the one studied in our previous work. Privacy Preserving and Anonymity have gained significant concern from the big data perspective. We have the view that the forthcoming frameworks and theories will establish several solutions for privacy protection. The k-anonymity is considered a key solution that has been widely employed to prevent data re-identifcation and concerns us in the context of this work. Data modeling has also gained significant attention from the big data perspective. It is believed that the advancing distributed environments will provide users with several solutions for efficient spatio-temporal data management. GeoSpark will be utilized in the current work as it is a key solution that has been widely employed for spatial data. Specifically, it works on the top of Apache Spark, the main framework leveraged from the research community and organizations for big data trans- formation, processing and visualization. To this end, we focused on trajectory data rep- resentation so as to be applicable to the GeoSpark environment, and a GeoSpark-based approach is designed for the efficient management of real spatio-temporal data. Th next step is to gain deeper understanding of the data through the application of k nearest neighbor (k-NN) queries either using indexing methods or otherwise. The k-anonymity set computation, which is the main component for privacy preservation evaluation and the main issue of our previous works, is evaluated in the GeoSpark environment. More to the point, the focus here is on the time cost of k-anonymity set computation along with vulnerability measurement. The extracted results are presented into tables and figures for visual inspection. The importance and general research interest on methods for processing safe spatial k-NN queries have increased. In this respect, and given the rapid increase in the volume

5 of spatial data (Big Spatio-temporal Data), it is necessary to assess the cost (time) of creating in the Spark environment the anonymity set of each moving object. This will help to assess the practical interest in the implementation (and development) of meth- ods in real-time systems. The GeoSpark environment has been set up to this end. In particular, the configuration of the anonymity set is approached as a whole by Snapshot Trajectory Point kNN (STPkNN) queries based on the selected descriptors of each track point of a set of animation objects in the respective timestamp. The performance evalu- ation selected to be applied at Apache Spark based GeoSpark because it is designed for processing Spatial Data. Although traditional privacy solutions have been designed in Euclidean space, our framework also studies the concept of anonymity in Hough space. Due to the con- stantly changing location information of moving objects, it is necessary to evaluate a large number of queries of nearest neighbors for a large number of moving objects per time footprint. The k-NN spatio-temporal queries are issued in order to configure the set of moving object ids based on the trajectory points of all objects in each tempo- ral imprint. Specifically, a k-NN query, which we call the Snapshot Trajectory Point k-NN (STPkNN), is calculated by considering the selected attribute information (e.g. Euclidean coordinates, angle, velocity, dual points) of all objects. Assuming a high sampling rate, we can consider that the process is similar to a Continuous Trajectory Point k-NN. An important feature of the continuous k-NN query in Hough space is that the nearest neighbors between two consecutive space-time points remain the same. Based on this feature, the problem of executing the classical Continuous k-NN queries [37]-[189] can be significantly reduced at specific space-time points where the veloc- ity of the moving object changes indicating a new linear sub-trajectory of the original nonlinear track. In the second part of the research, as included in chapters 8-10, the performance of various classifiers in the Sentiment Analysis problem [142], [97] using classifiers on Twitter data in the Apache Spark environment, was investigated. In addition, a Python text and language pre-processing tool [65] has been developed to remove erroneous values and noise in an optimal and efficient manner. A notable feature is the use of emojis and emoticons in the field of emotion analysis. Supervised machine learning techniques were used to analyze user views. The performance of the classifiers (Naive

6 Bayes and SVM) was experimentally evaluated under specific parameters, such as the size of the training data and the feature selection methods used (unigrams, bigrams and trigrams) using the ”k-fold cross validation” technique. Finally, the use of the data mining technique based on Decision Trees was studied at Apache Spark, with the aim of forecasting tourism demand [30], taking into account the contribution of interpretive variables to them. The data set was constructed from public sources and the predicted (target) variable is the tourist arrivals in Greece for the years 2006 to 2015.

7 8 Part I

Research on Methods and Algorithms for Secure Queries Processing

9 10 CHAPTER 2

Bloom Filters for Efficient Coupling between Tables of a Database

2.1 Introduction

The business data, associated with all of the business activities, are typically stored in relational databases in order to manage them using the SQL language, and more specif- ically perform SQL queries to the database. The relational databases are particularly effective in their operation. However, their efficiency is limited if they store ”big data” with complex correlations [167]. An SQL query can be very expensive in execution cost, and concretely in time and access to resources, if the execution plan is not opti- mized. Possible delays in the accomplishment of SQL queries may have impact on ap- plication performance using relational databases, thus reducing business performance. The main way to improve the performance of an SQL query is to reduce the number of required operations/calculations that should be performed during the execution of the corresponding query. However, further reduction of the required commands in an SQL query is not always possible and also requires additional techniques for SQL query performance optimization in a database [61]. In [125], authors investigate this specific problem and recommend the use of IN, EXITS, EQUAL and OPERATOR-TOP along with indexes. Moreover, the bloom filter structure is used in databases such as Google Big Data or Apache HBase in order to decrease searching (in disk) for non-existent records, optimizing in this way the performance of executed SQL queries [24]. The traditional database systems store data in the form of a table with records. Each record corresponds to a different entity object that holds information in a relational table. The relative organization of the databases is effective when there are performing queries on tables with a small number of records. However, as the number of records increases, e.g. hundreds of thousands or millions of records, SQL queries usually search in a much larger number of records in order to locate and access a small number of

11 records or fields [103]. The best way to improve the execution speed of SQL queries in a database is the definition of indexes in fields, which are part of the search criteria of an SQL query. When indexes are not set in a database, then the database management system operates as a reader trying to find a word in a book by reading the entire book. By integrating an index term at the back of a book, the reader can complete the procedure much more quickly. The benefit of using indexes when searching records in a table becomes greater as the number of table entries increases 1. The role of indexes, in a database is to direct access records according to the search criteria of the SQL query. However, when a table in a database contains millions of records, despite the use of indexes, then the identifi- cation of records that meet the search criteria, requires to access thousands of records of the relational table 2. Therefore, in order to improve the efficiency of execution speed of relational SQL queries, the, in advance, exclusion of a significant number of records that do not meet the search criteria, would be particularly useful. To this purpose, the implementation of Bloom filter structure is suggested; this structure is based on records of the tables and it is further used for the exclusion of records that do not meet the criteria of relevant SQL queries. The purpose of this research is to examine to what extent the structure of Bloom filter tables in relational databases can affect the performance of data access queries for data tables with millions of records. To achieve the aim of this survey, our contributions lie in the following bullets: (i) implementation of Bloom filter to a relational database, (ii) experimental evaluation of queries with or without the support of Bloom filter and table recording of execution time of queries and (iii) graphic visualization of results to show Bloom filter effectiveness (in terms of integration time) in executing SQL queries on tables with millions of records. The rest of the chapter is organized as follows: in Section 2.2 the properties and basic components of Bloom filters are introduced. In Section 2.3, Relational Databases and SQL framework is presented. Moreover, Section 2.4 presents the evaluation experi- ments conducted and the results gathered. Ultimately, Section 2.5 presents conclusions, constraints and draws directions for future work. 1http://odetocode.com/articles/237.aspx 2http://dataidol.com/tonyrogerson/2013/05/09/reducing-sql-server-io-and-access-times-using-Bloom-filters-part-2-basics-of-the-method-in-sql-server

12 2.2 Bloom Filters Background

2.2.1 Bloom Filter Elements

The Bloom filter structure, devised by Burton Howard Bloom in 1970, is used for rapid check whether an element is present in a data set or not [18]. It also permits checking if an item certainly does not belong to it. Although the Bloom filters allow false positive responses, the space savings they offer outweigh any downside [99]. A Bloom filter is composed of two parts: a set of k hash functions and a bit vector. The number of hash functions and the length of bit vector are chosen according to the expected number of keys to be added to the Bloom filter and the level of acceptable error rate per case 3. A number of important components need to be properly defined in order for a bloom filter to operate correctly. These parameters are briefly and comprehensively described in the following paragraphs.

2.2.1.1 Hash Functions

A hash function takes as input data of any length and returns as output an ID smaller in length and fixed in size, which can be employed with the aim to identify elements 4. The main features that a hash function should have, are the following:

• Return the same value at each iteration with the same data input.

• Quick execution.

• Generate output with uniform distribution in the potential range it produces.

Some of the most popular algorithms for implementing hash functions are: SHA1 and MD5. These functions differ in safety level and hash value calculation speed. Also, some algorithms homogeneously distribute the values generated by the hash function, but they are impractical. In each case, the selected hash function should satisfy the application requirements. As for the hash functions number, the bigger this number is, then the hash values are generated in a slower way and the binary vector fills in a faster way. However, this

3https://www.perl.com/pub/2004/04/08/bloom_filters.html 4https://blog.medium.com/what-are-bloom-filters-1ec2a50c68ff

13 Fig. 2.1 Bloom Filter Overview

decision increases the incorrect predictions on the existence of an object in a dataset 5. The optimal number of hash functions derives from the following formula in [99]:

m k = ln(2) (2.1) n where m is the binary vector length and n the number of inserted keys in bloom filter. When selecting the number of hash functions to be used, we also calculate the probability of false positive predictions. The previous step is repeated until we get an accepted value for the probability index of false positive responses [27].

2.2.1.2 Binary Vectors Length

The length of the binary values of a Bloom filter vector affects the pointer value of false positive responses of the filter. The greater the length of the binary vector values, the lower the probability of false positive responses. Conversely, as the length of the vector is shrinked, the relative probability is increased. Generally, a Bloom filter is considered complete when 50% values of bits in the array are equal to 1. At this point, further addition of objects will result in the increase of false positive responses rate [110].

2.2.1.3 Key Insertion

We initialize a Bloom filter by setting the values of binary vector equal to 0. To insert a key into a Bloom filter, the relevant k hash functions are originally performed and positions of the binary vector, which corresponds to hash values, change from 0 to 1.

5https://llimllib.github.io/bloomfilter-tutorial

14 If the relevant bit is already set to 1, then the value of the relevant bit does not further alter 6. Each bit of the vector can simultaneously encode multiple keys, which makes the Bloom filter compact as shown in Figure 2.1 [21]. The overlapping values do not permit a key removal from the filter, since it is not known whether the relevant bits are not activated by other key values. The only way to remove a key from a Bloom filter is to rebuild the filter from scratch, thus not in- corporating the key to be removed from the Bloom filter. For checking the possibility for a key in the Bloom filter to be present, the following procedure is applied. Initially, the hash functions are applied to the search key, and then we check the relevant bits generated by the hash functions to be all activated. Concretely, if at least one of the bits is disabled, it is certain that the corresponding key is not included in the filter. If all bits are turned on, then we know that with high probability, the key has been introduced.

2.2.2 Space-Time Advantages and Constraints

The implementation of a Bloom filter is relatively simple in comparison with other rele- vant search structures. In addition, the use of a Bloom filter ensures the fast membership checking of a value and in following absolute reliability of the non-existence of an ob- ject in it (no false negatives) [140]. Concerning the time required for adding a new item or to control whether a point belongs to a set of data, it is independent of the number of elements in the filter 7. More to the point, a strong advantage of Bloom filters is the storage space saving in comparison with other data structures such as sets, hash tables, or binary search trees. The insertion of an element into a Bloom filter is an irreversible process 8. The size of data in a Bloom filter must be known in advance for determining the vector length and the number of hash functions. However, the number of objects that will be imported into a Bloom filter are not always known in advance. It is theoretically possible to define an arbitrarily large size, but it would be wasteful in terms of space and would overturn the main advantage on the Bloom filter, which is storage economy. Alternatively, a Dynamic Bloom filter structure could be adopted, which, however, is

6https://www.perl.com/pub/2004/04/08/bloom_filters.html 7https://prakhar.me/articles/bloom-filters-for-dummies 8http://bugra.github.io/work/notes/2016-06-05/ a-gentle-introduction-to-bloom-filter

15 not always possible. There is a variant of the Bloom filter, called Scalable Bloom filter, which dynamically adjusts its size for different number of objects. The use of a relative Bloom filter could alleviate some of its shortcomings. A Bloom filter cannot produce the list of items imported, but it can only check whether an item has been introduced in a dataset. Finally, the Bloom filter cannot be used for answering questions about the properties of the objects.

2.3 Bloom Filters and RDBMS

2.3.1 Relational Database Management Systems

The relational database management systems have been a common choice for storing information in databases used for a wide range of data such as financial, logistic in- formation, personal data, and other forms of information, since 1980. The relational databases have replaced other forms such as hierarchical or network databases, as they are easier in understanding and their use is convenient. The main advantage of relational data model is that it allows the user to make query-in data access command, without the need to define access paths to stored data or other additional details [98]. Furthermore, the relational databases keep their data in form of tables. Each table consists of records, called tuples, and each record is uniquely identified by a field, i. e. primary key, which has a unique value. Each panel is usually connected to at least another database table in relation to the form: (i) one-by-one, (ii) one-to-many, or (iii) many-to-many. These relationships grant users unlimited ways of data access and dynamic com- bination amongst them from different tables. Nowadays, the market provides more than one hundred RDBMS systems and the most popular of them are the following: (i) Oracle, (ii) MySQL, (iii) Microsoft SQL Server, (iv) PostgreSQL, (v) DB2 and (vi) Microsoft Access (DB-Engines 2016), etc. 9. The SQL language is used for user communication with a relational database [137]. An SQL query demands no knowledge of the internal operation of database or the rele- vant data storage system [172]. According to ANSI (American National Institute Stan- dards) standards, the SQL is a standard language for relational database management systems. Moreover, the SQL language is used in order to query a database for the man-

9http://db-engines.com/en/ranking/relational+dbms

16 agement of such data and also for the data update or retrieval from a database. Some examples of relational databases that use SQL are: Oracle, Sybase, Microsoft SQL Server, Access and Ingres. The most important commands of SQL query language are 10: SELECT, UPDATE, DELETE, INSERT INTO, CREATE DATABASE, ALTER DATABASE, CREATE TA- BLE, ALTER TABLE, DROP TABLE, CREATE INDEX, DROP INDEX. The SQL commands are classified into the following basic types:

• Query Language with key command: where the Select command for accessing information from the database tables is used.

• Data Manipulation Language with key commands: (i) Insert-introduction of new records, (ii) Update-modify records, and (iii) Delete-delete records.

• Data Objects Definition with key commands: (i) Create Table, and (ii) Alter Table.

• Safety Control of Database with key commands: (i) Grand, Revoke for user rights management to database objects, and (ii) Commit, Rollback for transac- tions management.

2.3.2 Queries Language-SQL

2.3.2.1 Membership Queries

The command SQL IN controls whether an expression matches any value from a list of values. Furthermore, it is used in order to prevent multiple use of the OR command in SELECT, INSERT, UPDATE or DELETE queries 11. Besides checking should an expression belong to a set of values registered directly to a relevant query SQL, it may also check if an expression is part of a set of values from other tables.

2.3.2.2 Join Queries

The union queries, which combine values from two or more data tables based on a JOIN criterion, usually concern relationships between relevant tables. More to the point, JOIN queries are distinguished in four categories: 10http://www.w3schools.com/sql/sql_syntax.asp 11https://www.techonthenet.com/sql/in.php

17 1. Inner Join: returns the values from Table A and Table B that satisfy the joining criteria.

2. Left Join: returns all the values from Table A and the values of the Table B meeting the joining criteria.

3. Right Join: returns all the values from Table B and the values of the Table A that meet the joining criteria.

4. Outer Join: returns all the values from Table A and Table B regardless if they satisfy the relevant criteria combination.

2.3.2.3 Exist Queries

The existence control queries are used in conjunction with a secondary query. It is considered that the control condition is satisfied when the secondary query returns at least one relevant registration. The verification can be used in terms of the following queries: SELECT, INSERT, UPDATE or DELETE 12.

2.3.2.4 Top Queries

The command TOP limits the number of records that a query will return, that is to a specified number of rows or a specified percentage of records from the 2016 version of SQL Server 13. When the command TOP is used in combination with the ORDER BY command, then the first N records are returned according to the sorting arrangement provided by the ORDER BY command. Otherwise, N unsorted records are returned. In addition, the TOP command specifies the number of records returned by a SELECT statement or affected by a plethora of command statements, such as INSERT, UPDATE, JOIN, or DELETE. The TOP SELECT command can be particularly useful in large tables with thousands of records. The access and choice of a large number of records can adversely affect the performance execution of a query.

12https://www.techonthenet.com/sql/exists.php 13https://docs.microsoft.com/en-us/sql/t-sql/queries/ top-transact-sql

18 Fig. 2.2 B-Tree overview

2.3.3 Indexes Table

Indexes are auxiliary structures in a relational database management system with the aim of increasing data access performance to the database. Relevant helping structures are created in one or more fields (columns) of a table or a database. Moreover, an index provides a quick way to search data based on the values in the specific fields that are part of the index. For example, if an index on the primary key of a table is created, and then a series of data based on the values of the corresponding fields is found, then the SQL Server finds the value of the index field first and in following it uses the relevant index so as to quickly locate the whole relevant table entries. In this way, without the index marker field, it would require a scan of the entire table line by line, directly influencing the performance of the relevant query execution 14. Furthermore, an index consists of a set of pages that are organized into B-tree data structure. The relevant structure is hierarchical, comprising a root node at the top of the tree and the leaf nodes at the lower level, as illustrated in the above Figure 2.2. When a query, including a search criterion, is executed, then the query starts delving into

14https://www.simple-talk.com/sql/learn-sql-server/ sql-server-index-basics

19 relevant records from the root node and navigates through intermediate nodes, which are the leaf nodes of the B-tree structure. After locating the relevant leaf node, the query will access the interrelative record either directly (in the case of clustered index), or through a pointer to the relevant data record (if it is a non clustered index). A table in an SQL Server database can have at most one clustered index and more than one non clustered index, depending on the version of SQL Server that is used.

2.4 Experimental Evaluation in SQL Server

In this section, the results of the experiments conducted in the context of this research in order to evaluate the use of Bloom filters, are presented. We perform a series of common SQL database queries with and without the support of the Bloom filter and graphically present the resulted time performance of executed SQL queries. The SQL queries utilized are the following: In, Inner Join, Left Join, Right Join, Exists and T op. The following Tables 2.1 and 2.2 as well as Figure 2.3 show the execution times of the questions described previously. In particular in the corresponding Tables, the results are shown with and without the use of Bloom filter by introducing the label BF as the relative number of records is changed. Table 2.1 SQL Queries Execution Time Results vs Data Size

Execution Time in seconds Data In In BF Inner Join Inner Join Left Join Left Join BF BF 10.000.000 44 24 44 24 1 1 9.000.000 38 24 41 26 1 1 8.000.000 26 21 26 24 1 1 7.000.000 19 21 19 20 0 1 6.000.000 19 20 19 20 0 1 5.000.000 18 19 19 19 0 1 4.000.000 13 14 13 12 0 1 3.000.000 11 12 12 13 0 1 2.000.000 10 12 11 12 0 1 1.000.000 3 3 3 3 0 1

For all SQL commands and in particular for small number of records, we observed that the adoption of Bloom filter structure overloaded the system and thus, the execution of the queries without the use of Bloom filter is much faster.

20 Table 2.2 SQL Queries Execution Time Results vs Data Size

Execution Time in seconds Data Size Right Join Right Join Exists Exists BF Top Top BF BF 10.000.000 44 42 43 23 26 6 9.000.000 37 27 38 24 18 6 8.000.000 26 26 25 25 7 6 7.000.000 19 20 18 20 7 5 6.000.000 19 20 19 20 7 5 5.000.000 19 19 19 19 5 5 4.000.000 13 12 12 12 5 5 3.000.000 12 13 11 12 5 5 2.000.000 11 12 12 12 3 3 1.000.000 3 3 2 3 1 2

As the number of table records is increased and especially for values more than (or equal to) 8, 000, 000, the performance advantage offered by employing the Bloom filter structure increases significantly and the difference in speed execution of queries is obvious as it also rises exponentially. It is important to consider the fact that during the repetitive execution of the same queries, we observed the same runtime, but sometimes there was a gap of about two seconds between results. In these cases, we decided to take the average values in the relevant cases.

2.5 Conclusions

2.5.1 Research Conclusions

The large response times of SQL queries in relational databases affects not only the users, but also other applications that may run on the same computer or the network itself hosting the relevant database. The Bloom filter, capacity wise, is an effective solution and it has been used in numerous applications in the past, especially when immediate control of an object membership was required. The relevant experiments suggest that the inclusion of Bloom filter structure in an SQL Server database (with large number of records - 10, 000, 000 records) that may increase its data access performance. The optimization of query execution time to a database, using Bloom structure, allows users to quickly extract the needed information

21 Fig. 2.3 Queries Execution Time vs Records Size

and increase the efficiency of relevant database. The Bloom structure in a relational database acts as a filter that removes from the join, membership or existence control queries the need to access-process records that do not meet the criteria of relevant ques- tions. The potential profit from this restriction involved in accessing-searching records through an SQL query highly depends on the false positive records that the control of the Bloom filter returns. This relative number is limited as the length of the binary values of Bloom filter is increased. In this topic, an acceptable speed execution as well as balanced storage require- ments, according to the requirements of each database instance and the user require- ments, which concern the access speed of the relevant database, should be chosen. Es- pecially, in cases like a database containing historical data records with no probability of further record updates, the adoption of Bloom filter for faster access among numerous relevant tables can be considered a solution that could lead to increased efficiency. As it can be seen from the execution times of SQL queries (Tables 2.1, 2.2 and Figure 2.3), the benefit of the, in advance, restriction of records involved in an SQL query is greater than including all records of data tables and indexes used for direct

22 access to them. It should be noted that as the experimental measurements show, the application of Bloom filter structure in a database deserves to be selected only when the number of entries in the relevant tables are very large. Consequently, the use of Bloom filter may have the opposite effect, i.e. increase the query runtime.

2.5.2 Research Constraints

In the evaluation of the Bloom filter, we did not take into account possible delays caused by maintenance and regular updates of the Bloom filter structure during the record up- dates in the relevant tables. These possible delays could be caused in the execution of other SQL queries as well. Although all experiments were performed on the same machine, the ones with Bloom filter that were performed at different times, may have been affected (in performance) by possible processes running in the background. These relevant deviations do not directly affect the performance comparison among the same queries with or without the use of Bloom filter, but mostly between different SQL com- mands used.

2.5.3 Future Extensions

A promising and useful step would be to investigate the applicability of Bloom filters in other relational database management systems (like Oracle, SysBase, MySQL) with the aim of generalizing previous conclusions drawn from experimentation on the SQL Server relational database management system. Also, a possible review of the actual performance of database operations with millions of records used to store application data, will allow more reliable conclusions about the use of Bloom filter structure in re- lational databases. Thus possible delays of the system during the application operation can be taken into account.

23 CHAPTER 3

MapReduce Implementations for Privacy Preserving Record Linkage

3.1 Introduction

The rapid evolution of technology and Internet has created huge volume of data at very high rate, deriving from commercial transactions, social networks, scientific research. The mining and analysis of this volume of data may be beneficial for the humans in crucial areas such as health, economy, national security, leading to more qualitative results. A common problem in data analysis is the record linkage process (RL) which finds records in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites, and databases) [11],[26],[47],[145]. The purpose of RL is to categorize all possible combinations of records from different databases as similar or dissimilar by using attributes that are not necessarily identifying fields. The RL model requires at least two members that will provide their data in form of tables. A table row corresponds to an entity that is described by the columns. Often, the model of the RL is simplified by having only two members that provide the data to be combined (Alice and Bob) with or without the presence of a third member (Carol). The third member undertakes the interconnection process and communicates process results to participant members. Privacy-preserving policies often prevent research into personal data. Thus, organi- zations are legally and ethically bounded to exchange sensitive personal data, leading to datasets that are either free of sensitive personal data or encrypted to greatly enhance privacy protection. The privacy requirement during the RL process paved the way to Privacy-Preserving Record Linkage (PPRL) [28],[162],[165]. As in the case of RL, the PPRL process find pairs of records referred to the same entity from multiple data

24 sources where the classification, as similar or dissimilar, is conducted based on encoded data to avoid disclosure of confidential data about the entities presented in the problem. An efficient blocking scheme for PPRL is the HLSH/F P S when combined with Bloom-Filter based encoding. It utilizes Locality Sensitive Hashing and frequent colli- sion tables to Hamming distances between Bloom Filter based encoded pairs of records in order to reduce the number of pairs when a more rigid similarity comparison is per- formed. PPRL blocking techniques fall into the batch processing category and in the Big-Data world one of the most used system for batch processing applications is MapReduce which is distributed and fault-tolerant. In this paper, we evaluate the per- formance of four MapReduce work-flows of the LSH/F P S blocking scheme for the PPRL framework. The rest of this chapter is organized as follows: Section3.2 analyzes background knowledge around the encoding techniques based on Bloom Filter and private indexing. Also, Section 3.3 briefly describes the MapReduce framework of the HLSH/F P S implementations. Finally, Section 3.4 presents experimental evaluation and Section 3.5 presents conclusions.

3.2 Related Work

3.2.1 PPRL encoding techniques

In this section we describe some Bloom filter based encoding techniques that are nec- essary for the PPRL process.

3.2.1.1 String encoding

The basic idea of this approach is the hashing of q-grams in string fields of records based on Bloom Filters [144]. Bloom Filters [17] consist of a set of K hash functions. Their result puts the position of a bit in a bit vector of size S. Their objective is to give us a quick answer for the membership of an element to a set, by controlling K-positions

in the bit vector. The computation of K hash functions Hi(x) may be done through two independent hash functions as:

Hi(x) = (h1(x) + i · h2(x)) mod S (3.1)

25 For the h1(x) and h2(x) functions, we choose the cryptographic methods HMAC − SHA1 and HMAC − MD5 respectively due to their widespread and efficient imple- mentations on cryptographic platforms.

3.2.1.2 Record encoding

This method is used for records encoding instead of string, as previous method is used for. Each record consists of fields such as name, username, age, address, etc. As the PPRL process aims at the protection of such data, it is necessary to encode the values of the selected fields from all table’s records. To this direction, we suggest an encod- ing method based on a Bloom Filter for which it is necessary to pre-select values for the involved elements and Bloom Filters, such as the average number q-grams. Three different approaches for records encoding using Bloom Filter are described below. FBF (Field-level-Bloom-Filters) encoding [42], [41], [144] is considered as the simplest extension of string data encoding with Bloom Filter. The field values of a record are encoded on separate Bloom Filters, which then compose a larger Bloom Filter that will be used for encoding the entire record. In brief, the encoding steps are:

1. For a selected Q value (the number of q grams), we calculate the average number of q-grams g of each record for fields that will participate in the PPRL process

and calculate Bloom filter’s appropriate size SFBF .

2. From each string field, q-grams are extracted for a selected Q value.

3. Exported q-grams are encoded by SFBF size Bloom Filters using Fragmentation Components.

4. Bloom Filters produced from each field are combined into a larger one with pre- determined series of joins.

The FBF encoding is distinguished in FBF /Static and FBF /dynamic. The first

one requires the definition of Q, K and SFBF to encode the records, while the second one, for given Q values, imposes an initial preconditioning step to calculate the average number of q-grams g so as to calculate the appropriate Bloom Filter size SFBF . The basic idea of CLK encoding [145] is the use of large size S Bloom Filter to encode all fields of the record by using the produced q-grams of each field for a selected Q value and K hash functions. The encoding steps are:

26 1. Export q-grams from each field for a selected Q value.

2. Union of all produced sets of q-grams of the values to be encoded.

3. Extracted q-grams are placed on a S-size Bloom Filter using K hash functions.

Since CLK encoding places common q-grams from different fields to the same K locations in the Bloom Filter, it is difficult for the attacker to perceive either the encod- ing parameters or the original field values. Also, this particularity of CLK encoding, i.e. common q-grams between fields, can lead to incorrect results in the similarity con- trol. As an example, regarding names “James Johnson” and “John Jameson”, while being dissimilar, similarity control over CLK encodings may decide that these entries are similar. The RBF (Record-level-Bloom-Filter) encoding is based on FBF and attempts to enhance privacy protection in the PPRL process introducing additional parameters and information into the encoding steps [42], [41]. Initially, it encodes the values of the fields based on separate Bloom Filters and in following creates a random set of bits from each one so as to compose a larger Bloom Filter. Finally, it applies a random bit permutation of the larger Bloom Filter with RBF encoding being the result of this rearrangement. With regard to the number of bits required to be selected for the encoding of each field, we consider two ways of calculating it, uniform and weighted. The first way uses uniform bit selection from the FBF encoding of the record while the second one uses weighted. Uniform selection of bits requires equal or approximately equal number of bits for each field, i.e. Sf . Weighted way uses a weighted selection of the field encoding bits which leads to selecting more or fewer bits for some of the fields. More to the point, in [42], [41], it is mentioned that the weighted choice is based on the importance of each field in the interconnection process. In order to discover the significance of the fields in the process, the probabilities m and u of the Fellegi-Sunter probability model are used. The weights of agreement and disagreement as well as the range of these two weights are calculated and the normalized percentage of the range of each field is then calculated. In this way, each field contributes a percentage of wi to the final Bloom

Filter. The size of the final Bloom Filter SRBF is derived from the wi percentage that maximizes that size.

27 3.2.2 Private Indexing

The goal of indexing in PPRL is to substantially reduce the pairs of encoded records to be tested through similarity control. In this case, the third member (Carol) has little information about data encoding of Alice and Bob. To this direction, we are going to discuss HLSH indexing. The HLSH (Hamming Locality Sensitive Hashing) Indexing [42],[92] is used for

partitioning private records encoded in binary form of length S. Let Tl be a set of l = 1,...,L independent hash tables consisting of dynamic sets of key-values. Each

k hash table Tl uses a set of K hash functions hl that return the value of a randomly selected bit from the binary hashed rows in the table. The values of K functions are a key to the encoded records and can gather from Alice’s and Bob’s encoded sets A0

0 and B . Entries from two sets are stacked in the same key for a Tl hash table thus recommending a possibly identical pair if they match in K bits. Id fields’ values of encoded records can be in following used in order to finally form possibly identical pairs.

0 0 Let encoded records rA ∈ A and rB ∈ B consist of an Id and the fields BfA and BfB respectively. In addition, let i be a selected value as a limit for the Hamming metric calculated from the equation dH = |BfA ⊗ BfB|. We consider a family H of hash functions having the following property:

l l if dH ≤ θ then P r[hk(BfA) = hk(BfB)] ≤ pθ (3.2)

θ k = 1, 2, . . . , K l = 1, 2, . . . , L p = 1 − (3.3) θ S The suitable value for the number of hash functions K can be empirically computed,

as the accuracy of the method is mainly based on the number of tables, Lopt. Generally, this value should form enough buckets so that the number of interconnected lists for the pairs of records is low; for bigger values, more identical entries appear in pairs

of records. The formation of a pair of identifiers or encoded entries {rA, rB} during

HLSH in one of the Tl tables, is called collision. The method is redundant so a pair

can occur at C = 1,...,Lopt hash tables. The pair {rA, rB} with collisions C = Lopt is with high probability similar and intuitively one can argue that as the number of

28 Fig. 3.1 HLSH under FPS conflicts increases, then it is more likely for the records to be the same.

3.3 MapReduce Framework

HLSH along with use of Frequent Pair Schema (FPS) [93] can lead to fast and effi- cient record sharing by checking similarity of frequent collision pairs. We present four MapReduce implementations of the HLSH/F P S method for different size of encoded

0 0 records of Alice A , Bob B , hash tables T l as well as set of candidate IDs RIds. We consider that set B0 is smaller than set A0, so that it is chosen for the initial creation of the hash table Tl (Figure 3.1). The use of HLSH/F P S allows the implementation of an effective system with a relative low memory footprint. Our investigation focuses on memory saving and suggest four different versions of the HLSH methodology, namely v0, v1, v2 and v3. Each version is based on assumptions for almost all sizes of the problem and progressively ”transfers” these sizes from the slow disk to the faster memory of the Mapper/Reducer. We assume that every Mapper or Reducer in a MapReduce task has a fixed memory limit mtask that can be committed by YARN. Each of the four versions consists of 2 or

29 3 different MapReduce Jobs, which in following consist of a number of tasks (Mappers or Reducers) depending on the HDFS size of the problem and user settings. Version v0 is characterized by memory saving when performing the job, but is very expensive in disk use and is especially suitable for Apache YARN environments with low memory availability for the tasks of a MapReduce job. It assumes that Alice’s and

Bob’s encoded records and Tl are so large that cannot be available in the limited task memory. All pairs of identifiers from the HLSH process are formed and in following,

the ones that appear at least Cf times are stored again in HDFS to be loaded into mem- ory of the last job that undertakes the interconnection of the proposed IDs based on the identifiers. This approach, in addition to multiple MapReduce tasks, can be considered as the most naive as it forms all pairs of identifiers that can be derived from the HLSH process. On the contrary, it is the version that uses the least memory in Mapper/Reducer tasks according to experimental results. Version v1 allows more relaxed conditions for the committed memory of the tasks

to be performed. We assume that the set of Tl tables is able to fit into memory of each task in a MapReduce job. Having this important information in mind, we can perform the HLSH/F P S by exclusively storing the often conflicting pairs in HDFS. In the last two versions, we also assume that the records of the smaller set B0 are capable of being stored as a whole in the mtask memory of Mappers and Reducers. In both versions, the first job resembles the creation and storage in HDFS of the hash table Tl of the set of records. In the second work, the two versions are differentiated in terms of use or non-use of the Reduce phase.

3.4 Performance Evaluation

The evaluation of four schemas is conducted by considering CLK encoding for PPRL

1 process for S = 4096 under the following settings for the parameters δ , Cf , LCf , K.

1δ is the confidence parameter defining the likelihood that pairs which are actually the same, are not matched in the tables. This value is usually low, indicatively δ = 0.01.

30 Fig. 3.2 PPRL evaluation

δ Cf LCf K

0.001 4 52 30 0.0001 6 74 30 0.00001 7 91 30 0.000001 9 114 30

In the first screen of Figure 3.2, the simulation results of four versions of HLSH/F P S are shown. The fastest in all cases is v3, while v0 has the largest footprint on disk since it writes to HDFS all pairs of identifiers derived from HLSH. We also observe that for the highest value of δ, the footprint of jobs total memory is also large, but as δ decreases and simultaneously HLSH parameters change, then memory consumption decreases

31 Fig. 3.3 PPRL evaluation

as the number of candidate records for comparison increases. Regarding other versions, as the philosophy of FPS strategy is utilized, execution times are slightly affected by the change. The disk footprint for v1 is slightly affected; the same stands for v2 and v3.

However, it is evident that as LCf increases, while the number of candidate records to be compared is reduced, the memory footprint for all operations, except for v0, grows faster. As the number of records of B0 remains constant, this increase corresponds to

the increase of the hash table Tl. We then conduct the same procedure for the two largest sets of records, without the

v0 version. In this case, we show measurements for δ = 0.0001 as well as Cf = 6,

LCf = 74 and K = 30. The second screen of Figure 3.3 presents the prevalence of versions v3 and v2 on v1 on all metrics.

32 3.5 Conclusions

The four versions that are presented give Carol member the capability to choose be- tween the slow, but memory economical, and the fastest, but demanding, task MapRe- duce executions. It also shows the need for implementing techniques that help users to decide (in terms of resource use, relative costs and problem size) which of the four ver- sions is appropriate. The experimental evaluation shows that versions v1, v2, v3, which make progressively smarter memory usage, have the advantage of quick execution of HLSH/F P S compared to v0. But as the number of records grows, the demanded size to be put into its memory also increases. With the prospect of slow but integrated HLSH/F P S process, v0 may be the best proposal for Hadoop environments with limited memory resources.

33 CHAPTER 4

Security and Privacy Solutions associated with NoSQL Data Stores

4.1 Introduction

The advances in cloud computing technology and distributed web applications, along with the ever-increasing large volume of data for storage and further processing, has rendered necessary the adoption of non-relational databases, known as NoSQL or ”Not only SQL” [124]. It is widely known that the traditional SQL database is not able to cope with Big Data [147] as NoSQL systems, nowadays, are experimenting with an increase in popularity [114]. In recent years, many NoSQL databases have made their appearance; for example, Cassandra and MongoDB are two popular ones, to name a few. Some useful features of NoSQL databases are the high availability, scalability, bet- ter performance, as well as the ability to store and process large-scale semi-structured and/or unstructured data faster than traditional RDBMS [114], [147]. However, due to the ever increasing use of NoSQL databases, a significant amount of sensitive data is exposed to a number of security vulnerabilities, threats and risks. Lack of encryption support and poor authentication between servers and clients are some of the leading security issues in NoSQL Databases. Also, it should be noted that simple authorization is provided without support for role-based access control (RBAC) and so, there is no protection for injections and denial of service attacks [62]. Brewer in [20] made a conjecture about the trade-offs in the development of a dis- tributed database system, thus introducing the CAP (Consistency, Availability, Partition Tolerance). A formal version of Brewer’s conjecture is officially published as the CAP theorem in [55]. Specifically, the CAP theorem indicates that no shared data system could provide at the same time more than two out of the three properties, including consistency, availability, as well as partition tolerance. Regarding organizations, Amazon developed the Dynamo technology [35], whereas

34 Google produced the distributed storage system Bigtable [24]. These particular tech- nologies have inspired many NoSQL applications installed in companies, like Facebook or Twitter. Modern companies are dealing with data that are not relational and need superior databases than traditional ones, which encounter scalability and availability problems because of their data size. There are already several authorization models in relational databases where views are usually utilized. In this way, SQL queries are used to display a specific state of a specified part of the database [13]. Some NoSQL Databases managed by Big Data use new authorization models, which are specifically designed for structure, speed, and a huge amount of data. These models include key- value, wide column, and document-oriented authorization. In addition, the storage and retrieval of these records are achieved through a unique key for each record while pro- viding a swift search [29]. In this chapter, we present security and privacy issues in NoSQL databases and fur- ther examine it to propose the most efficient security mechanisms and privacy solutions. More to the point, data protection and access control are some of the key issues of secu- rity in NoSQL, while several security threats for NoSQL databases are considered, such as distributed environment, authentication, fine-grained authorization and protection of data at rest and in motion. The remainder of this chapter is organized as follows. Section 4.2 presents a sur- vey of existing related works concerning mechanisms to overcome security issues. Moreover, Section 4.3 overviews a comparative study between Relational and NoSQL Databases, while in Section 4.4, our security and privacy-preserving mechanisms are proposed. Finally, in Section 4.5, the summary of the chapter is presented.

4.2 Related Work

Many early papers that issued the relationship between Relational and NoSQL databases were given an overview of NoSQL database, as well as its types and characteristics. They were so enthusiastic about NoSQL and how it declined the dominance of SQL [126], [148]. However in [14], there was discussion about structured and non-structured database; there it was also explained how the use of NoSQL databases, like Cassandra, improved the performance of the system. In addition, it can scale the network without changing any hardware or needing to alternate the server infrastructure. This results in

35 improving the network scalability, with low-cost commodity hardware. In [80], a survey paper regarding relational databases is introduced along with NoSQL features and shortcomings. In addition, these shortcoming and issues of the NoSQL databases have been mentioned in [104]; complexity, consistency as well as limited eco structures are considered as serious concerns. Also in [116], the authors state that the demand for relational database will not go away anytime soon, and it will exclusively serve in line of application that will support business operations. However, NoSQL databases will serve the large, public and content centric applications. Another similar work is the one presented in [124], whaere an extensive analysis for security issues with NoSQL Database, like Cassandra and MongoDB, is considered. Several solutions have also been proposed to improve privacy-preserving in NoSQL databases. More specifically, in Arx, a proxy is employed in order to rewrite NoSQL queries at the trusted premises. A back-end component, deployed at the untrusted premises, is used to perform computation over encrypted data [134]. In terms of BigSe- cret system, standard encryption is used for protection of the stored data, while the indexes are encoded using special techniques to allow comparisons (pseudo-random functions) and range queries (order-preserving partitioning) [132]. Authors in [186] employ algorithms of searchable encryption to build a privacy-preserving key-value store on top of the Redis database. In this approach, the values are protected with sym- metric encryption, while the keys are secured with pseudo-random functions. In another solution, SafeRegions combines secret sharing and multiparty computation to perform secure NoSQL queries on three independent and untrusted HBase clusters providing thus simultaneously secure computation over the stored values and security guarantees similar to standard encryption [135].

4.3 Comparison of Relational and NoSQL Databases

During the last decades, relational databases, sub-divided into groups known as tables, have been used with the aim of storing structural data. The units of data in each ta- ble are known as columns, and each unit of the group is known as a row. Also, the columns in a relational database have relationships amongst them. This situation tends to change over the last years due to the rise of large web applications, which outputs a huge amount of data that traditional relational databases cannot handle any more [34].

36 NoSQL databases are sometimes referred as “Not only SQL” so as to give some empha- sis on the fact that they may support query languages that are SQL-like. Nowadays, it is stated that NoSQL databases have more to offer than just presenting solutions to scale problems, while also providing many important advantages [34] like the following:

• The data representation is schema-less, and there is no need to define a certain structure from the beginning since new fields at run-time can be added.

• The speed, even with a small amount of data, can be processed in milliseconds instead of hundreds of milliseconds.

• The elasticity of the applications due to the scalability features that NoSQL databases offer.

• Reduce in development time, as developers do not have to deal with complex SQL queries and difficult joints so as to collate the data from different tables into a new view.

Some of the differences between relational and NoSQL databases are listed in the following paragraphs.

4.3.1 Reliability of Transactions

The ACID (atomicity, consistency, isolation, durability) model is fully supported by the design of relational databases, providing high reliability in transactions unlike the NoSQL databases.

4.3.2 Scalability Issues and Cloud Support

The primary purpose of cloud technology is to provide services to end-users. NoSQL databases are fully compatible with cloud environment requirements as they can analyze not only raw structured data, but also semi-structured or unstructured data from different sources, since they are not compliant with the ACID model. On the other hand, the relational databases do not provide data search on full content and their characteristics are now designed for cloud use. It is possible that the need for scalability is one of the most significant problems of relational databases as they rely on vertical scalability to upgrade the performance.

37 More specifically, this upgrade method requires the purchase of expensive equipment such as RAM, processors, SSD hard drives, etc. and in some cases, this is not easily achieved due to each system constraints. Also, the possibility of horizontal scaling is not supported by the addition of extra nodes and therefore, cannot support demanding online applications with many users and distributed data. However, NoSQL databases support only horizontal scaling since they do not deal with relational data.

4.3.3 Complexity and Big Data Management

The complexity of NoSQL databases is less than that of relational databases as it is not necessary to create tables to record data, but the modeling by considering a query method can be used. Also, the development of a database structure on a relational database is always considered a complicated task compared to the abstract model of a NoSQL database, where data can be stored regardless of whether they are structured, unstructured, or semi-structured. NoSQL databases have a valuable role in Big Data management since they are well- suited for storing or retrieving data in high speed across distributed nodes, thus taking advantage of multi-core GPU architectures. In relational databases, where accuracy is more important than speed, the data should be stored in tables’ rows and columns, while the scalability is always considered a big issue. In the case of supporting conventional applications with small datasets, they are the most reasonable choice, but slitting the data across different servers increases the arduousness requiring complex SQL queries for joining the data again.

4.3.4 Data Model

Sets in mathematics are the driving force for relational database; all the data are rep- resented as mathematical n-ary relations, where an n-ary relation is a subset of the Cartesian product of N domains. The data are represented as tuples inside the database and are further grouped into relations. The relation (represented by table) contains a set of tuples (represented by rows); where the column in the relation table utilizes the sequence of attributes, the type of an attribute is identified by the domain, which is the set of values that have a common meaning. This data model is very specific and well organized, while the columns and the rows are described by a well-defined schema.

38 NoSQL databases can employ many modelling techniques like graph, key-value stores and document data model. In terms of classification procedure, NoSQL is named after their data model but in some cases, NoSQL database system can be identified by using two or more of the data models that represent their data. NoSQL data model does not utilize the table as storage structure of the data and this is considered the main feature that distinguishes the NoSQL from relational databases. Furthermore, it is schema-less and as a result, can handle the unstructured data like word, pdf, images, as well as video files, in a very efficient method.

4.3.5 Data Warehouse and Crash Recovery

Regarding data warehousing, relational databases gather data from many sources and the oversize of stored data results in big data problems. To name a few, some problems are the performance degradation when utilizng an OLAP (Online Analytical Process- ing), statistical process or Data Mining. On the other hand, NoSQL databases are not designed when considering data warehouse applications, because designers are focused on scalability, availability and high performance. Crash recovery is implemented in relational databases via the recovery manager, which is responsible for ensuring durability and transaction atomicity by using log files and ARIES algorithm. The crash recovery in NoSQL databases depends on replication to recover from the crash.

4.3.6 Privacy and Security

Most relational databases do not provide any feature regarding embedding security in the database itself. As a result, this requires developers to impose directly security systems in the middleware. Classic cryptography mechanisms and encryption pro- tocols, such as asymmetric key encryption schemes, digital signature schemes, zero- knowledge Proof of Knowledge, as well as commitment schemes, which are based on SRSA (Strong RSA), bilinear maps [8], discrete logarithm, homomorphic encryption, fully or not [1], have been widely considered for securing communication and ensuring data confidentiality in relational databases. Nonetheless, one of the most serious shortcomings of NoSQL databases is consid- ered the fact that data files are not by default encrypted, but such a process takes place in

39 the application layer before sending data to the database server. Although there are so- lutions that provide encryption services, these lack horizontal scaling and transparency required in the NoSQL environment. Furthermore, only a few NoSQL databases provide encryption mechanisms to pro- tect user-related sensitive data. By default in NoSQL databases, the inter-node commu- nication is not encrypted and does not support SSL (Secure Sockets Layer) client-node communication (as in relational databases), breaking the network security [147]. Also, there is no integration of authentication or authorization mechanisms. The distributed environments increase attack surface across several distributed nodes and enforcing integrity constraints is much complex in NoSQL databases. In general, only a few categories of NoSQL databases provide mechanisms to employ encryption techniques protecting data at rest.

4.4 Proposed Security and Privacy Solutions

Below are our proposed security and privacy solutions for NoSQL data stores.

4.4.1 Pseudonyms-based Communication Network

In the context of this system, users can have access to multiple services by inserting their credentials only once, that is when they are initially connected to the system. Such a system is called anonymous because users can be known only through their pseudonyms, and the transactions carried out by the same user cannot be linked as their identity is disclosed. For this reason, it is considered the best means in terms of user protection. Furthermore, it is based on two vital protocols, the RSA and the Diffie Hellman. Its structure and operations extend Brand’s credential system [19], whereas it consists of four parties: the users U, a central Identity Provider denoted as IP , the Ser- vice Providers SP s, and the organization for issuing and validating credentials. Users are entities that receive credentials and are known to Service Providers only through their pseudonyms. The central Identity Provider creates its own public and secret key, denoted as (P,S) respectively, and uses its secret key to digitally sign its sensitive data. Each credential is encoded with m + 1 attributes, denoted as y1, y2, . . . , ym, t (where t is the credential issuing time). The IP decides on the Gq, a finite cyclic group of prime order q, to which

40 the random generators g1, g2, . . . , gm, gm+1, h0, involved in keys generation, belong to. Specifically,

S = (y1, y2, . . . , ym, t, s) (4.1)

y1 y2 ym t s P = g1 g2 . . . gm gm+1h0, (4.2) where s ∈ Zq keeps secret.

Under the Discrete Logarithm assumption in Gq, these keys are unique. The IP is responsible for the distribution of the digital pseudonyms p1, p2, . . . , pm to any user. An organization can issue a credential to a pseudonym, and the corresponding user can prove its ownership to another organization (who knows it by a different pseudonym), by just revealing the ownership of the credential. Additionally, the Credential Authority CA prevents the sharing of credentials or pseudonyms and guarantee that users who enter the system have a public and secret key that makes them unique to the system. Another entity in the system is the Verifier V , whose role is to certify the validity of the user credentials and to communicate with either the Issuing Authority or the Credential Authority to inform that the user is not the owner of the credential that is presenting. A user, in terms of a digital credential, trans- mits the public key and the CA’s digital signature derived from a Proof of Knowledge, through which they prove that they know the secret key and the attributes in the digital credential that satisfy the particular attribute property they are revealing. Each pseudonym and credential belong to well-defined users. More in detail, it is impossible for different users to collaborate and show some of their credentials in a Service Provider as well as to obtain a credential for a user that they could not obtain (coherent credentials). As organizations are autonomous and separable entities, they can select their public and secret key independently of the other entities, so as to ensure the security of these keys and facilitate the keys management system. The pseudonyms system can protect user privacy and provide security, as in such a system, an organization can not find out anything about a user other than the own- ership of a set of credentials. Specifically, two pseudonyms that belong to the same user cannot be linked (unlinkability) and identified as in the Brand’s system, except for

41 specific conditions. In order to be efficient, any communication in the system involves as few entities as possible along with the minimum amount of information. If a user holds a credential, this can be shown multiple times without the need to reissue (and consequently resign) it. When a user has access to a service, they are validated by proving that they know the secret key of their pseudonym, without revealing it, thus preventing pseudonyms repetition. Also, for each pseudonym that a Service Provider associates with a user, it requires the user to unveil a different encoded random number of their pseudonym each time and thus, ensuring the unconditional unlinkability of their pseudonyms. Al- though the Identity Provider blindly encodes the random numbers in all of a user’s pseudonyms, that are uniquely related with them, if a user makes abuse of the service, the SP can blacklist and reveal numbers. In following, it is able to globally revoke their pseudonyms and abolish their access to any of the services they have previously had. Finally, users, under Discrete Logarithm, can conclusively prove that their encoded numbers do not belong to the SP ’s blacklist, while using this one as input on a zero- knowledge proof and without revealing any information about their identity. Hence, this technique does not impact users’ privacy and does not strengthen the SP and IP .

4.4.2 Monitoring, Filtering and Blocking

As mentioned above, available applications designed to monitor NoSQL databases can- not detect and then disable malicious jobs and queries. The Kerberos central authenti- cation system can be easily bypassed via advanced scripts, and in general, the level of monitoring is limited to data processing mainly in the API [83]. In a cloud environment, no information regarding the communication of nodes in the cluster or user connection details or data altering information (even editing or deleting), is recorded. In general, since there are no log files, a challenging problem is to identify incidents of data breach or malicious data loss in the cluster [79]. Real-time security mechanisms exist in big data technologies, resulting in high- speed data analysis. Therefore, the detection of anomalies is real-time implemented and the recording of security analytics can be frequently updated [62]. Some moni- toring tools are available but are limited to controlling user requests at the API level. In general, neither the characteristics of a malicious query in big data technologies are

42 defined, nor complete monitoring tools to disable these malicious queries, exist. One technique could be an initial authentication via Kerberos and in following, a second level authentication for accessing MapReduce [109].

4.5 Conclusions

In this chapter, we have discussed major security concerns regarding NoSQL databases. Data protection and access control can be considered some of the key issues of security in NoSQL technology. Reasons for security threats in various NoSQL databases have also been thoroughly discussed in the current work, like privacy of user data, distributed environment, authentication, fine-grained authorization and access control, safeguard- ing integrity as well as protection of data at rest and in motion. In NoSQL databases, Kerberos is used to authenticate the clients and data nodes. Specifically, in order to ensure fine-grained authorization, data are grouped according to their security level. On the other hand, Cassandra uses TDE technique to protect data at rest, whereas administrators must implement controls for ensuring that application and users have only access to the data they need in order to maintain a secure Mon- goDB deployment. Various techniques for mitigating the attacks on NoSQL databases have also been discussed along with proposed security and privacy solutions of NoSQL databases.

43 CHAPTER 5

Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio-Temporal Databases

5.1 Introduction

Nowadays, the rapid development of Internet-of-Things and Radio-Frequency Identi- fication sensor systems [174], in combination with evolution in satellites and wireless communication technologies, enables the tracking of moving objects such as vehicles, animals, people [157]. Trough mobility tracking we collect a huge number of data which give us considerable knowledge. Moving objects keep a continuous trajectory, however, this is described by a set of discrete points acquired by sampling at a specific rate and time-stamps for a time period [187]. A simple description of a trajectory is as a finite sequence of pairs of locations with time-stamp which meets GIS (Geographical Information System) database [157]. In real world, people activities constitute spatio- temporal trajectories which recorded either passively or actively, and can be used in hu- man behavior analysis. Some examples of active recording are the check-ins of a user in a location-based social network or a series of tagged photos in Flickr, as each photo has a location tag (where) and a time-stamp (when). Moreover, an example of passive recording is a credit card’ transactions as in each transaction corresponds a time-stamp and an id of its location. Modern vehicles such as taxies, buses, vessels and aircrafts have been equipped with a GPS (Global Positioning System) device which enables reporting of time-stamped locations [157]. Therefore, real-life examples made mov- ing object and trajectory data mining important. Animal scientists and biologists can study the moving animals trajectories, such as to understand their migratory traces, be- havior and/or living conditions. Meteorologists, environmentalists, climatologists, and oceanographers collect the trajectories data of natural phenomena, as these ones capture the environmental and climate changes, to forecast weather, manage natural disasters

44 (hurricanes) and protect the environment. Also, it can be used in law enforcement (e.g., video surveillance) and traffic analysis to improve transportation networks. More to the point, the evolution of technology in the domain of mobile devices, in combination with positioning capabilities (e.g., GPS), paved the way to Location-based applications such as Facebook and Twitter. Indeed, social media networking has thoroughly changed people’s habits in each aspect of their life, from personal and social to professional. A GPS sensor allows users to periodically transmit their location to a Location-based Ser- vice provider (active recording) in order to retrieve information about proximate points. However, queries based on location may conceal sensitive information about an individ- ual [118]. Therefore, the Privacy and Anonymity Preserving problem of mobile objects remains an important issue which will concern us in the context of this work. Above issues along with [75] motivated us to make the following research. Specif- ically, we apply k-NN queries on the trajectory data points of mobile users in cases of with or without clustering. In both methods, mobile users are camouflaged by their k nearest neighbors which constitute their k-anonymity set. In case of clustering, the tra- jectory points of all users in each time-stamp are grouped based on K-Means (on-line clustering) and apply k-NN queries to find the indexes of k nearest neighbors of all users, inside the cluster they belong to. Irrespective of the method used, if k nearest neighbors indexes remain the same or vary at a low rate in time, it is difficult for an ad- versary to discover a mobile user based on history data. We experiment on how this set changes in case of clustering or not, for different combinations of dimensions (x,y,θ,v), which is the main contribution of this work. We provide an analysis of the effect of dimensions on k-anonymity method. We conclude that when a data set contains a large number of attributes which are open to inference attacks, we are faced with a choice of either completely suppressing most of the data or losing the desired level of anonymity. The rest of this chapter is organized as follows: In Section 5.2 are described in detail (a) the clustering and classification problems along with the algorithms used (b) the system architecture (c)the problem definition (d)the system model and adopted methods (e) the k-anonymity privacy preserving and finally, the experiments environment and data sets source. In Section 5.3 previous related works are presented in relation to our approach and future directions of this work are recorded. Finally, Section 5.4 presents the graph- ical results gathered from experiments and the conclusions of their evaluation in terms

45 of the studied problem.

5.2 Materials and Methods

5.2.1 Clustering

Clustering is an iterative procedure which makes groups of similar data, primarily con- cerned with distance measures and a fundamental method in data mining. Clustering methods are classified as partition-based, hierarchy-based, density-based,grid based and model-based. Moving object activity pattern analysis (i.e. similar motion patterns) and activity prediction is some typical application scenarios of trajectory clustering [187]. In our case, clustering is used to organize moving objects into groups so that the mem- bers of a group are similar, with great compactness, according to a similarity criterion, based on spatio-temporal data. Specifically, for a group of mobile objects we apply clustering of their current trajectory points’ attributes (spatial coordinates, angle, veloc- ity) in specific time-stamps. In other words, we apply on-line clustering.

Algorithm 1:K-Means 1: Input:number of clusters K and training data P 2: Output:a set of K clusters 3: Method:Arbitrarily choose K objects from P as the initial cluster centers 4: repeat 5: assign each object to the cluster to which the object is the most similar based on the mean value of the objects in the cluster. 6: update the cluster means, i.e., calculate the mean value of the objects for each cluster. 7: until no change

K-Means (Algorithm 1 [194]), which is in this work, belongs to partitioning clus- tering methods and is popular due to its simplicity. It is based on the squared error minimization method and the main advantage of K-Means is that, in each iteration, it is computed the distance between a point and the K cluster centers only. Its time complex- ity is O(NKt), where N, K and t is the number of data objects, clusters and iterations respectively. However, K-Means clustering suffers in some points, the number of clus-

46 ters K must be known in advance, and its computational cost with respect to the number of data observations, clusters and iterations. K-Means and other clustering algorithms use the compactness criterion to assign clusters, which concerns us, in contrast with spectral clustering in [70] which makes use of the spectrum (or eigenvalues) of the sim- ilarity matrix of the data and examines the connectivity of the data. It is expected that K-Means algorithm may be a good option for exclusive clustering (which concerns our study) against Fuzzy C-Means which assigns each mobile object to different clusters with varying degrees of membership. Therefore, it may give good results for overlap- ping clusters. Not to mention that, it has much higher time complexity than K-Means [23].

Algorithm 2:k-Nearest Neighbor

1: Input:X:training data set, ClX :class labels of X, p:testing point to classify 2: Method:

3: Compute distances d(Xi, p) to every training point Xi and keep the indexes I of the k smallest distances.

4: Select the k labels ClX (I)

5: return I and the majority class clb in ClX (I)

5.2.2 Classification

Classification is an unsupervised machine learning approach and concerns the assign- ing of class labels to a new sample on the basis of a training data set whose samples class membership is known. The principle behind nearest neighbor method is to find a number of training samples closest in distance to the new point, and predict the class label from these. The number of k nearest neighbors can be a user-defined, constant or varying, based on the local density of points (radius-based neighbor learning). The most common distance measure is Euclidean. The k-Nearest Neighbor (Algorithm 2 [159]) is a non-parametric method and by far the simplest of all machine learning al- gorithms. The use of k-NN solely has great calculation complexity. This means that the classification of a new data point needs the calculation of distances between it and all training data set points, with ultimate goal to choose the k nearest neighbors. To overcome this issue, we combine it with a clustering method, namely K-Means, which

47 reduce the size of training sets efficiently, so the computational time of k-NN as well. It is worth mentioning that the application of k-NN inside a cluster has no sense if cluster size is less than the number of k nearest neighbors we are looking for, inside it. Hence, the appropriate combination of parameters K, since K influences clusters’ size, and k is crucial. Despite aforementioned advantages, it gives to each labeled sample the same importance to classify, in contrast with what fuzzy classifier considers [64]. Finally, in a recent work in [48] authors describe and suggest an efficient method in which kernel fuzzy clustering is combined with harmony search algorithm for scheme classification.

5.2.3 Useful Definitions

We consider points in a d-dimensional space D. Given two points a and b we define as dist(a, b) the distance between a and b in D. In this paper, we utilized the Euclidean distance metric which is defined as

v u d 2 uX  dist(a, b) = t a[i] − b[i] i=1 where a[i], b[i] denote the values of a, b along the i dimension in D.

Definition 5.2.1. k-NN: Given a point b, a data set X and an integer k, the k nearest neighbors of b from X, denoted as k − NN(b, X), is a set of k points from X such that ∀p ∈ k − NN(b, X) and ∀q ∈ {X−k−NN(b, X)}, dist(p, b) < dist(q, b).

Definition 5.2.2. k-NN Classification: Given a point b, a training data set X and a set of classes ClX where points of X belong, the classification process produces a pair

(b,clb), where clb is the majority class b belongs.

d Definition 5.2.3. Clustering: Given a finite data set P = {a1, a2,. . . , aN } in R , and number of clusters K, the clustering procedure produces K partitions of P such that among all K partitions (clusters) C1,C2,. . . ,CK find one that minimizes

2 K X X 1 X arg min a − aj C1,C2,...,CK =P |Cc| c=1 a∈Cc aj ∈Cc

where |Cc| the number of points in cluster Cc.

48 5.2.4 System Architecture

In this section we make a brief description of how the system, used to extract the de- sired data sets, operates and extracts the trajectory data points. The system architecture is based on SMaRT (Spatiotemporal Mysql ReTrieval) framework [53] and works as an expansion sub-system that produces new data sets out of sample points which are stored in a relational database. It exploits Google maps API in order to define trajec- tories between two randomly chosen points that follow road paths over a geographical area of interest. To support the above functionality, a class was created by extending the existing framework and giving the corresponding user interface. The data flow of this sub-system, as shown in fig. 5.1, follows a three stage path. Before the beginning of the process there is an initialization phase, while the database is populated with manually precreated trajectories from a comma separated file (csv) through the importing class

that is implemented on SMaRT framework. The trajectory objects Ti are in the form of sequential spatio-temporal points Pn and stored in the database as tuples of latitude, longitude, time-stamp and objid where objid is a unique identification integer number of the moving object that participates in our analysis and takes values from 1 (the first mobile object objid) to N (the N th mobile object objid). Moreover, the dimensions of latitude and longitude are transformed to the equivalent Cartesian coordinates (x,y), through Mercator transformation, in order to efficiently calculate measures such as dis- tance between points, velocity and angle vectors. After the initialization phase is over, a repeated procedure takes place as described below.

Reset API Call with Origin and Failed Yes configuration Destination Points Start new Call

No

Store Calculate Calculate Timestamp Velocity and Response Finish Trajectory based on user angle between Object into defined speed points DB

Fig. 5.1 Data flow diagram

At first, two points are randomly chosen from a relational database with a significant Euclidean distance between them. Because of the transformation every point is at a

49 distance measured in meters. Thus the distance between them is given by p(x2 + y2) and should be over a threshold of 10m. This threshold is given by heuristic way and it eliminates the problem of zero velocity and angle calculations. Then an API call (see fig.5.2) is raised against Google Maps Directions API Service given the points along with route settings information. This information may contain the way in which the target object moves as it could be pedestrian, car and bicycle, if it should follow toll roads or not and most helpful if the response should contain other relative routing paths beside the first proposed one. By this last attribute, trajectories can be multiplied by obtaining not only the proposed routing path but also the relative ones. This method also solves the drawback that Google service imposes which restricts the service calls to 10 per second by introducing a slight delay time in the insertion process. At second, the API response which contains the routing paths following the road network that connect the two points of reference, provides the information to the class that constructs the trajectory object. For each point of the trajectory, several calculations take place such as acquiring the time-stamp tn of the moving object at this specific point compared with the user defined velocity scale vn and the Euclidean distance dn from the previous

point. Also, the angle vector gn is measured using the coordinates (xn,yn) of the specific point and the next point (xn+1,yn+1) of the trajectory. Because of the randomness of the chosen reference points of the API request, there is a difference between the number of points that the user provided as the trajectory multitude and the API response routing path. In case some routing path Rn has more points than the user defined upper limit, it is shortened by that number and the endpoint of the final trajectory object Tn becomes the last point in the array. In the opposite situation, the trajectory is discarded and the current thread resets its state in order to make a new request. This procedure ends up with the final trajectory object Tn being stored to the database. For the storage procedure, a class from the SMaRT framework is used which checks all the rules that the database instance implies. When the user defined number of trajectory objects is achieved the whole process stops with a termination message containing the number of trajectories that was stored in database, the time spent for the procedure, the number of tuples inserted in database and the space used for those trajectory objects including the overhead of the indexing of the appropriate fields in the database.

50 Fig. 5.2 API Request Diagram

In the event of failure of any of the stages that should be followed, all trajectory information is being discarded and each of these processes is restarted. Thus, in order to overcome subsequently failures, a parallel methodology is deployed, where a number of user defined simultaneous multiple calls is instantiated at the beginning of each process.

51 After each call response a new trajectory is stored into database, given a new object ID, denoted as objid in the following. Those that failed are restarted with a different set of points. Finally, in terms of space consumption, each point of a trajectory object costs

17 bytes (2 × 4 bytes for the representation of the coordinates (xn,yn), 5 bytes for the time-stamp and 4 bytes for the objid) when it is stored as a simple point with the three

dimensions stored separately as (xn,yn,tn) and the objid. When it is stored as a single spatial point with extra fields for the time-stamp and the objid, the cost goes up to 34 bytes (25 bytes for the spatial representation of (xn,yn), 5 bytes for the time-stamp and 4 bytes for the objid).

5.2.5 Problem Definition

In the context of this work we study the problem of Privacy Preserving considering spatio-temporal databases of N records with d attributes each one. The spatio-temporal data is the location data of a number of mobile users along with the time-stamp of each position as shown in table 5.1. Through SMaRT system we have in our disposal tra- jectory data which give us information about angle direction and velocity amplitude, as well. Therefore, for each record, i.e. mobile user, we know the values of four attributes. We employ a popular anonymization approach, called k-anonymity, such that any orga- nization or adversary can deduce information, about the mobile user identity, observing its location attributes. Within k-anonymity attributes can be suppressed, namely their values can be replaced by ”*”, or generalized [176] until each row is identical with at least k − 1 other rows. At this point the database is said to be k-anonymous and thus, prevents database linkages. In our case, we select to anonymize location data attributes by employing a classification method which enable us to construct the k- anonymity set of each user per time-stamp. The rationale behind anonymity preserving lies in the preserving of the k nearest neighbors from one position to another. To this end, we investigate the problem considering two approaches. In the first approach, the anonymization is handled as a clustering problem, in which the d-dimensional space of attributes is partitioned into homogeneous groups so that each group contains at least k records, namely the minimum number of records in a cluster, to satisfy k-anonymity. To achieve it, as a first approach which will be elaborated in the future, we adopt the K-Means clustering method. The k-anonymity set of each user is formed based on the

52 cluster it belongs to. In the second approach, the anonymity set is formed again by the k nearest neighbors indexes but without considering d attributes space partitioning. The N maximum number of clusters is K = b k c, where N is the total number of records in the data set and k << N is the anonymity parameter for k-anonymization.

Table 5.1 An example of spatio-temporal database for d = 4

objid time-stamp timeToNextPoint x y angle velocity 1 2013-03-09 10:00:01 0 21082 56436 1.23 0 1 2013-03-09 10:00:04 3 21099 56432 1.16 4.5 1 2013-03-09 10:00:11 7 21221 56484 1.51 14.6 1 2013-03-09 10:00:19 8 21331 56524 1.95 11.3 1 2013-03-09 10:00:21 2 21402 56495 0 29.5 2 2013-03-09 10:00:03 0 35587 59829 -2.76 0 2 2013-03-09 10:00:08 5 35568 59782 2.94 7.8 2 2013-03-09 10:00:16 8 35580 59723 -2.07 5.8 2 2013-03-09 10:00:25 9 35530 59668 -1.52 6.4 2 2013-03-09 10:00:34 9 35476 59671 -2.85 4.6

For L trajectory points which correspond to L time-stamps we compute for each mobile user i its k nearest neighbors indexes and record them in a vector of the form

knnsit = [idit1idit2 . . . iditk] for t = 1, 2,...,L. An example of such sets for N mobile users is shown in table 5.2. For each user we measure how many of the k nearest neighbors remained the same from one position to another.

53 Table 5.2 k-anonymity sets for N mobile users in L = 5 time-stamps

objid Time Instant knns indexes

1 1 [id111, id112, . . . , id11k]

1 2 [id121, id122, . . . , id12k]

1 3 [id131, id132, . . . , id13k]

1 4 [id141, id142, . . . , id14k]

1 5 [id151, id152, . . . , id15k]

2 1 [id211, id212, . . . , id21k]

2 2 [id221, id222, . . . , id22k]

2 3 [id231, id232, . . . , id23k]

2 4 [id241, id242, . . . , id24k]

2 5 [id251, id252, . . . , id25k] ......

N 1 [idN11, idN12, . . . , idN1k]

N 2 [idN21, idN22, . . . , idN2k]

N 3 [idN31, idN32, . . . , idN3k]

N 4 [idN41, idN42, . . . , idN4k]

N 5 [idN51, idN52, . . . , idN5k]

Definition 5.2.4. (k-anonymity). A spatio-temporal database is k-anonymous w.r.t. a set of attributes d if at most one of the k nearest neighbor has changed from one time- stamp to another so that each mobile user not to be distinguishable from its k − 1 neighbors.

According to authors in [156] k-anonymity is able to prevent mobile users’ identity unveil. This means that the probability of a user re-identification between its k neigh- 1 bors is only possible with k . Nevertheless, k-anonymity may not protect users against attributes disclosure. Motivated by this argument, we evaluate the robustness of both approaches by computing how many of the nearest neighbors, out of the k, remained the same, and the previous probability per time-stamp.

54 5.2.6 System Model

We consider N mobile users in <2. The configuration space, namely, the environment that objects are moving, may be the free space or a road network (constrained or uncon- strained) [187]. In our case, we consider an unconstrained road network, as described in System Architecture, where mobile users are densely distributed and do not develop high speeds [53]. We exclude national or international road networks since there we cannot assume that users are moving with linear velocity. Suppose a collection of tra-

Fig. 5.3 A Matlab overview of mobile users trajectories’ points jectories T = {T 1,...,T N } (Trajectories Database) of equal length L. Each trajectory consists of a sequence of time ordered positions a mobile user goes through as it moves from a start point to a specific destination. It is a vector of the form

j j j j j j j j j j T = {(x1, y1, t1), (x2, y2, t2),..., (xL, yL, tL)}.

j j Each (xi , yi ) represents the position (Cartesian coordinates) of the mobile user j in j time-stamp ti or point i of its trajectory j [187]. For each point i in trajectory j we

55 j j j j j define in 4-dimensional space a vector Di = (xi , yi , θi , vi ), i = 1, 2,...,L, which is described by the location coordinates (x, y) and motion pattern (g, v), respectively. For the first point of each trajectory, direction and velocity is defined with respect to point (0, 0).

Algorithm 3: MWCL 1: Input:the number of k nearest neighbors, the number j of mobile users N, vectors Di of N users in L time-stamps 2: Output:k nearest neighbors indexes of N users in L time-stamps 3: for i = 1 : L do 4: for J = 1 : N do j 5: Compute vector Di of user j in time instant i j 6: Apply k-NN between the vector Di and the vectors j N {Di }j=1 of all users to find the set of k-NN indexes, j Ii , of user j in time-stamp i 7: end for 8: end for

We approach the problem of Privacy Preserving on Spatio-Temporal databases with two methods. The first one is called Method-Without-Clustering (MWCL) and the sec- ond one is called Method-with-Clustering (MCL). In the latter, we apply on-line clus- tering, i.e., in each time-stamp t, we group into clusters the mobile users based on their j vector values Di that moment. This vector is formulated depending on which of the following attributes x,y,g,v, we choose to apply K-Means and k-NN algorithms. We define a location data security metric which we have already called Vulnerabil- ity. This one quantifies and measures the robustness of each method. Specifically, it expresses the rate with which k nearest neighbors of each mobile user changes. The Vulnerability of each method is computed as the mean value of Vulnerabilities of all users in L time-stamps. The less neighbors’ indexes change (e.g., objid) the less values Vulnerability takes.

j Definition 5.2.5. Vulnerability: Given a mobile user j, a set Ii with the k nearest

56 Algorithm 4: MCL 1: Input:number of k nearest neighbors, number of mobile j users N, vectors Di of N users in L time-stamps 2: Output:k-NN indexes of N users in L time-stamps 3: for i = 1 : L do 4: for j = 1 : N do j 5: Compute vector Di of user j in time instant i j 6: Apply k-NN method between the vector Di and the j N j vectors {Di }j=1 inside the cluster Ci of user j in j time-stamp i and find the set of k-NN indexes, Ii . 7: end for 8: end for

j neighbors indexes in time-stamp i, Vi is defined as

j 1 Vi = j j |Ii ∩ Ii−1|

j j j where 0 ≤ Vi ≤ 1 and |Ii ∩ Ii−1| the number of k indexes remained the same.

Algorithm 5:Vulnerability j 1: Input:sets Ii with the k nearest neighbors indexes, number of mobile users N and time period L j 1 2: Initialization: V1 = k 3: Output:Vulnerability values of N mobile users in L time-stamps 4: for i = 2 : L do 5: for j = 1 : N do j 1 6: Vi = j j |Ii ∩Ii−1| 7: end for 8: end for

5.2.7 Privacy Preserving Analysis

The rising advances in video tracking technology has attracted the scientific attention of understanding of social behavior of swarming animals. The video tracking method can automatically measure individual’s motion states using videos from different camera

57 views [36]. Swarming behavior is connected with collective one which usually hap- pens in large groups of animals such as bird flies, mosquito, insects [4]. Researchers employ mathematical models to simulate and understand the swarm behavior. The sim- plest mathematical models generally consider that individual animals move in the same direction as their neighbors, remain close to them (thus neighbors remain constant) and avoid collisions with these ones. In our study, mobile users constitute the ”swarming animals” who have collective motion behavior and either organized into groups or not, as described in previous section. It is worth making privacy preserving analysis in case that mobile objects are moving randomly (thus independently one each other) from one position to another. In both approaches (either we apply on-line clustering or not), we observe that mobile users’ nearest neighbors indexes change and thus, the Vulnerability of both methods increases. In case of random motion, mobile users’ behavior is similar with the one of swarm of bees or flies. Their direction(angle) is linear and velocity am- plitude is approximately constant. Similar motion characteristics have users in a road network but they do not move randomly. Generally, a mobile object in a dense road network is moving on a piece-wise linear random path with a constant speed.

Let Xti be a sequence of independent random variables which relates with the neigh-

bors of a mobile user i which changed in a time interval ti. To maintain Vulnerability in low levels, much less than 1 and close to 0, users’ motion behavior should not change considerably so that their k nearest neighbors remains the same from one time-stamp to another. This results in the following theorem:

Theorem 5.2.1. The probability that the nearest neighbors do not remain the same within a time interval (0, ti] and at least l of them changed such that users be distin- guishable from an adversary, tends to 1 if objects move randomly.

lim P (Xt ≥ l) = 1 l→k i

Proof. We conduct a Poisson experiment. We consider that the number of successes that are resulting from the Poisson experiment termed as a Poisson random variable with average number of successes λ and k a positive integer which relates with k-anonymity level.

1. The outcomes of the experiment are discrete. Specifically, they concern the near-

58 est neighbor’s sustainability and are classified as either success (neighbors re- mained the same are at least k − 1) or failure (number of neighbors remained the same are at most k − 2 which is less than k − 1).

2. λ is the average or expected number of successes within a time interval ti for

a mobile user i, E(Xti ) and assumed to be known and constant throughout the experiment.

3. Poisson describes the distribution of events. Each event is independent of the other events.

4. The probability for a mobile user i to have at least l different neighbors, in a

time-interval ti (occurrence of a failure) is written as λleλ P (X > l) = 1 − P (X ≤ l) = 1− ≤ ti ti l! λleλ lim P (Xt ≥ l) = lim(1 − ) l→k i l→k l!

Due to the random moving assumption, it is highly certain that the indexes of all k nearest neighbors will change. Hence, above probability will be very close to 1. If k is large enough, previous limit will tend to 1.

5.2.8 Experiments Data and Environment

The experimental data used in this paper comes from SMaRT Database GIS Tool in http://www.bikerides.gr/thesis2/. We experiment on two trajectories data sets of 400 and 2000 bike riders, as shown in fig.5.3, in the area of Corfu, with 100 trajectory points each one. For each trajectory point we have available the four dimensions values for 100 time-stamps, e.g. the Cartesian coordinates (x,y) and the angle, velocity (θ,v), respectively. The environment in which experiments carried out has the following characteristics: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz 3.00GHz, 16GB Memory, Windows 10 Education, 64-bit Operating System, x64-based processor and Matlab 2018a.

59 5.3 Discussion

The last decade the problem of privacy preserving of location data has been of particular concern to researchers. Hence, many research works have been conducted to reinforce the security level previous ones provide, such as the association of location data with pseudonyms. In a recent work [111] authors recommend an asymmetric Private Equality Testing protocol (PET) which allows two users to communicate each other safely and without the involvement of a Third Party. PET requires two public-key exponentiation per each user and needs three rounds to complete. Both users compute their private input through hash functions and send it each other. The output of protocol in the end of third round indicates the equality or not of two keys and as result their communication. This proto- col can be used to Location-based Social Services to find the location of users who are in the same region or in a specific radius depending on the preference of user. The se- curity of private inputs of involver lies in the use of Computational Diffie-Hellman and Discrete Log Problem. Also, asymmetry does not reveal whether the computation and equality of inputs are successful and that prevents observers to identify if the connection established or not. More to the point, k-anonymity method has been used in order to reinforce and quantify location data privacy. If k-anonymity is achieved, a person cannot be dis- tinguished from k − 1 other people [118]. In the context of k-anonymity, authors in [118] propose an enhanced Dummy-Location Selection(DLS) algorithm for users in LBS. From a different perspective authors in [156] aim at the utility improvement of differentially private published data sets. More specific, they show that the amount of noise required to fulfill -differential privacy can be reduced if noise is added to a k- anonymous version of the data set, where k-anonymity is achieved through a specially designed microaggregation of all attributes. As a result of noise reduction, the general analytical utility of the anonymized output is increased. Moreover, authors in [154] at- tempt to protect users’ private locations in location-based services adopting the spatial cloaking technique, which organizes users’ exact locations into cloaked regions. This one satisfies the k-anonymity requirement within the cloaked region. They propose a cloaking system model which they called ”anonymity of motion vectors” (AMV) that

60 provides anonymity for spatial queries minimizing the cloaked region of a mobile user using motion vectors. The AVM creates a search area that includes the nearest neighbor objects to the querier who issued a cloaked region-based query. In addition, in [117], authors suggest a clustering based k-anonymity algorithm and optimizes it with paral- lelization. The experimental evaluation of the proposed approach shows that the algo- rithm performs better in information loss, due to anonymization, and its performance is compared with the existing algorithms such as KACA and Incognito. In our case, we approach the k-anonymity preserving as following. Specifically, we investigate the impact of the used attributes (x,y,θ,v) in the robustness of the pro- posed methods, MCL and MWCL. Since, from time-stamp to time-stamp the number of nearest neighbors may not remain the same and be less than k, the robustness is decreased, namely the probability one mobile user to be identified from its anonymity 1 set may be higher than the optimum value k . The proposed MCL method adopts a simple K-means clustering micro-aggregation technique that maintains k-anonymity, which is the aim of this work. However, the proposed approach has some limitations and drawbacks. Firstly, it works for numeric or continuous location data, but does not work for categorical data. Secondly, although there may exist natural relations among attributes such as angle and velocity with Cartesian coordinates, the proposed algorithm cannot incorporate such information to find more desirable solutions. Moreover, the fo- cus is too heavily placed on preserving k nearest neighbors to guarantee mobile users anonymity. As a result, the algorithm lacks the ability to address other issues (e.g. l- diversity, t-closeness) which previous works address, to find more desirable solutions. In addition to, the proposed MCL algorithm tries to minimize the within-cluster sum of squares and maximize intra-cluster sum of squares, so that the number of records in each partition is greater than k. Using K-Means there is no guarantee that the optimum is found, thus, the quality of the resulting anonymized data cannot be guaranteed. Like all greedy algorithms, the K-means algorithm will reach a local, but not necessarily a global minimum. Hence, the information loss is not minimized despite clusters are formed such that contain at least k similar objects. In addition to, the MCL algorithm is assigning mobile users to the nearest cluster by squared Euclidean distance. Using a different distance function may stop the algorithm from converging. Hence, as a first approach we focused on this version of K-Means. In a future work, various modifica-

61 tions of K-Means or other clustering algorithms will be used to investigate the impact on k-anonymity. Finally, our aim is to extend this work for large-scale trajectory data and transfer the whole processing in a distributed computing environment based on Hadoop. Apache Spark is the most promising environment with high performance in parallel computing, designed to efficiently deal with supervised machine learning algorithms, e.g., k-NN [138]. A future direction is to design above methods and experiments in Big Data environments [81], [175], [31] and investigate their scalability and running time perfor- mance under different data sets and parameter k combinations.

5.4 Results

5.4.1 Experiments Results

In this section, the results of the experiments conducted are presented in figs.5.4 to 5.9. We relied on real data to evaluate the performance in terms of the Vulnerability of MWCL and MCL. We experiment on two different data sets size N = {400, 2000} respectively. The parameters of two experiments and the relevant values are presented in tables 5.3 and 5.4 respectively. Parameter cs refers to the number of clusters and k to

Table 5.3 Parameters for the 1st experiment of N=400 trajectories

cs k Clustering attributes k-NN attributes 2,5 5 x x 2,5 5 x, y x, y 2,5 5 x, y, θ x, y, θ 2,5 5 x, y, θ, v x, y, theta, v 2,5,10 5 x, y, θ, v x, ∗, ∗, ∗

the number of nearest neighbors in terms of Euclidean distance. Concerning the MCL, we consider the attributes combination shown in both tables. For a fair comparison, in MWCL the k-NN is applied for the same attributes combination, as well. We investigate the k-anonymity gradually, by adding one new attribute each time. In the first four cases of both experiments and approaches (see figs.5.4, 5.5, 5.6, 5.7), we observe that the information of attribute x is not sufficient to make both methods robust enough in terms of nearest neighbors indexes change. More to the point, in the

62 Table 5.4 Parameters for the 2nd experiment of N=2000 trajectories

cs k Clustering attributes k-NN attributes 5,10 15 x x 5,10 15 x, y x, y 5,10 15 x, y, θ x, y, θ 5,10 15 x, y, θ, v x, y, θ, v 5,10,15 15,30 x, y, θ, v x, ∗, ∗, ∗

combination (x,y), although Vulnerability dropped significantly, the usage of attributes (θ,v) did not enhance it. In real data sets, many dimensions contain high levels of inter- attribute correlations. In this work, by definition as described in subsection ”System Architecture”, attributes (θ,v) and (x,y) are correlated. This stems from the fact that the non-linear trajectories of the mobile users are approximated as linear ones between

xn+1−xn yn+1−yn time-stamps. Specifically, velocity is computed as vx = , vy = and tn+1−tn tn+1−tn v = pv2 + v2 while angle g = tan-1 ( yn+1−yn ). The curse of dimensionality has x y xn+1−xn remained a challenge for a wide variety of algorithms in data mining, clustering, clas- sification, and privacy, and seems to affect both methods performance in terms of Vul- nerability. The experimental results seem to suggest that the dimensionality curse is an obstacle to privacy preservation. It was shown that an increasing dimensionality makes the data resistant to effective privacy and achieve the lower bound of k-anonymity, i.e. 1 k . However, in practice, we show that some of the attributes of real data can be lever- aged in order to greatly ameliorate the negative effects of the curse of dimensionality in privacy. To obtain an even more accurate classification, we considered two more at- tributes in computations, the angle and velocity. However, it is doubt if we can obtain a perfect classification by carefully defining a few of these features. In fact, after a certain point which in our case is (x,y) attributes, increasing the dimensions of the problem, by adding new features, we degraded the performance of k-NN classifier. As shown in the following figures, as the dimensionality increases the Vulnerability performance is improved until the optimal number of features is reached, i.e., 2. Further increasing of the dimensionality does not ameliorate Vulnerability performance. In the following figures we repeat the same procedure as previously, but we consider a larger data set of size N = 2000 trajectories. From cluster number perspective, the number of clusters cs obviously affects only MCL, where we see that Vulnerability becomes a little worse

63 0.8 0.8 MWCL MWCL MCL-cs = 2, 5 MCL-cs=2 0.7 0.7 MCL-cs=5

0.6 0.6

0.5 0.5

0.4

0.4 Vulnerability

Vulnerability 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b

Fig. 5.4 Both clustering and k-NN: (a) x and (b) (x,y) for N=400 trajectories, L=100 time-stamps and k = 5. as the clusters increase. This relates with the fact that, not only the average cluster size is reduced but also their composition changes, and thus, MCL becomes more sensitive to the change of the nearest neighbors of mobile users inside the cluster. We focus on the last combination of attributes of both experiments (see figs. 5.8,5.9). We observe that attributes suppression in terms of k nearest neighbors indexes computation, makes MCL superior in terms of Vulnerability (that is lower values), which is the main issue in k-anonymity and thus in Privacy Preserving. Moreover, it is the combination which highlights MCL as the clusters number increases. The fact that clustering is based on all available attributes empowers clusters homogeneity and reflects better the real word communities. More to the point, when the mobile users are camouflaged by k nearest neighbors based on one of the attributes, in that case x, it is disclosed less information about them and their neighbors. Therefore, it is more difficult to break security even if an intruder monitors history data and tries to link k − 1 public records of nearest neighbors. This case keeps Vulnerability in relatively high level for low values of k. It is obvious that, the more the nearest neighbors are used to camouflage the mobile users inside the cluster, the lower the Vulnerability becomes. Also, in terms of computation cost, k-NN is more time effective when applied to less volume (inside cluster) and less dimension data. Not to mention that, we avoid the dimensionality effect in classifica- tion and thus, in k-NN performance. Finally, in figs.5.8 and 5.9 we also demonstrate the impact of parameter k in Vulnerability. We observe that the increase of k benefits

64 0.8 0.8 MWCL MWCL MCL-cs=2 MCL-cs=2 0.7 0.7 MCL-cs=5 MCL-cs=5

0.6 0.6

0.5 0.5

0.4 0.4

Vulnerability Vulnerability 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b

Fig. 5.5 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=400 trajectories, L=100 time-stamps and k = 5. both methods. This shows that the security of a mobile user is more vulnerable when this one is protected by low number of nearest neighbors.

0.8 0.8 MWCL MCL-cs=5 0.7 0.7

0.6 0.6

0.5 0.5

MWCL 0.4 MCL-cs=2 0.4 MCL-cs=5

MCL-cs=10

Vulnerability Vulnerability 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b

Fig. 5.8 Clustering (x,y,θ,v) and k-NN x for N=400 trajectories, L=100 time-stamps for (a) k = 5 and (b) k = 15.

5.4.2 Experiments Conclusions

In conclusion, in the context of this chapter we carried out a research on Privacy Preserving based on real Spatio-Temporal data. This research work proposes a k- anonymity model based on motion vectors that provides anonymity for spatial queries. Specifically, we investigated the problem of k-anonymity from dimensionality perspec-

65 0.8 0.8 MWCL MWCL MCL-cs=5,10 MCL-cs=5 0.7 0.7 MCL-cs=10

0.6 0.6

0.5 0.5

0.4 0.4

Vulnerability Vulnerability 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b

Fig. 5.6 Both clustering and k-NN: (a) x and (b) (x,y) for N=2000 trajectories and L=100 time-stamps.

0.8 0.8 MWCL MWCL MCL-cs=5 MCL-cs=5 0.7 0.7 MCL-cs=10 MCL-cs=10

0.6 0.6

0.5 0.5

0.4 0.4

Vulnerability Vulnerability 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b

Fig. 5.7 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=2000 trajectories and L=100 time-stamps. tive and how the combination of dimensions affects Vulnerability of both methods. We observed that the inter-attribute combinations or suppression within a record have such a powerful revealing effect in the increase of dimensional case. We proved the effectiveness and efficacy of MCL, under specific dimensions combination, by inten- sive experiments. Finally, the anonymization using clustering (based on all attributes) and attributes’ suppression in k-anonymity set computation is a solution for privacy preserving.

66 0.8 0.8 MWCL MCL-cs= 5 MWCL 0.7 MCL-cs=5 MCL-cs=10 0.7 MCL-cs=10 MCL-cs= 15 MCL-cs=15 0.6 0.6

0.5 0.5

0.4

0.4 Vulnerability

0.3 Vulnerability 0.3

0.2 0.2

0.1 0.1

0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b

Fig. 5.9 Clustering with (x,y,θ,v) and k-NN in x. Figure (a) concerns k=15 while (b) k=30 for N=2000 trajectories and L = 100 time-stamps.

67 CHAPTER 6

Storage Effiecient Trajectory Clustering and k-NN for Robust Privacy Preserving Databases

6.1 Introduction

The research area of moving object databases has become an emerging technologi- cal discipline, and has consequently gained a lot of interest during the last decade due to the development of ubiquitous location-aware devices, such as PDAs, mobile phones, GPS-enabled mobile devices, and RFID, or road-side sensors. The techno- logical achievements and advances in sensing and communication/networking, along with the innovative technological design features (thin and light) of computing devices and the development of embedded systems have enabled the recording of a large vol- ume of spatio-temporal data. Mobile object trajectories are among the wide variety of spatio-temporal data that are especially important to scientists. Actually, they help them in discovering movement patterns (individual or group) and knowledge which, in recent literature, have been established as trajectory or mobility mining [100]. Also, the technology of databases is evolving to support the querying and representation of the trajectory of moving objects (e.g., humans, animals, vehicles, natural phenomena). Hence, the main parts of trajectory data-mining include pre-processing, data manage- ment, query processing, trajectory data-mining tasks, and privacy protection [49]. Real-life applications, such as the analysis of traffic congestion, intelligent trans- portation, animal immigration habits analysis, cellular communications, military appli- cations, structural and environmental monitoring, disaster/rescue management, as well as remediation, Geographic Information Systems (GIS), Location-Based Services (LBS), and other domains have increased the interest in the area of trajectory data-mining and efficient management of spatio-temporal data. It should be noted that the explosive growth of social media has produced large-

68 scale mobility datasets whose publication puts people’s personal lives at severe risk. Indeed, users get used to sharing their most-visited or potentially sensitive locations, such as their home, workplace, and holiday locations that are easy to obtain through social media. Nowadays, the amount of spatio-temporal data has been growing expo- nentially. Therefore, there is an urgent need to develop efficient methods for storing and managing this large amount of information. A plethora of studies have been conducted for handling mobile objects’ trajectory data. More precisely, several of them attempt to reduce the storage size [60, 68, 155], while others investigate the privacy preservation of trajectory data [69, 133]. Nowadays, not only are storage-efficient spatio-temporal transformation schemes needed, but also secure querying on large-scale spatio-temporal data [183]. An accurate capture of a moving object trajectory usually needs a high sam- pling rate to collect its location data. Thus, massive trajectory data will be generated, which is difficult to fit into the memory for utilizing data-mining algorithms. A com- mon idea is to compress the trajectory data to reduce the storage requirements while maintaining the utility of the trajectory. In the context of this work, we present the storage efficiency of dual methods and experiment on data from the SMaRT system, through which the data of moving object trajectories are generated and used as input to our methods in order to evaluate the security level they offer. More specifically, we summarize the main contributions of our paper as follows:

1. We compare the proposed methods on addressing k-NN queries on moving ob- jects’ trajectories data, which are stored both in dual and native dimensional space. Our implementation shows that the innovative method of Dual Trans- formation constitutes a practical solution that can provide secure k-NN queries.

2. We conduct an extensive experimental evaluation that studies various scenarios that can affect the vulnerability of the k-NN queries and proceed to a comparative analysis of the underlying methods. We prove the efficiency of our solution using real data drawn from SMaRT.

3. We recall two protocols for Pseudonyms Recovery and Registration with the aim of reinforcing the individuals’ privacy in the released data. An individual cannot be re-linked to specific users with a high degree of certainty, as it is described in Section 6.3.7.

69 The rest of the chapter is organized as follows: In Section 6.2, previous related works are presented in relation to our approach. The following are described in Sec- tion 6.3: (a) the dual transformation methods used; (b) the problem definition; (c) the problem formulation (d) the privacy-preserving analysis; (e) the experimental environ- ment, and source of datasets. Section 6.4 presents the graphical outcomes gathered from experiments, while Section 6.5 evaluates experimental results in relation to the pros and cons of the proposed methods. Finally, Section 6.6 records the conclusions in terms of the studied problem and future directions of this work.

6.2 Related Work

In this section, we review existing related works in the domain of secure querying on spatio-temporal databases. Our discussion includes privacy-preserving approaches for trajectory-based queries. In recent years, trajectory databases have constituted an important research area that has received a lot of interest. Most researchers have focused on the querying of mov- ing objects and their trajectory. The so-called trajectory-based queries are also gaining much interest. The queries based on trajectory data require the knowledge of the whole, or at least a part of the mobile objects’ trajectory to be processed. Such queries may provide useful information about an object’s average speed, travelled distance, and so forth. In [183], three common mechanisms in privacy-preserving trajectory publishing are described. Generalization and suppression are the most common ones used to im- plement the k-anonymity. However, the main drawback of these mechanisms is that they suffer from a high possibility of information loss—thus, perturbation techniques based on randomization (e.g., adding noise) may be utilized as an alternative. Actually, the problem of secure querying on spatio-temporal data in combination with k-anonymity has gained much attraction among researchers. Indeed, authors in [166] describe the historical k-anonymity based on each mobile user’s trajectory data history, known as Personal History Locations (PHL). According to PHL anonymity, a user, U is camouflaged by k − 1 users whose PHLs have a common part with its own, rendering him/her indistinguishable among them. Privacy preservation is enforced as the gener- alization method has been applied. More specifically, by trying to preserve historical k-anonymity, authors increased the uncertainty related to the user’s real location data

70 at the time of the query by modifying the spatio-temporal information of the query. More precisely, in [136], by employing the kl anonymity privacy model, authors ensure that an intruder, who has knowledge of any sub-trajectory TS of size l of a user’s tra- jectory T j, cannot distinguish their one among k − 1 trajectories that protect them with 1 probability, based on TS, at most k . In a more recent work [39], the authors investigated the privacy-preserving prob- lem based on real spatio-temporal data. That paper employed the k-anonymity method and formed the anonymity set based on motion vectors with the aim of executing secure spatial k-NN queries. More specifically, the problem of k-anonymity from a dimension- ality perspective and the impact of used dimensions on the vulnerability of suggested methods was investigated. The experiments presented the effectiveness of the proposed method, such as the clustering under particular attributes combination, and observed that it benefited from attributes suppression during the k-anonymity set computation. Authors in [53] suggested a novel spatio-temporal Mysql ReTrieval framework based on the MySQL and PostgreSQL database management system. In the context of that work, authors employed Hough-X transformation so as to evaluate the efficiency of range queries on nonlinear two-dimensional trajectories of mobile objects. Indeed, they demonstrated that the Hough-X dual approach, in combination with the range-tree vari- ant, was quite efficient. Generally, the trajectory of a mobile user is non-linear. However, it can be ap- proximated by a discrete number of linear sub-trajectories with the use of a trajectory segmentation application. Each partition is represented by a line segment between two consecutive partition points, and is expected to provide an effective and efficient way to obtain insights into motion characteristics and behavioral preferences of mobile objects. Our approach performs low-rate sampling and considers linear interpolation between successive sampled points, where each line segment represents the continuous moving of the object during sampled points. The duality transformation of line segments oper- ates as a pre-processing step and aims at increasing the security level and reinforcing the privacy of k-NN queries, which is the main subject of this work. Also, we have in our disposal linear components of the initial trajectory, as well as storage of the first and last spatial point in order to represent that line along with the dual representative, that is, the Hough-X (and/or Hough-Y) dual points. Lastly, this step will turn out to be

71 useful from a storage perspective in Big Data applications, and will render the proposed methods a strong candidate for efficient querying on massive data, in combination with the appropriate indexing method.

6.3 Materials and Methods

6.3.1 Dual Transform for Moving Objects

In general, the geometric dual transform maps a hyper-plane h from Rm to a point in Rm, and vice versa. In this section, we briefly present how the duality transformation operates in a one-dimensional case. A line from the plane (t, y) or (t, x) is mapped to a point on the dual plane (see Figure 6.1).

1. Hough-X: The equation y(t) = ut + a is mapped to a point (u, a), where axes u, a represent the slope (that is, velocity) and intercept of an object’s trajectory, respectively. Thus, we get the dual point (u, a), the so-called Hough-X transform.

1 a 2. Hough-Y: The equation y(t) = ut + a is rewritten as t = u y − u , a different dual representation, the so-called Hough-Y transform. The point in the dual plane is −a represented as (b, c), where b = u (the intersection with the line y = 0) and 1 c = u .

It is worth mentioning that the Hough-X transform cannot represent vertical lines, while horizontal lines cannot be represented using the Hough-Y transform. Nonethe-

less, both transforms are valid, since in our setting, velocity is bounded by [umin, umax], and thus lines have a minimum and maximum slope.

72 TR1 y (t) a 13 y(t) M TS Lj TR2 TR3 points (a , u ) 33 1 1 23 TS 12 TS TS TS 22 TS TS TS 11 (aM, uM) 32 Li 31 21 TS TS

t t u

a b

Fig. 6.1 An overview of trajectory segmentation and Hough-X transformation for a linear trajectory segment (TS), which consists of M points. The dual points of M points in TS are the same, for example, a1 = ... = aM , u1 = ... = uM , where the left graph shows the y(t) line and the right graph shows the Hough-X points.

6.3.2 kNN Classification and Clustering in Dual Space

Here, we consider points in dual space P. Given two dual points dp1 and dp2, we define as dist(dp1, dp2) the distance between dp1 and dp2 in P. In the context of this work, we utilize the Euclidean distance metric, which is defined as

v u p  2 uX dist(dp1, dp2) = t dp1[i] − dp2[i] , i=1 where dp1[i], dp2[i] denote the values of dp1, dp2 along the i dimension in P. For ex- ample, in Hough-X space, the distance between the dual points dp1 = (u1, a1), dp2 =

p 2 2 (u2, a2) is computed as dist(dp1, dp2) = (u1 − u2) + (a1 − a2) .

Definition 6.3.6. DukNN: Given a dual point dp, a data-set of dual points Y and an integer k, the k nearest neighbors of dp from Y , denoted as DukNN(dp, Y ), is a set of k points from Y such that ∀l ∈ DukNN(dp, Y ) and ∀q ∈ {Y − DukNN(dp, Y )}, dist(l, dp) < dist(q, dp).

Definition 6.3.7. DukNN Classification: Given a dual point dp, a training dual points data-set Y , and a set of classes ClY where the dual points of Y belong, the classification process produces a pair (dp,cldp), where cldp is the majority class to which dp belongs.

Definition 6.3.8. Clustering: Given a finite data-set of dual points DP = {dp1, dp2,. . . , p dpN } in R , and number of clusters K, the clustering procedure produces K partitions

73 of DP such that among all K partitions (clusters) C1,C2,. . . ,CK find one that minimizes

2 K X X 1 X arg min dp − dpj C1,C2,...,CK =P |Cc| c=1 dp∈Cc dpj ∈Cc where |Cc| is the number of dual points in cluster Cc.

Note that the aforementioned dual methods act as a feature extraction technique. More specifically, they extract the dual point of each of the x, y coordinates of a mobile user trajectory. The k nearest neighbors algorithm is then applied on dual points features and allowed to return dual points, whose distance from the query dual point is less than the distance from the rest of the training dual points. Considering the Hough-X transformation of attribute x or y, the search area is a circle with the center being the query point and a radius such that k nearest neighbors exist. If we assume Hough-X of

(x, y) attributes, the k nearest neighbor search area is four-dimensional (ux, ax, uy, ay) with complex hypercube geometry.

6.3.3 Problem Definition

Here, we consider a database that records the location information of mobile objects in the two-dimensional space on a finite area. Also, we assume that objects move with small velocities that lie in the range [umin, umax] starting from a specific location at a specific time-stamp and which move along a non-linear trajectory. In order to be able to store and handle queries in an efficient way, a mobile object’s trajectory is approximated with a series of linear ones, as depicted in Figure 6.2.

Approximated Raw Trajectory End Point Trajectory

Start Point Fig. 6.2 A raw trajectory approximation with a discrete number of R linear sub- trajectories. In the dual dimensional space, each one is represented as a dual point—for example, the linear sub-trajectory [l(t0), l(t1)] is represented as a dual point dp1, and the linear sub-trajectory [l(t1), l(t2)] is represented as a dual point dp2.

74 Definition 6.3.9. A linear trajectory is a straight line that an object keeps track of,

starting from a location l(t0) = [x0, y0] at time t0. Then, its location for t > t0 will be

l(t) = [x(t), y(t)], or l(t) = [x0 + ux(t − t0), y0 + uy(t − t0)], where u = (ux, uy) is the object’s velocity in each plane [53].

Definition 6.3.10. A trajectory partition or sub-trajectory segment is a line segment

LiLj, where for i < j, both points belong to the same trajectory and are connected in

order to form a partition denoted by TSi [113].

Definition 6.3.11. Characteristic points are the points where the trajectory changes rapidly.

Definition 6.3.12. The dual points array constitutes a set containing points of a trajec- tory that are represented in the dual space.

Definition 6.3.13. A compressed trajectory path is a subset of the trajectory’s points that indicate a significant change in the motion characteristics, that is, the speed or direction of a moving object.

Definition 6.3.14. Given a trajectory T of size |T | and a compressed trajectory Tc of T with size |T |, the Compression Ratio (CR) is |T | . c |Tc|

Authors in [158] claim that the compression ratio constitutes a common metric for evaluating the effectiveness of compression algorithms that can accurately reflect the change of a trajectory’s data size. It is influenced by the original signal’s data-sampling rate, as well as the quantization accuracy.

6.3.4 Problem Formulation

In the context of this study, the problem of privacy preservation when dealing with spatio-temporal databases goes one step further, and is related to the work [39]. The spatio- temporal data is the location data of a number of mobile users along with the time- stamp of each position, as shown in Table 5.1. Through the SMaRT system, we have in our disposal offline trajectory data that give us information about Hough-X, as well as Hough-Y of spatial data (x, y). Hence, for each database record per time-stamp, that is, the mobile user trajectory point, we can consider the values of four attributes

75 Table 6.1 An overview of the transformed spatio-temporal database.

ObjId Timestamp Ux ax Uy ay bx wx by wy 1 2013-03-09 10:00:01 4.37 22,242,219.9 1.03 4,800,692.9 0.23 −5,093,637.76 0.97 −4,645,833.3 1 2013-03-09 10:00:04 13.4 22,242,156.2 5.83 4,800,651.2 0.075 −1,659,862.40 0.17 −82,3641.2 1 2013-03-09 10:00:11 10.58 22,242,287.4 3.79 4,800,713.7 0.0946 −2,103,289.59 0.26 −1,267,515.2 1 2013-03-09 10:00:19 27.3 22,242,427.4 11 4,800,762 0.04 −814,740.93 0.09 −436,432.91 1 2013-03-09 10:00:21 27.3 22,242,427.4 11 4,800,762 0.04 814,740.9 0.09 −436,432.91 2 2013-03-09 10:00:03 2.92 22,256,723.4 7.32 4,804,052.4 0.3425 −7,622,165.55 0.14 −-656,291.3 2 2013-03-09 10:00:08 1.15 22,256,709.8 5.75 4,803,996 0.87 −19,353,660.69 0.17 −835,477.56 2 2013-03-09 10:00:16 4.27 22,256,692.6 4.64 4,803,941.2 0.23 −5,216,411.92 0.22 −1,034,341.51 2 2013-03-09 10:00:25 4.6 22,256,639.5 0.23 4,803,925.9 0.22 −4,826,741.21 4.29 −20,588,283.27 2 2013-03-09 10:00:34 1.5 22,256,625.5 5.2 4,803,925.8 0.67 −14,837,750.3 0.19 −923,831.89

(x, y, θ, u) (as in Table 5.1) along with the values of an additional eight attributes’

(Ux, ax,Uy, ay, bx, wx, by, wy) (as in Table 6.1). So, we have chosen to anonymize dual point attributes by employing the k-NN method, which enables us to form the k-anonymity set of each mobile object per time- stamp, as depicted in Table 5.2. The data anonymization is handled both as a clus- tering and a no-clustering problem. In both approaches, the anonymity set is formed again by the k nearest neighbors ids’. For each mobile user i and per time-stamp l, we compute its k nearest neighbors ids’ and keep them in a vector with form knnsil =

[idil1idil2 . . . idilk] for l = 1, 2,...,L. In Table 5.2, an example of such sets for N mobile users’ dual points is presented. For each user, we measure the number of the k nearest neighbors dual points that remained the same from one time-stamp to an- other. By employing the dual transformation methods as described in Section 6.3.1, the k-anonymity set of mobile users is formulated based on their dual points. Hence, an alternative definition for the k-anonymity is as follows:

Definition 6.3.15. (kDUST -anonymity). A transformed database record is k-anonymous with respect to Hough-X dual points—that is, velocity and intersection attributes (Ux, ax) or (Uy, ay), if k − 1 discrete records in the same specific time-stamp τ at least have the same dual point attributes so that no record of k is distinguished from its k − 1 neigh- boring records.

Remark 6.3.1. As we already mentioned in [39], k-anonymization intuitively hides each individual among k − 1 others. This means that linking cannot be performed with

1 confidence greater than k . Nevertheless, k-anonymity may not protect users against the unveiling of dual point attributes.

76 6.3.5 System Model

Here, we consider a spatio-temporal database with N records—that is, N moving ob- j j jects in the xy plane. Each record (xi , yi ) represents the spatial coordinates of the j mobile user j in time-stamp ti , or point i of its trajectory j [186]. From the location coordinates (x, y), we can extract the corresponding dual points by employing the meth- ods described in Section 6.3.1. Suppose a trajectories database T = {T 1,...,T N } of equal length L in which each trajectory is represented via a sequence of L triples, that

j j j j j j j j j j is, T = {(x1, y1, t1), (x2, y2, t2),..., (xL, yL, tL)}. j For each point i in trajectory j, we define in four-dimensional space a vector DPi =

(Uxij , axij ,Uyij , ayij ) which denotes the dual points array. Hence, we can redefine and j j j j j store the trajectory j as T = {DP1 ,DP2 ,DP3 ,...,DPL}. The privacy preservation of k-NN query in trajectory databases is addressed with the use of two different methods. The first one is entitled dual-based k-NN (DukNN) which applies k-NN directly onto dual points, while the second one is called dual-based clustering k-NN (DuCLkNN). The main difference between these two methods lies in the fact that the latter is applied in clustered dual point data. In addition, the operations involved in addressing a k-NN query are thoroughly described in Algorithms 6 and 7, respectively.

Algorithm 6 DukNN 1: input The number of k nearest neighbors 2: input The number of mobile users N 3: input The dual points array of N users in L time-stamps 4: output k nearest neighbors indexes of N users in L time-stamps 5: for i = 1 to L do 6: for j = 1 to N do 7: Apply k-NN for the dual points of all users in order to identify the set of k-NN j indexes Ii of user j in time-stamp i 8: end for 9: end for

77 Algorithm 7 DuCLkNN 1: input The number of k nearest neighbors 2: input The number of mobile users N 3: input The dual points array of N users in L time-stamps 4: output k-NN indexes of N users in L time-stamps

5: Apply K-Means of dual points (Ux, ax) of N users for the L time-stamps 6: for i = 1 : L do 7: for j = 1 : N do 8: Apply k-NN method between the dual point of user j and the dual point of j users inside the cluster Ci of user j in time-stamp i and find the set of k-NN j indexes Ii 9: end for 10: end for

In the case of employing the Algorithm 6 in order to run a k-NN query, we must focus on a specific time-period during which we will have in our disposal the dual point of all users’ locations. Given that each user stands in the same sub-trajectory during the study period, the privacy is preserved in that segment since the k nearest neighbors remain unchanged. On the other hand, in the case of employing Algorithm 7, the clustering step is ahead; we can again claim that the clusters composition remains the same, since the clustering method is applied in dual space and mobile users have the same dual point. As a result, the k nearest neighbors inside the cluster will remain the same. Hence, without loss of generality, in both cases, the privacy is piecewise preserved, except for the points of discontinuity (known as characteristic points) where the motion characteristics may change.

6.3.6 Vulnerability and Storage Efficiency

In this paper, we assume the mobile users’ trajectory on a real map with small velocities; thus, we use the Hough-X transform, since an object’s motion is mapped to the (U, a) dual point. To answer a k-NN query, the following steps are performed:

1. Decompose the k-NN query into 1D queries for the (t, x) and (t, y) projection.

2. For each projection, get the dual k-NN query by using a Hough-X transform.

78 3. Return the anonymity set, which contains the trajectories ids’ that satisfy the dual k-NN query in each projection.

In following, the analysis is focused on the robustness estimation of the proposed approach based on Hough-X. Specifically, the ensuing steps are followed:

1. Split the initial trajectory into a number of linear sub-trajectories, each of which consists of the same number of M spatial points.

2. Apply Hough-X in each part.

Suppose that M is the number of points of the 1D trajectory, which a dual point rep- resents, and D is the number of dual points, which describe the 1D trajectory projection (t, x) or (t, y) in dual space. Therefore, the whole trajectory has a length equal to DM spatial points, for which M  D should hold. In the following, we camouflage a mo- bile user who keeps track of a linear trajectory x(t) or y(t) or its corresponding dual point with the k nearest neighboring dual points, which is very probable to remain the same in the next timestamp. Actually, while users move onto the linear sub-trajectory, which relates to the same dual point, the k-NN set will remain intact. Therefore, for as long as it happens, we can claim that the k-anonymity holds. Indeed, the privacy preser- vation is reinforced by a factor M that formulates the so-called vulnerability level to 1 kM . We recall the spatial data security metric that we have already defined in [39] for the quantification and measure of the robustness of our methods. Again, the vulnerability 1 remains equal to k in dual point space. Nonetheless, the definition of vulnerability in the initial dataset is measured as the following. Since the points inside a sub-trajectory are protected by the same dual points, it is obvious that their vulnerability is considerably 1 1 reduced to Mk ; this aspect entails that with a probability equal to Mk , an intruder can distinguish the identity of a mobile user. The same holds for all sub-trajectories. Hence, the vulnerability in each projection is defined as:

1 Vx = Mk (6.1) 1 V = y Mk

79 where Vx and Vy is the vulnerability measure based on Hough-X in projection (t, x) and (t, y), accordingly. Next, the vulnerability in each projection is combined, and the total vulnerability is written as in the following equation:

 2  1  2  V = V V = , (6.2) total x y M (Mk)2 M

2  where M represents all combinations of M points that correspond to 2 dual points of the initial trajectory. Several trajectory compression approaches have been proposed aiming at reducing the trajectory’s size. An initial discrimination classifies the compression methods either as offline (after trajectory generation) or online (instantly as objects move). The data compression constitutes a method that decreases the size of the data in order to limit the memory space and ameliorate the efficiency of storage, processing, and/or transmission without loss of information. Various trajectory compression algorithms exist in liter- ature that try to balance the tradeoff between accuracy and storage size. We refer to some major ones—namely, distance-based, velocity-based, semantic, similarity-based, and priority queue [68]. The proposed Hough-X based approach achieves trajectory compression suitable for either a single or multiple trajectory set. Without loss of in- formation, Hough-X maps each linear sub-trajectory spatial point to its representative dual point. Compression can be achieved by applying dimensionality transformation to increase the storage efficiency of the data. Suppose we reduce three-dimensional data (x, y, t) to Hough-X space of (t, x), that is, (Ux, ax). Storage space-saving is achieved through the number of available dual points D, being less than the number of points M in the corresponding linear sub-trajectory; hence, achieving in the whole trajectory CR = MD D or CR = M : 1, where for example, M spatial points correspond to one dual point, as shown in Figure 6.1. This conserves space and achieves more compression, as depicted in Figure 6.3, and thus it is expected to have a greater impact on large-scale spatio-temporal databases. Potentially, by employing a dual method based on Hough-X, we could generate a trajectory codebook by applying Hough-X transformation to all linear sub-trajectories

80 of a given set of trajectories in a map region. In the training step, dual points that stem from the same linear part are similar and must be grouped into the same cluster; also, each cluster is assigned to a single representative vector, called a dual code-vector. Hence, each trajectory inside the codebook is represented by its dual points.

10 5

10 4

10 3

Compression Ratio-CR Compression 10 2

10 1 0 2 4 6 8 10 Number of Represented Points M #10 4

Fig. 6.3 Theoretical curve of compression ratio for M = [10 100 1000 10000 100000].

At this point, we should note that the Hough method acts as a clustering one. Ac- tually, K-means is a popular method for both clustering and codebook design. In the coding step, each input dual points vector is compressed to the nearest dual code-vector referenced by a simple index. The index of the matched code-vector in the codebook is then transmitted to the decoder over a channel and is used by the decoder in order to retrieve the similar trajectory dual points from an identical codebook. The key opera- tion is that it is stored and transmits the index of the dual code-vector, rather than the entire code-vector. As a result, the recommended schema is space-compressed because of the duality, and also more robust in comparison with the suggested methods in previous works [39, 53].

6.3.7 Privacy Preservation Analysis

Privacy relates to individual data protection and the human right to be able to determine the information about themselves that is to be hidden. Privacy-preserving data manage-

81 ment includes k-anonymity, a noted method for data anonymization before publication, which has also been studied in the context of trajectory data. Authors in [10] claim that given a set of trajectories, the objective of the data publication is to transform them into some k-anonymized form in order to prevent original data publication, putting at risk the privacy of individuals related with the data. In addition, they mention that an intruder, who knows a sub-trajectory of the original trajectory of an individual, may utilize it with the aim of extracting the whole trajectory of that person based on the pub- lished data. Finally, they recognize an upper bound for the re-identification probability of the whole trajectory within the released data, namely 1/k, where the parameter k reflects the expected level of privacy. Our solution transforms the original spatial point into the dual-point one using bi- jective mapping, such as Hough. This technique allows for a k-NN search directly on the transformed points, thus providing stronger location privacy. Assuming an insecure Transformed Database Management System (TDBMS) possibly located at a third party (e.g., a service provider in the cloud), an attacker sees its environment. In particular, the attacker has access to the transformed database, to the queries upon the transformed data, as well as to the results. Also, we suppose that the attacker is aware of the dual transformation scheme and aims to retrieve the original database executing Hough-X and/or Hough-Y algorithms with respect to the size of the database. Nonetheless, in our proposed paper, we aim to prevent an attacker from obtaining the original database, as it is possible that they may occupy extra knowledge about this original database. To bet- ter evaluate the power of the transformation scheme, we taxonomize the attacks into different levels based on the possessed knowledge.

1. Level 1: The attacker only observes the transformed database.

2. Level 2: Except the transformed database, the attacker is familiar with a set of plain tuples of the original database, but does not know the corresponding en- coded values of those tuples in the transformed database.

3. Level 3: Apart from the transformed database, the attacker observes a set of tuples in the original database, and thus knows the corresponding encoded values of those tuples.

82 A few cryptography-based approaches, such as homomorphic encryption (HE), ver- ifiable computation (VC), and secure multi-party computation (MPC) have been de- signed in order to provide secure big-data processing in the Cloud [177]. However, other approaches, such as Asymmetric Key Cryptography and trusted Public Key In- frastructure have been developed over the years in order to support privacy preservation in the spatio-temporal domain. The basic idea behind these techniques is to encrypt the identity of the user prior to sending it to the service provider. In this way, the service provider does not have any knowledge about the real identity of the individual who initiated the k-NN query. To prevent an external adversary from linking queries to the same mobile object, its pseudonym has to be secure. For this reason, we are concerned about pseudonyms’ recovery, as well as registration protocols consisting of three enti- ties, namely, Users (U), Identity Provider (IP), and Service Providers (SP). Recall that they are based on Brand’s credentials and have been suggested by Brand in the context of ”The New System” with the aim of making the communication more reliable and se- cure. We believe that the adoption of these protocols will reinforce the identity privacy of mobile objects and the spatio-temporal databases at large. For the sake of complete- ness, the main steps of these protocols, along with the privacy preservation properties they offer, are presented. The mobile user, U performs the following protocol in order to retrieve a set of pseudonyms with the identity provider (IP):

Initially, user U chooses random values r(1,1), r(1,2), . . . , r(1,m), e ∈ Zq where e is

r(1,1) r(1,2) r(1,m) e known only to user U, then computes the quantity t1 = g1 g2 . . . gm gm+1 ∈ Zq and finally sends it to IP (g1, g2, . . . , gm+1 ∈ Gq).

Secondly, IP recovers the quantity t1, collects random quantities r(2,1), r(2,2), . . . , r(2,m)

r(2,1) r(2,2) r(2,m) and computes the product t = t1 · t2, where t2 = g1 g2 . . . gm .

Thirdly, user U creates the ri’s according to the equation ri = r(1, i) + r(2, i) for

r1 r2 rm i = 1, 2, . . . , m and computes the quantity t = g1 g2 . . . gm . Hence, the corresponding

si user creates m pseudonyms (Pi, sign(Pi)) and values si ∈ Zq, such that Pi = (tf0) for i = 1, 2, . . . , m.

A mobile user U registers a pseudonym (Pi, sign(Pi)) with a service provider SPi

presenting the pseudonyms (Pi, sign(Pi)) and uncovering the value ri encoded in Pi.

The user with the service provider SPi performs the following proof of knowledge,

83 provided that Pi 6= 1, so as the tuple (Pi, sign(Pi)) will be a valid one.

−1 PK{(δ1, δ2, . . . , δi−1, δi+1, . . . , δm, , ς) : (δ1, δ2, . . . , δi−1, r, δi+1, . . . , δm, , ς)} = rep(g1,...,gm,gm+1,Pi)f0 (6.3)

Then, the service provider SPi stores the tuple (Pi, ri) and associates it with either a new or an existing user account. Through this protocol, it is demonstrated that the

user owns the pseudonyms and proves that the disclosed value ri is actually the value

encoded in Pi. The privacy preservation lies in the following facts:

1. The service provider cannot find out any additional information about the quanti-

ties encoded in Pi, except for the disclosed value ri.

2. The random set (r1, r2, . . . , rm, e) is created so that nobody (user, identity provider) can control its end value.

3. The e is randomly selected from the user so that it can remain unknown to the

IP. The user also computes a secret key (r1si, r2si, . . . , rm, esi), one for each

pseudonym (Pi, sign(Pi)); this proves that the user is aware of it without unveil- ing it.

4. The assumption of the Discrete Logarithm in a group Gq of prime order q along

with the values ri ∈ Zq = {1, 2, . . . , q} (i = {1, 2, . . . , m}) ensures that any malicious user MU, irrespective of the level of knowledge they possess about the Original and Transformed Database, even if they engage a pseudonym recovery protocol to the IP and obtain a valid pseudonym (P, sign(P )), has negligible probability that it is the value encoded in the public key P .

Suppose there is a mobile user who has initiated a discrete number of k-NN queries with different pseudonyms for each one. The unlinkability that the aforementioned protocols provide relates with the service provider’s incapability to connect to the IP with the different pseudonyms to that mobile user and validate that they belong to the same user. Thus, the privacy preservation of that user identity is achieved.

84 6.3.8 Experimental Data and Environment

The experimental data used in this paper were utilized from the SMaRT Database GIS Tool (http://www.bikerides.gr/thesis2/). The experiments were based on trajectory datasets of bike riders in the area of Corfu, Greece. For each trajectory point, the Hough-X and Hough-Y dual points, that is, the values of (Ux,ax), (Uy,ay),

(bx,wx), and (by,wy), were available for L time-stamps. The environment where the experiments were carried out had the following characteristics: Intel(R) Core(TM) 2 Duo CPU E8400 @ 3.00 GHz CPU, 16 GB of memory, 64-bit Operating System, x64- based processor, and Matlab 2018a.

6.4 Results

In this section, we conducted several experiments based on a real dataset with param- eters and relevant values presented in Tables 6.2–6.4, as well as the results in Fig- ures 6.4–6.7. Our aim was to evaluate the performance of Algorithms 6 and 7 in terms of vulnerability. We experimented on a dataset of size N = {87, 995, 1000}.

Table 6.2 Parameters for the experiment of using only Hough-X of x and Hough-X of x, y, for N = 1000 trajectories, L = 10 time-stamps (Figure 6.4a,b) and N = 995 trajectories, L = 100 time-stamps (Figure 6.4c,d).

K k Clustering Attributes k-NN Attributes Clustering Attributes k-NN Attributes

5 10 (Ux, ax)(Ux, ax)(Ux, ax,Uy, ay)(Ux, ax,Uy, ay)

5 20 (Ux, ax)(Ux, ax)(Ux, ax,Uy, ay)(Ux, ax,Uy, ay)

5 30 (Ux, ax)(Ux, ax)(Ux, ax,Uy, ay)(Ux, ax,Uy, ay)

10 20 (Ux, ax)(Ux, ax)(Ux, ax,Uy, ay)(Ux, ax,Uy, ay)

85 Table 6.3 Parameters for the experiment using Hough-X of x and suppressing Hough-X of y (Exp1) for N = 1000 trajectories, L = 10 time-stamps (Figure 6.5a,b) and using Hough-X of y and suppressing Hough-X of x (Exp2) for N = 995 trajectories, L = 100 time-stamps (Figure 6.5c,d).

K k Clustering Attributes k-NN Attributes (Exp1) k-NN Attributes (Exp2)

5 10 (Ux, ax,Uy, ay)(Ux, ax, ∗, ∗)(∗, ∗,Uy, ay)

5 20 (Ux, ax,Uy, ay)(Ux, ax, ∗, ∗)(∗, ∗,Uy, ay)

5 30 (Ux, ax,Uy, ay)(Ux, ax, ∗, ∗)(∗, ∗,Uy, ay)

10 20 (Ux, ax,Uy, ay)(Ux, ax, ∗, ∗)(∗, ∗,Uy, ay)

Table 6.4 Parameters for the experiment using (x, y) for clustering and Hough-X for k-NN for N = 87 trajectories and L = 100 time-stamps (Figure 6.7c,d).

K k Clustering Attributes k-NN Attributes

3 8 (x, y) x 3 8 (x, y)(x, y)

3 8 (x, y)(Ux, ax)

3 8 (x, y)(Ux, ax,Uy, ay)

86 0 10 100 No clustering k=10 Clustering K=5,k=10 No clustering k=20 Clustering K=5,k=20 No clustering k=30 Clustering K=5,k=30 No clustering k=20

Clustering K=10,k=20

Vulnerability Vulnerability No clustering k=10 -1 10 Clustering K=5,k=10 10-1 No clustering k=20 Clustering K=5,k=20 No clustering k=30 Clustering K=5,k=30 No clustering k=20 Clustering K=10,k=20

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Time Time a b

100 100 No clustering k=10 Clustering K=5, k=10 No clustering k=20 Clustering K=5,k=20 No clustering k=30 Clustering K=5, k=30 No clustering k=20

Clustering K=10,k=20

Vulnerability Vulnerability No clustering k=10 -1 -1 10 Clustering K=5,k=10 10 No clustering k=20 Clustering K=5,k=20 No clustering k=30 Clustering K=5,k=30 No clustering k=20 Clustering K=10,k=20

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Time c d

Fig. 6.4 Both clustering and k-NN: (a) (Ux, ax) and (b) (Ux, ax,Uy, ay) for N = 1000 trajectories, L = 10 time-stamps, (c) (Ux, ax) and (d) (Ux, ax,Uy, ay) for N = 995 trajectories, L = 100 time-stamps.

6.4.1 Vulnerability Evaluation in Hough Space

In the context of proposed work, we focused on k-anonymity from a different per- spective as we employed Hough-X transformation of (x, y) spatial data to formulate the anonymity set. The number of clusters is denoted by K, while k refers to the number of nearest neighbors in terms of Euclidean distance. In Figures 6.4 and 6.5, both approaches achieve similar performance that can be improved as the number of k increases. Although for low values of k, the vulnerability remains relatively high, the more the nearest neighbors are utilized to form the anonymity set, the lower the vulnerability becomes. Actually, Figure 6.4b,d depicts that the use of Hough-X trans- formation of y attribute considerably improved the performance in both cases. To ame- liorate the classification accuracy, we considered the Hough-X attributes of y as well.

87 100

100 Vulnerability

Vulnerability No clustering k=10 No clustering k=10 -1 Clustering K=5,k=10 -1 10 10 Clustering K=5,k=10 No clustering k=20 No clustering k=20 Clustering K=5,k=20 Clustering K=5,k=20 No clustering k=30 No clustering k=30 Clustering K=5,k=30 Clustering K=5,k=30 No clustering k=20 No clustering k=20 Clustering K=10,k=20 Clustering K=10,k=20 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Time Time a b

0

100 10

Vulnerability Vulnerability

No clustering k=10 No clustering k=10 10-1 10-1 Clustering K=5,k=10 Clustering K=5,k=10 No clustering k=20 No clustering k=20 Clustering K=5,k=20 Clustering K=5,k=20 No clustering k=30 No clustering k=30 Clustering K=5,k=30 Clustering K=5,k=30 No clustering k=20 No clustering k=20 Clustering K=10,k=20 Clustering K=10,k=20

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Time

c d

Fig. 6.5 Clustering with (Ux, ax,Uy, ay) and suppressing k-NN: (a) (Ux, ax, ∗, ∗) and (b) (∗, ∗,Uy, ay) for N = 1000 trajectories, L = 10 time-stamps, (c) (Ux, ax, ∗, ∗) and (d) (∗, ∗,Uy, ay) for N = 995 trajectories, L = 100 time-stamps.

We should note that the information (Ux, ax) in combination with (Uy, ay) increased the robustness of both methods for the same number of nearest neighbors. This en- tails that the performance of the k-NN classifier improved, and thus the anonymity set shows less time variation from one time-stamp to another. In the following, we employ the suppressing k-anonymity method for the composition of the k anonymity set. In par-

ticular, we applied K-Means clustering that takes advantage of (Ux, ax,Uy, ay), while

k-NN is applied either on (Ux, ax) or (Uy, ay). Here, the clustering method presents much better performance than the non-clustering one. However, for the same number of k, the performance of both methods is worse than in the first case, where we based

it on attributes (Ux, ax,Uy, ay) for both clustering and k-NN computation. The exper- imental results in Figure 6.5 present the performance of attribute suppression in terms of k anonymity set computation. Subsequently, a scenario with synthetic data from the real trajectory dataset was considered. More specifically, for each dual point, a number of copies M were gener-

88 ated, which shows that these dual points correspond to the same linear sub-trajectory of the trajectory. Figure 6.6a,b shows that DukNN (non-clustering) and DuCLkNN (clus- tering) methods have identical performance for M = 5, k = 10 and K = 5, and we verify that vulnerability is piece-wise preserved, except for the characteristic points. In Figure 6.6c, we compare vulnerability in the Hough-X space of x and y attributes with the one in native dimensional space of x and y. We observe that the results in native space are better by almost 5% than the ones in Hough-X space. This may relate with the linear dependency of dual space of the native one.

Trajectory of Mobile User 10 Trajectory of Mobile User 100 1 1 No clustering, k=10 0.9 No clustering k=10 0.9 Clustering,K=5,k=10 Clustering,K=5,k=10 0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 Vulnerability 0.4 Vulnerability 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Time Time

a b

100 No clustering (xy-xy)-k=8 Clustering (xy-xy)-K=3,k=8 No clustering (UxaxUyay-UxaxUyay)-k=8

Clustering (UxaxUyay-UxaxUyay)-K=3,k=8 Vulnerability

10-1 0 10 20 30 40 50 60 70 80 90 100 Time c

Fig. 6.6 Clustering with (Ux, ax) and k-NN with (Ux, ax):(a) Mobile User 10 and (b) Mobile User 100 for N = 995 trajectories, L = 50 time-stamps, (c) Vulnerability measure in dual Hough-X and native dimensional space of (x, y).

6.4.2 Vulnerability Evaluation in Hybrid Space

In this subsection, we consider the fact that clustering takes place in spatial coordinate space while the k-NN query is in Hough space. In this case, the dataset concerns the compressed version of mobile users’ trajectories as derived from SMaRT. Figure 6.7a,b

89 depicts information about the initial trajectory’s length and the selected points, as well as the compression ratio per trajectory ID. Note that the compression ratio in the system SelectedP oints is computed as (1 − InitialP oints ) × 100%, where the selected points are 100. From the dataset, we exclude 13 trajectories whose length is much less than 100, such as the average length of compressed trajectories. Another observation is introduced in Figure 6.7c,d, where the vulnerability in hybrid space has similar behavior and performance with the one in spatial coordinates space. This relates to the fact that Hough-X constitutes a linear transformation on spatial coor- dinates. Again, employment of the suppressing method, as shown in Table 7.4, makes the vulnerability of the k-NN query with the clustering method even more secure.

90 800

80 700 70 600 60

500 50

CR% 400

40 Initial Points Initial 30 300

20 200

10 100

0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Trajectory id Trajectory id

a b

0 100 10 No clustering (xy-xy)-k=8 No clustering (xy-x)-k=8 Clustering (xy-xy)-K=3,k=8 Clustering (xy-x),K=3,k=8 No clustering (xy-UxaxUyay)-k=8 No clustering (xy-Uxax),k=8 Clustering (xy-UxaxUyay)-K=3,k=8

Clustering (xy-Uxax),K=3,k=8

Vulnerability Vulnerability

-1 10-1 10 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Time

c d

Fig. 6.7 (a) Initial points per trajectory and (b) compression ratio for N = 87 trajecto- ries, L = 100 time-stamps. Clustering with (x, y) and k-NN: (c) x and (Ux, ax) and (d) (x, y) and (Ux, ax,Uy, ay) for N = 87 trajectories, L = 100 time-stamps.

90 6.5 Discussion

Hough Transform is a robust method used in Image Analysis and Computer Vision. The core idea is to map data onto the dual parameter space and then interpret it through classification and clustering. The major role of the Hough Transform is to detect straight lines and compute their representatives, that is, dual points. The term “representative” is strictly connected with Clustering Using Representatives (CURE), an efficient data- clustering algorithm for large-scale databases. Compared with K-means clustering, which has already been addressed in our previous work, it is more robust to outliers and able to identify clusters having non-spherical shapes and size variances. Trajec- tory Clustering (TRACLUS) [105, 128] is a characteristic algorithm which has been designed for partitioning and applying clustering among partitions of different trajecto- ries, and finally finding the representative sub-trajectory for each cluster, as presented in Figure 6.8. Several representative spatio-temporal clustering methods have been re- viewed in a more recent work [150]. Nevertheless, in our work, we made use of this term with a different meaning. The notion of the representative points of a trajectory sample points is defined. Thus, the representative point of a number of spatial points that belong to the same line segment is the dual transformation point. Our framework can be combined with any existing clustering algorithm. As a preliminary approach, we chose K-Means, which applied on-line clustering on dual-point data for a number of mobile objects’ trajectories at specific time-stamps.

End Point TR1

A set of Trajectories TR2

TR3

TR4

TR5 Start Point

Partition

A set of line segments

Group

A cluster

Representative Trajectroy

Fig. 6.8 Trajectory partition, grouping, and representatives.

This work investigated the impact of Hough-X, which has already been applied in

91 range queries, to the robustness of the methods proposed in [39] for addressing se- cure k-NN queries. The experimentation with the number of clusters K, which should be known in advance, obviously only affected the method utilizing clustering; there, it is obvious that vulnerability behaves a little worse as the cluster number increases. Indeed, when adding features, the data cluster density decreases, where the model be- comes more sparse, and hence the clustering task becomes even more difficult. A usual phenomenon and important part in Machine Learning is the reduction of a higher di- mensional space into a lower dimensional one in order to avoid the Curse of Dimen- sionality. An important property of Hough is its robustness to low quality or uncertain data (either due to non-uniform sampling or noise) [106]. Therefore, even if a trajectory is represented by different sample points in 2D Euclidean space, in Hough space it may have the same dual points. Under this condition, Hough space reflects mobility patterns better than the original trajectory spatial data (x, y), leading to more homoge- neous clusters and improving the k-NN performance. As experimental results verify in Figures 6.6a,b, the above properties have a positive impact on vulnerability which is pair-wise preserved, showing that the clusters occur within cluster space-time sim- ilarity. The authors in [196] provide an efficient scheme for representative clustering on uncertain data. Finally, assuming feature suppression, the method with clustering demonstrates higher robustness or lower vulnerability, which is the main issue in k- anonymity, and thus in privacy preservation. It is a case which shows the superiority of the method with clustering in terms of vulnerability. Indeed, when the mobile users are protected by k nearest neighbors based on lower dimensionality data than the ones used in clustering, it is more difficult for an attacker, who has access to history data, to link the public information of the k − 1 nearest neighbors (that is, unlinkability).

6.6 Conclusions

In conclusion, we carried out research on privacy preservation based on real spatio- temporal data, through which we demonstrated the impact of parameters k and K in terms of the vulnerability of the proposed methods. We observed that the increase of k benefits both methods, verifying that the security of a mobile user is more robust when the latter is protected by a high number of nearest neighbors. This paper proposes the

92 application of k-NN queries based on dual points of Hough-X projection with the aim of reinforcing anonymity of k-NN queries and decreasing storage requirements. More specifically, we investigated the problem from the perspective of dual point attributes. The experimental results indicate that although the outcomes of the Hough-X based vulnerability are not optimal in comparison with spatial coordinates space, the differ- ence between them is less than 5%, which still makes Hough-X an appropriate choice for storage-efficient privacy preservation. The SMaRT framework approximates users’ non-linear trajectories with linear ones from time-stamp to time-stamp, and the current results a concerned with the low data- sampling rate. A challenging and open issue is experimentation on the impact of the data sampling rate (low and high) in the described procedures and transformations. Also, we plan to extend and/or enhance proposed methods to be applicable to 3D (x, y, z) trajectories to represent the real situation of, for example, tracing the GPS trajectories of observed birds with devices or drones. In such a case, the dual methods can be applied in z projection in the same way as x, y ones. Additionally, we intend to evaluate the efficiency and scalability of the suggested approaches on big spatio- temporal databases in a distributed environment, that is, in the cloud, and compare its performance with appropriate indexing methods. Our aim is to make SMaRT suitable for supporting k-NN queries based on the proposed methods. Ultimately, it will be useful to evaluate the time of some transactions (e.g., roll- back where the end user lost himself and decides to come back the same way to a certain point or look for another way), that is, how long the end-user will take to receive the answer from the Database Management System (DBMS) with the aforementioned implemented procedures compared to those used nowadays.

93 94 CHAPTER 7

Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Query Processing in GeoSpark

7.1 Introduction

There is no doubt that we live in the era of Big Data. Over the last decade, thanks to technological advances, information systems have favored automatic and effective data gathering, thus resulting in a considerable increase in the amount of available data. Daily, a wide range of data is produced: scientific, financial, health data, as well as from social media, are just some examples of sources. However, this data is useless with- out the extraction of the underlying knowledge, a major challenge for the researchers as classical machine learning methods cannot deal with the volume, value, veracity and variety that big data brings [81]. Therefore, existing machine learning techniques, which deal with 4 Vs [44], have been or need to be redefined for efficiently processing and managing such data, as well as to obtain valuable information that can benefit not only scientists, but also businesses and organizations. Actually, the recent advances in distributed technologies can be utilized in order to enable scientists to rapidly find out hidden or unknown patterns from the 4 Vs [178]. Nonetheless, most of the ex- isting methods fail to directly tackle the increased number of attributes and records of databases, due to its computational complexity. Hence, the data mining techniques should be able to encounter data scalability, dimensionality, uncertain data and/or data preprocessing. Data preprocessing constitutes an important step before the data mining process and, as a result, data mining algorithms are designed to accept specific data formats that are best suited for them. With the advances in wireless and mobile technologies, moving objects are equipped with location positioning sensors and wireless communicating capabilities, thus, large- scale spatio-temporal data is being generated and an urgent need for efficient query

95 processing approaches, to deal with both the spatial and temporal attributes, has been arisen [149]. In particular, with the advent of mobile and ubiquitous computing, spatio- temporal queries processing on moving objects databases has become a necessity for many applications, such as traffic control systems, geographical information systems, and location-aware advertisement. Hence, time-dependent versions of k nearest neigh- bor (k-NN) queries need to be studied. According to [63], the k-NN queries can be distinguished into four categories: (i) both query and data objects are static, (ii) moving query but static data objects, (iii) static query but moving data objects and (iv) both query and data objects are moving. When a mobile object’s location data changes, a Snapshot k-NN (SkNN) query is issued at each location. However, in highly dynamic spatio-temporal applications, where the moving objects data varies frequently over time, a fundamental query is the so-called Continuous k-NN (CkNN) [76]. A CkNN query can be subjected to the fourth category, as it instantly retrieves the k nearest neighbor objects of a moving query object at each time within a given time interval. Practically, the CkNN spatio-temporal query is evaluated on a high number of consecutive times- tamps and can be considered as a series of frequent SkNN queries. To the best of our knowledge, none of the existing approaches, such as [76, 45, 193] to name a few, address the problem of CkNN queries processing on non-linear trajectories under the support of Hough transformation. It is worth to mention that in previous works the process of CkNN query has been considered in road networks with linear trajectory moving objects. In this chapter, our efforts are devoted to processing a CkNN query assuming that each object keeps track of non-linear trajectory. Such a trajectory is approximated as piece-wise linear among timestamps where objects (small) velocity changes; a storage efficient strategy adopted in our previous work [38]. Some previous works investigate the problem of CkNN for moving objects with fixed velocity while others with uncer- tain velocity in road networks [77, 46]. Nonetheless, once an object’s velocity changes, the CkNN query has to be re-executed. Such a case is quite frequent in highly dy- namic environments, thus, the performance of these techniques would be significantly degraded, resulting in increased query cost (because of re-evaluation). The focus of this work is on k nearest neighbor queries, which are considered under both a spatio- temporal and a Hough transformed database. The Hough method can mitigate such

96 issues since it is piece-wise constant [38], thus, the CkNN query is not evaluated as frequent as in the Euclidean space with non-linear trajectory. Actually, the applica- tion of k-NN queries in Hough space facilitates the maintenance and consistency of continuous queries result [71], even if the moving objects location data updates. The aforementioned issue relates to the fact that the query result needs to be only updated upon specific changes, namely, whenever moving angle direction and/or velocity vary dramatically. Nevertheless, while an object (including the query object) is moving on a linear part of its trajectory (maintenance phase) the query result will not change, mean- ing that the query result of all the objects will not need be revised until the objects transit to the next linear part of its trajectory, which indicates that the maintenance phase ex- pires. Moreover, with the extensive adoption of the Cloud, research on privacy preserving has appealed to lots of researchers [173, 180]. Let us recall that the ultimate purpose of the processing is to formulate the k-anonymity set to protect users’ privacy in the cloud computing environment. Especially, in the context of this work, we are challenged with the necessity to apply temporal continuous spatial k-NN queries in a distributed architecture with the ultimate goal of forming the anonymity set for a group of moving objects in a fast and efficient manner. In summary, the main contributions of this chapter are as follows:

• A continuous query processing algorithm is considered that efficiently answers the spatial k-NN query with the aid of an indexing method or otherwise. From the query result, in each timestamp and for all moving objects, the desired anonymity set is formulated.

• The adopted method is designed with the aid of GeoSpark spatial data operations to compute the k-anonymity set which is used to measure the possibility of each object identity being unveiled as it moves from one location to another. To be more specific, vulnerability evaluation is conducted in two-dimensional space (both Euclidean and Hough space), considering different pairs of features, as it will be demonstrated in Section 7.4.

• A comprehensive set of experiments is conducted to evaluate the time perfor- mance of the proposed method in the GeoSpark environment under different data

97 sizes.

The rest of this chapter is organized as follows. In Section 7.2, previous related works are presented in relation to our approach. In Section 7.3, the following are de- scribed: (a) the problem definition, (b) the system model and (c) the proposed GeoSpark framework for the k-anonymity set computation. Section 7.4 presents the experiments conducted for the evaluation of the studying problem, while Section 7.5 evaluates rel- evant discussion of the studying problem results in relation to previous works. Finally, in Section 7.6, conclusions and future directions of this work are recorded.

7.2 Related Work

The domain of efficient management regarding spatio-temporal data has different as- pects and extensions which are worth studying, from storage and indexing to time- efficient and robust spatio-temporal queries issuing. It is pointed out here that, in the context of this work, we focus on time efficient privacy preserving spatio-temporal k- NN queries.

7.2.1 Distributed Frameworks for Spatio-Temporal data Queries Processing

Due to the explosive growth of spatio-temporal data, the domain of distributed execu- tion of spatial queries has gained considerable concern. In [66], a novel framework, known as STARK, is recommended for spatio-temporal data management, which in- cludes spatial partitioners, different modes for indexing, filter, join, and clustering op- erators. In contrast to existing solutions, STARK is considered an integrated Spark program and provides more flexible and comprehensive operators. Several application scenarios of this framework are presented in [67]. Moreover, the authors in [191] in- troduce a new abstraction called IndexTRDD to manage trajectory segments, which exploits a global and local indexing mechanism to accelerate trajectory queries. Also, they adaptively update the partition structure based on the change of data distribution to alleviate the partitioning overhead. In [2], a scalable system for massive trajectory data management is elaborated, which modifies the three core layers of ST -Hadoop. Authors in [54] evaluate the dis- tributed execution of spatial SQL queries in GeoSpark and STARK system. Moreover,

98 in order to address the challenges of high velocity location data, authors propose a distributed in-memory spatio-temporal data processing system, which includes a dis- tributed in-memory index and storage infrastructure built on a distributed in-memory programming paradigm [131]. The location records are distributed across a cluster of nodes using the producer–consumer model.

7.2.2 Efficient Privacy Preserving k-NN Queries

In spatial databases, the processing of k-NN query over stationary objects has be ex- tensively studied. The last decade, thanks to technological advances, real-time spatio- temporal data of moving objects can be monitored and in following processed. Hence, the Continuous k-NN querying in real-time and dynamic environments has attracted the attention of researchers [179]. In [74], the authors attempt to develop an efficient algorithm to process the k-NN queries on uncertain locations of objects. A probability model is designed to quantify the possibility of each object being one of the k nearest neighbors. The uncertainty of the object’s location lies in the fact that their position is monitored by a sensor-based tracking system, instead of the GPS technique. In a recent work, the authors study the problem of CkNN queries on moving ob- jects to retrieve the k-NNs of all points along a query trajectory in spatial road net- works [37]. A novel direction-aware CkNN algorithm, the so-called DACKNN, is recommended. To ensure an efficient query, the algorithm excludes from the analysis moving objects that are far away from the query point. In [169] a fast continuous query privacy-preserving framework in road networks is recommended, based on the concepts of both k-anonymity and l-diversity. The authors in [155] propose a method to make location cloaking less vulnerable to query tracking attacks. The proposed method is applied on road networks, such as subways, railways, and highways, where the road network is known and fixed, except for the trajectories. It is called adaptive-fixed k-anonymization and generates smaller cloaking regions with- out compromising the privacy of the query issuer’s location. Futhermore, the authors in [189] study the problem of location disclosure adopt- ing the k-anonymity method in the centralized architecture based on a single trusted anonymizer. However, this strategy may compromise user privacy involving continu- ous LBSs. A dual-K mechanism (DKM) is suggested to protect the users’ trajectory

99 privacy for continuous LBSs. The proposed method firstly inserted multiple anonymiz- ers between the user and the location service provider (LSP). The k anonymization is achieved by sending k query locations to different anonymizers. To improve user trajectory privacy, the dynamic pseudonym mechanism is combined with the location selection one. Hence, the user trajectory (spatio-temporal points) cannot be obtained by neither the LSP nor the anonymizer. Note that in our previous work [38], a pseudonyms system is recommended to en- sure and/or reinforce the privacy of mobile objects. Actually, a mobile user can ini- tiate a discrete number of k-NN queries in each spatio-temporal point with different pseudonyms for each one. The recommended protocols provide unlinkability, thus, the service provider cannot connect to the IP and validate that they belong to the same user. The main characteristic of this approach is that the k-NN queries can be issued not only in Euclidean space but also in Hough-X/Y space. As the analysis and results show, Hough space is an appropriate solution as it preserves a user’s privacy and provides storage efficiency as well, assuming that the initial non-linear trajectory of a moving object is splitted into a set of linear sub-trajectories. In this chapter, the problem of k-anonymity for privacy preserving in spatio-temporal databases is evaluated in a distributed environment. Although traditional privacy pre- serving solutions have been designed in Euclidean space, our framework assumes the concept of k-anonymity in Hough space as well. Due to the constant evolution of the mobile objects’ location information in time, it is required to evaluate massive sin- gle query point k-NN queries for massive mobile objects per timestamp. The spatio- temporal k-NN queries are issued with the aim to formulate the k-anonymity set of moving objects. This set is online computed based on all objects trajectory point in each timestamp and consists of the ids of the k nearest objects. Specifically, in each times- tamp, a k-NN query, called Snapshot Trajectory Point k-NN (ST kNN), is issued taking into consideration the selected features information (e.g., euclidean coordinates, angle and velocity, dual points) of all the objects. Assuming a high time sampling rate, we can claim that the process is similar to a Continuous Trajectory Point k-NN (CT P kNN). As we have already mentioned, an important characteristic of continuous k-NN query in Hough space is that the k-NNs in between two consecutive spatio-temporal points remain the same. Based on this characteristic, the problem of performing repetitive

100 queries can be considerably reduced to finding the k-NNs in specific spatio-temporal points where the mobile objects’ velocity varies, indicating a new linear sub-trajectory of the initial non-linear trajectory. Unlike our and other previous works, here, the key idea is to evaluate the proposed method for the k-anonymity set computation in an en- vironment suitable for efficient k-NN query in spatial or dual points data.

7.3 Materials and Methods

This section provides the necessary background knowledge for the remainder of the pa- per. Initially, the k-NN algorithm, a core component of the adopted privacy preserving methodology along with its weaknesses to tackle big data problems, is presented. In following, useful definitions and notations will be recorded under the problem defini- tion, with the most characteristic being the spatial indexing and partitioning methods. Moreover, the GeoSpark components for the CkNN queries processing with the aim to formulate the k-anonymity set, will be in detail described.

7.3.1 Operations on Spatial Data

Querying on spatial data is considered an operation that is usually coupled with index- ing methods. Several indexing methods have been considered in literature as they are crucial for the performance of spatial data query processing algorithms, since they are used in order to reduce the query run time. The most representative are the ones based on R-tree and Quad-tree. Further, to support efficient query processing on moving objects, grid-based space partitioning methods can be adopted [43]. Moving objects data are indexed in the grid cells they belong to for facilitating the queries and avoiding the check of all the objects. The aim of a spatial partitioning technique is to improve the query time as well as to keep all the partitions balanced in terms of memory and computations, which is known as the term ”load balancing” [185]. The equal grids partitioning uniformly divides the whole region, providing thus a good data locality but not load balancing. Also, Quad-tree is another data structure based on the principle of “divide and rule” that recursively divides two-dimensional space into four quadrants and needs a merging operation to construct the specified num- ber of partitions. On the other hand, the R-tree provides an efficient data partitioning

101 strategy to efficiently index spatial data. It is considered a balanced search tree that improves both search speed and storage utilization. Another space partitioning strategy is based on KDB-tree, a balanced binary tree, which has been used in load balancing of spatial databases for fast querying. It is worth mentioning that the locality of data and load balancing are important to speedup queries performance [51].

7.3.2 The k-NN Classifier from Big Spatial Data Perspective

The k-NN algorithm is a popular non-parametric method that can be used for both classification and regression tasks. In following, we will discuss the k-NN classification problem from the big spatial data viewpoint. The main components of k-NN classifier are:

• TR is a training mobile objects dataset of size N,

• TS is a testing mobile objects dataset of size M,

• on is a mobile object represented as a tuple in the form of (fn1, fn2, . . . , fnp, cl),

where fnp is the value of the p-th feature of the n-th object and cl is the class it cl belongs to, denoted as on and

• cl is only known to TR dataset.

In the classification process, for each test object t ∈ TS, the k-NN algorithm searches the k closest objects in the TR set, computing the distances (specifically the Euclidean distance) between the test mobile object and all the mobile objects in TR. The distances from all training objects are ranked in ascending order and then, the k nearest objects (knn1, knn2, . . . , knnk) are kept to find the dominant class cl. Despite its remarkable performance in real-world applications, it lacks the scalability to man- age large-scale datasets. Initially, the time complexity to find the k nearest neighbors for a single test mobile object is O(Np), where an extra complexity O(Nlog(N)) for distances sorting process is involved; in this additional complexity, N is the size of training dataset and p is the number of object features. The classification process needs to be restated for all the test mobile objects. Additionally, the k-NN model requires the training data to be stored in memory in order to achieve a fast computation of the distances. However, large-scale TR and TS sets may not fit in the RAM memory.

102 7.3.3 Problem Definition

A spatio-temporal database, whose records are moving objects with geolocation data at- tributes in two-dimensional space D, is assumed. In real world examples objects move arbitrarily which oppose to fixed velocity assumption that some works suppose. Hence, it is considered that the objects velocity is low with values between a minimal and a

maximal, i.e., u ∈ [umin, umax]. In addition to, we consider low data sampling rate, since applying high data sampling rate it would be difficult to apply linear interpolation between sampled points, thus, the adopted dual methods should be redesigned adopting appropriate curve fitting methods. As a result, the velocity calculated from two consec- utive way points could not represent the real velocity well and Euclidean distance would not work. In the following, an overview of the studying problem definitions along with notations is presented.

Definition 7.3.16. The trajectory of a moving object is assumed to be a continuous piece-wise linear function, which maps the temporal dimension to the two-dimensional

Euclidean space, connecting a sequence of points (x1, y1, t1), (x2, y2, t2),..., (xL, yL, tL)

for t1 < t2 < . . . < tL.

ti Such a representation entails that the n-th object position at time ti is posn = (xi, yi)

for n = 1, 2,...,N, and that during each time interval (ti, ti+1), the object moves

along a straight line from (xi, yi) to (xi+1, yi+1) with a constant velocity (amplitude and direction). To ensure an efficient storage and handling of the queries, the expected position of

the object at any time t ∈ (ti, ti+1), where 1 ≤ i ≤ L − 1, is obtained by a linear interpolation between (xi, yi) and (xi+1, yi+1). In this way, a number of additional features can be extracted, such as velocity u, angle direction θ, Hough-X and/or Hough- Y transformation of (x, y) [38].

Definition 7.3.17. A moving object trajectory (spatio-temporal) snapshot is defined as the location data of that object in a specific timestamp; thus, a single trajectory is stored

ti L as a collection of location snapshots denoted as {posn }i=1.

Definition 7.3.18. Snapshot Distance: Given two objects o1 and o2 with location snap-

ti ti shots pos1 and pos2 respectively, the l2 − norm distance (Euclidean) (Without loss

103 of generality, both other distance measures can be considered, such as the Manhattan

distance (l1 − norm), as well as the maximum distance (l∞ − norm)) between o1 and

o2 in D at timestamp ti is computed as

v u D ti uX ti ti 2 disto1,o2 = t (pos1 [j] − pos2 [j]) . (7.1) j=1 The distance between two objects in our model is represented in Euclidean distance. The spatial k-NN queries are considered as one of the most common search prob- lems that will be employed in our study. Generally, given a spatial region, a k-NN query on spatial data identifies all the spatial points that lie inside this region. For spatio-temporal k-NN queries, a time interval is also given, where the timestamps of the resulting trajectories in that region need to also fall in that time interval. A spatial k- NN query takes as input a query center point along with a set of spatial objects location data in order to find the k nearest neighbors around the center point.

Definition 7.3.19. k-NN: Given a moving object mo, a dataset O and an integer k, the

k nearest neighbors of mo from O, denoted as kNN(mo,O), is a set of k objects from

O that ∀o ∈ kNN(mo,O), ∀s ∈ O − kNN(mo,O), disto,mo ≤ dists,mo .

Definition 7.3.20. Snapshot Trajectory Point k-NN (ST P kNN) query: Given a query

center point q at timestamp ti, an integer k and a dataset of trajectory (spatio-temporal)

ti ti ti ti ti points denoted as S = {traj1 , traj2 , traj3 , . . . trajN } a k-NN query, denoted as

ti ti ST P kNN(q, S , k), asks for the k spatial points of S whose l2 distance from the query point q is less than that of the rest points of Sti .

Definition 7.3.21. Continuous Trajectory Point k-NN (CT P kNN) query: Given a query center point q at timestamp ti, an integer k and a dataset of trajectory (spatio-

ti ti ti ti ti temporal) points denoted as S = {traj1 , traj2 , traj3 , . . . trajN } at a time-interval L t = {ti}i=1 for L → ∞ (practically large enough), a continuous k-NN query, denoted as CT P kNN(q, St, k), asks for the k spatial points of Sti , whose distance from the

ti query point q is less than that of the rest points of S , for all consecutive ti ∈ t.

ti ti Note that, traj1 ≡ pos1 . Considering the previous definitions, the problem for- mulation for the application of robust CT P kNN queries (see Figure 7.1) will be con- sidered in the following subsection. In the context of the proposed approach, all points

104 q ∈ Sti are selected as query objects in order to acquire the k nearest neighbors of

all mobile objects in each timestamp ti. If the process is repeated for a high number of consecutive timestamps in a selected time period, then, the collection of ST P kNN queries constitutes a CT P kNN query, thus similar to our case.

Time Spatio-temporal Trajectory Query

1 2 3 4 raj raj raj raj T T T T

t3

t2

t1

x

y

Fig. 7.1 An Overview of Continuous Trajectory Point k Nearest Neighbor (CT P kNN) Query.

7.3.4 Problem Formulation

The problem of Robust Spatio-temporal Databases is addressed in an environment suit- able for Big Spatial Data Management. The k-anonymization approach adopted in [40] for preventing mobile objects identity reveal, is related to k-NN algorithm as described in Section 7.3.2. Specifically, the k-anonymity set is formulated by the unique object identifier, denoted as id, of the k nearest neighbors, exploiting as a result the spatio- temporal data of a set of mobile objects. Through the SMaRT system, a set of mobile users’ trajectory data is recorded per timestamp, that is, the mobile user trajectory id and the values of longitude and lati- tude are recorded. From these location features, the four attributes (x, y, θ, u) (as pre- sented in Table 5.1) along with the values of Hough-X and Hough-Y of (x, y) [38],

(Ux, ax,Uy, ay, bx, wx, by, wy) (as presented in Table 7.1), are computed. So, by em- ploying the k-NN method for different pairs of features, this fact enables us to form the

105 k-anonymity set of each mobile object per timestamp, as depicted in Table 5.2. For each mobile user i and per timestamp l, the k nearest neighbors id it is com- puted and in the following is kept in a vector form knnsil = [idil1idil2 . . . idilk] for

l = 1, 2,...,L as presented in Table 5.2. For each user, the number ks out of the k nearest neighbors, which remains the same from one timestamp to another, is computed in order to estimate the vulnerability ratio 1 . Hence, higher k is associated with lower ks s probability (i.e., lower vulnerability) a moving object’s identity being unveiled.

Definition 7.3.22. k-anonymous database: A database is k-anonymous, i.e., its records are k-anonymous with respect to the selected features, if k − 1 discrete records in the same specific timestamp τ, have at least the same nearest neighbors so that no record of k is distinguished from its k − 1 neighboring records.

7.3.5 System Model

A spatio-temporal database is considered with N records, that is, N moving objects in j j the xy plane. Each record (xi , yi ) represents the spatial coordinates of a mobile user j j in timestamp ti , or point i of its trajectory j [186]. From the location coordinates (x, y), we can extract the corresponding velocity and angle direction features (u, θ) and

the dual points (Ux, ax), (Uy, ay), (bx, wx), (by, wy) by employing the dual methods described in [38]. Let us assume a trajectories database T = {T 1,...,T N } of equal length L in which each trajectory is represented via a sequence of L triples. j For each point i in trajectory j, we define a two-dimensional feature vector Fi , which captures the selected features data. Hence, we can define and store the trajectory

j j j j j j as T = {F1 ,F2 ,F3 ,...,FL}. In the context of this work, the time performance of the k-anonymity set formulation either with the aid of an indexing method or for spatial k-NN queries on trajectory data employing Snapshot k-NN query on Trajectory Points, as depicted in Figure 7.2, is evaluated. Given a set of trajectories represented as a sequence of spatio-temporal points along with a query point, the ST P kNN algorithm finds its k nearest spatial points from a set of trajectory points in corresponding timestamps. The ST P kNN query is issued over a set of moving objects, executing the classical k-NN in a time period, and updates the k-anonymity set from timestamp to timestamp for all objects.

106 RAW DATA

y Spatial Indexing Partitioning

y emporal T Partitioning

x y

x

x time-stamp

Fig. 7.2 An Overview of Spatio-Temporal Data Partitioning and Indexing.

However, some objects may not move with the same velocity, leading to changes in the query result, i.e., in the k-anonymity set. As a result, in Privacy Preserving, it is important for the k-anonymity set to remain the same or vary slowly, as in this case, the vulnerability measure of moving objects will be affected. Note that, for a given STPkNN query q, the k-anonymity set, denoted as A, should always satisfy the following conditions:

1. The first condition |A| = k ensures that the anonymity set contains the ids of k objects.

2. The second condition ∀a ∈ (S − A), dist(q, a) ≥ maxdist(q, a)|a ∈ A ensures that these k objects’ id are the k nearest ones to q.

Assuming a mobile object when considering a k-NN query relevant to the selected features in a specific timestamp, then, the spatial k-anonymity ensures that an attacker, 1 as the query issuer, cannot identify a mobile object with probability larger than k , where k is a user-defined anonymity parameter. Moreover, a spatio-temporal database is ex- pected to handle a high number of moving objects location data, as well as, a large number of consecutive k-NN queries. Hence, an efficient consecutive ST P kNN pro- cessing algorithm is very important to be addressed. To this end, the ST P kNN query is investigated in the GeoSpark framework and experimental evaluations are conducted

107 using a realistic dataset to demonstrate the time performance of the k-anonymity set computation under the ST P kNN query. Ultimately, the vulnerability behavior for dif- ferent pairs of features is investigated.

7.3.6 GeoSpark System Overview

In this subsection, the necessary structures of GeoSpark-based approach is thoroughly presented. All parts were implemented using Apache Scala due to its compatibility with Apache Spark framework. The core components are the following implemented on GeoSpark Resilient Distributed Dataset (RDD) and the interaction with GeoSpark is utilized through Apache Zeppelin interface.

7.3.6.1 GeoSpark Architecture

GeoSpark [78] is an extension of the Apache Spark core and provides additional tools to manage geospatial data, e.g., geospatial datatypes, indexes and operations. The ar- chitecture of GeoSpark, as introduced in Figure 7.3, consists of the following three layers:

1. The Apache Spark Layer consists of all the components present in Spark and performs data loading and querying.

2. The Geospatial RDD Layer extends Spark and supports three types of RDD, i.e., Point, Rectangle and Polygon RDD. In addition, it contains a geometrical opera- tions library for every RDD.

3. The Geospatial Query Processing Layer is used to perform different types of geospatial queries.

108 Fig. 7.3 An Overview of GeoSpark Layers.

GeoSpark provides Spatial RDD that allows efficient loading, transformation, par- titioning, in-memory storing as well as indexing of complex spatial data for different input data sources such as CSV, WKT, GeoJSON, Shapefile, etc., which, through the GeoSpark Spatial SQL interface, are compatible with Spark. It also provides eight types of spatial objects, namely Point, Multi-Point, Polygon, Multi-Polygon, Line String, Multi-Line String, GeometryCollection, and Circle. Hence, spatial objects in a Spatial RDD can consist of different geometry types. Furthermore, it supports three types of spatial objects, which are Point, Rectangle and Polygon, for which the corresponding spatial RDDs can be defined. These structures can be used as input to several spatial queries, such as spatial k-NN, spatial range and spatial join. In following, taking advantage of the GeoSpark architecture, the main parts of our approach will be presented.

7.3.6.2 Spatial RDD Structures Preparation

Initially, the trajectory data of all mobile users is stored in a csv file. Specifically, this data consists of trajectory (spatio-temporal) points for a number of mobile objects, i.e., bike riders. For each bike trajectory (spatio-temporal) point, we have in our disposal the

109 bike rider id, the spatial coordinates (x, y), the polar coordinates (direction, velocity) (u, θ), the Hough-X/Y attributes as well as the timestamp, with the aim of computing the k nearest neighbors and selecting two features among them, at each time. Prac- tically, all bike riders follow non-linear trajectories. However, in our case, the users’ trajectory is approximated by linear sub-trajectories, as, in our previous work [38], thus, the values of the attributes are not randomized such that groups of bikes have sim- ilar behavior. The user cannot control attribute values, but, they may ask to simulate a specific timestamp. In GeoSpark, firstly the corresponding SRDD is recovered and in following the temporal partitioning is implemented. Information about road network that describe the situation of the region are not taken into consideration. In Apache Zeppelin, the input data are loaded from the csv file as a Dataframe and then SQL queries can be executed on the created Dataframe in order to recover spatial data of moving objects in specific timestamps. Then, these data are stored and transformed to PointRDD. In this way, several PointRDDs are created which, from now, will be called as BikesRDD, as they concern bike riders trajectories (spatio-temporal points). GeoSpark’s operator permits us to transform a set of raw data into a PointRDD, and selecting the columns that correspond to the desired features at a specific timestamp, as depicted in Table 7.1.

110 Table 7.1 The Different Types of Point Resilient Distributed Dataset (RDDs) According to Selected Features.

PointRDDs

Features Type of PointRDD

(x, y) Spatial Points RDD

(u, θ) Polar Points

Hough-X of x (ax,Ux) PointRDD

Hough-X of y (ay,Uy) PointRDD

Hough-Y of x (bx, wx) PointRDD

Hough-Y of y (by, wy) PointRDD

7.3.6.3 Spatio-Temporal Partitioning

The proposed approach considers a naive temporal partitioning from which a collection of timestamps is considered. These timestamps are related to the sampling period where the system recorded the spatio-temporal data with its usage. As a result, the size of the loaded data in BikesRDD may be different for different timestamps since some mobile objects may not have a spatio-temporal footprint for all recorded timestamps. Here, in our simulations, the sampling period is supposed to be similar for all bike riders trajectory data. From the trajectory (i.e., spatio-temporal) points of a number of bike riders objects at different timestamps, a collection of BikesRDD is constructed. Following the previous aspects, GeoSpark can easily utilize spatial partitioning over the temporal partition of the data of these bikes. In each temporal partition, a bike rider may have not necessarily moved because random delays may be employed. Hence, in a specific time partition, the spatial data of all bikes may not necessarily participate. BikesRDD constitutes a specialized GeoSpark PointRDD, which consists of indi- vidual bike riders’ records. The spatial partitioning on the BikesRDD concerns bike

111 riders selected features, such as the location data. In several temporal partitions, the BikesRDD are simulated one by one. More to the point, GeoSpark offers low overhead spatial partitioning approaches that take into consideration the spatial distribution of the data as well as repartitions a loaded Spatial RDD in an automatic way. In each Spatial RDD, several spatially proximal objects are grouped into the same partition and, as a result, the partitioning layer in GeoSpark partitions the workload in a spatial way and periodically repartitions this workload in order to keep balanced partitions. Also, it supports a variety of grid- type partitioning methods, such as equal-grids, R-tree, Quad-tree and KDB-tree, to name a few [184].

7.3.6.4 k-Anonymity Set in GeoSpark

Algorithm 6 presents the steps of a spatial k-NN in GeoSpark. It takes as input a non-indexed or indexed SRDD, a query center point as well as a parameter k, which indicates the number of nearest neighbors. The algorithm is separated in two phases, namely selection and sorting phase [184], and will be utilized in the simulations.

112 Algorithm 8 Spatial k-NN Query 1: input The number of k nearest neighbors 2: input A query center object A 3: input A Spatial RDD B 4: output A list of k spatial objects 5: Step 1: Selection phase 6: for all partition ∈ SRDD B do 7: if an index exists then 8: Return k-neighbors of A by querying the index of this partition 9: else 10: for all object ∈ this partition do 11: Check all the distance between the object A and Spatial RDD B 12: end for 13: end if 14: end for 15: Maintain a priority queue that stores the top k nearest neighbors 16: Step 2: Sorting phase 17: Sort the spatial objects in the intermediate Spatial RDD based on their distances to A 18: Return top k objects in C

Our aim is to exploit the above mentioned structures and Algorithm 6 so as to for- mulate the k-anonymity set for a number of moving objects based on their trajectory (spatio-temporal) data. From the trajectory points of a number of mobile objects at different timestamps, a collection of spatial BikesRDD is constructed. The goal of this paper is to provide an efficient and scalable framework for robust continuous k-NN querying of spatial objects in GeoSpark. A Scala RDD API for creating the desired PointRDD from a csv file is introduced in following Algorithm 7.

Algorithm 9 Spatial PointRDD Creation 1: Define csv file location in pointRDDInputLocation 2: Define the attributes start Column equal to 0 in pointRDDOffset 3: Define pointRDDSplitter as FileDataSplitter.csv 4: Define rest attributes in carryOtherAttributes = true 5: Create PointRDD: PointRDD = new PointRDD(sc, pointRDDInputLocation, pointRDDOffset, pointRDDSplitter, carryOtherAttributes)

113 An iterative application of a typical spatial k-NN query, which was presented in Algorithm 6, is introduced with use of Scala RDD API of Geospark in following Algo- rithm 10.

Algorithm 10 Iterative Spatial k-NN Query 1: input k, usingIndex, number of timestamps L 2: for timestamp t = 1 to L do 3: Create the Dataframe df with the selected features in t 4: Save df as csv file f 5: Create from csv a PointRDD 6: if usingIndex = true then 7: Build R-Tree index on PointRDD 8: end if 9: for all lines ∈ csv file f do

10: Read the selected features (f1, f2) 11: Create querypoint 12: val fact = new GeometryFactory()

13: val querypoint = fact.createPoint(new Coordinate(f1,f2)) 14: Run Spatial k-NN Query to return k geometry points 15: val queryResultList = KNNQuery.SpatialKnnQuery(PointRDD, querypoint, k, usingIndex) 16: end for 17: end for

It is worth mentioning the fact that the output format of the spatial k-NN query (indexed or not) is a list of geometries as in Table 7.2, where the id of the point geometry is related to the mobile object id. In addition, the list holds the top-k geometry objects. In the following, the above methodology and algorithms will be considered for the performance evaluation of the proposed k-anonymity method based on spatial k-NN.

114 Table 7.2 Trajectory Representation in a PointRDD.

Trajectory Points

Point (−88.331492, 32.324142)

Point (−88.175933, 32.360763)

Point (−88.388954, 32.357073)

Point (−88.221102, 32.35078)

7.4 Results

7.4.1 Environment and Dataset

A local machine was used for the experimental evaluation. The experiments were ex- ecuted using 1 VM, which consisted of 4 CPU Cores, 3.8 Gb RAM and 488 Gb Stor- age capacity each with Ubuntu 18.04.4 LTS, Apache Spark 2.1.0, GeoSpark 1.2.0 and Apache-Zeppelin 0.8.2. We conducted the experiments on one master node and as- sign 384.1 MB memory to the Spark driver program that is executed on the master local machine. The experimental data extracted from the SMaRT Database GIS Tool (http://www.bikerides.gr/thesis2/). The experiments were based on tra- jectory datasets of bike riders in the area of Corfu, Greece, so the case study area is located in the non-urban area of Corfu.

7.4.2 Time Performance of k-Anonymity Set

A comprehensive experimental evaluation that studies the performance of GeoSpark under the studying problem, lies in the corresponding subsection. Specifically, the GeoSpark spatial k-NN query performance, which impacts the computation of the k- anonymity set of 80, 500 as well as 2000 mobile objects in 10 consecutive timestamps, as shown in Table 7.3, is measured. The spatial k-NN queries are tested by varying k values equal to 8 and 16 on Bike Riders Trajectory (spatio-temporal points) dataset. The performance measure of the system for the k-anonymity set computation is the total run time that the system needs to execute the jobs. We compare the following spatial data processing approaches:

115 1. Without Partitioning–Without Indexing: GeoSpark approach without spatial in- dex in non-partitioned PointRDD.

2. Without Partitioning–With Indexing: GeoSpark approach with spatial index in non-partitioned PointRDD.

3. With Partitioning–Without Indexing: GeoSpark approach with spatial index on each partition of PointRDD after data partitioned according grids using (i) KDB- Tree and (ii) R-Tree.

Table 7.3 Simulation Parameters.

Parameters

Parameter Range

80, 500, Mobile Objects 2000

Timestamps 10

Time Step uniform

The above cases will be taken into consideration in order to evaluate the GeoSpark performance for the computation of the k-anonymity set for a number of mobile users. To form the desired k-anonymity set, we apply spatial k-NN queries using the em- bedded function known as ”KNNQuery.SpatialKnnQuery”. The spatial index R-tree is only supported for the spatial k-NN Queries. We consider the following results assuming that the trajectories are represented as a collection of spatio-temporal points in two-dimensional space. The experiments were conducted in Apache Zeppelin, which is an open web-based notebook that enables in- teractive data analytics. In Figure 7.4, an SQL query is employed, which returns the trajectory data (spatio-temporal points) of 40 mobile objects. Specifically, a notebook, created in Zeppelin through which the methods described in previous section were ex- ecuted under GeoSpark framework, is utilized. In this case, it should be noted that the experimentation depicts that, when mobile objects do not move at the same timestamps, then it is difficult for the k-anonymity set to be formed, as the desired set of mobile objects has very low cardinality. Also, we

116 observed that from timestamp to timestamp, different objects participated in the formed dataset, meaning that the k nearest neighbors are time-varying. Such a scenario proves that the randomness in objects’ movements does not benefit or promote the formulation of a slow time-varying anonymity set, as we have proved in our previous work [40].

Fig. 7.4 An Overview of 40 Trajectories through Zeppelin.

7.4.2.1 Impact of Data Size

In this subsection, the impact of data size in terms of the number of mobile objects N in Figure 7.5 and storage size in MB in Figure 7.6 with or without the aid of an R-Tree indexing on the computation time of the k-anonymity set are investigated. According to results in Tables 7.4–7.6, we observe that the use of indexing does not impact the total execution time of involved spatial k-NN queries. This happens due to the fact that GeoSpark caches the spatial RDD along with the corresponding indexes in each timestamp. Hence, for the upcoming mobile objects, it directly reads spatial PointRDD and index from the cache, in this way leading to time saving. It should be also noted that in each timestamp, all objects use the same (partitioned or not) PointRDD, which changes in the upcoming timestamps for all the objects.

117 Fig. 7.5 Time Cost for k-Anonymity Set Computation with or without Indexing for N = 80, 500, 2000 Mobile Objects.

Fig. 7.6 Time Cost for k-Anonymity Set Computation with or without Indexing for 3 Cases of Total Input Data in Executor.

Table 7.4 Time for 80 Mobile Objects without Indexing and with R-Tree Indexing.

Time Results in Minutes

No Index- Number of Completed Number of k-NNs Indexing ing Jobs Queries

8 1.5 1.6 730 6400

16 1.5 1.6 730 6400

118 Table 7.5 Time for 500 Mobile Objects without Indexing and with R-Tree Indexing.

Time Results in Minutes

No Index- Number of Completed Number of k-NNs Indexing ing Jobs Queries

8 3.5 3.6 4510 25 × 105

16 3.5 3.6 4510 25 × 105

Table 7.6 Time for 2000 Mobile Objects without Indexing and with R-Tree Indexing.

Time Results (Minutes)

No Index- Number of Completed Number of k-NNs Indexing ing Jobs Queries

8 11 12 18,010 4 × 107

16 11 12 18,010 4 × 107

In terms of total execution time, as shown in Figure 7.5 including the time for index building on PointRDD, GeoSpark without index shows approximately similar search time performance as in the indexed version. Also, the execution time of a k-NN query in GeoSpark remains constant for different values of k in each case. That is mainly considered when the value of k is very small with respect to the input data and thus, most of the time is spent on input data processing. The time cost of loading data from csv to dataframe is 42 s whereas it is included in the total execution time. Figure 7.6 illustrates the total execution time relative to the total input data size in executor in MB, which seems to increase in a non-linear way. Ultimately, in Figure 7.7 let us focus only on higher number of mobile objects in or- der to understand the impact of data scalability in the time performance of the proposed approach for the k-anonymity set computation assuming 2-dimensional (2D) points RDD for 10 consecutive timestamps. Indeed, it seems to approach exponential increase which for actual Big Data will be more apparent.

119 300

250

200

150 Time (minutes) Time 100

50

0 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of mobile objects 104

Fig. 7.7 Time Cost for Mobile Objects N = {500, 2000, 8000, 32.000} without Index- ing for k = 8.

7.4.2.2 Impact of Spatial Partitioning

In this subsection, we experiment on the impact of the embedded spatial partitioning methods on the computation time of k-anonymity set, despite the fact that in GeoSpark the spatial partitioning method is usually used to optimize spatial join. Initially, in Figure 7.8, the total number of records in PointRDDs for four partition- ing methods in a specific timestamp is demonstrated. Obviously, the partition size of KBD-Tree partitioning is more balanced than that of Quad-Tree. Also, we observe that regarding R-Tree partitioning, an overflow in data partition (e.g., partition 2) oc- curs, which is much larger than other partitions because R-Tree does not consider the whole space [184].

120 Fig. 7.8 Spatial PointRDD Data Distribution for 4 Spatial Partition Techniques for 2000 Mobile Objects.

Note that, in the third dataset, with size N = 2000, the data distribution varies from one timestamp to another and thus, it is not recorded in Table 7.7. In Table 7.7, we focus only on the impact of R-Tree and KDB-Tree spatial parti- tioning on k-anonymity set time computation without PointRDD indexing, as GeoSpark does not support building local indexing on spatially partitioned data for spatial k-NN queries. We choose to evaluate, for a specific k with value equal to 16, the impact of spatial data partitioning on the time needed to formulate the k-anonymity set for 10 consecutive timestamps, as aforementioned. As the results show, the time cost of the k-anonymity method did not benefit from data partitioning. In addition, the time for the k-anonymity set computation is a little bit lower when using R-Tree for datasets having smaller size. However, in case of a larger dataset, the difference between the two types of partitioning is normalized. This means that most of the processing time, in the same machine, is focused on k-NN query computation.

7.4.3 Vulnerability Evaluation

In this subsection, some additional experiments based on a real dataset with parameters presented in Table 7.8 and Figures 7.9–7.11, are conducted. The k-anonymity set com- putation is utilized with the aid of R-Tree indexing. As GeoSpark works with spatial data in two-dimensional space, our focus is on three different cases for the k-anonymity set computation in order to evaluate the vulnerability under the formulated anonymity set in each space, which are:

121 Table 7.7 Impact of Spatial Partitioning in Time Performance.

Time Results (Minutes) for Different Spatial Partitioning Methods Mobile Ob- R-Tree KDB-Tree Completed Jobs jects N = 80 1.4 1.5 766 N = 500 3.6 3.7 4546 N = 2000 12 12 18,046 Data Distribution for 2 Different Spatial Partitioning Methods Mobile Ob- R-Tree KDB-Tree jects N = 80 40 40 0 30 25 25 236 132 N = 500 250 250 0 132

1. Euclidean coordinates (x, y)

2. Polar coordinates (u, θ)

3. Hough-X coordinates of (x, y) denoted as (Ux, ax) and (Uy, ay)

Table 7.8 Parameters when using N = 500 Trajectories, L = 100 Timestamps.

Anonymity Set Size k Anonymity Set Attributes

5, 10 (x, 0) 5, 10 (0, y) 5, 10 (x, y)

5, 10 (Ux, ax)

5, 10 (Uy, ay)

122 1 1 x, k=5 0.9 x, k=10 0.9 y, k=5 0.8 y, k=10 0.8 xy, k=5 0.7 xy, k=10 u, , k=5 0.7 u, , k=10 0.6 0.6 0.5 0.5

Vulnerability 0.4

Vulnerability 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Timestamps Timestamps

a b

Fig. 7.9 (a) Euclidean Space and (b) Polar Space for N = 500 Trajectories, L = 100 Timestamps.

1 1 x, k=5 x, k=10 0.9 (U ,a ), k=5 x x 0.9 (U ,a ), k=10 x x

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

Vulnerability 0.4

Vulnerability 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Timestamps Timestamps

a b

Fig. 7.10 Hough-X Space of x for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps.

1 1 y, k=5 y, k=10 (U , a ), k=5 0.9 y y 0.9 (U ,a ), k=10 y y

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 Vulnerability 0.4 Vulnerability 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Timestamps Timestamps

a b

Fig. 7.11 Hough-X Space of y for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps.

The experimentation has been carried out on a dataset of size N = 500 Trajectories (or moving objects).

123 Actually, Figure 7.9 on the left depicts that assuming only the x feature (or y one) for the anonymity set formulation, the vulnerability in both cases is about 2 times higher than in the case when both attributes (x, y) are considered. Similarly, in Figure 7.9 on the right, the vulnerability is measured in polar space, where the performance is worse than in Euclidean one. This is attributed to the fact that polar coordinates are linearly dependent on spatial data (x, y), revealing the curse of dimensionality. Also, this figure reveals that the anonymity set varies from timestamp to timestamp, meaning that for given objects angle direction and velocity, more than one out of k nearest neighbors change. Finally, Figures 7.10 and Figures 7.11 show the vulnerability performance in Hough- X space, which is compared with the corresponding one in Euclidean space. Obviously, the results are better in Euclidean space as Hough-X space is linearly dependent on the former. Finally, focusing on the case where k = 10 in Figure 7.12, the vulnerability varies slowly around 0.4 and 0.3 (red and orange dashed lines respectively) in Hough-X space, while in Euclidean space, it achieves values very close to 0.1 (purple solid lines).

1 (x,y), k=10 (U ,a ), k=10 0.9 x x (U , a ), k=10 y y 0.8

0.7

0.6

0.5

Vulnerability 0.4

0.3

0.2

0.1

0 0 10 20 30 40 50 60 70 80 90 100 Timestamps

Fig. 7.12 Vulnerability Performance Comparison in Euclidean and Hough-X Space.

7.5 Discussion

In this section, two main issues will be discussed. At first, the time performance of the researched issue and in following the robustness of the suggested anonymity method. Actually, the analysis in Section 7.4 gives some useful insights and presents some key aspects of the results obtained from the evaluation of the proposed k-anonymity method in GeoSpark.

124 7.5.1 Performance Issues

Here, we should claim that the adopted partitioning method is simple. Hence, to im- prove the performance of the suggested privacy preserving CkNNs, we could employ the approach proposed in [120] and extend Algorithm 6 by applying the smart parti- tioning technique. Especially, we could consider to compute AkNN queries based on kdANN or kdANN+ (where d stands for the dimensionality), instead of simple k- NN queries and study the effect of d on the performance of such queries and thus, in k-anonymity set computation performance, which is the main issue in the context of this research work. At this point, it should be noted that that approach exploits a space decomposition technique suitable for Big Data and processes the classification in a par- allel and distributed manner considering multidimensional objects, as well. Through an extensive experimental evaluation it has been proved that the solution is efficient, robust and scalable in processing the k nearest neighbor queries. Although the trajectory (spatio-temporal) data utilized in this article are 2-dimensional (2D), the methodology and techniques are general and applicable to higher dimensions. Hence, observing Figures 7.5 and 7.6 in Section 7.4, the efficiency of the proposed approach, which now is investigated utilizing 2D trajectory (spatio-temporal) points datasets, is expected to scale up exponentially with respect to both data size, as shown in Figure 7.7, and dimensions, when d > 2, in the same trend as in [120]. This results from the curse of dimensionality. As the number of dimensions increases, the number of distance computations as well as the number of searches. For further experimenta- tion with the dimensionality parameter d > 2 in the context of GeoSpark, functions for data preprocessing to prepare Points RDD structure need to be redefined. Let us recall that GeoSpark RDD operations and k-NN queries processing work with 2D points. Finally, as the previous finding shows in [120], it is expected that the increment of computing nodes (VMs) will improve scalability and time performance in support of both snapshot and continuous spatial queries over moving objects, having a great effect when the data follows a uniform distribution due to better load balancing. Indeed, the increment of number of computing nodes decreases the amount of distance calculations and update steps on k-NN lists that each computing node undertakes. A similar behavior we expect to see in our case if the processing is distributed in multiple nodes in the GeoSpark environment.

125 7.5.2 Vulnerability

Vulnerability is considered a major issue which arises in spatio-temporal data manage- ment as well, and simply measures how vulnerable various mobile objects are to reveal- ing their identity to potential adversaries, when moving from one position to another. Moreover, reducing vulnerability and enhancing resilience in the face of adversity are considered to be essential steps. To date, less attention has been given to vulnerability resulting from k-NN querying on non-linear trajectory data. This study investigates the vulnerability of mobile objects with respect to k-NN queries based on their geolocation data attributes. The proposed anonymity technique, as well as previous findings, are confirmed in the GeoSpark environment and, as a result, vulnerability is estimated for different risk conditions (i.e., for different number of nearest neighbors, namely k and pair of features, as presented in Tables 7.1 and 7.8). Furthermore, it is observed that for low values of k, the vulnerability remained at the medium level, and the more the nearest neighbors are utilized to form the anonymity set, the lower the vulnerability becomes. In other words, the results of the present study indicate a low level of vulnerability for high number of nearest neighbors and uncorrelated features data. At this point, it should be noted that the results in Section 7.4 highlight the superior- ity, that is lower vulnerability, of Euclidean space in terms of privacy preserving, which is the main issue in k-anonymity and, thus, in applying robust k-NN queries. How- ever, let us recall that, as it has been already investigated in [38], dual representative features, e.g., Hough-X of x and/or y, are considered an appropriate choice to form the k-anonymity set from both privacy preserving and storage efficiency perspective. So, priority should be given to the use of low dimension independent features data; in that case, the utilization of dual space against the Euclidean one.

7.6 Conclusions and Future Work

In conclusion, this work aimed at testing the hardware and software performance in terms of the number of operations performed and the time needed to obtain satisfactory anonymity. The real world example concerning moving bicycle riders proved satis- factory results using different additional ancillary methods of space partitioning and

126 indexing. Also, a GeoSpark-based approach for evaluating the vulnerability of spatial k-anonymity in large-scale spatio-temporal data is provided. Experiments show that GeoSpark constitutes an appropriate Spark-based system for the evaluation of robust k-NN queries. The GeoSpark spatial RDDs are exploited to store trajectory data of the mobile objects as PointRDD in order to acquire the anonymity sets of all query mo- bile objects issuing iterative spatial k-NN queries for all mobile objects in consecutive timestamps. It should be noted that the experiments were conducted in one Virtual Ma- chine (VM) in local mode, with specific capabilities as described in the experimental evaluation. As a future work, our aim is to apply the proposed anonymity method in a fully dis- tributed environment with many VMs so as to verify the expected results from previous findings about the execution time; in this case, the input data will be partitioned and in following distributed to different VMs. In this way, several parallel tasks will be exe- cuted and thus, the execution time for the anonymity set computation is expected to be much lower than in the current single VM case when considering large-scale databases. The locality of the data and load balancing are important when applying efficient k-NN queries. GeoSpark also provides a grid-type partitioning based on KDB-tree that can ensure workload balancing. In conclusion, it should be pointed out that the current experimentation gave us an insight about the design issues and possible constraints of the proposed approach so as to optimize its performance in terms of execution time cost and memory utilization for real-time scenarios, where the mobile objects identity security, when issuing a k-NN query, is of high importance.

127 128 Part II

Sentiment Analysis and Tourism Forecasting

129 130 CHAPTER 8

An Apache Spark Implementation for Graph-based Hashtag Sentiment Classification on Twitter

8.1 Introduction

Nowadays, the vast evolution of Internet has radically changed the ways of communica- tion, information and interactions between people. Internet users are no longer passive information receivers but in contrast, they participate in social networks and have the opportunity to discuss with others by exchanging views and ideas. Specifically, users tend to disseminate information, through short 140-character messages called “tweets” or follow other users in order to receive their status updates. Twitter constitutes a wide spread instant messaging platform which people use in order to get informed about world news, technological advancements, etc. Inevitably, a variety of opinion clusters that contain rich sentiment information, is formed. The rapid growth of Internet has significantly increased the data volume and made the processing with the traditional ways very difficult. Therefore, there is an increasing need to move into cloud computing technologies since they provide tools and infras- tructure for creating high scalable solutions and managing the input data in a distributed way among multiple servers. The public opinion around various issues is reflected through questionnaires and surveys. As more users post reviews for products or services they use, the social net- working platforms are becoming important sources of information, which is for the first time recorded directly in electronic form. The need for analysis and recovery of the pro- duced volume of information in an automated way, paves the way to Sentiment Analysis [97]. According to [107], Sentiment Analysis is one of the most simple problems in Nat- ural Language Processing; the computing system does not need to completely perceive the semantic of each proposition but to detect the whole attitude of the author and in

131 following classify it according to its polarity. However, the polarity detection problem seems to be difficult even for the humans. Sentiment Analysis can be applied in various levels depending on text size and the desired resolution of exported information:

1. Document Level: it is assumed that in each document, an opinion on a particular subject or topic is expressed.

2. Sentence Level: it separates each document proposals assuming that each pro- posal expresses a single view.

3. Attribute Level: it separates each document or proposal to phrases which they refer to an entity as a whole or separately for each feature with a different feeling.

In the context of this work, we utilize hashtags and emoticons as sentiment labels to perform classification of diverse sentiment types. Hashtags are a convention for adding additional context and metadata and are extensively utilized in tweets [169]. They are used to provide categorization of a message and/or highlight of a topic and they enhance the searching of tweets that refer to a common subject. Both, hashtags and emoticons, provide a fine-grained sentiment learning at tweet level which makes them suitable to be leveraged for opinion mining. Previous works regarding emotional content are the ones in [88], [89] and [90]; they presented various approaches for the automatic analysis of tweets and the recognition of the emotional content of each tweet based on Ekman emotion model, where the existence of one or more out of the six basic human emotions (Anger, Disgust, Fear, Joy, Sadness and Surprise) is specified. In this chapter, Sentiment Analysis on Twitter data based on [169] is performed. Our proposal is three classifiers, namely Naive Bayes, Logistic Regression and Decision Trees, which we implemented in Apache Spark, a prominent, distributed environment appropriate for processing large-scale data. The algorithm uses a set of categorized tweets to train the classifiers and in following it classifies the tweets on three categories, namely positive, negative or neutral. The rest of the chapter is organized as follows: in Section 8.2 we discuss related work as well as Spark framework is presented while in Section 8.3, the Sentiment Analysis Classification Framework is introduced. Section 8.4 presents the datasets for

132 validating our framework. Moreover, Section 8.5 presents the evaluation experiments conducted and the results gathered. Ultimately, Section 8.6 presents conclusions and draws directions for future work.

8.2 Related Work

8.2.1 Sentiment Analysis and Classification Models

In the last decade, there has been an increasing interest in Sentiment Analysis [129] as well as emotional models. This is mainly due to the recent growth of data available in the World Wide Web, especially those which reflect people’s opinions, experiences and feelings [130]. Early opinion mining studies focus on document level sentiment analysis concerning movie or product reviews [72], [195] and posts published on web pages or blogs [190]. In recent years, thanks to the development of social media, the interest of academic community and industry has turned to the processing and analysis of massive data. This resulted in important research studies which solely focus on data from social networks. Below is an overview of the main studies that have been conducted regarding Twitter data, divided into Supervised [161] and non-Supervised Machine Learning techniques, where the former achieve better precision than the latter ones.

8.2.1.1 Supervised Machine Learning Approaches

Study [56] is among the first ones conducted in Sentiment Analysis on Twitter data. Authors adopt binary sentiment classification problem by describing tweets as positive or negative. They apply a distant supervised technique in order to train a supervised machine learning classifier and in following compare the algorithms of Naive Bayes [97], [142], Maximum Entropy and Support Vector Machines. The final training set includes tweets for subjectivity detection and positive or negative tweets for polarity classification. Another interesting approach is described in [12] for evaluating the ef- fect of small length tweets to the usual techniques for supervised machine learning. Authors collected tweets from the ten most popular topics in five categories (entertain- ment, products and services, sports, news and companies) by creating a set of manually categorized standards. During the preprocessing step, they replace usernames, hashtags

133 and hyperlinks with pre-built keywords. As text representation attributes, monograms, digrams, trigrams, POS tags as well as POS v-letters are used. A two-phase classifier is proposed in [9]; in the first phase, the tweets are cate- gorized as subjective (have emotion) or objective (neutral) and then are subjectively distinguished into positive or negative. For the construction of the training set, noisy emoticons are used as labels and three emotion detection tools (Twendz 2, Twitter Sen- timent 3 and TweetFeel 2) are put to use. The authors in [33] automatically classify the data set of [123] by using as categorization indexes (noisy labels) consisting of 50 hashtags and 15 emoticons. The set of features includes words, v-letters (2 − 5), the length of each tweet, the plurality of punctuation, exclamation points, question marks, capital letters and words, along with the existence of words with high frequency. In addition, a three-stage technique in [82] considers the emotion analysis on spe- cific target dependent topics. Similar to [9], the authors initially classified tweets as subjective and objective and then (as second phase), with the use of two separate SVM classifiers by linear kernel function, the subjective tweets are classified as positive or negative. They claim that conventional techniques such as the ones proposed in [56] and [170] are not sufficient as all the features are target independent. In the third phase, they propose a graph-based method in order to increase the accuracy. They observe that their approach leads to better performance against the one proposed in [9], as their method is mainly based on lexical features instead of more abstract ones. A method for corpus collection and in following for building a sentiment classifier that is able to determine positive, negative and neutral sentiments is proposed in [127]. The classifier is based on the multinomial Naive Bayes classifier that uses n-gram and POS-tags as features considering conditional independence of n-gram features and POS information. Their experimental evaluations on a set of real microblogging posts prove that their technique is efficient and performs better than previously proposed methods.

8.2.1.2 Non-Supervised Machine Learning Works

The connection of the polls with Sentiment Analysis on tweets that referred to US Pres- ident Barrack Obama is considered in [123]. Specifically, through Twitter API, 1 billion tweets posted during the period from 2008 to 2009 without checking the demographic characteristics of the authors and the writing language, are collected. Authors classify

134 each tweet by measuring if it contains more positive or negative words, while seeking the polarity of each word in the dictionary of MPQA emotions [170]. Furthermore, authors in [52] consider the predictability of the final result applying popular techniques to tweets regarding the election of 2010 US Representatives. These techniques are based on the previous method in [123], using the MPQA dictionary [170] and introduce some changes so that the overall model suits to the nature and characteristics of each electoral system. Taking into account tweets that contain the names of rival candidates, they do not allow a tweet to have two opposite polarities simultaneously. Another non-supervised hybrid method is the one proposed in [101], which deals also with tweets experimenting on large body texts and dictionaries in order to determine the semantic orientation of the text terms. The proposed method of calculating the emotion value takes into account both the polarities of the dictionary and the number of emoticons, repeated letters, exclamation marks as well as capital letters, which are features commonly used by supervised techniques. Moreover, a model based on entity level regarding data collected from Twitter is proposed in [190]. This model applies pre-processing to the corresponding data set by deleting duplicates, removing user-names and hyper-links, replacing abbreviations to their normal form and finally recognizing grammatical terms of individual messages (POS tagging). Then, it calculates the emotional value of each term based on its simi- larity with emotions from dictionary words and solves the simple reports by assigning pronouns to the nearest entity of the text. The following step allows a binary SVM clas- sifier, with standards resulted from the above non-supervised procedure, which classi- fies tweets in their final classes, to be educated. In this work, we adopted the approach introduced in [169], which belongs to su- pervised machine learning techniques, and tried to implement it in a distributed envi- ronment, ideal for big data management. More specifically, the emotion analysis is proposed to be implemented in two phases, which are the tweet and hashtag level. In tweet level, a two-phase SVM classifier, like [9], is used. The former detects whether the tweet is emotional or not, and the tweets that have emotions, are driven to the input of the latter which classifies them as positive or negative. And in this way, graph-based hashtag level analysis of the tweets, which have emotions, is applied in order to achieve higher precision.

135 8.2.2 Cloud Computing Preliminaries

The Apache Spark is an open source framework designed specifically for cluster com- puting with the programming language Scala1. It is made so as to support general pur- pose distributed applications, based (generally) on the processing of large volume data, with a high degree of efficiency and speed. Regarding speed, Spark extends the popu- lar model MapReduce in order to support more types of processing, such as interactive questions and flow data processing. One of the main features of Spark is the ability to run calculations in memory. The Spark can accelerate an application to Hadoop cluster up to one hundred times with the memory usage, and ten times using only the disc. For ease of use, Spark offers simple APIs in Python, Java, Scala and SQL as well as rich embedded libraries. It can also be easily used with other big data tools like run on Hadoop cluster while having access to all Hadoop data. Finally, with regard to sophisticated analysis techniques, Spark supports SQL queries, data flow processing, and complex analytics such as machine learning algorithms and out-of-the-box. It also gives the users the ability to combine all these in a single program. The elements forming the Spark ecosystem are: Spark Core is the core of the Spark and contains its basic functions. It includes the necessary elements for scheduling, memory management, fault repair, interaction with the system storage, and others. Spark SQL provides SQL queries support as well as the neighboring language of SQL created by Apache Hive, called Hive Query Language-HiveQL. Apart from pro- viding the SQL interface for the Spark, the Spark SQL enables developers to introduce SQL queries in Spark. Spark Streaming is an element of the Spark that allows data flow processing in real time. MLlib is a library of machine learning functions. The MLlib provides various ma- chine learning algorithms, including binary classification, regression and collaborative filtering. GraphX is a library that was added to the Spark 0.9 (February 2014) and it provides an API for manipulating graphs and performing calculations in parallel graphs. An example is the process of a graph representing users friends in a social network.

1http://www.scala-lang.org/

136 8.3 Sentiment Classification on Twitter

As aforementioned, in [169], a different approach for emotion analysis using a graph consisting of hashtags, is proposed. This model is called Hashtag Graph Model and examines the polarity of graph nodes. Initially, sentiment analysis on tweet level and in following on hashtag level is utilized; below these two steps are presented.

8.3.1 Tweet-Level Sentiment Classification

To categorize tweets, we have used a two-phase SVM classifier in order to estimate each sentiment deriving from them. In the first phase, a tweet is categorized as neutral or subjective while in the second phase, a subjective tweet is further categorized as positive or negative. Both SVMs use the same features, e.g. unigram words, punctuation and emoticons, as well as the sentiment lexicon.

Fig. 8.1 An example of Hashtag Graph Model [169]

8.3.2 Hashtag-Level Sentiment Classification

Given a graph model, the aim is to solve the assignment inference problem by propos- ing efficient algorithms. To this point, three approaches have been introduced, namely Loopy Belief Propagation (LBP ), Relaxation Labeling (RL) and Iterative Classifica- tion Algorithm (ICA). Hashtags can be categorized either as positive or negative. Sup-

137 pose a set of hashtags H = {h1, h2, ..., hm} , where each one is associated with a set of tweets T = {t1, t2, ..., tn}. The aim is to create a set y, which contains the polarity of each hashtag throughout H. We consider that hashtags, which are in a specific tweet, are more likely to have the same polarity. This drives us to introduce the graph for the analysis to come. That is, a graph HG with hashtags is considered, entitled Hashtag Graph Model, as presented in 8.1. The graph is defined as HG = {H,E}, where H is a set of hashtags and E a set of edges where each edge connects two hashtags that coexist in a tweet. Our primary goal is to categorize each hashtag as positive or negative. So, we therefore use the following equation:

X log(P r(y|HG)) = log(φi(yi|hi))+

hiH X log(ψj,k(yj, yk|hj, hk)) − log(Z)

(hj ,hk)E

wherein the first summation represents the tweet coefficient while the second the hash- tag coefficient. Moreover, the Z is a normalization factor. From the above equation, we allow adjacent hashtags to affect the result of classification. The final result derives from the maximization of the following function 8.1:

yˆ = argmaxylog(P r(y|HG)) (8.1)

The three algorithms are presented below and the corresponding pseudocodes are also introduced in following.

8.3.2.1 Loopy Belief Propagation

LBP is an iterative algorithm and so, it tries to classify each node in a graph through belief message passing. It was originally proposed to work for tree−like networks as a Bayes likelihood-ratio updating rule. Although it does not guarantee convergence to a fixed point after any number of iterations, LBP shows surprisingly good performance in practice. In fact, the propagation process tries to reach the stationary points of the Bethe approximation free energy for a factor graph. Initially, the algorithm initializes all edges of the graph with one message for both labels (positive-negative). The next step is to refresh the messages with the multiplica-

138 Loopy Belief Propagation 1: Input : Hashtag Graph HG 2: Output : Sentiment label for each hashtag h 3: begin

4: for all (hi, hj) ∈ E do 5: for all y ∈ pos,neg do

6: mi→j(y) ← 1

7: mj→i(y) ← 1 8: end for 9: end for 10: repeat

11: for all hi ∈ H do

12: for all hj ∈ N(hi) do  13: for all yj ∈ pos, neg do 14: m (y ) ← a P ψ (y , y )φ (y ) Q m y ) i→j j yi i,j i j i i hk∈N(hi)\hj k→i i 15: end for 16: end for 17: end for

18: until all mi→j(yi) stop changing;

19: for all hi ∈ H do 20: yˆ ← arg max aφ (y) Q m (y) i i hj ∈N(hi) j→i y∈pos,neg  21: return yˆi 22: end for

tion of functions φ and ψ and the messages of neighboring nodes. Finally, y is calculated for each node.

8.3.2.2 Relaxation Labeling

RL is another algorithm for classification on the graph. Specifically, the value of dij is used in order to detect the “significance” of the node j with respect to the node i. In

addition, r(yi, yj) is used for estimating the compatibility between the labels yi and yj.

One can also consider the probability bi(yi) of hashtag hi for having the tag yi.

8.3.2.3 Iterative Classification Algorithm

Algorithm ICA starts with the categorization of each hashtag by using a probability function of the tweet that contains it. After each iteration, it calculates again the prob-

139 Relaxation Labeling 1: Input : Hashtag Graph HG 2: Output : Sentiment label for each hashtag h 3: begin

4: for all hi ∈ H do  5: for all yi ∈ pos,neg do P P ry (τ) τ∈Ti i 6: b (y ) ← P P i i P ry(τ) τ∈Ti y 7: end for 8: end for 9: repeat

10: for all hi ∈ H do  11: for all yi ∈ pos, neg do h i 12: q (y ) ← P d P r(y , y )b (y ) i i hj ∈N(hi) i,j yj i j j j P h i 13: ai ← y bi(y) 1 + qi(y) 1 h i 14: bi(yi) ← bi(yi) 1 + qi(yi) ai 15: end for 16: end for

17: until all bi(yi) stabilize;

18: for all hi ∈ H do

19: yˆi ← arg max bi(y) y∈pos, neg 20: end for  21: return yˆi ability of each node with respect to the probability of its neighbor and estimates the emotion of the top-k hashtags; k is increased linearly at each iteration.

8.4 Spark Implementation

In this Section, the basic parts of the proposed algorithms are outlined. It has to be stated that our work is the first where these algorithms are implemented in Scala programming language, which is the most appropriate one for the creation of graph model. Twitter Stream: Initially, the algorithm takes as input a word (or even a set of words) from the command line and in following, it gets the tweets that contain that specific word. Our experiments are based on Twitter and used Twitter API to collect tweets; specifically we implemented our Twitter network using Twitter4j2 library. One

2http://twitter4j.org/en/index.html

140 Iterative Classification 1: Input : Hashtag Graph HG 2: Output : Sentiment label for each hashtag h 3: begin

4: for all hi ∈ H do

5: yi ← arg max  φ(y|hi) y∈ neg,pos 6: end for 7: for t = 1 → M do

8: for all hi ∈ H do

9: compute pi(yi|(HG,y))

10: Store pi ← maxy pi(y|HG,y)

11: Store yi ← arg maxy pi(y|HG,y) 12: end for t 13: k ← |H| M 14: Update hashtag labels with top-k pi 15: end for  16: return yi

hurdle we had to overcome was that the Twitter API has a number of limitations as it does not provide access to a large number of tweets. Tweet Classification-MLlib: To categorize tweets, three operating classifiers with the help of MLlib3 are used. MLlib is a Spark library which supports machine learning algorithms. During the training stage of the classifier, we have used a data set which consists of 1, 600, 000 categorized tweets4. We processed the data with Spark−sql and we have used the Databricks5 library in order to further edit the corresponding files; final target being sentiment prediction. Hashtag Graph Model-GraphX: The creation of the graph is considered as an important step of our framework. In order to achieve this, we need three mutable lists, mutable meaning that the content of the list can be altered. In our case though, the cre- ation of the Resilient Distributed Dataset (RDD) does not allow such alternation. Lists are for TCID links, edges and the number of hashtags. Some operations need to be run on order to bring the hashtag in desired form required by GraphX6. The disadvan- tage that GraphX does not support the import of additional nodes on a specific graph is

3http://spark.apache.org/mllib/ 4http://www.sentiment140.com/ 5https://databricks.com/ 6http://spark.apache.org/graphx/

141 surpassed due to the fact that the algorithm is in real time and as new tweets arrive, the new hashtags are added to the list and the graph is created from the beginning.

8.5 Results and Evaluation

We have verified the results presented in [169] in terms of Accuracy, Precision, Recall and F1 evaluation metrics for the Apache Spark implementation. More to the point, the dataset used is the same as in [169]; 15.000 pre-classified tweets from a set of 600.000 ones all of which were categorized as positive, negative and neutral. The classifier used is the two-stage SVM Tweet-Level, whereas K-Fold Cross-Validation (K = 5 Fold) is performed. Apart from verifying the results, we have also compared the SVM with Naive Bayes, Logistic Regression as well as Decision Trees classifiers. The results are presented in Table 8.1 Accuracy-wise where both achieve almost similar performance.

Table 8.1 Performance of Tweet-Level Classifiers

Classifier Accuracy 2-phase SVM 84.13% Naive Bayes 81.75% Logistic Regression 72.45% Decision Trees 76.23%

Since our motivation stems from the fact that we are interested in identifying a high number of correct classified tweets in all classes of each classifier, Table 8.1 verifies our goal, meaning that the accuracy of the four classifiers remains high. The same high performance in terms of precision and F1 shows the combination of subjectivity and polarity classifiers as shown in [169]. We can observe that SVM outperforms the other three algorithms achieving higher accuracy in a range from about 3% to 12%. Specifically, as for the recall metric in [169], expressing the fraction of tweets in each class which is correctly classified, the results show that the subjectivity classifier finds difficulty in classifying correctly the tweets of subjective class. On the contrary, the polarity classifier achieves approximately 2x better performance in terms of recall metric.

142 8.6 Conclusions

In our work, we have developed four classifiers in Spark, namely SVM, Naive Bayes, Logistic Regression and Decision Trees. The classifiers were trained with the use of a large dataset of 1.600.000 pre-classified tweets where tweets were categorized as posi- tive, negative or neutral. The achieved accuracy of classifiers proves the classification efficiency and stability of results. All the classification algorithms are implemented in Apache Spark cloud framework using the Apache Spark’s Machine Learning library, entitled MLlib. As future work, we plan to further investigate the effect of different data features such as bigrams, trigrams, unigrams or POS tags as introduced in our pre- vious work [6]. Moreover, collecting data from other sources, such as the posts from Facebook and Instagram or YouTube comments or even reviews from Foursquare, is a further investigated area to be examined. Furthermore, we aim at experimenting with different clusters and evaluate Spark’s performance in regards to time and scalability as presented in our previous works in Spark environment [87], [151].

143 CHAPTER 9

An Efficient Preprocessing Tool for Supervised Sentiment Analysis on Twitter Data

9.1 Introduction

The rapid development of modern computing systems along with Internet access and high communication capabilities, has turned these schemes into an integral part of hu- man everyday life. Nowadays, users can express their personal opinion on any matter whenever they wish as well as share their thoughts and feelings. It is no coincidence that most websites encourage their users to review their services or products while so- cial media accounts have significantly increased. In particular, a user through websites can be informed, express their personal views on a variety of topics and simultaneously interact with other users. Hence, this kind of interaction produces a large amount of data that is of particular interest for further process and analysis. Companies have started to poll these microblogs to get a sense and understand the general sentiment towards their product. Often, these companies study user replies and in following reply on the corresponding microblogs. The challenge is to build tools that can detect and summarize an overall sentiment, so that valuable conclusions and information on various kinds of issues can be drawn. For example, one can consider demographic and social conclusions, information of economic nature such as the prevailing view of a product or service, or even results of political content. Sentiment analysis constitutes a subtle field in information mining. It is considered a computational analysis and categorization of opinions, feelings and attitudes that are drawn up in text format. The use of Natural Language Processing techniques is sought through polarity performance, while polarity can be characterized by many different classes. Specifically, the positive and negative terms, which correspond to the positive or negative view that a user has, are utilized in terms of a specific event or topic.

144 Fig. 9.1 KDD process for knowledge mining from data [94]

It is widely noted that the emotional analysis has many applications, as an individual may have a view on a huge range of issues of a different nature, such as economic, po- litical, religious, and so on. For this reason, the positive and negative classes are not the only ones used, as aforementioned. It constitutes a basic way of studying and analyzing such data. Indeed, in recent years, the great volume of raw information generated by Internet users has increased the interest in processing such data. In other words, it is the process of mining and categorizing subjective data so as to extract information about the polarity and overall emotion expressed in natural language in text form [108]. Data mining is considered an important part of the data analysis [192]. It largely consists of collection, processing and modeling of data and is aimed at the objectives shown in Figure 9.1. Its characteristics are the export of information from a dataset, its following transformation in a comprehensible structure with the use of various tech- niques (machine learning, statistics, etc.) which facilitate analysis, and finally the con- clusions export as well as the decision-making. In this present work, the main contributions concern the following aspects. Concern- ing data pre-processing techniques, tweets face several new challenges due to the typical short length and irregular structure of such content. Hence, data pre-processing consti- tutes a crucial step in the corresponding field of sentiment analysis, since selecting the appropriate pre-processing methods, the correctly classified instances can be increased [65]. To properly implement the data analysis process, it is necessary to process the raw data collected in a variety of ways. Initially, it should be decided what data will be used depending on the purpose of the analysis. In following, it is necessary to eliminate the

145 abnormalities and deal with incorrect values and/or incomplete inputs. Subsequently, the processed data are modified in a form suitable for mining them. Therefore, we focus on designing an efficient pre-processing tool which facilitates the sentiment anal- ysis conducted based on supervised machine learning algorithms. Another contribution is the application of Latent Dirichlet Allocation (LDA) based probabilistic system to discover latent topics from the conversations in connection to the event we take into account. The proposed research is organized as follows: Section 9.2 describes the back- ground topics and the challenges faced, while Section 9.3 explains the process of in- formation retrieval from Twitter platform. The same section analyzes the tool for data pre-processing, which is the main contribution of this work. Furthermore, Section 9.4 presents the experimental results and main conclusions extracted from the study and the analysis. Finally, Section 9.5 portrays conclusions and draws directions for future work.

9.2 Related Work

Sentiment Analysis can be considered as a field of data mining, which intends to de- termine the polarity of a subjective text, through the use of various algorithms. In the recent years, this particular branch of science has started to gain increasing interest both from academic and organizations perspective; users started to study the emotions around a subject for a variety of reasons. This interest is boosted by the rapid growth of users participating in social media, which is a modern world phenomenon. Some researches that have played a key role in the evolution and importance of emotional analysis are presented below. Pang and Lee [129] introduce techniques and challenges in the application of emo- tional analysis, while at the same time a plethora of datasets are widely used. Also, the authors show how both the frequency of the terms and the n-gram feature selection affect the result. In addition, within the framework of the system developed, the inci- dence of the individual condition and in following the posts splitting into n-grams was taken into account. Of particular interest is the application of emotional analysis to different hierarchi- cal levels. Pang et al [130] study the effectiveness of document-level for a large number

146 of critically acclaimed films from the popular website IMDB. Turney et al [164] exam- ine the polarity of a file through its proposals. Phrases that include adjectives and/or adverbs, or other features and parts of speech that are highly likely to express the au- thor’s emotion, are selected. Concretely, user reviews on a variety of topics such as movies, travel destinations, banks and cars, were utilized. Wilson et al [171] analyzed the views of the MPQA corpus along with a set of data containing journal articles from different sources that have been rated in terms of their emotion. An important part of their work was the separation of neutral phrases in polar and objective and then, the polarity analysis of subjective phrases to extract the overall feeling of the text. Furthermore, the emotional analysis can also be applied in the field of economic interest, in sets of journalistic content and critical films [72, 183, 190], as well as for political aspects [163, 169]. Initially, authors in [190] studied reviews for mobile ap- plications and in following exported important features for their nature. Then, in [72], product reviews to identify consumer’s sentiment in terms of certain characteristics and the products themselves were analyzed. Authors in [183] constructed a model for emo- tional analysis based on travelers reviews about the destinations they visited. In a simi- lar way, a system that analyzes the sentiment of publications of real-time Twitter users to predict the results of the 2012 presidential election in the United States of America was created [169]. Finally, in [163], researchers used messages (not real-time data) that included reference to German political parties in 2009. Previous works regarding emotional content are the ones presented in [84, 85, 86], in which the authors presented various approaches for the automatic analysis of tweets and the recognition of the emotional content of each tweet based on Ekman emotion model; based on Ekman emotion model, the existence of one or more out of the six basic human emotions (Anger, Disgust, Fear, Joy, Sadness and Surprise) is specified. Finally, a novel distributed framework implemented in Hadoop as well as in Spark for exploiting the hashtags and emoticons inside a tweet is introduced in [87]. Moreover, Bloom filters are also utilized to increase the performance of the proposed algorithm.

9.3 Tools and Environment

The pre-processing methods evaluated by the current research are three different data representations, namely unigrams, bigrams and trigrams. Two well-known machine

147 learning algorithms were selected for classification, namely Naive Bayes and Support Vector Machine (SVM) as shown in the following section.

9.3.1 Twitter

The platform that is being studied in this work is Twitter. This is a platform for post- ing posts, exchanging messages between users, and modifying their private profiles according to their needs. There is the possibility of communicating links, images and audiovisual material to the posts. Twitter has gained considerable interest on a global scale, due to the services it provides its users with. A special feature that makes emotional analysis quite difficult in its context is the small length of suspension that it allows its users. It is therefore perceived that studying the polarity of users’ publications, beyond the general challenges and difficulties faced, is even more difficult due to their limited length.

9.3.2 Publications Mining Tools

The mining of posts was done using the Tweepy library which through the Twitter interface allows managing a user’s profile, collecting data by optionally using certain search keywords, and finally creating and studying a stream of posts over a specific time interval. In this work, posts were stored in a CSV file, where rows contain the posts that were extracted, while the columns contain the values of the different attributes of each post (e.g., date, text, username, etc.). Useful tools in the context of this work were the following:

1. The Natural Language Tool Kit is a natural language processing library, which offers classification, parsing, tagging, and clipping stemming possibilities.

2. Scikit-Learn is a library that addresses the implementation and development of machine learning techniques and text editing tools. This library interacts with the Python, NumPy and SciPy computational and scientific libraries enhancing its efficiency and speed significantly.

3. Pickle is a library that converts objects in a form understandable only by the Python language, in order to limit the space they occupy in memory; this is due to

148 certain features (e.g. JSON format) to be stored and reused whenever necessary. In this work, this library was used to allow the usage of objects returned after the classifier training, without retraining, whenever necessary.

9.3.3 Pre-processing Scheme

In order to facilitate the mining process of the collected data, it is necessary to apply several pre-processing steps [50]. The main parts of this process are the following. Pandas Library: It was utilized to facilitate the management of input files having the components of the publications. Then, a dataframe was created only with the most important components for our analysis, i.e. records were removed, either with incorrect values or the ones not rated. Regular expressions: They were utilized to remove the urls and references to other users’ username. Also, it was possible to find and replace/remove alphanumeric char- acters that match a predefined search pattern and remove unnecessary spaces. The rep- etition of suffix characters that were used for emphasis reason and numeric characters, which do not facilitate Sentiment Analysis, were removed as well. Emoticons: Emoticons are characters, such as punctuation and parentheses, which in turn form representations of expressions of the human face, e.g. cheerful person {:-)}, but also different representations that play an important role in analyzing the feel- ings of each publication. Emoticons are widely used in social media, especially on the Twitter, to express feelings and impressions in a short way. Therefore, a set of regular expressions containing a large part of these representations was created. Additionally, a set of widely used unofficial abbreviations was generated in order to replace words that users make up. For example, the lol expression, which was replaced by its equivalent full form, namely, laugh out loud. Autocorrect library: This library uses a list of words found in recognized dictio- naries, and given an input word, compares its similarity to the words on that list. If the input word is correctly spelled, it is returned as a whole. If it is not correctly spelled, then its similarity is checked with the words in the list; if its similarity is greater than a certain threshold, then it is replaced by the word in the list. Otherwise, it is returned as a whole, without any changes. Pycontractions library: It detects a set of successive characters in which the apos-

149 trophe is contained and replaces it with the full form of expression. Expressions with the {’s} special character complicate emotional analysis since they express two or more words, which makes tokenization, as well as, normalization, particularly difficult. For effective mining, contractions are written in their original form no matter how compli- cated it is. In case the set has a single possible replacement, then the expression is trans- formed into its original full form. Using a grammatical controller and Word Mover’s Distance [102], a method of calculating the distance between the original text and the texts produced, called “compatibility” metric, is derived. The substitution applied is the one with the highest value. Emoji library: They are Unicode characters in the form of an icon representing face expressions, and many kinds of objects. This part differentiates our work from others, as in most other studies, emoticons and emojis are not taken into account. However, they are used in most publications in opinion mining. This library uses a mapping list, as created by the Unicode Consortium. Unicode characters included in a text are reviewed, and if they are in this list, they are replaced with the text form of their representation. Part-of-Speech Tagging: Each word of the text is tagged according to the part of the speech that it constitutes (adverb, verb, object). This process uses the context of the text to be analyzed as well as a set of aggregated elements (corpus), to evaluate and attribute the part of speech to the particular term being studied. Lemmatization: It is the process in which lexical and morphological analyses of words are taken into account in order to remove complex suffixies and to retrieve the lexical form of the term. It is applied after POS tagging and facilitates the emotional analysis through the application of machine learning algorithms. In the context of this work, POS tagging labels are Penn Treebank format. Tokenization: It is the separation of sentences into a list of symbol terms that can be used to reform the original sentence. Both emoticons and emojis that have now been converted to text characters are taken as tokens, without being divided into individual characters or punctuation marks. The tokenization process is applied to all sentences, and their terms are stored in the same token list. Essentially, the tokens list is created for each post. Once the publication’s details are now in a list in the order they appear in it, some more conversions are made to optimize and improve the viewing process. Punctuation: These are also tokens of the list. Generally, punctuation marks do not

150 attach any emotional significance to the publication and thus, they are removed. Stopwords: These are words appearing very often, without expressing some form of feeling. The reason why such words are removed is because the whole attempt is to examine meaningful words in order to determine the overall emotion expressed in publication.

9.3.4 Features

N-grams are one of the most common structures used in text mining and natural lan- guage processing fields. They constitute a set of co-occuring words within a given window. In addition, as already known, a Markov assumption is the assumption that the probability of a word depends only on the previous word. So, Markov models are these probabilistic models that with their use, the probability of a future aspect without looking too far into the past can be predicted. The most popular ones are the bigrams, which search for one word from the past, the trigrams, which search for two words from the past, and the n-grams, which search for n − 1 words from the past. The ”bag-of-words” approach is considered a very simple and flexible representa- tion of text that describes the occurrence of words within a document. It involves a vocabulary of known words as well as a metric of the presence of these known words. The model considers only the existence of the known words in the document, and not the exact place where to be found in the document. The intuition is that documents are similar if they have similar content. The Summed Term Frequency constitutes the sum of all the term frequencies in the documents. In the proposed paper, it is utilized as

X SummedT ermF requency = TFn−gram (9.1) d∈D In addition, the ”Apply Features” method has been taken into consideration in order to obtain a feature-value representation of the documents. Concretely, this method is used in order to apply a ”positive” or a ”negative” label to each feature of the training data.

151 9.3.5 Topic Modeling

One other aspect we want to take into consideration in our proposed work is the veri- fication of whether all the posts discuss the specific topic. Topic modeling considers a document as a ”bag-of-topics” representation, and its purpose is to cluster each term in each post into a relevant topic. Variations of different probabilistic topic models [15], [115] have been proposed and LDA [16] is considered to be a well known method. Concretely, the LDA model extracts the most common topics discussed that are represented by the words most frequently used, by simply taking as input a group of documents. The input is a term document matrix, and the output is composed of two distributions, namely document-topic distribution θ and topic-word distribution φ. EM [59] and Gibbs Sampling [58] algorithms were proposed to derive the distributions of θ and φ. In this paper, we use the Gibbs Sampling based LDA. In this approach, one of the most significant steps is updating each topic assignments individually for each term in every documents according to the probabilities calculated using Equation 2.

(n−i + α)(n−i + β) (k,m,·) (k,·,wi) P(zi = k|z i, w, α, β) ∝ −i (9.2) n(k,·,·) + V β

where zi = k shows that the ith term in a document is assigned to topic k, z i −i signifies all the assignments of topic except the ith term, n(k,m,·) is the number of times that the document d contains the topic k, n−i is the number of times that term v is (k,·,wi) assigned to topic k, V represents the size of the vocabulary as well as α and β are hyper- parameters for the document-topic distribution and topic-word distribution respectively. The number of the Gibbs sampling iterations performed for every terms in the cor- pus is N; after this component, the document-topic θ and topic-word φ distributions are estimated using Equations 3 and 4 respectively.

n + α θˆ = (k,m,·) (9.3) m,k PK Kα + k=1 n(k,m,·)

n + β φˆ = (k,·,υ) (9.4) k,υ PV V β + υ=1 n(k,·,υ)

152 9.4 Evaluation

In Table 9.1, the two datasets studied as well as their characteristics are presented. There are 5 different categories whereas the first dataset contains tweets for all of them and the second dataset contains tweets for the 3 of them. The total number of tweets studied per each dataset is also considered. The first dataset consists of tweets about self-driving cars1. The sentiment is catego- rized into 5 categories, ranging from very positive to very negative. The second dataset consists of the feelings that travelers have in February 2015 towards the problems of each major U.S. airline2. The sentiment in the tweets of this dataset is categorized as positive, neutral or negative for six US airlines.

Table 9.1 Datasets Details

Sentiment Selfdriving Cars Dataset Airlines Dataset Positive 1262 2363 Slightly Positive 1452 Neutral 4245 3099 Slightly Negative 1498 Negative 1076 9178 Total Number of Tweets 9533 14640

The results of our work are presented in the following Tables 9.2 to 9.4. The Ac- curacy, in terms of percentage, is used as the evaluation metric of the two different algorithms (Naive Bayes and SVM) for the different setup (Unigrams, Bigrams and Trigrams). Also, the percentage of training and test set is taken into account when considering the two datasets. In Table 9.2, the results of the RapidMiner platform3 are presented. We have used RapidMiner as a baseline to emphasize the improvement of our proposed methodology. Furthermore, in RapidMiner we cannot include features that are utilised in our paper, such as emojis and emoticons, etc. We observe that SVM performs better than Naive Bayes for the three different setups and for both datasets. Secondly, in both datasets, Unigrams and Bigrams achieve better accuracy than Trigrams; this is expected as tweets usually have small length due to the number restriction that characters have and thus,

1https://www.kaggle.com/c/twitter-sentiment-analysis-self-driving-cars 2https://www.kaggle.com/crowdflower/twitter-airline-sentiment 3https://rapidminer.com/

153 Trigrams cannot be considered as a qualitative metric.

Table 9.2 RapidMiner Results - Accuracy

Setup Selfdriving Cars Dataset Airlines Dataset Naive Bayes (Unigrams) 62.13 64.95 Naive Bayes (Bigrams) 64.37 62.40 Naive Bayes (Trigrams) 58.89 50.66 SVM (Unigrams) 69.56 75.50 SVM (Bigrams) 71.78 73.20 SVM (Trigrams) 71.49 70.95

In following, Table 9.3 presents the results for different ratio of training versus test set. We have utilised three different cases, with training set having values equal to 70, 75 and 80 whereas test set has values equal to 30, 25 and 20 respectively. Worth noting is the fact that our proposed methodology outperforms the results from RapidMiner as regarding Selfdriving Cars Dataset, for both classifiers and the three different setups, the accuracy has lower value equal to 72% and higher equal to 80%. On the other hand, for the Airlines dataset, we observe high rate fluctuations as the higher and the lower percentages are depicted. Concretely, Naive Bayes (Trigrams) achieves the lowest accu- racy with almost 60% and Naive Bayes (Unigrams) achieves the highest accuracy with 85% (for training-test ratio equal to 80-20). Furthermore, we notice that for all cases, as the percentage of training set increases, so does the accuracy. This is something that we expect as higher values of training set increase the classifier’s results.

Table 9.3 Accuracy for different Training - Test Set ratios

Selfdriving Cars dataset Airlines dataset Setup 70 - 30 75 - 25 80 - 20 70 - 30 75 - 25 80 - 20 Naive Bayes (Unigrams) 71.99 72.73 74.04 82.60 83.72 85.18 Naive Bayes (Bigrams) 77.83 79.49 80.76 77.78 78.55 80.94 Naive Bayes (Trigrams) 76.57 77.93 79.44 59.45 59.70 60.72 SVM (Unigrams) 73.50 74.45 74.15 77.45 77.57 78.44 SVM (Bigrams) 76.43 77.77 78.76 68.90 69.43 69.56 SVM (Trigrams) 75.70 77.27 78.67 64.94 65.28 65.82

Finally, Table 9.4 presents the Accuracy results when splitting with 10-Fold Cross- Validation. The concept of using this technique is that important information can be removed from the training set. In addition, this method is simple to understand and generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split. As in Table 9.3, Naive Bayes outperforms

154 SVM, while it achieves the higher percentage with value equal to 85%. What is more, for both datasets, Naive Bayes has values close to 79% whereas SVM has different values for the three setups.

Table 9.4 10-Fold Cross-Validation

Setup Selfdriving Cars Dataset Airlines Dataset Naive Bayes (Unigrams) 78.66 85.38 Naive Bayes (Bigrams) 79.92 79.99 Naive Bayes (Trigrams) 79.29 74.26 SVM (Unigrams) 73.04 77.26 SVM (Bigrams) 76.59 69.32 SVM (Trigrams) 75.93 65.25

9.5 Conclusions and Future Work

In this paper, we proposed a pre-processing framework for Twitter sentiment classifica- tion. We chose Twitter because of tweets’ short length and content’s diversity. We used supervised machine learning techniques for the analysis of the raw data in the user posts and incorporated emojis and emoticons if order to enrich our features. Furthermore, we applied the Latent Dirichlet Allocation (LDA) based probabilistic system to discover latent topics from the conversations. Two popular classifiers (Naive Bayes and SVM) were used for three different data representations (unigrams, bigrams and trigrams) in order to perform our experiments in two datasets. In the near future, we plan to extend and improve our framework by exploring more traits that may be added in the feature vector and will increase the classification perfor- mance. Moreover, we plan to compare the classification performance of our solution with other classification methods. Another future consideration is the adoption of other heuristics for handling complex semantic issues, such as irony that is typical of mes- sages on Twitter.

155 CHAPTER 10

An Apache Spark Methodology for Forecasting Tourism Demand in Greece

10.1 Introduction

Tourism is a factor of high importance for a country’s economy as it contributes to its growth not only directly by generating cash flows but also by creating new jobs. So, it is of pivotal importance for countries like Greece to forecast tourism demand. How- ever, the tourism industry in real-world is so dynamic that there arises the need of the process of Knowledge Discovery in Databases (KDD). This represents the interdisci- plinary field, generally known as “Data Mining”, where machine learning methods are a structural part. Our research is vital for Greece, as we propose a model suitable for forecasting tourism demand using Data Mining and specific machine learning methods on data originated from public sources. The demand variable to be forecasted is the quarterly tourist arrivals in Greece. In the present chapter, an essential methodology using machine learning techniques on Apache Spark, a cluster computing system, along with a robust machine learning library, is introduced. The current work aims at building an accurate multivariate pre- dicting model which could be integrated with public information systems giving up-to- date analytics and a boost to the tourism sector along with the economy in general. The proposed methodology can be easily adapted to provide valuable data for other coun- tries or continents due to the unique features of the analytics platform which provide scalability and robustness. The rest of the chapter is organized as follows. Section 10.2 presents related work. Section 10.3 focuses on the methods for forecasting tourism as well as the Machine Learning Algorithm which was utilized in our proposed system. In addition, our pro-

156 posed model is properly introduced and further analyzed in Section 10.4. Furthermore, in Section 10.5, the evaluation of the experiments as well as the results conducted, are presented. Ultimately, Section 10.6 depicts conclusions and draws directions for future work.

10.2 Related Work

Tourism forecasting is a field with growing academic interest that is proved by the ex- tensive and growing available literature. Traditional methods in forecasting tourism use statistical and econometric models relying on historical data. However, these methods lack accuracy as they focus on long-term horizons. A solution to the emerging prob- lem could be to utilize, on a monthly either weekly or daily basis, data that enhances short-term forecasting [181]. Initially, authors in [3] propose the FSS, a forecasting support system which consists of “a set of procedures (typically computer-based) that supports forecasting”. Also, the Baidu Index for proposing a novel technique of predicting tourist flows is presented in [73]. Baidu Index is a search engine that stores the history of the use of different keywords in online searching. Furthermore, work introduced in [146] portrays a novel hybrid intelligent model called Modular Genetic-Fuzzy Forecasting System. Other previous works with proposed cloud-based architecture based on Apache Spark are [7], [152]. Relative studies on forecasting tourism demand are those of Mozambique and Portugal conducted in [30] and [160], respectively. In [168], a NoSQL database approach for modeling heterogeneous and semi-structured information by in- tegrating Apache Spark with Apache Cassandra was presented. Authors focus on a model capable of predicting the relationship between tourist arrivals and nights spent in Greece.

10.3 Preliminaries

10.3.1 Forecasting Tourism Methods

Tourism demand forecasting methods can be broadly categorized into two groups, namely the qualitative and quantitative methods. Qualitative methods usually depend on the qualitative intuition, experience and insight on a specific tourism market. On the other

157 hand, quantitative methods are known as statistical methods having a mathematical base. As presented in [155], quantitative methods can be distinguished in time-series mod- els, econometric models, as well as Artificial Intelligence (AI) models. Time-series and econometric models are well-adopted as far as the tourism forecasting is concerned. AI models are relative to machine learning methods appropriate to forecast tourism de- mand. A rough set approach was selected in order to improve the comprehensibility of the tourism demand model.

10.3.2 Apache Spark

Apache Spark [188] was developed at UC Berkeley’s AMPLab1 and is commonly used for big data processing. This distributed open source processing system is a fast, op- timized engine that offers APIs in Java, Scala, Python and R. It can run standalone or over Hadoop or Mesos and access data sources like HDFS, Cassandra and HBase. The Apache Spark architecture includes Spark SQL and Data Frame operations. Spark’s distributed data-sharing concept is called Resilient Distributed Datasets (RDD). RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. These provide a flexible interface and unlike existing Data Frame APIs in R and Python, Data Frame operations in Spark SQL go through an extensible relational optimizer, entitled Catalyst. Finally, Spark is built from the ground up exclusively for performance and reliability and takes advantage of the operational and debugging tools developed for the Java stack [141].

10.3.3 Machine Learning Algorithm

The core system of Spark consists of different libraries and components that provide a rich set of higher tools including MLlib2 for Machine Learning. The Decision Tree (DT) algorithm is a supervised method that aims to discover any relationship between input and target attributes. The specific relationship is represented in a structure, known as a model. The input attributes for the model (independent variables) can be either categorical or continuous determining the type of the DT model to Classification or Regression Tree, respectively. 1https://spark.apache.org/ 2https://spark.apache.org/mllib/

158 Optimal DT algorithms are feasible only in small problems. The solution is the use of heuristics methods that are divided into two groups: top-down and bottom-up, where most preferable, according to literature, are the first ones. ID3,C4.5 and CART belong in this group of heuristic methods [139]. The core algorithm is strictly connected with an input having three basic parameters. Initially, the data parameter consists of the initial set of tuples and the relevant values of the target variables. In following, the attributes parameter that refers to the total input variables, is considered. Finally, the default category is the majority value of data in each recursive running of the main algorithm. The selection of choose-attribute procedure is vital as the success of the results depends on its outcome. The procedure is differentiated according to the type and form of the attributes; so, for variables, such as gender, which are categorical, the choose-attribute procedure implements evaluation algorithms such as Gini Index, Entropy, and Chi-Square(x2). On the other hand, in the case of a continuous variable, such as height, the evaluation algorithm that must be used is Reduction in Variance. Algorithm 14 Decision Tree (DT) algorithm 1: input data, attributes, default category 2: output Decision T ree 3: if data is empty then 4: return default category 5: else if each data belongs to the same category then 6: return category 7: else if attributes is empty then 8: return majority − class(data) 9: else 10: best = choose-attribute(attributes, data) 11: end if 12: tree = new Decision Tree checking the root via best attribute 13: m = majority-value(data) 14: for all Vi ∈ best do 15: datai = data where Vi = best 16: end for 17: subtree = Decision Tree(datai, attributes without best, m) 18: add subtree as leaf to the tree labeled by Vi 19: return tree

The Categorical Variables Metrics that are taken into consideration are the follow- ing:

• Gini index (or population diversity) is a measure of impurity which represents

159 the accuracy of the random-guess classifier.

N X 2 IG(P ) = 1 − Pi (10.1) i=1

• Entropy (or information gain) is a measure of impurity that comes from infor- mation theory and represents the uncertainty when a prediction of data falls in a specific subset.

N N X 1 X I (P ) = P log( ) = − P log(P ) (10.2) E i P i i i=1 i=1

• Chi – square test is for testing the independence between the occurrences of two particular concepts.

s (x − expected(x))2 Chi − square = (10.3) (x) expected(x)

On the other hand, as non-Categorical Variables Metric, we considered the follow- ing:

• Reduction in Variance is a method to determine the optimal fit via measuring whether or not a split will result in a reduction of variance within the data.

10.4 Implementation

10.4.1 Methodology

This work uses statistical and machine-learning techniques on large volumes of unstruc- tured and/or structured data in a distributed computing environment with the aim of identifying correlations, causal relationships, patterns and anomalies, predicting events as well as and infering probabilities, interest and sentiment [32]. KDD process utilizes Data Mining methods to extract what is deemed knowledge according to the specification of measures and thresholds, using a database along with any required preprocessing, subsampling and transformation of the database [5]. In 2016, KDDS process was introduced in order to address big data problems while providing further integration with management processes. KDDS defines four distinct

160 phases [57]:

• Assess: The first phase is a planning, analysis of alternatives and rough order of magnitude estimation process.

• Architect: The second phase consists of requirement translation into a solution for a new system, or to address the gaps from the current state to the final state that will satisfy the requirements for an enhanced system.

• Build: The third phase consists of the development, test and deployment of the technical solution.

• Improve: The final phase consists of the operation and management of the sys- tem as well as an analysis of innovative ways the system performance could be improved.

Our proposed method follows the KDDS process as a guideline methodology for tourism forecasting in Greece. Since our motivation stems from the fact that we are interested in following a model supporting big data technologies, we can extend our method to include more countries. When executing a data science project, the challenge does not merely lie in choosing the best algorithm and the best application [143].

10.4.2 Dataset Description

The collection and analysis of statistical data from tourism are of great importance. In Europe, and especially in Greece, the need for systematically monitoring the status of European tourism has led to the Regulation 692/2011, which develops a common framework for the processing and exchange of European statistics on tourist supply and demand. Also, the relationship between tourism demand and various macroeconomic vari- ables such as GDP, income and Net Disposable Income (NDI), is proved by the in- ternational literature [182]. So, these variables, as shown in Table 10.1 and publicly available online, are used as explanatory variables in our research. The final dataset includes variables of monthly frequency and variables like Google Trends, Currency Exchange Rates and Stock Market Index that are rarely used. In order to conduct the

161 Table 10.1 Data Sources

Organization URL Hellenic Statis- http://www. tical Authority statistics.gr (EL.STAT.) Eurostat https://ec. europa.eu/ commission United Nations http://www2. World Tourism unwto.org Organization (UNWTO) Google Trends https://trends. google.com/ trends

experiments of a specific case study, data for the period 2006 to 2015 were used (Figure 10.1); tourist arrivals were to be predicted from a total of 20 variables till year 2018. The heterogeneity among the different sources of data led to an initialization process to be usable for our analysis.

Fig. 10.1 Tourist Arrivals in Greece (2006 - 2015)

10.5 Experiments - Evaluation

Initially, the dataset is randomly split into two subgroups; 90% is the training set and the other 10% is the test set.

v a l Array(trainData , testData) = data.randomSplit(Array(0.9, 0.1)) trainData .cache() testData .cache()

Listing 10.1: Dataset Split

In order for the input data to be used with any classifier in Spark MLlib, data must be collected in one column, in contrary with input Data Frame that has one column for

162 each feature. The solution is V ector Assembler, a Spark’s core class which outputs a feature vector where the target column is excluded.

import org.apache.spark.ml. feature .VectorAssembler v a l i n p u t C o l s = trainData.columns. filter ( ! = ” T o u r i s t A r r i v a l s ” ) v a l a s s e m b l e r = new VectorAssembler() . setInputCols(inputCols).setOutputCol(”featureVector”) v a l assembledTrainData = assembler.transform(trainData) assembledTrainData. select(”featureVector”).show(truncate = f a l s e )

Listing 10.2: Vector Assembler

In following, the output gives the constructed Decision Tree by using the Decision T ree Classifier from Spark MLlib.

import org.apache.spark.ml. classification .DecisionTreeClassifier import scala . util .Random v a l c l a s s i f i e r = new DecisionTreeClassifier () .setSeed(Random.nextLong()).setLabelCol(”Tourist Arrivals”). setFeaturesCol(”featureVector”). setPredictionCol(”Predictions”) v a l model = classifier . fit (assembledTrainData)

Listing 10.3: Decision Tree Classifier

The predictions have a remarkable deviation from real values as shown in Figure 10.2 for the years 2014 and 2015. This is because the Decision T ree Classifier implementation has several hyperparameters for which a value must be chosen. In our experiments, the default values were used.

v a l predictions = model. transform(assembledTrainData) predictions . select(”Predictions”).show(truncate = f a l s e )

Listing 10.4: Data Prediction

The Multiclass Classification Evaluator is used to compute the accuracy and other metrics of the predictions.

import org.apache.spark.ml.evaluation. MulticlassClassificationEvaluator v a l e v a l u a t o r = new MulticlassClassificationEvaluator () .setLabelCol(”Tourist Arrivals”).setPredictionCol(” Prediction”) evaluator .setMetricName(”accuracy”). evaluate(predictions) evaluator .setMetricName(”f1”).evaluate(predictions)

Listing 10.5: Multiclass Classification Evaluator

Our method achieves an accuracy of 75% which means that it can be improved by using the appropriate hyperparameters. Hyperparameters selection constitutes a major task that can be performed in future research. The goal is not to build a classifier but to make appropriate predictions [141]. So, finding the “best model” is only the beginning. The model consists of a group of opera- tions that transform the input in the appropriate Data Frame and then make predictions. In our research, the predictions for the future are illustrated in Figure 10.2.

163 Fig. 10.2 Predictions (2014 - 2018)

10.6 Conclusions and Future Work

In our proposed paper, a methodology for forecasting tourism demand by modeling unstructured Data by using Apache Spark and Spark MLlib is proposed. The dataset was constructed from publicly available websites such as the Hellenic Statistical Au- thority, Eurostat, and Google Trends. The model used to support the proposed method- ology is a Decision Tree with the default values for the hyperparameters. The results of the methodology’s application were quite satisfactory and can be further improved by providing the appropriate tuning parameters. Moreover, the usage of the appropri- ate metrics according to the type of used variables can be tested in combination with a better tuning as far as the hyperparameters are concerned. The hyperparameters control how the tree’s decisions are chosen and can be quite different taking into account the maximum depth, maximum bins, impurity measure, as well as minimum information gain. Regarding future directions, different machine learning techniques can be utilized to improve the forecasting accuracy in tourist demand. A different approach such as the Support Vector Machines (SVMs), which was introduced in [25], can be utilized. Also, the use of more relative explanatory variables could help to improve the forecasting accuracy and also the performance of the model. In this task, the contribution of tourism stakeholders is necessary as they have more information and can propose variables such as marketing campaigns and useful social media data. For further enrichment of our data, social media analytics can also be included in the constructed dataset. Finally, another issue concerns the development of a clustered system where big data related techniques will be utilized to further forecast tourist demand in even more countries.

164 REFERENCES

[1] Mohammad Ahmadian, Frank Plochan, Zak Roessler, and Dan C. Marinescu. Securenosql: An approach for secure search of encrypted nosql databases in the public cloud. International Journal of Information Management, 37(2):63–74, 2017.

[2] Louai Alarabi. Summit: A scalable system for massive trajectory data man- agement. In 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL), pages 612–613, 2018.

[3] Jon Scott Armstrong. Principles of Forecasting: A Handbook for Researchers and Practitioners, volume 30. Springer Science and Business Media, 2001.

[4] Alessandro Attanasi, Andrea Cavagna, Lorenzo Del Castello, Irene Giardina, Stefania Melillo, Leonardo Parisi, Oliver Pohl, Bruno Rossaro, Edward Shen, Edmondo Silvestri, et al. Collective behaviour without collective order in wild swarms of midges. PLoS computational biology, 10(7):e1003697, 2014.

[5] Ana Azevedo and Manuel Filipe Santos. Kdd, SEMMA and CRISP-DM: a par- allel overview. In IADIS European Conference on Data Mining, pages 182–185, 2008.

[6] Alexandros Baltas, Andreas Kanavos, and Athanasios Tsakalidis. An apache spark implementation for sentiment analysis on twitter data. In International Workshop on Algorithmic Aspects of Cloud Computing (ALGOCLOUD), pages 15–25, 2016.

[7] Alexandros Baltas, Andreas Kanavos, and Athanasios Tsakalidis. An apache spark implementation for sentiment analysis on twitter data. In 1st International Workshop on Algorithmic Aspects of Cloud Computing (ALGOCLOUD), pages 15–25, 2016.

[8] Endre Bangerter, Jan Camenisch, and Anna Lysyanskaya. A cryptographic framework for the controlled release of certified data. In 12th International Work- shop on Security Protocols, volume 3957, pages 20–42, 2004.

165 [9] Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from biased and noisy data. In International Conference on Computational Linguis- tics: Posters (COLING), pages 36–44, 2010.

[10] Anirban Basu, Anna Monreale, Juan Camilo Corena, Fosca Giannotti, Dino Pe- dreschi, Shinsaku Kiyomoto, Yutaka Miyake, Tadashi Yanagihara, and Roberto Trasarti. A privacy risk model for trajectory data. In 8th IFIP International Conference on Trust Management, pages 125–140, 2014.

[11] Rohan Baxter, Peter Christen, Tim Churches, et al. A comparison of fast blocking methods for record linkage. In ACM SIGKDD, volume 3, pages 25–27, 2003.

[12] Adam Bermingham and Alan F. Smeaton. Classifying sentiment in microblogs: is brevity an advantage? In ACM Conference on Information and Knowledge Management (CIKM), pages 1833–1836, 2010.

[13] Elisa Bertino and Ravi S. Sandhu. Database security-concepts, approaches, and challenges. IEEE Transactions on Dependable and Secure Computing, 2(1):2– 19, 2005.

[14] Ankita Bhatewara and Kalyani Waghmare. Improving network scalability us- ing nosql database. International Journal of Advanced Computer Research, 2(4):488, 2012.

[15] David M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

[16] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

[17] Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970.

[18] James Blustein and Amal El-Maazawi. Bloom filters:A Tutorial, Analysis and Survey, 2002.

[19] Stefan Brands, Liesje Demuynck, and Bart De Decker. A practical system for globally revoking the unlinkable pseudonyms of unknown users. In 12th Australasian Conference on Information Security and Privacy (ACISP), volume 4586, pages 400–415, 2007.

[20] Eric A. Brewer. Towards robust distributed systems. In 19th Annual ACM Sym- posium on Principles of Distributed Computing, page 7, 2000.

166 [21] Andrei Z. Broder and Michael Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485–509, 2003.

[22] Adrian P Brown, Christian Borgs, Sean M Randall, and Rainer Schnell. Evalu- ating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets. BMC medical informatics and decision making, 17(1):83, 2017.

[23] Zeynel Cebeci and Figen Yildiz. Comparison of k-means and fuzzy c-means algorithms on different cluster structures. AGRARINFORMATIKA/JOURNAL´ OF AGRICULTURAL INFORMATICS, 6(3):13–23, 2015.

[24] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wal- lach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4:1–4:26, 2008.

[25] Kuan-Yu Chen and Cheng-Hua Wang. Support vector regression with genetic algorithms in forecasting tourism demand. Tourism Management, 28(1):215– 226, 2007.

[26] Peter Christen. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science and Business Media, 2012.

[27] Kenneth J. Christensen, Allen Roginsky, and Miguel Jimeno. A new analy- sis of the false-positive rate of a bloom filter. Information Processing Letters, 110(21):944–949, 2010.

[28] Chris Clifton, Murat Kantarcioglu, AnHai Doan, Gunther Schadow, Jaideep Vaidya, Ahmed K. Elmagarmid, and Dan Suciu. Privacy-preserving data inte- gration and sharing. In SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), pages 19–26, 2004.

[29] Pietro Colombo and Elena Ferrari. Fine-grained access control within nosql document-oriented datastores. Data Science and Engineering, 1(3):127–138, 2016.

[30] HA Constantino, Paula O Fernandes, and Joao˜ Paulo Teixeira. Tourism demand modelling and forecasting with artificial neural network models: the mozam- bique case study. Tekhne´ , 14(2):113–124, 2016.

[31] Alfredo Cuzzocrea. Algorithms for managing, querying and processing big data in cloud environments, 2016.

167 [32] Manirupa Das, Renhao Cui, David R. Campbell, Gagan Agrawal, and Rajiv Ramnath. Towards methods for systematic research on big data. In IEEE In- ternational Conference on Big Data, pages 2072–2081, 2015.

[33] Dmitry Davidov, Oren Tsur, and Ari Rappoport. Enhanced sentiment learning using twitter hashtags and smileys. In International Conference on Computa- tional Linguistics: Posters (COLING), pages 241–249, 2010.

[34] Ali Davoudian, Liu Chen, and Mengchi Liu. A survey on nosql stores. ACM Computing Surveys, 51(2):40:1–40:43, 2018.

[35] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula- pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. In 21st ACM Symposium on Operating Systems Principles (SOSP), pages 205–220, 2007.

[36] Anthony I Dell, John A Bender, Kristin Branson, Iain D Couzin, Gonzalo G de Polavieja, Lucas PJJ Noldus, Alfonso Perez-Escudero,´ Pietro Perona, An- drew D Straw, Martin Wikelski, et al. Automated image-based tracking and its application in ecology. Trends in ecology and evolution, 29(7):417–428, 2014.

[37] Tianyang Dong, Yuan Lulu, Yuehui Shang, Yang Ye, and Ling Zhang. Direction- aware continuous moving k-nearest-neighbor query in road networks. ISPRS International Journal of Geo-Information, 8(9):379, 2019.

[38] Elias Dritsas, Andreas Kanavos, Maria Trigka, Spyros Sioutas, and Athana- sios K. Tsakalidis. Storage efficient trajectory clustering and k-nn for robust privacy preserving spatio-temporal databases. Algorithms, 12(12):266, 2019.

[39] Elias Dritsas, Maria Trigka, Panagiotis Gerolymatos, and Spyros Sioutas. Trajec- tory clustering and k-nn for robust privacy preserving spatiotemporal databases. Algorithms, 11(12):207, 2018.

[40] Elias Dritsas, Maria Trigka, Panagiotis Gerolymatos, and Spyros Sioutas. Trajec- tory clustering and k-nn for robust privacy preserving spatiotemporal databases. Algorithms, 11(12):207, 2018.

[41] Elizabeth A Durham, Murat Kantarcioglu, Yuan Xue, Csaba Toth, Mehmet Kuzu, and Bradley Malin. Composite bloom filters for secure record link- age. IEEE transactions on knowledge and data engineering, 26(12):2956–2968, 2014.

168 [42] Elizabeth Ashley Durham. A framework for accurate, efficient private record linkage. PhD thesis, University of Texas at Dallas, 2012.

[43] Ahmed Eldawy, Louai Alarabi, and Mohamed F. Mokbel. Spatial partitioning techniques in spatial hadoop. Proceedings of the VLDB Endowment, 8(12):1602– 1605, 2015.

[44] Cheikh Kacfah Emani, Nadine Cullot, and Christophe Nicolle. Understandable big data: A survey. Computer Science Review, 17:70–81, 2015.

[45] Ping Fan, Guohui Li, and Ling Yuan. Continuous k-nearest neighbor processing based on speed and direction of moving objects in a road network. Telecommu- nication Systems, 55(3):403–419, 2014.

[46] Ping Fan, Guohui Li, Ling Yuan, and Yanhong Li. Vague continuous k-nearest neighbor queries over moving objects with uncertain velocity in road networks. Information Systems, 37(1):13–32, 2012.

[47] Ivan P Fellegi. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.

[48] Yu Feng, Jianzhong Zhou, and Muhammad Tayyab. Kernel clustering with a differential harmony search algorithm for scheme classification. Algorithms, 10(1):14, 2017.

[49] Zhenni Feng and Yanmin Zhu. A survey on trajectory data mining: Techniques and applications. IEEE Access, 4:2056–2067, 2016.

[50] Salvador Garc´ıa, Julian´ Luengo, and Francisco Herrera. Data Preprocessing in Data Mining, volume 72 of Intelligent Systems Reference Library. Springer, 2015.

[51] Francisco Garc´ıa-Garc´ıa, Antonio Corral, Luis Iribarne, and Michael Vassi- lakopoulos. Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop. Future Generation Computer Systems, 2019.

[52] Daniel Gayo-Avello, Panagiotis Takis Metaxas, and Eni Mustafaraj. Limits of electoral predictions using twitter. In International Conference on Weblogs and Social Media (ICWSM), 2011.

[53] Panagiotis Gerolymatos, Spyros Sioutas, Nikolaos Nodarakis, Alexandros Panaretos, and Konstantinos Tsakalidis. Smart: A novel framework for address- ing range queries over nonlinear trajectories. Journal of Systems and Software (JSS), 105:79–90, 2015.

169 [54] Konstantinos Giannousis, Konstantina Bereta, Nikolaos Karalis, and Manolis Koubarakis. Distributed execution of spatial SQL queries. In IEEE International Conference on Big Data, pages 528–533, 2018.

[55] Seth Gilbert and Nancy A. Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51– 59, 2002.

[56] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, pages 1–6, 2009.

[57] Nancy W. Grady. KDD meets big data. In IEEE International Conference on Big Data, pages 1603–1608, 2016.

[58] Thomas L. Griffiths. Gibbs sampling in the generative model of latent dirichlet allocation. 2002.

[59] Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.

[60] Joachim Gudmundsson, Jyrki Katajainen, Damian Merrick, Cahya Ong, and Thomas Wolle. Compressing spatio-temporal trajectories. Computational Ge- ometry, 42(9):825–841, 2009.

[61] Manoj Kumar Gupta and Pravin Chandra. An empirical evaluation of like opera- tor in oracle. BVICAM’s International Journal of Information Technology, 3(2), 2011.

[62] Neha Gupta and Rashmi Agrawal. Chapter four - nosql security. Advances in Computers, 109:101–132, 2018.

[63] Ralf Hartmut Guting,¨ Thomas Behr, and Jianqiu Xu. Efficient k-nearest neighbor search on moving object trajectories. The VLDB Journal, 19(5):687–714, 2010.

[64] Juan Carlos Guzman, Patricia Melin, and German Prado-Arechiga. Design of an optimized fuzzy classifier for the diagnosis of blood pressure with a new compu- tational method for expert rule optimization. Algorithms, 10(3):79, 2017.

[65] Emma Haddi, Xiaohui Liu, and Yong Shi. The role of text pre-processing in sentiment analysis. In 1st International Conference on Information Technology and Quantitative Management (ITQM), pages 26–32, 2013.

[66] Stefan Hagedorn, Philipp Gotze,¨ and Kai-Uwe Sattler. The STARK framework for spatio-temporal data analytics on spark. In 17th Conference on Database

170 Systems for Business, Technology, and Web (BTW), volume P-265, pages 123– 142, 2017.

[67] Stefan Hagedorn and Timo Rath.¨ Efficient spatio-temporal event processing with STARK. In 20th International Conference on Extending Database Technology (EDBT), pages 570–573, 2017.

[68] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and tech- niques. 2011.

[69] A. S. M. Touhidul Hasan, Qiang Qu, Chengming Li, Lifei Chen, and Qingshan Jiang. An effective privacy architecture to preserve user trajectories in reward- based LBS applications. ISPRS International Journal of Geo-Information.

[70] Xiaoqi He, Sheng Zhang, and Yangguang Liu. An adaptive spectral clustering algorithm based on the importance of shared nearest neighbors. Algorithms, 8(2):177–189, 2015.

[71] Lasanthi Heendaliya, Dan Lin, and Ali R. Hurson. Continuous predictive line queries for on-the-go traffic estimation. Transactions on Large-Scale Data- and Knowledge-Centered Systems, 18:80–114, 2015.

[72] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177, 2004.

[73] Xiankai Huang, Lifeng Zhang, and Yusi Ding. The baidu index: Uses in pre- dicting tourism flows–a case study of the forbidden city. Tourism Management, 58:301–306, 2017.

[74] Yuan-Ko Huang. Processing Knn queries in grid-based sensor networks. Algo- rithms, 7(4):582–596, 2014.

[75] Yuan-Ko Huang. Processing knn queries in grid-based sensor networks. Algo- rithms, 7(4):582–596, 2014.

[76] Yuan-Ko Huang, Zhi-Wei Chen, and Chiang Lee. Continuous k-nearest neighbor query over moving objects in road networks. In Joint International Conferences on Advances in Data and Web Management (APWeb/WAIM), pages 27–38, 2009.

[77] Yuan-Ko Huang and Chiang Lee. Efficient evaluation of continuous spatio- temporal queries on moving objects with uncertain velocity. GeoInformatica, 14(2):163–200, 2010.

171 [78] Zhou Huang, Yiran Chen, Lin Wan, and Xia Peng. Geospark SQL: an effective framework enabling spatial queries on spark. ISPRS International Journal of Geo-Information, 6(9):285, 2017.

[79] Venkata Narasimha Inukollu, Sailaja Arsi, and Srinivasa Rao Ravuri. Security issues associated with big data in cloud computing. International Journal of Network Security and Its Applications (IJNSA), 6(3):45, 2014.

[80] Nishtha Jatana, Sahil Puri, Mehak Ahuja, Ishita Kathuria, and Dishant Gosain. A survey and comparison of relational and non-relational database. International Journal of Engineering Research and Technology (IJERT), 1(6):1–5, 2012.

[81] Fan Jiang and Carson Kai-Sang Leung. A data analytic algorithm for managing, querying, and processing uncertain big data in cloud environments. Algorithms, 8(4):1175–1194, 2015.

[82] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target- dependent twitter sentiment classification. In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 151–160, 2011.

[83] Prudence Kadebu and Innocent Mapanga. A security requirements perspective towards a secured nosql database environment. In International Conference of Advance Research and Innovation (ICARI), 2014.

[84] A. Kanavos, I. Perikos, I. Hatzilygeroudis, and A. Tsakalidis. Integrating user’s emotional behavior for community detection in social networks. In International Conference on Web Information Systems and Technologies (WEBIST), pages 355–362, 2016.

[85] A. Kanavos, I. Perikos, I. Hatzilygeroudis, and A. Tsakalidis. Emotional com- munity detection in social networks. Computers and Electrical Engineering, 65:449–460, 2018.

[86] A. Kanavos, I. Perikos, P. Vikatos, I. Hatzilygeroudis, C. Makris, and A. Tsaka- lidis. Conversation emotional modeling in social networks. In IEEE Interna- tional Conference on Tools with Artificial Intelligence (ICTAI), pages 478–484, 2014.

[87] Andreas Kanavos, Nikolaos Nodarakis, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsolis, and Giannis Tzimas. Large scale implementations for twitter sentiment classification. Algorithms, 10(1):33, 2017.

172 [88] Andreas Kanavos, Isidoros Perikos, Ioannis Hatzilygeroudis, and Athanasios Tsakalidis. Integrating user’s emotional behavior for community detection in social networks. In International Conference on Web Information Systems and Technologies (WEBIST), pages 355–362, 2016.

[89] Andreas Kanavos, Isidoros Perikos, Ioannis Hatzilygeroudis, and Athanasios Tsakalidis. Emotional community detection in social networks. Computers and Electrical Engineering, 65:449–460, 2018.

[90] Andreas Kanavos, Isidoros Perikos, Pantelis Vikatos, Ioannis Hatzilygeroudis, Christos Makris, and Athanasios Tsakalidis. Conversation emotional modeling in social networks. In IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pages 478–484, 2014.

[91] Athanasios Kaplanis, Marios Kendea, Spyros Sioutas, Christos Makris, and Gi- annis Tzimas. Hb+ tree: use hadoop and hbase even your data isn’t that big. In Proceedings of the 30th Annual ACM Symposium on Applied Computing, pages 973–980, 2015.

[92] Dimitrios Karapiperis and Vassilios S Verykios. An lsh-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering, 27(4):909–921, 2015.

[93] Dimitrios Karapiperis and Vassilios S Verykios. A fast and efficient hamming lsh-based scheme for accurate linkage. Knowledge and Information Systems, 49(3):861–884, 2016.

[94] Ioannis Kavakiotis, Olga Tsave, Athanasios Salifoglou, Nicos Maglaveras, Ioan- nis Vlahavas, and Ioanna Chouvarda. Machine learning and data mining meth- ods in diabetes research. Computational and Structural Biotechnology Journal, 15:104–116, 2017.

[95] Marios Kendea, Vassiliki Gkantouna, Angeliki Rapti, Spyros Sioutas, Giannis Tzimas, and Dimitrios Tsolis. Graph dbs vs. column-oriented stores: A pure performance comparison. In International Workshop on Algorithmic Aspects of Cloud Computing, pages 62–74. Springer, 2015.

[96] Majid Khan and M. N. A. Khan. Exploring query optimization techniques in relational databases. International Journal of Database Theory and Application, 6(3):11–20, 2013.

[97] Vishal. A. Kharde and Sheetal. Sonawane. Sentiment analysis of twitter data: A survey of techniques. International Journal of Computer Applications, 139(11), 2016.

173 [98] Won Kim. On optimizing an sql-like nested query. ACM Transactions on Database Systems (TODS), 7(3):443–469, 1982.

[99] Adam Kirsch and Michael Mitzenmacher. Less hashing, same performance: Building a better bloom filter. In Annual European Symposium on Algorithms (ESA), pages 456–467, 2006.

[100] Christine Korner,¨ Michael May, and Stefan Wrobel. Spatiotemporal modeling and analysis - introduction and overview. Kunstliche¨ Intelligenz (KI), 26(3):215– 221, 2012.

[101] Akshi Kumar and Teeja Sebastian. Sentiment analysis on twitter. IJCSI Interna- tional Journal of Computer Science Issues, 9(3):372–378, 2012.

[102] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. From word embeddings to document distances. In Proceedings of the 32nd Interna- tional Conference on Machine Learning (ICML), pages 957–966, 2015.

[103] Per-Ake˚ Larson, Cipri Clinciu, Eric N. Hanson, Artem Oks, Susan L. Price, Srikumar Rangarajan, Aleksandras Surna, and Qingqing Zhou. SQL server col- umn store indexes. In ACM SIGMOD International Conference on Management of Data, pages 1177–1184, 2011.

[104] Neal Leavitt. Will nosql databases live up to their promise? IEEE Computer, 43(2):12–14, 2010.

[105] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. Trajectory clustering: A partition-and-group framework. In ACM SIGMOD International Conference on Management of Data, pages 593–604, 2007.

[106] Xiucheng Li, Kaiqi Zhao, Gao Cong, Christian S. Jensen, and Wei Wei. Deep representation learning for trajectory similarity computation. In 34th IEEE In- ternational Conference on Data Engineering (ICDE), pages 617–628, 2018.

[107] Bing Liu. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan and Claypool Publishers, 2012.

[108] Bing Liu and Lei Zhang. A survey of opinion mining and sentiment analysis. In Mining Text Data, pages 415–463, 2012.

[109] Securosis LLC. Securing big data: Security recommendations for hadoop and nosql environments. 2012.

174 [110] Michael J. Lyons and David M. Brooks. The design of a bloom filter hardware accelerator for ultra low power systems. In International Symposium on Low Power Electronics and Design, pages 371–376, 2009.

[111] Emmanouil Magkos, Panayiotis Kotzanikolaou, Marios Magioladitis, Spyros Sioutas, and Vassilios S Verykios. Towards secure and practical location pri- vacy through private equality testing. In International Conference on Privacy in Statistical Databases, pages 312–325. Springer, 2014.

[112] Jesus´ Maillo, Isaac Triguero, and Francisco Herrera. A mapreduce-based k- nearest neighbor approach for big data classification. In 2015 IEEE Trust- com/BigDataSE/ISPA, volume 2, pages 167–172. IEEE, 2015.

[113] Yingchi Mao, Haishi Zhong, Hai Qi, Ping Ping, and Xiaofang Li. An adaptive trajectory clustering method based on grid and density in mobile pattern analysis. Sensors, 17(9):2013, 2017.

[114] Mohamed Mohamed, Obay G. Altrafi, and Owais Ismail. Relational vs. nosql databases: A survey. International Journal of Computer and Information Tech- nology, 3(3):598–601, 2014.

[115] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning series. MIT Press, 2012.

[116] Cory Nance, Travis Losser, Reenu Iype, and Gary Harmon. Nosql vs rdbms - why there is room for both. In Southern Association for Information Systems Conference, 2013.

[117] Sang Ni, Mengbo Xie, and Quan Qian. Clustering based k-anonymity algorithm for privacy preservation. IJ Network Security, 19(6):1062–1071, 2017.

[118] Ben Niu, Qinghua Li, Xiaoyan Zhu, Guohong Cao, and Hui Li. Achieving k- anonymity in privacy-aware location-based services. In INFOCOM, 2014 Pro- ceedings IEEE, pages 754–762. IEEE, 2014.

[119] Nikolaos Nodarakis, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, and Giannis Tzimas. kdann+: A rapid aknn classifier for big data. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXIV, pages 139–168. Springer, 2016.

[120] Nikolaos Nodarakis, Evaggelia Pitoura, Spyros Sioutas, Athanasios K. Tsaka- lidis, Dimitrios Tsoumakos, and Giannis Tzimas. kdann+: A rapid aknn classi- fier for big data. Transactions on Large-Scale Data- and Knowledge-Centered Systems, 24:139–168, 2016.

175 [121] Nikolaos Nodarakis, Spyros Sioutas, Athanasios Tsakalidis, and Giannis Tzi- mas. Using hadoop for large scale analysis on twitter: A technical report. arXiv preprint arXiv:1602.01248, 2016.

[122] Nikolaos Nodarakis, Spyros Sioutas, Athanasios K Tsakalidis, and Giannis Tz- imas. Large scale sentiment analysis on twitter with spark. In EDBT/ICDT Workshops, pages 1–8, 2016.

[123] Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. From tweets to polls: Linking text sentiment to public opinion time series. In International Conference on Weblogs and Social Media (ICWSM), pages 122–129, 2010.

[124] Lior Okman, Nurit Gal-Oz, Yaron Gonen, Ehud Gudes, and Jenny Abramov. Security issues in nosql databases. In IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 541–547, 2011.

[125] Tanty Oktavia and Surya Sujarwo. Evaluation of sub query performance in sql server. EPJ Web of Conferences, 68, 2014.

[126] Rabi Prasad Padhy, Manas Ranjan Patra, and Suresh Chandra Satapathy. Rdbms to nosql: Reviewing some next-generation non-relational database’s. Interna- tional Journal of Advances in Engineering, Science and Technology (IJAEST), 11(1):15–30, 2011.

[127] Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In International Conference on Language Resources and Evaluation (LREC), pages 1320–1326, 2010.

[128] Costas Panagiotakis, Nikos Pelekis, Ioannis Kopanakis, Emmanuel Ramasso, and Yannis Theodoridis. Segmentation and sampling of moving object trajec- tories based on representativeness. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(7):1328–1343, 2012.

[129] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135, 2008.

[130] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In ACL Conference on Empir- ical methods in Natural Language Processing (EMNLP), pages 79–86, 2002.

176 [131] Maria Patrou, Md Mahbub Alam, Puya Memarzia, Suprio Ray, Virendra C. Bhavsar, Kenneth B. Kent, and Gerhard W. Dueck. DISTIL: a distributed in-memory data processing system for location-based services. In 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 496–499, 2018.

[132] Erman Pattuk, Murat Kantarcioglu, Vaibhav Khadilkar, Huseyin Ulusoy, and Sharad Mehrotra. Bigsecret: A secure data management framework for key- value stores. In 6th IEEE International Conference on Cloud Computing, pages 147–154, 2013.

[133] Tao Peng, Qin Liu, Dacheng Meng, and Guojun Wang. Collaborative trajec- tory privacy preserving scheme in location-based services. Information Sciences, 387:165–179, 2017.

[134] Rishabh Poddar, Tobias Boelter, and Raluca Ada Popa. Arx: A strongly en- crypted database system. IACR Cryptology ePrint Archive, 2016:591, 2016.

[135] Rogerio Pontes, Francisco Maia, Joao˜ Paulo, and Ricardo Manuel Pereira Vilac¸a. Saferegions: Performance evaluation of multi-party protocols on hbase. In 35th IEEE Symposium on Reliable Distributed Systems Workshops (SRDS), pages 31– 36, 2016.

[136] Giorgos Poulis, Spiros Skiadopoulos, Grigorios Loukides, and Aris Gkoulalas- Divanis. Distance-based kˆm-anonymization of trajectory data. In 14th IEEE International Conference on Mobile Data Management (MDM), pages 57–62, 2013.

[137] Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Kevin S. Beyer, and Muralidhar Krishnaprasad. SRQL: sorted relational query language. In International Conference on Scientific and Statistical Database Management (SSDBM), pages 84–95, 1998.

[138] Jorge L Reyes-Ortiz, Luca Oneto, and Davide Anguita. Big data analytics in the cloud: Spark on hadoop vs mpi/openmp on beowulf. Procedia Computer Science, 53:121–130, 2015.

[139] Lior Rockach and Oded Maimon. Data Mining With Decision Trees: Theory and Applications. World Scientific Publishing Co. Pte. Ltd., 2015.

[140] Jelle Roozenburg. A literature survey on bloom filters. Research Assignment in Computer Science, 2005.

177 [141] Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O’Reilly Media, Inc., 2017.

[142] Varsha Sahayak, Shete Vijaya, and Apashabi Pathan. Sentiment analysis on twit- ter data. Innovative Research in Advanced Engineering (IJIRAE), 2, 2015.

[143] Jeffrey S. Saltz, Ivan Shamshurin, and Colin Connors. Predicting data sci- ence sociotechnical execution challenges by categorizing data science projects. Journal of the Association for Information Science and Technology (JASIST), 68(12):2720–2728, 2017.

[144] Rainer Schnell, Tobias Bachteler, and Jorg¨ Reiher. Privacy-preserving record linkage using bloom filters. BMC medical informatics and decision making, 9(1):41, 2009.

[145] Rainer Schnell, Tobias Bachteler, and Jorg¨ Reiher. A novel error-tolerant anony- mous linking code. German Record Linkage Center, 2011.

[146] Jamal Shahrabi, Esmaeil Hadavandi, and Shahrokh Asadi. Developing a hybrid intelligent model for forecasting problems: Case study of tourism demand time series. Knowledge-Based Systems, 43:112–122, 2013.

[147] Hossain Shahriar and Hisham M. Haddad. Security vulnerabilities of nosql and sql databases for mooc applications. International Journal of Digital Society (IJDS), 8(1), 2017.

[148] Vatika Sharma and Meenu Dave. Sql and nosql databases. International Jour- nal of Advanced Research in Computer Science and Software Engineering, 2(8), 2012.

[149] Shashi Shekhar, Zhe Jiang, Reem Y. Ali, Emre Eftelioglu, Xun Tang, Venkata M. V. Gunturi, and Xun Zhou. Spatiotemporal data mining: A computational perspective. ISPRS International Journal of Geo-Information, 4(4):2306–2338, 2015.

[150] Zhicheng Shi and Lilian S. C. Pun-Cheng. Spatiotemporal data clustering: A survey of methods. ISPRS International Journal of Geo-Information, 8(3):112, 2019.

[151] Spyros Sioutas, Phivos Mylonas, Alexandros Panaretos, Panagiotis Geroly- matos, Dimitrios Vogiatzis, Eleftherios Karavaras, Thomas Spitieris, and An- dreas Kanavos. Survey of machine learning algorithms on spark over dht-based structures. In International Workshop on Algorithmic Aspects of Cloud Comput- ing (ALGOCLOUD), pages 146–156, 2016.

178 [152] Spyros Sioutas, Phivos Mylonas, Alexandros Panaretos, Panagiotis Geroly- matos, Dimitrios Vogiatzis, Eleftherios Karavaras, Thomas Spitieris, and An- dreas Kanavos. Survey of machine learning algorithms on spark over dht-based structures. In 2nd International Workshop on Algorithmic Aspects of Cloud Com- puting (ALGOCLOUD), pages 146–156, 2016.

[153] Spyros Sioutas, Konstantinos Tsakalidis, Kostas Tsichlas, Christos Makris, and Yannis Manolopoulos. A new approach on indexing mobile objects on the plane. Data and Knowledge Engineering, 67(3):362–380, 2008.

[154] Doohee Song, Jongwon Sim, Kwangjin Park, and Moonbae Song. A privacy- preserving continuous location monitoring system for location-based services. International Journal of Distributed Sensor Networks, 11(8):815613, 2015.

[155] Renchu Song, Weiwei Sun, Baihua Zheng, and Yu Zheng. PRESS: A novel framework of trajectory compression in road networks. PVLDB, 7(9):661–672, 2014.

[156] Jordi Soria-Comas, Josep Domingo-Ferrer, David Sanchez,´ and Sergio Mart´ınez. Enhancing data utility in differential privacy via microaggregation-based k- anonymity. The VLDB Journal—The International Journal on Very Large Data Bases, 23(5):771–794, 2014.

[157] Penghui Sun, Shixiong Xia, Guan Yuan, and Daxing Li. An overview of moving object trajectory compression algorithms. Mathematical Problems in Engineer- ing, 2016, 2016.

[158] Penghui Sun, Shixiong Xia, Guan Yuan, and Daxing Li. An overview of moving object trajectory compression algorithms. Mathematical Problems in Engineer- ing, 2016, 2016.

[159] Bunheang Tay, Jung Keun Hyun, and Sejong Oh. A machine learning approach for specification of spinal cord injuries using fractional anisotropy values ob- tained from diffusion tensor images. Computational and mathematical methods in medicine, 2014, 2014.

[160] Joao˜ Paulo Teixeira and Paula Odete Fernandes. Tourism time series forecast- different ann architectures with time index input. Procedia Technology, 5:445– 454, 2012.

[161] Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. Sentiment strength detection for the social web. International Journal of the American Society for Information Science and Technology (JASIST), 63(1):163–173, 2012.

179 [162] Stanley Trepetin. Privacy-preserving string comparisons in record linkage sys- tems: a review. Information Security Journal: A Global Perspective, 17(5- 6):253–266, 2008.

[163] Andranik Tumasjan, Timm Oliver Sprenger, Philipp G. Sandner, and Isabell M. Welpe. Predicting elections with twitter: What 140 characters reveal about politi- cal sentiment. In Proceedings of the Fourth International Conference on Weblogs and Social Media (ICWSM), 2010.

[164] Peter D. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 417–424, 2002.

[165] Dinusha Vatsalan, Peter Christen, and Vassilios S Verykios. A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38(6):946– 969, 2013.

[166] Vassilios S. Verykios, Maria Luisa Damiani, and Aris Gkoulalas-Divanis. Pri- vacy and security in spatiotemporal data and trajectories. In Mobility, Data Min- ing and Privacy, pages 213–240. 2008.

[167] Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen, and Dawn Wilkins. A comparison of a graph database and a relational database: A data provenance perspective. In ACM Southeast Regional Conference, page 42, 2010.

[168] Gerasimos Vonitsanos, Andreas Kanavos, Phivos Mylonas, and Spyros Sioutas. A nosql database approach for modeling heterogeneous and semi-structured in- formation. In 9th International Conference on Information, Intelligence, Systems and Applications (IISA), pages 1–8, 2018.

[169] Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, and Ming Zhang. Topic sentiment analysis in twitter: A graph-based hashtag sentiment classification ap- proach. In ACM International Conference on Information and Knowledge Man- agement (CIKM), pages 1031–1040, 2011.

[170] Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2- 3):165–210, 2005.

[171] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. In Conference on Human Lan-

180 guage Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 347–354, 2005.

[172] Markus Winand. SQL Performance Explained: Everything Developers Need to Know about SQL Performance. M. Winand, 2012.

[173] Wei Wu, Udaya Parampalli, Jian Liu, and Ming Xian. Privacy preserving k- nearest neighbor classification over encrypted database in outsourced cloud en- vironments. World Wide Web, 22(1):101–123, 2019.

[174] Yanbo Wu, Hong Shen, and Quan Z Sheng. A cloud-friendly rfid trajectory clustering algorithm in uncertain environments. IEEE Transactions on Parallel and Distributed Systems, 26(8):2075–2088, 2015.

[175] Yuqin Xie and Mingchun Zheng. A differentiated anonymity algorithm for social network privacy preservation. Algorithms, 9(4):85, 2016.

[176] Yang Xu, Tinghuai Ma, Meili Tang, and Wei Tian. A survey of privacy preserving data publishing using generalization and suppression. Applied Mathematics and Information Sciences, 8(3):1103, 2014.

[177] Sophia Yakoubov, Vijay Gadepally, Nabil Schear, Emily Shen, and Arkady Yerukhimovich. A survey of cryptographic approaches to securing big-data ana- lytics in the cloud. In IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6, 2014.

[178] Chaowei Yang, Manzhu Yu, Fei Hu, Yongyao Jiang, and Yun Li. Utilizing cloud computing to address big geospatial data challenges. Computers, Environment and Urban Systems, 61:120–128, 2017.

[179] Chong Yang, Xiaohui Yu, and Yang Liu. Continuous KNN join processing for real-time recommendation. In IEEE International Conference on Data Mining (ICDM), pages 640–649, 2014.

[180] Shumei Yang, Shaohua Tang, and Xiao Zhang. Privacy-preserving k-nearest neighbor query with authentication on road networks. Journal of Parallel and Distributed Computing, 134:25–36, 2019.

[181] Xin Yang, Bing Pan, James A Evans, and Benfu Lv. Forecasting chinese tourist volume with search engine data. Tourism Management, 46:386–397, 2015.

[182] Soheila Khoshnevis Yazdi and Bahman Khanalizadeh. Tourism demand: A panel data approach. Current Issues in Tourism, 20(8):787–800, 2017.

181 [183] Haina Ye, Xinzhou Cheng, Mingqiang Yuan, Lexi Xu, Jie Gao, and Chen Cheng. A survey of security and privacy in big data. In 16th International Symposium on Communications and Information Technologies (ISCIT), pages 268–272, 2016.

[184] Jia Yu, Zongsi Zhang, and Mohamed Sarwat. Spatial data management in apache spark: The geospark perspective and beyond. GeoInformatica, 23(1):37–78, 2019.

[185] Ziqiang Yu, Yang Liu, Xiaohui Yu, and Ken Q. Pu. Scalable distributed process- ing of K nearest neighbor queries over moving objects. IEEE Transactions on Knowledge and Data Engineering, 27(5):1383–1396, 2015.

[186] Guan Yuan, Penghui Sun, Jie Zhao, Daxing Li, and Canwei Wang. A review of moving object trajectory clustering algorithms. Artificial Intelligence Review, 47(1):123–144, 2017.

[187] Guan Yuan, Penghui Sun, Jie Zhao, Daxing Li, and Canwei Wang. A review of moving object trajectory clustering algorithms. Artificial Intelligence Review, 47(1):123–144, 2017.

[188] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Arm- brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11):56–65, 2016.

[189] Shaobo Zhang, Xinjun Mao, Kim-Kwang Raymond Choo, Tao Peng, and Guojun Wang. A trajectory privacy-preserving scheme based on dual-k mechanism for continuous location-based services. Information Sciences, 527:406–419, 2020.

[190] Wei Zhang, Clement Yu, and Weiyi Meng. Opinion retrieval from blogs. In ACM Conference on Conference on Information and Knowledge Management (CIKM), pages 831–840, 2007.

[191] Zhigang Zhang, Cheqing Jin, Jiali Mao, Xiaolin Yang, and Aoying Zhou. Trajs- park: A scalable and efficient in-memory management system for big trajectory data. In 1st (APWeb-WAIM) International Joint Conference on Web and Big Data, pages 11–26, 2017.

[192] Jun Zhao, Wei Wang, and Chunyang Sheng. Data-Driven Prediction for Indus- trial Processes and Their Applications. Springer, 2018.

182 [193] Bolong Zheng, Kai Zheng, Xiaokui Xiao, Han Su, Hongzhi Yin, Xiaofang Zhou, and Guohui Li. Keyword-aware continuous knn query on road networks. In 32nd IEEE International Conference on Data Engineering (ICDE), pages 871– 882, 2016.

[194] Pei-Yuan Zhou and Keith CC Chan. A model-based multivariate time series clustering algorithm. In Pacific- Conference on Knowledge Discovery and Data Mining, pages 805–817. Springer, 2014.

[195] Li Zhuang, Feng Jing, and Zhu Xiao-Yan. Movie review mining and summariza- tion. In ACM International Conference on Information and Knowledge Manage- ment (CIKM), pages 43–50, 2006.

[196] Andreas Zufle,¨ Tobias Emrich, Klaus Arthur Schmid, Nikos Mamoulis, Arthur Zimek, and Matthias Renz. Representative clustering of uncertain data. In 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 243–252, 2014.

183 LIST OF PUBLICATIONS

International Scientific Conferences

1. Chioti, E., Dritsas, E., Kanavos, A., Liapakis, X., Sioutas, S. and Tsakalidis, A., 2017, August. Bloom Filters for Efficient Coupling Between Tables of a Database. In International Conference on Engineering Applications of Neural Networks (pp. 596-608). Springer, Cham.

2. Boussis, D., Dritsas, E., Kanavos, A., Sioutas, S., Tzimas, G. and Verykios, V.S., 2018, July. MapReduce Implementations for Privacy Preserving Record Linkage. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, pp. 1-4.

3. Vonitsanos, G., Kanavos, A., Dritsas, E., Mylonas, P. and Sioutas,S. . Security and Privacy Solutions associated with NoSQL Data Stores. SMAP 2020

International Scientific Journals

1. Dritsas, E., Trigka, M., Gerolymatos, P. and Sioutas, S., 2018. Trajectory Clus- tering and k-NN for Robust Privacy Preserving Spatiotemporal Databases. Algo- rithms, 11(12), p.207.

2. Dritsas, E., Kanavos, A., Trigka, M., Sioutas, S. and Tsakalidis, A., 2019. Storage Efficient Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio- Temporal Databases. Algorithms, 12(12), p.266.

3. Dritsas, E., Kanavos, A., Trigka, M., Vonitsanos, G., Sioutas, S. and Tsakalidis, A., 2020. Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Queries Processing in GeoSpark. Algorithms, 13(8), p.182.

Additional International Scientific Conferences

184 1. Dritsas, E., Livieris, I.E., Giotopoulos, K. and Theodorakopoulos, L., 2018, Novem- ber. An Apache Spark implementation for graph-based hashtag sentiment clas- sification on Twitter. In Proceedings of the 22nd Pan-Hellenic Conference on Informatics, pp. 255-260.

2. Ntaliakouras, N., Vonitsanos, G., Kanavos, A. and Dritsas, E., 2019, July. An Apache Spark Methodology for Forecasting Tourism Demand in Greece. In 2019 10th International Conference on Information, Intelligence, Systems and Appli- cations (IISA) (pp. 1-5). IEEE.

3. Dritsas, E., Vonitsanos, G., Livieris, I.E., Kanavos, A., Ilias, A., Makris, C. and Tsakalidis, A., 2019, May. Pre-processing Framework for Twitter Sentiment Classification. In IFIP International Conference on Artificial Intelligence Ap- plications and Innovations (pp. 138-149). Springer, Cham.

185 Appendices

186 Appendix A

Matlab Code

%Chapter 5-6 clear all; clc;

load matlab.mat;

N = 400; L = 100; %e = 5; %% EXPAND DATASET a = repmat(1:N,[L 1]); a = a(:);

b = data1; b(:,1) = a; %PP = [data7(1:N*L,:) data8(1:N*L,:)]; PP = [b data2]; %% MAKE A 3-D MATRIX WHICH KEEPS THE L TRAJECTORY POINTS %% OF TH FORM (X,Y,U,TH) OF e*N MOBILE USERS Positions = zeros(L,4,N); for k = 1:N [c1,c2]= find(PP(:,1) == k); %if (any(c1)==1) Positions(:,:,k) = PP(c1,2:5); %scatter(Positions(:,1,k),Positions(:,2,k)),hold on end nns = input(’Give the number of nearest neighbors:’); %m = input(’Define the mobile users:’);

187 cs =input(’Give the number of desired clusters:’); %flag = 0 -> no clustering, flag = 1 -> on-line clustering %fl = input(’Give the flag number:’);

C = zeros(N,L); opts = statset(’Display’,’final’); for t = 1:L mydat = squeeze(Positions(t,:,:)); %apply kmeans clustering C(:,t) = kmeans(mydat’,cs,’Distance’,’sqeuclidean’, ’Replicates’,10,’Options’,opts); end

Vuln1 = zeros(N,L);%no clustering Vuln2 = zeros(N,L);%kmeans clustering

Vuln1(:,1) = 1/nns; Vuln2(:,1) = 1/nns;

% keeps neighbors indices for each time instant %% t = 1:L for a given mobile %node NN1 = zeros(N,nns,L);%flag = 0 NN2 = zeros(N,nns,L);%flag = 1 for t = 1:L mydat = squeeze(Positions(t,:,:)); mydat1 = mydat(3:4,:); mydat2 = mydat(1,:); for mm = 1:N

%keeps mm mobile object position in time t pos = Positions(t,1,mm); pos2 = Positions(t,3:4,mm); %% Method without clustering (MWCL)

188 kk = knnsearch(mydat2’,pos,’dist’,’euclidean’,’k’,nns); NN1(mm,:,t) = kk; %% Method Clustering (MCL) %c keeps node mm cluster number c = C(mm,t);%C(mm,mc); %cc keeps object ids that belongs to the same cluster as mm cc = find(C(:,t) == c); %d keeps positions of objects in class c d = mydat2(:,cc); %find nns of pos in d kkk = knnsearch(d’,pos,’dist’,’euclidean’,’k’,nns); NN2(mm,:,t) = cc(kkk)’;

end end %% PLOT VULERABILITIES

for mm = 1:N for tt = 2:L %how many neighbohrs remained the same Vuln1(mm,tt) = 1/length(find(NN1(mm,:,tt)== NN1(mm,:,tt-1))); Vuln2(mm,tt) = 1/length(find(NN2(mm,:,tt)== NN2(mm,:,tt-1)));

end %plot(1:L,Vuln1(mm,:),’*-r’,1:L,Vuln2(mm,:),’*-b’), hold on end

Vuln1(isinf(Vuln1)) = 0; [i1,j1] = find(Vuln1); Vuln2(isinf(Vuln2)) = 0; [i2,j2] = find(Vuln2); plot(1:L,mean(Vuln1(i1,:)),’-*r’,1:L,mean(Vuln2(i2,:)),’-b’) ylabel(’Vulnerability’) xlabel(’Time’) legend(’No clustering’,’Clustering’) %plot(1:L,mean(Vuln1),’*r’,1:L,mean(Vuln2),’-b’)

189 %Chapter 7 clear all; %close all; clc; load data.mat dat = y10; k = size(dat,2)-1; L = length(unique(dat(:,k+1))); for jj = 1:L x(jj) = length(find(dat(:,k+1)==jj-1)); end

[u1,u2] = unique(x); u1 = flip(u1); u2 = sort(u2); %u2 = [u2;u2(end)+1]; init = 1/k; VV = []; %1 13 64 95 for i = 1:length(u1) if u2(i) == u2(end) aa = u2(end); l = length(u2(end):L); else aa = u2(i); l = u2(i+1)-u2(i); end

KNNs = zeros(u1(i),k,l); bb = aa+l-1; u = aa:bb; for j = 1:l

190 KNNs(:,:,j) = dat(dat(:,k+1) == u(j)-1,1:k); end

V = zeros(u1(i),l); if(u2(i) == 1) V(:,1) = init; else V(:,1) = I; end for ii = 1:u1(i) for jj = 2:l V(ii,jj) = 1./length(find(KNNs(ii,:,jj)== KNNs(ii,:,jj-1))); end end

I = V(u1(i),l); V(isinf(V)) = 0; [i1,j1] = find(V); VV = [VV mean(V(i1,:))]; i1 = []; end figure; plot(VV,’r’),hold on ylim([0 1]) ylabel(’Vulnerability’) xlabel(’Time’)

191 Appendix B

GeoSpark Code

import org.datasyslab.geosparksql.utils._ import org.datasyslab.geosparkviz.sql.utils.GeoSparkVizRegistrator import org.datasyslab.geospark.spatialRDD._ import org.datasyslab.geospark.enums.{FileDataSplitter, GridType} import org.datasyslab.geospark.enums.IndexType} import org.datasyslab.geospark.formatMapper._ import org.datasyslab.geospark.spatialOperator._ import com.vividsolutions.jts.geom._ import java.io.{BufferedWriter, FileWriter, File} import scala.io.Source import scala.collection.JavaConversions._ import scala.collection.JavaConverters._ import java.util.List import com.vividsolutions.jts.index.quadtree.Quadtree import com.vividsolutions.jts.index.SpatialIndex

// Zeppelin creates and injects sc (SparkContext) and sqlContext // (HiveContext or SqlContext) // So you don’t need create them manually import org.datasyslab.geosparksql.utils.GeoSparkSQLRegistrator import org.datasyslab.geosparkviz.sql.utils.GeoSparkVizRegistrator

// Zeppelin creates and injects sc (SparkContext) and sqlContext // (HiveContext or SqlContext) // So you don’t need create them manually

// Load data GeoSparkSQLRegistrator.registerAll(spark)

192 GeoSparkVizRegistrator.registerAll(spark) // Load Long Lat from a CSV file: Area landmark val pointdf = spark.read.format("csv") .option("delimiter", ",") .option("header", "false").load("/home/vag/data/bigdata.csv") pointdf.createOrReplaceTempView("trajectories") val xronos = pointdf.select("_c6") .distinct() .collect.map(_(0)) .toArray for (i <- 0 until 100) { val t = xronos(i); println(t) val df = pointdf.select($"_c2",$"_c3").filter($"_c6" === t) df.coalesce(1).write.mode("overwrite") .csv("/home/vag/data/mydata2.csv") val dir = new File("/home/vag/data/mydata2.csv") val l = dir.listFiles.filter(_.toPath.toString.endsWith(".csv"))(0) .toString println(l) var pointrdd1= new PointRDD(sc,l, 0, FileDataSplitter.CSV , true) val buildOnSpatialPartitionedRDD = false // Set to TRUE only if run join query val numPartitions = 4 pointrdd1.analyze() pointrdd1.spatialPartitioning(GridType.QUADTREE, numPartitions) pointrdd1.buildIndex(IndexType.QUADTREE, buildOnSpatialPartitionedRDD) pointrdd1.spatialPartitionedRDD.rdd .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}.toDF("partition_number","number_of_records") .show() val bw = new BufferedWriter( new FileWriter(new File("/home/vag/data/file.out"), true)) for (line <- Source.fromFile(l).getLines) {

193 val fact = new GeometryFactory() val chicago = fact.createPoint(new Coordinate(line.split(",")(0).toDouble, line.split(",")(1).toDouble)) val r = KNNQuery.SpatialKnnQuery(pointrdd1,chicago, 2, false) bw.write(r + "\n") } bw.close()

}

194 195