DEPARTMENT OF COMPUTER ENGINEERINGAND INFORMATICS
UNIVERSITYOF PATRAS
DOCTORAL DISSERTATION
EFFICIENT ALGORITHMSFOR BIG DATA MANAGEMENT
ELIAS DRITSAS
SUPERVISOR:SPYROS SIOUTAS,PROFESSOR
PATRAS -AUGUST 2020 2 DEPARTMENT OF COMPUTER ENGINEERINGAND INFORMATICS
UNIVERSITYOF PATRAS
DOCTORAL DISSERTATION
EFFICIENT ALGORITHMSFOR BIG DATA MANAGEMENT ELIAS DRITSAS
DISSERTATION COMMITTEE:
SPYROS SIOUTAS,PROFESSOR (SUPERVISOR)
CHRISTOS MAKRIS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER)
KONSTANTINOS TSICHLAS,ASSISTANT PROFESSOR (COMMITTEE MEMBER)
GEORGE ALEXIOU,PROFESSOR (COMMITTEE MEMBER)
DIMITRIS TSOLIS,ASSISTANT PROFESSOR (COMMITTEE MEMBER)
IOANNIS TZIMAS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER)
PHOIVOS MYLONAS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER)
3 4 ELIAS DRITSAS AUGUST 2020
5 1 ABSTRACT
In the context of the doctoral research, I dealt with data management problems by developing methods and techniques that, on the one hand, maintain or improve the privacy and anonymity of users and, on the other hand, are efficient in terms of time and storage space for large volumes of databases. The research results of the work focus on the following:
• Evaluate the performance of queries in a large volume database using or not the Bloom Filter structure.
• Evaluate workload time, memory and disk usage of the Privacy Preserving Record Linkage (PPRL) problem in Hadoop MapReduce Framework.
• Methods of answering queries of nearest neighbors to spatio-temporal data (mov- ing users trajectories) in order to preserve anonymity, where queries are applied to clustered or non-clustered data.
• The k anonymity method was used, where, the set of anonymity with which each moving object of the space-time database is being camouflaged, consists of its k nearest neighbors. The robustness of the method was quantified with a prob- 1 ability of k and the effect of dimensionality and correlation of the data on the preservation of anonymity and privacy was studied.
• The above method was improved in terms of efficient storage of spatio-temporal data by applying queries of nearest neighbors to Hough transformed nonlinear trajectories of moving objects. The application of secure k-NN queries was eval- uated in the GeoSpark environment.
• Sentiment Analysis on Twitter Data and Tourist Forecasting at Apache Spark
Keywords: Bloom Filters, Privacy Preserving, k-NN Queries, k-anonymity, Spatio- temporal Databases, Sentiment Analysis, Twitter Apache Spark, GeoSpark.
i ACKNOWLEDGEMENT
This dissertation signifies the end of an academic journey as a PhD student. At this point, I would like to sincerely thank all those who have supported me all these years. First and foremost, with immense pleasure and deep sense of gratitude, I wish to express my sincere thanks to my supervisor Dr. Spyros Sioutas, Professor, University of Patras, without his motivation and continuous encouragement, this research would not have been successfully completed. I would also like to express my sincere gratitude to Prof. Spyros Sioutas for his enthusiasm in supervising this work and for his constant support, encouragement and critical suggestions during the writing of this thesis. I am also grateful to my initial supervisor, Prof. Athanasios Tsakalidis who supervised me during the most part of my Ph.D. studies and giving me the opportunity to undertake this research thesis. Also, I warmly thank Associate Prof. Christos Makris who willingly accepted to supervise me during their final part. I express my sincere thanks to Dr. Andreas Kanavos for his kind support and encouragement in several ways throughout my research work. I wish to extend my profound sense of gratitude to my parents for all the sac- rifices they made during my research and also providing me with moral support and encouragement whenever required. Last but not the least, I would like to thank my wife Maria Trigka for her constant encouragement and moral support along with patience and understanding. Finally, i would like to acknowledge the support and funding of the current PhD thesis by the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation (HFRI).
Place: Patra
Date: 31/08/2020 Elias Dritsas
ii TABLE OF CONTENTS
ABSTRACT ...... i ACKNOWLEDGEMENT ...... ii LIST OF FIGURES ...... viii LIST OF TABLES ...... xii LIST OF ABBREVIATIONS ...... xiii
1 Introduction 1
I Research on Methods and Algorithms for Secure Queries Processing 9
2 Bloom Filters for Efficient Coupling between Tables of a Database 11 2.1 Introduction ...... 11 2.2 Bloom Filters Background ...... 13 2.2.1 Bloom Filter Elements ...... 13 2.2.2 Space-Time Advantages and Constraints ...... 15 2.3 Bloom Filters and RDBMS ...... 16 2.3.1 Relational Database Management Systems ...... 16 2.3.2 Queries Language-SQL ...... 17 2.3.3 Indexes Table ...... 19 2.4 Experimental Evaluation in SQL Server ...... 20 2.5 Conclusions ...... 21 2.5.1 Research Conclusions ...... 21 2.5.2 Research Constraints ...... 23 2.5.3 Future Extensions ...... 23
iii 3 MapReduce Implementations for Privacy Preserving Record Linkage 24 3.1 Introduction ...... 24 3.2 Related Work ...... 25 3.2.1 PPRL encoding techniques ...... 25 3.2.2 Private Indexing ...... 28 3.3 MapReduce Framework ...... 29 3.4 Performance Evaluation ...... 30 3.5 Conclusions ...... 33
4 Security and Privacy Solutions associated with NoSQL Data Stores 34 4.1 Introduction ...... 34 4.2 Related Work ...... 35 4.3 Comparison of Relational and NoSQL Databases ...... 36 4.3.1 Reliability of Transactions ...... 37 4.3.2 Scalability Issues and Cloud Support ...... 37 4.3.3 Complexity and Big Data Management ...... 38 4.3.4 Data Model ...... 38 4.3.5 Data Warehouse and Crash Recovery ...... 39 4.3.6 Privacy and Security ...... 39 4.4 Proposed Security and Privacy Solutions ...... 40 4.4.1 Pseudonyms-based Communication Network ...... 40 4.4.2 Monitoring, Filtering and Blocking ...... 42 4.5 Conclusions ...... 43
5 Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio- Temporal Databases 44 5.1 Introduction ...... 44 5.2 Materials and Methods ...... 46 5.2.1 Clustering ...... 46 5.2.2 Classification ...... 47 5.2.3 Useful Definitions ...... 48 5.2.4 System Architecture ...... 49 5.2.5 Problem Definition ...... 52
iv 5.2.6 System Model ...... 55 5.2.7 Privacy Preserving Analysis ...... 57 5.2.8 Experiments Data and Environment ...... 59 5.3 Discussion ...... 60 5.4 Results ...... 62 5.4.1 Experiments Results ...... 62 5.4.2 Experiments Conclusions ...... 65
6 Storage Effiecient Trajectory Clustering and k-NN for Robust Pri- vacy Preserving Databases 68 6.1 Introduction ...... 68 6.2 Related Work ...... 70 6.3 Materials and Methods ...... 72 6.3.1 Dual Transform for Moving Objects ...... 72 6.3.2 kNN Classification and Clustering in Dual Space ...... 73 6.3.3 Problem Definition ...... 74 6.3.4 Problem Formulation ...... 75 6.3.5 System Model ...... 77 6.3.6 Vulnerability and Storage Efficiency ...... 78 6.3.7 Privacy Preservation Analysis ...... 81 6.3.8 Experimental Data and Environment ...... 85 6.4 Results ...... 85 6.4.1 Vulnerability Evaluation in Hough Space ...... 87 6.4.2 Vulnerability Evaluation in Hybrid Space ...... 89 6.5 Discussion ...... 91 6.6 Conclusions ...... 92
7 Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Query Processing in GeoSpark 95 7.1 Introduction ...... 95 7.2 Related Work ...... 98 7.2.1 Distributed Frameworks for Spatio-Temporal data Queries Pro- cessing ...... 98
v 7.2.2 Efficient Privacy Preserving k-NN Queries ...... 99 7.3 Materials and Methods ...... 101 7.3.1 Operations on Spatial Data ...... 101 7.3.2 The k-NN Classifier from Big Spatial Data Perspective ...... 102 7.3.3 Problem Definition ...... 103 7.3.4 Problem Formulation ...... 105 7.3.5 System Model ...... 106 7.3.6 GeoSpark System Overview ...... 108 7.4 Results ...... 115 7.4.1 Environment and Dataset ...... 115 7.4.2 Time Performance of k-Anonymity Set ...... 115 7.4.3 Vulnerability Evaluation ...... 121 7.5 Discussion ...... 124 7.5.1 Performance Issues ...... 125 7.5.2 Vulnerability ...... 126 7.6 Conclusions and Future Work ...... 126
II Sentiment Analysis and Tourism Forecasting 129
8 An Apache Spark Implementation for Graph-based Hashtag Senti- ment Classification on Twitter 131 8.1 Introduction ...... 131 8.2 Related Work ...... 133 8.2.1 Sentiment Analysis and Classification Models ...... 133 8.2.2 Cloud Computing Preliminaries ...... 136 8.3 Sentiment Classification on Twitter ...... 137 8.3.1 Tweet-Level Sentiment Classification ...... 137 8.3.2 Hashtag-Level Sentiment Classification ...... 137 8.4 Spark Implementation ...... 140 8.5 Results and Evaluation ...... 142 8.6 Conclusions ...... 143
vi 9 An Efficient Preprocessing Tool for Supervised Sentiment Analysis on Twitter Data 144 9.1 Introduction ...... 144 9.2 Related Work ...... 146 9.3 Tools and Environment ...... 147 9.3.1 Twitter ...... 148 9.3.2 Publications Mining Tools ...... 148 9.3.3 Pre-processing Scheme ...... 149 9.3.4 Features ...... 151 9.3.5 Topic Modeling ...... 152 9.4 Evaluation ...... 153 9.5 Conclusions and Future Work ...... 155
10 An Apache Spark Methodology for Forecasting Tourism Demand in Greece 156 10.1 Introduction ...... 156 10.2 Related Work ...... 157 10.3 Preliminaries ...... 157 10.3.1 Forecasting Tourism Methods ...... 157 10.3.2 Apache Spark ...... 158 10.3.3 Machine Learning Algorithm ...... 158 10.4 Implementation ...... 160 10.4.1 Methodology ...... 160 10.4.2 Dataset Description ...... 161 10.5 Experiments - Evaluation ...... 162 10.6 Conclusions and Future Work ...... 164 REFERENCES ...... 164 LIST OF PUBLICATIONS ...... 184
Appendices
Appendix A Matlab Code 187
Appendix B GeoSpark Code 192
vii LIST OF FIGURES
2.1 Bloom Filter Overview ...... 14 2.2 B-Tree overview ...... 19 2.3 Queries Execution Time vs Records Size ...... 22 3.1 HLSH under FPS ...... 29 3.2 PPRL evaluation ...... 31 3.3 PPRL evaluation ...... 32 5.1 Data flow diagram ...... 49 5.2 API Request Diagram ...... 51 5.3 A Matlab overview of mobile users trajectories’ points ...... 55 5.4 Both clustering and k-NN: (a) x and (b) (x,y) for N=400 trajectories, L=100 time-stamps and k = 5...... 64 (a) ...... 64 (b) ...... 64 5.5 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=400 tra- jectories, L=100 time-stamps and k = 5...... 65 (a) ...... 65 (b) ...... 65 5.8 Clustering (x,y,θ,v) and k-NN x for N=400 trajectories, L=100 time- stamps for (a) k = 5 and (b) k = 15...... 65 (a) ...... 65 (b) ...... 65 5.6 Both clustering and k-NN: (a) x and (b) (x,y) for N=2000 trajectories and L=100 time-stamps...... 66 (a) ...... 66 (b) ...... 66
viii 5.7 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=2000 trajectories and L=100 time-stamps...... 66 (a) ...... 66 (b) ...... 66 5.9 Clustering with (x,y,θ,v) and k-NN in x. Figure (a) concerns k=15 while (b) k=30 for N=2000 trajectories and L = 100 time-stamps. . . . 67 (a) ...... 67 (b) ...... 67 6.1 An overview of trajectory segmentation and Hough-X transformation for a linear trajectory segment (TS), which consists of M points. The dual
points of M points in TS are the same, for example, a1 = ... =
aM , u1 = ... = uM , where the left graph shows the y(t) line and the right graph shows the Hough-X points...... 73 (a) ...... 73 (b) ...... 73 6.2 A raw trajectory approximation with a discrete number of R linear sub- trajectories. In the dual dimensional space, each one is represented as a
dual point—for example, the linear sub-trajectory [l(t0), l(t1)] is repre-
sented as a dual point dp1, and the linear sub-trajectory [l(t1), l(t2)] is
represented as a dual point dp2...... 74 6.3 Theoretical curve of compression ratio for M = [10 100 1000 10000 100000]...... 81
6.4 Both clustering and k-NN: (a) (Ux, ax) and (b) (Ux, ax,Uy, ay) for N =
1000 trajectories, L = 10 time-stamps, (c) (Ux, ax) and (d) (Ux, ax,Uy, ay) for N = 995 trajectories, L = 100 time-stamps...... 87 (a) ...... 87 (b) ...... 87 (c) ...... 87 (d) ...... 87
ix 6.5 Clustering with (Ux, ax,Uy, ay) and suppressing k-NN: (a) (Ux, ax, ∗, ∗)
and (b) (∗, ∗,Uy, ay) for N = 1000 trajectories, L = 10 time-stamps,
(c) (Ux, ax, ∗, ∗) and (d) (∗, ∗,Uy, ay) for N = 995 trajectories, L = 100 time-stamps...... 88 (a) ...... 88 (b) ...... 88 (c) ...... 88 (d) ...... 88
6.6 Clustering with (Ux, ax) and k-NN with (Ux, ax):(a) Mobile User 10 and (b) Mobile User 100 for N = 995 trajectories, L = 50 time- stamps, (c) Vulnerability measure in dual Hough-X and native dimen- sional space of (x, y)...... 89 (a) ...... 89 (b) ...... 89 (c) ...... 89 6.7 (a) Initial points per trajectory and (b) compression ratio for N = 87 trajectories, L = 100 time-stamps. Clustering with (x, y) and k-NN:
(c) x and (Ux, ax) and (d) (x, y) and (Ux, ax,Uy, ay) for N = 87 trajec- tories, L = 100 time-stamps...... 90 (a) ...... 90 (b) ...... 90 (c) ...... 90 (d) ...... 90 6.8 Trajectory partition, grouping, and representatives...... 91 7.1 An Overview of Continuous Trajectory Point k Nearest Neighbor (CT P kNN) Query...... 105 7.2 An Overview of Spatio-Temporal Data Partitioning and Indexing. . . . . 107 7.3 An Overview of GeoSpark Layers...... 109 7.4 An Overview of 40 Trajectories through Zeppelin...... 117 7.5 Time Cost for k-Anonymity Set Computation with or without Indexing for N = 80, 500, 2000 Mobile Objects...... 118
x 7.6 Time Cost for k-Anonymity Set Computation with or without Indexing for 3 Cases of Total Input Data in Executor...... 118 7.7 Time Cost for Mobile Objects N = {500, 2000, 8000, 32.000} without Indexing for k = 8...... 120 7.8 Spatial PointRDD Data Distribution for 4 Spatial Partition Techniques for 2000 Mobile Objects...... 121 7.9 (a) Euclidean Space and (b) Polar Space for N = 500 Trajectories, L = 100 Timestamps...... 123 (a) ...... 123 (b) ...... 123 7.10 Hough-X Space of x for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps...... 123 (a) ...... 123 (b) ...... 123 7.11 Hough-X Space of y for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps...... 123 (a) ...... 123 (b) ...... 123 7.12 Vulnerability Performance Comparison in Euclidean and Hough-X Space.124 8.1 An example of Hashtag Graph Model [169] ...... 137 9.1 KDD process for knowledge mining from data [94] ...... 145 10.1 Tourist Arrivals in Greece (2006 - 2015) ...... 162 10.2 Predictions (2014 - 2018) ...... 164
xi LIST OF TABLES
2.1 SQL Queries Execution Time Results vs Data Size ...... 20 2.2 SQL Queries Execution Time Results vs Data Size ...... 21 5.1 An example of spatio-temporal database for d = 4 ...... 53 5.2 k-anonymity sets for N mobile users in L = 5 time-stamps ...... 54 5.3 Parameters for the 1st experiment of N=400 trajectories ...... 62 5.4 Parameters for the 2nd experiment of N=2000 trajectories ...... 63 6.1 An overview of the transformed spatio-temporal database...... 76 6.2 Parameters for the experiment of using only Hough-X of x and Hough- X of x, y, for N = 1000 trajectories, L = 10 time-stamps (Figure 6.4a,b) and N = 995 trajectories, L = 100 time-stamps (Figure 6.4c,d). . . . . 85 6.3 Parameters for the experiment using Hough-X of x and suppressing Hough-X of y (Exp1) for N = 1000 trajectories, L = 10 time-stamps (Figure 6.5a,b) and using Hough-X of y and suppressing Hough-X of x (Exp2) for N = 995 trajectories, L = 100 time-stamps (Figure 6.5c,d). . 86 6.4 Parameters for the experiment using (x, y) for clustering and Hough-X for k-NN for N = 87 trajectories and L = 100 time-stamps (Figure 6.7c,d)...... 86 7.1 The Different Types of Point Resilient Distributed Dataset (RDDs) Ac- cording to Selected Features...... 111 7.2 Trajectory Representation in a PointRDD...... 115 7.3 Simulation Parameters...... 116 7.4 Time for 80 Mobile Objects without Indexing and with R-Tree Indexing. 118 7.5 Time for 500 Mobile Objects without Indexing and with R-Tree Indexing.119 7.6 Time for 2000 Mobile Objects without Indexing and with R-Tree In- dexing...... 119 7.7 Impact of Spatial Partitioning in Time Performance...... 122
xii 7.8 Parameters when using N = 500 Trajectories, L = 100 Timestamps. . . 122 8.1 Performance of Tweet-Level Classifiers ...... 142 9.1 Datasets Details ...... 153 9.2 RapidMiner Results - Accuracy ...... 154 9.3 Accuracy for different Training - Test Set ratios ...... 154 9.4 10-Fold Cross-Validation ...... 155 10.1 Data Sources ...... 162
xiii LIST OF ABBREVIATIONS
SQL Structured Query Language
NoSQL No Structured Query Language
RDBMS Relational Database Management Systems
RSA Rivest–Shamir–Adleman
RL Record Linkage
PPRL Privacy Preserving Record Linkage
HLSH Hamming Locality Sensitive Hashing
CLK Cryptographically Long Keys k-NN k Nearest Neighbors
MWCL Method Witout CLustering
MCL Method CLustering
DUST DUal-based Spatio-temporal kDUST DUal-based k anonymity
DP Dual Point
CR Compressive Ratio
DukNN Dual-based k Nearest Neighbor
DuCLkNN Dual-based Clustering k Nearest Neighbor
SkNN Snapshot k Nearest Neighbors
CkNN Continuous k Nearest Neighbors
STPkNN Snapshot Trajectory Point k Nearest Neighbors
CTPkNN Continuous Trajectory Point k Nearest Neighbors
xiv CHAPTER 1
Introduction
The rapid growth and evolution of technology, such as social networks and smart mobile devices, coupled with the large number of users, is leading to an increase in the volume of data on the Internet. In addition to, there are other important sources of large amounts of data, such as scientific, telecommunications, banking, business data, whose analysis and management are important for critical decision making. However, the large amount of data, known as Big Data [119]-[112] has caused several problems with how to store, process and recover them. There are many systems [121], [91], [95] with different architectures that address this challenge, namely, the management of massive data. The aim of this thesis is to develop and implement efficient management and pro- cessing algorithms for large scale data. In particular, emphasis will be placed on multi- dimensional data classification algorithms. These algorithms are widely used in site related services/applications (e.g. GPS) to query a database and retrieve the desired information. A popular algorithm in the literature is one of the nearest neighbors, or k − NN [119]-[122], [112], which uses distance metrics to ”answer” queries of users to a database. Over the last decade, the vast explosion of data has fueled the develop- ment of Big Data management systems and technologies. The most popular solutions have been proposed in centralized environments whose efficiency is limited to a large amount of data, so searching for distributed solutions is imperative. Nowadays, digital data are the most valuable asset of almost every organization. Database management systems are considered as storing systems for efficient retrieval and processing of digital data. However, effective operation, in terms of data access speed and relational database is limited, as its size increases significantly [96]. Bloom filter is a special data structure with finite storage requirements and rapid control of an object membership to a dataset. It is worth mentioning that the Bloom filter structure has
1 been proposed with a view to constructively increase data access in relational databases. Since the characteristics of a Bloom filter are consistent with the requirements of a fast data access structure, we examine the possibility of using it in order to increase the SQL query execution speed in a database. In the context of this research in chapter 2, a database in a RDBMS SQL Server that includes big data tables is implemented and in following the performance enhancement, using Bloom filters, in terms of execution time on different categories of SQL queries, is examined. I experimentally proved the time effectiveness of Bloom filter structure in relational databases when dealing with large scale data. This investigation was initiated by the study to optimize query performance [96], [125] in a large database using or not the Bloom Filter structure [27] which requires finite storage and rapid control of the existence of an object in a data set. Given the aforementioned Bloom Filter features, consideration was given to using it to increase the speed of running SQL queries in a large database. Furthermore, the Privacy Preserving Record Linkage problem based on Bloom Fil- ter encoding techniques is being described in chapter 3, which both maintain users’ se- curity and permit similarity control. The research study focused on the problem of the Privacy Preserving Record Linkage known as the PPRL [162], [26] due to the wide res- onance in the research community to protect the identity and characteristics of entities associated with records in different databases. This topic was studied in the Mapreduce Framework in Hadoop using the Locality Sensitive Hashing technical indexing in the Hamming field (HLSH). The coding technique based on Bloom Filters [144], known as CLK (Cryptographically Long Keys) [22], was utilized due to its security against in- truders. Moreover, our study extended to the HLSH/F P S private indexing technique and briefly describe four implementations in the MapReduce distributed framework that is capable of processing large scale data. I also conducted experimental evaluation of these four versions in order to evaluate them in terms of job execution time, memory and disk usage. In addition, in chapter 4 security and privacy issues for NoSQL databases are stud- ied where security mechanisms and privacy solutions are thoroughly examined. The adoption of Cloud computing and big data management technologies have created an urgent need for specific databases to safely store extensive data along with high avail- ability. Specifically, a growing number of companies have adopted various types of
2 non-relational databases, commonly referred to as NoSQL databases. These databases provide a robust mechanism for the storage and retrieval of massive data without using a predefined schema. NoSQL platforms are superior to RDBMS, especially in cases when we are dealing with big data and parallel processing, and in particular, when there is no need to use relational modeling. Let recall that, the main objective of the research is the development of the basic knowledge mining algorithms (namely, K-means clustering and k-NN classification) for the processing of high volume spatial data on the basis of enhancing security and protecting privacy. Hence, in the context of this research in chapter 5 the problem of Privacy Preserving on Spatio-Temporal Databases is studied. In particular, the k- anonymity of mobile users based on real trajectory data is being used to quantify pri- vacy. The k-anonymity set consists of the k nearest neighbors. A motion vector of the form (x,y,θ,v) is constructed, where x,y are the spatial coordinates,θ the angle direc- tion, v the velocity of mobile users, and study the problem in four-dimensional space. Two approaches are followed. The former applies only k-Nearest Neighbor (k-NN) algorithm in the whole data set, while the latter combines trajectory clustering, based on K-Means, with k-NN. Unlike previous works, such as [150], [178], which deal with trajectories clustering. Actually, it applies k-NN inside a cluster of mobile users with similar motion pattern (g,v). We define a metric, called Vulnerability, that measures 1 the rate at which k-NNs are varying. This metric varies from k (high robustness) to 1 (low robustness) and represents the probability the real identity of a mobile user be- ing discovered from a potential attacker. The aim of this work is to prove that with 1 high probability, the above rate tends to a number very close to k in clustering method which means that the k-anonymity is highly preserved. Through experiments on real spatial data sets, the anonymity robustness is evaluated, the so-called Vulnerability, of the proposed method. Bearing in mind the ”curse of dimensionality” and its effect on clustering and classi- fication, the impact of this on the maintenance of privacy has been studied. That’s why we’ve evaluated the impact of the number and the correlation of dimensions on privacy protection for the two approaches, defining a ’Vulnerability’ metric, which measures the rate at which the k nearest neighbors of a set of moving users change. The study on real spatial data evaluated the performance of both methods in terms of privacy preserv-
3 ing for different combinations of characteristics (x, y, θ, v). Regardless of the method used, if the k nearest neighbors’ ids remain the same or do not often change in time, it is difficult for an opponent to discover a moving user based on historical data. The need to store massive volumes of spatio-temporal data has become a difficult task as GPS capabilities and wireless communication technologies have become preva- lent to modern mobile devices. As a result, massive trajectory data are produced, incur- ring expensive costs for storage, transmission, as well as query processing. A number of algorithms for compressing trajectory data have been proposed in order to over- come these difficulties. These algorithms try to reduce the size of trajectory data, while preserving the quality of the information. In following, in chapter 6, i focus on both the privacy preservation and storage of spatio-temporal databases. To alleviate this is- sue, I focused on the storage-compression problem [157] of spatio-temporal databases. An effective method for spatio-temporal data compression called Dual-based Spatio- temporal Trajectory (DUST) is proposed here, whereby an initial raw trajectory is di- vided into a number of linear sub-trajectories under Hough transformation [157], [153], [53] which forms the representatives of each linear component of the initial trajectory and therefore, it is compressed. The Hough transformation breaks down the k-NN query into two one-dimensional and allows it to be applied in a smaller space. This helps to bring compression into the data and enhance the safety of the queries. In particular, any intruder, even if he/she has access to the representatives of the trajectory data and tries to reproduce the points of the initial track, the identity of the mobile object remains safe with high probability. The anonymity set now consists of mobile users who have the same motion pattern based on Hough-X/Y transformation. This approach differentiates from previous approaches described in [196]-[128]. A theoretical limit for measuring the sensitivity of methods based on Hough in two projections of x, y dimensions and the overall sensitivity is computed. In addition, a model of attacks in Spatio-temporal Databases and in by Hough transformed ones which store the trajectories data of a set of moving objects, is being studied. It also recommends using a Digital Pseudonyms version protocol with an Identity Provider that enhances the security of the database objects identity to a malicious user known as the Brand Protocol. This reinforces the firmness of the k-anonymity method that we
4 already studied in previous chapter. To our knowledge, we are the first to study and address k-NN queries on nonlinear moving object trajectories that are represented in dual dimensional space. Additionally, the proposed approach is expected to reinforce the privacy protection of such data. Specifically, even in case that an intruder has ac- cess to the dual points of trajectory data and try to reproduce the native points that fit a specific component of the initial trajectory, the identity of the mobile object will re- main secure with high probability. In this way, the privacy of the k-anonymity method recommended in [39] is reinforced. Through experiments on real spatial datasets, we evaluate the robustness of the new approach and compare it with the one studied in our previous work. Privacy Preserving and Anonymity have gained significant concern from the big data perspective. We have the view that the forthcoming frameworks and theories will establish several solutions for privacy protection. The k-anonymity is considered a key solution that has been widely employed to prevent data re-identifcation and concerns us in the context of this work. Data modeling has also gained significant attention from the big data perspective. It is believed that the advancing distributed environments will provide users with several solutions for efficient spatio-temporal data management. GeoSpark will be utilized in the current work as it is a key solution that has been widely employed for spatial data. Specifically, it works on the top of Apache Spark, the main framework leveraged from the research community and organizations for big data trans- formation, processing and visualization. To this end, we focused on trajectory data rep- resentation so as to be applicable to the GeoSpark environment, and a GeoSpark-based approach is designed for the efficient management of real spatio-temporal data. Th next step is to gain deeper understanding of the data through the application of k nearest neighbor (k-NN) queries either using indexing methods or otherwise. The k-anonymity set computation, which is the main component for privacy preservation evaluation and the main issue of our previous works, is evaluated in the GeoSpark environment. More to the point, the focus here is on the time cost of k-anonymity set computation along with vulnerability measurement. The extracted results are presented into tables and figures for visual inspection. The importance and general research interest on methods for processing safe spatial k-NN queries have increased. In this respect, and given the rapid increase in the volume
5 of spatial data (Big Spatio-temporal Data), it is necessary to assess the cost (time) of creating in the Spark environment the anonymity set of each moving object. This will help to assess the practical interest in the implementation (and development) of meth- ods in real-time systems. The GeoSpark environment has been set up to this end. In particular, the configuration of the anonymity set is approached as a whole by Snapshot Trajectory Point kNN (STPkNN) queries based on the selected descriptors of each track point of a set of animation objects in the respective timestamp. The performance evalu- ation selected to be applied at Apache Spark based GeoSpark because it is designed for processing Spatial Data. Although traditional privacy solutions have been designed in Euclidean space, our framework also studies the concept of anonymity in Hough space. Due to the con- stantly changing location information of moving objects, it is necessary to evaluate a large number of queries of nearest neighbors for a large number of moving objects per time footprint. The k-NN spatio-temporal queries are issued in order to configure the set of moving object ids based on the trajectory points of all objects in each tempo- ral imprint. Specifically, a k-NN query, which we call the Snapshot Trajectory Point k-NN (STPkNN), is calculated by considering the selected attribute information (e.g. Euclidean coordinates, angle, velocity, dual points) of all objects. Assuming a high sampling rate, we can consider that the process is similar to a Continuous Trajectory Point k-NN. An important feature of the continuous k-NN query in Hough space is that the nearest neighbors between two consecutive space-time points remain the same. Based on this feature, the problem of executing the classical Continuous k-NN queries [37]-[189] can be significantly reduced at specific space-time points where the veloc- ity of the moving object changes indicating a new linear sub-trajectory of the original nonlinear track. In the second part of the research, as included in chapters 8-10, the performance of various classifiers in the Sentiment Analysis problem [142], [97] using classifiers on Twitter data in the Apache Spark environment, was investigated. In addition, a Python text and language pre-processing tool [65] has been developed to remove erroneous values and noise in an optimal and efficient manner. A notable feature is the use of emojis and emoticons in the field of emotion analysis. Supervised machine learning techniques were used to analyze user views. The performance of the classifiers (Naive
6 Bayes and SVM) was experimentally evaluated under specific parameters, such as the size of the training data and the feature selection methods used (unigrams, bigrams and trigrams) using the ”k-fold cross validation” technique. Finally, the use of the data mining technique based on Decision Trees was studied at Apache Spark, with the aim of forecasting tourism demand [30], taking into account the contribution of interpretive variables to them. The data set was constructed from public sources and the predicted (target) variable is the tourist arrivals in Greece for the years 2006 to 2015.
7 8 Part I
Research on Methods and Algorithms for Secure Queries Processing
9 10 CHAPTER 2
Bloom Filters for Efficient Coupling between Tables of a Database
2.1 Introduction
The business data, associated with all of the business activities, are typically stored in relational databases in order to manage them using the SQL language, and more specif- ically perform SQL queries to the database. The relational databases are particularly effective in their operation. However, their efficiency is limited if they store ”big data” with complex correlations [167]. An SQL query can be very expensive in execution cost, and concretely in time and access to resources, if the execution plan is not opti- mized. Possible delays in the accomplishment of SQL queries may have impact on ap- plication performance using relational databases, thus reducing business performance. The main way to improve the performance of an SQL query is to reduce the number of required operations/calculations that should be performed during the execution of the corresponding query. However, further reduction of the required commands in an SQL query is not always possible and also requires additional techniques for SQL query performance optimization in a database [61]. In [125], authors investigate this specific problem and recommend the use of IN, EXITS, EQUAL and OPERATOR-TOP along with indexes. Moreover, the bloom filter structure is used in databases such as Google Big Data or Apache HBase in order to decrease searching (in disk) for non-existent records, optimizing in this way the performance of executed SQL queries [24]. The traditional database systems store data in the form of a table with records. Each record corresponds to a different entity object that holds information in a relational table. The relative organization of the databases is effective when there are performing queries on tables with a small number of records. However, as the number of records increases, e.g. hundreds of thousands or millions of records, SQL queries usually search in a much larger number of records in order to locate and access a small number of
11 records or fields [103]. The best way to improve the execution speed of SQL queries in a database is the definition of indexes in fields, which are part of the search criteria of an SQL query. When indexes are not set in a database, then the database management system operates as a reader trying to find a word in a book by reading the entire book. By integrating an index term at the back of a book, the reader can complete the procedure much more quickly. The benefit of using indexes when searching records in a table becomes greater as the number of table entries increases 1. The role of indexes, in a database is to direct access records according to the search criteria of the SQL query. However, when a table in a database contains millions of records, despite the use of indexes, then the identifi- cation of records that meet the search criteria, requires to access thousands of records of the relational table 2. Therefore, in order to improve the efficiency of execution speed of relational SQL queries, the, in advance, exclusion of a significant number of records that do not meet the search criteria, would be particularly useful. To this purpose, the implementation of Bloom filter structure is suggested; this structure is based on records of the tables and it is further used for the exclusion of records that do not meet the criteria of relevant SQL queries. The purpose of this research is to examine to what extent the structure of Bloom filter tables in relational databases can affect the performance of data access queries for data tables with millions of records. To achieve the aim of this survey, our contributions lie in the following bullets: (i) implementation of Bloom filter to a relational database, (ii) experimental evaluation of queries with or without the support of Bloom filter and table recording of execution time of queries and (iii) graphic visualization of results to show Bloom filter effectiveness (in terms of integration time) in executing SQL queries on tables with millions of records. The rest of the chapter is organized as follows: in Section 2.2 the properties and basic components of Bloom filters are introduced. In Section 2.3, Relational Databases and SQL framework is presented. Moreover, Section 2.4 presents the evaluation experi- ments conducted and the results gathered. Ultimately, Section 2.5 presents conclusions, constraints and draws directions for future work. 1http://odetocode.com/articles/237.aspx 2http://dataidol.com/tonyrogerson/2013/05/09/reducing-sql-server-io-and-access-times-using-Bloom-filters-part-2-basics-of-the-method-in-sql-server
12 2.2 Bloom Filters Background
2.2.1 Bloom Filter Elements
The Bloom filter structure, devised by Burton Howard Bloom in 1970, is used for rapid check whether an element is present in a data set or not [18]. It also permits checking if an item certainly does not belong to it. Although the Bloom filters allow false positive responses, the space savings they offer outweigh any downside [99]. A Bloom filter is composed of two parts: a set of k hash functions and a bit vector. The number of hash functions and the length of bit vector are chosen according to the expected number of keys to be added to the Bloom filter and the level of acceptable error rate per case 3. A number of important components need to be properly defined in order for a bloom filter to operate correctly. These parameters are briefly and comprehensively described in the following paragraphs.
2.2.1.1 Hash Functions
A hash function takes as input data of any length and returns as output an ID smaller in length and fixed in size, which can be employed with the aim to identify elements 4. The main features that a hash function should have, are the following:
• Return the same value at each iteration with the same data input.
• Quick execution.
• Generate output with uniform distribution in the potential range it produces.
Some of the most popular algorithms for implementing hash functions are: SHA1 and MD5. These functions differ in safety level and hash value calculation speed. Also, some algorithms homogeneously distribute the values generated by the hash function, but they are impractical. In each case, the selected hash function should satisfy the application requirements. As for the hash functions number, the bigger this number is, then the hash values are generated in a slower way and the binary vector fills in a faster way. However, this
3https://www.perl.com/pub/2004/04/08/bloom_filters.html 4https://blog.medium.com/what-are-bloom-filters-1ec2a50c68ff
13 Fig. 2.1 Bloom Filter Overview
decision increases the incorrect predictions on the existence of an object in a dataset 5. The optimal number of hash functions derives from the following formula in [99]:
m k = ln(2) (2.1) n where m is the binary vector length and n the number of inserted keys in bloom filter. When selecting the number of hash functions to be used, we also calculate the probability of false positive predictions. The previous step is repeated until we get an accepted value for the probability index of false positive responses [27].
2.2.1.2 Binary Vectors Length
The length of the binary values of a Bloom filter vector affects the pointer value of false positive responses of the filter. The greater the length of the binary vector values, the lower the probability of false positive responses. Conversely, as the length of the vector is shrinked, the relative probability is increased. Generally, a Bloom filter is considered complete when 50% values of bits in the array are equal to 1. At this point, further addition of objects will result in the increase of false positive responses rate [110].
2.2.1.3 Key Insertion
We initialize a Bloom filter by setting the values of binary vector equal to 0. To insert a key into a Bloom filter, the relevant k hash functions are originally performed and positions of the binary vector, which corresponds to hash values, change from 0 to 1.
5https://llimllib.github.io/bloomfilter-tutorial
14 If the relevant bit is already set to 1, then the value of the relevant bit does not further alter 6. Each bit of the vector can simultaneously encode multiple keys, which makes the Bloom filter compact as shown in Figure 2.1 [21]. The overlapping values do not permit a key removal from the filter, since it is not known whether the relevant bits are not activated by other key values. The only way to remove a key from a Bloom filter is to rebuild the filter from scratch, thus not in- corporating the key to be removed from the Bloom filter. For checking the possibility for a key in the Bloom filter to be present, the following procedure is applied. Initially, the hash functions are applied to the search key, and then we check the relevant bits generated by the hash functions to be all activated. Concretely, if at least one of the bits is disabled, it is certain that the corresponding key is not included in the filter. If all bits are turned on, then we know that with high probability, the key has been introduced.
2.2.2 Space-Time Advantages and Constraints
The implementation of a Bloom filter is relatively simple in comparison with other rele- vant search structures. In addition, the use of a Bloom filter ensures the fast membership checking of a value and in following absolute reliability of the non-existence of an ob- ject in it (no false negatives) [140]. Concerning the time required for adding a new item or to control whether a point belongs to a set of data, it is independent of the number of elements in the filter 7. More to the point, a strong advantage of Bloom filters is the storage space saving in comparison with other data structures such as sets, hash tables, or binary search trees. The insertion of an element into a Bloom filter is an irreversible process 8. The size of data in a Bloom filter must be known in advance for determining the vector length and the number of hash functions. However, the number of objects that will be imported into a Bloom filter are not always known in advance. It is theoretically possible to define an arbitrarily large size, but it would be wasteful in terms of space and would overturn the main advantage on the Bloom filter, which is storage economy. Alternatively, a Dynamic Bloom filter structure could be adopted, which, however, is
6https://www.perl.com/pub/2004/04/08/bloom_filters.html 7https://prakhar.me/articles/bloom-filters-for-dummies 8http://bugra.github.io/work/notes/2016-06-05/ a-gentle-introduction-to-bloom-filter
15 not always possible. There is a variant of the Bloom filter, called Scalable Bloom filter, which dynamically adjusts its size for different number of objects. The use of a relative Bloom filter could alleviate some of its shortcomings. A Bloom filter cannot produce the list of items imported, but it can only check whether an item has been introduced in a dataset. Finally, the Bloom filter cannot be used for answering questions about the properties of the objects.
2.3 Bloom Filters and RDBMS
2.3.1 Relational Database Management Systems
The relational database management systems have been a common choice for storing information in databases used for a wide range of data such as financial, logistic in- formation, personal data, and other forms of information, since 1980. The relational databases have replaced other forms such as hierarchical or network databases, as they are easier in understanding and their use is convenient. The main advantage of relational data model is that it allows the user to make query-in data access command, without the need to define access paths to stored data or other additional details [98]. Furthermore, the relational databases keep their data in form of tables. Each table consists of records, called tuples, and each record is uniquely identified by a field, i. e. primary key, which has a unique value. Each panel is usually connected to at least another database table in relation to the form: (i) one-by-one, (ii) one-to-many, or (iii) many-to-many. These relationships grant users unlimited ways of data access and dynamic com- bination amongst them from different tables. Nowadays, the market provides more than one hundred RDBMS systems and the most popular of them are the following: (i) Oracle, (ii) MySQL, (iii) Microsoft SQL Server, (iv) PostgreSQL, (v) DB2 and (vi) Microsoft Access (DB-Engines 2016), etc. 9. The SQL language is used for user communication with a relational database [137]. An SQL query demands no knowledge of the internal operation of database or the rele- vant data storage system [172]. According to ANSI (American National Institute Stan- dards) standards, the SQL is a standard language for relational database management systems. Moreover, the SQL language is used in order to query a database for the man-
9http://db-engines.com/en/ranking/relational+dbms
16 agement of such data and also for the data update or retrieval from a database. Some examples of relational databases that use SQL are: Oracle, Sybase, Microsoft SQL Server, Access and Ingres. The most important commands of SQL query language are 10: SELECT, UPDATE, DELETE, INSERT INTO, CREATE DATABASE, ALTER DATABASE, CREATE TA- BLE, ALTER TABLE, DROP TABLE, CREATE INDEX, DROP INDEX. The SQL commands are classified into the following basic types:
• Query Language with key command: where the Select command for accessing information from the database tables is used.
• Data Manipulation Language with key commands: (i) Insert-introduction of new records, (ii) Update-modify records, and (iii) Delete-delete records.
• Data Objects Definition with key commands: (i) Create Table, and (ii) Alter Table.
• Safety Control of Database with key commands: (i) Grand, Revoke for user rights management to database objects, and (ii) Commit, Rollback for transac- tions management.
2.3.2 Queries Language-SQL
2.3.2.1 Membership Queries
The command SQL IN controls whether an expression matches any value from a list of values. Furthermore, it is used in order to prevent multiple use of the OR command in SELECT, INSERT, UPDATE or DELETE queries 11. Besides checking should an expression belong to a set of values registered directly to a relevant query SQL, it may also check if an expression is part of a set of values from other tables.
2.3.2.2 Join Queries
The union queries, which combine values from two or more data tables based on a JOIN criterion, usually concern relationships between relevant tables. More to the point, JOIN queries are distinguished in four categories: 10http://www.w3schools.com/sql/sql_syntax.asp 11https://www.techonthenet.com/sql/in.php
17 1. Inner Join: returns the values from Table A and Table B that satisfy the joining criteria.
2. Left Join: returns all the values from Table A and the values of the Table B meeting the joining criteria.
3. Right Join: returns all the values from Table B and the values of the Table A that meet the joining criteria.
4. Outer Join: returns all the values from Table A and Table B regardless if they satisfy the relevant criteria combination.
2.3.2.3 Exist Queries
The existence control queries are used in conjunction with a secondary query. It is considered that the control condition is satisfied when the secondary query returns at least one relevant registration. The verification can be used in terms of the following queries: SELECT, INSERT, UPDATE or DELETE 12.
2.3.2.4 Top Queries
The command TOP limits the number of records that a query will return, that is to a specified number of rows or a specified percentage of records from the 2016 version of SQL Server 13. When the command TOP is used in combination with the ORDER BY command, then the first N records are returned according to the sorting arrangement provided by the ORDER BY command. Otherwise, N unsorted records are returned. In addition, the TOP command specifies the number of records returned by a SELECT statement or affected by a plethora of command statements, such as INSERT, UPDATE, JOIN, or DELETE. The TOP SELECT command can be particularly useful in large tables with thousands of records. The access and choice of a large number of records can adversely affect the performance execution of a query.
12https://www.techonthenet.com/sql/exists.php 13https://docs.microsoft.com/en-us/sql/t-sql/queries/ top-transact-sql
18 Fig. 2.2 B-Tree overview
2.3.3 Indexes Table
Indexes are auxiliary structures in a relational database management system with the aim of increasing data access performance to the database. Relevant helping structures are created in one or more fields (columns) of a table or a database. Moreover, an index provides a quick way to search data based on the values in the specific fields that are part of the index. For example, if an index on the primary key of a table is created, and then a series of data based on the values of the corresponding fields is found, then the SQL Server finds the value of the index field first and in following it uses the relevant index so as to quickly locate the whole relevant table entries. In this way, without the index marker field, it would require a scan of the entire table line by line, directly influencing the performance of the relevant query execution 14. Furthermore, an index consists of a set of pages that are organized into B-tree data structure. The relevant structure is hierarchical, comprising a root node at the top of the tree and the leaf nodes at the lower level, as illustrated in the above Figure 2.2. When a query, including a search criterion, is executed, then the query starts delving into
14https://www.simple-talk.com/sql/learn-sql-server/ sql-server-index-basics
19 relevant records from the root node and navigates through intermediate nodes, which are the leaf nodes of the B-tree structure. After locating the relevant leaf node, the query will access the interrelative record either directly (in the case of clustered index), or through a pointer to the relevant data record (if it is a non clustered index). A table in an SQL Server database can have at most one clustered index and more than one non clustered index, depending on the version of SQL Server that is used.
2.4 Experimental Evaluation in SQL Server
In this section, the results of the experiments conducted in the context of this research in order to evaluate the use of Bloom filters, are presented. We perform a series of common SQL database queries with and without the support of the Bloom filter and graphically present the resulted time performance of executed SQL queries. The SQL queries utilized are the following: In, Inner Join, Left Join, Right Join, Exists and T op. The following Tables 2.1 and 2.2 as well as Figure 2.3 show the execution times of the questions described previously. In particular in the corresponding Tables, the results are shown with and without the use of Bloom filter by introducing the label BF as the relative number of records is changed. Table 2.1 SQL Queries Execution Time Results vs Data Size
Execution Time in seconds Data In In BF Inner Join Inner Join Left Join Left Join BF BF 10.000.000 44 24 44 24 1 1 9.000.000 38 24 41 26 1 1 8.000.000 26 21 26 24 1 1 7.000.000 19 21 19 20 0 1 6.000.000 19 20 19 20 0 1 5.000.000 18 19 19 19 0 1 4.000.000 13 14 13 12 0 1 3.000.000 11 12 12 13 0 1 2.000.000 10 12 11 12 0 1 1.000.000 3 3 3 3 0 1
For all SQL commands and in particular for small number of records, we observed that the adoption of Bloom filter structure overloaded the system and thus, the execution of the queries without the use of Bloom filter is much faster.
20 Table 2.2 SQL Queries Execution Time Results vs Data Size
Execution Time in seconds Data Size Right Join Right Join Exists Exists BF Top Top BF BF 10.000.000 44 42 43 23 26 6 9.000.000 37 27 38 24 18 6 8.000.000 26 26 25 25 7 6 7.000.000 19 20 18 20 7 5 6.000.000 19 20 19 20 7 5 5.000.000 19 19 19 19 5 5 4.000.000 13 12 12 12 5 5 3.000.000 12 13 11 12 5 5 2.000.000 11 12 12 12 3 3 1.000.000 3 3 2 3 1 2
As the number of table records is increased and especially for values more than (or equal to) 8, 000, 000, the performance advantage offered by employing the Bloom filter structure increases significantly and the difference in speed execution of queries is obvious as it also rises exponentially. It is important to consider the fact that during the repetitive execution of the same queries, we observed the same runtime, but sometimes there was a gap of about two seconds between results. In these cases, we decided to take the average values in the relevant cases.
2.5 Conclusions
2.5.1 Research Conclusions
The large response times of SQL queries in relational databases affects not only the users, but also other applications that may run on the same computer or the network itself hosting the relevant database. The Bloom filter, capacity wise, is an effective solution and it has been used in numerous applications in the past, especially when immediate control of an object membership was required. The relevant experiments suggest that the inclusion of Bloom filter structure in an SQL Server database (with large number of records - 10, 000, 000 records) that may increase its data access performance. The optimization of query execution time to a database, using Bloom structure, allows users to quickly extract the needed information
21 Fig. 2.3 Queries Execution Time vs Records Size
and increase the efficiency of relevant database. The Bloom structure in a relational database acts as a filter that removes from the join, membership or existence control queries the need to access-process records that do not meet the criteria of relevant ques- tions. The potential profit from this restriction involved in accessing-searching records through an SQL query highly depends on the false positive records that the control of the Bloom filter returns. This relative number is limited as the length of the binary values of Bloom filter is increased. In this topic, an acceptable speed execution as well as balanced storage require- ments, according to the requirements of each database instance and the user require- ments, which concern the access speed of the relevant database, should be chosen. Es- pecially, in cases like a database containing historical data records with no probability of further record updates, the adoption of Bloom filter for faster access among numerous relevant tables can be considered a solution that could lead to increased efficiency. As it can be seen from the execution times of SQL queries (Tables 2.1, 2.2 and Figure 2.3), the benefit of the, in advance, restriction of records involved in an SQL query is greater than including all records of data tables and indexes used for direct
22 access to them. It should be noted that as the experimental measurements show, the application of Bloom filter structure in a database deserves to be selected only when the number of entries in the relevant tables are very large. Consequently, the use of Bloom filter may have the opposite effect, i.e. increase the query runtime.
2.5.2 Research Constraints
In the evaluation of the Bloom filter, we did not take into account possible delays caused by maintenance and regular updates of the Bloom filter structure during the record up- dates in the relevant tables. These possible delays could be caused in the execution of other SQL queries as well. Although all experiments were performed on the same machine, the ones with Bloom filter that were performed at different times, may have been affected (in performance) by possible processes running in the background. These relevant deviations do not directly affect the performance comparison among the same queries with or without the use of Bloom filter, but mostly between different SQL com- mands used.
2.5.3 Future Extensions
A promising and useful step would be to investigate the applicability of Bloom filters in other relational database management systems (like Oracle, SysBase, MySQL) with the aim of generalizing previous conclusions drawn from experimentation on the SQL Server relational database management system. Also, a possible review of the actual performance of database operations with millions of records used to store application data, will allow more reliable conclusions about the use of Bloom filter structure in re- lational databases. Thus possible delays of the system during the application operation can be taken into account.
23 CHAPTER 3
MapReduce Implementations for Privacy Preserving Record Linkage
3.1 Introduction
The rapid evolution of technology and Internet has created huge volume of data at very high rate, deriving from commercial transactions, social networks, scientific research. The mining and analysis of this volume of data may be beneficial for the humans in crucial areas such as health, economy, national security, leading to more qualitative results. A common problem in data analysis is the record linkage process (RL) which finds records in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites, and databases) [11],[26],[47],[145]. The purpose of RL is to categorize all possible combinations of records from different databases as similar or dissimilar by using attributes that are not necessarily identifying fields. The RL model requires at least two members that will provide their data in form of tables. A table row corresponds to an entity that is described by the columns. Often, the model of the RL is simplified by having only two members that provide the data to be combined (Alice and Bob) with or without the presence of a third member (Carol). The third member undertakes the interconnection process and communicates process results to participant members. Privacy-preserving policies often prevent research into personal data. Thus, organi- zations are legally and ethically bounded to exchange sensitive personal data, leading to datasets that are either free of sensitive personal data or encrypted to greatly enhance privacy protection. The privacy requirement during the RL process paved the way to Privacy-Preserving Record Linkage (PPRL) [28],[162],[165]. As in the case of RL, the PPRL process find pairs of records referred to the same entity from multiple data
24 sources where the classification, as similar or dissimilar, is conducted based on encoded data to avoid disclosure of confidential data about the entities presented in the problem. An efficient blocking scheme for PPRL is the HLSH/F P S when combined with Bloom-Filter based encoding. It utilizes Locality Sensitive Hashing and frequent colli- sion tables to Hamming distances between Bloom Filter based encoded pairs of records in order to reduce the number of pairs when a more rigid similarity comparison is per- formed. PPRL blocking techniques fall into the batch processing category and in the Big-Data world one of the most used system for batch processing applications is MapReduce which is distributed and fault-tolerant. In this paper, we evaluate the per- formance of four MapReduce work-flows of the LSH/F P S blocking scheme for the PPRL framework. The rest of this chapter is organized as follows: Section3.2 analyzes background knowledge around the encoding techniques based on Bloom Filter and private indexing. Also, Section 3.3 briefly describes the MapReduce framework of the HLSH/F P S implementations. Finally, Section 3.4 presents experimental evaluation and Section 3.5 presents conclusions.
3.2 Related Work
3.2.1 PPRL encoding techniques
In this section we describe some Bloom filter based encoding techniques that are nec- essary for the PPRL process.
3.2.1.1 String encoding
The basic idea of this approach is the hashing of q-grams in string fields of records based on Bloom Filters [144]. Bloom Filters [17] consist of a set of K hash functions. Their result puts the position of a bit in a bit vector of size S. Their objective is to give us a quick answer for the membership of an element to a set, by controlling K-positions
in the bit vector. The computation of K hash functions Hi(x) may be done through two independent hash functions as:
Hi(x) = (h1(x) + i · h2(x)) mod S (3.1)
25 For the h1(x) and h2(x) functions, we choose the cryptographic methods HMAC − SHA1 and HMAC − MD5 respectively due to their widespread and efficient imple- mentations on cryptographic platforms.
3.2.1.2 Record encoding
This method is used for records encoding instead of string, as previous method is used for. Each record consists of fields such as name, username, age, address, etc. As the PPRL process aims at the protection of such data, it is necessary to encode the values of the selected fields from all table’s records. To this direction, we suggest an encod- ing method based on a Bloom Filter for which it is necessary to pre-select values for the involved elements and Bloom Filters, such as the average number q-grams. Three different approaches for records encoding using Bloom Filter are described below. FBF (Field-level-Bloom-Filters) encoding [42], [41], [144] is considered as the simplest extension of string data encoding with Bloom Filter. The field values of a record are encoded on separate Bloom Filters, which then compose a larger Bloom Filter that will be used for encoding the entire record. In brief, the encoding steps are:
1. For a selected Q value (the number of q grams), we calculate the average number of q-grams g of each record for fields that will participate in the PPRL process
and calculate Bloom filter’s appropriate size SFBF .
2. From each string field, q-grams are extracted for a selected Q value.
3. Exported q-grams are encoded by SFBF size Bloom Filters using Fragmentation Components.
4. Bloom Filters produced from each field are combined into a larger one with pre- determined series of joins.
The FBF encoding is distinguished in FBF /Static and FBF /dynamic. The first
one requires the definition of Q, K and SFBF to encode the records, while the second one, for given Q values, imposes an initial preconditioning step to calculate the average number of q-grams g so as to calculate the appropriate Bloom Filter size SFBF . The basic idea of CLK encoding [145] is the use of large size S Bloom Filter to encode all fields of the record by using the produced q-grams of each field for a selected Q value and K hash functions. The encoding steps are:
26 1. Export q-grams from each field for a selected Q value.
2. Union of all produced sets of q-grams of the values to be encoded.
3. Extracted q-grams are placed on a S-size Bloom Filter using K hash functions.
Since CLK encoding places common q-grams from different fields to the same K locations in the Bloom Filter, it is difficult for the attacker to perceive either the encod- ing parameters or the original field values. Also, this particularity of CLK encoding, i.e. common q-grams between fields, can lead to incorrect results in the similarity con- trol. As an example, regarding names “James Johnson” and “John Jameson”, while being dissimilar, similarity control over CLK encodings may decide that these entries are similar. The RBF (Record-level-Bloom-Filter) encoding is based on FBF and attempts to enhance privacy protection in the PPRL process introducing additional parameters and information into the encoding steps [42], [41]. Initially, it encodes the values of the fields based on separate Bloom Filters and in following creates a random set of bits from each one so as to compose a larger Bloom Filter. Finally, it applies a random bit permutation of the larger Bloom Filter with RBF encoding being the result of this rearrangement. With regard to the number of bits required to be selected for the encoding of each field, we consider two ways of calculating it, uniform and weighted. The first way uses uniform bit selection from the FBF encoding of the record while the second one uses weighted. Uniform selection of bits requires equal or approximately equal number of bits for each field, i.e. Sf . Weighted way uses a weighted selection of the field encoding bits which leads to selecting more or fewer bits for some of the fields. More to the point, in [42], [41], it is mentioned that the weighted choice is based on the importance of each field in the interconnection process. In order to discover the significance of the fields in the process, the probabilities m and u of the Fellegi-Sunter probability model are used. The weights of agreement and disagreement as well as the range of these two weights are calculated and the normalized percentage of the range of each field is then calculated. In this way, each field contributes a percentage of wi to the final Bloom
Filter. The size of the final Bloom Filter SRBF is derived from the wi percentage that maximizes that size.
27 3.2.2 Private Indexing
The goal of indexing in PPRL is to substantially reduce the pairs of encoded records to be tested through similarity control. In this case, the third member (Carol) has little information about data encoding of Alice and Bob. To this direction, we are going to discuss HLSH indexing. The HLSH (Hamming Locality Sensitive Hashing) Indexing [42],[92] is used for
partitioning private records encoded in binary form of length S. Let Tl be a set of l = 1,...,L independent hash tables consisting of dynamic sets of key-values. Each
k hash table Tl uses a set of K hash functions hl that return the value of a randomly selected bit from the binary hashed rows in the table. The values of K functions are a key to the encoded records and can gather from Alice’s and Bob’s encoded sets A0
0 and B . Entries from two sets are stacked in the same key for a Tl hash table thus recommending a possibly identical pair if they match in K bits. Id fields’ values of encoded records can be in following used in order to finally form possibly identical pairs.
0 0 Let encoded records rA ∈ A and rB ∈ B consist of an Id and the fields BfA and BfB respectively. In addition, let i be a selected value as a limit for the Hamming metric calculated from the equation dH = |BfA ⊗ BfB|. We consider a family H of hash functions having the following property:
l l if dH ≤ θ then P r[hk(BfA) = hk(BfB)] ≤ pθ (3.2)
θ k = 1, 2, . . . , K l = 1, 2, . . . , L p = 1 − (3.3) θ S The suitable value for the number of hash functions K can be empirically computed,
as the accuracy of the method is mainly based on the number of tables, Lopt. Generally, this value should form enough buckets so that the number of interconnected lists for the pairs of records is low; for bigger values, more identical entries appear in pairs
of records. The formation of a pair of identifiers or encoded entries {rA, rB} during
HLSH in one of the Tl tables, is called collision. The method is redundant so a pair
can occur at C = 1,...,Lopt hash tables. The pair {rA, rB} with collisions C = Lopt is with high probability similar and intuitively one can argue that as the number of
28 Fig. 3.1 HLSH under FPS conflicts increases, then it is more likely for the records to be the same.
3.3 MapReduce Framework
HLSH along with use of Frequent Pair Schema (FPS) [93] can lead to fast and effi- cient record sharing by checking similarity of frequent collision pairs. We present four MapReduce implementations of the HLSH/F P S method for different size of encoded
0 0 records of Alice A , Bob B , hash tables T l as well as set of candidate IDs RIds. We consider that set B0 is smaller than set A0, so that it is chosen for the initial creation of the hash table Tl (Figure 3.1). The use of HLSH/F P S allows the implementation of an effective system with a relative low memory footprint. Our investigation focuses on memory saving and suggest four different versions of the HLSH methodology, namely v0, v1, v2 and v3. Each version is based on assumptions for almost all sizes of the problem and progressively ”transfers” these sizes from the slow disk to the faster memory of the Mapper/Reducer. We assume that every Mapper or Reducer in a MapReduce task has a fixed memory limit mtask that can be committed by YARN. Each of the four versions consists of 2 or
29 3 different MapReduce Jobs, which in following consist of a number of tasks (Mappers or Reducers) depending on the HDFS size of the problem and user settings. Version v0 is characterized by memory saving when performing the job, but is very expensive in disk use and is especially suitable for Apache YARN environments with low memory availability for the tasks of a MapReduce job. It assumes that Alice’s and
Bob’s encoded records and Tl are so large that cannot be available in the limited task memory. All pairs of identifiers from the HLSH process are formed and in following,
the ones that appear at least Cf times are stored again in HDFS to be loaded into mem- ory of the last job that undertakes the interconnection of the proposed IDs based on the identifiers. This approach, in addition to multiple MapReduce tasks, can be considered as the most naive as it forms all pairs of identifiers that can be derived from the HLSH process. On the contrary, it is the version that uses the least memory in Mapper/Reducer tasks according to experimental results. Version v1 allows more relaxed conditions for the committed memory of the tasks
to be performed. We assume that the set of Tl tables is able to fit into memory of each task in a MapReduce job. Having this important information in mind, we can perform the HLSH/F P S by exclusively storing the often conflicting pairs in HDFS. In the last two versions, we also assume that the records of the smaller set B0 are capable of being stored as a whole in the mtask memory of Mappers and Reducers. In both versions, the first job resembles the creation and storage in HDFS of the hash table Tl of the set of records. In the second work, the two versions are differentiated in terms of use or non-use of the Reduce phase.
3.4 Performance Evaluation
The evaluation of four schemas is conducted by considering CLK encoding for PPRL
1 process for S = 4096 under the following settings for the parameters δ , Cf , LCf , K.
1δ is the confidence parameter defining the likelihood that pairs which are actually the same, are not matched in the tables. This value is usually low, indicatively δ = 0.01.
30 Fig. 3.2 PPRL evaluation
δ Cf LCf K
0.001 4 52 30 0.0001 6 74 30 0.00001 7 91 30 0.000001 9 114 30
In the first screen of Figure 3.2, the simulation results of four versions of HLSH/F P S are shown. The fastest in all cases is v3, while v0 has the largest footprint on disk since it writes to HDFS all pairs of identifiers derived from HLSH. We also observe that for the highest value of δ, the footprint of jobs total memory is also large, but as δ decreases and simultaneously HLSH parameters change, then memory consumption decreases
31 Fig. 3.3 PPRL evaluation
as the number of candidate records for comparison increases. Regarding other versions, as the philosophy of FPS strategy is utilized, execution times are slightly affected by the change. The disk footprint for v1 is slightly affected; the same stands for v2 and v3.
However, it is evident that as LCf increases, while the number of candidate records to be compared is reduced, the memory footprint for all operations, except for v0, grows faster. As the number of records of B0 remains constant, this increase corresponds to
the increase of the hash table Tl. We then conduct the same procedure for the two largest sets of records, without the
v0 version. In this case, we show measurements for δ = 0.0001 as well as Cf = 6,
LCf = 74 and K = 30. The second screen of Figure 3.3 presents the prevalence of versions v3 and v2 on v1 on all metrics.
32 3.5 Conclusions
The four versions that are presented give Carol member the capability to choose be- tween the slow, but memory economical, and the fastest, but demanding, task MapRe- duce executions. It also shows the need for implementing techniques that help users to decide (in terms of resource use, relative costs and problem size) which of the four ver- sions is appropriate. The experimental evaluation shows that versions v1, v2, v3, which make progressively smarter memory usage, have the advantage of quick execution of HLSH/F P S compared to v0. But as the number of records grows, the demanded size to be put into its memory also increases. With the prospect of slow but integrated HLSH/F P S process, v0 may be the best proposal for Hadoop environments with limited memory resources.
33 CHAPTER 4
Security and Privacy Solutions associated with NoSQL Data Stores
4.1 Introduction
The advances in cloud computing technology and distributed web applications, along with the ever-increasing large volume of data for storage and further processing, has rendered necessary the adoption of non-relational databases, known as NoSQL or ”Not only SQL” [124]. It is widely known that the traditional SQL database is not able to cope with Big Data [147] as NoSQL systems, nowadays, are experimenting with an increase in popularity [114]. In recent years, many NoSQL databases have made their appearance; for example, Cassandra and MongoDB are two popular ones, to name a few. Some useful features of NoSQL databases are the high availability, scalability, bet- ter performance, as well as the ability to store and process large-scale semi-structured and/or unstructured data faster than traditional RDBMS [114], [147]. However, due to the ever increasing use of NoSQL databases, a significant amount of sensitive data is exposed to a number of security vulnerabilities, threats and risks. Lack of encryption support and poor authentication between servers and clients are some of the leading security issues in NoSQL Databases. Also, it should be noted that simple authorization is provided without support for role-based access control (RBAC) and so, there is no protection for injections and denial of service attacks [62]. Brewer in [20] made a conjecture about the trade-offs in the development of a dis- tributed database system, thus introducing the CAP (Consistency, Availability, Partition Tolerance). A formal version of Brewer’s conjecture is officially published as the CAP theorem in [55]. Specifically, the CAP theorem indicates that no shared data system could provide at the same time more than two out of the three properties, including consistency, availability, as well as partition tolerance. Regarding organizations, Amazon developed the Dynamo technology [35], whereas
34 Google produced the distributed storage system Bigtable [24]. These particular tech- nologies have inspired many NoSQL applications installed in companies, like Facebook or Twitter. Modern companies are dealing with data that are not relational and need superior databases than traditional ones, which encounter scalability and availability problems because of their data size. There are already several authorization models in relational databases where views are usually utilized. In this way, SQL queries are used to display a specific state of a specified part of the database [13]. Some NoSQL Databases managed by Big Data use new authorization models, which are specifically designed for structure, speed, and a huge amount of data. These models include key- value, wide column, and document-oriented authorization. In addition, the storage and retrieval of these records are achieved through a unique key for each record while pro- viding a swift search [29]. In this chapter, we present security and privacy issues in NoSQL databases and fur- ther examine it to propose the most efficient security mechanisms and privacy solutions. More to the point, data protection and access control are some of the key issues of secu- rity in NoSQL, while several security threats for NoSQL databases are considered, such as distributed environment, authentication, fine-grained authorization and protection of data at rest and in motion. The remainder of this chapter is organized as follows. Section 4.2 presents a sur- vey of existing related works concerning mechanisms to overcome security issues. Moreover, Section 4.3 overviews a comparative study between Relational and NoSQL Databases, while in Section 4.4, our security and privacy-preserving mechanisms are proposed. Finally, in Section 4.5, the summary of the chapter is presented.
4.2 Related Work
Many early papers that issued the relationship between Relational and NoSQL databases were given an overview of NoSQL database, as well as its types and characteristics. They were so enthusiastic about NoSQL and how it declined the dominance of SQL [126], [148]. However in [14], there was discussion about structured and non-structured database; there it was also explained how the use of NoSQL databases, like Cassandra, improved the performance of the system. In addition, it can scale the network without changing any hardware or needing to alternate the server infrastructure. This results in
35 improving the network scalability, with low-cost commodity hardware. In [80], a survey paper regarding relational databases is introduced along with NoSQL features and shortcomings. In addition, these shortcoming and issues of the NoSQL databases have been mentioned in [104]; complexity, consistency as well as limited eco structures are considered as serious concerns. Also in [116], the authors state that the demand for relational database will not go away anytime soon, and it will exclusively serve in line of application that will support business operations. However, NoSQL databases will serve the large, public and content centric applications. Another similar work is the one presented in [124], whaere an extensive analysis for security issues with NoSQL Database, like Cassandra and MongoDB, is considered. Several solutions have also been proposed to improve privacy-preserving in NoSQL databases. More specifically, in Arx, a proxy is employed in order to rewrite NoSQL queries at the trusted premises. A back-end component, deployed at the untrusted premises, is used to perform computation over encrypted data [134]. In terms of BigSe- cret system, standard encryption is used for protection of the stored data, while the indexes are encoded using special techniques to allow comparisons (pseudo-random functions) and range queries (order-preserving partitioning) [132]. Authors in [186] employ algorithms of searchable encryption to build a privacy-preserving key-value store on top of the Redis database. In this approach, the values are protected with sym- metric encryption, while the keys are secured with pseudo-random functions. In another solution, SafeRegions combines secret sharing and multiparty computation to perform secure NoSQL queries on three independent and untrusted HBase clusters providing thus simultaneously secure computation over the stored values and security guarantees similar to standard encryption [135].
4.3 Comparison of Relational and NoSQL Databases
During the last decades, relational databases, sub-divided into groups known as tables, have been used with the aim of storing structural data. The units of data in each ta- ble are known as columns, and each unit of the group is known as a row. Also, the columns in a relational database have relationships amongst them. This situation tends to change over the last years due to the rise of large web applications, which outputs a huge amount of data that traditional relational databases cannot handle any more [34].
36 NoSQL databases are sometimes referred as “Not only SQL” so as to give some empha- sis on the fact that they may support query languages that are SQL-like. Nowadays, it is stated that NoSQL databases have more to offer than just presenting solutions to scale problems, while also providing many important advantages [34] like the following:
• The data representation is schema-less, and there is no need to define a certain structure from the beginning since new fields at run-time can be added.
• The speed, even with a small amount of data, can be processed in milliseconds instead of hundreds of milliseconds.
• The elasticity of the applications due to the scalability features that NoSQL databases offer.
• Reduce in development time, as developers do not have to deal with complex SQL queries and difficult joints so as to collate the data from different tables into a new view.
Some of the differences between relational and NoSQL databases are listed in the following paragraphs.
4.3.1 Reliability of Transactions
The ACID (atomicity, consistency, isolation, durability) model is fully supported by the design of relational databases, providing high reliability in transactions unlike the NoSQL databases.
4.3.2 Scalability Issues and Cloud Support
The primary purpose of cloud technology is to provide services to end-users. NoSQL databases are fully compatible with cloud environment requirements as they can analyze not only raw structured data, but also semi-structured or unstructured data from different sources, since they are not compliant with the ACID model. On the other hand, the relational databases do not provide data search on full content and their characteristics are now designed for cloud use. It is possible that the need for scalability is one of the most significant problems of relational databases as they rely on vertical scalability to upgrade the performance.
37 More specifically, this upgrade method requires the purchase of expensive equipment such as RAM, processors, SSD hard drives, etc. and in some cases, this is not easily achieved due to each system constraints. Also, the possibility of horizontal scaling is not supported by the addition of extra nodes and therefore, cannot support demanding online applications with many users and distributed data. However, NoSQL databases support only horizontal scaling since they do not deal with relational data.
4.3.3 Complexity and Big Data Management
The complexity of NoSQL databases is less than that of relational databases as it is not necessary to create tables to record data, but the modeling by considering a query method can be used. Also, the development of a database structure on a relational database is always considered a complicated task compared to the abstract model of a NoSQL database, where data can be stored regardless of whether they are structured, unstructured, or semi-structured. NoSQL databases have a valuable role in Big Data management since they are well- suited for storing or retrieving data in high speed across distributed nodes, thus taking advantage of multi-core GPU architectures. In relational databases, where accuracy is more important than speed, the data should be stored in tables’ rows and columns, while the scalability is always considered a big issue. In the case of supporting conventional applications with small datasets, they are the most reasonable choice, but slitting the data across different servers increases the arduousness requiring complex SQL queries for joining the data again.
4.3.4 Data Model
Sets in mathematics are the driving force for relational database; all the data are rep- resented as mathematical n-ary relations, where an n-ary relation is a subset of the Cartesian product of N domains. The data are represented as tuples inside the database and are further grouped into relations. The relation (represented by table) contains a set of tuples (represented by rows); where the column in the relation table utilizes the sequence of attributes, the type of an attribute is identified by the domain, which is the set of values that have a common meaning. This data model is very specific and well organized, while the columns and the rows are described by a well-defined schema.
38 NoSQL databases can employ many modelling techniques like graph, key-value stores and document data model. In terms of classification procedure, NoSQL is named after their data model but in some cases, NoSQL database system can be identified by using two or more of the data models that represent their data. NoSQL data model does not utilize the table as storage structure of the data and this is considered the main feature that distinguishes the NoSQL from relational databases. Furthermore, it is schema-less and as a result, can handle the unstructured data like word, pdf, images, as well as video files, in a very efficient method.
4.3.5 Data Warehouse and Crash Recovery
Regarding data warehousing, relational databases gather data from many sources and the oversize of stored data results in big data problems. To name a few, some problems are the performance degradation when utilizng an OLAP (Online Analytical Process- ing), statistical process or Data Mining. On the other hand, NoSQL databases are not designed when considering data warehouse applications, because designers are focused on scalability, availability and high performance. Crash recovery is implemented in relational databases via the recovery manager, which is responsible for ensuring durability and transaction atomicity by using log files and ARIES algorithm. The crash recovery in NoSQL databases depends on replication to recover from the crash.
4.3.6 Privacy and Security
Most relational databases do not provide any feature regarding embedding security in the database itself. As a result, this requires developers to impose directly security systems in the middleware. Classic cryptography mechanisms and encryption pro- tocols, such as asymmetric key encryption schemes, digital signature schemes, zero- knowledge Proof of Knowledge, as well as commitment schemes, which are based on SRSA (Strong RSA), bilinear maps [8], discrete logarithm, homomorphic encryption, fully or not [1], have been widely considered for securing communication and ensuring data confidentiality in relational databases. Nonetheless, one of the most serious shortcomings of NoSQL databases is consid- ered the fact that data files are not by default encrypted, but such a process takes place in
39 the application layer before sending data to the database server. Although there are so- lutions that provide encryption services, these lack horizontal scaling and transparency required in the NoSQL environment. Furthermore, only a few NoSQL databases provide encryption mechanisms to pro- tect user-related sensitive data. By default in NoSQL databases, the inter-node commu- nication is not encrypted and does not support SSL (Secure Sockets Layer) client-node communication (as in relational databases), breaking the network security [147]. Also, there is no integration of authentication or authorization mechanisms. The distributed environments increase attack surface across several distributed nodes and enforcing integrity constraints is much complex in NoSQL databases. In general, only a few categories of NoSQL databases provide mechanisms to employ encryption techniques protecting data at rest.
4.4 Proposed Security and Privacy Solutions
Below are our proposed security and privacy solutions for NoSQL data stores.
4.4.1 Pseudonyms-based Communication Network
In the context of this system, users can have access to multiple services by inserting their credentials only once, that is when they are initially connected to the system. Such a system is called anonymous because users can be known only through their pseudonyms, and the transactions carried out by the same user cannot be linked as their identity is disclosed. For this reason, it is considered the best means in terms of user protection. Furthermore, it is based on two vital protocols, the RSA and the Diffie Hellman. Its structure and operations extend Brand’s credential system [19], whereas it consists of four parties: the users U, a central Identity Provider denoted as IP , the Ser- vice Providers SP s, and the organization for issuing and validating credentials. Users are entities that receive credentials and are known to Service Providers only through their pseudonyms. The central Identity Provider creates its own public and secret key, denoted as (P,S) respectively, and uses its secret key to digitally sign its sensitive data. Each credential is encoded with m + 1 attributes, denoted as y1, y2, . . . , ym, t (where t is the credential issuing time). The IP decides on the Gq, a finite cyclic group of prime order q, to which
40 the random generators g1, g2, . . . , gm, gm+1, h0, involved in keys generation, belong to. Specifically,
S = (y1, y2, . . . , ym, t, s) (4.1)
y1 y2 ym t s P = g1 g2 . . . gm gm+1h0, (4.2) where s ∈ Zq keeps secret.
Under the Discrete Logarithm assumption in Gq, these keys are unique. The IP is responsible for the distribution of the digital pseudonyms p1, p2, . . . , pm to any user. An organization can issue a credential to a pseudonym, and the corresponding user can prove its ownership to another organization (who knows it by a different pseudonym), by just revealing the ownership of the credential. Additionally, the Credential Authority CA prevents the sharing of credentials or pseudonyms and guarantee that users who enter the system have a public and secret key that makes them unique to the system. Another entity in the system is the Verifier V , whose role is to certify the validity of the user credentials and to communicate with either the Issuing Authority or the Credential Authority to inform that the user is not the owner of the credential that is presenting. A user, in terms of a digital credential, trans- mits the public key and the CA’s digital signature derived from a Proof of Knowledge, through which they prove that they know the secret key and the attributes in the digital credential that satisfy the particular attribute property they are revealing. Each pseudonym and credential belong to well-defined users. More in detail, it is impossible for different users to collaborate and show some of their credentials in a Service Provider as well as to obtain a credential for a user that they could not obtain (coherent credentials). As organizations are autonomous and separable entities, they can select their public and secret key independently of the other entities, so as to ensure the security of these keys and facilitate the keys management system. The pseudonyms system can protect user privacy and provide security, as in such a system, an organization can not find out anything about a user other than the own- ership of a set of credentials. Specifically, two pseudonyms that belong to the same user cannot be linked (unlinkability) and identified as in the Brand’s system, except for
41 specific conditions. In order to be efficient, any communication in the system involves as few entities as possible along with the minimum amount of information. If a user holds a credential, this can be shown multiple times without the need to reissue (and consequently resign) it. When a user has access to a service, they are validated by proving that they know the secret key of their pseudonym, without revealing it, thus preventing pseudonyms repetition. Also, for each pseudonym that a Service Provider associates with a user, it requires the user to unveil a different encoded random number of their pseudonym each time and thus, ensuring the unconditional unlinkability of their pseudonyms. Al- though the Identity Provider blindly encodes the random numbers in all of a user’s pseudonyms, that are uniquely related with them, if a user makes abuse of the service, the SP can blacklist and reveal numbers. In following, it is able to globally revoke their pseudonyms and abolish their access to any of the services they have previously had. Finally, users, under Discrete Logarithm, can conclusively prove that their encoded numbers do not belong to the SP ’s blacklist, while using this one as input on a zero- knowledge proof and without revealing any information about their identity. Hence, this technique does not impact users’ privacy and does not strengthen the SP and IP .
4.4.2 Monitoring, Filtering and Blocking
As mentioned above, available applications designed to monitor NoSQL databases can- not detect and then disable malicious jobs and queries. The Kerberos central authenti- cation system can be easily bypassed via advanced scripts, and in general, the level of monitoring is limited to data processing mainly in the API [83]. In a cloud environment, no information regarding the communication of nodes in the cluster or user connection details or data altering information (even editing or deleting), is recorded. In general, since there are no log files, a challenging problem is to identify incidents of data breach or malicious data loss in the cluster [79]. Real-time security mechanisms exist in big data technologies, resulting in high- speed data analysis. Therefore, the detection of anomalies is real-time implemented and the recording of security analytics can be frequently updated [62]. Some moni- toring tools are available but are limited to controlling user requests at the API level. In general, neither the characteristics of a malicious query in big data technologies are
42 defined, nor complete monitoring tools to disable these malicious queries, exist. One technique could be an initial authentication via Kerberos and in following, a second level authentication for accessing MapReduce [109].
4.5 Conclusions
In this chapter, we have discussed major security concerns regarding NoSQL databases. Data protection and access control can be considered some of the key issues of security in NoSQL technology. Reasons for security threats in various NoSQL databases have also been thoroughly discussed in the current work, like privacy of user data, distributed environment, authentication, fine-grained authorization and access control, safeguard- ing integrity as well as protection of data at rest and in motion. In NoSQL databases, Kerberos is used to authenticate the clients and data nodes. Specifically, in order to ensure fine-grained authorization, data are grouped according to their security level. On the other hand, Cassandra uses TDE technique to protect data at rest, whereas administrators must implement controls for ensuring that application and users have only access to the data they need in order to maintain a secure Mon- goDB deployment. Various techniques for mitigating the attacks on NoSQL databases have also been discussed along with proposed security and privacy solutions of NoSQL databases.
43 CHAPTER 5
Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio-Temporal Databases
5.1 Introduction
Nowadays, the rapid development of Internet-of-Things and Radio-Frequency Identi- fication sensor systems [174], in combination with evolution in satellites and wireless communication technologies, enables the tracking of moving objects such as vehicles, animals, people [157]. Trough mobility tracking we collect a huge number of data which give us considerable knowledge. Moving objects keep a continuous trajectory, however, this is described by a set of discrete points acquired by sampling at a specific rate and time-stamps for a time period [187]. A simple description of a trajectory is as a finite sequence of pairs of locations with time-stamp which meets GIS (Geographical Information System) database [157]. In real world, people activities constitute spatio- temporal trajectories which recorded either passively or actively, and can be used in hu- man behavior analysis. Some examples of active recording are the check-ins of a user in a location-based social network or a series of tagged photos in Flickr, as each photo has a location tag (where) and a time-stamp (when). Moreover, an example of passive recording is a credit card’ transactions as in each transaction corresponds a time-stamp and an id of its location. Modern vehicles such as taxies, buses, vessels and aircrafts have been equipped with a GPS (Global Positioning System) device which enables reporting of time-stamped locations [157]. Therefore, real-life examples made mov- ing object and trajectory data mining important. Animal scientists and biologists can study the moving animals trajectories, such as to understand their migratory traces, be- havior and/or living conditions. Meteorologists, environmentalists, climatologists, and oceanographers collect the trajectories data of natural phenomena, as these ones capture the environmental and climate changes, to forecast weather, manage natural disasters
44 (hurricanes) and protect the environment. Also, it can be used in law enforcement (e.g., video surveillance) and traffic analysis to improve transportation networks. More to the point, the evolution of technology in the domain of mobile devices, in combination with positioning capabilities (e.g., GPS), paved the way to Location-based applications such as Facebook and Twitter. Indeed, social media networking has thoroughly changed people’s habits in each aspect of their life, from personal and social to professional. A GPS sensor allows users to periodically transmit their location to a Location-based Ser- vice provider (active recording) in order to retrieve information about proximate points. However, queries based on location may conceal sensitive information about an individ- ual [118]. Therefore, the Privacy and Anonymity Preserving problem of mobile objects remains an important issue which will concern us in the context of this work. Above issues along with [75] motivated us to make the following research. Specif- ically, we apply k-NN queries on the trajectory data points of mobile users in cases of with or without clustering. In both methods, mobile users are camouflaged by their k nearest neighbors which constitute their k-anonymity set. In case of clustering, the tra- jectory points of all users in each time-stamp are grouped based on K-Means (on-line clustering) and apply k-NN queries to find the indexes of k nearest neighbors of all users, inside the cluster they belong to. Irrespective of the method used, if k nearest neighbors indexes remain the same or vary at a low rate in time, it is difficult for an ad- versary to discover a mobile user based on history data. We experiment on how this set changes in case of clustering or not, for different combinations of dimensions (x,y,θ,v), which is the main contribution of this work. We provide an analysis of the effect of dimensions on k-anonymity method. We conclude that when a data set contains a large number of attributes which are open to inference attacks, we are faced with a choice of either completely suppressing most of the data or losing the desired level of anonymity. The rest of this chapter is organized as follows: In Section 5.2 are described in detail (a) the clustering and classification problems along with the algorithms used (b) the system architecture (c)the problem definition (d)the system model and adopted methods (e) the k-anonymity privacy preserving and finally, the experiments environment and data sets source. In Section 5.3 previous related works are presented in relation to our approach and future directions of this work are recorded. Finally, Section 5.4 presents the graph- ical results gathered from experiments and the conclusions of their evaluation in terms
45 of the studied problem.
5.2 Materials and Methods
5.2.1 Clustering
Clustering is an iterative procedure which makes groups of similar data, primarily con- cerned with distance measures and a fundamental method in data mining. Clustering methods are classified as partition-based, hierarchy-based, density-based,grid based and model-based. Moving object activity pattern analysis (i.e. similar motion patterns) and activity prediction is some typical application scenarios of trajectory clustering [187]. In our case, clustering is used to organize moving objects into groups so that the mem- bers of a group are similar, with great compactness, according to a similarity criterion, based on spatio-temporal data. Specifically, for a group of mobile objects we apply clustering of their current trajectory points’ attributes (spatial coordinates, angle, veloc- ity) in specific time-stamps. In other words, we apply on-line clustering.
Algorithm 1:K-Means 1: Input:number of clusters K and training data P 2: Output:a set of K clusters 3: Method:Arbitrarily choose K objects from P as the initial cluster centers 4: repeat 5: assign each object to the cluster to which the object is the most similar based on the mean value of the objects in the cluster. 6: update the cluster means, i.e., calculate the mean value of the objects for each cluster. 7: until no change
K-Means (Algorithm 1 [194]), which is in this work, belongs to partitioning clus- tering methods and is popular due to its simplicity. It is based on the squared error minimization method and the main advantage of K-Means is that, in each iteration, it is computed the distance between a point and the K cluster centers only. Its time complex- ity is O(NKt), where N, K and t is the number of data objects, clusters and iterations respectively. However, K-Means clustering suffers in some points, the number of clus-
46 ters K must be known in advance, and its computational cost with respect to the number of data observations, clusters and iterations. K-Means and other clustering algorithms use the compactness criterion to assign clusters, which concerns us, in contrast with spectral clustering in [70] which makes use of the spectrum (or eigenvalues) of the sim- ilarity matrix of the data and examines the connectivity of the data. It is expected that K-Means algorithm may be a good option for exclusive clustering (which concerns our study) against Fuzzy C-Means which assigns each mobile object to different clusters with varying degrees of membership. Therefore, it may give good results for overlap- ping clusters. Not to mention that, it has much higher time complexity than K-Means [23].
Algorithm 2:k-Nearest Neighbor
1: Input:X:training data set, ClX :class labels of X, p:testing point to classify 2: Method:
3: Compute distances d(Xi, p) to every training point Xi and keep the indexes I of the k smallest distances.
4: Select the k labels ClX (I)
5: return I and the majority class clb in ClX (I)
5.2.2 Classification
Classification is an unsupervised machine learning approach and concerns the assign- ing of class labels to a new sample on the basis of a training data set whose samples class membership is known. The principle behind nearest neighbor method is to find a number of training samples closest in distance to the new point, and predict the class label from these. The number of k nearest neighbors can be a user-defined, constant or varying, based on the local density of points (radius-based neighbor learning). The most common distance measure is Euclidean. The k-Nearest Neighbor (Algorithm 2 [159]) is a non-parametric method and by far the simplest of all machine learning al- gorithms. The use of k-NN solely has great calculation complexity. This means that the classification of a new data point needs the calculation of distances between it and all training data set points, with ultimate goal to choose the k nearest neighbors. To overcome this issue, we combine it with a clustering method, namely K-Means, which
47 reduce the size of training sets efficiently, so the computational time of k-NN as well. It is worth mentioning that the application of k-NN inside a cluster has no sense if cluster size is less than the number of k nearest neighbors we are looking for, inside it. Hence, the appropriate combination of parameters K, since K influences clusters’ size, and k is crucial. Despite aforementioned advantages, it gives to each labeled sample the same importance to classify, in contrast with what fuzzy classifier considers [64]. Finally, in a recent work in [48] authors describe and suggest an efficient method in which kernel fuzzy clustering is combined with harmony search algorithm for scheme classification.
5.2.3 Useful Definitions
We consider points in a d-dimensional space D. Given two points a and b we define as dist(a, b) the distance between a and b in D. In this paper, we utilized the Euclidean distance metric which is defined as
v u d 2 uX dist(a, b) = t a[i] − b[i] i=1 where a[i], b[i] denote the values of a, b along the i dimension in D.
Definition 5.2.1. k-NN: Given a point b, a data set X and an integer k, the k nearest neighbors of b from X, denoted as k − NN(b, X), is a set of k points from X such that ∀p ∈ k − NN(b, X) and ∀q ∈ {X−k−NN(b, X)}, dist(p, b) < dist(q, b).
Definition 5.2.2. k-NN Classification: Given a point b, a training data set X and a set of classes ClX where points of X belong, the classification process produces a pair
(b,clb), where clb is the majority class b belongs.
d Definition 5.2.3. Clustering: Given a finite data set P = {a1, a2,. . . , aN } in R , and number of clusters K, the clustering procedure produces K partitions of P such that among all K partitions (clusters) C1,C2,. . . ,CK find one that minimizes
2 K X X 1 X arg min a − aj C1,C2,...,CK =P |Cc| c=1 a∈Cc aj ∈Cc
where |Cc| the number of points in cluster Cc.
48 5.2.4 System Architecture
In this section we make a brief description of how the system, used to extract the de- sired data sets, operates and extracts the trajectory data points. The system architecture is based on SMaRT (Spatiotemporal Mysql ReTrieval) framework [53] and works as an expansion sub-system that produces new data sets out of sample points which are stored in a relational database. It exploits Google maps API in order to define trajec- tories between two randomly chosen points that follow road paths over a geographical area of interest. To support the above functionality, a class was created by extending the existing framework and giving the corresponding user interface. The data flow of this sub-system, as shown in fig. 5.1, follows a three stage path. Before the beginning of the process there is an initialization phase, while the database is populated with manually precreated trajectories from a comma separated file (csv) through the importing class
that is implemented on SMaRT framework. The trajectory objects Ti are in the form of sequential spatio-temporal points Pn and stored in the database as tuples of latitude, longitude, time-stamp and objid where objid is a unique identification integer number of the moving object that participates in our analysis and takes values from 1 (the first mobile object objid) to N (the N th mobile object objid). Moreover, the dimensions of latitude and longitude are transformed to the equivalent Cartesian coordinates (x,y), through Mercator transformation, in order to efficiently calculate measures such as dis- tance between points, velocity and angle vectors. After the initialization phase is over, a repeated procedure takes place as described below.
Reset API Call with Origin and Failed Yes configuration Destination Points Start new Call
No
Store Calculate Calculate Timestamp Velocity and Response Finish Trajectory based on user angle between Object into defined speed points DB
Fig. 5.1 Data flow diagram
At first, two points are randomly chosen from a relational database with a significant Euclidean distance between them. Because of the transformation every point is at a
49 distance measured in meters. Thus the distance between them is given by p(x2 + y2) and should be over a threshold of 10m. This threshold is given by heuristic way and it eliminates the problem of zero velocity and angle calculations. Then an API call (see fig.5.2) is raised against Google Maps Directions API Service given the points along with route settings information. This information may contain the way in which the target object moves as it could be pedestrian, car and bicycle, if it should follow toll roads or not and most helpful if the response should contain other relative routing paths beside the first proposed one. By this last attribute, trajectories can be multiplied by obtaining not only the proposed routing path but also the relative ones. This method also solves the drawback that Google service imposes which restricts the service calls to 10 per second by introducing a slight delay time in the insertion process. At second, the API response which contains the routing paths following the road network that connect the two points of reference, provides the information to the class that constructs the trajectory object. For each point of the trajectory, several calculations take place such as acquiring the time-stamp tn of the moving object at this specific point compared with the user defined velocity scale vn and the Euclidean distance dn from the previous
point. Also, the angle vector gn is measured using the coordinates (xn,yn) of the specific point and the next point (xn+1,yn+1) of the trajectory. Because of the randomness of the chosen reference points of the API request, there is a difference between the number of points that the user provided as the trajectory multitude and the API response routing path. In case some routing path Rn has more points than the user defined upper limit, it is shortened by that number and the endpoint of the final trajectory object Tn becomes the last point in the array. In the opposite situation, the trajectory is discarded and the current thread resets its state in order to make a new request. This procedure ends up with the final trajectory object Tn being stored to the database. For the storage procedure, a class from the SMaRT framework is used which checks all the rules that the database instance implies. When the user defined number of trajectory objects is achieved the whole process stops with a termination message containing the number of trajectories that was stored in database, the time spent for the procedure, the number of tuples inserted in database and the space used for those trajectory objects including the overhead of the indexing of the appropriate fields in the database.
50 Fig. 5.2 API Request Diagram
In the event of failure of any of the stages that should be followed, all trajectory information is being discarded and each of these processes is restarted. Thus, in order to overcome subsequently failures, a parallel methodology is deployed, where a number of user defined simultaneous multiple calls is instantiated at the beginning of each process.
51 After each call response a new trajectory is stored into database, given a new object ID, denoted as objid in the following. Those that failed are restarted with a different set of points. Finally, in terms of space consumption, each point of a trajectory object costs
17 bytes (2 × 4 bytes for the representation of the coordinates (xn,yn), 5 bytes for the time-stamp and 4 bytes for the objid) when it is stored as a simple point with the three
dimensions stored separately as (xn,yn,tn) and the objid. When it is stored as a single spatial point with extra fields for the time-stamp and the objid, the cost goes up to 34 bytes (25 bytes for the spatial representation of (xn,yn), 5 bytes for the time-stamp and 4 bytes for the objid).
5.2.5 Problem Definition
In the context of this work we study the problem of Privacy Preserving considering spatio-temporal databases of N records with d attributes each one. The spatio-temporal data is the location data of a number of mobile users along with the time-stamp of each position as shown in table 5.1. Through SMaRT system we have in our disposal tra- jectory data which give us information about angle direction and velocity amplitude, as well. Therefore, for each record, i.e. mobile user, we know the values of four attributes. We employ a popular anonymization approach, called k-anonymity, such that any orga- nization or adversary can deduce information, about the mobile user identity, observing its location attributes. Within k-anonymity attributes can be suppressed, namely their values can be replaced by ”*”, or generalized [176] until each row is identical with at least k − 1 other rows. At this point the database is said to be k-anonymous and thus, prevents database linkages. In our case, we select to anonymize location data attributes by employing a classification method which enable us to construct the k- anonymity set of each user per time-stamp. The rationale behind anonymity preserving lies in the preserving of the k nearest neighbors from one position to another. To this end, we investigate the problem considering two approaches. In the first approach, the anonymization is handled as a clustering problem, in which the d-dimensional space of attributes is partitioned into homogeneous groups so that each group contains at least k records, namely the minimum number of records in a cluster, to satisfy k-anonymity. To achieve it, as a first approach which will be elaborated in the future, we adopt the K-Means clustering method. The k-anonymity set of each user is formed based on the
52 cluster it belongs to. In the second approach, the anonymity set is formed again by the k nearest neighbors indexes but without considering d attributes space partitioning. The N maximum number of clusters is K = b k c, where N is the total number of records in the data set and k << N is the anonymity parameter for k-anonymization.
Table 5.1 An example of spatio-temporal database for d = 4
objid time-stamp timeToNextPoint x y angle velocity 1 2013-03-09 10:00:01 0 21082 56436 1.23 0 1 2013-03-09 10:00:04 3 21099 56432 1.16 4.5 1 2013-03-09 10:00:11 7 21221 56484 1.51 14.6 1 2013-03-09 10:00:19 8 21331 56524 1.95 11.3 1 2013-03-09 10:00:21 2 21402 56495 0 29.5 2 2013-03-09 10:00:03 0 35587 59829 -2.76 0 2 2013-03-09 10:00:08 5 35568 59782 2.94 7.8 2 2013-03-09 10:00:16 8 35580 59723 -2.07 5.8 2 2013-03-09 10:00:25 9 35530 59668 -1.52 6.4 2 2013-03-09 10:00:34 9 35476 59671 -2.85 4.6
For L trajectory points which correspond to L time-stamps we compute for each mobile user i its k nearest neighbors indexes and record them in a vector of the form
knnsit = [idit1idit2 . . . iditk] for t = 1, 2,...,L. An example of such sets for N mobile users is shown in table 5.2. For each user we measure how many of the k nearest neighbors remained the same from one position to another.
53 Table 5.2 k-anonymity sets for N mobile users in L = 5 time-stamps
objid Time Instant knns indexes
1 1 [id111, id112, . . . , id11k]
1 2 [id121, id122, . . . , id12k]
1 3 [id131, id132, . . . , id13k]
1 4 [id141, id142, . . . , id14k]
1 5 [id151, id152, . . . , id15k]
2 1 [id211, id212, . . . , id21k]
2 2 [id221, id222, . . . , id22k]
2 3 [id231, id232, . . . , id23k]
2 4 [id241, id242, . . . , id24k]
2 5 [id251, id252, . . . , id25k] ......
N 1 [idN11, idN12, . . . , idN1k]
N 2 [idN21, idN22, . . . , idN2k]
N 3 [idN31, idN32, . . . , idN3k]
N 4 [idN41, idN42, . . . , idN4k]
N 5 [idN51, idN52, . . . , idN5k]
Definition 5.2.4. (k-anonymity). A spatio-temporal database is k-anonymous w.r.t. a set of attributes d if at most one of the k nearest neighbor has changed from one time- stamp to another so that each mobile user not to be distinguishable from its k − 1 neighbors.
According to authors in [156] k-anonymity is able to prevent mobile users’ identity unveil. This means that the probability of a user re-identification between its k neigh- 1 bors is only possible with k . Nevertheless, k-anonymity may not protect users against attributes disclosure. Motivated by this argument, we evaluate the robustness of both approaches by computing how many of the nearest neighbors, out of the k, remained the same, and the previous probability per time-stamp.
54 5.2.6 System Model
We consider N mobile users in <2. The configuration space, namely, the environment that objects are moving, may be the free space or a road network (constrained or uncon- strained) [187]. In our case, we consider an unconstrained road network, as described in System Architecture, where mobile users are densely distributed and do not develop high speeds [53]. We exclude national or international road networks since there we cannot assume that users are moving with linear velocity. Suppose a collection of tra-
Fig. 5.3 A Matlab overview of mobile users trajectories’ points jectories T = {T 1,...,T N } (Trajectories Database) of equal length L. Each trajectory consists of a sequence of time ordered positions a mobile user goes through as it moves from a start point to a specific destination. It is a vector of the form
j j j j j j j j j j T = {(x1, y1, t1), (x2, y2, t2),..., (xL, yL, tL)}.
j j Each (xi , yi ) represents the position (Cartesian coordinates) of the mobile user j in j time-stamp ti or point i of its trajectory j [187]. For each point i in trajectory j we
55 j j j j j define in 4-dimensional space a vector Di = (xi , yi , θi , vi ), i = 1, 2,...,L, which is described by the location coordinates (x, y) and motion pattern (g, v), respectively. For the first point of each trajectory, direction and velocity is defined with respect to point (0, 0).
Algorithm 3: MWCL 1: Input:the number of k nearest neighbors, the number j of mobile users N, vectors Di of N users in L time-stamps 2: Output:k nearest neighbors indexes of N users in L time-stamps 3: for i = 1 : L do 4: for J = 1 : N do j 5: Compute vector Di of user j in time instant i j 6: Apply k-NN between the vector Di and the vectors j N {Di }j=1 of all users to find the set of k-NN indexes, j Ii , of user j in time-stamp i 7: end for 8: end for
We approach the problem of Privacy Preserving on Spatio-Temporal databases with two methods. The first one is called Method-Without-Clustering (MWCL) and the sec- ond one is called Method-with-Clustering (MCL). In the latter, we apply on-line clus- tering, i.e., in each time-stamp t, we group into clusters the mobile users based on their j vector values Di that moment. This vector is formulated depending on which of the following attributes x,y,g,v, we choose to apply K-Means and k-NN algorithms. We define a location data security metric which we have already called Vulnerabil- ity. This one quantifies and measures the robustness of each method. Specifically, it expresses the rate with which k nearest neighbors of each mobile user changes. The Vulnerability of each method is computed as the mean value of Vulnerabilities of all users in L time-stamps. The less neighbors’ indexes change (e.g., objid) the less values Vulnerability takes.
j Definition 5.2.5. Vulnerability: Given a mobile user j, a set Ii with the k nearest
56 Algorithm 4: MCL 1: Input:number of k nearest neighbors, number of mobile j users N, vectors Di of N users in L time-stamps 2: Output:k-NN indexes of N users in L time-stamps 3: for i = 1 : L do 4: for j = 1 : N do j 5: Compute vector Di of user j in time instant i j 6: Apply k-NN method between the vector Di and the j N j vectors {Di }j=1 inside the cluster Ci of user j in j time-stamp i and find the set of k-NN indexes, Ii . 7: end for 8: end for
j neighbors indexes in time-stamp i, Vi is defined as
j 1 Vi = j j |Ii ∩ Ii−1|
j j j where 0 ≤ Vi ≤ 1 and |Ii ∩ Ii−1| the number of k indexes remained the same.
Algorithm 5:Vulnerability j 1: Input:sets Ii with the k nearest neighbors indexes, number of mobile users N and time period L j 1 2: Initialization: V1 = k 3: Output:Vulnerability values of N mobile users in L time-stamps 4: for i = 2 : L do 5: for j = 1 : N do j 1 6: Vi = j j |Ii ∩Ii−1| 7: end for 8: end for
5.2.7 Privacy Preserving Analysis
The rising advances in video tracking technology has attracted the scientific attention of understanding of social behavior of swarming animals. The video tracking method can automatically measure individual’s motion states using videos from different camera
57 views [36]. Swarming behavior is connected with collective one which usually hap- pens in large groups of animals such as bird flies, mosquito, insects [4]. Researchers employ mathematical models to simulate and understand the swarm behavior. The sim- plest mathematical models generally consider that individual animals move in the same direction as their neighbors, remain close to them (thus neighbors remain constant) and avoid collisions with these ones. In our study, mobile users constitute the ”swarming animals” who have collective motion behavior and either organized into groups or not, as described in previous section. It is worth making privacy preserving analysis in case that mobile objects are moving randomly (thus independently one each other) from one position to another. In both approaches (either we apply on-line clustering or not), we observe that mobile users’ nearest neighbors indexes change and thus, the Vulnerability of both methods increases. In case of random motion, mobile users’ behavior is similar with the one of swarm of bees or flies. Their direction(angle) is linear and velocity am- plitude is approximately constant. Similar motion characteristics have users in a road network but they do not move randomly. Generally, a mobile object in a dense road network is moving on a piece-wise linear random path with a constant speed.
Let Xti be a sequence of independent random variables which relates with the neigh-
bors of a mobile user i which changed in a time interval ti. To maintain Vulnerability in low levels, much less than 1 and close to 0, users’ motion behavior should not change considerably so that their k nearest neighbors remains the same from one time-stamp to another. This results in the following theorem:
Theorem 5.2.1. The probability that the nearest neighbors do not remain the same within a time interval (0, ti] and at least l of them changed such that users be distin- guishable from an adversary, tends to 1 if objects move randomly.
lim P (Xt ≥ l) = 1 l→k i
Proof. We conduct a Poisson experiment. We consider that the number of successes that are resulting from the Poisson experiment termed as a Poisson random variable with average number of successes λ and k a positive integer which relates with k-anonymity level.
1. The outcomes of the experiment are discrete. Specifically, they concern the near-
58 est neighbor’s sustainability and are classified as either success (neighbors re- mained the same are at least k − 1) or failure (number of neighbors remained the same are at most k − 2 which is less than k − 1).
2. λ is the average or expected number of successes within a time interval ti for
a mobile user i, E(Xti ) and assumed to be known and constant throughout the experiment.
3. Poisson describes the distribution of events. Each event is independent of the other events.
4. The probability for a mobile user i to have at least l different neighbors, in a
time-interval ti (occurrence of a failure) is written as λleλ P (X > l) = 1 − P (X ≤ l) = 1− ≤ ti ti l! λleλ lim P (Xt ≥ l) = lim(1 − ) l→k i l→k l!
Due to the random moving assumption, it is highly certain that the indexes of all k nearest neighbors will change. Hence, above probability will be very close to 1. If k is large enough, previous limit will tend to 1.
5.2.8 Experiments Data and Environment
The experimental data used in this paper comes from SMaRT Database GIS Tool in http://www.bikerides.gr/thesis2/. We experiment on two trajectories data sets of 400 and 2000 bike riders, as shown in fig.5.3, in the area of Corfu, with 100 trajectory points each one. For each trajectory point we have available the four dimensions values for 100 time-stamps, e.g. the Cartesian coordinates (x,y) and the angle, velocity (θ,v), respectively. The environment in which experiments carried out has the following characteristics: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz 3.00GHz, 16GB Memory, Windows 10 Education, 64-bit Operating System, x64-based processor and Matlab 2018a.
59 5.3 Discussion
The last decade the problem of privacy preserving of location data has been of particular concern to researchers. Hence, many research works have been conducted to reinforce the security level previous ones provide, such as the association of location data with pseudonyms. In a recent work [111] authors recommend an asymmetric Private Equality Testing protocol (PET) which allows two users to communicate each other safely and without the involvement of a Third Party. PET requires two public-key exponentiation per each user and needs three rounds to complete. Both users compute their private input through hash functions and send it each other. The output of protocol in the end of third round indicates the equality or not of two keys and as result their communication. This proto- col can be used to Location-based Social Services to find the location of users who are in the same region or in a specific radius depending on the preference of user. The se- curity of private inputs of involver lies in the use of Computational Diffie-Hellman and Discrete Log Problem. Also, asymmetry does not reveal whether the computation and equality of inputs are successful and that prevents observers to identify if the connection established or not. More to the point, k-anonymity method has been used in order to reinforce and quantify location data privacy. If k-anonymity is achieved, a person cannot be dis- tinguished from k − 1 other people [118]. In the context of k-anonymity, authors in [118] propose an enhanced Dummy-Location Selection(DLS) algorithm for users in LBS. From a different perspective authors in [156] aim at the utility improvement of differentially private published data sets. More specific, they show that the amount of noise required to fulfill -differential privacy can be reduced if noise is added to a k- anonymous version of the data set, where k-anonymity is achieved through a specially designed microaggregation of all attributes. As a result of noise reduction, the general analytical utility of the anonymized output is increased. Moreover, authors in [154] at- tempt to protect users’ private locations in location-based services adopting the spatial cloaking technique, which organizes users’ exact locations into cloaked regions. This one satisfies the k-anonymity requirement within the cloaked region. They propose a cloaking system model which they called ”anonymity of motion vectors” (AMV) that
60 provides anonymity for spatial queries minimizing the cloaked region of a mobile user using motion vectors. The AVM creates a search area that includes the nearest neighbor objects to the querier who issued a cloaked region-based query. In addition, in [117], authors suggest a clustering based k-anonymity algorithm and optimizes it with paral- lelization. The experimental evaluation of the proposed approach shows that the algo- rithm performs better in information loss, due to anonymization, and its performance is compared with the existing algorithms such as KACA and Incognito. In our case, we approach the k-anonymity preserving as following. Specifically, we investigate the impact of the used attributes (x,y,θ,v) in the robustness of the pro- posed methods, MCL and MWCL. Since, from time-stamp to time-stamp the number of nearest neighbors may not remain the same and be less than k, the robustness is decreased, namely the probability one mobile user to be identified from its anonymity 1 set may be higher than the optimum value k . The proposed MCL method adopts a simple K-means clustering micro-aggregation technique that maintains k-anonymity, which is the aim of this work. However, the proposed approach has some limitations and drawbacks. Firstly, it works for numeric or continuous location data, but does not work for categorical data. Secondly, although there may exist natural relations among attributes such as angle and velocity with Cartesian coordinates, the proposed algorithm cannot incorporate such information to find more desirable solutions. Moreover, the fo- cus is too heavily placed on preserving k nearest neighbors to guarantee mobile users anonymity. As a result, the algorithm lacks the ability to address other issues (e.g. l- diversity, t-closeness) which previous works address, to find more desirable solutions. In addition to, the proposed MCL algorithm tries to minimize the within-cluster sum of squares and maximize intra-cluster sum of squares, so that the number of records in each partition is greater than k. Using K-Means there is no guarantee that the optimum is found, thus, the quality of the resulting anonymized data cannot be guaranteed. Like all greedy algorithms, the K-means algorithm will reach a local, but not necessarily a global minimum. Hence, the information loss is not minimized despite clusters are formed such that contain at least k similar objects. In addition to, the MCL algorithm is assigning mobile users to the nearest cluster by squared Euclidean distance. Using a different distance function may stop the algorithm from converging. Hence, as a first approach we focused on this version of K-Means. In a future work, various modifica-
61 tions of K-Means or other clustering algorithms will be used to investigate the impact on k-anonymity. Finally, our aim is to extend this work for large-scale trajectory data and transfer the whole processing in a distributed computing environment based on Hadoop. Apache Spark is the most promising environment with high performance in parallel computing, designed to efficiently deal with supervised machine learning algorithms, e.g., k-NN [138]. A future direction is to design above methods and experiments in Big Data environments [81], [175], [31] and investigate their scalability and running time perfor- mance under different data sets and parameter k combinations.
5.4 Results
5.4.1 Experiments Results
In this section, the results of the experiments conducted are presented in figs.5.4 to 5.9. We relied on real data to evaluate the performance in terms of the Vulnerability of MWCL and MCL. We experiment on two different data sets size N = {400, 2000} respectively. The parameters of two experiments and the relevant values are presented in tables 5.3 and 5.4 respectively. Parameter cs refers to the number of clusters and k to
Table 5.3 Parameters for the 1st experiment of N=400 trajectories
cs k Clustering attributes k-NN attributes 2,5 5 x x 2,5 5 x, y x, y 2,5 5 x, y, θ x, y, θ 2,5 5 x, y, θ, v x, y, theta, v 2,5,10 5 x, y, θ, v x, ∗, ∗, ∗
the number of nearest neighbors in terms of Euclidean distance. Concerning the MCL, we consider the attributes combination shown in both tables. For a fair comparison, in MWCL the k-NN is applied for the same attributes combination, as well. We investigate the k-anonymity gradually, by adding one new attribute each time. In the first four cases of both experiments and approaches (see figs.5.4, 5.5, 5.6, 5.7), we observe that the information of attribute x is not sufficient to make both methods robust enough in terms of nearest neighbors indexes change. More to the point, in the
62 Table 5.4 Parameters for the 2nd experiment of N=2000 trajectories
cs k Clustering attributes k-NN attributes 5,10 15 x x 5,10 15 x, y x, y 5,10 15 x, y, θ x, y, θ 5,10 15 x, y, θ, v x, y, θ, v 5,10,15 15,30 x, y, θ, v x, ∗, ∗, ∗
combination (x,y), although Vulnerability dropped significantly, the usage of attributes (θ,v) did not enhance it. In real data sets, many dimensions contain high levels of inter- attribute correlations. In this work, by definition as described in subsection ”System Architecture”, attributes (θ,v) and (x,y) are correlated. This stems from the fact that the non-linear trajectories of the mobile users are approximated as linear ones between
xn+1−xn yn+1−yn time-stamps. Specifically, velocity is computed as vx = , vy = and tn+1−tn tn+1−tn v = pv2 + v2 while angle g = tan-1 ( yn+1−yn ). The curse of dimensionality has x y xn+1−xn remained a challenge for a wide variety of algorithms in data mining, clustering, clas- sification, and privacy, and seems to affect both methods performance in terms of Vul- nerability. The experimental results seem to suggest that the dimensionality curse is an obstacle to privacy preservation. It was shown that an increasing dimensionality makes the data resistant to effective privacy and achieve the lower bound of k-anonymity, i.e. 1 k . However, in practice, we show that some of the attributes of real data can be lever- aged in order to greatly ameliorate the negative effects of the curse of dimensionality in privacy. To obtain an even more accurate classification, we considered two more at- tributes in computations, the angle and velocity. However, it is doubt if we can obtain a perfect classification by carefully defining a few of these features. In fact, after a certain point which in our case is (x,y) attributes, increasing the dimensions of the problem, by adding new features, we degraded the performance of k-NN classifier. As shown in the following figures, as the dimensionality increases the Vulnerability performance is improved until the optimal number of features is reached, i.e., 2. Further increasing of the dimensionality does not ameliorate Vulnerability performance. In the following figures we repeat the same procedure as previously, but we consider a larger data set of size N = 2000 trajectories. From cluster number perspective, the number of clusters cs obviously affects only MCL, where we see that Vulnerability becomes a little worse
63 0.8 0.8 MWCL MWCL MCL-cs = 2, 5 MCL-cs=2 0.7 0.7 MCL-cs=5
0.6 0.6
0.5 0.5
0.4
0.4 Vulnerability
Vulnerability 0.3 0.3
0.2 0.2
0.1 0.1
0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b
Fig. 5.4 Both clustering and k-NN: (a) x and (b) (x,y) for N=400 trajectories, L=100 time-stamps and k = 5. as the clusters increase. This relates with the fact that, not only the average cluster size is reduced but also their composition changes, and thus, MCL becomes more sensitive to the change of the nearest neighbors of mobile users inside the cluster. We focus on the last combination of attributes of both experiments (see figs. 5.8,5.9). We observe that attributes suppression in terms of k nearest neighbors indexes computation, makes MCL superior in terms of Vulnerability (that is lower values), which is the main issue in k-anonymity and thus in Privacy Preserving. Moreover, it is the combination which highlights MCL as the clusters number increases. The fact that clustering is based on all available attributes empowers clusters homogeneity and reflects better the real word communities. More to the point, when the mobile users are camouflaged by k nearest neighbors based on one of the attributes, in that case x, it is disclosed less information about them and their neighbors. Therefore, it is more difficult to break security even if an intruder monitors history data and tries to link k − 1 public records of nearest neighbors. This case keeps Vulnerability in relatively high level for low values of k. It is obvious that, the more the nearest neighbors are used to camouflage the mobile users inside the cluster, the lower the Vulnerability becomes. Also, in terms of computation cost, k-NN is more time effective when applied to less volume (inside cluster) and less dimension data. Not to mention that, we avoid the dimensionality effect in classifica- tion and thus, in k-NN performance. Finally, in figs.5.8 and 5.9 we also demonstrate the impact of parameter k in Vulnerability. We observe that the increase of k benefits
64 0.8 0.8 MWCL MWCL MCL-cs=2 MCL-cs=2 0.7 0.7 MCL-cs=5 MCL-cs=5
0.6 0.6
0.5 0.5
0.4 0.4
Vulnerability Vulnerability 0.3 0.3
0.2 0.2
0.1 0.1
0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b
Fig. 5.5 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=400 trajectories, L=100 time-stamps and k = 5. both methods. This shows that the security of a mobile user is more vulnerable when this one is protected by low number of nearest neighbors.
0.8 0.8 MWCL MCL-cs=5 0.7 0.7
0.6 0.6
0.5 0.5
MWCL 0.4 MCL-cs=2 0.4 MCL-cs=5
MCL-cs=10
Vulnerability Vulnerability 0.3 0.3
0.2 0.2
0.1 0.1
0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b
Fig. 5.8 Clustering (x,y,θ,v) and k-NN x for N=400 trajectories, L=100 time-stamps for (a) k = 5 and (b) k = 15.
5.4.2 Experiments Conclusions
In conclusion, in the context of this chapter we carried out a research on Privacy Preserving based on real Spatio-Temporal data. This research work proposes a k- anonymity model based on motion vectors that provides anonymity for spatial queries. Specifically, we investigated the problem of k-anonymity from dimensionality perspec-
65 0.8 0.8 MWCL MWCL MCL-cs=5,10 MCL-cs=5 0.7 0.7 MCL-cs=10
0.6 0.6
0.5 0.5
0.4 0.4
Vulnerability Vulnerability 0.3 0.3
0.2 0.2
0.1 0.1
0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b
Fig. 5.6 Both clustering and k-NN: (a) x and (b) (x,y) for N=2000 trajectories and L=100 time-stamps.
0.8 0.8 MWCL MWCL MCL-cs=5 MCL-cs=5 0.7 0.7 MCL-cs=10 MCL-cs=10
0.6 0.6
0.5 0.5
0.4 0.4
Vulnerability Vulnerability 0.3 0.3
0.2 0.2
0.1 0.1
0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b
Fig. 5.7 Both clustering and k-NN: (a) (x,y,θ) and (b) (x,y,θ,v) for N=2000 trajectories and L=100 time-stamps. tive and how the combination of dimensions affects Vulnerability of both methods. We observed that the inter-attribute combinations or suppression within a record have such a powerful revealing effect in the increase of dimensional case. We proved the effectiveness and efficacy of MCL, under specific dimensions combination, by inten- sive experiments. Finally, the anonymization using clustering (based on all attributes) and attributes’ suppression in k-anonymity set computation is a solution for privacy preserving.
66 0.8 0.8 MWCL MCL-cs= 5 MWCL 0.7 MCL-cs=5 MCL-cs=10 0.7 MCL-cs=10 MCL-cs= 15 MCL-cs=15 0.6 0.6
0.5 0.5
0.4
0.4 Vulnerability
0.3 Vulnerability 0.3
0.2 0.2
0.1 0.1
0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Time Stamps Time Stamps a b
Fig. 5.9 Clustering with (x,y,θ,v) and k-NN in x. Figure (a) concerns k=15 while (b) k=30 for N=2000 trajectories and L = 100 time-stamps.
67 CHAPTER 6
Storage Effiecient Trajectory Clustering and k-NN for Robust Privacy Preserving Databases
6.1 Introduction
The research area of moving object databases has become an emerging technologi- cal discipline, and has consequently gained a lot of interest during the last decade due to the development of ubiquitous location-aware devices, such as PDAs, mobile phones, GPS-enabled mobile devices, and RFID, or road-side sensors. The techno- logical achievements and advances in sensing and communication/networking, along with the innovative technological design features (thin and light) of computing devices and the development of embedded systems have enabled the recording of a large vol- ume of spatio-temporal data. Mobile object trajectories are among the wide variety of spatio-temporal data that are especially important to scientists. Actually, they help them in discovering movement patterns (individual or group) and knowledge which, in recent literature, have been established as trajectory or mobility mining [100]. Also, the technology of databases is evolving to support the querying and representation of the trajectory of moving objects (e.g., humans, animals, vehicles, natural phenomena). Hence, the main parts of trajectory data-mining include pre-processing, data manage- ment, query processing, trajectory data-mining tasks, and privacy protection [49]. Real-life applications, such as the analysis of traffic congestion, intelligent trans- portation, animal immigration habits analysis, cellular communications, military appli- cations, structural and environmental monitoring, disaster/rescue management, as well as remediation, Geographic Information Systems (GIS), Location-Based Services (LBS), and other domains have increased the interest in the area of trajectory data-mining and efficient management of spatio-temporal data. It should be noted that the explosive growth of social media has produced large-
68 scale mobility datasets whose publication puts people’s personal lives at severe risk. Indeed, users get used to sharing their most-visited or potentially sensitive locations, such as their home, workplace, and holiday locations that are easy to obtain through social media. Nowadays, the amount of spatio-temporal data has been growing expo- nentially. Therefore, there is an urgent need to develop efficient methods for storing and managing this large amount of information. A plethora of studies have been conducted for handling mobile objects’ trajectory data. More precisely, several of them attempt to reduce the storage size [60, 68, 155], while others investigate the privacy preservation of trajectory data [69, 133]. Nowadays, not only are storage-efficient spatio-temporal transformation schemes needed, but also secure querying on large-scale spatio-temporal data [183]. An accurate capture of a moving object trajectory usually needs a high sam- pling rate to collect its location data. Thus, massive trajectory data will be generated, which is difficult to fit into the memory for utilizing data-mining algorithms. A com- mon idea is to compress the trajectory data to reduce the storage requirements while maintaining the utility of the trajectory. In the context of this work, we present the storage efficiency of dual methods and experiment on data from the SMaRT system, through which the data of moving object trajectories are generated and used as input to our methods in order to evaluate the security level they offer. More specifically, we summarize the main contributions of our paper as follows:
1. We compare the proposed methods on addressing k-NN queries on moving ob- jects’ trajectories data, which are stored both in dual and native dimensional space. Our implementation shows that the innovative method of Dual Trans- formation constitutes a practical solution that can provide secure k-NN queries.
2. We conduct an extensive experimental evaluation that studies various scenarios that can affect the vulnerability of the k-NN queries and proceed to a comparative analysis of the underlying methods. We prove the efficiency of our solution using real data drawn from SMaRT.
3. We recall two protocols for Pseudonyms Recovery and Registration with the aim of reinforcing the individuals’ privacy in the released data. An individual cannot be re-linked to specific users with a high degree of certainty, as it is described in Section 6.3.7.
69 The rest of the chapter is organized as follows: In Section 6.2, previous related works are presented in relation to our approach. The following are described in Sec- tion 6.3: (a) the dual transformation methods used; (b) the problem definition; (c) the problem formulation (d) the privacy-preserving analysis; (e) the experimental environ- ment, and source of datasets. Section 6.4 presents the graphical outcomes gathered from experiments, while Section 6.5 evaluates experimental results in relation to the pros and cons of the proposed methods. Finally, Section 6.6 records the conclusions in terms of the studied problem and future directions of this work.
6.2 Related Work
In this section, we review existing related works in the domain of secure querying on spatio-temporal databases. Our discussion includes privacy-preserving approaches for trajectory-based queries. In recent years, trajectory databases have constituted an important research area that has received a lot of interest. Most researchers have focused on the querying of mov- ing objects and their trajectory. The so-called trajectory-based queries are also gaining much interest. The queries based on trajectory data require the knowledge of the whole, or at least a part of the mobile objects’ trajectory to be processed. Such queries may provide useful information about an object’s average speed, travelled distance, and so forth. In [183], three common mechanisms in privacy-preserving trajectory publishing are described. Generalization and suppression are the most common ones used to im- plement the k-anonymity. However, the main drawback of these mechanisms is that they suffer from a high possibility of information loss—thus, perturbation techniques based on randomization (e.g., adding noise) may be utilized as an alternative. Actually, the problem of secure querying on spatio-temporal data in combination with k-anonymity has gained much attraction among researchers. Indeed, authors in [166] describe the historical k-anonymity based on each mobile user’s trajectory data history, known as Personal History Locations (PHL). According to PHL anonymity, a user, U is camouflaged by k − 1 users whose PHLs have a common part with its own, rendering him/her indistinguishable among them. Privacy preservation is enforced as the gener- alization method has been applied. More specifically, by trying to preserve historical k-anonymity, authors increased the uncertainty related to the user’s real location data
70 at the time of the query by modifying the spatio-temporal information of the query. More precisely, in [136], by employing the kl anonymity privacy model, authors ensure that an intruder, who has knowledge of any sub-trajectory TS of size l of a user’s tra- jectory T j, cannot distinguish their one among k − 1 trajectories that protect them with 1 probability, based on TS, at most k . In a more recent work [39], the authors investigated the privacy-preserving prob- lem based on real spatio-temporal data. That paper employed the k-anonymity method and formed the anonymity set based on motion vectors with the aim of executing secure spatial k-NN queries. More specifically, the problem of k-anonymity from a dimension- ality perspective and the impact of used dimensions on the vulnerability of suggested methods was investigated. The experiments presented the effectiveness of the proposed method, such as the clustering under particular attributes combination, and observed that it benefited from attributes suppression during the k-anonymity set computation. Authors in [53] suggested a novel spatio-temporal Mysql ReTrieval framework based on the MySQL and PostgreSQL database management system. In the context of that work, authors employed Hough-X transformation so as to evaluate the efficiency of range queries on nonlinear two-dimensional trajectories of mobile objects. Indeed, they demonstrated that the Hough-X dual approach, in combination with the range-tree vari- ant, was quite efficient. Generally, the trajectory of a mobile user is non-linear. However, it can be ap- proximated by a discrete number of linear sub-trajectories with the use of a trajectory segmentation application. Each partition is represented by a line segment between two consecutive partition points, and is expected to provide an effective and efficient way to obtain insights into motion characteristics and behavioral preferences of mobile objects. Our approach performs low-rate sampling and considers linear interpolation between successive sampled points, where each line segment represents the continuous moving of the object during sampled points. The duality transformation of line segments oper- ates as a pre-processing step and aims at increasing the security level and reinforcing the privacy of k-NN queries, which is the main subject of this work. Also, we have in our disposal linear components of the initial trajectory, as well as storage of the first and last spatial point in order to represent that line along with the dual representative, that is, the Hough-X (and/or Hough-Y) dual points. Lastly, this step will turn out to be
71 useful from a storage perspective in Big Data applications, and will render the proposed methods a strong candidate for efficient querying on massive data, in combination with the appropriate indexing method.
6.3 Materials and Methods
6.3.1 Dual Transform for Moving Objects
In general, the geometric dual transform maps a hyper-plane h from Rm to a point in Rm, and vice versa. In this section, we briefly present how the duality transformation operates in a one-dimensional case. A line from the plane (t, y) or (t, x) is mapped to a point on the dual plane (see Figure 6.1).
1. Hough-X: The equation y(t) = ut + a is mapped to a point (u, a), where axes u, a represent the slope (that is, velocity) and intercept of an object’s trajectory, respectively. Thus, we get the dual point (u, a), the so-called Hough-X transform.
1 a 2. Hough-Y: The equation y(t) = ut + a is rewritten as t = u y − u , a different dual representation, the so-called Hough-Y transform. The point in the dual plane is −a represented as (b, c), where b = u (the intersection with the line y = 0) and 1 c = u .
It is worth mentioning that the Hough-X transform cannot represent vertical lines, while horizontal lines cannot be represented using the Hough-Y transform. Nonethe-
less, both transforms are valid, since in our setting, velocity is bounded by [umin, umax], and thus lines have a minimum and maximum slope.
72 TR1 y (t) a 13 y(t) M TS Lj TR2 TR3 points (a , u ) 33 1 1 23 TS 12 TS TS TS 22 TS TS TS 11 (aM, uM) 32 Li 31 21 TS TS
t t u
a b
Fig. 6.1 An overview of trajectory segmentation and Hough-X transformation for a linear trajectory segment (TS), which consists of M points. The dual points of M points in TS are the same, for example, a1 = ... = aM , u1 = ... = uM , where the left graph shows the y(t) line and the right graph shows the Hough-X points.
6.3.2 kNN Classification and Clustering in Dual Space
Here, we consider points in dual space P. Given two dual points dp1 and dp2, we define as dist(dp1, dp2) the distance between dp1 and dp2 in P. In the context of this work, we utilize the Euclidean distance metric, which is defined as
v u p 2 uX dist(dp1, dp2) = t dp1[i] − dp2[i] , i=1 where dp1[i], dp2[i] denote the values of dp1, dp2 along the i dimension in P. For ex- ample, in Hough-X space, the distance between the dual points dp1 = (u1, a1), dp2 =
p 2 2 (u2, a2) is computed as dist(dp1, dp2) = (u1 − u2) + (a1 − a2) .
Definition 6.3.6. DukNN: Given a dual point dp, a data-set of dual points Y and an integer k, the k nearest neighbors of dp from Y , denoted as DukNN(dp, Y ), is a set of k points from Y such that ∀l ∈ DukNN(dp, Y ) and ∀q ∈ {Y − DukNN(dp, Y )}, dist(l, dp) < dist(q, dp).
Definition 6.3.7. DukNN Classification: Given a dual point dp, a training dual points data-set Y , and a set of classes ClY where the dual points of Y belong, the classification process produces a pair (dp,cldp), where cldp is the majority class to which dp belongs.
Definition 6.3.8. Clustering: Given a finite data-set of dual points DP = {dp1, dp2,. . . , p dpN } in R , and number of clusters K, the clustering procedure produces K partitions
73 of DP such that among all K partitions (clusters) C1,C2,. . . ,CK find one that minimizes
2 K X X 1 X arg min dp − dpj C1,C2,...,CK =P |Cc| c=1 dp∈Cc dpj ∈Cc where |Cc| is the number of dual points in cluster Cc.
Note that the aforementioned dual methods act as a feature extraction technique. More specifically, they extract the dual point of each of the x, y coordinates of a mobile user trajectory. The k nearest neighbors algorithm is then applied on dual points features and allowed to return dual points, whose distance from the query dual point is less than the distance from the rest of the training dual points. Considering the Hough-X transformation of attribute x or y, the search area is a circle with the center being the query point and a radius such that k nearest neighbors exist. If we assume Hough-X of
(x, y) attributes, the k nearest neighbor search area is four-dimensional (ux, ax, uy, ay) with complex hypercube geometry.
6.3.3 Problem Definition
Here, we consider a database that records the location information of mobile objects in the two-dimensional space on a finite area. Also, we assume that objects move with small velocities that lie in the range [umin, umax] starting from a specific location at a specific time-stamp and which move along a non-linear trajectory. In order to be able to store and handle queries in an efficient way, a mobile object’s trajectory is approximated with a series of linear ones, as depicted in Figure 6.2.
Approximated Raw Trajectory End Point Trajectory
Start Point Fig. 6.2 A raw trajectory approximation with a discrete number of R linear sub- trajectories. In the dual dimensional space, each one is represented as a dual point—for example, the linear sub-trajectory [l(t0), l(t1)] is represented as a dual point dp1, and the linear sub-trajectory [l(t1), l(t2)] is represented as a dual point dp2.
74 Definition 6.3.9. A linear trajectory is a straight line that an object keeps track of,
starting from a location l(t0) = [x0, y0] at time t0. Then, its location for t > t0 will be
l(t) = [x(t), y(t)], or l(t) = [x0 + ux(t − t0), y0 + uy(t − t0)], where u = (ux, uy) is the object’s velocity in each plane [53].
Definition 6.3.10. A trajectory partition or sub-trajectory segment is a line segment
LiLj, where for i < j, both points belong to the same trajectory and are connected in
order to form a partition denoted by TSi [113].
Definition 6.3.11. Characteristic points are the points where the trajectory changes rapidly.
Definition 6.3.12. The dual points array constitutes a set containing points of a trajec- tory that are represented in the dual space.
Definition 6.3.13. A compressed trajectory path is a subset of the trajectory’s points that indicate a significant change in the motion characteristics, that is, the speed or direction of a moving object.
Definition 6.3.14. Given a trajectory T of size |T | and a compressed trajectory Tc of T with size |T |, the Compression Ratio (CR) is |T | . c |Tc|
Authors in [158] claim that the compression ratio constitutes a common metric for evaluating the effectiveness of compression algorithms that can accurately reflect the change of a trajectory’s data size. It is influenced by the original signal’s data-sampling rate, as well as the quantization accuracy.
6.3.4 Problem Formulation
In the context of this study, the problem of privacy preservation when dealing with spatio-temporal databases goes one step further, and is related to the work [39]. The spatio- temporal data is the location data of a number of mobile users along with the time- stamp of each position, as shown in Table 5.1. Through the SMaRT system, we have in our disposal offline trajectory data that give us information about Hough-X, as well as Hough-Y of spatial data (x, y). Hence, for each database record per time-stamp, that is, the mobile user trajectory point, we can consider the values of four attributes
75 Table 6.1 An overview of the transformed spatio-temporal database.
ObjId Timestamp Ux ax Uy ay bx wx by wy 1 2013-03-09 10:00:01 4.37 22,242,219.9 1.03 4,800,692.9 0.23 −5,093,637.76 0.97 −4,645,833.3 1 2013-03-09 10:00:04 13.4 22,242,156.2 5.83 4,800,651.2 0.075 −1,659,862.40 0.17 −82,3641.2 1 2013-03-09 10:00:11 10.58 22,242,287.4 3.79 4,800,713.7 0.0946 −2,103,289.59 0.26 −1,267,515.2 1 2013-03-09 10:00:19 27.3 22,242,427.4 11 4,800,762 0.04 −814,740.93 0.09 −436,432.91 1 2013-03-09 10:00:21 27.3 22,242,427.4 11 4,800,762 0.04 814,740.9 0.09 −436,432.91 2 2013-03-09 10:00:03 2.92 22,256,723.4 7.32 4,804,052.4 0.3425 −7,622,165.55 0.14 −-656,291.3 2 2013-03-09 10:00:08 1.15 22,256,709.8 5.75 4,803,996 0.87 −19,353,660.69 0.17 −835,477.56 2 2013-03-09 10:00:16 4.27 22,256,692.6 4.64 4,803,941.2 0.23 −5,216,411.92 0.22 −1,034,341.51 2 2013-03-09 10:00:25 4.6 22,256,639.5 0.23 4,803,925.9 0.22 −4,826,741.21 4.29 −20,588,283.27 2 2013-03-09 10:00:34 1.5 22,256,625.5 5.2 4,803,925.8 0.67 −14,837,750.3 0.19 −923,831.89
(x, y, θ, u) (as in Table 5.1) along with the values of an additional eight attributes’
(Ux, ax,Uy, ay, bx, wx, by, wy) (as in Table 6.1). So, we have chosen to anonymize dual point attributes by employing the k-NN method, which enables us to form the k-anonymity set of each mobile object per time- stamp, as depicted in Table 5.2. The data anonymization is handled both as a clus- tering and a no-clustering problem. In both approaches, the anonymity set is formed again by the k nearest neighbors ids’. For each mobile user i and per time-stamp l, we compute its k nearest neighbors ids’ and keep them in a vector with form knnsil =
[idil1idil2 . . . idilk] for l = 1, 2,...,L. In Table 5.2, an example of such sets for N mobile users’ dual points is presented. For each user, we measure the number of the k nearest neighbors dual points that remained the same from one time-stamp to an- other. By employing the dual transformation methods as described in Section 6.3.1, the k-anonymity set of mobile users is formulated based on their dual points. Hence, an alternative definition for the k-anonymity is as follows:
Definition 6.3.15. (kDUST -anonymity). A transformed database record is k-anonymous with respect to Hough-X dual points—that is, velocity and intersection attributes (Ux, ax) or (Uy, ay), if k − 1 discrete records in the same specific time-stamp τ at least have the same dual point attributes so that no record of k is distinguished from its k − 1 neigh- boring records.
Remark 6.3.1. As we already mentioned in [39], k-anonymization intuitively hides each individual among k − 1 others. This means that linking cannot be performed with
1 confidence greater than k . Nevertheless, k-anonymity may not protect users against the unveiling of dual point attributes.
76 6.3.5 System Model
Here, we consider a spatio-temporal database with N records—that is, N moving ob- j j jects in the xy plane. Each record (xi , yi ) represents the spatial coordinates of the j mobile user j in time-stamp ti , or point i of its trajectory j [186]. From the location coordinates (x, y), we can extract the corresponding dual points by employing the meth- ods described in Section 6.3.1. Suppose a trajectories database T = {T 1,...,T N } of equal length L in which each trajectory is represented via a sequence of L triples, that
j j j j j j j j j j is, T = {(x1, y1, t1), (x2, y2, t2),..., (xL, yL, tL)}. j For each point i in trajectory j, we define in four-dimensional space a vector DPi =
(Uxij , axij ,Uyij , ayij ) which denotes the dual points array. Hence, we can redefine and j j j j j store the trajectory j as T = {DP1 ,DP2 ,DP3 ,...,DPL}. The privacy preservation of k-NN query in trajectory databases is addressed with the use of two different methods. The first one is entitled dual-based k-NN (DukNN) which applies k-NN directly onto dual points, while the second one is called dual-based clustering k-NN (DuCLkNN). The main difference between these two methods lies in the fact that the latter is applied in clustered dual point data. In addition, the operations involved in addressing a k-NN query are thoroughly described in Algorithms 6 and 7, respectively.
Algorithm 6 DukNN 1: input The number of k nearest neighbors 2: input The number of mobile users N 3: input The dual points array of N users in L time-stamps 4: output k nearest neighbors indexes of N users in L time-stamps 5: for i = 1 to L do 6: for j = 1 to N do 7: Apply k-NN for the dual points of all users in order to identify the set of k-NN j indexes Ii of user j in time-stamp i 8: end for 9: end for
77 Algorithm 7 DuCLkNN 1: input The number of k nearest neighbors 2: input The number of mobile users N 3: input The dual points array of N users in L time-stamps 4: output k-NN indexes of N users in L time-stamps
5: Apply K-Means of dual points (Ux, ax) of N users for the L time-stamps 6: for i = 1 : L do 7: for j = 1 : N do 8: Apply k-NN method between the dual point of user j and the dual point of j users inside the cluster Ci of user j in time-stamp i and find the set of k-NN j indexes Ii 9: end for 10: end for
In the case of employing the Algorithm 6 in order to run a k-NN query, we must focus on a specific time-period during which we will have in our disposal the dual point of all users’ locations. Given that each user stands in the same sub-trajectory during the study period, the privacy is preserved in that segment since the k nearest neighbors remain unchanged. On the other hand, in the case of employing Algorithm 7, the clustering step is ahead; we can again claim that the clusters composition remains the same, since the clustering method is applied in dual space and mobile users have the same dual point. As a result, the k nearest neighbors inside the cluster will remain the same. Hence, without loss of generality, in both cases, the privacy is piecewise preserved, except for the points of discontinuity (known as characteristic points) where the motion characteristics may change.
6.3.6 Vulnerability and Storage Efficiency
In this paper, we assume the mobile users’ trajectory on a real map with small velocities; thus, we use the Hough-X transform, since an object’s motion is mapped to the (U, a) dual point. To answer a k-NN query, the following steps are performed:
1. Decompose the k-NN query into 1D queries for the (t, x) and (t, y) projection.
2. For each projection, get the dual k-NN query by using a Hough-X transform.
78 3. Return the anonymity set, which contains the trajectories ids’ that satisfy the dual k-NN query in each projection.
In following, the analysis is focused on the robustness estimation of the proposed approach based on Hough-X. Specifically, the ensuing steps are followed:
1. Split the initial trajectory into a number of linear sub-trajectories, each of which consists of the same number of M spatial points.
2. Apply Hough-X in each part.
Suppose that M is the number of points of the 1D trajectory, which a dual point rep- resents, and D is the number of dual points, which describe the 1D trajectory projection (t, x) or (t, y) in dual space. Therefore, the whole trajectory has a length equal to DM spatial points, for which M D should hold. In the following, we camouflage a mo- bile user who keeps track of a linear trajectory x(t) or y(t) or its corresponding dual point with the k nearest neighboring dual points, which is very probable to remain the same in the next timestamp. Actually, while users move onto the linear sub-trajectory, which relates to the same dual point, the k-NN set will remain intact. Therefore, for as long as it happens, we can claim that the k-anonymity holds. Indeed, the privacy preser- vation is reinforced by a factor M that formulates the so-called vulnerability level to 1 kM . We recall the spatial data security metric that we have already defined in [39] for the quantification and measure of the robustness of our methods. Again, the vulnerability 1 remains equal to k in dual point space. Nonetheless, the definition of vulnerability in the initial dataset is measured as the following. Since the points inside a sub-trajectory are protected by the same dual points, it is obvious that their vulnerability is considerably 1 1 reduced to Mk ; this aspect entails that with a probability equal to Mk , an intruder can distinguish the identity of a mobile user. The same holds for all sub-trajectories. Hence, the vulnerability in each projection is defined as:
1 Vx = Mk (6.1) 1 V = y Mk
79 where Vx and Vy is the vulnerability measure based on Hough-X in projection (t, x) and (t, y), accordingly. Next, the vulnerability in each projection is combined, and the total vulnerability is written as in the following equation:
2 1 2 V = V V = , (6.2) total x y M (Mk)2 M