Department of Computer Engineering and Informatics University of Patras

DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS UNIVERSITY OF PATRAS DOCTORAL DISSERTATION EFFICIENT ALGORITHMS FOR BIG DATA MANAGEMENT ELIAS DRITSAS SUPERVISOR:SPYROS SIOUTAS,PROFESSOR PATRAS -AUGUST 2020 2 DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS UNIVERSITY OF PATRAS DOCTORAL DISSERTATION EFFICIENT ALGORITHMS FOR BIG DATA MANAGEMENT ELIAS DRITSAS DISSERTATION COMMITTEE: SPYROS SIOUTAS,PROFESSOR (SUPERVISOR) CHRISTOS MAKRIS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER) KONSTANTINOS TSICHLAS,ASSISTANT PROFESSOR (COMMITTEE MEMBER) GEORGE ALEXIOU,PROFESSOR (COMMITTEE MEMBER) DIMITRIS TSOLIS,ASSISTANT PROFESSOR (COMMITTEE MEMBER) IOANNIS TZIMAS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER) PHOIVOS MYLONAS,ASSOCIATE PROFESSOR (COMMITTEE MEMBER) 3 4 ELIAS DRITSAS AUGUST 2020 5 1 ABSTRACT In the context of the doctoral research, I dealt with data management problems by developing methods and techniques that, on the one hand, maintain or improve the privacy and anonymity of users and, on the other hand, are efficient in terms of time and storage space for large volumes of databases. The research results of the work focus on the following: • Evaluate the performance of queries in a large volume database using or not the Bloom Filter structure. • Evaluate workload time, memory and disk usage of the Privacy Preserving Record Linkage (PPRL) problem in Hadoop MapReduce Framework. • Methods of answering queries of nearest neighbors to spatio-temporal data (moving users trajectories) in order to preserve anonymity, where queries are applied to clustered or non-clustered data. • The k anonymity method was used, where, the set of anonymity with which each moving object of the space-time database is being camouflaged, consists of its k nearest neighbors. The robustness of the method was quantified with a prob- 1 ability of k and the effect of dimensionality and correlation of the data on the preservation of anonymity and privacy was studied. • The above method was improved in terms of efficient storage of spatio-temporal data by applying queries of nearest neighbors to Hough transformed nonlinear trajectories of moving objects. The application of secure k-NN queries was eval- uated in the GeoSpark environment. • Sentiment Analysis on Twitter Data and Tourist Forecasting at Apache Spark Keywords: Bloom Filters, Privacy Preserving, k-NN Queries, k-anonymity, Spatio- temporal Databases, Sentiment Analysis, Twitter Apache Spark, GeoSpark. i ACKNOWLEDGEMENT This dissertation signifies the end of an academic journey as a PhD student. At this point, I would like to sincerely thank all those who have supported me all these years. First and foremost, with immense pleasure and deep sense of gratitude, I wish to express my sincere thanks to my supervisor Dr. Spyros Sioutas, Professor, University of Patras, without his motivation and continuous encouragement, this research would not have been successfully completed. I would also like to express my sincere gratitude to Prof. Spyros Sioutas for his enthusiasm in supervising this work and for his constant support, encouragement and critical suggestions during the writing of this thesis. I am also grateful to my initial supervisor, Prof. Athanasios Tsakalidis who supervised me during the most part of my Ph.D. studies and giving me the opportunity to undertake this research thesis. Also, I warmly thank Associate Prof. Christos Makris who willingly accepted to supervise me during their final part. I express my sincere thanks to Dr. Andreas Kanavos for his kind support and encouragement in several ways throughout my research work. I wish to extend my profound sense of gratitude to my parents for all the sac- rifices they made during my research and also providing me with moral support and encouragement whenever required. Last but not the least, I would like to thank my wife Maria Trigka for her constant encouragement and moral support along with patience and understanding. Finally, i would like to acknowledge the support and funding of the current PhD thesis by the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation (HFRI). Place: Patra Date: 31/08/2020 Elias Dritsas ii TABLE OF CONTENTS ABSTRACT .................................... i ACKNOWLEDGEMENT ............................. ii LIST OF FIGURES . viii LIST OF TABLES ................................. xii LIST OF ABBREVIATIONS . xiii 1 Introduction 1 I Research on Methods and Algorithms for Secure Queries Processing 9 2 Bloom Filters for Efficient Coupling between Tables of a Database 11 2.1 Introduction . 11 2.2 Bloom Filters Background . 13 2.2.1 Bloom Filter Elements . 13 2.2.2 Space-Time Advantages and Constraints . 15 2.3 Bloom Filters and RDBMS . 16 2.3.1 Relational Database Management Systems . 16 2.3.2 Queries Language-SQL . 17 2.3.3 Indexes Table . 19 2.4 Experimental Evaluation in SQL Server . 20 2.5 Conclusions . 21 2.5.1 Research Conclusions . 21 2.5.2 Research Constraints . 23 2.5.3 Future Extensions . 23 iii 3 MapReduce Implementations for Privacy Preserving Record Linkage 24 3.1 Introduction . 24 3.2 Related Work . 25 3.2.1 PPRL encoding techniques . 25 3.2.2 Private Indexing . 28 3.3 MapReduce Framework . 29 3.4 Performance Evaluation . 30 3.5 Conclusions . 33 4 Security and Privacy Solutions associated with NoSQL Data Stores 34 4.1 Introduction . 34 4.2 Related Work . 35 4.3 Comparison of Relational and NoSQL Databases . 36 4.3.1 Reliability of Transactions . 37 4.3.2 Scalability Issues and Cloud Support . 37 4.3.3 Complexity and Big Data Management . 38 4.3.4 Data Model . 38 4.3.5 Data Warehouse and Crash Recovery . 39 4.3.6 Privacy and Security . 39 4.4 Proposed Security and Privacy Solutions . 40 4.4.1 Pseudonyms-based Communication Network . 40 4.4.2 Monitoring, Filtering and Blocking . 42 4.5 Conclusions . 43 5 Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio- Temporal Databases 44 5.1 Introduction . 44 5.2 Materials and Methods . 46 5.2.1 Clustering . 46 5.2.2 Classification . 47 5.2.3 Useful Definitions . 48 5.2.4 System Architecture . 49 5.2.5 Problem Definition . 52 iv 5.2.6 System Model . 55 5.2.7 Privacy Preserving Analysis . 57 5.2.8 Experiments Data and Environment . 59 5.3 Discussion . 60 5.4 Results . 62 5.4.1 Experiments Results . 62 5.4.2 Experiments Conclusions . 65 6 Storage Effiecient Trajectory Clustering and k-NN for Robust Pri- vacy Preserving Databases 68 6.1 Introduction . 68 6.2 Related Work . 70 6.3 Materials and Methods . 72 6.3.1 Dual Transform for Moving Objects . 72 6.3.2 kNN Classification and Clustering in Dual Space . 73 6.3.3 Problem Definition . 74 6.3.4 Problem Formulation . 75 6.3.5 System Model . 77 6.3.6 Vulnerability and Storage Efficiency . 78 6.3.7 Privacy Preservation Analysis . 81 6.3.8 Experimental Data and Environment . 85 6.4 Results . 85 6.4.1 Vulnerability Evaluation in Hough Space . 87 6.4.2 Vulnerability Evaluation in Hybrid Space . 89 6.5 Discussion . 91 6.6 Conclusions . 92 7 Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Query Processing in GeoSpark 95 7.1 Introduction . 95 7.2 Related Work . 98 7.2.1 Distributed Frameworks for Spatio-Temporal data Queries Pro- cessing . 98 v 7.2.2 Efficient Privacy Preserving k-NN Queries . 99 7.3 Materials and Methods . 101 7.3.1 Operations on Spatial Data . 101 7.3.2 The k-NN Classifier from Big Spatial Data Perspective . 102 7.3.3 Problem Definition . 103 7.3.4 Problem Formulation . 105 7.3.5 System Model . 106 7.3.6 GeoSpark System Overview . 108 7.4 Results . 115 7.4.1 Environment and Dataset . 115 7.4.2 Time Performance of k-Anonymity Set . 115 7.4.3 Vulnerability Evaluation . 121 7.5 Discussion . 124 7.5.1 Performance Issues . 125 7.5.2 Vulnerability . 126 7.6 Conclusions and Future Work . 126 II Sentiment Analysis and Tourism Forecasting 129 8 An Apache Spark Implementation for Graph-based Hashtag Senti- ment Classification on Twitter 131 8.1 Introduction . 131 8.2 Related Work . 133 8.2.1 Sentiment Analysis and Classification Models . 133 8.2.2 Cloud Computing Preliminaries . 136 8.3 Sentiment Classification on Twitter . 137 8.3.1 Tweet-Level Sentiment Classification . 137 8.3.2 Hashtag-Level Sentiment Classification . 137 8.4 Spark Implementation . 140 8.5 Results and Evaluation . 142 8.6 Conclusions . 143 vi 9 An Efficient Preprocessing Tool for Supervised Sentiment Analysis on Twitter Data 144 9.1 Introduction . 144 9.2 Related Work . 146 9.3 Tools and Environment . ..

Load more