Mining Security Risks from Massive Datasets

Mining Security Risks from Massive Datasets Fang Liu Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science & Application Danfeng Yao, Chair Wenjing Lou Ali R. Butt B. Aditya Prakash Dongyan Xu June 28, 2017 Blacksburg, Virginia Keywords: Cyber Security, Big Data Security, Mobile Security, Data Leakage Detection Copyright 2017, Fang Liu Mining Security Risks from Massive Datasets Fang Liu (ABSTRACT) Cyber security risk has been a problem ever since the appearance of telecommunication and electronic computers. In the recent 30 years, researchers have developed various tools to protect the confidentiality, integrity, and availability of data and programs. However, new challenges are emerging as the amount of data grows rapidly in the big data era. On one hand, attacks are becoming stealthier by concealing their behaviors in massive datasets. One the other hand, it is becoming more and more difficult for existing tools to handle massive datasets with various data types. This thesis presents the attempts to address the challenges and solve different security prob- lems by mining security risks from massive datasets. The attempts are in three aspects: detecting security risks in the enterprise environment, prioritizing security risks of mobile apps and measuring the impact of security risks between websites and mobile apps. First, the thesis presents a framework to detect data leakage in very large content. The framework can be deployed on cloud for enterprise and preserve the privacy of sensitive data. Second, the thesis prioritizes the inter-app communication risks in large-scale Android apps by designing new distributed inter-app communication linking algorithm and performing nearest-neighbor risk analysis. Third, the thesis measures the impact of deep link hijacking risk, which is one type of inter-app communication risks, on 1 million websites and 160 thousand mobile apps. The measurement reveals the failure of Google's attempts to improve the security of deep links. Mining Security Risks from Massive Datasets Fang Liu (GENERAL AUDIENCE ABSTRACT) Cyber security risk has been a problem ever since the appearance of telecommunication and electronic computers. In the recent 30 years, researchers have developed various tools to prevent sensitive data from being accessed by unauthorized users, protect program and data from being changed by attackers, and make sure program and data to be available whenever needed. However, new challenges are emerging as the amount of data grows rapidly in the big data era. On one hand, attacks are becoming stealthier by concealing their attack behaviors in massive datasets. On the other hand, it is becoming more and more difficult for existing tools to handle massive datasets with various data types. This thesis presents the attempts to address the challenges and solve different security prob- lems by mining security risks from massive datasets. The attempts are in three aspects: detecting security risks in the enterprise environment where massive datasets are involved, prioritizing security risks of mobile apps to make sure the high-risk apps being analyzed first and measuring the impact of security risks within the communication between websites and mobile apps. First, the thesis presents a framework to detect sensitive data leakage in enterprise environment from very large content. The framework can be deployed on cloud for enterprise and avoid the sensitive data being accessed by the semi-honest cloud at the same time. Second, the thesis prioritizes the inter-app communication risks in large-scale Android apps by designing new distributed inter-app communication linking algorithm and performing nearest-neighbor risk analysis. The algorithm runs on a cluster to speed up the computation. The analysis leverages each app's communication context with all the other apps to prioritize the inter-app communication risks. Third, the thesis measures the impact of mobile deep link hijacking risk on 1 million websites and 160 thousand mobile apps. Mo- bile deep link hijacking happens when a user clicks a link, which is supposed to be opened by one app but being hijacked by another malicious app. Mobile deep link hijacking is one type of inter-app communication risks between mobile browser and apps. The measurement reveals the failure of Google's attempts to improve the security of mobile deep links. Acknowledgments I would like to express my deepest gratitude to my mentor Dr. Danfeng (Daphne) Yao for providing me the opportunity to be part of the Yao group! She introduced me to cyber security research and had since guided me through the most interesting and exciting research topics in the past five years. I have learned so much over the past five years from Dr. Yaos knowledge and expertise, and the knowledge and skills gained in this learning experience build the most solid foundation for my dissertation work. I also greatly appre- ciate Dr. Yaos kindness and generosity. She is always available to provide insights on my research when I am lost and offer support when I am discouraged. I would not have survived the past five challenging years of graduate school without her encouragement and inspiration. I would also like to express my sincere gratitude to Dr. Wenjing Lou, Dr. Ali R. Butt, Dr. B. Aditya Prakash, Dr. Dongyan Xu, and Dr. Babara Ryder who generously con- tributed much time and energy to help improve my dissertation. I have benefited a lot from their diverse perspectives and insightful comments which help both deepen and broaden my views in my research. I also want to thank Dr. Gang Wang for providing guidance on the mobile deep link project. It was such a valuable and enjoyable learning experience despite it being short in duration. I would like to thank my friends and peers who I worked with over the past five years: Dr. Kui Xu, Dr. Xiaokui Shu, Dr. Hao Zhang, Dr. Karim Elish, Dr. Haipeng Cai, Daniel Barton, Ke Tian, Dr. Long Cheng, Sazzadur Rahaman, Stefan Nagy, Alex Kedrowitsch, Andres Pico, Bo Li, Yue Cheng, Kaixi Hou, Zheng Song, Tong Zhang, Qingrui Liu and Xinwei Fu. Their companionship has brought so much joy into the challenging and stressful graduate school life! Finally, I would like to thank my parents, my sister, and my wife for always having faith in me and loving me. Their consistent support instills in me the strongest motivation to keep learning more, working harder, and getting better. iv Contents List of Figures ix List of Tables xii 1 Introduction1 1.1 Detect Data Leakage in Large Datasets.....................2 1.2 Prioritize Inter-communication Risks in Large-scale Android Apps......3 1.3 Measure Hijacking Risks of Mobile Deep Links.................4 1.4 Document Organization.............................4 2 Review of Literature6 2.1 Data Leakage Detection.............................6 2.2 Mobile Inter-app Communication........................9 3 Detect Data Leakage 12 3.1 Threat Model, Security and Computation goals................ 14 3.1.1 Threat Model and Security Goal..................... 15 3.1.2 Computation Goal............................ 16 3.1.3 Confidentiality of Sensitive Data..................... 17 3.2 Technical Requirements and Design Overview................. 18 3.2.1 MapReduce................................ 18 3.2.2 MapReduce-Based Design and Challenges............... 19 3.2.3 Workload Distribution.......................... 20 v 3.2.4 Detection Workflow............................ 21 3.3 Collection Intersection in MapReduce..................... 22 3.3.1 Divider Algorithm............................ 22 3.3.2 Reassembler Algorithm........................ 24 3.3.3 Example of the Algorithms........................ 25 3.3.4 Complexity Analysis........................... 27 3.4 Security Analysis and Discussion........................ 27 3.4.1 Privacy Guarantee............................ 28 3.4.2 Collisions................................. 28 3.4.3 Discussion................................. 29 3.5 Implementation and Evaluation......................... 31 3.5.1 Performance of A Single Host...................... 32 3.5.2 Optimal Size of Content Segment.................... 33 3.5.3 Scalability................................. 34 3.5.4 Performance Impact of Sensitive Data................. 35 3.5.5 Plain Text Leak Detection Accuracy.................. 36 3.5.6 Binary Leak Detection Accuracy.................... 36 4 Prioritize Inter-app Communication Risks 39 4.1 Models & Methodology.............................. 41 4.1.1 Threat Model............................... 41 4.1.2 Security Insights of Large-Scale Inter-app Analysis............................ 42 4.1.3 Computational Goal........................... 45 4.1.4 The Workflow............................... 46 4.2 Distributed ICC Mapping............................ 46 4.2.1 Identify ICC Nodes............................ 47 4.2.2 Identify ICC Edges and Tests...................... 47 4.2.3 Multiple ICCs Per App Pair....................... 48 vi 4.2.4 Workload Balance............................. 49 4.2.5 Complexity Analysis........................... 49 4.3 Neighbor-based Risk Analysis.......................... 50 4.3.1 Features.................................. 50 4.3.2 Hijacking/Spoofing Risk......................... 51 4.3.3 Collusion Risk............................... 53 4.4 Evaluation..................................... 54 4.4.1 Q1: Results of Risk Assessment..................... 54 4.4.2

Load more