Content Sensitivity Based Access Control Model for Big Data

CONTENT SENSITIVITY BASED ACCESS CONTROL MODEL FOR BIG DATA By ASHWIN KUMAR THANDAPANI KUMARASAMY Bachelor of Technology in Information Technology Anna University Chennai, Tamil Nadu, India 2010 Master of Science in Computer Science Oklahoma State University Stillwater, Oklahoma 2013 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2017 CONTENT SENSITIVITY BASED ACCESS CONTROL MODEL FOR BIG DATA Dissertation Approved: Dr. Johnson P Thomas Dissertation Adviser Dr. KM George Dr. Blayne Mayfield Dr. Ramesh Sharda ii ACKNOWLEDGEMENTS To my mom Meera and dad Kumarasamy. I am what I am because of you and I owe it all to you. Thanks for the never-ending encouragement, which helped me to finish this degree. I am grateful to my sibling, friends and family for providing moral and emotional support. Thanks for being there for me Dakota. I am grateful to my adviser Dr. Thomas. It was my pleasure to get to know you and work with you for the past several years. I leaned a lot from you over these years and it has made me a better person. I am also thankful to my department head and co-adviser Dr. George for his guidance and support during my PhD. I also thank all my committee members for being very supportive. I would like to express my gratitude to Liu Hong, Xiaofei Hou and Ashwin Kannan for being my research partners. It was great to work with you. I would like to thank all the graduate students that I got a chance to work with. iii Acknowledgements reflect the views of the author and are not endorsed by committee members or Oklahoma State University. Name: ASHWIN KUMAR THANDAPANI KUMARASAMY Date of Degree: JULY, 2017 Title of Study: CONTENT SENSITIVITY BASED ACCESS CONTROL MODEL FOR BIG DATA Major Field: Computer Science Abstract: Big data technologies have seen tremendous growth in recent years. They are being widely used in both industry and academia. In spite of such exponential growth, these technologies lack adequate measures to protect the data from misuse or abuse. Corporations that collect data from multiple sources are at risk of liabilities due to exposure of sensitive information. In the current implementation of Hadoop, only file level access control is feasible. Providing users, the ability to access data based on attributes in a dataset or based on their role is complicated due to the sheer volume and multiple formats (structured, unstructured and semi-structured) of data. In this dissertation an access control framework, which enforces access control policies dynamically based on the sensitivity of the data is proposed. This framework enforces access control policies by harnessing the data context, usage patterns and information sensitivity. Information sensitivity changes over time with the addition and removal of datasets, which can lead to modifications in the access control decisions and the proposed framework accommodates these changes. The proposed framework is automated to a large extent and requires minimal user intervention. The experimental results show that the proposed framework is capable of enforcing access control policies on non- multimedia datasets with minimal overhead. iv TABLE OF CONTENTS Chapter Page I. INTRODUCTION ......................................................................................................1 Motivation ................................................................................................................1 Challenges ................................................................................................................3 Content Sensitivity Based Approach towards Access Control ................................3 Contributions............................................................................................................5 Structuring Non-Multimedia Dataset .................................................................5 Generating Structural and Descriptive Metadata ...............................................5 Tracking Usage Patterns ....................................................................................6 Tracking Data Lineage .......................................................................................6 Data-driven Approach to Estimate Data Sensitivity ..........................................6 Enforcing Access Control Decisions Based on Data Sensitivity .......................7 Re-estimating Data Sensitivity during Data Aggregation .................................7 Thesis Organization .................................................................................................7 II. REVIEW OF LITERATURE....................................................................................8 Hadoop .....................................................................................................................8 A Brief History of Hadoop ................................................................................8 Hadoop Distributed File System (HDFS) ..........................................................9 MapReduce Programming Model ....................................................................11 Overview of MapReduce Job Execution .........................................................13 Drawbacks of MapReduce Programming Model ............................................14 Yet another Resource Navigator (YARN) .......................................................15 Survey of Traditional Access Control Models.......................................................17 Access Control Lists (ACL).............................................................................18 Discretionary Access Control (DAC) ..............................................................19 Mandatory Access Control (MAC) ..................................................................21 Role Based Access Control (RBAC) ...............................................................22 Attribute Based Access Control (ABAC) ........................................................26 Policy Based Access Control (PBAC) .............................................................28 Risk Adaptable Access Control (RAdAC) ......................................................29 Content Based Access Control Models............................................................31 Bitcoin ....................................................................................................................37 Introduction ......................................................................................................37 v Chapter Page Transactions .....................................................................................................38 Timestamp Server ............................................................................................38 Proof-of-Work..................................................................................................39 Network............................................................................................................39 Verifying a Payment Using Block Chain.........................................................40 Privacy .............................................................................................................40 III. STRUCTURING AND LINKING DATA ............................................................42 Problem Statement .................................................................................................42 Introduction ............................................................................................................42 Literature Review and Related Work ....................................................................43 Generating Metadata and Structuring Data in Hadoop ....................................43 Context Similarity Measures............................................................................44 Usage Pattern Similarity Measures ..................................................................45 The Metadata Generator ........................................................................................46 Introduction ......................................................................................................46 Data Context Analysis Module ........................................................................46 Data Usage Pattern Analysis Module ..............................................................48 Framework to Generate Metadata, Link and Track Data Items.............................50 Introduction ......................................................................................................50 Roles and Responsibilities ...............................................................................52 Enhanced Metadata Generator (EMG) ............................................................52 Data Usage Tracker (DUT) ..............................................................................55 Data Similarity Analyzer (DSA) ......................................................................55 Provenance Tracker (PT) .................................................................................58 Experimental Results .......................................................................................58 IV. DETECTING SENSITIVE DATA ITEMS ...........................................................61 Problem Statement .................................................................................................61 Introduction ............................................................................................................61

Load more