Hadoop Security

Like what you hear? Tweet it using: #Sec360 HADOOP SECURITY Like what you hear? Tweet it using: #Sec360 HADOOP SECURITY About Robert: School: UW Madison, U St. Thomas Programming: 15 years, C, C++, Java Security Work: § Surescripts, Minneapolis (present) § Big Retail Company, Minneapolis § Big Healthcare Company, Minnetonka OWASP Local Volunteer CISSP, CISM, CISA, CHPS Email: [email protected] Twitter: @msp_sullivan HADOOP SECURITY History What is new? Common Applications Threats Security Architecture Secure Baseline and Testing Policy Impact HADOOP HISTORY • 2002 : Doug Cutting & Mike Cafarella: Nutch • Crawl and index hundreds of millions of pages • 2003: Google File System paper released • 2004: Google MapReduce paper released • 2006: Yahoo formed Hadoop 5 to 20 nodes • 2008: Yahoo, Hadoop “behind every click” • 2008: Google spun off Cloudera 2,000 Hadoop nodes • 2008: Facebook open sourced Hive for Hadoop • 2011: Yahoo spins out Hortonworks • Hortonworks Hadoop 42,000 nodes, hundreds of petabytes Derrick Harris “The History of Hadoop from 4 nodes to the future of data”, gigamon.com HADOOP IS The Apache Hadoop software library is a framework that allows for the distributed processing of large … - Software Framework - Distributed Processing - Large Data Sets - Clusters of Computers - High Availability - Scale to Thousands of Machines Link: https://developer.yahoo.com/hadoop/tutorial MAPREDUCE IS NEW MAP REDUCE HADOOP COMMON APPLICATIONS 1. Web Search 2. Advertising & recommendations 3. Security Threat Identification 4. Fraud Detection 5. Patient Record Search Source: Yahoo: https://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html PATIENT MATCHING AT SURESCRIPTS - Surescripts provides a Patient Matching service - 230 Million Patients - Over 1 Billion matches last year - Requirements: - Reliability and performance - Data Protection at rest is required - Data Protection in transit is required - Comprehensive security logging is needed - ISO 27001 & EHNAC Audit Accreditation status must be maintained NOW WHAT? SECURE THE BEES HADOOP THREAT MODEL 1) Unauthorized data access (protected health information access) 2) Unauthorized data change 3) Unauthorized job submission, delete or change 4) Task may access other tasks or access local data 5) Rogue DataNode, NameNode or Job Tracker 6) User spoofing to submit workflow as another user From: “Adding Security to Apache Hadoop”, Das, O’Malley, Rhadia, Zhang, 2011, http://hortonworks.com/wp-content/uploads/2011/10/security- design_withCover-1.pdf HADOOP SECURITY Data Nodes Management Nodes - Network Security - Authentication - Authorization Admins - Auditing - Data Protection Applications Application Users Enterprise Identity, Logging, Encryption, Key Management DATA PROTECTION Data Nodes Management Nodes - Network Security - Authentication - Authorization Admins - Auditing - Data Protection - Encryption at rest; HTTPS HTTPS - Volume, file - Encryption in transit: Applications - HTTPS Application Users Enterprise Identity, Logging, Encryption, Key Management SECURITY AUDITING Data Nodes Management Nodes - Network Security - Authentication - Authorization Admins - Auditing - Failed/Successful Authn. - System changes - Access to PHI - Application logs: HDFS, Applications YARN, MapReduce… Application Users - Data Protection Enterprise Identity, Logging, Encryption, Key Management AUTHORIZATION Data Nodes Management Nodes - Network Security - Authentication - Authorization Admins - Limit user access to function - Limit user access to objects - Manage delegation of access Applications - Auditing Application - Data Protection Users Enterprise Identity, Logging, Encryption, Key Management AUTHENTICATION Data Nodes Management Nodes - Network Security - Authentication - All users, all applications, Admins all access paths - Apache Knox Gateway - Authorization HTTPS - Auditing - Data Protection Applications Application Users Enterprise Identity, Logging, Encryption, Key Management NETWORK SECURITY Data Nodes Management Nodes - Network Security - Authentication - Authorization Admins - Auditing - Data Protection Applications Application Users Enterprise Identity, Logging, Encryption, Key Management HADOOP SECURE MODE Apache Hadoop Secure Mode: 2.6.0 (March 14’) - Authentication - Covers HDFS, YARN, MapReduce & Web Console - Uses central LDAP Server or Active Directory - Requires Kerberos keytabs for each application - Authorization - Each Hadoop service has a list of users and groups - Group permissions on HDFS filesystem components - Audit - Hadoop log, YARN log, other logs - Data Protection - Encryption in transit between Hadoop services & clients - Encryption in transit between DataNodes - Encryption in transit between web console & clients (HTTPS) - Encryption at rest for HDFS columns HADOOP SECURE MODE Apache Hadoop Secure Mode: 2.6.0 (March 14’) Data Data Job Task Rogue User Access Change Submission Access Node Spoofing Network Security Authentication Authorization Audit Data Protection APACHE KNOX The Apache Knox Gateway is a REST API Gateway for interacting with Hadoop clusters. The Knox Gateway provides a single access point for all REST interactions with Hadoop clusters. Knox can provide: • Authentication (LDAP and Active Directory Authentication Provider) • Federation/SSO (HTTP Header Based Identity Federation) • Authorization (Service Level Authorization) • Auditing Integrations: - WebHDFS (HDFS), Templeton (Hcatalog), Stargate (Hbase), Oozie, Hive/ JDBC Status: Incubating APACHE RANGER A centralized security framework to manage fine grained access control. Status: Incubating Authentication • Kerberos in native Apache Hadoop • Secured by the Apache Knox Gateway via the HTTP/REST API Authorization • on the folder and file level, via HDFS • on the database, table and column level, via Hive • on the table, column family and column level, via HBase Audit User access auditing in HDFS, Hive and HBase at IP address, Resource/resource type, Timestamp, Access granted or denied Data Protection • Wire, volume and file/column encryotion • HDFS Transparent Encryption (TDE) • Third-Party Partners (Hortonworks) Administration • Policy management, administration and delegation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.0/Ranger_U_Guide_v22/index.html#Item1.1 HADOOP SECURITY POLICY Authentication of processes: - May go into existing application security policy Security Logging requirements: - Which applications must be logged? - Add node identifier to standard log records De-anonymization Issues - Sparse data can be de-anonymized through matching to public sources - Could 200 days of tweets be matched to any of my de-identified data? Key Management & Business Continuity BUILD A SECURITY BASELINE - Start with your Vendor’s distribution - Add your company’s sauce - Review Hadoop Security Benchmark project at the Center For Internet Security: - Apache Hadoop 2.6.0 Benchmark - Community Discussion - Editors and members get free access to validation tools - Everyone gets free access to baselines - Registration is moderated. That means human registrants are approved and receive a welcome email. - Link: - http://tinyurl.com/HadoopSecurityBenchmark HADOOP SECURITY REVIEW 1. Start with the threats 2. Choose your diagram 3. Ask the standard security questions: uNetwork Security uAuthentication uAuthorization uSecurity Audit uData Protection 4. Update your policy 5. Build a Security Baseline HADOOP SECURITY RESOURCES 1. Apache “Hadoop in Secure Mode http://tinyurl.com/hadoopSecureMode 2. Yahoo Hadoop Tutorial https://developer.yahoo.com/hadoop/tutorial 3. Securosis: “Securing Big Data: Security Recommendations for Hadoop and NoSQL Environments”, 10/12/2012, Adrian Lane https://securosis.com/assets/library/reports/SecuringBigData_FINAL.pdf 4. Cloudera: “Introduction to Hadoop Security” http://tinyurl.com/cloudera50security 5. Hortonworks: “Security for Enterprise Hadoop” http://hortonworks.com/innovation/security/ 6. Center for Internet Security: Hadoop Security Baseline http://tinyurl.com/HadoopSecurityBenchmark QUESTIONS ? Updates at http://www.confidentialsoftware.com .

Hadoop Security

Splitting the Load How Separating Compute from Storage Can Transform the Flexibility, Scalability and Maintainability of Big Data Analytics Platforms

Apache Hadoop Today & Tomorrow

Big Business Value from Big Data and Hadoop

View Whitepaper

Final HDP with IBM Spectrum Scale

Hortonworks Data Platform Release Notes (October 30, 2017)

Hortonworks Data Platform Apache Solr Search Installation (July 12, 2018)

Ingesting Data

Hortonworks Data Platform Teradata Connector User Guide (May 17, 2018)

Hortonworks Data Platform on IBM Power Systems

Hortonworks Data Platform Data Movement and Integration (December 15, 2017)

Hortonworks Data Platform Apache Hadoop High Availability (April 20, 2017)