Cybersecurity Data Science (CSDS) Practitioner Methods and Best Practices PART 1: FRAME Scott Allen Mongeau Cybersecurity Data Scientist SAS Institute [email protected]

Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Data Science (CSDS) TOPIC 1. FRAME 2. DATA 3. DISCOVER 4. DETECT 5. DEPLOY

2 Introductions, Expectations, Agenda

Copyright © SAS Institute Inc. All rights reserved. Scott Allen Mongeau Cybersecurity Data Scientist Cybersecurity Data Science (CSDS) MA GD MA MBA PhD (ABD)

+31 (0)68 370 3097

[email protected]

Scott Allen Mongeau

• SAS (2015 – present) SARK7 • F&SI Cyber – Global (EU/Netherlands based) • Fraud & Financial Crime Analytics (London) • SARK7 (2008 – present) • Data analytics consulting (Leiden) • Deloitte (2013 – 15) • Fraud, Financial Crime, Cyber (Amsterdam) • Genentech Inc. (2000 – 2008) • Data analytics (San Francisco, CA) Scott Mongeau • Defining key CSDS challenges

Cybersecurity • Outline best practices Data Science: Perspectives on an • Guidance to managers and Emerging Profession practitioners in implementation of CSDS programs

5 Key Learning Objectives

6 Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Services and Solutions: Overview of Major Areas

7 NIST Cybersecurity Framework

8 9 Cybersecurity Data Science as a Process

Data Engineering Advanced Analytics Triage / Validate Remediate

Diagnostics & Establishing Predictive Anomaly Behavioral ? patterns baselines modelling detection insights

DATA DataENGINEER Manager Scientist Cyber INVESTIGATOR Data Scientist CASE MGMT Infosec Response Investigator

10 Cybersecurity Analytics Maturity Curve Data-aware Anomaly Detection Predictive Detection Risk Optimization Investigations

• Big data management • Flags, rules, and alerts ______• Multivariate statistics, inferenceCHASING & unsupervisedPHANTOM machinePATTERNS learning • Groups extracted as baselines

11 1. FRAME C S D S 2. DATA P r o c e s s

DEPLOY

4. DETECT 3. DISCOVER

12 Cybersecurity Data Science (CSDS) Lifecycle

Monitor Frame Data Manager InfoSec Response

DECIDE Deploy Explore DETECTION DISCOVERY

Validate Engineer

Cyber Data Scientist Model Investigator

Copyright © SAS Institute Inc. All rights reserved. Security Operations Center (SOC)

14 Emerging SOC Operational Drivers

Big & fast streaming data Limitations of traditional Integrated situational Need to build and Automation of manual needs to be stitched into signature and rules- awareness of network, validate efficacious investigation and ‘smart data’ based approaches, device, and user behavior machine learning remediation processes requiring probabilistic while reducing false alerts models and risk-focused models

Copyright © SAS Institute Inc. All rights reserved. 1. FRAME Cybersecurity, Data Science, Logs

Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Data Science (CSDS) TOPIC 1. FRAME 2. DATA 3. DISCOVER 4. DETECT 5. DEPLOY

17 Learning Objectives

18 Copyright © SAS Institute Inc. All rights reserved. Objectives of Context Setting Establishing a foundation

• Data challenges (opportunities) in cybersecurity • Data science foundations • Demonstration / exercises • Insights into the ‘data deluge’ (network analytics) • Log file analytics – from unstructured to structured

19 CSDS Process Unified Orchestration

Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Context

Copyright © SAS Institute Inc. All rights reserved. Evolving Threats

Internal Threats State Actors

Automated Fraud-Cyber Attacks Hybrids

Social Ransomware Engineering & Cryptojacking

Copyright © SAS Institute Inc. All rights reserved. Hidden threats want to remain hidden (in the data)

23 24 24 Source: Recorded Future via Fortune Magazine ‘A Hacker’s Tool Kit’ http://fortune.com/2017/10/25/cybercrime-spyware-marketplace/ 25 FUD Fear, Uncertainty, Doubt Increasing FREQUENCY, sophistication, and speed of attacks

Example of hacking tool ‘ecosystem’: Kali Linux

• 51,000 attackers were blocked within days of site launch • More than 1 million VPN-based attacks were blocked

26 Distinction Information bias Reactive devaluation Anchoring or focalism Dread aversion Insensitivity to sample size Recency illusion Anthropocentric thinking Dunning–Kruger effect Interoceptive bias Regressive bias Anthropomorphism or personification Duration neglect Irrational escalation Law of the instrument Rhyme as reason effect Risk compensation / Peltzman Endowment effect Less-is-better effect effect

Availability heuristic Exaggerated expectation Look-elsewhere effect

Availability cascade Experimenter's or expectation bias Loss aversion Selective perception Backfire effect Focusing effect Mere exposure effect Semmelweis reflex

Bandwagon effect Forer effect or Barnum effect Money illusion Social comparison bias or Base rate neglect Form function Moral credential effect Social desirability bias LIST OF Framing effect or Negativity effect Frequency illusion or Baader– Ben Franklin effect Meinhof effect Neglect of probability Stereotyping Berkson's paradox Functional fixedness Subadditivity effect blind spot Gambler's fallacy Not invented here Subjective validation Choice-supportive bias Groupthink Observer-expectancy effect Hard–easy effect Time-saving bias Ostrich effect Third-person effect Hot-hand fallacy Parkinson's law of triviality

Conservatism (belief revision) Hyperbolic discounting Overconfidence effect Unit bias

Continued influence effect Identifiable victim effect Pareidolia Weber–Fechner law https://en.wikipedia.org/wiki Contrast effect IKEA effect Pessimism bias Well travelled road effect Courtesy bias Illicit transference Planning fallacy Zero-risk bias /List_of_cognitive_biases Post-purchase rationalization Declinism Illusion of validity Decoy effect Pro-innovation bias Default effect Illusory27 truth effect Projection bias Denomination effect Pseudocertainty effect Disposition effect Implicit association Reactance Lack of statistical and analytics approaches to establish alert validity

High volumes of Too many alerts combined disconnected and with slow and resource disparate data sources constrained investigations

Proliferation of point Disconnected log analytics solutions sources of variable impedes holistic risk- quality based approach Difficulty Actioning ! Indicators

Company Confidential – For Internal Use Only Copyright © SAS Institute Inc. All rights reserved. ‘Drowning’ in Data Lakes

29 Network Traffic Tracking network traffic on a single device

30 Copyright © SAS Institute Inc. All rights reserved. Exponential Growth in Network Traffic

https://www.neustadt.fr/essays/against-a-user-hostile-web/

Using cookies and other tracking techniques, many “normal” web sites aggressively access and store our:

• Device details • Buying behavior • IP address • Personal finances • Present geolocation • Religious beliefs • Changing physical locations • Political affiliations • Browser history • Online purchases • Health concerns and problems • Profile information • Camera / photos / microphone 31 Marketing Analytics Ecosystem (>4000 companies)

32 Demonstration / Exercise Network Traffic from Web Browsing 1. Open ‘Wireshark’

2. Click ‘Ethernet’

3. Observe results

33 Demonstration / Exercise Network Traffic from Simple Web Browsing 4. Open ‘Internet Explorer’ (next to Wireshark) 5. Go to a popular news website

6. Monitor Wireshark – what do you see? 7. Sort on destination 7. View ‘red’ & ‘black’ events

See codes View => Coloring Rules Note ‘Destination’ IP 8. Go to http://ip-lookup.net Where and what are these IPs?

34 Wireshark Coloring Rules (default)

35 36 Data Science Foundations

Copyright © SAS Institute Inc. All rights reserved. Level of difficulty in reducing false alerts*

* Survey of 621 global IT security practitioners https://www.sas.com/en_us/whitepapers/ponemon-how-security- analytics-improves-cybersecurity-defenses-108679.html 38 Statistics Cognitive Pattern science Identification Data Science Data AI Mining Machine Learning Data Engineering

As adapted from: Hall, P., et al. An Overview of Machine Learning with SAS Enterprise Miner. SAS Institute Inc.

39 39 DATA ANALYTICS What to do? What are PRESCRIPTIVE trends? PREDICTIVE What happened?

DESCRIPTIVE SOPHISTICATION

VALUE

40 DATA ANALYTICS Operations Management Econometrics PRESCRIPTIVE Forecasting Machine Learning PREDICTIVE Business Intelligence (BI)

DESCRIPTIVE SOPHISTICATION

VALUE

41 DATA ANALYTICS How can we optimize remediation from focused alerts? PRESCRIPTIVE What is the ‘normal’ behaviour of particular users or devices? PREDICTIVE What are trends and patterns in network and device usage?

DESCRIPTIVE SOPHISTICATION

VALUE

42 43 Calvin.Andrus (2012) Depicts a mash-up of disciplines from which Data Science is derived http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png Engineering Physics / ‘Hard Computer Sciences’ Science

Software Economics / Engineering Finance

Social Sciences

44 Calvin.Andrus (2012) Depicts a mash-up of disciplines from which Data Science is derived http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png 45 Scientific test…

46 VALIDATE WITH DIAGNOSTICS

Descriptive Predictive Prescriptive

DIAGNOSTICS i.e. statistical tests, causal and explanatory models, experiments, validation, model performance, etc.

47 Data Science Methods What are the How can we Optimizing SEMANTIC underlying Data support experts visualization Systems Understanding human factors? with Patterns PRESCRIPTIVE visualizations? Understanding PREDICTIVE Context & Meaning Validating DIAGNOSTICS Factors & Causes Forecasting &

Probabilities SOPHISTICATION DESCRIPTIVE How do we substantiate our assumptions and hypothesizes? DATA QUALITY Is data quality reliable enough? Business Intelligence What are limitations?

VALUE 48 CSDS: Cybersecurity Data Science

Replacing rules with Moving to real time Automation of manual machine learning to reduce detection and processes and routine decisions false alerts decisioning

Data engineering to structured Investigation tools that visualize complexity and integrate distributed big to improve investigator efficiency and data into ‘smart data’ decision making Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Data Science: Typical Sequence

Identify a security problem

Identify data associated Operationalize solution with problem

Test and validate a solution Clean and prepare the data

Select a set of methods Experiment with methods

50 Data Foundations: Log File Analytics

Copyright © SAS Institute Inc. All rights reserved. The devil is in the data

010001011010001000101100000001101010111100010000110001 100000110100101000101100100111011100101001000001100100 000011011011001100110111001100101011110000100001000011 010101101000100010010001000100101011011101101010011101 111000010011001100100111100011010111100001001100100001 111011010010100111000100101000110100110010011001010010 010111110100100011101000100011010100111011000101100111 110001001011001000100110100011010101101000010000100001 010001011010001000010000000001101010101100001010010010 101010110100010001001111000110101011010100100001000001 111001001011001000110000100011010101111001010000111001 011100011010001110000101111110000111001100001110101001 101001011000001000110000100011010101111000111100100010 010101101011101000101000100010010101011100001000000011 011101000001100100100011001001010110111011010100110101

Copyright © SAS Institute Inc. All rights reserved. Challenges preventing successful use of cybersecurity analytics*

https://www.sas.com/en_us/whitepapers/ponemon-how-security- * Survey of 621 global IT security practitioners analytics-improves-cybersecurity-defenses-108679.html 53 Log files are the most common source of cybersecurity data…

SOURCE Security Brief Magazine. (2016). “Analyze This! Who’s Implementing Security Analytics Now?” Available at https://www.sas.com/en_th/whitepapers/analyze-this-108217.html 54 Security Log Data Sources Operating systems • Web server • Windows & registry events • IIS W3C Logs, Apache • Unix logs Data access • HTTP Error Logs, URL Scans • File system • Resource access and performance • ODBC • Physical security access • RMDB (i.e. Oracle, SQL Server) Authentication and Authorization Reports • Malware and endpoint activity • Active Directory Services • Failure and critical error reports • Kerberos • Specialized • Proxy logs Network logs • Cloud applications (i.e. SaaS) • Firewall logs • Specialized systems / applications (i.e. ERP, thin client, specialized or home- • Router grown) • Traffic flows / packet captures

SOURCE: Marty, Raffael. Applied Security55 Visualization. Pearson Education. Use Cases: Security Log File Analytics

• User-entity behavioral analytics • Data / asset protection (UEBA)* • Preventive insights • Performance insights • Identify / attribute attacks • Optimization of resources • Incident response • Asset tracking • Incident root-cause analysis • Refining focused alerts • Refining risk indicator metrics • Compliance / risk reporting • Enriching SIEM or other repositories • CISO dashboard

* Global Behavioral Analytics Market $3.5 billion projected market value by 2024 56 Diving into Cybersecurity Log Data

This exercise illustrates an example of examining cybersecurity log data with a variety of tools.

Copyright © SAS Institute Inc. All rights reserved. Diving Right In! EXERCISE: Assessing Raw Log Data

58 Demonstration / Exercise EXERCISE: Assessing Raw Log Data 1. Open ‘Notepad’

2. File => Open

3. Select ‘All Files (*.*)’ (lower right)

3. D: => @CYBER => 1.FRAME => B.Log_Files => log_files => auth.log

59 60 Demonstration / Exercise EXERCISE: Assessing Semi-Structured Log Data 1. Open ‘Sublime Text’

2. File => Open File

3. D: => @CYBER => 1.FRAME => B.Log_Files => log_files => auth.log

61 62 Processing Raw Cybersecurity Data – Example EXERCISE: Assessing Semi-Structured Log Data 1. What are we looking at – what is this? • What system? http://honeynet.org/challenges/2010_5_log_mysteries • What types of logs / format? https://help.ubuntu.com/community/LinuxLogFiles • How ‘verbose’? Inherent structure (or lack thereof)? • How might we build context (structure more granularly)? • Schema availability? Data dictionary? Guidance from experts? • For example, https://help.ubuntu.com/community/LinuxLogFiles • Codes and indicators - descriptive versus opaque?

63 Cyber Forensics: Log File Analysis – HoneyNet Challenge Virtual Server Log File Analysis (Virtual Ubuntu Linux Server) 1. Was the system compromised and when? How do you know that for sure? 2. If the system was compromised, what was the method used? 3. Can you locate how many attackers failed? If some succeeded, how many were they? 4. How many stopped attacking after the first success? 5. What happened after the brute force attack? 6. Locate the authentication logs. Was a brute force attack performed? if yes, how many? 7. What is the timeline of significant events? How certain are you of the timing? 8. Anything else that looks suspicious in the logs? Any misconfigurations? Other issues? 9. Was an automatic tool used to perform the attack? if yes, which one? 10. What can you say about the attacker's goals and methods? http://honeynet.org/challenges/2010_5_log_mysteries

64 Processing Raw Cybersecurity Data – Example EXERCISE: Assessing Semi-Structured Log Data 2. How can we reduce and structure? • Parsing and extracting – many options • Commercial tools (e.g. Splunk, Excel) and free tools (e.g. MS Log Parser) • Scripting/programmatic (e.g. Perl, R, Python, SAS DS2, UNIX GREP, PowerShell) 3. Where do we store and transform subsequently? • Flat files • Database • SIEM or other specialized repository • ‘Data Lake’ (note: also can dump the raw logfile there too) • Analytics platform 65 Demonstration / Exercise EXERCISE: Assessing Semi-Structured Log Data

1. Open ‘Log Parser 2.2’ More fun with LogParser! 2. Examples of structured queries to try: https://mlichtenberg.wordpress.com/2011/02/03/log ALL FAILED AUTHENTICATION EVENTS: -parser-rocks-more-than-50-examples/ logparser -i:TEXTLINE -o:csv "SELECT * INTO 'D:\@CYBER\1.FRAME\B.Log_Files\results\log_extract_failed.txt' FROM 'D:\@CYBER\1.FRAME\B.Log_Files\log_files\auth.log' WHERE Text LIKE '%Failed password%'" ALL FAILED ROOT ACCESS AUTHENTICATION EVENTS – STRUCTURED: logparser -i:TEXTLINE -o:csv "SELECT EXTRACT_TOKEN (Text, 8, ' ') AS UserName, EXTRACT_TOKEN (Text, 10, ' ') AS IPSource, EXTRACT_TOKEN (Text, 12, ' ') AS Port, EXTRACT_TOKEN (Text, 10, ' ') AS Protocol INTO 'D:\@CYBER\1.FRAME\B.Log_Files\results\log_extract_failed_root.txt' FROM 'D:\@CYBER\1.FRAME\B.Log_Files\log_files\auth.log' WHERE Text LIKE '%Failed password for root%'" RESULTS D:\@CYBER\1.FRAME\B.Log_Files\results\ 66 Exercise Review

67 Copyright © SAS Institute Inc. All rights reserved. Conclusion

68 Copyright © SAS Institute Inc. All rights reserved. Section Review

69 Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Data Science as a Process

Data Engineering Advanced Analytics Triage / Validate Remediate

Diagnostics & Establishing Predictive Anomaly Behavioral ? patterns baselines modelling detection insights

DATA DataENGINEER Manager Scientist Cyber INVESTIGATOR Data Scientist CASE MGMT Infosec Response Investigator

70 Cybersecurity Analytics Maturity

Data-aware Risk Awareness / Anomaly Detection Predictive Detection Investigations Resource Optimization

• Big data management • Flags, rules, and alerts ______• Structured data • CountsCHASING of key measuresPHANTOM • FoundationPATTERNS for comparisons across entities and over time

71 Cybersecurity Analytics Maturity

Data-aware Risk Awareness / Anomaly Detection Predictive Detection Investigations Resource Optimization

• Big data management • Flags, rules, and alerts ______• Structured data • Counts of key measures • Foundation for comparisons across entities and over time

72 Cybersecurity Data Science (CSDS) Lifecycle

Monitor Frame

DECIDE Deploy Explore DETECTION DISCOVERY

Validate Engineer

Model

Copyright © SAS Institute Inc. All rights reserved. Cybersecurity Data Science (CSDS) TOPIC 1. FRAME 2. DATA 3. DISCOVER 4. DETECT 5. DEPLOY

74