Adversarial Analycs 101

Strata Conference & Hadoop World October 28, 2013

Robert L. Grossman Open Data Group Examples of Adversarial Analycs

• Compete for resources – Low latency trading – Aucons • Steal someone else’s resources – rings – Insurance fraud rings • Hide your acvity – Click spam – Exfiltraon of informaon

Convenonal vs Adversarial Analycs

Convenonal Analycs Adversarial Analycs Type of game Gain / Gain Gain / Loss Subject Individual Group / ring / organizaon Behavior Dris over me Sudden changes Transparency Behavior usually open Behavior usually hidden Automaon Manual (an individual) Somemes another model 1. Points of Compromise

Source: Wall Street Journal, May 4, 2007 Case Study: Points of Compromise • TJX compromise • Wireless Point of Sale devices compromised • Personal informaon on 451,000 individuals taken • Informaon used months later for fraudulent purchases. • WSJ reports that 45.7 million credit card accounts at risk • “TJX's breach-related bill could surpass $1 billion Source: Wall Street Journal, over five years” May 4, 2007 Learning Tradecra The adversary is oen a group, criminal ring, etc. It’s Not an Individual…

Source: Jusce Department, Times. Gonzalez Tradecra

1. Exploit vulnerabilies in wireless networks outside of retail stores. 2. Exploit vulnerabilies in databases of retail organizaons. 3. Gain unauthorized access to networks that process and store credit card transacons. 4. Exfiltrate 40 Million track 2 records (data on credit card’s magnec strip) Gonzalez Tradecra (cont’d)

5. Sell track 2 data in Eastern Europe, USA, and other places. 6. Create counterfeit ATM and debit cards using track 2 data. 7. Conceal and launder the illegal proceeds obtained through web currencies in the US and .

Source: US District Court, District of Massachuses, Indictment of Albert Gonzalez, August 5, 2008. Data Is Large

• TJX compromise data is too large to fit into memory • Data is difficult to fit into database • Millions of possible points of compromise • Data must be kept for months to years

Source: Wall Street Journal, May 4, 2007 2. Introducon to Adversarial Analycs Convenonal Analycs Adversarial Analycs

• Example: predicve • Example: commercial model to select online competor trying to ads. maximize trading gains. • Example: predicve • Example: criminal gang model to classify aacking a system and senment of a customer hiding its behavior. service interacon. What’s Different About the Models?

Convenonal Analycs Adversarial Analycs Data Labeled Unlabeled Data size Small to large Small to very large Model Classificaon / Clustering, change Regression detecon Frequency Monthly, quarterly, Hourly, daily, weekly, … of update yearly Types of Adversaries

Home Team Adversary

Use specialized teams $10,000,000 and approaches. Change the game. Build custom $100,000 analyc models. Create new analyc techniques.

Use products. $1,000 Use exisng analyc techniques.

Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013 What is Different About the Adversary?

Convenonal Analycs Adversarial Analycs Enty Person browsing Gang aacking a system What is Events, enes Events, enes, rings modeled? Behavior Component of natural Waing for opportunies behavior Evoluon Dri Acve change in behavior

Obfuscaon Ignore Hide Updang Models

Convenonal Adversarial Analycs Analycs When do you When there is a Frequently, to gain update model? major change in an advantage. behavior Process for updang Automaon and Automaon and models analyc analyc infrastructure infrastructure helpful. essenal. 3. More Examples Click Fraud ?

• Is a click on an ad legimate? • Is it done by a computer or a human? • If done by a human, is it an adversary? Types of Adversaries

Home Team Adversary

Use specialized teams $10,000,000 and approaches. Change the game. Build custom $100,000 analyc models. Create new analyc techniques.

Use products. $1,000 Use exisng analyc techniques.

Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013 Low Latency Trading

Broker/Dealer

ECN Trades Algorithm

Market data Low latency Adversarial traders trader Algorithm

• 2+ billion bids, asks, executes / day • > 90% are machine generated Back tesng • Most are cancelled • Decisions are made in ms Connecng Chicago & NYC

Technology Vendor Round Trip Time (ms) Microwave Windy 9 .0 Apple Dark fiber, Spread 13.1 shorter path Networks Standard Various 14.5+ fiber, ISP

Source: lowlatency.com, June 27, 2012; Wired Magazine, August 3, 2012 Types of Adversaries

Home Team Adversary

Use specialized teams $10,000,000 and approaches. Change the game. Build custom $100,000 analyc models. Create new analyc techniques.

Use products. $1,000 Use exisng analyc techniques.

Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013 Example: Conficker

Source: Conficker Working Group: Lessons Learned, June 2010 (Published January 2011),www.confickerworkinggroup.org Command and control center Infected PCs Source: Conficker Working Group: Lessons Learned, June 2010 (Published January 2011),www.confickerworkinggroup.org Source: Conficker Working Group: Lessons Learned, June 2010 (Published January 2011),www.confickerworkinggroup.org Types of Adversaries

Home Team Adversary

Use specialized teams $10,000,000 and approaches. Change the game. Build custom $100,000 analyc models. Create new analyc techniques.

Use products. $1,000 Use exisng analyc techniques.

Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013 4. Building Models for Adversarial Analycs Building Models To Idenfy Fraudulent Transacons

Building the model: Scoring transacons: 1. Get a dataset of labeled 1. Get next event data (i.e. fraud is labeled). (transacon). 2. Build a candidate predicve 2. Update one or more model that scores each associated feature vectors transacon with likelihood (account-based) of fraud. 3. Use model to process 3. Compare li of candidate updated feature vectors to model to current champion compute scores. model on hold-out dataset. 4. Post-process scores using 4. Deploy new model if rules to compute alerts. performance is beer. Building Models to Idenfy Fraud Rings

Driver

Repair shop

Repair shop Driver Driver Clinic Driver

Driver Driver Clinic • Look for suspicious paerns and relaonships among claims. • Look for hidden relaonships through common phone numbers, nearby addresses, etc. Method 1: Look for Tesng Behavior

• Adversaries oen test an approach first. • Build models to detect the tesng. Method 2: Idenfy Common Behavior

• Common cluster algorithms (e.g. k-means) require specifying the number of clusters • For many applicaons, we keep the number of clusters k relavely small • With microclustering, we produce many clusters, even if some of the them are quite similar. • Microclusters can somemes be labeled Microcluster Based Scoring

Good Cluster • The closer a Bad feature vector is to Cluster a good cluster, the more likely it is to be good • The closer it is to a bad cluster, the more likely it is to be bad. Method 3: Scaling Baseline Algorithms

Drive by exploits

Source: Wall Street Journal Points of compromise What are the Common Elements? • Time stamps • Sites – e.g. Web sites, vendors, computers, network devices • Enes – e.g. visitors, users, flows

• Log files fill disks; many, many disks • Behavior occurs at all scales • Want to idenfy phenomena at all scales • Need to group “similar behavior” Examples of a Site-Enty Data

Example Sites Enes Drive-by exploits Web sites Computers Compromised Compromised User accounts user accounts computers Compromised Merchants Cardholders payment systems

Source: Collin Benne, Robert L. Grossman, David Locke, Jonathan Seidman and Steve Vejcik, MalStone: Towards a Benchmark for Analycs on Large Data Clouds, The 16th ACM SIGKDD Internaonal Conference on Knowledge Discovery and Data Mining (KDD 2010), ACM, 2010. 37 Exposure Window Monitor Window

dk-1 dk me 38 sites enes The Mark Model

• Some sites are marked (percent of mark is a parameter and type of sites marked is a draw from a distribuon) • Some enes become marked aer vising a marked site (this is a draw from a distribuon) • There is a delay between the visit and the when the enty becomes marked (this is a draw from a distribuon) • There is a background process that marks some enes independent of visit (this adds noise to problem) Subsequent Proporon of Marks

• Fix a site s[j] • Let A[j] be enes that transact during ExpWin and if enty is marked, then visit occurs before mark • Let B[j] be all enes in A[j] that become marked someme during the MonWin • Subsequent proporon of marks is r[j] = | B[j] | / | A[j] | • Easily computed with MapReduce.

Compung Site Enty Stascs with MapReduce

MalStone B Hadoop MapReduce v0.18.3 799 min Hadoop Python Streams v0.18.3 142 min Sector UDFs v1.19 44 min # Nodes 20 nodes # Records 10 Billion Size of Dataset 1 TB

Source: Collin Benne, Robert L. Grossman, David Locke, Jonathan Seidman and Steve Vejcik, MalStone: Towards a Benchmark for Analycs on Large Data Clouds, The 16th ACM SIGKDD Internaonal Conference on Knowledge Discovery and Data Mining (KDD 2010), ACM, 2010. 41 5. Summary Some Rules

1. Understand your adversary's tradecra; he or she will go aer the easiest or weakest link. 2. Your adversary will be creang new opportunies; you must create new models and analyc approaches. 3. Risk is usually viewed as the cost of doing business, but every once in a while the cost may be 100x – 1,000x greater 4. The quicker you can update your strategy and models, the greater your advantage. Use automaon (e.g. PMML-based scoring engines) to deploy models more quickly. Quesons? About Robert L. Grossman Robert Grossman (@bobgrossman) is the Founder and a Partner of Open Data Group, which specializes in building predicve models over big data. He is also a Senior Fellow at the Computaon Instute and Instute for Genomics and Systems Biology at the University of Chicago and a Professor in the Biological Sciences Division. He has led the development of new open source soware tools for analyzing big data, cloud compung, and high performance networking. Prior to starng Open Data Group, he founded Magnify, Inc., which provides data mining soluons to the insurance industry. Grossman was Magnify’s CEO unl 2001 and its Chairman unl it was sold to ChoicePoint in 2005. He blogs occasionaly about big data, data science, and data engineering at rgrossman.com. For More Informaon

www.opendatagroup.com [email protected]