Business white paper Finding fraud in large and diverse data sets Applying real-time, next-generation analytics to fraud detection and prevention using the HP Vertica Analytics Platform Developments in In the effort to identify and deter fraud, conventional wisdom still applies: Follow the money. That simple adage notwithstanding, the task of tracking fraud and its perpetrators continues to vex both private and public organizations. Clearly, advancements in information technology have made it possible to capture transaction data at the most granular level. For instance, in the retail trade alone, transmissions of up to 500 megabytes daily between individual point-of-sale sites and their data centers are typical.2 Logically, such detail should result in greater transparency and greater capacity to fight fraud. Yet, the sheer volume of data that organizations now maintain, pulled from so many sources and stored across a range of locations has made the same organizations more vulnerable.3 More points of entry amount to more opportunities for fraud. In its annual Global Fraud Report, The Economist found that 50% of all businesses surveyed acknowledged they were vulnerable to fraud; 35% of North American companies specifically cited IT complexity for increasing their exposure to risk.

Accordingly, the application of data mining as a security measure A solution for real-time fraud detection has become increasingly germane to modern fraud detection. Historically, data mining as a means of identifying trends from raw Fraud saps hundreds of billions of dollars each year from the statistics can be traced back to the 1700s with the introduction of bottom line of industries such as banking, insurance, retail, Bayes’ theorem. Since then, statistical analysis has evolved as a healthcare, government social services, and service providers. In means for institutions to measure outcomes, identify customer the United States alone, estimates range as high as $994 billion in behavioral trends, and make forecasts to support management annual fraud-related losses.1 A practice of pattern recognition that decisions. Most recently, the development of advanced algorithms combines the analysis of current transactional data with known has made it possible to rapidly discern patterns, even from the vast fraudulent activities, in addition to other statistical techniques, stores of disparate data. Subsequently, data mining can identify can yield predictive models for real-time prevention. This paper patterns within transaction records to shed light on potentially examines how the HP Vertica Analytics platform is ideal for real- fraudulent actions by vendors, customers, or employees. The time fraud detection, and provides a real-world example of a credit fraudulent pattern can then become a valuable point of reference; card fraud scenario called skimming. business parties that conduct the same type of transactions can then be scored based on their likelihood for fraud.

2 There are three relatively recent developments External fraud vs. internal fraud in statistical algorithms that support pattern Fraud’s impact is felt across a vast spectrum of industries, and is matching for fraud detection. They include: manifested in a range of 30 categories of victimization, including bank credit cards, mail, loans, and utilities. Complaints are • registered by government institutions, including legal and criminal Front-end outlier detection in multivariate justice organizations, as well businesses of all sizes and individual data streams4 citizens. The targets of these complaints are both external— 5 consumers and organized criminals—and internal—employees of • Neural networks the victimized organizations. Historically, internal fraud generates • Social network analysis6 the greatest attention, notably due to the big-ticket crimes carried out by rogue securities traders or corporate procurement staff. Among the recent cases involving a trader, Kweku Adeboli with Switzerland’s UBS bank was accused of stealing $2.03 billion through false accounting practices. Other forms of internal fraud Taking a toll: the figures on fraud that subject institutions to large losses include the theft of trade secrets and technology and asset misappropriation by employees. In all its forms, fraud hits business hard. The Association of Certified Fraud Examiners estimates that companies worldwide While big-ticket fraud grabs the headlines, it’s a relatively small slice lose 5% of revenues to fraud, or about $3.5 trillion.7 U.S. of the overall casebook. In fact, less than 5% of all fraud complaints organizations alone lose about $994 billion per year. And roughly involve more than $5,000 in losses to the victims. Rather, it’s small- half of the organizations surveyed by the ACFE fail to recover any of scale external fraud that represents the significant majority of their fraud-related losses. In its annual Global Fraud Report,8 The losses. These crimes come in the form of credit card fraud, medical Economist found that 75% of all organizations were victims of fraud reimbursement claims, social welfare benefits, check scams, or false at some level, including 66% in North America. invoicing. Of these external fraud crimes, credit card fraud is a study in scope and severity. Globally, fraudulent use of payment cards Meanwhile, in the U.S., the volume of consumer fraud complaints (including general purpose and private label credit cards, debit cards, has escalated dramatically over the past decade. The Consumer and prepaid payment cards) generated $7.6 billion in losses in 2010,11 Sentinel Network, a service of the U.S. Federal Trade Commission, up 10.2% from the previous year. The United States sustained a identified approximately 990,242 fraud-related complaints disproportionate share of those losses; while the U.S. registered 27% submitted to various authorities during 2011,9 up from 137,306 of worldwide payment card business in 2010, it reported nearly half in 2001. Cybercrime, which is reflected in the trend of consumer (47%) of all losses, or $3.56 billion. complaints, also continues to grow in severity. The Internet Crime Complaint Center (IC3) reported 314,246 complaints in 2011—up Although there are numerous cases where internal fraud and 3.4% over 2012—representing $485.3 million in losses. On average, external fraud are linked, and internal fraud sets the stage for IC3, and a partnership between the FBI, the National White Collar external fraud, we want to focus on the distinction between internal Crime Center, and the Bureau of Justice Assistance, fields 26,000 fraud and external fraud; in this paper we intend to delve into the complaints per month.10 subject of external fraud, and how HP Vertica Analytics Platform provides a solution to help detect and prevent such incidents.12

3 Proactive vs. reactive: the HP Vertica The truth is in the transactions Analytics Platform Identifying likely credit card skimmers from transaction data starts with a comprehensive set of transaction data combined with a list of Analysis of transaction data can provide a retroactive means known fraud events. In this example, we’ll present two methods— of detecting fraud, but real-time use of transaction data can one approximate and one more precise—to use in conjunction with proactively step in to stop fraud. Faced with the complexities of Big transaction records and thus yield the probability that a particular Data, traditional relational systems can be ill equipped transaction source (a store, online merchant, etc.) is skimming. For to turn data analysis into quick decisions, as the cost of designing, either approach, required information for this analysis includes a list building, and deploying can be prohibitive.13 The HP Vertica Analytics of shopping transactions—who shopped at which merchants—and a Platform, based on a grid-enabled, column-oriented system, is list of which shoppers reported fraudulent transactions. Such data can specifically designed to provide real-time intelligence from data be culled from basic transaction history that bears a date stamp. From warehouse operations. HP Vertica Analytics Platform includes the this historical data, it’s then possible to determine the typical fraud capability to conduct queries from 50-to-1000 times faster than rate for each merchant and apply that rate as a score, or risk factor. traditional row-oriented RDBMS. On that basis, let’s launch this example based on a data set of In the following scenario, HP Vertica Analytics Platform will be transactions from purchases made at three merchants: applied to one of the most insidious forms of crime that is described above: credit card fraud; and more specifically, credit card skimming. Merchant 1. Honest Abe’s, which has no skimming record

Merchant 2. Sketchy’s, which represents a 50% chance The scoop on skimming of skimming Within the category of credit card fraud, skimming is responsible Merchant 3. McFraudster, which always skims for roughly 15% of all credit card fraud, and bankrate.com reports that skimming results in about $1 billion in annual losses. Note: Because our data is synthetic, we already know who is Skimming is the term for a fraudulent charge made on a credit skimming. Our challenge in this exercise is to see if the results card number by someone other than the card holder. A skimming match up with that knowledge. incident is commonly recognized after a customer makes a series of transactions with his credit card, only to report at a later date that one or more of the charges to his bill was fraudulent. Typically, the complaint reflects that someone copied the victim’s credit card information and used the information to make the unauthorized transaction. Skimming crimes are commonly traced to unscrupulous merchants or their employees who have the opportunity to collect card information during a legitimate purchase. Subsequently, a pattern matching approach to data analysis can determine which merchants establish a track record for being associated with fraudulent transactions.

While skimming can be the work of isolated operators, it’s also known to be run by sophisticated scam artists on an international scale. In one recent case, 32 Japanese men and women were charged with skimming from more than 250 credit cards, racking up charges across the globe. Investigators determined that fake cards were cloned after thieves captured card data by secretly skimming from transactions at hotels, restaurants, and stores in Japan. One group of five suspects was picked up after buying 4 million yen (about $51,000) worth of goods in Hawaii.

4 Pattern matching for fraud detection As we would expect, McFraudster’s lives down to its name, and Sketchy’s is bad more often than not, both red flags for the bank’s Intuitively, one would think the detection of skimming would risk management department. But surprisingly, Honest Abe’s prompt us to simply count the number of shoppers at each store also appears to be less than trustworthy, even though our prior who reported fraud. However, this method fails to account for knowledge tells us Honest Abe’s doesn’t skim. That’s because these popularity. For instance, a large retailer may have a comparatively results are complicated by the fact that Honest Abe’s shoppers also low percentage of fraudulent charges, but due to its size and volume shop at McFraudster’s and Sketchy’s, which a simple percentage of business, its records will show a high raw number of complaints. calculation doesn’t take into account. Additionally, to conduct the In order to balance the scales, it’s necessary to calculate the simple calculation, we also had to discard the date stamps on the percentage of customers who complain on a per-store basis. HP purchase, which is essential to determine the relative frequency Vertica Analytics Platform enables such a calculation, and when that with which each credit card customer visits each store. The answer process is applied to our synthetic data set, it yields these results: lies in a method that accounts for both merchant mix and relative frequency, which leads us to take a second approach.

Store Per-store rate of fraud complaints

Honest Abe’s 45.45%

McFraudster’s 100%

Sketchy’s 68.75%

5 Building a better method of analysis Cutting through the complexity We start with the same two lists: 1) which customers made To the business or management reader without a purchases at which stores, and when; and 2) which customers background in statistics–or for those of us willing reported fraudulent transactions. Next, we transform list 1 into a set of continuous predictor variables.14 This allows us to provide an to admit we’ve probably forgotten much of what integer answer to the question, “How many times does the customer we’ve learned over the years—this formula may conduct a transaction at this store?” List 2 becomes a dichotomous dependent variable, providing a yes/no answer to the question, seem complex; and it may lead you to believe “Did the customer eventually report a fraudulent transaction?” As its computational representation in code and a result we create a set of linear equations in the following form: the time necessary to build that code would be ß₁x¡₁ + ß₂x¡₂ + ... + ßnxin + €¡ = y¡ onerous. With traditional RDBMS and separate At this stage, since we know the values of X and Y, we can solve for analytics platforms it is, but with HP Vertica the ßs by using . However, this equation requires Analytics Platform nothing could be further from multivariate regression. And because HP Vertica Analytics Platform is capable of single-variable linear regression, we’ll employ the support the truth. In reading our technical white paper of a third-party instrument. Among many options available to us, we your IT staff working on fraud detection will find select R¹² since HP Vertica Enterprise Edition runs R scripts natively. that it takes no more than 20 lines of code to Before we proceed, there’s one more hurdle to clear. The first implement on the HP Vertica Analytics platform.15 step in performing linear regression is to place all the Xin’s into a matrix. But as we indicated above, there may be tens of thousands of merchants and millions of account numbers, requiring billions or trillions of cells for the matrix. And a single server running R can’t crunch a trillion-cell matrix. Fortunately, since many customers won’t have shopped at most stores, many of the Xin values will be zero. As a result, we can use sparse matrix techniques—specifically the SparseM package offered by R—to reduce the necessary computing power for the linear regression.

Subsequently, setting up a solution environment requires the following resources: • One database in HP Vertica Analytics Platform • One ODBC DSN for the database in HP Vertica Analytics Platform • The R base system, which is available via download for a variety of platforms at cran.r-project.org/bin

With these tools in place, we can move through the steps to install SparseM and establish a foundation for performing the linear regression and populating our matrix. Now we can go forward to solve the equations for ß.

6 Interpreting the results Next, we apply the R script to perform a for the Sketchy’s transactions on these dates, providing the following With the application of HP Vertica Analytics Platform and R¹², linear set of ß values: regression has helped us determine a ß value for each of the three stores in our analysis. The higher ß number, the higher the probability the merchant is skimming. Be aware that the ß does not constitute a Transaction date Beta value percentage, nor does it correspond to the frequency that a particular merchant is engaged in skimming. However, it gives us a relative 3/1/12 0.2545 measure among the sample set of merchants, and allows us to focus 3/2/12 0.8545 on those most likely perpetrators. 3/3/12 0.2545

Store Beta value

Honest Abe’s 0.015820698747528 The resulting figures provide high confidence that the credit card skimming occurred on March 2, 2012. McFraudster’s 0.829927488464074

Sketchy’s 0.301252471984179 Conclusions In the above example, we started our data analysis with two lists: In the end, the ß values reflect that most of the reported fraud a transaction history for a credit card issuer (a bank), and a list is linked to transactions at McFraudster’s. Sketchy’s is worthy of of credit card customers who reported the fraudulent purchases attention, although still a far lower risk profile than McFraudster’s, against their cards. Next, we applied two methods of analysis to look while the likelihood of fraud at Honest Abe’s is much slimmer than for statistical correlations that connect merchants and transaction we may have suspected with the initial percentage calculations. dates with reports of fraudulent charges. The first approach makes Remember, in that formula, the rate of fraud at Honest Abe’s use solely of the HP Vertica Analytics Platform, while the second registered at 45.5%. approach applied HP Vertica Analytics Platform in conjunction with R, a software environment for statistical computing and graphics. Still, we want to know if the data can also help us determine when the fraud occurred, not just where. To that end, let’s take a closer Our analysis turned up not only which merchants were likely to look at Sketchy’s to see if we can identify the dates. By repeating be engaged in credit card skimming, but even when the skimming the percentage analysis that we used earlier, and focusing only on activity was most likely to have occurred. The resulting information transactions at Sketchy’s, HP Vertica Analytics Platform yields the can prove highly valuable to a bank or payment processor, allowing following results: them to clear their system from potential sources of fraud. Additionally, the outcomes provide a prospective use for the bank: Once a set of merchants and dates has been identified, converting Transaction date % of shoppers reporting fraud these conditions into WHERE clauses and running them against an entire transaction history becomes simplified. Furthermore, if this 3/2/12 100 process is executed in a timely fashion, it can also disclose potential skimming victims before they even report fraudulent charges. It 3/1/12 57.14 positions a bank to move proactively and minimize exposure by 3/3/12 57.14 freezing credit cards or warning individual victims.

7 Why HP Vertica Platform Analytics? 1 The Association of Certified Fraud Examiners, “2012 Report to the Nations” 2 Impact of Technology in Retail Industry, MBA Knowledge Base, mbaknol.com The basic analytic techniques described in the example above are not mbaknol.com/retail-management/impact-of-technology-in-retail-industry/ complex or novel. In fact, they have been practiced in various forms 3 Anuj Sharma and Prabin Kumar Panigrahi. Article: A Review of Financial Accounting for well over 200 years.16 But the application of these techniques to Fraud Detection based on Data Mining Techniques. International Journal of Computer unravel credit card fraud has emerged only recently. That’s due to the Applications 39(1):37-47, February 2012. Published by Foundation of Computer Science, New York, USA ijcaonline.org/archives/volume39/number1/4787-7016, inability of analytic methods to digest and assess the amount of data 4 associated with credit card transactions. Just consider the transaction Ben-Gal I., Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers,” records collected by even a modest-sized banking operation. If the Kluwer Academic Publishers, 2005 bank had 1 million retail customers, and each of those customers 5 MA Bihina Bella, MS Olivier and JHP Eloff, “A fraud detection model for Next-Generation used their credit cards only once per day, that adds up to 365 million Networks,” in D Browne (ed), Southern African Telecommunication Networks distinct transactions over a year. If the bank expands its customer and Applications Conference 2005 (SATNAC 2005) Proceedings, Vol 1, 321-326, base, or customers increase their daily use, the possibility exists that Champagne Castle, South Africa, September 2005 annual transactions could surpass 1 billion. Apply these ratios to a 6 Implementing Social Network Analysis for Fraud Prevention, CGI, 2011 larger bank system, which may have up to 200 million customers, 7 The Association of Certified Fraud Examiners, “2012 Report to the Nations” 17 and you realize one institution can easily be handling tens of billions 8 managementthinking.eiu.com/sites/default/files/downloads/KRL_ of transaction records. The availability of data processing systems FraudReport2011-12_US.PDF Global Fraud Report, Economist Intelligence Unit Survey that can store that much transaction volume, let alone make it Results, Annual Edition 2011/12 actionable data for the bank, is limited. 9 Consumer Sentinel Network Data Book, January - December 2011, US Federal Trade Commission, February, 2012 ftc.gov/sentinel/reports/sentinel-annual-reports/ The HP Vertica Analytics Platform represents a new level of sentinel-cy2011.pdf, p. 5 capability in this space. With MPP architecture that features linear 10 Internet Computer Crime Center, 2010 Internet Crime Report, 2011, ic3.gov scalability, the database in HP Vertica Analytics Platform can ic3.gov/default.aspx comfortably store trillions of records; real-world deployments of 11 “Global Credit Card Fraud Losses Increased 10.2% over 2009”, The Nilson Report, HP Vertica Analytics Platform provide more than 10 trillion rows. It’s November, 2011 Source: The Nilson Report, September 2011 comparably simple to implement, and most importantly, HP Vertica 12 Responsible Information Management: Ensuring Data Privacy in the Enterprise, Andrew Analytics Platform processes data far faster than traditional RDBMS, Joiner Division CEO, Autonomy and Adam Ecker Sr. Technologist, Autonomy with a smaller hardware footprint in the data center, and at a lower To learn more about HP Vertica solutions for internal fraud, see a separate report on Autonomy ICE solutions publications.autonomy.com/docs/Responsible%20 18 total cost of ownership than competing systems. Information%20Management

13 Globally, the rate of fraud has taken a promising turn downward 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse: And Why Vertica May Be The Right Fit, HP Vertica White Paper, May 2012 since an all-time high in 2008. However, the sophistication of 14 dtreg.com/vartype.htm criminals and the opportunity for fraud evolves along with the 15 advancements in technology. To protect themselves from these Using Credit Card Transaction Records to Detect Skimming, HP Technical White Paper, August 2012 attacks, businesses and organizations will have to harness 16 increasing amounts of data in real time, with in-line processing en.wikipedia.org/wiki/Regression analysis#History algorithms. The HP Vertica Analytics Platform provides the best 17 newyorkfed.org/research/epr/07v13n3/0712hirt.pdf line of defense to keep the would-be perpetrators at bay. 18 “5 Signs You Might Be Outgrowing Your MySQL Data Warehouse,” Vertica White Paper, vertica.com/wp-content/uploads/2012/05/Vertica_MySQL_WhitePaper.pdf

Get connected hp.com/go/getconnected Share with colleagues Get the insider view on tech trends, support alerts, and HP solutions.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

4AA1-3292ENW, Created August 2012