Unsupervised Anomaly Detection in Receipt Data
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Unsupervised Anomaly Detection in Receipt Data ANDREAS FORSTÉN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Unsupervised anomaly detection in receipt data ANDREAS FORSTÉN Master in Computer Science Date: September 17, 2017 Supervisor: Professor Örjan Ekeberg Examiner: Associate Professor Mårten Björkman Swedish title: Oövervakad anomalidetektion i kvittodata School of Computer Science and Communication iii Abstract With the progress of data handling methods and computing power comes the possibility of automating tasks that are not necessarily han- dled by humans. This study was done in cooperation with a company that digitalizes receipts for companies. We investigate the possibility of automating the task of finding anomalous receipt data, which could automate the work of receipt auditors. We study both anomalous user behaviour and individual receipts. The results indicate that automa- tion is possible, which may reduce the necessity of human inspection of receipts. Keywords: Anomaly detection, receipt, receipt digitalization, au- tomatization iv Sammanfattning Med de framsteg inom datahantering och datorkraft som gjorts så kommer också möjligheten att automatisera uppgifter som ej nödvän- digtvis utförs av människor. Denna studie gjordes i samarbete med ett företag som digitaliserar företags kvitton. Vi undersöker möjligheten att automatisera sökandet av avvikande kvittodata, vilket kan avlas- ta revisorer. Vti studerar både avvikande användarbeteenden och in- dividuella kvitton. Resultaten indikerar att automatisering är möjligt, vilket kan reducera behovet av mänsklig inspektion av kvitton. Nyckelord: Anomalidetektion, kvitto, kvittodigitalisering, automa- tisering Contents 1 Introduction 1 1.1 Problem description . .1 1.2 Ethical considerations . .4 1.3 Related Work . .5 2 Background 7 2.1 Anomaly detection . .7 2.1.1 Evaluation . .9 2.1.2 Summary . 11 2.2 Methods for anomaly detection . 12 2.2.1 Temporal anomaly detection . 12 2.2.2 Global anomaly detection . 14 2.2.3 Local anomaly detection . 15 3 Method 17 3.1 Data . 17 3.1.1 Data selection . 18 3.1.2 Data description . 18 3.2 Automatic auditing . 24 3.2.1 User characterization . 24 3.2.2 Temporal analysis . 30 3.3 Evaluation . 38 3.4 Implementation . 39 4 Results 40 4.1 User characterization . 40 4.2 Time series analysis . 41 v vi CONTENTS 5 Discussion 44 5.1 User characterization . 44 5.2 Time series analysis . 45 5.3 Further work . 46 Chapter 1 Introduction Anomaly detection, also commonly referred to as outlier detection, is the process of finding data points that deviate from some measure of normality. The problem has been intensely studied and the methods generated are used in everything from credit card fraud detection to monitoring of patient medical data (Chandola, Banerjee, and Kumar 2009). The origin of anomalies naturally varies with the field of study. They may arise from e.g. fraud attempts, sensor faults or user mistakes in a data input interface. A thorough understanding of the ’black box’ be- tween the mechanism generating the data and the data itself is there- fore usually very helpful, or even necessary, for finding interesting outliers, which is one of the reasons why domain-specialized studies are common in the field of anomaly detection. This thesis is such a domain-specialized study, with the focus being on receipt data. 1.1 Problem description This work is done in cooperation with a company that provides a ser- vice for digitally handling receipts. The common customers are com- panies in need of a service that lets employees easily structure and re- port expenses made on behalf of the company in order to be refunded. The user fills in a form for each receipt, where he or she may choose to either manually enter fields such as total cost, date of expense, VAT cost etc., or let the system suggest completed field based on an OCR reading of the receipt. 1 2 CHAPTER 1. INTRODUCTION After completing the form, the user sends the data to the person or persons responsible for handling receipts at the company (henceforth referred to as auditors). The user’s request is either accepted or rejected, where a reason for rejection may be e.g. that the auditor considers the expense to be of personal nature. Figure 1.1: Receipt digitalization system. The expenses reported through this system has been stored in a dataset that is henceforth referred to as the Expense dataset. All data is anonymized to avoid identification of individuals. The original Expense dataset consists of approximately one million data points. In this study we use a subset of the dataset which only contains data from paying users with at least 500 receipts in the dataset. The leaves about 155; 000 data points from 197 users. This reduction was done to 1) remove all data from users that merely test the system, 2) be able to draw conclusions about user behaviour, which is difficult if too few data points are avail- able. The features of the dataset are: CHAPTER 1. INTRODUCTION 3 • Transaction ID. Works as a primary key; it does not provide any information. • User ID. Used to identify the user that the receipt belongs to. • Company ID. Used to identify the company which the user be- longs to. • The date of the expense, that is, when the purchase was made. • The exchange rate to SEK at the date of purchase. • The Value Added Tax (VAT) cost. • Total cost of the expense (that is, including VAT). • The country of the expense. • The currency of the expense. This may be different from the cur- rency of the country feature, e.g. if someone buys something with euros in a country outside the eurozone. • Number of attendees. • An image url, which provides a link to the image of the receipt. • An ID that identifies a category of the transaction. • Zero or more class tags of the transaction. Most features are self-explanatory, but the last four warrant further explanation. The number of attendees is an integer that is used in restaurant receipts where several people were present. This allows us to normalize the total cost of the restaurant visit by simply dividing by this number, as visits with a large number of attendees would other- wise appear as very large expenses. The image url provides a link to the receipt which the expense re- ported is based upon. This provides us with a ground truth for the expense report, and based on inspection of the image we may e.g. de- termine if an error in the data input has been made. For purposes of bookkeeping, receipts are stored in different categories, 4 CHAPTER 1. INTRODUCTION such as software or transport, which is found in the dataset by the cat- egory ID. The categories used may vary with the company, thus all companies have the possibility of defining their own. This posed a problem when a classification mechanism based on the OCR reading was added to the system. To solve this, a set of 26 class tags such as advertising, aviation, restau- rant or car maintenance, which the classification algorithm could use, was introduced. The tags are designed to be non-overlapping and to encompass almost every reported receipt. Companies can then define their own categories, and associate class tags with every category. For example, a company may have transport as a category and associate the tags taxi, aviation, bus, boat, train, parking, rental car with it. A data point classified in one of the tags would then be categorized as trans- port. It would be useful if the web service offered a way for subscribing companies to automate the task of auditing receipts. This study inves- tigates the possibility of this in the context of the reduced Expense dataset. Thus, the following questions are evaluated in this study: • How may we construct an automatic auditor that flags individ- ual receipts and users for inspection in such a way that the flag- ging agrees well with human conceptions? • How may such an automatic auditor’s performance be evalu- ated? • Which existing anomaly detection methods may be used for this task? 1.2 Ethical considerations The obvious ethical issue of this study is that which is common to all big data studies, which is privacy invasion. It quickly becomes easy to monitor and study the behaviour of users, which makes surveillance of employees much easier for companies and for those in possession of the data. While the data set we worked with was anonymized in terms CHAPTER 1. INTRODUCTION 5 of the user id etc., it was still possible sometimes to identify individ- uals by name in the receipt images. A leak of such data could make it easy to study the patterns of persons or organizations by unwanted individuals. Another potential problem with receipt digitalization could perhaps be the appearance of new types of forgeries. Currently, receipts must be kept in physical form, but if this demand is removed and perhaps only an image is necessary, then image manipulation could become a real problem. 1.3 Related Work To the author’s knowledge, no previous study has been done on re- ceipt data. The field that has been extensively studied and which is probably in some sense closest to this problem is credit card fraud de- tection. However, as the data generating mechanism is still different, in that private consumers have a different spending pattern from em- ployees, insights from that field can not be directly applied without modifications. Credit card fraud detection was one of the early applications of data mining and machine learning techniques to outlier detection, and many designs have been conceived. One approach is the construction of user profiles made of credit card transaction features, where new data points may be tested against this profile of normality.