Late Payment Prediction of Invoices Through Graph Features
Total Page:16
File Type:pdf, Size:1020Kb
Late PAYMENT PREDICTION OF INVOICES THROUGH GRAPH FEATURES Master Thesis By: Arthur HoVANESYAN Late-payment prediction of invoices through graph features Arthur Hovanesyan to obtain the degree of Master of Science, Computer Science, Data Science Track at the Delft University of Technology, to be defended publicly on Wednesday December 18, 2019 at 12:30 PM. Student number: 4322711 Thesis committee: Associate Prof. dr. Pablo Cesar, TU Delft, (Chair) Assistant Prof. dr. Huijuan Wang, TU Delft (Supervisor) Assistant Prof. dr. Christoph Lofi, TU Delft Head Data Science. dr. Judith Redi, Exact (Supervisor) Preface This report concludes the thesis project and with it the two-year journey towards receiving my master’s de- gree in Computer Science. Thank you to Exact and especially to everyone in the Customer Intelligence/Data Science team for giving me the opportunity to work on such an awesome project. This was was not possible without the supervision of Dr. Judith Redi, Dr. Huijuan Wang, Dr. Bikash Joshi, Ir. Xiuxiu Zhan and of course everyone else from the team that chipped in and kept the morale high. Thank you for your guidance during the process. There are too many of you now to list. Thank you to my family and friends for your support throughout the journey. Without you all, this would not have been possible. Arthur Hovanesyan Delft, December 2019 iii Disclaimer The information made available by Exact for this research is provided for use of this research only and under strict confidentiality. v Abstract Keeping a steady cash flow is one of the biggest if not the biggest problem that Small to Medium Enterprises (SMEs) deal with daily. Within the different types of cash flow, Accounts Receivable (AR) classifies the balance of money that needs to be paid by the company’s customers. In the most typical case, after receiving goods or services, the customer receives an invoice with the amount that is owed to the supplier. However, this often does not happen before the aforementioned date, meaning that the invoice is often paid late. Intervention requires resources and over-intervention could cause unwanted customer dissatisfaction. Knowing whether an invoice is going to be paid late can be vital information. Current methods of late payment prediction focus only on the history between the seller and the buyer and are unusable when this history is not present. Intu- itively, one’s business depends on the relationships and transactions that it has with its neighbors. Suggesting that neighbor behavior could be useful when predicting the cash flow of a company. Unfortunately, this type of information is not always given and needs to be data mining from non-relational data. This work presents a method for building a relational network of SMEs using entity resolution and improving the current state of the art of late payment prediction using features extracted from the graph. vii Contents List OF Figures XI List OF TABLES XIII 1 Introduction 1 1.1 Currently available solutions...................................1 1.2 Exact...............................................2 1.3 Problem Definition and Research Questions...........................2 1.4 Contributions...........................................2 1.5 Report structure..........................................3 2 Related WORK 5 2.1 Late payment prediction.....................................5 2.2 Financial transactions......................................7 2.3 Graph analysis and feature engineering..............................8 2.3.1 Feature engineering.................................... 10 2.4 Entity Resolution......................................... 12 3 Entity Resolution 13 3.1 Division-Accounts........................................ 13 3.2 Problem of record matching................................... 13 3.2.1 Generic ERA pipeline................................... 14 3.2.2 Data definition and preprocessing............................. 15 3.2.3 Classifier.......................................... 17 3.2.4 Custom method for Entity Resolution........................... 18 4 InVOICING 23 4.1 EDA................................................ 24 4.1.1 Distribution........................................ 26 4.1.2 Timeline.......................................... 26 5 Method 29 5.1 Overview of late payment experiments.............................. 29 5.2 Featuring............................................. 30 5.2.1 Modeling.......................................... 32 5.2.2 Evaluation......................................... 33 5.3 Baseline.............................................. 34 5.3.1 Results........................................... 34 6 Experiments 37 6.1 Custom features.......................................... 37 6.1.1 Experiment 1: Node level profiling............................. 37 6.1.2 Experiment 2: Time sensitive customer profiling...................... 38 6.1.3 Experiment 3: Alternatives to profile features....................... 38 6.1.4 Experiment 4: Neighborhood level profiling........................ 39 ix x Contents 6.2 Learned features......................................... 40 6.2.1 Experiment 5: Business Graph Embedding......................... 40 6.2.2 Experiment 6: Invoice Graph Embedding......................... 42 6.2.3 Split graph tuning..................................... 44 7 Discussion 47 7.1 Entity Resolution......................................... 47 7.2 Invoice prediction......................................... 47 8 Conclusion 49 9 FUTURE WORK 51 9.1 Entity Resolution......................................... 51 9.1.1 Matching based on context information:.......................... 51 9.1.2 Recognition of private entities:............................... 51 9.1.3 Testing on public datasets:................................. 51 9.2 Late-Payment prediction..................................... 52 9.2.1 Modeling of cashflow and node interactions........................ 52 9.2.2 GNN models........................................ 52 A Experiment protocol: InVESTIGATING THE SUBJECTIVE QUALITY OF GENERATED entities. 53 A.1 Motivation............................................ 53 A.2 Research Questions........................................ 53 A.3 Independent Variables / Setup.................................. 53 A.4 Brand level vs Franchise level................................... 54 A.5 Validation............................................. 54 B Experiment BriefiNG 55 BibliogrAPHY 57 List of Figures 2.1 t-SNE representation of the features before the classification layer................... 10 2.2 Taxonomy for the feature extraction techniques and feature learning methods for link predic- tion studies................................................... 11 2.3 List of heuristic features used by the SEAL model for link prediction.................. 11 3.1 Generic pipeline of ERA........................................... 14 3.2 Overview of how many fields of the given accounts have been supplied with information..... 16 3.3 Overlap between KvK and VAT........................................ 16 3.4 Example of the matching dataset used to train the matching classifier................ 18 3.5 Scatter plot of matching and non-matching records, with on the x-axis the cosine similarity between the two strings and on the y-axis the partial similarity..................... 19 3.6 Distribution of similarity between the mode of a cluster and the inner records. Clusters smaller than 5 have been taken filtered out..................................... 20 3.7 Fraction of the clusters that have a certain amount of possible modes. Clusters smaller than 5 have been taken filtered out......................................... 21 3.9 Contingency matrix of the experiment.................................. 22 4.1 Interactions between the Division (in the illustration the Invoice Sender) and one of their ac- counts (Invoice Receiver in the illustration) in no specific order.................... 24 4.2 Barplot shows the most frequent sequences of dates in the system. The sequences have been coded for readability (Invoice Date = I, Payment Date = P, System Invoice Date = Is, System Payment Date = Ps,). The sequence next to a bar represents the order of the dates. Leftmost symbol is the oldest date, rightmost the most recent. Two symbols in brackets are equal, in other words these dates represent the same day............................. 25 4.3 Distribution of invoices sent and received by the divisions. Only shows divisions with counts from 1 to 25 invoices.............................................. 25 4.4 Overview of invoices over the duration of time. Figure (a) shows the complete view of all avail- able data. Plots (b) and (c) show the segments during the end of the year in 2017 and 2018.... 26 4.5 Balances of late payment in the complete dataset, on specific weekdays and the magnitude of the delay..................................................... 27 4.6 Returning customers and how often they have been seen in the dataset............... 27 5.1 Overview of experiments and type of the generated feature sets.................... 30 5.2 Visualization of an expanding window featuring approach. The example shows the calcula- tion of the late-payment ratio feature. A rolling function iterates over the sequence of invoices (late in red, non-late in blue) for every invoice, every preceding available invoices are taken and passed to the feature functions........................................ 32 5.3 Pipeline from the data source to the prediction of late payments.................... 32 5.4