Feature Engineering and Machine Learning Methodologies Using

Feature Engineering and Machine Learning Methodologies using Customer Debit and Credit Transactions for Predicting Savings Account and Home Equity Line of Credit Acquisition By Susanne Pyda A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Department of Chemical Engineering & Applied Chemistry University of Toronto Supervisor: Professor J.C. Paradi @ Copyright Susanne Pyda 2019 Feature Engineering and Machine Learning Methodologies using Customer Debit and Credit Transactions for Predicting Savings Account and Home Equity Line of Credit Acquisition Susanne Pyda Master of Applied Science Department of Chemical Engineering & Applied Chemistry University of Toronto 2019 Abstract The objective of this thesis is to investigate feature engineering methodologies for customer credit and debit transaction data and evaluate their utility in developing machine learning models on the Apache Spark framework for two common marketing targets: savings account acquisition and home equity line of credit acquisition. The utility of the new features (predictor variables) is evaluated in comparison to the Bank’s previous marketing models, and in combination with an existing library of predictor variables aggregated from various databases. The results show that there is potential to both automate the process of creating features from transaction data using word embeddings, and to use them in combination with gradient boosted decision trees to increase model performance. However, the parameters, methods, and applications tested in this thesis are not exhaustive, and should serve as an example for further exploration. ii Acknowledgements I would like to express sincere gratitude to the following persons who have made this thesis possible: My thesis supervisor Professor Joseph C. Paradi, who gave me the opportunity to learn and pursue my interests, and provided ongoing advice, patience, and support. My parents, who encouraged me to continue my studies and whose support made it possible. My fellow CMTE candidates, who provided encouragement and companionship throughout the entire process. Everyone at the Bank who provided invaluable suggestions, advice, expertise, and mentorship. iii Table of Contents Abstract ........................................................................................................................................................ ii Acknowledgements ..................................................................................................................................... iii List of Tables .............................................................................................................................................. vii List of Figures ............................................................................................................................................ vii Glossary ..................................................................................................................................................... viii 1 Introduction ................................................................................................................................................1 1. 1 Motivation ..........................................................................................................................................1 1.2 Objectives ............................................................................................................................................2 1.3 Scope ...................................................................................................................................................2 2 Literature Review .......................................................................................................................................4 2.1 Banking Product Models .....................................................................................................................4 2.1.1 Consumers using HELOC ............................................................................................................4 2.2.2 Consumers Acquiring Savings Accounts .....................................................................................4 2.2 Feature Engineering for Transaction Data...........................................................................................5 2.2.1 Transaction Aggregation ..............................................................................................................5 2.2.2 Vector Representation ..................................................................................................................6 2.2.3 Sentence Embedding ....................................................................................................................8 2.3 Algorithms in Feature Engineering .....................................................................................................9 2.3.0 Aggregation by Category Binning ................................................................................................9 2.3.1 Word2Vec.....................................................................................................................................9 2.3.2 Simple Baseline proposed by Arora et al. ..................................................................................12 2.3.3 K-Means .....................................................................................................................................13 2.4 Supervised Learning ..........................................................................................................................13 2.4.1 Logistic Regression ....................................................................................................................13 2.4.2 Tree Models ................................................................................................................................14 3 Data ..........................................................................................................................................................18 3.0 Customer Universe ............................................................................................................................18 3.1 Targets ...............................................................................................................................................18 3.1.1 Savings Account .........................................................................................................................18 3.1.2 Home Equity Line of Credit (HELOC) ......................................................................................19 3.2 Transaction Data ................................................................................................................................20 iv 3.3 General Banking Features .................................................................................................................21 3.4 Out of Time Customers .....................................................................................................................22 4 Technology ...............................................................................................................................................24 4.1 Cloudera ............................................................................................................................................24 4.2 Apache Spark ....................................................................................................................................24 4.2.1 Spark Machine Learning Libraries .............................................................................................25 4.3 Apache Hive ......................................................................................................................................25 5 Methodology and Implementation ...........................................................................................................26 5.1 Data Cleaning ....................................................................................................................................26 5.1.2 General Banking Features ..........................................................................................................26 5.1.3 Transaction Feature ....................................................................................................................27 5.3 Transaction Features ..........................................................................................................................28 5.3.1 Binning .......................................................................................................................................28 5.3.2 Word2Vec...................................................................................................................................28 5.3.3 Customer Vectors .......................................................................................................................30 5.3.4 K-Means Clustering ....................................................................................................................31 5.3.5 Summary of Tables .....................................................................................................................33 5.4 Models ...............................................................................................................................................33 5.4.1 Logistic Regression ....................................................................................................................35

Load more