Feature Engineering and Methodologies using Customer Debit and Credit Transactions for Predicting Savings Account and Home Equity Line of Credit Acquisition

By

Susanne Pyda

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science

Department of Chemical Engineering & Applied Chemistry University of Toronto

Supervisor: Professor J.C. Paradi

@ Copyright Susanne Pyda 2019

Feature Engineering and Machine Learning Methodologies using Customer Debit and Credit Transactions for Predicting Savings Account and Home Equity Line of Credit Acquisition

Susanne Pyda

Master of Applied Science

Department of Chemical Engineering & Applied Chemistry

University of Toronto

2019

Abstract

The objective of this thesis is to investigate feature engineering methodologies for customer credit and debit transaction and evaluate their utility in developing machine learning models on the Apache Spark framework for two common marketing targets: savings account acquisition and home equity line of credit acquisition.

The utility of the new features (predictor variables) is evaluated in comparison to the Bank’s previous marketing models, and in combination with an existing library of predictor variables aggregated from various databases.

The results show that there is potential to both automate the process of creating features from transaction data using word embeddings, and to use them in combination with gradient boosted decision trees to increase model performance. However, the parameters, methods, and applications tested in this thesis are not exhaustive, and should serve as an example for further exploration. ii

Acknowledgements

I would like to express sincere gratitude to the following persons who have made this thesis possible:

My thesis supervisor Professor Joseph C. Paradi, who gave me the opportunity to learn and pursue my interests, and provided ongoing advice, patience, and support.

My parents, who encouraged me to continue my studies and whose support made it possible.

My fellow CMTE candidates, who provided encouragement and companionship throughout the entire process.

Everyone at the Bank who provided invaluable suggestions, advice, expertise, and mentorship.

iii

Table of Contents

Abstract ...... ii Acknowledgements ...... iii List of Tables ...... vii List of Figures ...... vii Glossary ...... viii 1 Introduction ...... 1 1. 1 Motivation ...... 1 1.2 Objectives ...... 2 1.3 Scope ...... 2 2 Literature Review ...... 4 2.1 Banking Product Models ...... 4 2.1.1 Consumers using HELOC ...... 4 2.2.2 Consumers Acquiring Savings Accounts ...... 4 2.2 Feature Engineering for Transaction Data...... 5 2.2.1 Transaction Aggregation ...... 5 2.2.2 Vector Representation ...... 6 2.2.3 ...... 8 2.3 Algorithms in Feature Engineering ...... 9 2.3.0 Aggregation by Category Binning ...... 9 2.3.1 ...... 9 2.3.2 Simple Baseline proposed by Arora et al...... 12 2.3.3 K-Means ...... 13 2.4 ...... 13 2.4.1 ...... 13 2.4.2 Tree Models ...... 14 3 Data ...... 18 3.0 Customer Universe ...... 18 3.1 Targets ...... 18 3.1.1 Savings Account ...... 18 3.1.2 Home Equity Line of Credit (HELOC) ...... 19 3.2 Transaction Data ...... 20

iv

3.3 General Banking Features ...... 21 3.4 Out of Time Customers ...... 22 4 Technology ...... 24 4.1 Cloudera ...... 24 4.2 Apache Spark ...... 24 4.2.1 Spark Machine Learning Libraries ...... 25 4.3 Apache Hive ...... 25 5 Methodology and Implementation ...... 26 5.1 Data Cleaning ...... 26 5.1.2 General Banking Features ...... 26 5.1.3 Transaction Feature ...... 27 5.3 Transaction Features ...... 28 5.3.1 Binning ...... 28 5.3.2 Word2Vec...... 28 5.3.3 Customer Vectors ...... 30 5.3.4 K-Means Clustering ...... 31 5.3.5 Summary of Tables ...... 33 5.4 Models ...... 33 5.4.1 Logistic Regression ...... 35 5.4.2 Gradient Boosted Decision Tree...... 35 5.4.3. XGBoost ...... 35 5.5 Model Tuning and Testing ...... 37 5.5.1 Model Testing ...... 37 5.5.2 Parameter Tuning with Grid Search ...... 38 5.5.3 Evaluation Metrics ...... 39 6 Results and Analysis ...... 40 6.1 Results ...... 40 6.1.1 Benchmark ...... 40 6.1.2 Modelling Results ...... 41 6.1.3 Testing Results ...... 42 6.1.4 Average Results for Best Model ...... 43 6.1.5 Additional Parameter Testing ...... 43 6.1.6 Results for Additional Parameters ...... 44 6.2 Analysis ...... 44 v

6.2.1 HELOC ...... 44 6.2.2 SA ...... 46 7 Conclusions ...... 47 7.1 Future Work ...... 48 References ...... 50 Appendices ...... 53 Appendix 1: Additional Results ...... 53

vi

List of Tables Table 1 Aggregating Transaction by Category...... 6 Table 2: Appending Percentile Bin to Category Sums ...... 7 Table 3 : Example of a List of Transactions with 4 Transaction Elements ...... 8 Table 4: Example of One-Hot Encoding ...... 10 Table 5 Example of Raw Credit Card Transaction Attributes (Bahnsen et al., 2016) ...... 20 Table 6: Data Input Format for Word2Vec ...... 29 Table 7: Sample Output of K-Means Clustering for a Customer Sample ...... 32 Table 8: Modelling Feature Data Tables ...... 33 Table 9: Raw Target Data Distribution for Modelling Data ...... 34 Table 10: Logistic Regression Parameters ...... 35 Table 11: Gradient Boosted Decision Tree Parameters ...... 35 Table 12: XGBoost Parameters ...... 36 Table 13: Hyperparameter Tuning Results ...... 38 Table 14: HELOC XGBoost Modelling Results ...... 41 Table 15: SA XGBoost Modelling Results ...... 42 Table 16: HELOC Modelling Results on Testing Cohort ...... 42 Table 17: SA XGBoost Modelling Results on Testing Cohort ...... 42 Table 18: Repeated Training on Model No.3 ...... 43 Table 19: Additional XGBoost Hyperparameters Tested ...... 43 Table 20: Results for Additional Hyperparameter Testing ...... 44

List of Figures Figure 1: One-Hot Vectors Plotted in 3D Space ...... 10 Figure 2: Mikolov et al. Skip-Gram Architecture, as published in “Distributed Representations of Words and Phrases and their Compositionality” ...... 11 Figure 3: Example of Decision Tree and Prediction at Output Node ...... 16 Figure 4: One Round of Boosting for GBDT ...... 16 Figure 5: Observation and Outcome Window ...... 21 Figure 6: Observation and Outcome Window for Out of Time Customer Universe ...... 23 Figure 7: Spark Components (Apache Spark, Cluster Mode Overview) ...... 25 Figure 8: Example of Observation and Outcome Window for Benchmark Models ...... 41

vii

Glossary

GBDT gradient boosted decision tree

HELOC Home Equity Line of Credit

SA Savings Account

MCC merchant category code

SIF smooth inverse frequency

SVD singular value decomposition

viii

1 Introduction 1. 1 Motivation

Financial Institutions such as the Bank1 have always collected and stored large amounts of data. However, unprecedented growth in the volume of data, computing capabilities, and data centralization, commonly referred to as the “Big Data Revolution” not only facilitates advanced analytics but necessitates it to maintain a competitive advantage (1).

Banks have traditionally utilized their data to make decisions, including decisions related to marketing product offerings such as savings accounts, mortgages, personal loans, and lines of credit. In a competitive market, it is crucial for a Bank to foresee and meet a customer’s financial needs before a competitor does.

Predictors for these models come from several different sources – credit bureaus, personal customer data, financial holdings, branch interactions, as well as from online activity tracked through cookies. Features from debit and credit transactions are typically aggregations and aggregate statistics (e.g. spending sum per time period, standard deviation over time periods), or rule-based features (e.g. travel using tropical airlines in winter months, transactions occurring within a list of selected “luxury” retailers).

Within the volume of debit and transaction data the Bank holds, there exists the opportunity to draw more than heuristic insights, and to use machine learning to automate the and engineering processes (2).

“Big Data” allows an organization to generate features (model predictors) beyond what can be manually or explicitly determined. The premise is that with enough data and advanced algorithms, the model can learn important attributes of the data on its own (2).

1 The Bank is a major Canadian bank, and the provider of the data used in this thesis. It is referred to as “the Bank” due to confidentiality in this paper. 1

1.2 Objectives

Through the collection and storage of transaction data, the Bank has unique insight into its customer’s spending habits, lifestyle, and potential future financial decisions. The Bank seeks to optimize its use of this data to drive business value and better serve customers.

To the present day, the bank has focused largely on banking data and rule-based transaction features. The bank has also prototyped embedding customer transactions, adopting the popular algorithm from Natural Language Processing - word2vec (3). The objective of this thesis is to further develop feature engineering methodologies which will be useful in using customer transaction patterns to predict financial and banking decisions. The utility will be tested on the prediction of new savings accounts and home equity line of credit (HELOC) acquisition.

The objectives of this thesis are as follows:

1) Review existing academic literature and industry practices related to:

▪ Converting financial transactions to model features. ▪ Predicting savings account and HELOC acquisition in order to prepare an appropriate benchmark and testing environment for new features.

2) Develop and implement automated feature extraction methodologies using algorithms and techniques such as binning, clustering, vector embedding, and to develop model features from customer credit and debit transactions.

3) Utilize the Apache Spark framework to construct machine learning models for the classification tasks and provide evaluation metrics to test the usefulness of the developed features and models.

1.3 Scope

The objective of this thesis is to investigate automated feature engineering methodologies for customer credit and debit transaction data and evaluate their utility in developing machine learning models for two common marketing targets: savings account acquisition and home equity line of credit acquisition.

2

The utility of the new features (predictor variables) is evaluated in comparison to the Bank’s previous marketing models, and in combination with the Bank’s existing library of predictor variables aggregated from various databases.

The transaction data used consists solely of the data provided by standardized transaction databases at the Bank. The data consists of a single cohort of bank activity divided into training and validation sets (training cohort). The models are further evaluated on a new cohort of bank activities set one year forward after the training cohort (testing cohort).

The observation window for transactions occurs over 6 months - Jan 1, 2015 to June 30, 2015. Other customer banking data is acquired as of June 30, 2015. The model outcome window occurs over 4 months – June 30, 2015 to October 1, 2015. This means that if the customer made the bank product acquisition within this window, the target or outcome value was positive. The testing cohort has the same structure except in 2016.

The work of this thesis was completed within the Bank’s Apache Hadoop environment, utilizing SQL in Apache Hive and Apache Spark through Python and Scala APIs. Modelling was done using Apache Spark’s native Machine Learning (ML and MLlib) libraries.

Although several of Spark ML’s native machine learning models were preliminarily tested, the open-source library XGBoost provided the most functionality and ability to tune the model, resulting in the best performance. XGBoost is an implementation of the gradient boosted decision tree algorithm (GBDT).

3

2 Literature Review

This chapter reviews predictors for HELOC and SA in literature, as well as feature engineering methodologies which are used for banking data. The algorithms used are outlined, as well as supervised machine learning techniques for classification.

2.1 Banking Product Models

2.1.1 Consumers using HELOC

Although details are often kept proprietary by financial institutions, variables for predicting and understanding customers’ financial decisions have been studied in literature. In 2006, Siman, Finke, and Corlija conducted a review of studies on HELOC customers, and noted that households owning more expensive homes with a high portion of equity in the home were most likely to obtain a HELOC (4). Propensity was found to be largely dependent on socioeconomic status including factors such as income level, net worth, age and education. It is difficult however to use these factors to predict the exact time a customer will make the decision to obtain a HELOC. Other significant factors included the economic climate and market conditions, and a customer’s sensitivity to changes in the market (4).

2.2.2 Consumers Acquiring Savings Accounts

The Bank has developed its own models for specific customer behaviours such as the acquisition of a new savings account. includes expert opinion and amalgamation of customer characteristics from credit bureaus, customer personal data, and recent activity with the Bank.

The objective of these models is to understand customer needs and engage effectively and efficiently in marketing related activities. Customer relationship management using machine learning and has seen an increase in popularity. For example, features used to predict campaign output for a long-term deposit application include age, yearly balance, education status, salary, credit history and communication history with the bank (5). Transaction data has been used to analyze bank customers by assessing credit card repayment behaviour for example (6).

4

2.2 Feature Engineering for Transaction Data

2.2.1 Transaction Aggregation

Transaction data offers valuable information for financial institutions, and has been well studied, most significantly in the context of credit card fraud detection, motivated by the billions of dollars lost annually to credit card fraud.

Whitrow et al. observe that (in a fraud detection context) using an entire series of transactions is not practical due to high dimensionality as well as heterogeneity of the transactions, and present a case for aggregation strategies (7). Transaction-level features as described by Whitrow et al. refer to rule-based indicators of fraud, such as a large purchase from a certain merchant category (7). Customer behaviour can be better represented by aggregating some patterns over time. By aggregating transactions over some of the features shown in Table 5, and calculating statistics, Whitrow et al. were able to show that new aggregate transactions could be useful in many situations (7). Building on this concept, Bahnsen et al. analyze periodic behaviour by aggregating using the von Mises distribution (8). They increase the feature space by aggregating on different factors (e.g. country and merchant group). The length of time to aggregate on has been discussed by both Whitrow et al. and Bahnsen et al., emphasizing that the marginal value of an additional transaction added to the aggregation diminishes with a larger time window (7). Bahnsen et al recommend a 7-day aggregation period (8). However, it should be emphasized that feature engineering for fraud detection is a unique application, because any derived features must allow for immediate classification of a transaction, rather than the long-term changes which are studied in this thesis.

Aggregation and rule-based raw transaction level methods for extracting useful features both require significant manual effort and expert knowledge, making the process time consuming, difficult to update and prone to human error (2). Dayioglugil and Akgul suggest that “automatically discovering hidden patterns in observed customer data is essential for several financial tasks such as fraud detection, new product offers, and customer behaviour analysis” and raw customer transaction data lends itself to this task.

5

2.2.2 Vector Representation

Baldassini and Serrano aim to automate the process of representing customer behaviour using marginalized stacked denoising on current account transaction data. Through this process, clients are embedded by vectors which can be used for segmentation, profiling and targeting. To accomplish this task, techniques introduced to Natural Language Processing (NLP) tasks have been applied. Continuous vector representation of words in the field of NLP has had a significant impact on unsupervised capturing of syntactic and semantic relationships between words, phrases, and documents (9). This algorithm, known as word2vec is further described in 2.3.1.

Transactions in this approach are analogues to words – although they hold individual definition, their order, context and combination are imperative to meaning. Clients are treated as a sequence of words, or document (2). Baldassini and Serrano first aggregate the client into a vector representing the transaction sum in each expense category. Credit cards provide a standardized merchant category code list and this is the code provided in transaction data (10).

This approach is described stepwise below, based on the description and figures provided by Baldassini and Serrano in “client2vec: Towards Systematic Baselines for Banking Applications” (2018).

Step 1: For each client, sum the transaction amount over some time period (e.g. 1 year) in each spending category. A fabricated example is shown below.

Table 1 Aggregating Transaction by Category Category 1 Category2 Category3 … CategoryN Grocery Stores Automobile Supply Drug Stores and Stores Pharmacies Client 1 34020.37 66.22 8891.54 … Client 2 500.30 66587.44 687.24 …

Step 2: To convert the sums in Step 1 to word-like entities, the number is first converted to a percentile range within that category. Therefore, if Client 2’s spending in Category 1 falls within the 10-20th percentile in that category compared to all of the customers, 500.30 is changed to 10_20. To make this unique to Category1, “CAT1” is appended to this entity. Client 2’s spending in Category1 (500.50) is converted to CAT1:10_20. 6

Step 3: Replace the transactions in the raw dataset from Step 1 with “the label of the bin they fall inside (…) yielding a finite set of repeating words that depend on the nature of the transaction (9).”

Table 2 shows what Client 1’s transactions may look like (bins here are arbitrary for demonstration purposes):

Table 2: Appending Percentile Bin to Category Sums Category 1 Category2 Category3 … CategoryN Client 1 CAT1:50_60 CAT2:0_10 CAT3:70_80 … CATN:80_90

The sentence representation of each customer would be the combination of these entities such as Client 1: CAT1:50_60, CAT2:0_10 […] CATN:80_90.

This vocabulary allowed for feature engineering using encoding methods such as word2vec and autoencoders, demonstrating effectiveness at embedding customer information (9).

Continuous embedding of bank transactions was explored by Dayioglugil and Akgul at the 2017 International Symposium on Methodologies for Intelligent Systems. The objective of this paper was to predict customer behaviour without any manual intervention (2). As one of the first studies to overcome the high dimensionality of transaction data using continuous word embedding methodologies developed in NLP, the results encourage further study.

Dayioglugil and Akgul recognize the resemblance of a customer’s chronological transaction history to a sentence. In their work, each transaction is defined by 10 categorized elements, such as age, business segment, education level, marital status, and transaction process bank group. These elements serve as words, and form a grouping comparable to a sentence in natural language. Word2vec is used to create element embedding, which are agglomerated into customer vectors by concatenating all of the transaction element vectors (9).

An example of what a chronological list of transactions (all customers grouped together) may look like is shown in Table 3.

7

Table 3 : Example of a List of Transactions with 4 Transaction Elements Age Amount ($) Gender Transaction Process Group Code 19 Amnt1(<100) M 123 57 Amnt2(200-400) F 456 Notes: Here 4 transaction elements are shown, whereas Dayioglugil and Akgul used 10.

The elements are treated as words, and the transactions consisting of 10 elements as sentences. The example in Table 3 would be concatenated into a document as [19 Amnt1 F 123. 57 Amnt2 F 456.]. Each “word” could then be given a vector representation using the word2vec algorithm described in 2.3.1. (2).

2.2.3 Sentence Embedding

The ability to effectively embed words into dense vectors has consequently sparked interest in embedding groups of words - from sentences to paragraphs and entire documents – while retaining some form of meaning (11). To further study transaction embedding, the grouping of transaction vectors into customer vectors must be considered.

Concatenation, as performed by Dayioglugil and Akgul, is an option for a sentence size of 10, however for longer sentence or document representation the resulting dimensionality would be high. Baldassini and Serrano discuss considering methods such as max-pooling, mean-pooling, and Vector of Locally Aggregated Descriptors (VLAD). VLAD encoding involves computing a cluster centroid for word vectors and concatenating the distance between each word vector and its centroid for all words in a sentence (8). Other studies have considered methods from simple vector addition to neural networks – such methods being typically designed to tailor to the domain (11). Wieting et al. propose a universally applicable sentence embedding methodology using neural networks trained on phrase pairs from the Paraphrase Database (11).

Acknowledging the objectives of Weiting et al., Arora, Liang, and Ma propose “A Simple But Tough-To-Beat Baseline for Sentence Embeddings,” which involves taking the weighted average and modifying it using principal components. This methodology will be discussed further in 2.3.2.

8

2.3 Algorithms in Feature Engineering

This section provides information on the methodologies and algorithms discussed in 2.2 which have been implemented in this thesis.

2.3.0 Aggregation by Category Binning

Aggregation by binning, or some form of summation, are well-researched methods for extracting features from transaction data as discussed in 2.2.1. Because the data contains a merchant category code, an intuitive approach is to sum the total spending per customer, per time period in each of the categories. Although there are almost 1000 category codes, some categories offer little granularity – categories include "grocery stores, supermarkets" and “men's and women's clothing stores". Nevertheless, total spending per category can offer insights into a customer's general profile and this simple method can effectively serve as a benchmark for more complex methodologies.

Another binning approach which may have been considered is binning by the full transaction text. This is the merchant text descriptor determined at the transaction point. This transaction text is not standardized and therefore the number of distinct transaction texts is very high. It was determined that even if the text was truncated or cleaned, this representation would be too sparse to be meaningful.

2.3.1 Word2Vec

As discussed in 2.2.2, the word2vec algorithm has had a significant impact on Natural Language Processing. Its effectiveness for other applications has been recognized by drawing parallels between unstructured text data and data sources such as transaction data.

The algorithm was popularized by Mikolov et al., in the publications "Distributed Representations of Words and Phrases and their Compositionality" (3) and “Efficient Estimation of Word Representations in Vector Space” (12).

The algorithm addresses the NLP challenge of creating a useful representation of words, which can be used in machine learning models. Most preceding techniques, such as the N-gram model, create predictor variables by extracting the count of words or n-grams (sequence of words). This

9

is equivalent to binning (or aggregating by count) the document by unique n-gram. The representation of a single word is a one-hot vector having the size of the vocabulary, making all vectors orthogonal to each other – the cosine distance between words is therefore zero and word similarity is not captured (12).

One-hot values can only take the values of 0 and 1. For example, in the document [one two three one] there are 3 distinct words in the vocabulary [one two three]. The size of the vector is equal to the size of the vocabulary, 3. The value is 1 if the given word is equal to the word at that index in the vocabulary (order is arbitrary) or 0 otherwise. Each word in the vocabulary could be one- hot encoded as:

Table 4: Example of One-Hot Encoding vocabulary word vector one two three one 1 0 0 [1,0,0] two 0 1 0 [0,1,0] three 0 0 1 [0,0,1] one 1 0 0 [1,0,0] SUM 2 1 1 [2,1,1]

These vectors can be plotted in 3D Space:

Plotted using “3d-vector-plotter” by Academo.org (13)

[0,0,1]

[0,1,0] [1,0,0]

y x

Figure 1: One-Hot Vectors Plotted in 3D Space

10

It can be seen that the vectors are orthogonal – perpendicular with a zero dot product. The resulting vector representation of the document using the n-gram model would be the summation of the vectors as shown in Table 4.

Mikolov et al. further emphasize that the treatment of words as distinct units does not extract information related to syntactic and semantic similarity (12). Although there are several variations of the word2vec algorithm the skip-gram method (used within Apache Spark’s MLlib library) will be discussed. The skip-gram model inputs each word into a log-linear classifier, with the target being words within a window of the input word.

For example, in Figure 2, the input w(t) is a word at position t. In a document consisting of words [one two three four five], a window size of one can be selected. The words within one word of “three” are “two” and “four”. Therefore, one training example would be “three” as w(t) and “four” as w(t+1). However, words cannot be used as data points so they are one-hot encoded as shown in Table 4.

Figure 2: Mikolov et al. Skip-Gram Architecture, as published in “Distributed Representations of Words and Phrases and their Compositionality”

Per the Apache Spark documentation, given a sequence of training words (w1,w2 … wT), the objective of the skip-gram model is to maximize the average log-likelihood (where k is the window size) (14).

11

푇 푗=푘 1 ∑ ∑ 푙표𝑔푝(푤 |푤 ) 푇 푡+푗 푡 푡=1 푗=−푘

Objective Function Word2Vec (14)

The projection layer creates a dense embedding for each word, and has proven to capture context and similarity – embeddings for words with contextual similarity have a smaller cosine distance (12).

As discussed in 2.2.3 aggregating the word vectors to represent sentences is challenging. Arora et al. propose a ‘Simple but Tough-to-Beat Baseline” to improve sentence embedding.

2.3.2 Simple Baseline proposed by Arora et al.

The methodology proposed by Arora et al. has several advantages (15).

1) It is unsupervised and does not require a labelled dataset such as the Paraphrase Database used by Wieting et al. (11). 2) It is relatively simple compared to sophisticated models used by neural networks. 3) It shows improvements of 10-30% in textual similarity tasks. 4) It offers theoretical justification for its reweighting method (SIF).

The methodology is comprised of two steps – taking the weighted average of the word vectors in the sentence, followed by removing the projection of the average vectors on their first principal component (15).

The weighted average, here referred to as the smooth inverse frequency (SIF) is calculated as: a

a+p(w) where a is a parameter and p(w) is the word frequency. It is noted that SIF is closely related to TF-IDF (Term-Frequency Inverse Document Frequency), a common reweighting method in Information Retrieval. TF-IDF weighs a word with inverse proportionality to its count in the document – accounting for the fact that very commonly used words may add less additional information.

12

The second step proposed is removing the first (or first few) principle components, as calculated using Singular Value Decomposition (SVD). This method is often referred to as “common component removal (15).”

The singular value decomposition of a matrix A is the factorization of A into the product of three matrices A = UDVT where the columns of U and V are singular, orthonormal (orthogonal and unit vectors) matrices, and the matrix D is diagonal with positive real entries (16).

2.3.3 K-Means

The K-Means algorithm is an unsupervised algorithm used to identify a predetermined number of groups (clusters) in a multidimensional space. Membership to a cluster is assigned to the cluster centroid which has the minimum Euclidean distance to the data point. First, cluster centers are randomly assigned, and all data points are assigned to the closest cluster. Next, the cluster center is recalculated as the centroid of the data points assigned to it. The process is repeated until the cluster center is no longer changing.

In this section, various methods for creating features from transactions have been discussed. Once the features are developed, their utility is tested in supervised machine learning algorithms.

2.4 Supervised Learning

To evaluate the usefulness and predictive value of the features development, they will be tested to predict the two targets discussed: HELOC and SA. The supervised machine learning classification algorithms used are logistic regression and gradient boosted decision trees.

2.4.1 Logistic Regression

Logistic regression is a traditional classification model, with many applications at the Bank. Built upon concepts from , it applies the nonlinearity sigmoid function to the output of a linear regression.

The sigmoid function is defined as:

1 휎(푦) = 1 + 푒−푦

13

The general form of logistic regression is therefore:

푦(푥) = 휎(푤푇푥 + 푏)

The sigmoid function smooths the output to a value between 0 and 1, representing the probability, P(y=1).

2.4.2 Tree Models

A classification model can also be created using a decision tree. A decision tree model uses a tree structure, where at each node a given data point is tested on an attribute, and moves on to the subsequent branch depending on the outcome. The final leaf node decides the class. When the tree is being built, the algorithm decides on what attribute to split using information gain and the ID3 algorithm developed by Claude Shannon (17).

Decision Tree

A decision tree branches data at the root incrementally based on internal decision nodes. At the node, the data from the previous branch is tested against an attribute – and split into a branch depending on the outcome. After the final split, the terminal (or leaf) node decides on the outcome.

The algorithm behind decision trees, ID3 (Iterative Dichotomiser), was developed by J. Ross Quinlan (17). The algorithm uses information gain by splitting on the attribute with the highest information gain, minimizing the number of splits required to classify the remaining data. Entropy represents the amount of information required to classify a point, and information gain is the difference in the entropy before and after a split (17).

Decision trees are easy to interpret because of the attribute tests at each split. They inherently select features which provide the most information and can model nonlinearity.

Random Forest

A model trains a collection (“forest”) of trees, each on a random sample of attributes and data points from the original data set. The trained trees collectively vote on the

14

classification (17). They have the benefit of the preventing overfitting that may occur when training a single tree.

Gradient Boosted Decision Tree

Gradient Boosted Decision Trees use the algorithm on an ensemble of decision trees. Freund and Schapire are considered the first to develop Adapative Boosting (AdaBoost) – based on creating an accurate prediction by combining a set of weak predictors. In 1999, Friedman improved upon the algorithm with gradient boosting, or the concept of fitting the subsequent weak learner on the errors (residuals) of the previous learner at each stage (18).

The algorithm has been adapted and scaled in modern machine learning libraries. In this thesis, Spark’s GBTClassifier implementation was tested. It supports both continuous and categorical data, and does not require . Tree boosting has been shown to produce state-of-the- art results in many machine learning tasks (19), and was therefore chosen as a benchmark for initial comparisons of results, and in the final modelling.

An overview of the implementation of gradient boosted decision trees for binary classification (labelled 1 and 0) is described below (20).

1. Begin with an initial model F0. For the binary classification task, this is initiated by finding the log odds. This is converted to a probability using the sigmoid function. The initial predicted probability (e.g. of class 1) for each data point is this probability.

This allows for the calculation of residuals:

Residual (R) = true class (0 or 1) – predicted probability (probability y = 1 )

For example, if a data point belonged to class 1 and the predicted probability of class 1 was 0.6, the residual would be 1 – 0.6 = 0.4. The intermediate trees for boosting are regression trees for the residuals.

2. An initial weak learner (hm) is trained on the data to make predictions for the residuals. A weak learner here refers to a simple model that likely performs poorly on its own. The model is used to predict the residuals. In a standard decision tree regressor, the predicted value at a node is the average of the data points in the node. However, for gradient boosting trees the result must be 15

in terms of log odds so that it can be converted to a probability (P). The calculation for the predicted residual (output) at each node (Ri+1) is shown in Figure 3. The rectangular nodes represent the output nodes, at which predictions are made.

Figure 3: Example of Decision Tree and Prediction at Output Node

4. Next, the model is updated with the first regression tree. The new model Fm+1 is the sum of the initial model and the first weak learner, scaled by a y. The new model Fm+1 is defined as:

퐹푚+1(푥) = 퐹푚(푥) + 훾푚+1ℎ푚+1(푥)

Figure 4: One Round of Boosting for GBDT

5. The prediction 퐹푚+1(푥) is now in terms of log odds. It can again be converted to a probability using the sigmoid function, the residuals can be calculated, and another weak learner can be built.

16

6. This process is continued until the predetermined number of weak learners is reached. The final prediction is the summation of all the weak learners:

퐹(푥) = ∑ 훾푖ℎ푖(푥)

This is converted to a probability using the sigmoid function, and a prediction can be made based on the predetermined threshold (e.g. if P(y=1) > 0.5, predict 1, else 0)

17

3 Data 3.0 Customer Universe

Before conducting any specific data selection, a “customer universe” was defined from the Bank’s existing customer base. This is defined as all of the customers from which a modelling cohort may be reasonably selected. This was done using customer universe queries provided by the Bank. Some of the main customers which were removed included:

▪ commercial customers ▪ deceased customers ▪ customers with joint accounts

Because this study has focused on transaction data, only customers actively using a debit or credit card are considered. The threshold for active use was set at a total of 50 transactions over 6 months. Although this does not capture the entire customer base it can demonstrate an effect on a large portion of the customer universe and therefore has an effect on the overall performance.

3.1 Targets

Predicting the behaviour of customers individually and in aggregate is fundamental to the Bank’s operations and competitive advantage. Almost every line of business within the Bank could benefit from predicting future activity.

In order to evaluate and quantify the predictive value of features (predictor variables), two target variables (dependent variables) have been identified within the marketing group at the Bank. These targets are: acquisition of a new savings account and acquisition of a home equity line of credit.

3.1.1 Savings Account

The opening of a new savings account is a long-established model, applicable for testing the usefulness of new features from transaction data. A savings account allows a customer to easily access and earn interest on funds. Properties of the account vary across institutions, including interest rate, number of allowable or free transactions, fees, and minimum balance among other

18

features. Savings accounts are not typically used for regular expenses like a checking account, but offer the liquidity to cover any larger purchases.

Although the Bank offers a selection of savings accounts, a specific account type was selected for the model, offering basic features. Details of the account are proprietary to the Bank.

The modelling universe consists of customers who do not currently hold any savings accounts with the Bank, but hold an active credit or debit account with the Bank, and have at least 6- months of transaction history.

The objective is to determine if features engineered from credit and debit transactions can add predictive value, additional to general banking features, in determining if a customer will open a new savings account within the given observation window.

3.1.2 Home Equity Line of Credit (HELOC)

A HELOC is a secured form of credit, secured against the equity of a person’s home. The equity in the home is used as a form of collateral. This typically allows for a lower interest rate and higher credit limit than other forms of credit.

A customer may obtain a HELOC at the time of purchasing a home or at another point in time when they would require a loan. A HELOC may be used for any purpose, such as investment, consolidating debt, or managing expenses (21). The target group in this paper is customers who currently have a mortgage with the Bank, (or have been recorded to have a mortgage within past 10 years), and do not already have a HELOC with the Bank. This is meant to identify all potential home-owners or mortgage holders who are eligible for a HELOC.

A HELOC may be acquired due to a large unforeseen expense, or a significant and planned life event requiring a large amount of funds – such as a home improvement, education, or a wedding.

The objective is to determine if a customer’s spending history is indicative of their propensity to acquire a HELOC with the bank.

19

3.2 Transaction Data

Credit and debit transaction data, as provided through customer’s use of debit and credit cards, provides an abundant but challenging source of information. Data collection is highly regulated by International Financial Reporting Standards (IFRS). An overview of standard data collection by a credit card is shown in Table 5 below (8).

Table 5 Example of Raw Credit Card Transaction Attributes (Bahnsen et al., 2016) Attribute Name Description Transaction ID Transaction identification number Time Date and time of transaction Account number Identification number of the customer Card number Identification number of the credit card Transaction type ie. Internet, ATM, POS, … Entry mode ie. Chip and pin, magnetic strip, … Amount Amount of the transaction Merchant code Identification of the merchant type Merchant group Merchant group identification Country Country of trx Country 2 Country of residence Type of card ie. Visa debit, Mastercard, American Express Gender Gender of the card holder Age Card holder age Bank Issuer bank of the card Adapted from Bahnsen et al. “Feature engineering strategies for credit card fraud detection (8)

The Bank stores all credit and debit transactions (including interest payments and bank payments) in an account transaction table. In addition to the attributes shown in Table 1, the table contains a transaction text column. This alphanumeric column contains an identifier that the merchant determines upon implementation of the payment method. This column text is therefore not unique or standardized. Different points of sale for a single organization may have the same transaction text, they may alter the name (e.g. truncate, include hyphens and other punctuation) or append a numeric store number or other identifier. Although this makes it difficult to accurately identify the same organizations, it adds information in terms of uniqueness of the transaction point.

20

The merchant category code is a standardized attribute which introduces a consistent identifier of the nature of the transaction – although the predetermined category groupings are limiting and not as granular as transaction text. Merchant categories include “clothing retail” and “groceries” for example.

The observation window for transactions occurs over 6 months, from January 1st, 2015 to June 30th, 2015. Other customer banking data is pulled as of June 30th, 2015. The model outcome window occurs over 4 months, from June 30th 2015 to October 1st 2015. This means that if the customer made the bank product acquisition within this window, the target or outcome value was positive. The testing cohort has the same structure except in 2016.

An observation window of 6 months is chosen as it captures enough individual transactions to train the word2vec algorithm with reasonable accuracy, and within the resource and storage limitations at the Bank (more data could potentially provide improved word embeddings). The observation window gives the same weight to transactions occurring 6 months before the model outcome window as transactions occurring immediately before. Additional latency was not introduced as recent transactions may hold significant information.

In reality, a latency (typically a month) should be introduced to allow for the Bank to act on new information (for example send a promotional campaign to a selected group of likely to respond customers). Various observation windows should be tested and compared depending on the target in question. However, for the purpose of creating automated and general features which could be used for several different targets, only a 6-month window is considered in this study.

Jan 2015 Feb 2015 … June 2015 July 2015 Aug 2015 Sep 2015 Oct 2015 Observation Window Outcome Window

Figure 5: Observation and Outcome Window 3.3 General Banking Features

The bank collects a variety of information about customers throughout its operations. The information is stored in different databases across the Bank; however, it is regularly aggregated into a central table which can serve as a repository of features. Banking features include aggregated metrics and statistics (e.g. total spending over 6 months, change in account balance, and standard deviation of account balance). These features can be categorized as follows: 21

▪ Customer Personal Information: personal information about the customer such as age, gender, tenure at the bank, branch number, location region ▪ Product and Account Information: information on customers’ ownership of bank products (account type, mortgage, line of credit) and detailed account information (account balances, loan amounts, monthly transaction count) ▪ Channel Information: information on a customers’ usage of the Bank’s service channels such as the branch visits, online banking, telephone banking (frequency of branch visits, whether online banking is used)

After data cleaning and preparation, 540 of these features have been used for modelling. The features within this set have served as some of the primary predictors in the Bank’s models.

Two approaches were taken in using the general banking features. For the HELOC model, the most significant (by contribution to model lift) 16 features were selected or recreated from the general banking features. Some of these features were available “as-is” (such as customer age) while others required some minor calculation/aggregation (such as loan-to-value ratio). The specific features used are proprietary to the Bank.

The SA model utilized features that involved more complex pre-processing such as utilizing data from other bank sources and calculating cumulative statistics (such as average change in account balance). To create a simple benchmark and to test the ability of machine learning models to “learn” feature importance and combinations of features, all of the general banking features were used “as-is”.

3.4 Out of Time Customers

In order to correctly evaluate a predictive model is it not sufficient to test on the validation data removed from the training cohort. If the model is meant to serve a predictive purpose it must be applicable to future customers and should not over-fit to the training timeframe. To prevent any inconsistency due to seasonality, the “out of time” or testing customer universe is taken one year forward (2016). All models are applied to customer data in the observation window and tested on the outcome window. This time period is unseen by the original model, and will evaluate whether the model over-fits to one specific time window. Customer spending and banking

22

characteristics can change across years, so it is important that a model is able to extract and generalize patterns effectively.

Jan 2016 Feb 2016 … June 2016 July 2016 Aug 2016 Sep 2016 Oct 2016 Observation Window Outcome Window

Figure 6: Observation and Outcome Window for Out of Time Customer Universe

23

4 Technology

The bank utilizes Apache Hadoop for its big data ecosystem. It is an open source software library for reliable, scalable, distributed computing (22). The framework “allows for the distributed processing of large data sets across clusters of computers using simple programming models” (22). The main tools within the framework used for this research include Apache Spark and Apache Hive.

4.1 Cloudera

The bank utilizes Cloudera services to facilitate its Enterprise Data Warehouse. Cloudera assists in centralizing various sources of data across the bank, and integrating with Apache Hadoop libraries. It provides support with data security, governance, and availability (23).

4.2 Apache Spark

Apache Spark is “a fast and general-purpose cluster computing system” (24). Spark’s Scala and Python APIs were used in this research, although APIs in Java and R are also provided. Structured data processing was made possible through the Spark SQL module (24).

Cluster computing frameworks have been created and adopted to facilitate big data operations, allowing parallel process execution of data blocks on separate nodes, followed by collection to a single output (25). The framework allows users to write programs that are automatically parallelized and executed on a cluster of machines (25).

Spark and the Resilient Distributed Datasets (RDDs) it utilizes was created to address the fact that MapReduce was not effective at using distributed memory – in particular storing intermediate results in-memory for use, for tasks such as iterative machine learning models (26). RDDs provide map, filter, and join operations, which function by storing the transformations used to build a dataset instead of the data itself (26).

Spark applications coordinate cluster processes on the driver program (running main function) using SparkContext, and a cluster manager program is used to allocate resources. This allows for Spark to distribute data amongst worker nodes, execute tasks, and collect the result on the driver (27). 24

A schematic is shown below (27):

Figure 7: Spark Components (Apache Spark, Cluster Mode Overview)

4.2.1 Spark Machine Learning Libraries

Several native Spark libraries were utilized, most significantly Spark ML/MLlib, Spark’s scalable machine learning library, which provides tools for machine learning algorithms, featurization, pipelines, persistence and utilities for data handling (28).

MLlib is Spark’s original RDD-based machine learning library. The library referred to as ML is Spark’s updated DataFrame-based iteration of machine learning tools, currently under development (28).

4.3 Apache Hive

All data used is accessed using Apache Hive. Apache Hive is a data warehouse software that facilitates reading, writing and managing large datasets residing in distributed storage using Structured Query Language (SQL) (29).

Data selection and pre-processing is done directly through Apache Hive or through Spark using the Spark SQL module, which can read and write tables from databases stored in Hive.

25

5 Methodology and Implementation

Chapter 5 discusses the implementation of the methodologies in Chapter 2, using the data and technology discussed in Chapters 3 and 4 respectively. All work was conducted using the Hadoop framework.

5.1 Data Cleaning

General banking features stored in Hadoop’s database, Hive, contains terabytes of data with heterogeneous origins. This makes it susceptible to missing, noisy, and inconsistent data (17). As described in Chapter 4, transaction data is stored in a raw format, requiring pre-processing before it is usable. Data cleaning is essential for any modelling or data mining task, and it is well understood that low quality data will yield low quality results (17).

Both the transaction data and general banking features contain inconsistencies, missing data, and ambiguities which could hinder model performance, or prevent Spark’s native libraries from executing (e.g. in the case of NULL values).

The data cleaning has been processed separately for general banking features and transaction data.

5.1.2 General Banking Features

As discussed in Chapter 3, the data available on the Hadoop platform was pulled from several sources across the bank, and across different lines of business. To develop a comprehensive cleaning process, the methodology and logic of collecting and joining this data would have to be well understood, however that is outside the scope of this research. Additionally, as there are over 600 general banking features, analyzing all inconsistencies and quality issues is out of scope for this work, and would require deep domain knowledge. An additional objective was to automate the feature preparation process and limit the required expert input. The data cleaning methodology was based on the work of previous research on the data (30).

An overview of the data cleaning process is provided below:

26

1) Features related to specific product and account type are joined from several different databases to the customer in the general banking features table. If the customer does not exist in the database being joined, a missing value occurs. The missing values are imputed with 0 – indicating that the customer does not hold that product or account.

2) Features with over 80% total missing customer data points are removed.

3) Categorical personal customer information is imputed with the mode, and continuous information with the mean. This method assumes that there is no pattern to the distribution of null data, and prevents the loss of data which would occur if the data points were dropped.

5.1.3 Transaction Feature

Two years of transaction data were made available for this research. Transaction history is stored in a large table, each row containing a single transaction, the associated account identifier, and other available transaction information (stored in separate columns) as outlined in 3.2.

Accounts that are shared amongst two or more customers have also been removed. This is due to the potential bias it could induce in training, and because certain attributes must apply to a single person – for example, age.

As discussed in 3.2, the transaction text is not standardized, and required pre-processing before it could be used. There are approximately 1000 unique merchant category codes, and millions of distinct transaction texts. The degree of specificity that could be informative to study is not known, so different approaches have been evaluated. Any bank transactions were removed (e.g. interest charge, transfer).

The methods tested include:

1) No revision to original transaction text.

2) Truncating transaction text to 4-6 characters and appending the merchant category code.

3) Utilizing the regular expression library in Spark to execute the following steps:

27

▪ remove any digits ▪ remove underscores and punctuation ▪ remove whitespace

Method 3) provided the best results and has been used in the models described in this paper.

5.3 Transaction Features

After the data had been cleaned, several methodologies were used to create features. These methodologies are discussed in the subsequent sections.

5.3.1 Binning

Binning was the simplest methodology implemented, and refers to grouping the transactions into “bins” by the merchant category code. For each customer, the total transaction amount in each merchant category code over the 6-month observation window was added, creating a vector with a length of over 1000.

5.3.2 Word2Vec

Application of the word2vec algorithm was previously implemented in proof-of-concept models by the Bank, and in the work of prior research students (30). In this section, methods to improve its usefulness in are explored.

The word2vec algorithm has several parameters which are set by the modeller (any parameters not set remain at Spark’s defaults). The parameters set in this study, as well as their description per the Spark documentation, are described below (14).

MinCount is the minimum number of times a token must appear to be included in the word2vec model’s training vocabulary. This was set to 500 because the transaction text was already cleaned as outlined in 5.1.3. The cleaning process should have decreased the number of unique transaction texts (e.g. by removing digits GROCER123 and GROCER456 would be grouped together). Therefore, considering that there are millions of transactions, 500 was a reasonable threshold value.

28

NumPartitions is the number of partitions (smaller number increase accuracy). This value partitions the data to be trained separately and then merged. This value was increased from 1 to 5 because the platform could not support running on one partition.

Vector Size is the size of the output embedded vector for each word, and was set to 50. The default value is 100, but the ideal vector size depends on the data used. A small vector may decrease accuracy by embedding the data in a low-dimensional space; however a large vector requires more storage space and more resources for training and modelling (31).

Window Size is the number of words before and after a given input word to build training examples from (as explained in 2.3.1). This value was set to 5 (the default). The ideal size depends on the data used. Too large of a window could add irrelevant data points, while too small of a window could leave out important data points.

The input required for Spark’s word2vec algorithm is a RDD of string sequences (14). To obtain this format, the cleaned transaction texts for each customer were grouped into chronological lists, with a single row per customer. The entirety of customer lists then acts as a group of “documents” for input for word2vec training. The product is a word embedding for each transaction text.

An example of the input is shown in Table 6:

Table 6: Data Input Format for Word2Vec Client# All Transactions (ordered by timestamp) in 6-month observation window 1 MerchantA, MerchantB, MerchantA, MerchantD, …..

Two sets of vectors for embedding merchant information were created:

▪ vectors based on the mcc: the algorithm was trained directly on the merchant category code (mcc). This is equivalent to replacing the transaction text with the mcc converted to a string. ▪ vectors based on the merchant’s transaction text: the algorithm was trained on the cleaned transaction text using the methods described in 5.1

29

The results of embedding merchant information provide interpretable results. For example, in the case of the embeddings based on the mcc, the closest vectors based on cosine distance (interpreted as most synonymous) to “Art Galleries” were “Photofinishing and Photo Developing Laboratories, “Artist’s Supply and Craft Shops” and “Tourist Attractions and Exhibits.”

Once the transaction-level vectors are determined, they must be aggregated into customer-level vectors.

5.3.3 Customer Vectors

Methods for creating customer-level vectors typically include addition of all the individual transaction-level vectors and variations of scaling by transaction amount (the value of the transaction, or amount spent on a given transaction), and the frequency of the merchant amongst all transactions, or and individual customer’s transactions.

On that basis, this thesis has explored similar methodologies as well as opportunities for improvement. The following three methodologies for customer vector aggregations were tested:

Method 1

For each customer, sum the product of each transaction’s vector and transaction amount (am).

For each transaction vector, 푣푛 from 푣1 to 푣푇 the customer vector, 푣푐 is calculated:

푣푐 = ∑(푎푚푛) (푣푛) 푛=1 Method 2

Use the SIF method of Arora et al. (15), with the addition of scaling by transaction amount, followed by removal of the first principal component. Where |c| is the length of the customer’s transactions (number), a is constant and p(n) is the probability of the word occurring. p(n) is determined as the number of occurrences of the transaction text n across all customers, divided by the total count of transactions.

30

The constant a was tested at 0.001, per recommendation of Arora et al. (15) (other values such as the average number of times a transaction text occurs in the dataset were also tested but showed decreased performance) .

푇 1 (푎푚푛) ∗ 푎 푣 = ∑ (푣 ) 푐 |푐| 푎 + 푝(푛) 푛 푛=1

T T T Next, for the matrix C, whose columns are all customer vectors [vc1 vc2 .. vcn ], the first singular vector is removed (15).

퐶 = 퐶 − 푢푢푇퐶

Method 3

Use the SIF without scaling by transaction amount, followed by removal of the first principal component.

For binned transactions (1000+ vector length), dimensionality reduction though Principal Component Analysis was initially tested, and resulted in lower model performance. It was therefore not used in the final modelling.

5.3.4 K-Means Clustering

Bank models frequently use a select number of features based on domain knowledge and statistical feature selection (e.g. importance, p-value). These features can easily be joined from regularly updated tables at the Bank. Updating and storing customer vectors on a regular basis requires significantly more computation and storage. The commonly used unsupervised algorithm K-means clustering emerges as a potentially useful and simple predictor which could be extracted from this data (30). The output from this algorithm is a cluster identifier, which may be used in training, and as a predictor in a model, even if the meaning or significance of cluster membership is not known. The algorithm was run on the customer vectors described in 5.3.3, setting the output number of clusters to 5. Although other cluster numbers (10, 100) were initially tested, 5 provided the best result. In the future, a more comprehensive search for the optimal number of clusters should be conducted.

31

An example of a clustering output, where the number of clusters is set to 100, is shown in Table 7 (only 10 out of 100 clusters shown).

Table 7: Sample Output of K-Means Clustering for a Customer Sample

Customer cluster_ID 1111 10 2222 8 3333 2 4444 10

32

5.3.5 Summary of Tables

In 5.4, the features described in 5.3 are used in machine learning models. The table below summarizes the tables used. The corresponding model number in 5.4 is provided.

Table 8: Modelling Feature Data Tables

Model Reference Name Description No. Section

HELOC : selected features from a 3.3, 1 General Banking Features benchmark marketing model (16 features) 5.1.2 SA: all available features (540) Sum of a customer’s transaction-level 3.2, 2a Transaction Features 1 vectors scaled by the amount of the 5.3.3 transaction (Method 1) Sum of a customer’s transactions 3.2, 2b Transaction Features 2 amounts in each mcc, scaled by the 5.3.3 amount of the transaction (Method 1) Sum of a customer’s transaction-level 3.2, 2c Transaction Features 3 vectors based on mcc scaled by the 5.3.3 amount of the transaction (Method 1) 3.2, Customer’s transaction-level vectors 5.3.3 2d Transaction Features 4 aggregated using SIF a) (Method 2) b) (Method 3)

General Banking Features + 3 See above Transaction Features 4

General Banking Features + See above 3b Transaction Features 1

General Banking Features + 4a Transaction Features 4 Cluster See above ID General Banking Features + 4b Transaction Features 2 Cluster See above ID

5.4 Models

Both target variables (HELOC and SA) are highly imbalanced, meaning are significantly more examples of the negative outcomes (customer not a responder, did not acquire the banking product within the observation window) than positive outcomes. 33

Using this model directly for modelling would cause any default models to skew the predictions towards the majority class, as this would minimize the model’s loss function (error) and maximize accuracy. However, the objective of these models is to identify the minority of positive outcomes.

Two approaches were explored to mitigate this error:

1) The negative class could be down-sampled and the positive class could be over-sampled. This involves a combination of replicating the positive data points in the dataset, and taking a sample of the negative data points until the desired ratio is achieved (for example a 1:1 ratio of positive examples: negative examples ). The downside of down-sampling is that it decreases the overall number of examples the models has available to learn from. The downside of over-sampling is that the model may over-fit to the particular examples that appear in the training data and not be able to generalize to new data.

2) The loss function of the model could be scaled so that positive examples are given more weight and therefore the model does not skew towards predicting the majority class in order to minimize the loss function.

The loss function (F) for machine learning models is a function of the prediction (푦̂푖) and the true value (푦푖), and involves the summation across all examples (n). An example of F could be the difference of squares for linear regression.

푇표푡푎푙 퐿표푠푠 (푥) = (푆푐푎푙𝑖푛𝑔 퐹푎푐푡표푟) ∑ 퐹(푦̂푖, 푦푖) 푖=1 Scaled Loss Function

The distribution of the data (customers chosen for study) is shown in Table 9.

Table 9: Raw Target Data Distribution for Modelling Data HELOC SA Number of Customers 925464 9507974 Positive Outcome Percentage <1% 2-3%

34

The native libraries provided by the Spark ML version used do not provide built-in functionality to scale the loss function. Therefore, for the initial modeling, approach 1) was used to create a 1:1 ratio of positive examples: negative examples. Other ratios were also tested, however a skewed (not 1:1) ratio resulted in significant model skew towards the majority (negative) class.

5.4.1 Logistic Regression

Spark’s implementation of Logistic Regression was tested with default parameters.

Table 10: Logistic Regression Parameters Parameter Description Value regParam Regularization parameter 0 standardization Training feature standardization True

5.4.2 Gradient Boosted Decision Tree

Spark’s implementation of the Gradient Boosted Decision Tree was tested with default parameters, and significantly outperformed logistic regression.

Table 11: Gradient Boosted Decision Tree Parameters Parameter Description Value maxDepth Maximum depth of the tree None Minimum information gain for a split to be minInfoGain 0 considered at a tree node numTrees Number of trees built 20

5.4.3. XGBoost

Although Spark ML’s native Gradient Boosted Decision Tree classifier consistently outperformed logistic regression, the open-source XGBoost library offers additional functionality and efficiency on distributed platforms such as Hadoop (19). The back-end algorithms developed by Tianqi Chen and Carlos Guestrin allow for faster training and efficient scalability (19). Although the framework has been developed for several platforms, the Spark-Scala API was used. Table 12 outlines the hyperparameters used.

One of the main advantages of the model was the ability to set the parameter scale_pos_weight. 푠푢푚 푛푒푔푎푡푖푣푒 푖푛푠푡푎푛푐푒푠 The recommended value of was used, which allowed for more 푠푢푚 푝표푠푖푡푖푣푒 푖푛푠푡푎푛푐푒푠 negative examples in the training data to be used. The scaling occurs when model loss is

35

calculated and minimized. For positive examples, the log loss is scaled by the scale_pos_weight factor. Additional hyperparameters are shown in Table 12.

푛 1 퐿표𝑔 퐿표푠푠 = − ( ) ∑(풔풄풂풍풆_풑풐풔_풘풆풊품풉풕)[푦 log(푝(푦 )) + (1 − 푦 ) log(1 − 푝(푦 ))] 푛 푖 푖 푖 푖 푖=1 Example: log loss function scaled by scale_pos_weight for each data point

Table 12: XGBoost Parameters Parameter Description Default SA HELOC Learning rate: Step size eta 0.3 0.2 0.1 shrinkage used in update max_depth Maximum depth of a tree 6 6 15 L1 regularization term on alpha 0 0.001 0.001 weights Control the balance of positive scale_pos_weight and negative weights, useful for 1 5 20 unbalanced classes. Subsample ratio of the training subsample 1 0.25 0.75 instances per boosting iteration reg: binary: binary: objective Learning objective linear logistic logistic The number of rounds for num_round - 30 100 boosting (number of trees) Parameter descriptions and default values are found in XGBoost documentation (32)

Because the data was highly skewed it was down-sampled prior to scaling the positive weight. Although this resulted in a smaller dataset and fewer negative examples, a very high scaling factor for positive examples may have resulted in overfitting to the specific positive examples seen in the training set.

The HELOC data contained fewer positive examples and less data, so the down-sample ratio was set to allow for more negative examples and hence more data to be used.

36

The process it outlined below:

1) Positive examples are triplicated. 2) Data is down-sampled to a 20:1 negative: positive ratio for HELOC and 5:1 for SA models. 3) The scale_pos_weight factor is set to 20 for HELOC and 5 for SA

The final result of this process is positive and negative examples have the same impact on model loss and hence model training.

5.5 Model Tuning and Testing

5.5.1 Model Testing

Machine learning typically involves three sets of data:

▪ Training Set: The input data to the model to be used in calculating weights or optimal parameters (such as splits for a decision tree) ▪ Validation Set: A sample of data set aside during training (not used in model training). Once the model is trained, its performance is evaluated on the validation data. The model is then adjusted accordingly and tested again. ▪ Testing Set: A set of data previously not seen by the model. In a predictive model this would represent new data – for example a new cohort of customers – that the model could be used on.

The training cohort is divided into training (90% of data) and validation (10% of data) sets. The splits are done randomly, and the best performing models are additionally tested by splitting the data 3 different times and then evaluating performance on the validation set to ensure the model is consistent and robust.

The testing cohort is used for testing. The entire cohort is used for evaluating performance of the best models. This process is meant to mimic how the model would perform in reality if used in the future when the targets are not yet known.

37

5.5.2 Parameter Tuning with Grid Search

Although most machine learning packages provide default parameters, which may perform well across a large range of datasets, selecting appropriate hyperparameters for a model is necessary to achieve optimal performance (as shown with the scale_pos_weight parameter). The default parameters for XGBoost are provided in Table 12.

Spark ML provides built-in functionality for hyperparameter tuning with cross-validation using grid search. K-fold cross validation involves partitioning the data into k sets (folds). The model is trained on all but one of the sets, and then tested on the left-out set. This is repeated until each fold has a turn being “left-out”. The average result and standard deviation across folds can be analyzed. This is done to ensure that the model is not overfitting to a particular set of data, and can generalize to new data.

Grid search involves specifying a list of values for each parameter to be tested. The model is then trained on each possible combination of parameters, and returns the best combination based on a specified metric. Each combination of parameters could be tested using cross validation.

This process involves re-training the model several times and is therefore resource intensive. The more folds are used, the more times the model must be trained. However, using few folds involves using smaller subsets of the data in each training iteration. In addition, as Chapter 6 shows, overfitting to a specific time period (training cohort) is more significant than potential overfitting to a subset of the data.

The maximum depth of each tree, and the number of boosting rounds (number of consecutive trees) was tested using cross validation. The results are shown below:

Table 13: Hyperparameter Tuning Results Hyperparameter Name

Model num_round max_depth Tested Best Result Tested Best Result HELOC 10, 25, 100 100 5,10, 15 15 SA 10, 30 30 4, 6 6

38

It can be observed that a higher accuracy is achieved with more boosting rounds and a larger maximum tree depth - both of which create a more complex model.

Fewer parameters are testing for the SA model because it uses significantly more data and therefore more resources. In a separate test, 100 boosting rounds (num_round) showed comparable results to 30 rounds, so it was not included in the hyperparameter tuning.

5.5.3 Evaluation Metrics

The metrics used to evaluate a model depend on the data and how the model will be use used.

In order to determine the value of new features and models, the results were assessed using the metrics currently used by the Bank’s marketing group – lift and response rate. These metrics are applicable to the Bank’s marketing campaign assessment as well as to data that is highly imbalanced.

Lift and response rate compare the outcomes of a predictive model to a baseline, in this case random sampling. Lift in this application is defined as the ratio of the true positive outcomes (responders) in a portion of the customer universe divided by the average response rate of the customer universe modelled.

The models used output both a prediction (binary) and a continuous probability value. The top decile is the top 10% of customers, when ordered by descending probability of belonging to the positive class. This is an effective metric in marketing as it allows for the selection of a smaller subset of customers, who will be more likely to (positively) respond, rather than marketing to all eligible customers.

Lift in top decile = % positive outcomes in top 10% of predicted positive probabilities Average % of positive outcomes

The response rate in this study is the percentage of positive outcomes captured within the top 3 deciles (top 30%) of customers, ordered by most likely to be responders, as predicted by the model. This way a model is shown to be effective; for example if it recommends a campaign be extended to only 30% of potential customers but that subset captures 70% of the actual responders.

39

6 Results and Analysis

This chapter provides the results of the models and features described in Chapter 5, as well as an analysis of the results. The results are reported for the XGBoost model with the parameters outlined in 5.4.3, using the features described in 5.3.5.

6.1 Results

Model performance is reported in terms of the commonly used metrics at the Bank’s marketing group. Lift and (fraction) Captured in Top 3 Deciles are explained in more detail in 5.5.3.

The data and metrics reported in this section are:

1) Lift Top Decile: fraction of positive respondents in the top decile divided by fraction of positive responders in the entire dataset

Responders Top Decile: percentage of positive respondents in the top decile Average Response: percentage of positive responders in the entire dataset

2) Captured in Top 3 Deciles: percentage of all positive responders in the dataset found in the top 3 deciles

The output of the models is the probability of a customer being a (positive) respondent. The dataset is then ordered from most likely to least likely to be a responder. This ordered set is then divided equally into ten parts, or deciles. The top decile is the decile that captures the 10% of customers predicted most likely to be a responder. The top 3 deciles capture the 30% of customers predicted most likely to be a responder.

6.1.1 Benchmark

The models tested in this chapter are based upon propensity models for HELOC and SA used at the Bank. Although there are several versions and iterations of each model, the models that were most recent and applicable to the data and objectives of this study were selected as a benchmark.

40

Jan Feb … Aug Sept Oct Nov Dec Observation Window Latency Outcome Window

Figure 8: Example of Observation and Outcome Window for Benchmark Models

Figure 8 shows an example the possible observation, latency, and outcome windows for previously tested models at the Bank. Some models use a larger observation window; however this window is often used to calculate statistics over time, which are then summarized in month- end tables (e.g. 6-month average account balance), and the summarized data is collected as of the end of the observation window. For a model to be usable in production, a latency period must also be introduced to allow for time to act on new data. The “Out-of-Time” (testing) cohort is taken in subsequent months. The details of the benchmark models and performance are proprietary to the Bank.

6.1.2 Modelling Results

To test the performance of the features summarized in 5.3.5, the best performing XGBoost model was trained using these features, with the parameters shown in Table 12. Table 14 and 15 show the validation results on the training cohort. In the training cohort, both the HELOC and SA models show improvement over the benchmark.

Table 14: HELOC XGBoost Modelling Results HELOC Model % Captured in No. Lift Top Decile Top 3 Deciles 1 3.88 65 2a 3.38 56 2b 2.80 42 2c 3.13 49 2d-a 3.25 53 2d-b 3.70 56 3 4.25 67 3b 3.87 58 4a 3.70 59 4b 3.93 64

41

Table 15: SA XGBoost Modelling Results SA Model % Captured in No. Lift Top Decile Top 3 Deciles 1 3.83 68 2a 2.69 56 2b 2.79 57 2c 2.57 54 2d-a 2.41 53 2d-b 2.41 54 3 3.89 68 3b 3.97 69 6.1.3 Testing Results

To further assess the best performing models in Tables 14 and 15, they must be tested on the “Out-of-Time” testing cohort. Although the SA performance remains consistent and shows improvement over the benchmark, the HELOC model shows a large drop in performance on the testing cohort.

Table 16: HELOC Modelling Results on Testing Cohort HELOC Model % Captured in No. Lift Top Decile Top 3 Deciles 1 1.80 46 2d-a 1.40 39 3 1.90 47

Table 17: SA XGBoost Modelling Results on Testing Cohort SA Model % Captured in No. Lift Top Decile Top 3 Deciles 1 3.86 68 2d-a 2.50 55 3 3.86 68

42

6.1.4 Average Results for Best Model

The best model for each target was Model No.3. The best model was re-trained 3 times to ensure that it was reproducible and robust. Each time, the random split between training and validation was different, generating a slightly different model. Each of the three models was then tested on the testing data. It can be observed that the model performance does not change significantly across each run (the standard deviation is also reported). The full results for each test are shown in Appendix 1.

Table 18: Repeated Training on Model No.3 Validation Out of Time % Captured Model Trial Lift Top Lift Top % Captured in in Top 3 Decile Decile Top 3 Deciles Deciles HELOC- Average 3.97 64 1.86 47 HELOC – Standard Deviation 0.29 3 0.02 0.5 SA - Average 3.87 68 3.86 68 SA – Standard Deviation 0.05 0.3 0.00 0

6.1.5 Additional Parameter Testing

To mitigate the potential overfitting in the HELOC model, additional parameters were tested. It should be noted that when parameter tuning with cross-validation is completed on the training cohort it could still result in over-fitting to the specific time period. Although a more complex model could generalize well within the cohort, it may not perform on new, unseen data. Selecting parameters to optimize testing results negates the purpose of using the testing cohort as an imitation of completely unseen data. However, to identify if overfitting was occurring, an additional HELOC model was trained using a smaller tree depth (6 instead of 15), a smaller number of boosting rounds (25 instead of 100), as well as higher regularization.

Table 19: Additional XGBoost Hyperparameters Tested Parameter Description Default HELOC - HELOC new parameters Learning rate: Step size shrinkage eta 0.3 0.1 0.1 used in update max_depth Maximum depth of a tree 6 6 15

43

alpha L1 regularization term on weights 0 0.005 0.001 Control the balance of positive and scale_pos_ negative weights, useful for 1 20 20 weight unbalanced classes. Subsample ratio of the training subsample 1 0.75 0.75 instances per boosting iteration reg: binary: binary: objective Learning objective linear logistic logistic The number of rounds for boosting num_round - 25 100 (number of trees)

6.1.6 Results for Additional Parameters

The hyperparameters shown in Table 19 are trained and tested on both the validation data and the testing cohort. The results are shown in Table 20.

Table 20: Results for Additional Hyperparameter Testing HELOC Model Trial Captured in to Top Lift Top Decile 3 Deciles Validation 3.11 62 Testing 2.06 50

As shown in Table 22, when the HELOC model was simplified it showed improved performance on the testing cohort; however, this was insufficient to outperform the benchmark model.

6.2 Analysis

6.2.1 HELOC

The results in Table 14 show that using both the general banking features (16 features used in the benchmark marketing model) and the transaction features (Model No.3) provides a significant increase in lift on the training cohort validation data, in comparison to the benchmark marketing model. The lift in the top decile for this model (Model No.3) is 4.25, with 67% of the positive responders captured within the top 3 deciles, demonstrating significant (at least 10%) improvement over benchmark models at the Bank.

44

To further assess the various feature engineering techniques used, the XGBoost model was also trained to predict the target using only the transaction features. All methods show improvement over simply binning the transactions by merchant category code (Model No.2b), which demonstrates the utility of dense transaction embeddings using word2vec.

An improvement is observed in using the methodology proposed by Arora et al. in “A Simple But Tough to Beat Baseline for Sentence Embedding” over other methods for creating customer- level vectors when transaction features are tested alone (Model No.2d).

Model performance on the training cohort was also observed when using the marketing features with the cluster identifier. The cluster ID was more effective when k-means was run on the simple sums of transaction category code (Model No.4b) rather than customer vectors based on transaction vectors (Model No.4a). Although a possible explanation is that increased complexity could have introduced noise, the usefulness of clustering requires more comprehensive exploration.

Although the best HELOC model showed a large improvement in the training cohort validation model, this improvement was not observed in the “Out-of-Time” testing cohort (Table 18). Both the top decile lift and the responders captured in top 3 deciles were significantly lower demonstrating that the model over-fit to the time period the model was trained on. It can also be noted that for the benchmark model, testing occurs in subsequent months, rather than one full year forward as is done in this study.

The fraction of respondents for HELOC acquisition is smaller than for SA, and the training cohort is smaller because to be eligible for HELOC the customer must have had a mortgage with the Bank. The smaller size of the dataset may cause it to be more vulnerable to fluctuations, which are observed in the data between different cohorts.

The average response rate changed by over 50% between training and testing – a change of similar magnitude was observed in benchmark models. Based on the literature review in Chapter 2, HELOC acquisition may be more sensitive to external conditions than new savings accounts.

45

To prevent overfitting, the parameters of the model were adjusted in order to make the model less complex. These parameters are shown in Table 21. Although these parameters showed some improvement on the testing cohort, the final testing result was still below the benchmark.

Considering the decrease in model performance on the testing cohort, a future consideration would be to create models using several years of historical data or retraining the model with additional data collected year to year rather than a single year.

6.2.2 SA

As shown in Table 16, The SA model demonstrated the best results using both the banking features (all available features) and transaction features (Model No.3), although the improvement over the banking features alone is less significant. There is significant improvement in lift (at least 10%) and some (2- 5%) improvement in the fraction captured in the top 3 deciles using the combined model in comparison to benchmarks at the Bank.

As shown in table 17, improvement is also seen in the testing cohort. The lift remains at 3.86 with 68% responders captured in the top 3 deciles. The average response rate for acquisition of SA accounts is also very stable year-to-year.

The transaction features alone are not as effective at predicting SA acquisition, and the distinction between various methodologies is not as clear. Therefore, although using transaction data combined with banking features results in overall improvement over the benchmark, further research is required to determine the optimal methodology.

46

7 Conclusions

Although word embedding methodologies for transaction data have been previously explored, new ways of aggregating transaction-level vectors into customer vectors have been tested and used in conjunction with open-source machine learning models in the Apache Spark framework. New approaches have been tested, such as embedding the merchant category code, and using cluster identifiers from the k-means algorithm as a feature. The XGBoost model showed potential improvement over traditional models such as Logistic Regression when sufficient data is used.

For both models, transaction features alone are able to provide model lift. It can also be noted that in the case of the SA model, the features selection process did not require any expert opinion or feature selection. The model is able to use all of the available features and the results show an improvement over the benchmark. The improvement in lift is at least 0.2. This means that for a customer universe of 1 million for example, the top decile (top 10%) would contain 100 000 customers. On average, the fraction of customers who will acquire a savings account in the outcome window is 0.02, or 2000 customers. The portion of responders in the top decile of the model is about 0.08, or 8000 customers. Therefore, an improvement in lift can more effectively target more responders. The HELOC model demonstrated larger improvement in lift on the training cohort, but did not generalize well to new data in the testing cohort.

The initial results demonstrate that there is potential for transaction features to improve the performance of the Bank’s models. They also show that raw transactions could be converted to usable features without a rule-based approach. The HELOC model shows the potential usefulness of clustering transaction features and using the cluster identifier in the model, rather than storing full customer vectors.

The models demonstrate several benefits and drawbacks of using more advanced machine learning models. This includes the potential for overfitting, and the requirement for a larger amount of data and resources. Additional complexity is added to the interpretation of features and feature importance.

47

7.1 Future Work

Although the results demonstrate the potential benefits of using machine learning and automated feature engineering strategies in traditional banking models, they also demonstrate the large potential for overfitting when models become too complex.

The parameters, models, and methods outlined in this report are not comprehensive and there remains potential for further improvement and parameter tuning in the word2vec and customer vector building algorithms, as well as the models themselves.

As described in 5.3.2, the parameters chosen for word2vec are based on previous benchmarks, research, as well as intuition. The usefulness of the transaction-level vectors for embedding merchant information could benefit from more data, and further study of optimal vector length, window size, and transaction text cleaning.

The use of only one year on time period was a significant limitation of this work. As shown in the results, it could cause overfitting to a specific time period and poor generalization out-of- time. To mitigate this effect, several years of data could be agglomerated, or a sliding window could be used. This would have several benefits. Firstly, it would provide the machine learning models with more data which could improve model performance, assist in filtering out “noise”, and limit overfitting to a specific time period. Secondly, it would allow using changes across time as a feature. For example, the change in spending year-to-year in a specific merchant category code could be an indicative feature. Thirdly, a longer history of data could allow for time-series models to be run. Because the 6-month observation window took place from January to October, seasonal behaviour was overlooked. A significant potential downside would be the increase in resources required to train the model.

Although this study has been conducted for a Bank, the applications can extend beyond financial institutions. An evident application would be the retail industry, where each transaction also provides itemized information on products. This structured data format lends itself well to rule- based modelling, however, the unprecedented volumes of data create opportunities for machine learning methodologies, allowing for automated extraction of patterns in data.

48

This thesis focused solely on the standardized transaction tables provided by credit and debit card institutions and banks. There is extensive further research that could be done by joining this data to information on the merchant or the contents of the transaction. Other factors, such as socioeconomic and political changes should also be assessed in more detail when analyzing decisions such as HELOC acquisition.

49

References

1. Accenture. Exploring Next Generation Financial Services: The Big Data Revolution. 2017.

2. Continuous Embedding Spaces for Bank Transaction Data. Ali Batuhan Dayioglugil, Yusuf Sinan Akgul. İstanbul : Cybersoft R&D Center, 2017. Foundations of Intelligent Systems, Lecture Notes in Computer Science. Vol. 10352.

3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. 2013.

4. Emilian Siman, Michael S. Finke, Melvin Corlija. Which Consumers Are Using Home Equity Lines of Credit? . Consumer Interests Annual. 2006, Vol. 52.

5. Femina Bahari T, Sudheep Elayidom M. An Efficient CRM-Data Mining Framework for the Prediction of Customer Behaviour . International Conference on Information and Communication Technologies. 2014.

6. Hsieh, Nan-Chen. An Integrated Data Mining and Behavioral Scoring Model for Analyzing Bank Customers. Expert Systems with Applications. 2004, 27.

7. Transaction aggregation as a strategy for credit card fraud detection. C. Whitrow, D. J. Hand, P. Juszczak, D. Weston, N. M. Adams. s.l. : Springer Science+Business Media, 2008, Data Min Knowl Disc (2009).

8. Alejandro Correa Bahnsen, Djamila Aouada, Aleksandar Stojanovic, Björn Ottersten. Feature engineering strategies for credit card fraud detection. University of Luxembourg, Interdisciplinary Centre for Security, Reliability and Trust. s.l. : Expert Systems With Applications, 2016. pp. 134-142.

9. Leonardo Baldassini, Jose Antonio Rodriguez Serrano. client2vec: Towards Systematic Baselines for Banking Applications. BBVA Data & Analytics. 2018.

10. USDA. USDA Departmental Management. VISA MERCHANT CATEGORY CLASSIFICATION. [Online] https://www.dm.usda.gov/procurement/card/card_x/mcc.pdf.

11. Towards Universal Paraphrastic Sentence Embdeddings. John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu. Chicago : s.n., 2016. ICLR.

12. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. 2013.

13. Academo. 3d-vector-plotter. Academo.org. [Online]

14. Apache Spark. Word2Vec. [Online] https://spark.apache.org/docs/2.2.0/mllib-feature- extraction.html#word2vec.

50

15. A Simple But Hard to Beat Baseline for Sentence Embedding. Sanjeev Arora, Yingyu Liang, Tengyu Ma. 2017. ICLR .

16. Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of . 2018.

17. Han, J., Kamber, M., & Pei, J. Data Mining: Concepts and Techniques (3rd ed.). Amsterdam : Elsevier/Morgan Kaufmann, 2012.

18. Friedman, Jerome H. A Greedy Function Approximation: A Gradient Boosting Machine. 1999.

19. Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. University of Washington.

20. Starmer, Josh [StatQuest with Josh Starmer]. Gradient Boost Part 3: Classification. April 8, 2019.

21. Government of Canada. Getting a home equity line of credit. Financial Consumer Agency of Canada. [Online] 06 26, 2018.

22. Apache Hadoop. Apache Hadoop. Apache Hadoop. [Online] Apache. https://hadoop.apache.org/.

23. Cloudera. Cloudera Data Warehouse. Cloudera. [Online] https://www.cloudera.com/products/data- warehouse.html.

24. Apache Spark. Spark Overview. [Online] https://spark.apache.org/docs/2.2.1/.

25. Ghemawat, Jeffrey Dean and Sanjay. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. 2004.

26. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing. s.l. : University of California, Berkeley, 2012.

27. Apache Spark. Cluster Mode Overview. Apache Spark. [Online] https://spark.apache.org/docs/2.2.1/cluster-overview.html.

28. —. Machine Learning Library (MLlib) Guide. Apache Spark. [Online] http://spark.apache.org/docs/latest/ml-guide.html.

29. Apache Hive. Apache Hive. [Online] https://hive.apache.org/.

30. Yun Hsuan Lee, Shi Miao Zhang. A Machine Learning Approach to Joint Finance Prediction: Exploring the Usefulness of Transaction Data. s.l. : University of Toronto, 2017.

31. GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, Christopher D. Manning. Doha, Qatar : Association for Computational Linguistics, 2014. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543.

32. XGBoost. XGBoost . XGBoost Parameters. [Online] 2016.

51

33. Enhancing Sentence Embedding with Generalized Pooling. Qian Chen, Zhen-Hua Ling, Xiaodan Zhu. Santa Fe : s.n., 2018, Proceedings of the 27th International Conference on Computational Linguistics, pp. 1815–1826.

34. M.Bishop, Christopher. Pattern Recognition and Machine Learning. Singapore : Springer, 2006.

52

Appendices

Appendix 1: Additional Results Validation Out of Time % % Model Trial Lift Top Captured Lift Top Captured Decile in Top 3 Decile in Top 3 Deciles Deciles Heloc 1 4.25 67.04 1.90 46.78 Heloc 2 3.56 59.57 1.86 45.99 Heloc 3 4.09 66.00 1.84 47.09 SA 1 3.89 67.76 3.86 67.82 SA 2 3.80 67.28 3.86 67.88 SA 3 3.91 68.10 3.86 67.85 HELOC- Average 3.97 64.20 1.86 46.62 HELOC – Standard Deviation 0.29 3.30 0.02 0.46 SA - Average 3.87 67.71 3.86 67.85 SA – Standard Deviation 0.05 0.34 0.00 0.02

Validation Out of Time % % Model Trial Lift Top Captured Lift Top Captured Decile in Top 3 Decile in Top 3 Deciles Deciles HELOC- Average 3.97 64.20 1.86 46.62 HELOC – Standard Deviation 0.29 3.30 0.02 0.46 SA - Average 3.87 67.71 3.86 67.85 SA – Standard Deviation 0.05 0.34 0.00 0.02

53