Inductive Logic Programming with Gradient Descent for Supervised Binary Classification by Nicholas Wu B.S., Massachusetts Institute of Technology (2019) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science February 2020

Certified by...... Andrew W. Lo Charles E. and Susan T. Harris Professor, Sloan School of Management Supervisor

Accepted by ...... Katrina LaCurts Chairman, Master of Engineering Thesis Committee 2 Inductive Logic Programming with Gradient Descent for Supervised Binary Classification by Nicholas Wu

Submitted to the Department of Electrical Engineering and Computer Science on February 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract As techniques have become more advanced, interpretability has be- come a major concern for models making important decisions. In contrast to Local Interpretable Model-Agnostic Explanations (LIME), this thesis seeks to develop an interpretable model using logical rules, rather than explaining existing blackbox mod- els. We extend recent inductive logic programming methods developed by Evans and Grefenstette [3] to develop an gradient descent-based inductive logic programming technique for supervised binary classification. We start by developing our methodol- ogy for binary input data, and then extend the approach to numerical data using a threshold-gate based binarization technique. We test our implementations on datasets with varying pattern structures and noise levels, and select our best performing im- plementation. We then present an example where our method generates an accurate and interpretable rule set, whereas the LIME technique fails to generate a reasonable model. Further, we test our original methodology on the FICO Home Equity Line of Credit dataset. We run a hyperparameter search over differing number of rules and rule sizes. Our best performing model achieves a 71.7% accuracy, which is comparable to multilayer perceptron and randomized forest models. We conclude by suggesting directions for future applications and potential improvements.

Thesis Supervisor: Andrew W. Lo Title: Charles E. and Susan T. Harris Professor, Sloan School of Management

3 4 Acknowledgments

My journey through MIT hasn’t been linear, and there have been so many twists and turns along the way. As such, there are so many people I have to thank for their guidance, advice, and support. To start, I extend my heartfelt gratitude to Professor Andrew Lo for all his help through this thesis. I am incredibly grateful for all the guidance he has provided for me along the way. Whenever I had questions, whenever I seemed to get stuck or con- fused, Professor Lo would always offer a different angle or another way forward. His mentorship throughout the entire research process helped make this thesis possible. I also recognize the help I received from the staff at the Laboratory for Financial Engi- neering. I sincerely thank Jayna Cummings, Crystal Myler, and Mavanee Nealon, for all their behind-the-scenes work coordinating meetings and making sure my research process went smoothly. Additionally, I want to thank some of my close friends and classmates who have always been listening and helping me along my way throughout MIT. I’ve been helped by so many people, whether it be inspiring me, listening to my ideas, providing general advice, or teaching me things I don’t know. To that end, I want to thank Alap Sahoo, Luis Sandoval, Haris Brkic, Evan Tey, Justin Yu, and Henry La Soya for all they’ve done to help me through this thesis, and especially for being good friends to me throughout this time. I also want to thank Jenny Shi for all her support over the past two years. From late-night proofreading, motivating me to work, and getting me past mental stumbling blocks on my research path, she has done so much to help me get here. Lastly, I have to thank my parents, Daniel and Li, and my sister, Jackie, for their continual support throughout everything in the past four and a half years. Throughout this entire time, my family has stood by me, helping me through all my struggles and celebrating my successes. Without them, I likely would never have even gotten the opportunity to come to MIT, and I extend my greatest thanks for their constant support.

5 6 Contents

1 Introduction 15 1.1 Research Goals ...... 15 1.2 Thesis Structure and Result Summary ...... 16 1.2.1 Chapter 2 ...... 16 1.2.2 Chapter 3 ...... 16 1.2.3 Chapter 4 ...... 17 1.2.4 Chapter 5 ...... 17 1.2.5 Chapter 6 ...... 17 1.2.6 Chapter 7 ...... 18

2 Inductive Logic Programming 19 2.1 Notation and Definition ...... 19 2.2 Early methods for Inductive Logic Programming ...... 20 2.2.1 RLGG refinements ...... 21 2.2.2 Top-down approaches ...... 22 2.2.3 Inverse Entailment ...... 22 2.3 Recent Approaches ...... 23 2.3.1 Boolean Satisfiability Reduction ...... 23 2.3.2 Learning Process ...... 23 2.3.3 Approach Results ...... 24

3 Explainable Artificial Intelligence 25 3.1 Why Explainability? ...... 25

7 3.2 Blackbox Interpretability Methods ...... 26 3.2.1 LIME ...... 26 3.2.2 Gradient Approximation ...... 27 3.2.3 General Approach Shortcomings ...... 28 3.3 Inductive Methods versus Regression ...... 28

4 Model Development 31 4.1 Approximating Logical Structures ...... 32 4.1.1 Parametrization ...... 32 4.1.2 Implementating AND ...... 34 4.1.3 General Learning Procedure ...... 36 4.2 Experiments ...... 37 4.2.1 Data Construction ...... 37 4.2.2 Model Comparisons ...... 37 4.3 Discussion ...... 41

5 Inducing Rules on Numerical Data 43 5.1 Approach Details ...... 43 5.1.1 Feature Expansion ...... 44 5.1.2 Interpretability ...... 44 5.2 Experiments ...... 44 5.2.1 Experimental Results ...... 45 5.3 Discussion ...... 46 5.3.1 Comparison to LIME ...... 48

6 FICO Home Equity Line of Credit Tests 51 6.1 Dataset ...... 51 6.1.1 Weight-of-Evidence Encoding ...... 53 6.2 Binary Data Rule Learning ...... 53 6.2.1 Experiments ...... 54 6.3 Numerical Data Rule Learning ...... 55

8 6.3.1 Results and Analysis ...... 56 6.4 Discussion ...... 57 6.4.1 Comparisons to Related Work ...... 58 6.4.2 Generalizability ...... 58 6.4.3 Approach Shortcomings ...... 59

7 Conclusion and Next Steps 63 7.1 Key Ideas ...... 63 7.2 Future Work ...... 65

A Tables 67

9 10 List of Figures

4-1 Relation between number of features and model performance . . . . 39 4-2 Relation between number of rules and model performance ...... 40 4-3 Relation between rule size and model performance ...... 41 4-4 Relation between noise level and model performance ...... 42

5-1 Relation between error rate and numerical model performance . . . . 46 5-2 Relation between number of rules and numerical model performance . 47 5-3 Relation between rule size and numerical model performance . . . . . 47

6-1 Example Training Run, Plotting Accuracy and Loss ...... 59

11 12 List of Tables

4.1 List of all constructed dataset configurations for binary data rule learning 38 4.2 Model Run Results, Constructed Dataset 1 ...... 39

5.1 List of all constructed dataset configurations for numerical rule learning 45 5.2 LIME Coefficients around test point ...... 49

6.1 Rules for learning non-creditworthiness ...... 55 6.2 Rules for learning creditworthiness ...... 55 6.3 Rules for learning non-creditworthiness, numerical data ...... 57 6.4 Rules for learning creditworthiness, numerical data ...... 57 6.5 Comparison between our descent-based inductive logic programming and other models ...... 58

A.1 Model Run Results, Varying Number of Features ...... 67 A.2 Model Run Results, Varying Number of Rules ...... 68 A.3 Model Run Results, Varying Rule Size ...... 69 A.4 Model Run Results, Varying Noise Level ...... 70 A.5 Numerical Model Run Results, Varying Noise Level ...... 70 A.6 Numerical Model Run Results, Varying Number of Rules ...... 71 A.7 Numerical Model Run Results, Varying Number of Rules ...... 71 A.8 Hyperparameter search, predicting non-creditworthiness with HELOC Binarized Data ...... 72 A.9 Hyperparameter search, predicting creditworthiness using HELOC Bi- narized Data ...... 72

13 A.10 Hyperparameter search, predicting non-creditworthiness using HELOC Numerical Data ...... 73 A.11 Hyperparameter search, predicting creditworthiness using HELOC Nu- merical Data ...... 73

14 Chapter 1

Introduction

In the past decade, the proliferation of machine learning and artificial intelligence techniques has allowed such models to outperform many traditional methods for a myriad of applications, from image processing to natural language processing. How- ever, many successful machine learning models function as blackboxes, especially since these models frequently produce extremely intricate functions where there is no in- tuitive meaning for any of the model’s parameters. Further, as models become more complex, the number of parameters in a model such as a deep neural network can exceed several million, making it difficult to understand model behavior. As such, there has been interest in producing machine learning models that are explainable to humans without sacrificing accuracy. One of the interesting approaches to developing explainable artificial intelligence comes from the field of inductive logic programming. Inductive logic programming deals with the development of a hypothesis that logically entails a set of background examples. This approach addresses the explainability issue explicitly in that any logic program consists of a set of Horn clauses, which are explainable as logical rules.

1.1 Research Goals

This research seeks to extend research development in the field of inductive logic programming in order to adapt inductive logic programming methods to the general

15 task of supervised binary classification. That is, given some target binary label and some input features, we wish to develop a method to learn a logic program to predict the label using the given inputs, such that the resulting logic program has the following properties:

1. Accuracy: the logic program should relatively accurately fit the training data. That is, the logic program should predict the right class for a high fraction of the training.

2. Generalizeability: the logic program should accurately classify examples not seen before.

3. Interpretability; the rules of the logic program should be easily understandable to a human observer.

1.2 Thesis Structure and Result Summary

We outline the content in the chapters of this thesis, and summarize key results.

1.2.1 Chapter 2

In this chapter, we provide an brief overview of the field of inductive logic program- ming and explain the different inductive logic approaches. We specifically highlight the 2018 paper by Evans and Grefenstette [3] in order to pre-empt our extension of their methodology.

1.2.2 Chapter 3

In this chapter, we discuss the importance for interpretability in artificial intelligence. We highlight important motivations for interpretability, and cite examples where tra- ditional machine learning methods fail to achieve these goals. We then present a literature survey in explainable artificial intelligence, specifically regarding local ex- planations for blackbox models. We present the LIME methodology, and then discuss

16 a recent extension of the LIME method that utilizes inductive logic programming. We also briefly discuss gradient interpretation as a method of analyzing blackbox models, before closing with a discussion of the relative flaws in interpreting a blackbox model through approximation.

1.2.3 Chapter 4

We begin this section by presenting the formalization of our task. The subsequent parts of this section discuss our original methodologies for performing inductive rule learning. We describe our general approach for forming logical rules, the various im- plementation strategies, and then provide experimental data over constructed datasets to examine the practicality of the methods given various levels of rule size, rule quan- tity, and noise.

1.2.4 Chapter 5

In this section, we describe how to extend the methods developed in Chapter 4 to handle non-binary input data. Specifically, we describe how to potentially learn log- ical thresholds for numerical data. We analyze the efficacy of these methods for extending inductive logic onto continuous numerical data by training models on con- structed numerical datasets, and report on these results. We finish this chapter by discussing an example where our inductive logic programming model is able to gen- erate reasonable explanations, but the LIME technique fails to generate a feasible global approximation.

1.2.5 Chapter 6

For this section, we apply the cumulative methods developed in the two previous chapters to a real-world dataset. We briefly describe the Home Equity Line of Credit (HELoC) dataset. We first explore utilizing a binarized approach over the Weight-of- Evidence (WoE) based cutoffs, and then explore applying the methods from chapter 5 to the original numerical data to manually learn cutoffs. We lastly discuss the

17 approximation breakdown phenomenon, and suggest potential future directions for addressing the issue.

1.2.6 Chapter 7

In the final chapter, we discuss concluding remarks and recapitulate the mainre- sults of the thesis. We provide some ideas for future direction, including potential methodology expansions, alternative implementations, and different applications of the methods discussed.

18 Chapter 2

Inductive Logic Programming

We will start by defining logical induction and inductive logic programming, and discuss current background in this area. We also introduce the important precursor work in inductive logic programming which we will extend in later chapters. We will not expand upon every inductive logic programming development in detail; instead, we seek to provide a higher level overview of the field. Induction refers to the general ability to learn a conclusion from some examples; for example, we might observe many different material compositions and colors, and eventually inductively reason that certain pigments induce certain colors on the ma- terials. In the framework of logical induction, we intend to learn a conclusion in the form of some logical statements. Thus, we begin by introducing exactly what we mean by logical statements.

2.1 Notation and Definition

First, we will utilize the following symbols:

∙∨ to denote logical OR

∙∧ to denote logical AND

∙ 푎 to denote the logical inverse of 푎 (i.e. NOT 푎)

19 In general, the standard form of a logical rule takes the form of a Horn clause, which we define as:

Definition 2.1. A Horn clause is a logical rule of the form:

푎1 ∧ 푎2 ∧ ...푎푖 → 푘 where any of the 푎 are positive or negative literals.

Note that atomic facts can also be expressed as Horn clauses. For example,

푖푠퐺푟푒푒푛(grass)

also can be expressed as → 푖푠퐺푟푒푒푛(grass)

Finally, we present our entailment notation. Let  denote entailment. That is, for two sets of Horn clauses 퐴 and 퐵, 퐴  퐵 implies that every rule or fact in 퐵 can be logically derived from the elements of 퐴. Having presented our background and formalism, we now are ready to formally define the inductive logic programming problem.

Definition 2.2. Inductive Logic Programming (ILP): Given a set of background knowledge 퐵 consisting of logical rules and facts, and a set 퐸 of logical facts, compute a set of Horn clauses 퐻 such that 퐻 ∧ 퐵  퐸. We call 퐻 the hypothesis.

2.2 Early methods for Inductive Logic Programming

Many of the earlier methods for performing inductive logic programming involved generating a search space of possible hypotheses, and processing through these hy- potheses in some order (i.e. from most general to most specific). In general, all of these methods take this underlying approach, with refinements in the managing the search space of hypotheses.

20 The early foundation for inductive logic programming was laid by Plotkin in his PhD thesis in 1971 [12]. Plotkin’s method of relative least general generalization (RLGG) proceeds as follows:

1. For every positive example, generate a rule by treating all the background knowl- edge as the body of the rule, and the example as the target of the rule. This step is called relativization.

2. Replace all specific objects with variables, and unify similar rules. This stepis referred to as anti-unification.

3. Delete extraneous literals that do not involve the given variables.

As proposed, the relative least general generalization (rlgg) approach has many flaws. For example, rlgg fails to learn rules that involve additional internal variables. To see this, consider the grandparent relation 푔(푥, 푦). In terms of a parent relation, grandparent can be defined 푔(푥, 푦) = 푝(푥, 푧) ∧ 푝(푧, 푦). Plotkin’s method fails to learn this since it discards feature 푧 as irrelevant. Further, the size of the relative least general generalization in Plotkin’s method can grow exponentially in the background knowledge, quickly making this method intractable for large datasets.

2.2.1 RLGG refinements

In 1990, Muggleton and Feng extended Plotkin’s method by creating Golem to address some of these issues. By imposing some determinacy restrictions on background knowledge and the hypothesis space, the Golem method for ILP polynomially bounds the size of the rlgg, and adapts the learning methodology for larger datasets [10]. In 2009, Muggleton et al. extended Golem further to address some of its short- comings, producing ProGolem [11]. The determinacy restrictions, being essential to the polynomial bound on the size of the RLGG in Golem, made Golem inapplicable to several key datasets. ProGolem utilizes the Asymmetric Relative Minimal Gener- alization (ARMG) rather than the determinate relative least general generalization (RLGG) to produce hypotheses.

21 2.2.2 Top-down approaches

Another class of ILP methods search the hypothesis from top down; they generate a rule that is too broad, explaining all the positive examples, and then refine the rules to cut out negative examples. One major class of approaches follows the first-order inductive learner (FOIL) method developed in by Quinlan in 1990 [13]. To learn a rule, the Horn clause starts with an empty body, and literals are appended to the body in an order computed according to some score, hence refining the rule until the rule no longer covers any negative examples. As with the earlier generaliation-based methods, FOIL becomes intractable for larger datasets. To address this limitation, in 2014, Zeng et al developed an extention called QuickFOIL by implementing a different scoring function for adding literals to the body of a rule, and a refined pruning strategy to evaluate adding rules [18].

2.2.3 Inverse Entailment

In contrast to approaches that search the hypothesis space from most general to most specific, there are other bottom-up approaches that try hypotheses from most specific to most general; specifically, Cigol (“logic” spelled backwards) uses this methodology [9]. However, the more successful refinement of this approach evolved in 1995 from a different observation about logical induction [8]. The learning system Progol, de- veloped by Muggleton, relies on the following observation: given that we wish to derive a hypothesis 퐻 such that 퐵 ∧ 퐻  퐸 for the background knowledge 퐵 and the ¯ ¯ ¯ examples 퐸, it follows that 퐵 ∧ 퐸  퐻 (where 퐴 denotes the logical complement of 퐴). Practically, this results in the following algorithm:

1. Pick a positive example as a seed.

2. Start with the most specific possible rule that could explain this example, and generate the space of possible generalizations.

3. Search this space. In the case of Progol, an 퐴* search is used.

4. Repeat until all positive examples are covered.

22 2.3 Recent Approaches

While all of the approaches have still been the subject of recent research, a fundamen- tal hindrance to many of these approaches lies in their sensitivity to error. Although some adaptations of the previously mentioned approaches manuall allow for some amount of error tolerance, a DeepMind publication in 2018 produced a more organic method of handling error tolerance by utilizing a gradient descent-based approach to learning rules. Termed 휕ILP, this approach treats the rule-learning process as a loss minimization problem solveable through gradient descent. The main part of this thesis extends ideas from this approach rather than the earlier inductive logic programming approaches.

2.3.1 Boolean Satisfiability Reduction

The first key note is that we can interpret the inductive logic programming taskasa boolean satisfiability problem. That is, if all the possible rules for defining apredicate are enumerated, we simply have to learn a true/false indicator for each of these rules as to whether the rule defines the given predicate. Specifically, to get a complete description, [3] shows that any predicate can be expressed in terms of two rules of with at most two predicates each,

푎 ∧ 푏 → 푐 where one or more of the literals in the body of the rule may be “invented”, and the rule definition of such an “invented” predicate can also be learned. Hence, the approach used by 휕ILP enumerates all pairs of possible rules, and learns a probability distribution over which pair of rules generates the best theory.

2.3.2 Learning Process

In order to learn which rules are correct, 휕ILP performs a fixed 푁 number of steps of fuzzy logical reasoning from the background knowledge to obtain a truth assessment

23 of the relevant target predicate for the given examples. By fuzzy logic, we refer to the extension of logical operators to values in the continuous [0, 1] interval. Hence, the overall approach starts with an example, attempts to deduce the target predicates from the other features, and forms a probabilistic estimate, which is used to compute a binary cross entropy loss to minimize. Specifically, for every potential pair of defining rules, a fixed logical reasoning function is generated to approximate the process of logical reasoning with those rules. One step of forward reasoning then takes the non-target information as input, applies the each pair of rules using the corresponding reasoning function, and unifies the resulting valuation according the the probability distribution over the rule pairs.

2.3.3 Approach Results

This approach achieves performance improvement over other statistical reasoning- based methods and comparable performance to inductive logic methods, but partic- ularly provides the benefit of being fault-tolerant to data mislabeling. The approach developed in the paper additionally outperformed multilayer-perceptron models at inductive tasks; in general, the multi-layer perceptron could not generalize as well as 휕ILP [3]. This improved generalizability and explainability motivates the extension of 휕ILP as the basis for our research.

24 Chapter 3

Explainable Artificial Intelligence

In general, the notion of explainability or interpretability is subject to many different ideas on exactly how interpretability or explainability should be defined. As [2] notes, there are two broad approaches across the field. The first approach is to demonstrate usefulness under a specific application; for example, previous work evaluated the practicality of the Local Interpretable Model-Agnostic Explanation (LIME) on the Home Equity Line of Credit (HELoC) dataset [7]. The alternative assumes that a certain type of model satisfies interpretability, and describes how to optimize sucha model.

3.1 Why Explainability?

Before we review some recent approaches and literature around interpretability in ma- chine learning, we first explore why interpretability provides any value. Specifically, interpretability provides value by allowing humans to sanity-check for

1. Security: Without any understanding of the internal mechanism of a machine learning model, the model can easily become prey to specifically engineered tasks. As noted in [16], intricate deep neural networks for computer vision are extremely susceptible to one-pixel attacks; that is, the modification of the value of one pixel can induce an algorithm to falsely label an image. With

25 some measure of interpretability, such a security vulnerability can be protected against.

2. Ethics: Often, the human notion of fairness is hard to judge with a machine learning model. For example, in the HELoC case, part of the interest in inter- pretability relies on the fairness for rejecting any loan application. Specifically, a loan provider would have to explain any rejected loan application. In other cases, biased datasets may induce implicit biases in the models trained on them, which might negatively affect certain groups of people.

3. Multiple objectives: Typically, machine learning models are trained to minimize a loss function. However, loss functions cannot always account for phenomena in practice; although a machine learning algorithm may optimize for one objective, it may not account for others. For example, using FICO credit data, Hardt et al. demonstrated that optimizing for nondiscrimination and accuracy are not always compatible [5].

3.2 Blackbox Interpretability Methods

In this section, we introduce various different methods for generating interpretable explanations from an existing blackbox model. These explanations all approximate the blackbox model in the local vicinity of some interest point 푥.

3.2.1 LIME

One of the main approaches to interpretability is the method of using Local Inter- pretable Model-Agnostic Explanations, proposed in 2016 as a method for understand- ing existing black-box models [14]. The general LIME algorithm proceeds as follows: given a blackbox function 푓 and an focal point 푥,

1. Randomly generate 푆, a set of perturbed points sampled in some neighborhood of 푥.

26 2. Compute 푓(푆) = {푓(푠)|푠 ∈ 푆}.

3. Perform a linear regression over the points in 푆; i.e. minimize the mean squared error between a linear function and 푓(푆).

As [7] states, LIME does not require the underlying blackbox model to be differen- tiable, but it falls under the shortcoming of linear regression, especially becoming susceptible to correllations between input features.

LIME-FOLD

In an extension work to the original LIME methods proposed in [14], another approximation- based approach extends LIME by performing an inductive logic programming in- duction over the perturbed sample, rather than the linear regression of LIME [15]. Specifically, given a function 푓 and focal point 푥, LIME-FOLD

1. Randomly generate 푆, a set of perturbed points sampled in some neighborhood of 푥

2. Compute 푓(푆) = {푓(푠)|푠 ∈ 푆}.

3. Utilize the FOLD inductive logic programming algorithm (derived from FOIL [13]) to approximate rules governing 푓 around 푥.

While this has many of the benefits of LIME in that it is model-agnostic, it also avoids some of the pitfalls of linear regression. However, this approach does have many of the overall shortcomings of taking a blackbox approximation approach, such as a lack of ability to discover global structure.

3.2.2 Gradient Approximation

Another blackbox function approximation approach computes the gradient of the model at a point of interest, and uses the gradient vector to guide explanation [1]. Termed the explanation vector in 2010 by Baehrens et al., the gradient along the model at a given point near a decision boundary points towards a direction that

27 induces a classification change. However, this method does require the underlying blackbox model to be differentiable everywhere. Further, it is possible that gradients may be deceptively small near certain inputs [17], and thus the gradient vector may not provide a good explanation.

3.2.3 General Approach Shortcomings

All of the blackbox interpretability models discussed rely on approximating the black- box function near a focal point. While this does generate potentially reasonable expla- nations near the decision boundary, these methods struggle to interpret global struc- tures of the blackbox functions. To avoid this flaw with the approach of approximating blackbox functions, we use a different approach for interpretability; specifically, we impose an inductive bias on a model itself in order to maintain interpretability.

3.3 Inductive Methods versus Regression

Logical rules have an advantage over regression models in that they formalize feature interactions; that is, relationships between inputs. Consider a theoretical example, where we have binary input variables 푥1 and 푥2 determining the behavior of a binary output variable 푦. Suppose the presence of either input determines 푦; we can easily express this as 푥1|푥2 → 푦. However, assuming we have an approximately equal set of observations of (푥1, 푥2, 푦) for all possible pairs (푥1, 푥2), the linear regression of 푦 on

푥1 and 푥2 yields coefficients of approximately 0.5 each for 푥1 and 푥2, which remains oblivious to the interaction between 푥1 and 푥2. We could, of course, include 푥1푥2 as an interaction term variable in our regression; the resulting regression would yield the model 푥1 + 푥2 − 푥1푥2. However, this implicitly requires an “a priori” intuition that there exists a feature interaction between 푥1 and 푥2; in models when we have multiple parameters, there are an exponential number of potential feature interactions. Including an interaction term in the regression would result in an exponential number of regression variables, which quickly becomes infeasible for many parameters. The ability to perform logical

28 induction, and generate rules programmatically, provides a supplement to regression techniques especially in terms of teasing out feature interactions, without sacrificing anything in terms of interpretability.

29 30 Chapter 4

Model Development

In this chapter, we present our investigation into original methodologies for per- forming logical induction-based supervised learning on binary data and analyze the accuracy and applicability of these approaches. Formally, we will suppose that we have an input collection 푆 consisting of 푛-dimensional vectors 푥 ∈ {0, 1}푛, possibly repeated, and we have a 푦 ∈ {0, 1} label for each vector in our collection. Under this supervised learning context, we suppose that we have access to these labels, and we wish to learn a set of consistent rules 푅* such that logical deduction from these rules recovers 푓 with as high accuracy as possible.

For sake of clarity throughout this chapter, we reiterate the definition of the softmax function. For some vector 푋 ∈ R푛

(푒푋 ) softmax(푋) = 푖 푖 ∑︀ 푋 푗∈푍(푛)(푒 )푗

Additionally, we denote the sigmoid function 휎:

1 휎(푥) = 1 + 푒−푥

31 4.1 Approximating Logical Structures

Recall that a Horn clause takes the form

푎1 ∧ 푎2 ∧ ...푎푖 → 푡

Generally, this form is interpretable to us as a logical rule; if the given conditions are met, the result holds. It is clear that a single Horn clause is insufficient to express an arbitrary Boolean function; for example, consider 푐 governed by inputs 푎 and 푏: 푎 ∧ 푏 → 푐. It is clear that no single Horn clause can capture this relationship in 푐, hence, for most learning settings, we need to learn sets of Horn clauses. Typically, the collective OR of all the Horn clauses is used to characterize a variable. Hence, we can respectively interpret clause learning as characterizing a disjunctive normal form expression for target value 푡 of the form:

⋁︁ ⋀︁ 푡 = 푎푖,푗 푗 푖

where 푎푖,푗 denotes the 푖th literal in the 푗th Horn clause. We would like to be able to determine the 푎푖,푗 terms; that is, which literals belong to which rules.

4.1.1 Parametrization

We consider two main ways to develop a parametrized representation for these rules. In both cases, our parameters represent some probability distribution characterizing the rules.

Fixed-Size Parametrization

The first parametrization requires a given number of rules 푁 and a given maximum rule size 푅. We construct Π, a 푁 × 푅 × 2푛 tensor, and interpret

softmax(Π[푖, 푗])

32 as a probability distribution for the 푗th term in the 푖th rule; that is, we get a 2푛 size discrete probability distribution over the 푛 features and their negations.

From this parametrization and extensions ∨ˆ and ∧ˆ for OR and AND on [0, 1], we can compute an approximated label 푦ˆ for an input vector 푥 by concatenating 푥 with its 1 − 푥 to get a 2푛-vector 푥*:

⋁︁⋀︁ 푦ˆ = ̂︁̂︁ (푥* · softmax(Π[푖, 푗])) 푖 푗 where 푖 ranges over the number of rules and 푗 ranges over the size of a rule.

We can also resolve the representation Π to a set of logical rules for all 푟 from 1 to 푅, 퐶 ⋀︁ target ← argmax푘(Π[푟, 푐, 푘]) 푐=1 for each 푟 ∈ 푍(푅). The set of rules is also expressible as a single exression in disjunctive normal form, as

푅 (︃ 퐶 )︃ ⋁︁ ⋀︁ target ← argmax푘(Π[푟, 푐, 푘]) 푟=1 푐=1

Variable-Size Parametrization

Under a different parametrization, we can learn rules of variable sizes. That is,given a hyperparameter for the number of rules 푁, we can construct a different parameter Π′ as a 푁 × 푛 × 3 tensor and interpret

softmax(Π′[푖, 푗]) as the discrete probability distribution over whether the 푗th feature appears in the 푖th rule, whether its negation appears in the 푖th rule, or whether it does not appear at all.

Given our parametrization and soft logic extensions ∨ˆ and ∧ˆ for OR and AND on

33 [0, 1] values, we can compute an estimated label 푦ˆ as follows

⋁︁⋀︁ 푦ˆ = ̂︁̂︁ (푥 · softmax (Π′[푖, 푗]) [0] + (1 − 푥) · softmax (Π′[푖, 푗]) [1] + softmax (Π′[푖, 푗]) [2]) 푖 푗

where 푖 ranges over the number of rules and 푗 ranges over the number of indicators in a rule.

We can similarly resolve the representation Π′ to a set of logical rules via taking the rule 푖 to be the union of the

⋃︁ {푘 | argmax (Π′[푖, 푘]) = 0} {︀푘 | argmax (Π′[푖, 푘]) = 1}︀

That is, the rule 푖 contains the indicators 푘 for which the probability distribution from Π′ assigns the highest probability to being in the rule, and 푘 for the indicators which the probability distribution from Π′ assigns the highest probability to being in the rule as a negated literal.

4.1.2 Implementating AND

In implementing such a learning procedure, we require the continuous extensions of

AND and OR given by ∧̂︀ and ∨̂︀. In implementation, we will make use of DeMorgan’s Law: that is, ⋁︁ ⋀︁ ̂︁푥 = ̂︁푥

Hence, we need only develop a continuous extension 퐴 of the AND function, and the corresponding OR extension is simply given by

⋁︁ ̂︁(푥) = 1 − 퐴(1 − 푥)

34 Minimum Function AND

Given a set of binary variables 푎1, 푎2, ...푎푛, one continuous extension of 푎1 ∧ 푎2... ∧ 푎푛 is given by

min 푎푖 푖

It is obvious that for binary 푎푖, min푖 푎푖 = 1 only when all the 푎푖 are 1, and hence this is consistent with the binary logic AND.

Product AND

Another proposed AND function is given by

⋀︁ ∏︁ 푎푖 = 푎푖 푖 푖

We can confirm that for binary 푎푖, this is 1 if and only if all of the 푎푖 are 1.

Sigmoid-based AND

In the next sections, we note that another approximation to the AND stems from the fact that 푛 (︃ 푛 )︃ ⋀︁ ∑︁ 푎푖 = 푎푖 ≥ 푛 푖=1 푖=1 Given that the sigmoid function can be considered an approximation to the step function (휎(푥) ≈ (푥 ≥ 0)), we can obtain the following approximation:

푛 (︃ 푛 )︃ [⋀︁ ∑︁ 푎푖 = 휎 푎푖 − 푛 + 휖 푖=1 푖=1 for some 휖 ∈ (0, 1). Note that this approach has the drawback that:

(︃(︃ 푛 )︃ )︃ ∑︁ 휎 1 − 푛 + 휖 ̸= 1 푖=1 so this function only approximates the AND of the inputs, and in fact is not an exact AND.

35 ReLU-based AND

We recall the rectified linear unit given by:

푅푒퐿푈(푥) = max(0, 푥)

Using the ReLU function, we can create an exact extension of the AND function from

푛 (︃ 푛 )︃ [⋀︁ ∑︁ 푎푖 = 푅푒퐿푈 푎푖 − (푛 − 1) 푖=1 푖=1

We can easily check that for binary 푎푖, this is 1 if the 푎푖 are all 1, and zero otherwise. However, this has the disadvantage of having constant behavior outside of the the ∑︀ range where the sum 푖 푎푖 is between 푛 − 1 and 푛, which prevents learning on most initializations due to the zero gradient in the constant region of the rectified linear unit. As noted in [6], a parametric rectified linear unit replaces the constant 0inthe ReLU function with a slightly sloping line, which prevents the gradient from com- pletely disappearing at negative values. Specifically, for some given learnable param- eter 푐 ≥ 1, we can design a parametrized ReLU:

(︂ 1 1 2푐 − 1 1 )︂ 푃 푅푒퐿푈(푥) = min 푥 + , 푥 + 2푐(푛 − 1) 2푐 2푐 2푐

This was deliberately engineered such that for 푥 = −(푛−1), the function evaluates to 0, and for 푥 = 1, the function evaluates to 1. We can thus use our extension as

푛 (︃ 푛 )︃ [⋀︁ ∑︁ 푎푖 = 푃 푅푒퐿푈 푎푖 − (푛 − 1) 푖=1 푖=1

4.1.3 General Learning Procedure

Now, supposing we have a parametrization of rule structure, we discuss the procedure for learning a rule structure to fit the input data. As discussed in the previous sections, every parametrization has a method for developing an estimated label 푦ˆ from an input

36 vector 푥. We then use gradient descent to minimize the binary cross entropy loss between our estimates 푦ˆ and the true labels 푦 with respect to our parametrization.

4.2 Experiments

Experiments were run with the discussed implementations of the extended AND func- tion. In all cases, we tested the approach on constructed noisy datasets, where each dataset was generated by randomly generating binary vectors {0, 1}푛, applying an underlying set of logical rules to generate a target label 푦, and randomly perturbing each target label with some small noise probability 푝. We explore the abilities of each approaches on these constructions with varying rule complexities, feature set size, and noise levels. We test each approach over a grid of hyperparameters controlling for rule size and number of rules.

4.2.1 Data Construction

We first describe how we construct examples for testing our methods. Each con- structed dataset is characterized by the number of features 푛, the number of rules used to generate the feature 푁, the maximum size of a rule 푅, and an error rate 훿. For each configuration of features, rules, and rule sizes, we select a set ofrules randomly to try to learn. We then generate 퐼 random binary vectors, and apply the selected rule set to generate a label for each vector. Finally, we randomly switch the label with probability 훿 to add noise to the data. In table 4.1, we have listed the different configurations for each of 푛, 푁, 푅, 훿, and 퐼 that we tested.

4.2.2 Model Comparisons

For the first set of experiments, we run each combination of parametrization and AND function discussed on the initial configuration 1 (25 indicators, 3 rules of size 3, 10000 data points, and a 0.01 error rate). For every combination, we train a model to predict 3 rules (and if using a fixed size parametrization, rules of size at most 3).For

37 Configuration 푛 푁 푅 퐼 훿 1 25 3 3 10000 0.01 2 15 3 3 10000 0.01 3 50 3 3 100000 0.01 4 100 3 3 1000000 0.01 5 25 2 3 10000 0.01 6 25 4 3 10000 0.01 7 25 5 3 10000 0.01 8 25 6 3 10000 0.01 9 25 3 2 10000 0.01 10 25 3 4 10000 0.01 11 25 3 5 10000 0.01 12 25 3 6 10000 0.01 13 25 3 3 10000 0.001 14 25 3 3 10000 0.005 15 25 3 3 10000 0.05 16 25 3 3 10000 0.1

Table 4.1: List of all constructed dataset configurations for binary data rule learning

the learning process, we train the model using an Adam optimizer with a 0.01 learning rate, using a batch size of 25 examples for 100 epochs to ensure convergence. Each model is trained with five different random parameter initializations, since different initializations may yield convergence to different minima. After confirming each model attains convergence through examination of the loss functions, we report performance using two different metrics. For the first metric, we simply compute the average rule set accuracy over allruns; that is, how accurate each learned rule set predicts the patterns in the data given. The other metric we will use is a rough measure of precision: the fraction of rules learned that are relevent, or the fraction of the learned rules that were actually used to generate the dataset. The results can be found in table 4.2. From these results, we see that the most accurate models for recovering these patterns are models that use either a PReLU-based AND or a Product-based AND. A possible explanation might be that the sigmoid AND function shrinks the magnitude of the gradient and fails to learn any patterns, and the min-function AND also does not propagate gradients back to all of the parameter entries.

38 Model Parametrization AND Rule Accuracy Rule Precision 0 Fixed Min 0.868 0.533 1 Fixed PReLU 0.990 1.000 2 Fixed Sigmoid 0.646 0.000 3 Fixed Product 0.990 1.000 4 Variable Min 0.753 0.133 5 Variable PReLU 0.990 1.000 6 Variable Sigmoid 0.644 0.000 7 Variable Product 0.990 1.000

Table 4.2: Model Run Results, Constructed Dataset 1

Rule Accuracy Rule Precision 1 1

0.9 0.8

0.8 0.6

0.7 Fixed/PReLU 0.4 Fixed/PReLU Fixed/Product Fixed/Product 0.6 Variable/PReLU 0.2 Variable/PReLU Variable/Product Variable/Product 0.5 0 0 20 40 60 80 100 0 20 40 60 80 100 Number of features 푛 Number of features 푛

Figure 4-1: Relation between number of features and model performance

To explore the behavior of these methods on varying number of features (i.e. 푛 in the mathematical formulation), we tested the most successful four models from the previous experiment on inputs with varying number of features (15, 25, 50, 100) and report on how each model type performs in terms of the same metrics (rule set accu- racy, rule precision). Again, we use the same learning parameters (Adam optimizer with learning rate 0.01). We train long enough to ensure convergence: 100 epochs for smaller datasets, and 10-20 epochs for the larger datasets. The performances for each model are shown in Figure 4-1, and the table of results can be found in Appendix A in table A.1.

39 Rule Accuracy Rule Precision 1 1

0.9 0.8

0.8 0.6

0.7 Fixed/PReLU 0.4 Fixed/PReLU Fixed/Product Fixed/Product 0.6 Variable/PReLU 0.2 Variable/PReLU Variable/Product Variable/Product 0.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Number of rules 푁 Number of rules 푁

Figure 4-2: Relation between number of rules and model performance

From these results, we see that the PReLU-based AND function generally fares slightly worse than the product-based AND even across varying the number of input features to discriminate from. However, the performances of the different models are fairly close, with the note that the product-based AND performs markedly better on smaller numbers of features. For the next set of experimental runs, we train the models on dataset configura- tions 5 through 12, where we vary the number of rules used to generate the pattern and the rule sizes in the pattern. Again, we use the same Adam optimizer under a 0.01 learning rate, training each model for 100 epochs. As before, we report rule set accuracy and rule precision as our metrics of performance. The table of results under varying numbers of rules are in A.2 and the results under varying rule size are in A.3. The results have also been graphed in 4-2 and 4-3. As the figures demonstrate, the product-based AND models perform best as the number of rules and as the rule size varies, regardless of parametrization. Finally, we explore the performances of different implementations under varying levels of noise. As before, we test each combination of both parametrizations and the PReLU-based and Product-based AND functions. We use an Adam optimizer with a learning rate of 0.01, for 100 epochs, on our constructed datasets 13-16, where we

40 Rule Accuracy Rule Precision 1 1

0.9 0.8

0.8 0.6

0.7 Fixed/PReLU 0.4 Fixed/PReLU Fixed/Product Fixed/Product 0.6 Variable/PReLU 0.2 Variable/PReLU Variable/Product Variable/Product 0.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Rule size 푅 Rule size 푅

Figure 4-3: Relation between rule size and model performance

have varied the noise rate, with our original 0.01 error rate for reference. We report again on average accuracy and average rule precision over five runs. The results can be found in table A.4, and we have graphed them in figure 4-4. We note that the models, with the exception of the combination of a fixed-length rule parametrization and PReLU-based AND, recover all of the learnable pattern.

4.3 Discussion

Overall, it appears that the best models for predicting rule structure on binary data are either parametrization in combination with a product-based AND gate. Across different experiments varying number of features, rule sizes, number of rules, and noise level, the product-based AND gate models consistently performs best. In terms of overall model performance, we note that across most experiments, the models with the product-based AND gate were able to average at least a 90% accuracy. We note that [4] proves a complexity bound for the general inductive logic pro- gramming problem; that is, learning Horn clauses to fit data points. Gottlob [4] shows that the complexity class of inductive logic programming is actually harder

푃 than NP-complete; in fact, it is Σ2 -complete. Thus, we cannot reasonably expect to

41 Rule Accuracy Rule Precision 1 1

0.9 0.8

0.8 0.6

0.7 Fixed/PReLU 0.4 Fixed/PReLU Fixed/Product Fixed/Product 0.6 Variable/PReLU 0.2 Variable/PReLU Variable/Product Variable/Product 0.5 0 10−3 10−2 10−1 10−3 10−2 10−1 Noise Level 훿 Noise Level 훿

Figure 4-4: Relation between noise level and model performance have a polynomial-time deterministic algorithm here that always recovers the original pattern. However, we have shown that our approach, using random parameter initial- izations, recovers nearly 90% of the pattern with the fixed-length rule parametrization and product-based AND gate. Further, one of the most promising aspects of the developed approach is evident through our experiments varying the noise level. We particularly note that for higher error rates, the rule set never fits the noise; that is, the training set accuracy of our model never exceeds 1 − 훿, the maximum learnable pattern. With reasonable hyperparameter controls by restricting rule size and rule number, we can be confident that the approach developed suffers less from the overfitting issues of models likedeep neural networks due to its complexity restrictions.

42 Chapter 5

Inducing Rules on Numerical Data

In the previous chapter, we discussed how to induce logical rules explaining a target attribute from binary input data using a gradient-descent based procedure. Here, we discuss how we learn logical rules from non-binary, numerical data. We explore the ability to learn interpretable binary features from these numerical data in conjunction with logical rule induction, and discuss the merits of this approach relative to the LIME technique.

5.1 Approach Details

Extending the methods in chapter 4 to numerical data requires a mapping of numerical values into [0, 1]. To do this, we take the sigmoid function in order to map values into the desired range. Specifically, we transform our numerical inputs by a function

푇 : R → [0, 1] defined as:

푇 (푥) = 휎(푎(푥 − 푏))

We note that as 푎 → ∞, then 푇 (푥) becomes a threshold gate, which is 1 for 푥 > 푏 and 0 for 푥 < 푏. Our procedure then becomes as follows: we maintain parameters 푎, 푏 which are of the same dimension as inputs 푥, and we compute the transformation

푥푇 = 푇 (푥) as our updated “binarized” inputs in [0, 1]. We then use 푥푇 as our inputs

43 to the model described in chapter 4. During the gradient descent optimization step, we update parameters 푎, 푏 along with the inductive logic parameters.

5.1.1 Feature Expansion

It is possible that a model could require multiple binary features for a single contin- uous feature; i.e. one such theory might be whether 푥 > 8 or 푥 < 3. As such, for an

푛-length input vector 푥, we construct the 푘푛-length transformed inputs 푥푇

푥푇 [푖] = 휎(푎[푗](푥[⌊푖/푘⌋] − 푏[푗]))

We will refer to the hyperparameter 푘 as the feature expansion constant.

5.1.2 Interpretability

As before, our main interest in rule learning stems from the necessity of learning in- terpretable machine learning models. In order to maintain interpretability, we utilize the fact that for large 푎, 푇 (푥) approaches a threshold gate. Therefore, we can inter- pret the resulting features 푥푇 as simply thresholded features for whether 푥 is larger or less than the cutoff 푏, which is easily understandable.

5.2 Experiments

As in chapter 4, we construct several datasets to test our methodology on. For these experiments, our constructed datasets are characterized by the following parameters:

∙ 훿: the error rate of the target. These are taken in the set {0.001, 0.005, 0.01, 0.05, 0.1}

∙ 푁: the number of rules. We vary this on integers from 1 to 5.

∙ 푅: the maximum rule size. We vary this on integers from 1 to 5.

Each dataset is then constructed by generating 푛-length vectors with values in the range [0, 1]. We then randomly generate a rule pattern with 푁 rules of size at most 푅,

44 Configuration 푁 푅 훿 1 2 2 0.01 2 2 2 0.001 3 2 2 0.005 4 2 2 0.05 5 2 2 0.1 6 1 2 0.01 7 3 2 0.01 8 4 2 0.01 9 5 2 0.01 10 2 1 0.01 11 2 3 0.01 12 2 4 0.01 13 2 5 0.01

Table 5.1: List of all constructed dataset configurations for numerical rule learning

and assign a label to each vector based on the rule pattern. We finally perturb these labels with independent probability 훿 for each label. The parameter configurations governing our constructed dataset are listed in table 5.1.

5.2.1 Experimental Results

We first test the best methods from chapter 4 with our numerical binarization tech- nique described in section 5.1. That is, we train all parametrizations introduced in chapter 4 with a product-based AND function, a feature expansion constant of 2. In all cases, we report the accuracy of the learned rule set as our metric of performance. First, we vary the error rate 훿 over the set {0.001, 0.005, 0.01, 0.05, 0.1}. For each parametrization, we run the training process with five different random parameter initializations, and we report the average accuracy of the learned rule set over all initializations. We have graphed our results in figure 5-1, and the results are also tabulated in the appendix in table A.5. In general, we can see that the variable- length parametrization fares slightly worse than the fixed-length parametrization for rule structure. For the next set of experiments, we varied the number of rules used to generate the

45 Rule Set Accuracy

1

0.8

0.6

0.4

0.2 Fixed Variable 0 10−3 10−2 10−1 Error Rate 훿

Figure 5-1: Relation between error rate and numerical model performance label pattern and the size of the rules used to create the label pattern. Again, for each parametrization, we run the training process with five different random parameter initializations, and we report the average accuracy of the learned rule set. We have graphed the average accuracy of the learned rule set as we vary the number of rules in 5-2 and as we vary the rule size in 5-3. The results for varying the number of rules are in table A.6, and the results for varying rule size are in table A.7. Overall, across both experiments, the variable-length rule parametrization fares slightly worse than the fixed-length parametrization, similarly with the earlier experiments varying the error rate.

5.3 Discussion

The best model for our gradient-descent based rule learning procedure uses the fixed- length parametrization with a product-based AND function. We also note that in general, the models we trained have a very high performance; the models are able to recover most, if not all, of the pattern used to generate the target. For both of the models tested, both were able to recover over 90% of the pattern present, and for the fixed-length parametrization, the models were able to recover around 95% ofthe

46 Rule Set Accuracy

1

0.8

0.6

0.4

0.2 Fixed Variable 0 0 1 2 3 4 5 6 Number of Rules

Figure 5-2: Relation between number of rules and numerical model performance

Rule Set Accuracy

1

0.8

0.6

0.4

0.2 Fixed Variable 0 0 1 2 3 4 5 6 Rule Size

Figure 5-3: Relation between rule size and numerical model performance

47 pattern even as rule size and rule number varied.

5.3.1 Comparison to LIME

In this section, we show how the rules learned by inductive logic produce a better interpretable model than LIME in conjunction with more sophisticated models. We utilize our constructed dataset configuration 1, which has 10 input features, 10,000 data points, and a label generated as follows:

푇 ← (푋1 < 0.8) ∧ (푋2 > 0.3)

푇 ← (푋0 < 0.4)

We first confirm that our inductor model is able to recover these rules. Indeed, our inductive model recovered the following rules with an overall 98% accuracy.

ˆ 푇 ← (푋1 < 0.7969) ∧ (푋2 > 0.3135)

ˆ 푇 ← (푋0 < 0.4042)

For comparison, we also trained a multilayer perceptron model on this dataset. We then ran the LIME technique around a the following randomly chosen point of interest, and we report the regression coefficients in table 5.2. We note that the regression coefficient with the largest magnitude is actually feature 7, which actually has no correllation with the target variable. In fact, the magnitudes of all coefficients are less than 0.05, suggesting that changing none of the features would change the target label. Clearly, the LIME regression around the point fails to account for global behavior, especially the feature interaction between features 1 and 2. We note that the rule provides a interpretable model that predicts the feature interaction between 1 and 2; specifically, our model suggests that simultaneously decreasing feature 1and increasing feature 2 changes the classification, whereas the LIME method predicts no feasible way to change the classification. Thus, it is clear that there are examples of points of interest and scenarios for which the LIME method fails to produce viable

48 Feature Point of Interest Value LIME Coefficient 0 0.965 0.0285 1 0.201 -0.0312 2 0.888 0.0197 3 0.735 0.0002 4 0.135 0.0142 5 0.489 -0.0115 6 0.963 -0.0405 7 0.152 -0.0068 8 0.777 0.0141 9 0.912 -0.0285

Table 5.2: LIME Coefficients around test point explanations, but for which the model structure is learnable and interpretable under our inductive logic programming technique.

49 50 Chapter 6

FICO Home Equity Line of Credit Tests

Finally, we test our methodology for rule learning on the FICO Home Equity Line of Credit (HELoC) dataset. We start by describing the dataset, and providing a brief context to the target variable being predicted. We then discuss applying the methods in chapter 4 on binary data after transforming the original HELoC dataset using the weight-of-evidence bins. We then discuss applying the methods in chapter 5 for rule learning over numerical data to the HELoC dataset, before concluding with a discussion of the results.

6.1 Dataset

We first begin by describing the FICO Home Equity Line of Credit (HELoC) dataset. A home equity line of credit is a loan where the loan collateral is the loanee’s equity in their house. This dataset consists of 10,459 loanees who took out a home equity line of credit between March 2000 and March 2002. A year later, a label of “creditworthy” or “non-creditworthy” was given to each loanee based on their performance, where the delinquent or charged-off loanees were labeled “non-creditworthy”. We seek to predict this creditability rating using the 23 other features in the dataset, which are:

1. ExternalRiskEstimate: A metric of the loanee’s credit risk computed by FICO

51 2. MSinceFirstLOC: Months since the first line of credit was opened

3. MSinceNewestLOC: Months since the newest line of credit was opened

4. AvgAgeOfLOC: Average age of all existing lines of credit

5. NumLOCNotDelq: Number of lines of credit not currently delinquent

6. NumLOC60PlusDaysDelq: Number of lines of credit that have been 60 or more days delinquent at some point in time

7. NumLOC90PlusDaysDelq: Number of lines of credit that have been 90 or more days delinquent at some point in time

8. PercentLOCNeverDelq: Percentage of lines of credit that have never been delin- quent

9. MSinceMRecentDelq: Months since the most recent delinquency

10. MaxDelqLast12M: Maximum delinquency in days in the past year

11. MaxDelqEver: Maximum delinquency in days

12. NumTotalLOC: Number of total lines of credit opened

13. NumLOCInLast12M: Number of lines of credit opened in the last year

14. PercentInstLOC: Percentage of lines of incredit that are installment lines of credit

15. MSinceNewLOCReqExPastWeek: Months since the most recent request for a line of credit, excluding the past week

16. NumLOCReqLast6M: Number of lines of credit requested in the past six months

17. NumLOCReqLast6MExPastWeek: Number of lines of credit requested in the past six months, excluding the past week

18. FracRevLOCLimitUse: Fraction of revolving credit limits in use

52 19. FracInstLOCUse: Fraction of installment lines of credit in use

20. NumRevLOCWBalance: Number of revolving lines of credit with positive bal- ance

21. NumInstLOCWBalance: Number of installment lines of credit with positive balance

22. NumBankOrNatlLoansWHighUtil: Number of bank loans and national loans

23. PercentLOCWBalance: Percentage of lines of credit with a positive balance

There are a few special values and unique feature values; a full description of those features is given by [7]. To prepare the dataset for training, we drop examples with missing data; we use the remaining 9,861 examples for our experiments.

6.1.1 Weight-of-Evidence Encoding

The weight-of-evidence (WoE) encoding is a technique for binning continuous or cat-

egorical values and assigning a value to that bin. Given a bin 푏, let 푃푏 be the fraction

of individuals in bin 푏 with a “creditworthy” rating, and let 푁푏 be the fraction of individuals in bin 푏 with a “non-creditworthy” rating. Then the weight-of-evidence value for bin 푏 is given by: (︂ )︂ 푃푏 푊푏 = 100 ln 푁푏 The weight-of-evidence value of a bin intuitively represents the ability of a bin to sepa- rate creditworthy and non-creditworthy applicants [7]. The weight-of-evidence encod- ing bins provide a natural mechanism for introducing binary variables to encapsulate the data; specifically, by one-hot encoding the bins given by the weight-of-evidence technique, we have a natural set of binary indicators for each input feature.

6.2 Binary Data Rule Learning

For our initial set of experiments, we train the models developed in chapter 4 on a one- hot encoding of the weight-of-evidence encoded bins as our inputs. We describe our

53 hyperparameter configurations, and report our results in terms of ruleset accuracy.

6.2.1 Experiments

We run a set of experiments where we test the best model developed in the previous chapters (i.e. a fixed-length parametrixation and a product-based AND gate) on learning the binarized features for the categories determined by the weight-of-evidence encoding. To test our models, we randomly select 60% of the examples for training, and reserve the remaining out-of-sample examples for validation and testing. We first learn rules that predict whether a loanee will not be creditworthy. We run our hyperparameter grid search varying rule size and number of rules each from the range 2-5, training each hyperparameter set 5 times, taking the best model of those five runs in terms of training accuracy. The results for each hyperparameter configuration are reported in table A.8, where we have reported both the accuracy of the learned rule set on the training set of data as well as the accuracy on an out-of-sample validation set. For sake of completeness, we also train a set of models over the same hyperpa- rameter grid search to predict the inverse of our original target label. For this set of experiments, we attempt to predict whether a loanee will be creditworthy. We have the results reported in table A.9. As before, we have reported the accuracy of the learned rule set on training data and out-of-sample validation data.

Results and Analysis

Overall, it appears the best models trained to predict non-creditworthiness achieve accuracy for the training rule set in the range [0.66, 0.72], and the best models trained to predict creditworthiness achieve training accuracies in the range [0.64, 0.74]. Fur- ther, for nearly all of the models given, the difference between training set accuracy and validation set accuracy is very small, under 3%; it seems likely that the strong restrictions on our inductive model help safeguard against overfitting. We evaluate the hyperparameter combinations by selecting for the best perfor-

54 Rule Accuracy (NumLOCInLast12M < 4) and (FracRevLOCLimitUse ≥ 77) 80.98% and (PercentLOCNeverDelq ≥ 89) (PercentLOCWBalance ≥ 50) and (ExternalRiskEstimate < 64) 82.63% (AvgAgeOfLOC < 75) and (MSinceNewLOCReqExPastWeek < 1) 70.7% and (MSinceNewLOCReqExPastWeek has usable trades) (PercentLOCNeverDelq < 82) and (MSinceMRecentDelq < 16) 82.17% (ExternalRiskEstimate < 71) 79.06% and (MSinceNewLOCReqExPastWeek < 1) and (MSinceNewLOCReqExPastWeek has usable trades)

Table 6.1: Rules for learning non-creditworthiness

Rule Accuracy (ExternalRiskEstimate ≥ 68) and (MSinceNewLOCReqExPastWeek ≥ 1) 73.75% (ExternalRiskEstimate ≥ 75) and (PercentLOCNeverDelq ≥ 96) 75.33%

Table 6.2: Rules for learning creditworthiness mance on the validation set. For the rules trained to predict non-creditworthiness, the best combination of hyperparameters uses five rules each having at most three conditions; the rules are given in table 6.1 and have a combined accuracy of 71.7% on the test set. For the rules predicting creditworthiness, the best combination of hyperparameters uses three rules of size at most two; the rules have a combined accu- racy of 71.1%. We remark that after eliminating rule redundancies, this ruleset could actually be expressed using two rules, which we have shown in table 6.2 along with the individual rule accuracies.

6.3 Numerical Data Rule Learning

For the next section, we apply the methods developed in chapter 5 on the HELOC data without binarization using the weight-of-evidence encoding bins. For this part, we start with the original numeric features of the HELOC dataset. Note that our binarization technique developed chapter 5 learns a threshold gate; we develop binary features based off of the comparisons of numeric features and cutoffs. Hence, any

55 monotonic transformation of our numeric data will preserve learnable cutoffs. To facilitate learning by normalizing all features to the same scale, we use the percentile rank of our features instead of the original numeric values, since the percentile rank is a monotonic transformation. As in our binary learning section, we run two hyperparameter searches, training two separate sets of models to predict creditworthiness and non-creditworthiness. We train our numerical data learning model with a feature expansion constant of 3, varying the number of rules and rule size from 2 to 5. We report our results for learning non-creditworthiness in table A.10, and our results for learning creditworthiness in table A.11. As our metrics of performance, we report the accuracies on the training data and an out-of-sample validation set, where we only use 60% of the data for training, reserving the rest for validation and testing.

6.3.1 Results and Analysis

We note that across all the hyperparameter configurations, the numerical data learn- ing performs more consistently than the binary data learning using weight-of-evidence encoding; with only a few exceptions, nearly all of the hyperparameter configurations could achieve at least a 70% accuracy on the training and validation data. Once again, we evaluate performance of a hyperparameter configuration by the validation set performance of the best run. The best hyperparameter configuration for learning non-creditworthiness uses four rules of size four (given in table 6.3), and the best hyperparameter configuration for learning creditworthiness uses two rules of size five (given in table 6.4). The best hyperparameter configuration for learning non-creditworthiness hasa test set accuracy of 71.0%, and our best hyperparameter configuration for learning creditworthiness has a test set accuracy of 71.3%. We note that neither of these models reaches the 71.7% accuracy, but both are very close to this accuracy threshold. We also note that the numerical methods are able to recover distinct numerical cutoffs for particular values; for example, the ruleset for learning creditworthiness learns a useful cutoff for the ExternalRiskEstimate feature at 54, which is not one ofthe

56 Rule Accuracy (ExternalRiskEstimate ≤ 69) 75.35% and (NumInstLOCWBalance has usable trades) (FracRevLOCLimitUse > 54) and (MSinceMRecentDelq ≤ 14) 82.84% (AvgAgeOfLOC ≤ 61) and (No NewLOCReqExPastWeek) 76.46% and (PercentLOCWBalance has usable trades) (FracRevLOCLimitUse ≤ 74) and (PercentLOCNeverDelq ≤ 86) 74.53% and (MSinceNewLOCReqExPastWeek has usable trades) and (PercentLOCWBalance has usable trades)

Table 6.3: Rules for learning non-creditworthiness, numerical data

Rule Accuracy (FracRevLOCLimitUse ≤ 63) and (MaxDelqLast12M > 6) 74.51% and (ExternalRiskEstimate > 71) and (PercentLOCNeverDelq > 92) and (FracRevLOCLimitUse has usable trades) (ExternalRiskEstimate > 54) and (PercentLOCNeverDelq ≥ 96) 87.62% and (MSinceNewLOCReqExPastWeek has no usable trades) and (MSinceMRecentDelq has usable trades)

Table 6.4: Rules for learning creditworthiness, numerical data

original weight-of-evidence encoding bin thresholds. We also note that some of the bin thresholds for the weight-of-evidence encoding were actually recovered organically by the model (for example, the cutoff near 60 for the AvgAgeOfLOC feature), suggesting that our numerical model is able to learn reasonable thresholds.

6.4 Discussion

We conclude with a discussion of the methods we have developed and their applica- bility to the FICO Home Equity Line of Credit dataset. We compare the performance of our methods to other machine learning models, analyze the generalizability of the learned models, and discuss some shortcomings of our approach.

57 Model Test Accuracy Multilayer Perceptron 74.7% Randomized Forests 72.3% Descent-Based Inductive Logic Programming 71.7%

Table 6.5: Comparison between our descent-based inductive logic programming and other models

6.4.1 Comparisons to Related Work

For comparison, we have included the results from related papers for other models on this dataset in table 6.5, where the related model results are from [7]. We see that the rules learned by our best inductive logic programming method are comparable to the performance of randomized forests and not too far away from the performance of multilayer perceptron models. We see that the best performance from multilayer perceptron models is 3% better than our inductive logic programming methodology. Given that even neural networks fail to achieve a significantly better accuracy, it appears that it seems the maximum learnable accuracy from the given inputs is not much higher than the performance of our descent-based model. Overall, our model is fairly successful at discovering a significant fraction of the learnable pattern, and in an interpretable format, anditis quite remarkable that most of the learnable pattern can be expressed with as few as five rules.

6.4.2 Generalizability

From the results in tables A.8 to A.11, we note that the greatest discrepancy between validation set accuracy and training set accuracy was less than 4%, indicating that the rules fit to the training set data generalized to the out-of-sample validation data with little change in performance. However, beyond our learned rule sets yielding comparable performance on both the training and validation datasets, the interpretability aspect of our methodology allows for human control over generalizability. The fixed-size parametrization limits

58 Rule Set Accuracy on Training Data Average Loss During Training 0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5 0 20 40 60 80 100 0 20 40 60 80 100 Epochs Epochs

Figure 6-1: Example Training Run, Plotting Accuracy and Loss

the number and size of the learned rules, which helps prevent overfitting. Finally, because of the interpretability of our model, we can analyze each individual rule for sensibility, discarding rules that are non-sensical.

6.4.3 Approach Shortcomings

After observing model training performance during the hyperparameter searches, it became apparent that the models we trained were able to learn most of the pattern very quickly, jumping up to close to the ending rule accuracy in the early epochs. In figure 6-1, we have plotted rule set accuracy on the training data and theaverage value of our loss function taken after each of one hundred training epochs. From examining the plots in figure 6-1, we firstly remark that our model converges to some minimum; we observe this from the loss function flatlining at the end of the training near 0.53. We also note that the rule set accuracy seems to peak around halfway through the training process. By examining the underlying values, we notice that the rule set accuracy reaches a maximum value of 0.7219 after epoch 48 and then slightly decreasing afterwards to end at 0.7099. We note that this decay results in a net accuracy decrease of 1.2%, which is fairly small compared to the overall accuracy

59 magnitude. After observing many of the 160 runs observed throughout out hyper- parameter search, we notice that this behavior is present in varying degrees across many of the training runs. Specifically, we concern ourselves with the phenomenon where a section of normal training behavior (increasing accuracy, decreasing loss) is followed by decreases in the training accuracy by a small amount even though the loss function indicates the model still exhibits convergence. We will term this phe- nomenon approximation breakdown. We note that approximation breakdown is distinct from overfitting, since the measured accuracy is taken over the same data used for training, rather than out-of-sample data; one would expect the decrease in loss function to yield higher accuracy. To explain this phenomenon, we recall some of the details of our model implemen- tation. Recall that when we transform our numerical indicators into the [0, 1] range, we utilize the following transformation:

푇 (푥) = 휎(푎(푥 − 푏))

As noted, in the expression for the binarized input 푇 (푥), we noted that as 푎 → ∞, our evaluation 푇 (푥) approaches the ideal threshold gate for value 푏. However, we can’t guarantee that for a learned parameter 푎, that the model will necessarily take 푎 → ∞. Similarly, recall the details of our fixed-length rule parametrization. We maintain a 푁 × 푅 × 2푛 tensor, and interpret

softmax(Π[푖, 푗]) as a probability distribution over the potential features for the 푗th term of the 푖th rule. We note that model convergence does not require this probability distribution to approach a 훿-distribution; that is, the optimal probability distribution for our model design may not assign all of the probability mass to one feature. In summary, we can attribute approximation breakdown to the discrepancy be- tween our model structure and perfect logical rules. We recognize that optimal param-

60 eter values for our model may not yield perfect values, analogous to how relaxations of integer linear programs can yield non-integral optimal solutions. For future work, one could explore potential ways to control the fidelity of the approximation. For example, in order to manual control the fidelity of the numerical approximations (i.e. how close the sigmoid function is to a true threshold gate), we can manually regulate the scaling constant.

61 62 Chapter 7

Conclusion and Next Steps

This thesis sought to formulate a method to learn logical rules for the arbitrary supervised binary classification problem, such that we satisfy goals of accuracy, gen- eralizability, and interpretability. We conclude by summarizing our key ideas and results, and indicate directions for future work.

7.1 Key Ideas

Our general framework for learning rules is as follows:

1. Create a parametrization for rule structure, and select an extension of the logical AND function for inputs on [0, 1].

2. For each training example:

(a) Use the parametrization given to generate an estimated target label.

(b) Compute the binary cross-entropy loss between the estimated target label and the true label for this example.

(c) Utilize backpropagation to estimate the gradient for the model parameters, and minimize the binary cross-entropy loss using gradient descent.

3. Repeat loop for number of training epochs desired.

63 We discussed two ways to parametrize rule structure, and four ways to extend the logical AND function to [0, 1]. After experimenting with different methods in chapters 4 and 5, we concluded the best performing model structure utilizes a fixed- size parametrization and a product-based AND.

In our fixed-size rule parametrization, we use a parameter Π of size 푁 × 푅 × 2푛, where 푁 is a hyperparameter governing the number of rules, 푅 is a hyperparameter governing the rule size, and 푛 denoting the number of input features. The product-

based AND extends 푎 ∧ 푏 with 푎∧̂︀푏 = 푎 · 푏. Under this modeling methodology, the estimate target label for input 푥 given by

⋀︁⋀︁ 푦ˆ = ̂︁̂︁ (푥* · softmax(Π[푖, 푗])) 푖 푗 where 푥* denotes the concatenation of 푥 and 1 − 푥.

To extend the input to numerical data, we binarize numerical data by approxi- mating a threshold gate 휎(푎(푥 − 푏)) where 푎, 푏 are learnable parameters. We can then feed the same result into our model from chapter 4, and repeat the same learning process.

We tested our model on constructed datasets varying pattern structures and noise level. We find that our model can recover rules very accurately on our constructed datasets. We noted that in particular scenarios, our model learns logical rules that accurately explain the global structure of a pattern, whereas the LIME method on a multilayer perceptron does not generate any reasonable explanations of model be- havior at a global scale.

Finally, we tested our methods on the FICO Home Equity Line of Credit dataset. We tested our binary data approach using features binarized with the weight-of- evidence encoding bins, and our numerical approach on the feature percentile values. Our best models achieve accuracies between 71%-72%. We note that our model per- forms comparably to multilayer perceptrons (74.7%) and randomized forests (72.3%).

64 7.2 Future Work

In the last section of chapter 6, we introduced the approximation breakdown phe- nomena of our model; that is, the optimum parameter values to minimize the binary cross entropy loss may not converge to exact logical equivalents. One of the future di- rections involves addressing the discrepancy; specifically, there are various potential ways to minimize approximation breakdown. Potential future explorations include adding terms to the loss function and controlling scaling parameters. Another potential future direction is to explore alternative binarization techniques with numerical input data. In this thesis, the only binarization technique we explored involved learning a threshold gate via a sigmoid function approximation. Future work could explore alternative binarization techniques. For example, one could use a logit- based or probit-based approaches to generate more intricate binary features. Addi- tionally, we note that one of the advantages of using gradient descent to learn logical rules allows for optimization of all parameters along the computation path for the estimated labels; therefore, we can chain more complicated machine learning models with our inductive logic method to perform more advanced feature binarization. Throughout this thesis, we used inductive logic programming to independently develop interpretable models. However, we can use our inductive logic programming methods to approximate and explain deep learning models. Specifically, we can ap- proximate intermediate features within more complex neural network structures by training rules to predict those intermediate features. Thus, another potential appli- cation of our methods is to generate interpretable rule-based explanations for existing blackbox models. Lastly, future research can extend the descent-based inductive logic programming methods to tasks beyond binary classification. Potential directions could involve multilabel classification learning rules for different labels, and learning numerical output labels.

65 66 Appendix A

Tables

Model Parametrization AND 푛 Rule Set Accuracy Rule Precision 2 Fixed PReLU 15 0.945 0.667 2 Fixed PReLU 25 0.990 1.000 2 Fixed PReLU 50 0.879 0.600 2 Fixed PReLU 100 0.958 1.000 4 Fixed Product 15 0.992 1.000 4 Fixed Product 25 0.990 1.000 4 Fixed Product 50 0.971 0.933 4 Fixed Product 100 0.990 1.000 6 Variable PReLU 15 0.905 0.600 6 Variable PReLU 25 0.990 1.000 6 Variable PReLU 50 0.894 0.733 6 Variable PReLU 100 0.878 0.933 8 Variable Product 15 0.992 1.000 8 Variable Product 25 0.990 1.000 8 Variable Product 50 0.898 0.733 8 Variable Product 100 0.958 0.867

Table A.1: Model Run Results, Varying Number of Features

67 Model Parametrization AND 푁 Rule Set Accuracy Rule Precision 2 Fixed PReLU 2 0.991 1.000 2 Fixed PReLU 3 0.990 1.000 2 Fixed PReLU 4 0.882 0.600 2 Fixed PReLU 5 0.943 0.880 2 Fixed PReLU 6 0.807 0.267 4 Fixed Product 2 0.991 1.000 4 Fixed Product 3 0.990 1.000 4 Fixed Product 4 0.990 1.000 4 Fixed Product 5 0.990 1.000 4 Fixed Product 6 0.990 1.000 6 Variable PReLU 2 0.991 1.000 6 Variable PReLU 3 0.990 1.000 6 Variable PReLU 4 0.990 1.000 6 Variable PReLU 5 0.911 0.840 6 Variable PReLU 6 0.718 0.167 8 Variable Product 2 0.991 1.000 8 Variable Product 3 0.990 1.000 8 Variable Product 4 0.990 1.000 8 Variable Product 5 0.979 0.960 8 Variable Product 6 0.983 0.967

Table A.2: Model Run Results, Varying Number of Rules

68 Model Parametrization AND 푅 Rule Set Accuracy Rule Precision 2 Fixed PReLU 2 0.990 1.000 2 Fixed PReLU 3 0.990 1.000 2 Fixed PReLU 4 0.990 1.000 2 Fixed PReLU 5 0.970 0.933 2 Fixed PReLU 6 0.974 0.933 4 Fixed Product 2 0.990 1.000 4 Fixed Product 3 0.990 1.000 4 Fixed Product 4 0.990 1.000 4 Fixed Product 5 0.991 1.000 4 Fixed Product 6 0.990 1.000 6 Variable PReLU 2 0.659 0.333 6 Variable PReLU 3 0.990 1.000 6 Variable PReLU 4 0.850 0.733 6 Variable PReLU 5 0.779 0.400 6 Variable PReLU 6 0.969 0.733 8 Variable Product 2 0.990 1.000 8 Variable Product 3 0.990 1.000 8 Variable Product 4 0.990 1.000 8 Variable Product 5 0.991 1.000 8 Variable Product 6 0.990 1.000

Table A.3: Model Run Results, Varying Rule Size

69 Model Parametrization AND 훿 Rule Set Accuracy Rule Precision 2 Fixed PReLU 0.001 0.998 1.000 2 Fixed PReLU 0.005 0.938 1.000 2 Fixed PReLU 0.01 0.990 1.000 2 Fixed PReLU 0.05 0.924 0.800 2 Fixed PReLU 0.1 0.899 1.000 4 Fixed Product 0.001 0.998 1.000 4 Fixed Product 0.005 0.995 1.000 4 Fixed Product 0.01 0.990 1.000 4 Fixed Product 0.05 0.949 1.000 4 Fixed Product 0.1 0.899 1.000 6 Variable PReLU 0.001 0.998 1.000 6 Variable PReLU 0.005 0.995 1.000 6 Variable PReLU 0.01 0.990 1.000 6 Variable PReLU 0.05 0.949 1.000 6 Variable PReLU 0.1 0.899 1.000 8 Variable Product 0.001 0.998 1.000 8 Variable Product 0.005 0.995 1.000 8 Variable Product 0.01 0.990 1.000 8 Variable Product 0.05 0.949 1.000 8 Variable Product 0.1 0.899 1.000

Table A.4: Model Run Results, Varying Noise Level

Parametrization 훿 Rule Set Accuracy Fixed 0.001 0.988 Fixed 0.005 0.917 Fixed 0.01 0.966 Fixed 0.05 0.929 Fixed 0.1 0.880 Variable 0.001 0.849 Variable 0.005 0.94 Variable 0.01 0.835 Variable 0.05 0.930 Variable 0.1 0.831

Table A.5: Numerical Model Run Results, Varying Noise Level

70 Parametrization Number of Rules Rule Set Accuracy Fixed 1 0.987 Fixed 2 0.966 Fixed 3 0.979 Fixed 4 0.976 Fixed 5 0.979 Variable 1 0.948 Variable 2 0.835 Variable 3 0.897 Variable 4 0.946 Variable 5 0.966

Table A.6: Numerical Model Run Results, Varying Number of Rules

Parametrization Rule Size Rule Set Accuracy Fixed 1 0.982 Fixed 2 0.966 Fixed 3 0.944 Fixed 4 0.965 Fixed 5 0.952 Variable 1 0.925 Variable 2 0.835 Variable 3 0.907 Variable 4 0.903 Variable 5 0.919

Table A.7: Numerical Model Run Results, Varying Number of Rules

71 Number of Rules Rule Size Best Rule Set Accuracy Validation Accuracy 2 2 0.7126 0.7116 2 3 0.7064 0.7082 2 4 0.6964 0.7093 2 5 0.6976 0.6788 3 2 0.6604 0.6464 3 3 0.7150 0.7082 3 4 0.7045 0.7082 3 5 0.6866 0.6650 4 2 0.7121 0.7113 4 3 0.7052 0.7052 4 4 0.7030 0.7042 4 5 0.7137 0.7164 5 2 0.7118 0.7147 5 3 0.7162 0.7184 5 4 0.7099 0.7099 5 5 0.7133 0.693

Table A.8: Hyperparameter search, predicting non-creditworthiness with HELOC Binarized Data

Number of Rules Rule Size Best Rule Set Accuracy Validation Accuracy 2 2 0.6624 0.6389 2 3 0.7159 0.7045 2 4 0.6545 0.6525 2 5 0.6545 0.6474 3 2 0.7017 0.7214 3 3 0.7263 0.7092 3 4 0.7270 0.7031 3 5 0.7121 0.7133 4 2 0.6895 0.7062 4 3 0.7267 0.7204 4 4 0.7240 0.6960 4 5 0.7230 0.7082 5 2 0.6978 0.6839 5 3 0.7088 0.6960 5 4 0.7306 0.7062 5 5 0.7344 0.7133

Table A.9: Hyperparameter search, predicting creditworthiness using HELOC Bina- rized Data

72 Number of Rules Rule Size Best Rule Set Accuracy Validation Accuracy 2 2 0.7001 0.7076 2 3 0.7125 0.6978 2 4 0.7101 0.7049 2 5 0.7037 0.7099 3 2 0.7108 0.6941 3 3 0.7191 0.7157 3 4 0.7132 0.7018 3 5 0.7101 0.7096 4 2 0.7037 0.6805 4 3 0.7150 0.6927 4 4 0.7206 0.7181 4 5 0.7150 0.6917 5 2 0.7132 0.6937 5 3 0.7196 0.6988 5 4 0.7160 0.7120 5 5 0.7155 0.7039

Table A.10: Hyperparameter search, predicting non-creditworthiness using HELOC Numerical Data

Number of Rules Rule Size Best Rule Set Accuracy Validation Accuracy 2 2 0.7084 0.7082 2 3 0.7133 0.7163 2 4 0.7224 0.7234 2 5 0.7273 0.7254 3 2 0.7160 0.7204 3 3 0.7197 0.7194 3 4 0.7263 0.7194 3 5 0.7258 0.7092 4 2 0.7110 0.6981 4 3 0.7069 0.7102 4 4 0.7268 0.7102 4 5 0.7328 0.7102 5 2 0.7099 0.7123 5 3 0.7294 0.7082 5 4 0.7179 0.6809 5 5 0.7304 0.6910

Table A.11: Hyperparameter search, predicting creditworthiness using HELOC Nu- merical Data

73 74 Bibliography

[1] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification deci- sions. J. Mach. Learn. Res., 11:1803–1831, August 2010.

[2] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. 2017.

[3] Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. J. Artif. Int. Res., 61(1):1–64, January 2018.

[4] Georg Gottlob, Nicola Leone, and Francesco Scarcello. On the complexity of some inductive logic programming problems. In Nada Lavrač and Sašo Džeroski, editors, Inductive Logic Programming, pages 17–32, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg.

[5] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in super- vised learning. CoRR, abs/1610.02413, 2016.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.

[7] Sudhanshu Nath Mishra. Explaining Machine Learning Predictions: Rationales and Effective Modifications. Master’s thesis, Massachusetts Institute of Technol- ogy, Cambridge (MA), 2018.

[8] Stephen Muggleton. Inverse entailment and progol. New Generation Computing, 13(3):245–286, Dec 1995.

[9] Stephen Muggleton and Wray Buntine. Machine invention of first-order predi- cates by inverting resolution. In Machine Learning Proceedings 1988, pages 339 – 352. Kaufmann, San Francisco (CA), 1988.

[10] Stephen Muggleton and Cao Feng. Efficient induction of logic programs. In New Generation Computing. Academic Press, 1990.

[11] Stephen Muggleton, Jose Santos, and Alireza Tamaddoni-Nezhad. Progolem: A system based on relative minimal generalisation. In ILP, 2009.

75 [12] Gordon Plotkin. Automatic methods of inductive inference. PhD thesis, Univer- sity of Edinburgh, 1971.

[13] J.R. Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239–266, Aug 1990.

[14] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. CoRR, abs/1602.04938, 2016.

[15] Farhad Shakerin and Gopal Gupta. Induction of non-monotonic logic programs to explain boosted tree models using LIME. CoRR, abs/1808.00629, 2018.

[16] Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. CoRR, abs/1710.08864, 2017.

[17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017.

[18] Qiang Zeng, Jignesh M. Patel, and David Page. Quickfoil: Scalable inductive logic programming. Proc. VLDB Endow., 8(3):197–208, November 2014.

76