An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions

Jannis Kurtz 1 Bubacarr Bah 2 3

Abstract to be more robust to adversarial perturbations than the con- tinuous activation networks (Qin et al., 2020). Furthermore We study deep neural networks with binary ac- low-powered computations may benefit from discrete ac- tivation functions (BDNN), i.e. the activation tivations as a form of coarse quantizations (Plagianakos function only has two states. We show that the et al., 2001; Bengio et al., 2013; Courbariaux et al., 2015; BDNN can be reformulated as a mixed-integer Rastegari et al., 2016). Nevertheless, gradient descent-based linear program which can be solved to global op- training behaves like a black box, raising a lot of questions timality by classical integer programming solvers. regarding the explainability and interpretability of inter- Additionally, a heuristic solution algorithm is pre- nal representations (Hampson & Volper, 1990; Plagianakos sented and we study the model under data uncer- et al., 2001; Bengio et al., 2013). tainty, applying a two-stage robust optimization approach. We implemented our methods on ran- On the other hand, integer programming (IP) is known as a dom and real datasets and show that the heuristic powerful tool to model a huge class of real-world optimiza- version of the BDNN outperforms classical deep tion problems (Wolsey, 1998). Recently it was successfully neural networks on the Breast Cancer Wisconsin applied to problems involving sparsity dataset while performing worse on random data. constraints and to evaluate trained neural networks (Bertsi- mas et al., 2017; 2019b; Fischetti & Jo, 2018).

1. Introduction Contributions: In Section 3, we reformulate the binary neural network as an IP formulation, referred to in the sequel (DL) methods have of-late reinvigorated inter- as a binary deep neural network (BDNN), which can be est in artificial intelligence and , and they have solved to global optimality by classical IP solvers. All our had many successful applications in computer vision, nat- results can also be applied to BDNNs with binary weights. ural language processing, and data analytics (LeCun et al., Solving the latter class of networks to optimality was still an 2015). The training of deep neural networks relies mostly on open problem. We note that it is straight forward to extend (stochastic) gradient descent, hence the use of differentiable the IP formulation to more general settings and several activation functions like ReLU, sigmoid or the hyperbolic variations of the model. We present a heuristic algorithm to tangent is the state of the art (Rumelhart et al., 1986; Good- calculate local optimal solutions of the BDNN in Section fellow et al., 2016). On the contrary discrete activations, 3.1 and apply a two-stage robust optimization approach to which may be more analogous to biological activations, protect the model against adversarial attacks in Section 4. present a training challenge due to non-differentiability and In Section 5 we present computational results for random even discontinuity. If additionally the weights are consid- and real datasets and compare the BDNN to a deep neural ered to be binary, the use of discrete activation networks can network using ReLU activations (DNN). Despite scalability reduce the computation and storage complexities, provides issues and a slightly worse accuracy on random datasets, the for better interpretation of solutions, and has the potential results indicate that the heuristic version outperforms the DNN on the Breast Cancer Wisconsin dataset. *Equal contribution 1Chair for of Information Pro- cessing, RWTH Aachen University, Aachen, Germany 2African Institute for Mathematical Sciences, Cape Town, South Africa 2. Related Literature 3Division