An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions
Total Page:16
File Type:pdf, Size:1020Kb
An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions Jannis Kurtz 1 Bubacarr Bah 2 3 Abstract to be more robust to adversarial perturbations than the con- tinuous activation networks (Qin et al., 2020). Furthermore We study deep neural networks with binary ac- low-powered computations may benefit from discrete ac- tivation functions (BDNN), i.e. the activation tivations as a form of coarse quantizations (Plagianakos function only has two states. We show that the et al., 2001; Bengio et al., 2013; Courbariaux et al., 2015; BDNN can be reformulated as a mixed-integer Rastegari et al., 2016). Nevertheless, gradient descent-based linear program which can be solved to global op- training behaves like a black box, raising a lot of questions timality by classical integer programming solvers. regarding the explainability and interpretability of inter- Additionally, a heuristic solution algorithm is pre- nal representations (Hampson & Volper, 1990; Plagianakos sented and we study the model under data uncer- et al., 2001; Bengio et al., 2013). tainty, applying a two-stage robust optimization approach. We implemented our methods on ran- On the other hand, integer programming (IP) is known as a dom and real datasets and show that the heuristic powerful tool to model a huge class of real-world optimiza- version of the BDNN outperforms classical deep tion problems (Wolsey, 1998). Recently it was successfully neural networks on the Breast Cancer Wisconsin applied to machine learning problems involving sparsity dataset while performing worse on random data. constraints and to evaluate trained neural networks (Bertsi- mas et al., 2017; 2019b; Fischetti & Jo, 2018). 1. Introduction Contributions: In Section 3, we reformulate the binary neural network as an IP formulation, referred to in the sequel Deep learning (DL) methods have of-late reinvigorated inter- as a binary deep neural network (BDNN), which can be est in artificial intelligence and data science, and they have solved to global optimality by classical IP solvers. All our had many successful applications in computer vision, nat- results can also be applied to BDNNs with binary weights. ural language processing, and data analytics (LeCun et al., Solving the latter class of networks to optimality was still an 2015). The training of deep neural networks relies mostly on open problem. We note that it is straight forward to extend (stochastic) gradient descent, hence the use of differentiable the IP formulation to more general settings and several activation functions like ReLU, sigmoid or the hyperbolic variations of the model. We present a heuristic algorithm to tangent is the state of the art (Rumelhart et al., 1986; Good- calculate local optimal solutions of the BDNN in Section fellow et al., 2016). On the contrary discrete activations, 3.1 and apply a two-stage robust optimization approach to which may be more analogous to biological activations, protect the model against adversarial attacks in Section 4. present a training challenge due to non-differentiability and In Section 5 we present computational results for random even discontinuity. If additionally the weights are consid- and real datasets and compare the BDNN to a deep neural ered to be binary, the use of discrete activation networks can network using ReLU activations (DNN). Despite scalability reduce the computation and storage complexities, provides issues and a slightly worse accuracy on random datasets, the for better interpretation of solutions, and has the potential results indicate that the heuristic version outperforms the DNN on the Breast Cancer Wisconsin dataset. *Equal contribution 1Chair for Mathematics of Information Pro- cessing, RWTH Aachen University, Aachen, Germany 2African Institute for Mathematical Sciences, Cape Town, South Africa 2. Related Literature 3Division of Applied Mathematics, Stellenbosch University, Stel- lenbosch, South Africa. Correspondence to: Jannis Kurtz The interest in BDNN goes back to (McCulloch & Pitts, <[email protected]>. 1943) where BDNNs were used to simulate Boolean func- Workshop on Beyond first-order methods in ML systems at the 37 th tions. However, until the beginning of this century con- International Conference on Machine Learning, Vienna, Austria, certed efforts were made to train these networks either by 2020. Copyright 2020 by the author(s). specific schemes (Gray & Michel, 1992; Kohut & Stein- Deep Neural Networks with Binary Activation Functions bach, 2004) or via back propagation by modifications of the Given a set of labeled training samples gradient descent method (Widrow & Winter, 1988; Toms, i i n 1990; Barlett & Downs, 1992; Goodman & Zeng, 1994; X × Y = (x ; y ) j i 2 [m] ⊂ R × f0; 1g Corwin et al., 1994; Plagianakos et al., 2001). More recent we consider loss functions work regarding the back propagation, mostly motivated by the low complexity of computation and storage, build-on dK ` : f0; 1g × R ! R (2) the pioneering works of (Bengio et al., 2013; Courbariaux et al., 2015; Hubara et al., 2016; Rastegari et al., 2016; Kim and the task is to find the optimal weight matrices which & Smaragdis, 2016); see (Qin et al., 2020) for a detailed minimize the empirical loss over the training samples, i.e. survey. Regarding the generalization error of BDNN, it we want to solve the problem was already proved that the VC-Dimension of deep neural m networks with binary activation functions is ρ log(ρ), where X min ` yi; zi ρ is the number of weights of the BDNN; see (Baum & i=1 Haussler, 1989; Maass, 1994; Sakurai, 1993). s:t: zi = σK W K σK−1 : : : σ1 W 1xi ::: 8i 2 [m] One the other hand IP methods were successfully applied W k 2 dk×dk−1 8k 2 [K] to sparse classification problems (Bertsimas et al., 2017), R sparse regression (Bertsimas et al., 2019b) and to evaluate λk 2 R 8k 2 [K] trained neural networks (Fischetti & Jo, 2018). In (Icarte (3) et al., 2019) BDNNs with weights restricted to {−1; 1g are for given dimensions d0; : : : dK where d0 = n. We set trained by a hybrid method based on constraint program- dK =: L. All results in this work even hold for regression ming and mixed-integer programming. In (Khalil et al., problems, more precisely for loss functions 2018) the authors calculate optimal adversarial examples for ` : L × L ! BDNNs using a MIP formulation and integer propagation. r R R R Furthermore, robust optimization approaches were used to Pm i i where we minimize the empirical loss i=1 `r x ; z in- protect against adversarial attacks for other machine learn- stead. We use labels f0; 1g instead of the classical labels ing methods (Xu et al., 2009b;a; Bertsimas et al., 2019a). {−1; 1g due to ease of notation in our model. Nevertheless the same approach can be adapted to labels y~i 2 {−1; 1g 3. Discrete Neural Networks by adding the additional constraints In this work we study binary deep neural net- y~i = −1 + 2yi: works (BDNN), i.e. classical deep neural net- works with binary activation functions. As in Several variants of the model can be considered: the classical framework, for a given input vector x 2 n we study classification functions f of the form R • The output of the neural network is 0 or 1, defining f(x) = σK W K σK−1 W K−1 : : : ; σ1 W 1x ::: for the class the input is assigned to. As a loss function, weight matrices W k 2 dk×dk−1 and activation functions R we count the number of misclassified training samples. σk, which are applied component-wise. The dimension d k This variant can be modeled by choosing the last layer is called the width of the k-th layer . In contrast to the to have width d = 1 together with the loss function recent developments of the field we consider the activation K `(y; z) = jy − zj. functions to be binary, more precisely each function is of the form • The output of the neural network is a continuous vector z 2 RL. This case can be modeled by choosing the ac- ( K 0 if α < λk tivation function of the last layer σ as the identity and σk(α) = (1) 1 otherwise dK = L. Any classical loss functions maybe consider, 2 e.g. the squared loss `(x; z) = ky − zk2 . Note that a special case of this model is the autoencoder where in contrast to the continuous framework the input is for α 2 R where the parameters λk 2 R can be learned by our model simultaneously with the weight matrices which encoded into a vector with 0-1 entries. is normally not the case in the classical neural network • All results can easily be generalized to multiclass clas- approaches. Note that it is also possible to fix the values λk sification; see Section 3. in advance. In the following we use the notation [p] := f1; : : : ; pg for In the following lemma we show how to reformulate Prob- p 2 N. lem (3) as an integer program. Deep Neural Networks with Binary Activation Functions Lemma 1. Assume the euclidean norm of each data point since otherwise Constraint (6) would be violated. Note that i;1 in X is bounded by r, then Problem (3) is equivalent to the if uj = 1, then Constraint (5) is always satisfied since mixed-integer non-linear program all entries of W 1 are in [−1; 1] and all entries of xi are in 1 i m [−r; r] and therefore jW x j ≤ nr < M1. Similarly we X i;1 1 > i min ` yi; ui;K s:t: (4) can show that uj = 0 if and only if (wj ) x < λ1. Hence i;1 i i=1 u is the output of the first layer for data point x which 1 i i;1 is applied to W 2 in Constraints (7) and (8).