Consistent Feature Selection for Analytic Deep Neural Networks

Consistent Feature Selection for Analytic Deep Neural Networks Vu Dinh ⇤ Lam Si Tung Ho ⇤ Department of Mathematical Sciences Department of Mathematics and Statistics University of Delaware Dalhousie University Delaware, USA Halifax, Nova Scotia, Canada [email protected] [email protected] Abstract One of the most important steps toward interpretability and explainability of neural network models is feature selection, which aims to identify the subset of relevant features. Theoretical results in the field have mostly focused on the prediction aspect of the problem with virtually no work on feature selection consistency for deep neural networks due to the model’s severe nonlinearity and unidentifiability. This lack of theoretical foundation casts doubt on the applicability of deep learning to contexts where correct interpretations of the features play a central role. In this work, we investigate the problem of feature selection for analytic deep networks. We prove that for a wide class of networks, including deep feed-forward neural networks, convolutional neural networks, and a major sub-class of residual neural networks, the Adaptive Group Lasso selection procedure with Group Lasso as the base estimator is selection-consistent. The work provides further evidence that Group Lasso might be inefficient for feature selection with neural networks and advocates the use of Adaptive Group Lasso over the popular Group Lasso. 1 Introduction In recent years, neural networks have become one of the most popular models for learning systems for their strong approximation properties and superior predictive performance. Despite their successes, their “black box” nature provides little insight into how predictions are being made. This lack of interpretability and explainability stems from the fact that there has been little work that investigates statistical properties of neural networks due to its severe nonlinearity and unidentifiability. One of the most important steps toward model interpretability is feature selection, which aims to identify the subset of relevant features with respect to an outcome. Numerous techniques have been developed in the statistics literature and adapted to neural networks over the years. For neural networks, standard Lasso is not ideal because a feature can only be dropped if all of its connections have been shrunk to zero together, an objective that Lasso does not actively pursue. Zhao et al. [2015], Scardapane et al. [2017] and Zhang et al. [2019] address this concern by utilizing Group Lasso and its variations for selecting features of deep neural networks (DNNs). Feng and Simon [2017] propose fitting a neural network with a Sparse Group Lasso penalty on the first-layer input weights while adding a sparsity penalty to later layers. Li et al. [2016] propose adding a sparse one-to-one linear layer between the input layer and the first hidden layer of a neural network and performing `1 regularization on the weights of this extra layer. Similar ideas are also applied to various types of networks and learning contexts [Tank et al., 2018, Ainsworth et al., 2018, Lemhadri et al., 2019]. Beyond regularization-based methods, Horel and Giesecke [2019] develop a test to assess the statistical significance of the feature variables of a neural network. Alternatively, a ranking of feature ⇤These authors contributed equally to this work. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. importance based on local perturbations, which includes fitting a model in the local region around the input or locally perturbing the input to see how predictions change, can also be used as a proxy for feature selection [Simonyan et al., 2013, Ibrahim et al., 2014, Ribeiro et al., 2016, Nezhad et al., 2016, Lundberg and Lee, 2017, Shrikumar et al., 2017, Ching et al., 2018, Taherkhani et al., 2018, Lu et al., 2018]. “Though these methods can yield insightful interpretations, they focus on specific architectures of DNNs and can be difficult to generalize" [Lu et al., 2018]. Despite the success of these approaches, little is known about their theoretical properties. Results in the field have been either about shallow networks with one hidden layer [Dinh and Ho, 2020, Liang et al., 2018, Ye and Sun, 2018], or focus on posterior concentration, prediction consistency, parameter- estimation consistency and convergence of feature importance [Feng and Simon, 2017, Farrell et al., 2018, Polson and Rocková,ˇ 2018, Fallahgoul et al., 2019, Shen et al., 2019, Liu, 2019], with virtually no work on feature selection consistency for deep networks. This lack of a theoretical foundation for feature selection casts doubt on the applicability of deep learning to applications where correct interpretations play a central role such as medical and engineering sciences. This is problematic since works in other contexts indicated that Lasso and Group Lasso could be inconsistent/inefficient for feature selection [Zou, 2006, Zhao and Yu, 2006, Wang and Leng, 2008], especially when the model is highly-nonlinear [Zhang et al., 2018]. In this work, we investigate the problem of feature selection for deep networks. We prove that for a wide class of deep analytic neural networks, the GroupLasso + AdaptiveGroupLasso procedure (i.e., the Adaptive Group Lasso selection with Group Lasso as the base estimator) is feature-selection- consistent. The results also provide further evidence that Group Lasso might be inconsistent for feature selection and advocate the use of the Adaptive Group Lasso over the popular Group Lasso. 2 Feature selection with analytic deep neural networks Throughout the paper, we consider a general analytic neural network model described as follows. d0 Given an input x that belongs to be a bounded open set R , the output map f↵(x) of an L-layer neural network with parameters ↵ =(P, p, S, Q, q) isX defined⇢ by input layer: h (x)=P x + p • 1 · hidden layers: hj(x)=φj 1(S, hj 1(x),hj 2(x),...,h1(x)),j=2,...,L 1. • − − − − output layer: f↵(x)=hL(x)=Q hL 1(x)+q • · − d1 d0 dL dL 1 where di denotes the number of nodes in the i-th hidden layer, P R ⇥ , Q R ⇥ − , d1 dL 2 2 p R , q R , and φ1,φ2,...,φL 2 are analytic functions parameterized by the hidden layers’ parameter2 S2. This framework allows− interactions across layers of the network architecture and only requires that (i) the model interacts with inputs through a finite set of linear units, and (ii) the activation functions are analytic. This parameterization encompasses a wide class of models, including feed-forward networks, convolutional networks, and (a major subclass of) residual networks. Hereafter, we assume that the set of all feasible vectors ↵ of the model is a hypercube =[ A, A]n↵ W − and use the notation h[i] to denote the i-th component of a vector h and u[i,k] to denote the [i, k]-entry of a matrix u. We study the feature selection problem for regression in the model-based setting: n Assumption 2.1. Training data (Xi,Yi) are independent and identically distributed (i.i.d ) { }i=1 samples generated from PX,Y⇤ such that the input density pX is positive and continuous on its domain 2 and Y = f (X )+✏ where ✏ (0,σ ) and ↵⇤ . X i ↵⇤ i i i ⇠N 2W Remarks. Although the analyses of the paper focus on Gaussian noise, the results also apply to all models for which ✏2 is sub-Gaussian, including the cases of bounded noise. The assumption that the input density pX is positive and continuous on its bounded domain ensures that there is no perfect correlations among the inputs and plays an important role in our analysis.X Throughout the paper, the network is assumed to be fixed, and we are interested in the asymptotic behaviors of estimators in the learning setting when the sample size n increases. We assume that the “true” model f↵⇤ (x) only depends on x through a subset of significant features while being independent of the others. The goal of feature selection is to identify this set of significant features from the given data. For convenience, we separate the inputs into two groups s Rns and 2 2 nz z R (with ns + nz = d0) that denote the significant and non-significant variables of f↵⇤ (x), respectively.2 Similarly, the parameters of the first layer (P and p) are grouped by their corresponding input, i.e. x =(s, z),P=(u, v) and p =(b1,b2) d1 ns d1 nz ns nz where u R ⇥ , v R ⇥ , b1 R and b2 R . We note that this separation is simply for mathematical2 convenience2 and the2 training algorithm2 is not aware of such dichotomy. We define significance and selection consistency as follows. [i] Definition 2.2 (Significance). The i-th input x of a network f↵(x) is referred to as non-significant iff the output of the network does not depend on the value of that input variable. That is, let gi(x, s) denote the vector obtained from x by replacing the i-th component of x by s R, we have 2 f↵(x)=f↵(gi(x, s)) for all x ,s R. 2X 2 Definition 2.3 (Feature selection consistency). An estimator ↵n with first layer’s parameters (un,vn) is feature selection consistent if for any δ>0, there exists Nδ such that for n>Nδ, we have u[:,k] =0, k =1,...,n , and v[:,l] =0, l =1,...,n n 6 8 s n 8 z with probability at least 1 δ. − One popular method for feature selection with neural networks is the Group Lasso (GL). Moreover, since the model of our framework only interacts with the inputs through the first layer, it is reasonable that the penalty should be imposed only on these parameters.

Load more