Links Between Perceptrons, Mlps and Svms
Total Page:16
File Type:pdf, Size:1020Kb
Links between Perceptrons, MLPs and SVMs Ronan Collobert [email protected] Samy Bengio [email protected] IDIAP, Rue du Simplon 4, 1920 Martigny, Switzerland Abstract large margin classifier idea, which is known to improve We propose to study links between three the generalization performance for binary classification important classification algorithms: Percep- tasks. The aim of this paper is to give a better under- trons, Multi-Layer Perceptrons (MLPs) and standing of Perceptrons and MLPs applied to binary Support Vector Machines (SVMs). We first classification, using the knowledge of the margin in- study ways to control the capacity of Percep- troduced with SVMs. Indeed, one of the key problem trons (mainly regularization parameters and in machine learning is the control of the generalization early stopping), using the margin idea intro- ability of the models, and the margin idea introduced duced with SVMs. After showing that under in SVMs appears to be a nice way to control it (Vap- simple conditions a Perceptron is equivalent nik, 1995). to an SVM, we show it can be computation- After the definition of our mathematical framework ally expensive in time to train an SVM (and in the next section, we first point out the equivalence thus a Perceptron) with stochastic gradient between a Perceptron trained with a particular cri- descent, mainly because of the margin maxi- terion (for binary classification) and a linear SVM. mization term in the cost function. We then As in general Perceptrons are trained with stochastic show that if we remove this margin maxi- gradient descent (LeCun et al., 1998), and SVMs are mization term, the learning rate or the use trained with the minimization of a quadratic problem of early stopping can still control the mar- under constraints (Vapnik, 1995), the only remaining gin. These ideas are extended afterward to difference appears to be the training method. Based the case of MLPs. Moreover, under some on this remark, we study afterward traditional ways assumptions it also appears that MLPs are to control the capacity of a Perceptron when train- a kind of mixture of SVMs, maximizing the ing with stochastic gradient descent. We first focus margin in the hidden layer space. Finally, on the use of a weight decay parameter in section 3, we present a very simple MLP based on the which can lead to a computationally expensive train- previous findings, which yields better perfor- ing time. Then, we show in section 4 that if we re- mances in generalization and speed than the move this weight decay term during stochastic gradi- other models. ent descent, we can still control the margin, using for instance the learning rate or the mechanism of early stopping. These ideas are then extended in section 6 to 1. Introduction MLPs with one hidden layer, for binary classification. Since the Perceptron algorithm proposed by Rosen- Under some assumptions it also appears that MLPs blatt (1957), several machine learning algorithms were are a kind of mixture of SVMs, maximizing the margin proposed for classification problems. Among them, in the hidden layer space. Based on the previous find- two algorithms had a large impact on the research ings, we finally propose a simplified version of MLPs, community. The first one is the Multi-Layer Percep- easy to implement and fast to train, which gives in tron (MLP), which became useful with the introduc- general better generalization performance than other tion of the back-propagation training algorithm (Le- models. Cun, 1987; Rumelhart et al., 1986). The second one, proposed more recently, is the Support Vector Machine 2. Framework Definition algorithm (SVM) (Vapnik, 1995), which supplied the We consider a two-class classification problem, given a st n Appearing in Proceedings of the 21 International Confer- training set (xl, yl)l=1...L with (xl, yl) ∈ R × {−1, 1}. ence on Machine Learning, Banff, Canada, 2004. Copyright th Here xl represents the input vector of the l example, by the authors. and yl represents its corresponding class. We focus on three models to perform this classification: Per- equivalent to minimizing in the “dual space” ceptrons (including their non-linear extension called L L L Φ-machines1), Multi Layer Perceptrons (MLPs), and X 1 X X α → − α + y y α α Φ(x )·Φ(x ) , (4) Support Vector Machines (SVMs). The decision func- l 2 k l k l k l l=1 k=1 l=1 tion fθ (.) of all these models can be written in its gen- eral form as subject to the constraints L fθ (x) = w · Φ(x) + b , (1) X 1 y α = 0 and 0 ≤ α ≤ ∀l . (5) l l l L where w is a real vector of weights and b ∈ R is l=1 a bias. The generic vector of parameters θ repre- The weight w is then given2 by sents all the parameters of the decision function (that is, w, b, and all parameters of Φ, if any). For Φ- L 1 X machines and SVMs, Φ is an arbitrary chosen func- w = y α Φ(x ) . (6) µ l l l tion; in the special case of Φ(x) = x, it leads respec- l=1 tively to Perceptrons and linear SVMs. We also con- Note that in general, only few α will be non-zero; sider MLPs with one hidden layer and N hidden units. l the corresponding training examples (x , y ) are called In that case the hidden layer can be decomposed as l l “support vectors” (Vapnik, 1995) and give a sparse Φ(·) = (Φ (·), Φ (·),..., Φ (·)), where the i-th hid- 1 2 N description of the weight w. Efficient optimization den unit is described with of (4) has been proposed by Joachims (1999). Note also that inner products Φ(x ) · Φ(x ) in the feature Φ (x) = h(v · x + d ) . (2) k l i i i space are usually replaced by a kernel function, which n allows efficient inner products in high dimensional fea- Here (vi, di) ∈ R × R represents the weights of the i-th hidden unit, and h is a transfer function which ture spaces. However, we will not focus on kernels in is usually a sigmoid or an hyperbolic tangent. Note this paper. that the hyperspace defined with {Φ(x), x ∈ Rn} will The training of Perceptrons, Φ-machines and MLPs be called “feature space”, or sometimes “hidden layer is usually achieved by minimizing a given criterion space” in the special case of MLPs. Q(fθ (.),.) over the training set Let us now review the SVM algorithm (Vapnik, 1995): L 1 X it aims at finding an hyper-plane fθ (.) in the feature θ → Q(f (x ), y ) , (7) L θ l l space, which separates the two classes while maximiz- l=1 ing the margin (in the feature space) between this hyper-plane and the two classes. This is equivalent using gradient descent, until reaching a local optimum. to minimizing the cost function We will focus on stochastic gradient descent, which gives good results in time and training performance L for large data sets (LeCun et al., 1998). Note that µ 2 1 X θ → kwk + |1 − y fθ (x )|+ , (3) the original Perceptron algorithm (Rosenblatt, 1957) 2 L l l l=1 is equivalent to the use of the criterion + where |z|+ = max(0, z). µ ∈ R is a hyper-parameter Q(fθ (xl), yl) = | − yl fθ (xl)|+ . (8) (also called weight decay parameter) which has to be tuned, and controls the trade-off between the max- However, as this criterion has some limitations in prac- imization of the margin and the number of mar- tice (see Bishop, 1995), we will not focus on it. Instead, gin errors. SVMs are usually trained by minimiz- the criterion usually chosen (for all kinds of MLPs) is ing a quadratic problem under constraints (Vapnik, either the Mean Squared Error or the Cross-Entropy 1995): if we introduce Lagrange multipliers α = criterion (Bishop, 1995). Minimizing (7) using the L (α1, α2, . , αL) ∈ R , the minimization of (3) is Mean Squared Error criterion is equivalent, from a likelihood perspective, to maximizing the likelihood 1Φ-machines were introduced very early in the machine learning community (Nilsson, 1965). They are to Percep- 2Traditional SVM notations use a hyper-parameter C trons what non-linear SVMs are to SVMs: after choos- instead of the weight decay parameter µ. Here, we chose ing an arbitrary function Φ which maps the input space a notation coming from the MLP to establish clear links into another arbitrary space, we apply the Perceptron al- between all the models. The relation is given simply with gorithm to examples (Φ(xl), yl) instead of (xl, yl), which C = 1/(Lµ). Lagrange multipliers αl given here also have allows non-linear discrimination over training examples. to be divided by µ to recover traditional SVM notation. 5 log(1+exp(−z)) will consider the minimization of |1−z| 4 + 2 P 2 kwk i kvik 1 X θ → µ +ν + |1−yl fθ (x)|+ . (11) 3 2 2 L l 2 With this criterion, it appears obvious that Φ- 1 machines (respectively Perceptrons) are equivalent to 0 SVMs (resp. linear SVMs). Only the training method −4 −2 0 2 4 6 8 z remains different: Perceptrons are trained by stochas- tic gradient descent on (3), whereas SVMs are trained Figure 1. The Cross-Entropy criterion z → log(1 + by minimizing the quadratic problem (4) under con- exp(−z)) versus the margin criterion z → |1 − z| . + straints (see (Joachims, 1999) for details on SVM training). Thus, we will study in the following two under the hypothesis that the classes yl are generated sections the training process of Perceptrons (or linear from a smooth function with added Gaussian noise.