Distributed Training Strategies for the Structured Perceptron
Total Page:16
File Type:pdf, Size:1020Kb
Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd|kbhall|gmann}@google.com Abstract lation (Liang et al., 2006). However, like all struc- tured prediction learning frameworks, the structure Perceptron training is widely applied in the perceptron can still be cumbersome to train. This natural language processing community for is both due to the increasing size of available train- learning complex structured models. Like all ing sets as well as the fact that training complexity structured prediction learning frameworks, the is proportional to inference, which is frequently non- structured perceptron can be costly to train as training complexity is proportional to in- linear in sequence length, even with strong structural ference, which is frequently non-linear in ex- independence assumptions. ample sequence length. In this paper we In this paper we investigate distributed training investigate distributed training strategies for strategies for the structured perceptron as a means the structured perceptron as a means to re- of reducing training times when large computing duce training times when computing clusters clusters are available. Traditional machine learning are available. We look at two strategies and algorithms are typically designed for a single ma- provide convergence bounds for a particu- lar mode of distributed structured perceptron chine, and designing an efficient training mechanism training based on iterative parameter mixing for analogous algorithms on a computing cluster – (or averaging). We present experiments on often via a map-reduce framework (Dean and Ghe- two structured prediction problems – named- mawat, 2004) – is an active area of research (Chu entity recognition and dependency parsing – et al., 2007). However, unlike many batch learning to highlight the efficiency of this method. algorithms that can easily be distributed through the gradient calculation, a distributed training analog for 1 Introduction the perceptron is less clear cut. It employs online up- dates and its loss function is technically non-convex. One of the most popular training algorithms for A recent study by Mann et al. (2009) has shown structured prediction problems in natural language that distributed training through parameter mixing processing is the perceptron (Rosenblatt, 1958; (or averaging) for maximum entropy models can Collins, 2002). The structured perceptron has many be empirically powerful and has strong theoretical desirable properties, most notably that there is no guarantees. A parameter mixing strategy, which can need to calculate a partition function, which is be applied to any parameterized learning algorithm, necessary for other structured prediction paradigms trains separate models in parallel, each on a disjoint such as CRFs (Lafferty et al., 2001). Furthermore, subset of the training data, and then takes an average it is robust to approximate inference, which is of- of all the parameters as the final model. In this paper, ten required for problems where the search space we provide results which suggest that the percep- is too large and where strong structural indepen- tron is ill-suited for straight-forward parameter mix- dence assumptions are insufficient, such as parsing ing, even though it is commonly used for large-scale (Collins and Roark, 2004; McDonald and Pereira, structured learning, e.g., Whitelaw et al. (2008) for 2006; Zhang and Clark, 2008) and machine trans- named-entity recognition. However, a slight mod- 456 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 456–464, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics |T | ification we call iterative parameter mixing can be Perceptron(T = {(xt, yt)}t=1) shown to: 1) have similar convergence properties to 1. w(0) = 0; k = 0 the standard perceptron algorithm, 2) find a sepa- 2. for n : 1..N 3. for t : 1..T rating hyperplane if the training set is separable, 3) 0 (k) 0 4. Let y = arg maxy0 w · f(xt, y ) reduce training times significantly, and 4) produce 0 5. if y 6= yt models with comparable (or superior) accuracies to (k+1) (k) 0 6. w = w + f(xt, yt) − f(xt, y ) those trained serially on all the data. 7. k = k + 1 8. return w(k) 2 Related Work Figure 1: The perceptron algorithm. Distributed cluster computation for many batch training algorithms has previously been examined al., 2005; Crammer et al., 2006), the recently intro- by Chu et al. (2007), among others. Much of the duced confidence weighted learning (Dredze et al., relevant prior work on online (or sub-gradient) dis- 2008) and coordinate descent algorithms (Duchi and tributed training has been focused on asynchronous Singer, 2009). optimization via gradient descent. In this sce- nario, multiple machines run stochastic gradient de- 3 Structured Perceptron scent simultaneously as they update and read from a shared parameter vector asynchronously. Early The structured perceptron was introduced by Collins work by Tsitsiklis et al. (1986) demonstrated that (2002) and we adopt much of the notation and pre- if the delay between model updates and reads is sentation of that study. The structured percetron al- bounded, then asynchronous optimization is guaran- gorithm – which is identical to the multi-class per- teed to converge. Recently, Zinkevich et al. (2009) ceptron – is shown in Figure 1. The perceptron is an performed a similar type of analysis for online learn- online learning algorithm and processes training in- ers with asynchronous updates via stochastic gra- stances one at a time during each epoch of training. dient descent. The asynchronous algorithms in Lines 4-6 are the core of the algorithm. For a input- output training instance pair (xt, yt) ∈ T , the algo- these studies require shared memory between the 0 distributed computations and are less suitable to rithm predicts a structured output y ∈ Yt, where Yt the more common cluster computing environment, is the space of permissible structured outputs for in- which is what we study here. put xt, e.g., parse trees for an input sentence. This prediction is determined by a linear classifier based While we focus on the perceptron algorithm, there on the dot product between a high-dimensional fea- is a large body of work on training structured pre- ture representation of a candidate input-output pair diction classifiers. For batch training the most com- f(x, y) ∈ M and a corresponding weight vector mon is conditional random fields (CRFs) (Lafferty R w ∈ M , which are the parameters of the model1. et al., 2001), which is the structured analog of maxi- R If this prediction is incorrect, then the parameters mum entropy. As such, its training can easily be dis- are updated to add weight to features for the cor- tributed through the gradient or sub-gradient com- responding correct output y and take weight away putations (Finkel et al., 2008). However, unlike per- t from features for the incorrect output y0. For struc- ceptron, CRFs require the computation of a partition tured prediction, the inference step in line 4 is prob- function, which is often expensive and sometimes lem dependent, e.g., CKY for context-free parsing. intractable. Other batch learning algorithms include A training set T is separable with margin γ > M3Ns (Taskar et al., 2004) and Structured SVMs 0 if there exists a vector u ∈ M with kuk = 1 (Tsochantaridis et al., 2004). Due to their efficiency, R such that u · f(x , y ) − u · f(x , y0) ≥ γ, for all online learning algorithms have gained attention, es- t t t (x , y ) ∈ T , and for all y0 ∈ Y such that y0 6= y . pecially for structured prediction tasks in NLP. In t t t t Furthermore, let R ≥ ||f(x , y )−f(x , y0)||, for all addition to the perceptron (Collins, 2002), others t t t (x , y ) ∈ T and y0 ∈ Y . A fundamental theorem have looked at stochastic gradient descent (Zhang, t t t 2004), passive aggressive algorithms (McDonald et 1The perceptron can be kernalized for non-linearity. 457 |T | of the perceptron is as follows: PerceptronParamMix(T = {(xt, yt)}t=1) 1. Shard T into S pieces T = {T1,..., TS} Theorem 1 (Novikoff (1962)). Assume training set (i) 2. w = Perceptron(Ti) † T is separable by margin γ. Let k be the number of P (i) 3. w = i µiw ‡ mistakes made training the perceptron (Figure 1) on 4. return w R2 T . If training is run indefinitely, then k ≤ γ2 . Figure 2: Distributed perceptron using a parameter mix- Proof. See Collins (2002) Theorem 1. ing strategy. † Each w(i) is computed in parallel. ‡ µ = P {µ1, . , µS}, ∀µi ∈ µ : µi ≥ 0 and µi = 1. Theorem 1 implies that if T is separable then 1) the i perceptron will converge in a finite amount of time, and 2) will produce a w that separates T . Collins ference between parameters trained on all the data also proposed a variant of the structured perceptron serially versus parameters trained with parameter where the final weight vector is a weighted average mixing. However, their analysis requires a stabil- of all parameters that occur during training, which ity bound on the parameters of a regularized max- he called the averaged perceptron and can be viewed imum entropy model, which is not known to hold as an approximation to the voted perceptron algo- for the perceptron. In Section 5, we present empir- rithm (Freund and Schapire, 1999). ical results showing that parameter mixing for dis- tributed perceptron can be sub-optimal. Addition- 4 Distributed Structured Perceptron ally, Dredze et al. (2008) present negative parame- ter mixing results for confidence weighted learning, In this section we examine two distributed training which is another online learning algorithm. The fol- strategies for the perceptron algorithm based on pa- lowing theorem may help explain this behavior.