Distributed Training Strategies for the Structured

Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd|kbhall|gmann}@google.com

Abstract lation (Liang et al., 2006). However, like all struc- tured prediction learning frameworks, the structure Perceptron training is widely applied in the perceptron can still be cumbersome to train. This natural language processing community for is both due to the increasing size of available train- learning complex structured models. Like all ing sets as well as the fact that training complexity structured prediction learning frameworks, the is proportional to inference, which is frequently non- structured perceptron can be costly to train as training complexity is proportional to in- linear in sequence length, even with strong structural ference, which is frequently non-linear in ex- independence assumptions. ample sequence length. In this paper we In this paper we investigate distributed training investigate distributed training strategies for strategies for the structured perceptron as a means the structured perceptron as a means to re- of reducing training times when large computing duce training times when computing clusters clusters are available. Traditional are available. We look at two strategies and algorithms are typically designed for a single ma- provide convergence bounds for a particu- lar mode of distributed structured perceptron chine, and designing an efficient training mechanism training based on iterative parameter mixing for analogous algorithms on a computing cluster – (or averaging). We present experiments on often via a map-reduce framework (Dean and Ghe- two structured prediction problems – named- mawat, 2004) – is an active area of research (Chu entity recognition and dependency parsing – et al., 2007). However, unlike many batch learning to highlight the efficiency of this method. algorithms that can easily be distributed through the gradient calculation, a distributed training analog for 1 Introduction the perceptron is less clear cut. It employs online up- dates and its loss function is technically non-convex. One of the most popular training algorithms for A recent study by Mann et al. (2009) has shown structured prediction problems in natural language that distributed training through parameter mixing processing is the perceptron (Rosenblatt, 1958; (or averaging) for maximum entropy models can Collins, 2002). The structured perceptron has many be empirically powerful and has strong theoretical desirable properties, most notably that there is no guarantees. A parameter mixing strategy, which can need to calculate a partition function, which is be applied to any parameterized learning algorithm, necessary for other structured prediction paradigms trains separate models in parallel, each on a disjoint such as CRFs (Lafferty et al., 2001). Furthermore, subset of the training data, and then takes an average it is robust to approximate inference, which is of- of all the parameters as the final model. In this paper, ten required for problems where the search space we provide results which suggest that the percep- is too large and where strong structural indepen- tron is ill-suited for straight-forward parameter mix- dence assumptions are insufficient, such as parsing ing, even though it is commonly used for large-scale (Collins and Roark, 2004; McDonald and Pereira, structured learning, e.g., Whitelaw et al. (2008) for 2006; Zhang and Clark, 2008) and machine trans- named-entity recognition. However, a slight mod-

456 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 456–464, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics |T | ification we call iterative parameter mixing can be Perceptron(T = {(xt, yt)}t=1) shown to: 1) have similar convergence properties to 1. w(0) = 0; k = 0 the standard perceptron algorithm, 2) find a sepa- 2. for n : 1..N 3. for t : 1..T rating hyperplane if the training set is separable, 3) 0 (k) 0 4. Let y = arg maxy0 w · f(xt, y ) reduce training times significantly, and 4) produce 0 5. if y 6= yt models with comparable (or superior) accuracies to (k+1) (k) 0 6. w = w + f(xt, yt) − f(xt, y ) those trained serially on all the data. 7. k = k + 1 8. return w(k) 2 Related Work Figure 1: The perceptron algorithm. Distributed cluster computation for many batch training algorithms has previously been examined al., 2005; Crammer et al., 2006), the recently intro- by Chu et al. (2007), among others. Much of the duced confidence weighted learning (Dredze et al., relevant prior work on online (or sub-gradient) dis- 2008) and coordinate descent algorithms (Duchi and tributed training has been focused on asynchronous Singer, 2009). optimization via gradient descent. In this sce- nario, multiple machines run stochastic gradient de- 3 Structured Perceptron scent simultaneously as they update and read from a shared parameter vector asynchronously. Early The structured perceptron was introduced by Collins work by Tsitsiklis et al. (1986) demonstrated that (2002) and we adopt much of the notation and pre- if the delay between model updates and reads is sentation of that study. The structured percetron al- bounded, then asynchronous optimization is guaran- gorithm – which is identical to the multi-class per- teed to converge. Recently, Zinkevich et al. (2009) ceptron – is shown in Figure 1. The perceptron is an performed a similar type of analysis for online learn- online learning algorithm and processes training in- ers with asynchronous updates via stochastic gra- stances one at a time during each epoch of training. dient descent. The asynchronous algorithms in Lines 4-6 are the core of the algorithm. For a input- output training instance pair (xt, yt) ∈ T , the algo- these studies require shared memory between the 0 distributed computations and are less suitable to rithm predicts a structured output y ∈ Yt, where Yt the more common cluster computing environment, is the space of permissible structured outputs for in- which is what we study here. put xt, e.g., parse trees for an input sentence. This prediction is determined by a linear classifier based While we focus on the perceptron algorithm, there on the dot product between a high-dimensional fea- is a large body of work on training structured pre- ture representation of a candidate input-output pair diction classifiers. For batch training the most com- f(x, y) ∈ M and a corresponding weight vector mon is conditional random fields (CRFs) (Lafferty R w ∈ M , which are the parameters of the model1. et al., 2001), which is the structured analog of maxi- R If this prediction is incorrect, then the parameters mum entropy. As such, its training can easily be dis- are updated to add weight to features for the cor- tributed through the gradient or sub-gradient com- responding correct output y and take weight away putations (Finkel et al., 2008). However, unlike per- t from features for the incorrect output y0. For struc- ceptron, CRFs require the computation of a partition tured prediction, the inference step in line 4 is prob- function, which is often expensive and sometimes lem dependent, e.g., CKY for context-free parsing. intractable. Other batch learning algorithms include A training set T is separable with margin γ > M3Ns (Taskar et al., 2004) and Structured SVMs 0 if there exists a vector u ∈ M with kuk = 1 (Tsochantaridis et al., 2004). Due to their efficiency, R such that u · f(x , y ) − u · f(x , y0) ≥ γ, for all online learning algorithms have gained attention, es- t t t (x , y ) ∈ T , and for all y0 ∈ Y such that y0 6= y . pecially for structured prediction tasks in NLP. In t t t t Furthermore, let R ≥ ||f(x , y )−f(x , y0)||, for all addition to the perceptron (Collins, 2002), others t t t (x , y ) ∈ T and y0 ∈ Y . A fundamental theorem have looked at stochastic gradient descent (Zhang, t t t 2004), passive aggressive algorithms (McDonald et 1The perceptron can be kernalized for non-linearity.

457 |T | of the perceptron is as follows: PerceptronParamMix(T = {(xt, yt)}t=1) 1. Shard T into S pieces T = {T1,..., TS} Theorem 1 (Novikoff (1962)). Assume training set (i) 2. w = Perceptron(Ti) † T is separable by margin γ. Let k be the number of P (i) 3. w = i µiw ‡ mistakes made training the perceptron (Figure 1) on 4. return w R2 T . If training is run indefinitely, then k ≤ γ2 . Figure 2: Distributed perceptron using a parameter mix- Proof. See Collins (2002) Theorem 1. ing strategy. † Each w(i) is computed in parallel. ‡ µ = P {µ1, . . . , µS}, ∀µi ∈ µ : µi ≥ 0 and µi = 1. Theorem 1 implies that if T is separable then 1) the i perceptron will converge in a finite amount of time, and 2) will produce a w that separates T . Collins ference between parameters trained on all the data also proposed a variant of the structured perceptron serially versus parameters trained with parameter where the final weight vector is a weighted average mixing. However, their analysis requires a stabil- of all parameters that occur during training, which ity bound on the parameters of a regularized max- he called the averaged perceptron and can be viewed imum entropy model, which is not known to hold as an approximation to the voted perceptron algo- for the perceptron. In Section 5, we present empir- rithm (Freund and Schapire, 1999). ical results showing that parameter mixing for dis- tributed perceptron can be sub-optimal. Addition- 4 Distributed Structured Perceptron ally, Dredze et al. (2008) present negative parame- ter mixing results for confidence weighted learning, In this section we examine two distributed training which is another online learning algorithm. The fol- strategies for the perceptron algorithm based on pa- lowing theorem may help explain this behavior. rameter mixing. Theorem 2. For a any training set T separable by 4.1 Parameter Mixing margin γ, the perceptron algorithm trained through a parameter mixing strategy (Figure 2) does not nec- Distributed training through parameter mixing is a essarily return a separating weight vector w. straight-forward way of training classifiers in paral- lel. The algorithm is given in Figure 2. The idea is Proof. Consider a binary classification setting simple: divide the training data T into S disjoint where Y = {0, 1} and T has 4 instances. shards such that T = {T1,..., TS}. Next, train We distribute the training set into two shards, perceptron models (or any learning algorithm) on T1 = {(x1,1, y1,1), (x1,2, y1,2)} and T2 = each shard in parallel. After training, set the final {(x2,1, y2,1), (x2,2, y2,2)}. Let y1,1 = y2,1 = 0 and 6 parameters to a weighted mixture of the parameters y1,2 = y2,2 = 1. Now, let w, f ∈ R and using of each model using mixture coefficients µ. Note block features, define the feature space as, that we call this strategy parameter mixing as op- f(x1,1, 0) = [1 1 0 0 0 0] f(x1,1, 1) = [0 0 0 1 1 0] posed to parameter averaging to distinguish it from f(x1,2, 0) = [0 0 1 0 0 0] f(x1,2, 1) = [0 0 0 0 0 1] the averaged perceptron (see previous section). It is f(x2,1, 0) = [0 1 1 0 0 0] f(x2,1, 1) = [0 0 0 0 1 1] easy to see how this can be implemented on a cluster f(x2,2, 0) = [1 0 0 0 0 0] f(x2,2, 1) = [0 0 0 1 0 0] through a map-reduce framework, i.e., the map step Assuming label 1 tie-breaking, parameter mixing re- trains the individual models in parallel and the re- turns w1=[1 1 0 -1 -1 0] and w2=[0 1 1 0 -1 -1]. For duce step mixes their parameters. The advantages of any µ, the mixed weight vector w will not separate parameter mixing are: 1) that it is parallel, making all the points. If both µ /µ are non-zero, then all it possibly to scale to extremely large data sets, and 1 2 examples will be classified 0. If µ =1 and µ =0, 2) it is resource efficient, in particular with respect 1 2 then (x , y ) will be incorrectly classified as 0 to network usage as parameters are not repeatedly 2,2 2,2 and (x , y ) when µ =0 and µ =1. But there is a passed across the network as is often the case for 1,2 1,2 1 2 separating weight vector w = [-1 2 -1 1 -2 1]. exact distributed training strategies. For maximum entropy models, Mann et al. (2009) This counter example does not say that a parameter show it is possible to bound the norm of the dif- mixing strategy will not converge. On the contrary,

458 |T | if T is separable, then each of its subsets is separa- PerceptronIterParamMix(T = {(xt, yt)}t=1) ble and converge via Theorem 1. What it does say 1. Shard T into S pieces T = {T1,..., TS} is that, independent of µ, the mixed weight vector 2. w = 0 3. for n : 1..N produced after convergence will not necessarily sep- (i,n) 4. w = OneEpochPerceptron(Ti, w) † arate the entire data, even when T is separable. P (i,n) 5. w = i µi,nw ‡ 4.2 Iterative Parameter Mixing 6. return w Consider a slight augmentation to the parameter OneEpochPerceptron(T , w∗) mixing strategy. Previously, each parallel percep- 1. w(0) = w∗; k = 0 tron was trained to convergence before the parame- 2. for t : 1..T 0 (k) 0 ter mixing step. Instead, shard the data as before, but 3. Let y = arg maxy0 w · f(xt, y ) 0 train a single epoch of the perceptron algorithm for 4. if y 6= yt (k+1) (k) 0 each shard (in parallel) and mix the model weights. 5. w = w + f(xt, yt) − f(xt, y ) 6. k = k + 1 This mixed weight vector is then re-sent to each 7. return w(k) shard and the on those shards reset their weights to the new mixed weights. Another single Figure 3: Distributed perceptron using an iterative param- epoch of training is then run (again in parallel over eter mixing strategy. † Each w(i,n) is computed in paral- the shards) and the process repeats. This iterative lel. ‡ µn = {µ1,n, . . . , µS,n}, ∀µi,n ∈ µn: µi,n ≥ 0 and ∀n P µ = 1 parameter mixing algorithm is given in Figure 3. : i i,n . Again, it is easy to see how this can be imple- mented as map-reduce, where the map computes the w(avg,n) be the mixed vector from the weight vec- parameters for each shard for one epoch and the re- tors returned after the nth epoch, i.e., duce mixes and re-sends them. This is analogous S (avg,n) X (i,n) to batch distributed gradient descent methods where w = µi,nw the gradient for each shard is computed in parallel in i=1 the map step and the reduce step sums the gradients Following the analysis from Collins (2002) Theorem and updates the weight vector. The disadvantage of 1, by examining line 5 of OneEpochPerceptron in iterative parameter mixing, relative to simple param- Figure 3 and the fact that u separates the data by γ: eter mixing, is that the amount of information sent across the network will increase. Thus, if network u · w(i,n) = u · w([i,n]−1) 0 latency is a bottleneck, this can become problematic. + u · (f(xt, yt) − f(xt, y )) However, for many parallel computing frameworks, ≥ u · w([i,n]−1) + γ including both multi-core computing as well as clus- ≥ u · w([i,n]−2) + 2γ ter computing with high rates of connectivity, this is (avg,n−1) ... ≥ u · w + ki,nγ (A1) less of an issue. (i,n) Theorem 3. Assume a training set T is separable That is, u · w is bounded below by the average by margin γ. Let ki,n be the number of mistakes that weight vector for the n-1st epoch plus the number occurred on shard i during the nth epoch of train- of mistakes made on shard i during the nth epoch ing. For any N, when training the perceptron with times the margin γ. Next, by OneEpochPerceptron ([i,n]−1) iterative parameter mixing (Figure 3), line 5, the definition of R, and w (f(xt, yt)− 0 f(xt, y )) ≤ 0 when line 5 is called: N S X X R2 kw(i,n)k2 = kw([i,n]−1)k2 µi,nki,n ≤ 2 γ 0 2 n=1 i=1 +kf(xt, yt) − f(xt, y )k ([i,n]−1) 0 Proof. Let w(i,n) to be the weight vector for the + 2w (f(xt, yt) − f(xt, y )) ([i,n]−1) 2 2 ith shard after the nth epoch of the main loop and ≤ kw k + R let w([i,n]−k) be the weight vector that existed on ≤ kw([i,n]−2)k2 + 2R2 (i,n) (avg,n−1) 2 2 shard i in the nth epoch k errors before w . Let ... ≤ kw k + ki,nR (A2)

459 That is, the squared L2-norm of a shards weight vec- we can write: tor is bounded above by the same value for the aver- S age weight vector of the n-1st epoch and the number (avg,N) 2 X (i,N) 2 kw k ≤ µi,N kw k of mistakes made on that shard during the nth epoch i=1 2 times R . S X (avg,N−1) 2 2 Using A1/A2 we prove two inductive hypotheses: ≤ µi,N (kw k + ki,N R ) i=1 N S S (avg,N) X X (avg,N−1) 2 X 2 u · w ≥ µi,nki,nγ (IH1) = kw k + µi,N ki,N R n=1 i=1 i=1 "N−1 S # S N S X X 2 X 2 (avg,N) 2 X X 2 ≤ µi,nki,nR + µi,N ki,N R kw k ≤ µi,nki,nR (IH2) n=1 i=1 i=1 n=1 i=1 N S X X 2 (avg,N) PN PS = µi,nki,nR IH1 implies kw k ≥ n=1 i=1 µi,nki,nγ since u · w ≤ kukkwk and kuk = 1. n=1 i=1 The base case is w(avg,1), where we can observe: The first inequality is Jensen’s, the second A2, and

S S the third the inductive hypothesis IH2. Putting to- avg,1 X (i,1) X (avg,N) (avg,N) u · w = µi,1u · w ≥ µi,1ki,1γ gether IH1, IH2 and kw k ≥ u · w : i=1 i=1 " N S #2 " N S # X X 2 X X 2 using A1 and the fact that w(avg,0) = 0 for the sec- µi,nki,n γ ≤ µi,nki,n R ond step. For the IH2 base case we can write: n=1 i=1 n=1 i=1 N S R2 S 2 P P µ k ≤ which yields: n=1 i=1 i,n i,n γ2 (avg,1) 2 X (i,1) kw k = µi,1w i=1 4.3 Analysis S S X (i,1) 2 X 2 ≤ µi,1kw k ≤ µi,1ki,1R If we set each µn to be the uniform mixture, µi,n = i=1 i=1 1/S, then Theorem 3 guarantees convergence to The first inequality is Jensen’s inequality, and the PS a separating hyperplane. If i=1 µi,nki,n = 0, second is true by A2 and kw(avg,0)k2 = 0. then the previous weight vector already separated (avg,N) PN PS Proceeding to the general case, w : the data. Otherwise, n=1 i=1 µi,nki,n is still in- S creasing, but is bounded and cannot increase indefi- (avg,N) X (i,N) u · w = µi,N (u · w ) nitely. Also note that if S = 1, then µ1,n must equal i=1 1 for all n and this bound is identical to Theorem 1. S However, we are mainly concerned with how fast X (avg,N−1) ≥ µi,N (u · w + ki,N γ) convergence occurs, which is directly related to the i=1 S number of training epochs each algorithm must run, (avg,N−1) X i.e., N in Figure 1 and Figure 3. For the non- = u · w + µi,N ki,N γ i=1 distributed variant of the perceptron we can say that "N−1 S # S N ≤ R2/γ2 since in the worst case a single X X X non dist ≥ µi,nki,nγ + µi,N ki,N mistake happens on each epoch.2 For the distributed n=1 i=1 i=1 case, consider setting µi,n = ki,n/kn, where kn = N S P X X i ki,n. That is, we mix parameters proportional to = µi,nki,nγ n=1 i=1 the number of errors each made during the previous epoch. Theorem 3 still implies convergence to a sep- The first inequality uses A1, the second step arating hyperplane with this choice. Further, we can P i µi,N = 1 and the second inequality the induc- tive hypothesis IH1. For IH2, in the general case, 2It is not hard to derive such degenerate cases.

460 bound the required number of epochs Ndist: training set accuracies approached 100%. For both Ndist S Ndist S 2 tasks we also plot test set metrics relative to the user X Y ki,n X X ki,n R kn Ndist ≤ [ki,n] ≤ ki,n ≤ 2 wall-clock taken to obtain the classifier. The results kn γ n=1 i=1 n=1 i=1 were computed by collecting the metrics at the end Ignoring when all ki,n are zero (since the algorithm of each epoch for every classifier. All experiments will have converged), the first inequality is true since used 10 shards (Section 5.1 looks at convergence rel- k /kn either ki,n ≥ 1, implying that [ki,n] i,n ≥ 1, or ative to different shard size). k /kn ki,n = 0 and [ki,n] i,n = 1. The second inequal- Our first experiment is a named-entity recogni- ity is true by the generalized arithmetic-geometric tion task using the English data from the CoNLL mean inequality and the final inequality is Theo- 2003 shared-task (Tjong Kim Sang and De Meul- rem 3. Thus, the worst-case number of epochs is der, 2003). The task is to detect entities in sentences identical for both the regular and distributed percep- and label them as one of four types: people, organi- tron – but the distributed perceptron can theoreti- zations, locations or miscellaneous. For our exper- cally process each epoch S times faster. This ob- iments we used the entire training set (14041 sen- servation holds only for cases where µi,n > 0 when tences) and evaluated on the official development ki,n ≥ 1 and µi,n = 0 when ki,n = 0, which does set (3250 sentences). We used a straight-forward not include uniform mixing. IOB label encoding with a 1st order Markov fac- torization. Our feature set consisted of predicates 5 Experiments extracted over word identities, word affixes, orthog- To investigate the distributed perceptron strategies raphy, part-of-speech tags and corresponding con- discussed in Section 4 we look at two structured pre- catenations. The evaluation metric used was micro diction tasks – named entity recognition and depen- f-measure over the four entity class types. dency parsing. We compare up to four systems: Results are given in Figure 4. There are a num- ber of things to observe here: 1) training on a single 1. Serial (All Data): This is the classifier returned shard clearly provides inferior performance to train- if trained serially on all the available data. ing on all data, 2) the simple parameter mixing strat- 2. Serial (Sub Sampling): Shard the data, select egy improves upon a single shard, but does not meet one shard randomly and train serially. the performance of training on all data, 3) iterative parameter mixing achieves performance as good as 3. Parallel (Parameter Mix): Parallel strategy or better than training serially on all the data, and discussed in Section 4.1 with uniform mixing. 4) the distributed algorithms return better classifiers much quicker than training serially on all the data. 4. Parallel (Iterative Parameter Mix): Parallel This is true regardless of whether the underlying al- strategy discussed in Section 4.2 with uniform gorithm is the regular or the averaged perceptron. mixing (Section 5.1 looks at mixing strategies). Point 3 deserves more discussion. In particular, the For all four systems we compare results for both the iterative parameter mixing strategy has a higher final standard perceptron algorithm as well as the aver- f-measure than training on all the data serially than aged perceptron algorithm (Collins, 2002). the standard perceptron (f-measure of 87.9 vs. 85.8). We report the final test set metrics of the con- We suspect this happens for two reasons. First, the verged classifiers to determine whether any loss in parameter mixing has a bagging like effect which accuracy is observed as a consequence of distributed helps to reduce the variance of the per-shard classi- training strategies. We define convergence as ei- fiers (Breiman, 1996). Second, the fact that parame- ther: 1) the training set is separated, or 2) the train- ter mixing is just a form of parameter averaging per- ing set performance measure (accuracy, f-measure, haps has the same effect as the averaged perceptron. etc.) does not change by more than some pre-defined Our second set of experiments looked at the much threshold on three consecutive epochs. As with most more computationally intensive task of dependency real world data sets, convergence by training set sep- parsing. We used the Prague Dependency Tree- aration was rarely observed, though in both cases bank (PDT) (Hajicˇ et al., 2001), which is a Czech

461 0.85 0.85

0.8

0.8 0.75

0.7 0.75 Test Data F-measure Test Data F-measure Test Perceptron -- Serial (All Data) Averaged Perceptron -- Serial (All Data) Perceptron -- Serial (Sub Sampling) Averaged Perceptron -- Serial (Sub Sampling) 0.65 Perceptron -- Parallel (Parameter Mix) Averaged Perceptron -- Parallel (Parameter Mix) 0.7 Perceptron -- Parallel (Iterative Parameter Mix) Averaged Perceptron -- Parallel (Iterative Parameter Mix)

Wall Clock Wall Clock Reg. Perceptron Avg. Perceptron F-measure F-measure Serial (All Data) 85.8 88.2 Serial (Sub Sampling) 75.3 76.6 Parallel (Parameter Mix) 81.5 81.6 Parallel (Iterative Parameter Mix) 87.9 88.1

Figure 4: NER experiments. Upper figures plot test data f-measure versus wall clock for both regular perceptron (left) and averaged perceptron (right). Lower table is f-measure for converged models. language treebank and currently one of the largest 5.1 Convergence Properties dependency treebanks in existence. We used the Section 4.3 suggests that different weighting strate- CoNLL-X training (72703 sentences) and testing gies can lead to different convergence properties, splits (365 sentences) of this data (Buchholz and in particular with respect to the number of epochs. Marsi, 2006) and dependency parsing models based For the named-entity recognition task we ran four on McDonald and Pereira (2006) which factors fea- experiments comparing two different mixing strate- tures over pairs of dependency arcs in a tree. To gies – uniform mixing (µ =1/S) and error mix- parse all the sentences in the PDT, one must use a i,n ing (µ =k /k ) – each with two shard sizes – non-projective parsing algorithm, which is a known i,n i,n n S = 10 and S = 100. Figure 6 plots the number NP-complete inference problem when not assuming of training errors per epoch for each strategy. strong independence assumptions. Thus, the use of We can make a couple observations. First, the approximate inference techniques is common in or- mixing strategy makes little difference. The rea- der to find the highest weighted tree for a sentence. son being that the number of observed errors per We use the approximate parsing algorithm given in epoch is roughly uniform across shards, making McDonald and Pereira (2006), which runs in time both strategies ultimately equivalent. The other ob- roughly cubic in sentence length. To train such a servation is that increasing the number of shards model is computationally expensive and can take on can slow down convergence when viewed relative to the order of days to train on a single machine. epochs3. Again, this appears in contradiction to the Unlabeled attachment scores (Buchholz and analysis in Section 4.3, which, at least for the case Marsi, 2006) are given in Figure 5. The same trends of error weighted mixtures, implied that the num- are seen for dependency parsing that are seen for ber of epochs to convergence was independent of named-entity recognition. That is, iterative param- the number of shards. But that analysis was based eter mixing learns classifiers faster and has a final on worst-case scenarios where a single error occurs accuracy as good as or better than training serially on a single shard at each epoch, which is unlikely to on all data. Again we see that the iterative parame- occur in real world data. Instead, consider the uni- ter mixing model returns a more accurate classifier than the regular perceptron, but at about the same 3As opposed to raw wall-clock/CPU time, which benefits level as the averaged perceptron. from faster epochs the more shards there are.

462 0.84 0.85

0.84 0.82

0.83

0.8 0.82

0.78 0.81

0.76 0.8 Perceptron -- Serial (All Data) Averaged Perceptron -- Serial (All Data) Unlabeled Attachment Score Unlabeled Perceptron -- Serial (Sub Sampling) Attachment Score Unlabeled Averaged Perceptron -- Serial (Sub Sampling) 0.79 Perceptron -- Parallel (Iterative Parameter Mix) Averaged Perceptron -- (Iterative Parameter Mix) 0.74

0.78 Wall Clock Wall Clock Reg. Perceptron Avg. Perceptron Unlabeled Attachment Score Unlabeled Attachment Score Serial (All Data) 81.3 84.7 Serial (Sub Sampling) 77.2 80.1 Parallel (Iterative Parameter Mix) 83.5 84.5

Figure 5: Dependency Parsing experiments. Upper figures plot test data unlabeled attachment score versus wall clock for both regular perceptron (left) and averaged perceptron (right). Lower table is unlabeled attachment score for converged models.

10000 It is worth pointing out that a linear term S in Error mixing (10 shards) Uniform mixing (10 shards) 8000 the convergence bound above is similar to conver- Error mixing (100 shards) Uniform mixing (100 shards) gence/regret bounds for asynchronous distributed 6000 online learning, which typically have bounds lin- 4000 ear in the asynchronous delay (Mesterharm, 2005; 2000 # Training Mistakes Training # Zinkevich et al., 2009). This delay will be on aver- 0

0 10 20 30 40 50 age roughly equal to the number of shards S. Training Epochs Figure 6: Training errors per epoch for different shard 6 Conclusions size and parameter mixing strategies. In this paper we have investigated distributing the form mixture case. Theorem 3 implies: structured perceptron via simple parameter mixing N S 2 N S 2 strategies. Our analysis shows that an iterative pa- X X ki,n R X X R ≤ =⇒ k ≤ S × S γ2 i,n γ2 rameter mixing strategy is both guaranteed to sepa- n=1 i=1 n=1 i=1 rate the data (if possible) and significantly reduces Thus, for cases where training errors are uniformly the time required to train high accuracy classifiers. distributed across shards, it is possible that, in the However, there is a trade-off between increasing worst-case, convergence may slow proportional the training times through distributed computation and the number of shards. This implies a trade-off be- slower convergence relative to the number of shards. tween slower convergence and quicker epochs when Finally, we note that using similar proofs to those selecting a large number of shards. In fact, we ob- given in this paper, it is possible to provide theoreti- served a tipping point for our experiments in which cal guarantees for distributed online passive aggres- increasing the number of shards began to have an ad- sive learning (Crammer et al., 2006), which is a form verse effect on training times, which for the named- of large-margin perceptron learning. Unfortunately entity experiments occurred around 25-50 shards. space limitations prevent exploration here. This is both due to reasons described in this section as well as the added overhead of maintaining and Acknowledgements: We thank Mehryar Mohri, Fer- summing multiple high-dimensional weight vectors nando Periera, Mark Dredze and the three anonymous re- after each distributed epoch. views for their helpful comments on this work.

463 References G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. 2009. Efficient large-scale distributed L. Breiman. 1996. Bagging predictors. Machine Learn- training of conditional maximum entropy models. In ing, 24(2):123–140. Advances in Neural Information Processing Systems. S. Buchholz and E. Marsi. 2006. CoNLL-X shared R. McDonald and F. Pereira. 2006. Online learning of task on multilingual dependency parsing. In Proceed- approximate dependency parsing algorithms. In Pro- ings of the Conference on Computational Natural Lan- ceedings of the Conference of the European Chapter guage Learning. of the Association for Computational Linguistics. C.T. Chu, S.K. Kim, Y.A. Lin, Y.Y. Yu, G. Bradski, A.Y. R. McDonald, K. Crammer, and F. Pereira. 2005. On- Ng, and K. Olukotun. 2007. Map-Reduce for ma- line large-margin training of dependency parsers. In Advances in Neural chine learning on multicore. In Proceedings of the Conference of the Association for Information Processing Systems. Computational Linguistics. M. Collins and B. Roark. 2004. Incremental parsing with C. Mesterharm. 2005. Online learning with delayed la- Proceedings of the Con- the perceptron algorithm. In bel feedback. In Proceedings of Algorithmic Learning ference of the Association for Computational Linguis- Theory. tics. A.B. Novikoff. 1962. On convergence proofs on percep- M. Collins. 2002. Discriminative training methods for trons. In Symposium on the Mathematical Theory of hidden Markov models: Theory and experiments with Automata. perceptron algorithm. In Proceedings of the Confer- F. Rosenblatt. 1958. The perceptron: A probabilistic ence on Empirical Methods in Natural Language Pro- model for information storage and organization in the cessing. brain. Psychological Review, 65(6):386–408. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, B. Taskar, C. Guestrin, and D. Koller. 2004. Max-margin and Y. Singer. 2006. Online passive-aggressive algo- Markov networks. In Advances in Neural Information rithms. The Journal of Machine Learning Research, Processing Systems. 7:551–585. E. F. Tjong Kim Sang and F. De Meulder. 2003. Intro- J. Dean and S. Ghemawat. 2004. MapReduce: Simpli- duction to the CoNLL-2003 Shared Task: Language- fied data processing on large clusters. In Sixth Sym- Independent Named Entity Recognition. In Proceed- posium on Operating System Design and Implementa- ings of the Conference on Computational Natural Lan- tion. guage Learning. M. Dredze, K. Crammer, and F. Pereira. 2008. J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. 1986. Pro- Confidence-weighted linear classification. In Distributed asynchronous deterministic and stochastic ceedings of the International Conference on Machine gradient optimization algorithms. IEEE Transactions learning . on Automatic Control, 31(9):803–812. J. Duchi and Y. Singer. 2009. Efficient learning using I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. forward-backward splitting. In Advances in Neural In- 2004. Support vector machine learning for interdepen- formation Processing Systems. dent and structured output spaces. In Proceedings of J.R. Finkel, A. Kleeman, and C.D. Manning. 2008. Effi- the International Conference on Machine learning. cient, feature-based, conditional random field parsing. C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. In Proceedings of the Conference of the Association 2008. Web-scale named entity recognition. In Pro- for Computational Linguistics. ceedings of the International Conference on Informa- Y. Freund and R.E. Schapire. 1999. Large margin clas- tion and Knowledge Management. sification using the perceptron algorithm. Machine Y. Zhang and S. Clark. 2008. A tale of two parsers: In- Learning, 37(3):277–296. vestigating and combining graph-based and transition- J. Hajic,ˇ B. Vidova Hladka, J. Panevova,´ E. Hajicovˇ a,´ based dependency parsing using beam-search. In Pro- P. Sgall, and P. Pajas. 2001. Prague Dependency Tree- ceedings of the Conference on Empirical Methods in bank 1.0. LDC, 2001T10. Natural Language Processing. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- T. Zhang. 2004. Solving large scale linear prediction ditional random fields: Probabilistic models for seg- problems using stochastic gradient descent algorithms. menting and labeling sequence data. In Proceedings In Proceedings of the International Conference on Ma- of the International Conference on Machine Learning. chine Learning. P. Liang, A. Bouchard-Cotˆ e,´ D. Klein, and B. Taskar. M. Zinkevich, A. Smola, and J. Langford. 2009. Slow 2006. An end-to-end discriminative approach to ma- learners are fast. In Advances in Neural Information chine translation. In Proceedings of the Conference of Processing Systems. the Association for Computational Linguistics.

464