Accelerated Estimation of Conditional Random Fields Using a Pseudo-Likelihood-Inspired Perceptron Variant

Accelerated Estimation of Conditional Random Fields using a Pseudo-Likelihood-inspired Perceptron Variant Teemu Ruokolainena Miikka Silfverbergb Mikko Kurimoa Krister Lindénb a Department of Signal Processing and Acoustics, Aalto University, firstname.lastname@aalto.fi b Department of Modern Languages, University of Helsinki, firstname.lastname@helsinki.fi Abstract quired by the stochastic gradient descent algorithm employed in ML estimation (Vishwanathan We discuss a simple estimation approach et al., 2006). Additionally, while ML and percep- for conditional random fields (CRFs). The tron training share an identical time complexity, approach is derived heuristically by defin- the perceptron is in practice faster due to sparser ing a variant of the classic perceptron al- parameter updates. gorithm in spirit of pseudo-likelihood for Despite its simplicity, running the perceptron al- maximum likelihood estimation. The re- gorithm can be tedious in case the data contains sulting approximative algorithm has a lin- a large number of labels. Previously, this prob- ear time complexity in the size of the la- lem has been addressed using, for example, k-best bel set and contains a minimal amount of beam search (Collins and Roark, 2004; Zhang and tunable hyper-parameters. Consequently, Clark, 2011; Huang et al., 2012) and paralleliza- the algorithm is suitable for learning CRF- tion (McDonald et al., 2010). In this work, we based part-of-speech (POS) taggers in explore an alternative strategy, in which we mod- presence of large POS label sets. We ify the perceptron algorithm in spirit of the classic present experiments on five languages. pseudo-likelihood approximation for ML estima- Despite its heuristic nature, the algorithm tion (Besag, 1975). The resulting novel algorithm provides surprisingly competetive accura- has linear complexity w.r.t. the label set size and cies and running times against reference contains only a single hyper-parameter, namely, methods. the number of passes taken over the training data set. 1 Introduction We evaluate the algorithm, referred to as the The conditional random field (CRF) model (Laf- pseudo-perceptron, empirically in POS tagging ferty et al., 2001) has been successfully applied on five languages. The results suggest that the to several sequence labeling tasks in natural lan- approach can yield competitive accuracy com- guage processing, including part-of-speech (POS) pared to perceptron training accelerated using a tagging. In this work, we discuss accelerating the violation-fixed 1-best beam search (Collins and CRF model estimation in presence of a large num- Roark, 2004; Huang et al., 2012) which also pro- ber of labels, say, hundreds or thousands. Large la- vides a linear time complexity in label set size. bel sets occur in POS tagging of morphologically The rest of the paper is as follows. In Section 2, rich languages (Erjavec, 2010; Haverinen et al., we describe the pseudo-perceptron algorithm and 2013). discuss related work. In Sections 3 and 4, we CRF training is most commonly associated with describe our experiment setup and the results, re- the (conditional) maximum likelihood (ML) crite- spectively. Conclusions on the work are presented rion employed in the original work of Lafferty et in Section 5. al. (2001). In this work, we focus on an alternative training approach using the averaged perceptron 2 Methods algorithm of Collins (2002). While yielding competitive accuracy (Collins, 2002; Zhang and Clark, 2.1 Pseudo-Perceptron Algorithm 2011), the perceptron algorithm avoids extensive The (unnormalized) CRF model for input and tuning of hyper-parameters and regularization re- output sequences x = (x1, x2, . , x x ) and | | 74 Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 74–78, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics y = (y1, y2, . , y x ), respectively, is written as are decoded in a standard manner using the Viterbi | | search. p (y x; w) exp w Φ(y, x) | ∝ · The appeal of PP is that the time complexity x of search is reduced to O( x ), i.e., linear | | | | × |Y| = exp w φ(yi n, . , yi, x, i) , in the number of labels in the label set. On the · − other hand, we no longer expect the obtained pa- iY=n (1) rameters to necessarily generalize well to test in- where w denotes the model parameter vector, Φ stances.1 Consequently, we consider PP a heuris- the vector-valued global feature extracting func- tic estimation approach motivated by the rather tion, φ the vector-valued local feature extracting well-established success of PL (Korcˇ and Förstner, function, and n the model order. We denote the 2008; Sutton and McCallum, 2009).2 tag set as . The model parameters w are esti- Y Next, we study yet another heuristic pseudo- mated based on training data, and test instances variant of the perceptron algorithm referred to as are decoded using the Viterbi search (Lafferty et the piecewise-pseudo-perceptron (PW-PP). This al., 2001). algorithm is analogous to the piecewise-pseudo- Given the model definition (1), the param- likelihood (PW-PL) approximation presented by eters w can be estimated in a straightforward Sutton and McCallum (2009). In this variant, the manner using the structured perceptron algo- original graph is first split into smaller, possibly rithm (Collins, 2002). The algorithm iterates overlapping subgraphs (pieces). Subsequently, we over the training set a single instance (x, y) at apply the PP approximation to the pieces. We em- a time and updates the parameters according ploy the approach coined factor-as-piece by Sut- (i) (i 1) to the rule w = w − + ∆Φ(x, y, z), where ton and McCallum (2009), in which each piece ∆Φ(x, y, z) for the ith iteration is written as contains n + 1 consecutive variables, where n is ∆Φ(x, y, z) = Φ(x, y) Φ(x, z). The predic- − the CRF model order. tion z is obtained as The PW-PP approach is motivated by the results z = arg max w Φ(x, u) (2) of Sutton and McCallum (2009) who found PW- u (x) · PL to increase stability w.r.t. accuracy compared ∈Y by performing the Viterbi search over to plain PL across tasks. Note that the piecewise (x) = , a product of x copies approximation in itself is not interesting in chain- Y Y × · · · × Y | | of . In case the perceptron algorithm yields structured CRFs, as it results in same time com- Y a small number of incorrect predictions on the plexity as standard estimation. Meanwhile, the training data set, the parameters generalize well PW-PP algorithm has same time complexity as PP. to test instances with a high probability (Collins, 2.2 Related work 2002). The time complexity of the Viterbi search is Previously, impractical running times of percep- O( x n+1). Consequently, running the per- tron learning have been addressed most notably | | × |Y| ceptron algorithm can become tedious if the la- using the k-best beam search method (Collins and bel set cardinality and/or the model order n Roark, 2004; Zhang and Clark, 2011; Huang et |Y| is large. In order to speed up learning, we define al., 2012). Here, we consider the ”greedy” 1-best a variant of the algorithm in the spirit of pseudo- beam search variant most relevant as it shares the likelihood (PL) learning (Besag, 1975). In anal- time complexity of the pseudo search. Therefore, ogy to PL, the key idea of the pseudo-perceptron in the experimental section of this work, we com- (PP) algorithm is to obtain the required predictions pare the PP and 1-best beam search. We are aware of at least two other learning ap- over single variables yi while fixing the remaining variables to their true values. In other words, in- proaches inspired by PL, namely, the pseudo-max stead of using the Viterbi search to find the z as in and piecewise algorithms of Sontag et al. (2010) and Alahari et al. (2010), respectively. Com- (2), we find a z0 for each position i 1.. x as ∈ | | pared to these approaches, the PP algorithm pro- z0 = arg max w Φ(x, u) , (3) vides a simpler estimation tool as it avoids the u (x) · ∈Yi0 1We leave formal treatment to future work. with i0(x) = y1 yi 1 yi+1 2 Y { }×· · ·×{ − }×Y ×{ }× Meanwhile, note that pseudo-likelihood is a consistent y x . Subsequent to training, test instances estimator (Gidas, 1988; Hyvärinen, 2006). · · · × { | |} 75 hyper-parameters involved in the stochastic gradi- lang. train. dev. test tags train. tags ent descent algorithms as well as the regularization eng 38,219 5,527 5,462 45 45 and margin functions inherent to the approaches of rom 5,216 652 652 405 391 est 5,183 648 647 413 408 Alahari et al. (2010) and Sontag et al. (2010). On cze 5,402 675 675 955 908 the other hand, Sontag et al. (2010) show that the fin 5,043 630 630 2,355 2,141 pseudo-max approach achieves consistency given certain assumptions on the data generating func- Table 1: Overview on data. The training (train.), tion. Meanwhile, as discussed in previous section, development (dev.) and test set sizes are given in we consider PP a heuristic and do not provide any sentences. The columns titled tags and train. tags generalization guarantees. To our understanding, correspond to total number of tags in the data set Alahari et al. (2010) do not provide generalization and number of tags in the training set, respectively. guarantees for their algorithm. POS tagging. Our preliminary experiments using 3 Experimental Setup the latest violation updates supported this. Conse- 3.1 Data quently, we employ the early updates. We also provide results using the CRFsuite For a quick overview of the data sets, see Table 1.

Load more