<<

Unsupervised POS Induction with Word Embeddings

Chu-Cheng Lin Waleed Ammar Chris Dyer Lori Levin School of Computer Science Carnegie Mellon University {chuchenl,wammar,cdyer,lsl}@cs.cmu.edu

Abstract Our findings suggest that, in both models, sub- stantial improvements are possible when word em- Unsupervised word embeddings have been beddings are used rather than opaque word types. shown to be valuable as features in supervised However, the independence assumptions made by learning problems; however, their role in unsu- the model used to induce embeddings strongly deter- pervised problems has been less thoroughly ex- mines its effectiveness for POS induction: embedding plored. In this paper, we show that embeddings can likewise add value to the problem of unsu- models that model short-range context are more ef- pervised POS induction. In two representative fective than those that model longer-range contexts. models of POS induction, we replace multi- This result is unsurprising, but it illustrates the lack nomial distributions over the vocabulary with of an evaluation metric that measures the syntactic multivariate Gaussian distributions over word (rather than semantic) information in word embed- embeddings and observe consistent improve- dings. Our results also confirm the conclusions of ments in eight languages. We also analyze the Sirts et al. (2014) who were likewise able to improve effect of various choices while inducing word embeddings on “downstream” POS induction POS induction results, albeit using a custom clus- results. tering model based on the the distance-dependent Chinese restaurant process (Blei and Frazier, 2011). Our contributions are as follows: (i) reparameter- 1 Introduction ization of token-level POS induction models to use word embeddings; and (ii) a systematic evaluation Unsupervised POS induction is the problem of as- of word embeddings with respect to the syntactic signing word tokens to syntactic categories given information they contain. only a corpus of untagged text. In this paper we ex- plore the effect of replacing words with their vector 1 2 Vector Space Word Embeddings space embeddings in two POS induction models: the classic first-order HMM (Kupiec, 1992) and the Word embeddings represent words in a language’s newly introduced conditional random field autoen- vocabulary as points in a d-dimensional space such coder (Ammar et al., 2014). In each model, instead that nearby words (points) are similar in terms of their 2 of using a conditional multinomial distribution to distributional properties. A variety of techniques for generate a word token wi V given a POS tag ti T, learning embeddings have been proposed, e.g., matrix ∈ ∈ we use a conditional Gaussian distribution and gen- factorization (Deerwester et al., 1990; Dhillon et al., d erate a d-dimensional vw R 2011) and neural language modeling (Mikolov et al., i ∈ given ti . 2011; Collobert and Weston, 2008). 1 For the POS induction task, we specifically need Unlike Yatbaz et al. (2014), we leverage easily obtainable and widely used embeddings of word types. embeddings that capture syntactic similarities. There- 2 Also known as a categorical distribution. fore we experiment with two types of embeddings

1311

Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1311–1316, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics that are known for such properties: • p(wi ti ) is parameterized as a “naïve multino- | mial” distribution with one distinct parameter for • Skip-gram embeddings (Mikolov et al., 2013) are each word type. based on a log bilinear model that predicts an un- ordered set of context words given a target word. • p(wi ti ) is parameterized as a multinomial logis- | Bansal et al. (2014) found that smaller context win- tic regression model with hand-engineered features dow sizes tend to result in embeddings with more as detailed in (Berg-Kirkpatrick et al., 2010). syntactic information. We confirm this finding in Gaussian Emissions. We now consider incorporat- our experiments. ing word embeddings in the HMM. Given a tag t T, ∈ • Structured skip-gram embeddings (Ling et al., instead of generating the observed word w V, we ∈d 2015) extend the standard skip-gram embeddings generate the (pre-trained) embedding vw R of that ∈ (Mikolov et al., 2013) by taking into account the word. The conditional probability density assigned relative positions of words in a given context. to vw t follows a multivariate Gaussian distribution 3 | We use the tool and Ling et al. (2015)’s with mean µt and covariance matrix Σt : 4 modified version to generate both plain and struc- 1 1 exp (vw µ )>Σ− (vw µ ) tured skip-gram embeddings in nine languages. − 2 − t t − t p(vw; µt ,Σt ) = d  (2π) Σt  3 Models for POS Induction | | (2) p In this section, we briefly review two classes of mod- This parameterization makes the assumption that em- els used for POS induction (HMMs and CRF autoen- beddings of words which are often tagged as t are coders), and explain how to generate word embed- concentrated around some point µ Rd, and the ding observations in each class. We will represent a t ∈ ` concentration decays according to the covariance ma- sentence of length ` as w = w1,w2,...,w` V 6 h i ∈ ` trix Σt . and a sequence of tags as t = t ,t ,...,t` T . h 1 2 i ∈ Now, the joint distribution over a sequence of The embeddings of word type w V will be written d ∈ observations v = vw1 ,vw2 ...,vw` (which corre- as vw R . h i ∈ sponds to word sequence w = w1,w2,...,w`, ) and h i a tag sequence t = t ,t ...,t` becomes: 3.1 Hidden Markov Models h 1 2 i The with multinomial emis- ` sions is a classic model for POS induction. This p(v, t) = p(ti ti 1) p(vwi ; µt ,Σti ) | − × i model makes the assumption that a latent Markov pro- Yi=1 cess with discrete states representing POS categories We use the Baum–Welch to fit the µt emits individual words in the vocabulary according Σ µ and ti parameters. In every iteration, we update t∗ to state (i.e., tag) specific emission distributions. An as follows: HMM thus defines the following joint distribution new v i=1...` p(ti = t∗ v) vwi over sequences of observations and tags: µ = ∈T | × (3) t∗ v i=1...` p(ti = t∗ v) ` P ∈TP | p(w, t) = p(ti ti 1) p(wi ti ) (1) where is a dataP setP of word embedding sequences | − × | T i=1 v each of length v = `, and p(ti = t∗ v) is the Y | | | where distributions p(ti ti 1) represents the transi- posterior probability of label t∗ at position i in the | − tion probability and p(wi ti ) is the emission prob- sequence v. Likewise the update to Σt is: | ∗ ability, the probability of a particular tag generating 5 new v i=1...` p(ti = t∗ v) δδ> the word at position i. Σ = ∈T | × (4) t∗ v i=1...` p(ti = t∗ v) We consider two variants of the HMM as baselines: P P∈T | 3https://code.google.com/p/word2vec/ new where δ = vwi P µt P. 4https://github.com/wlin12/wang2vec/ − ∗ 5 6 Terms for the starting and stopping transition probabilities “Essentially, all models are wrong, but some are useful.” — are omitted for brevity. George E. P. Box

1312 3.2 4.1 Choice of POS Induction Models The second class of models this work extends is Here, we compare the following models for POS called CRF autoencoders, which we recently pro- induction: posed in (Ammar et al., 2014). It is a scalable family • Baseline: HMM with multinomial emissions (Ku- of models for feature-rich learning from unlabeled piec, 1992), examples. The model conditions on one copy of the structured input, and generates a reconstruction of • Baseline: HMM with log-linear emissions (Berg- the input via a set of interdependent latent variables Kirkpatrick et al., 2010), which represent the linguistic structure of interest. As • Baseline: CRF with multinomial re- 7 shown in Eq. 5, the model factorizes into two distinct constructions (Ammar et al., 2014), parts: the encoding model p(t w) and the recon- | • Proposed: HMM with Gaussian emissions, and struction model p(wˆ t); where w is the structured | input (e.g., a token sequence), t is the linguistic struc- • Proposed: CRF autoencoder with Gaussian recon- ture of interest (e.g., a sequence of POS tags), and structions. w ˆ is a generic reconstruction of the input. For POS Data. To train the POS induction models, we used induction, the encoding model is a linear-chain CRF the plain text from the training sections of the with feature vector λ and local feature functions f. CoNLL-X shared task (Buchholz and Marsi, 2006) (for Danish and Turkish), the CoNLL 2007 shared p(wˆ ,t w) = p(t w) p(wˆ t) | | × | task (Nivre et al., 2007) (for Arabic, Basque, Greek, w | | Hungarian and Italian), and the Ukwabelana corpus p(wˆ t) exp λ f(ti ,ti 1, w) (5) ∝ | × · − (Spiegler et al., 2010) (for Zulu). For evaluation, we i= X1 obtain the corresponding gold-standard POS tags by In (Ammar et al., 2014), we explored two kinds of deterministically mapping the language-specific POS reconstructions wˆ : surface forms and Brown clusters tags in the aforementioned corpora to the correspond- (Brown et al., 1992), and used “stupid multinomials” ing universal POS tag set (Petrov et al., 2012). This as the underlying distributions for re-generating wˆ . is the same set up we used in (Ammar et al., 2014).

Gaussian Reconstruction. In this paper, we use d- Setup. In this section, we used skip-gram (i.e., word2vec dimensional word embedding reconstructions wˆ i = ) embeddings with a context window size d = 1 and with dimensionality d = 100, trained with vw R , and replace the multinomial distribution of i ∈ the reconstruction model with the multivariate Gaus- the largest corpora for each language in (Quasthoff et al., 2006), in addition to the plain text used to train sian distribution in Eq. 2. We again use the Baum– 8 µ Σ the POS induction models. In the proposed models, Welch algorithm to estimate t∗ and t∗ similar to Eq. 3. The only difference is that posterior label prob- we only show results for estimating µt , assuming a diagonal covariance matrix Σt (k, k) = 0.45 k abilities are now conditional on both the input se- 9 ∀ ∈ 1,..., d . While the CRF autoencoder with multi- quence w and the embeddings sequence v, i.e., re- { } nomial reconstructions were carefully initialized as place p(ti = t∗ v) in Eq. 2 with p(ti = t∗ w,v). | | 7 We use the configuration with best performance which re- constructs Brown clusters. 4 Experiments 8 We used the corpus/tokenize-anything.sh script in In this section, we attempt to answer the following the cdec decoder (Dyer et al., 2010) to tokenize the corpora questions: from (Quasthoff et al., 2006). The other corpora were already tokenized. In Arabic and Italian, we found a lot of discrepancies • §4.1: Do syntactically-informed word embeddings between the tokenization used for inducing word embeddings and the tokenization used for evaluation. We expect our results improve POS induction? Which model performs to improve with consistent tokenization. 9 best? Surprisingly, we found that estimating Σt significantly de- grades the performance. This may be due to overfitting (Shi- • §4.2: What kind of word embeddings are suitable nozaki and Kawahara, 2007). Possible remedies include using a for POS induction? prior (Gauvain and Lee, 1994).

1313 Multinomial HMM Multinomial CRF Autoencoder Gaussian CRF Autoencoder HMM (standard skip−gram) HMM (structured skip−gram) 0.8 Multinomial Featurized HMM Gaussian HMM 0.8 CRF Autoencoder (standard skip−gram) CRF Autoencoder (structured skip−gram) 0.6 0.6 0.4 0.4 V−measure V−measure 0.2 0.2 0.0 0.0 Arabic Basque Danish Greek Hungarian Italian Turkish Zulu Average Arabic Basque Danish Greek Hungarian Italian Turkish Zulu Average

Figure 1: POS induction results. (V-measure, higher is better.) Window size is 1 for all word embeddings. Left: Models which use standard skip-gram word embeddings (i.e., Gaussian HMM and Gaussian CRF Autoencoder) outperform all baselines on average across languages. Right: comparison between standard and structured skip-grams on Gaussian HMM and CRF Autoencoder. discussed in (Ammar et al., 2014), CRF autoencoder 4.2 Choice of Embeddings with Gaussian reconstructions were initialized uni- Standard skip-gram vs. structured skip-gram. formly at random in [ 1,1]. All HMM models were − On Gaussian HMMs, structured skip-gram embed- also randomly initialized. We tuned all hyperparame- dings score moderately higher than standard skip- ters on the English PTB corpus, then fixed them for grams. And as context window size gets larger, the all languages. gap widens (as shown in Fig. 2.) The reason may be that structured skip-gram embeddings give each Evaluation. We use the V-measure evaluation met- position within the context window its own project ric (Rosenberg and Hirschberg, 2007) to evaluate the 10 matrix, so the smearing effect is not as pronounced predicted syntactic classes at the token level. as the window grows when compared to the standard embeddings. However the best performance is still 12 Results. The results in Fig. 1 clearly suggest that obtained when window size is small. we can use word embeddings to improve POS induc- tion. Surprisingly, the feature-less Gaussian HMM Dimensions = 20 vs. 200. We also varied the model outperforms the strong feature-rich baselines: number of dimensions in the word vectors (d ∈ Multinomial Featurized HMM and Multinomial CRF 20,50,100,200 ). The best V-measure we obtain { } Autoencoder. One explanation is that our word em- is 0.504 (d = 20) and the worst is 0.460 (d = 100). beddings were induced using larger unlabeled cor- However, we did not observe a consistent pattern as pora than those used to train the POS induction mod- shown in Fig. 3. els. The best results are obtained using both word em- beddings and feature-rich models using the Gaussian Window size = 1 vs. 16. Finally, we varied the win- CRF autoencoder model. This set of results suggest dow size for the context surrounding target words (w 1,2,4,8,16 ). w = 1 yields the best average that word embeddings and hand-engineered features ∈ { } play complementary roles in POS induction. It is V-measure across the eight languages as shown in worth noting that the CRF autoencoder model with Fig. 2. This is true for both standard and structured Gaussian reconstructions did not require careful ini- 12 11 In preliminary experiments, we also compared standard skip- tialization. gram embeddings to SENNA embeddings (Collobert et al., 2011) (which are trained in a semi-supervised multi-task learning setup, 10 We found the V-measure results to be consistent with the with one task being POS tagging) on a subset of the English many-to-one evaluation metric (Johnson, 2007). We only show PTB corpus. As expected, the induced POS tags are much better one set of results for brevity. when using SENNA embeddings, yielding a V-measure score of 11 In (Ammar et al., 2014), we found that careful initialization 0.57 compared to 0.51 for skip-gram embeddings. Since SENNA for the CRF autoencoder model with multinomial reconstructions embeddings are only available in English, we did not include it is necessary. in the comparison in Fig. 1.

1314 standard skip−gram structured skip−gram 0.45 0.45 V−measure avg. V−measure avg.

0.30 1 2 4 8 16 0.30 20 50 100 200 Window size Dimension size

Figure 2: Effect of window size and embeddings type Figure 3: Effect of dimension size on POS induction on POS induction over the languages in Fig. 1. d = on a subset of the English PTB corpus. w = 1. The 100. The model is HMM with Gaussian emissions. model is HMM with Gaussian emissions. skip-gram models. Notably, larger window sizes ap- behavior. However, the verbs nearest our centroid all pear to produce word embeddings with less syntactic seem rather abstract. In English, the nearest 5 words information. This result confirms the observations of in the verb category are entails, aspires, attaches, Bansal et al. (2014). foresees, deems. This may be because these words seldom serve functions other than verbs; and plac- 4.3 Discussion ing the centroid around them incurs less penalty (in contrast to physical verbs, e.g. bite, which often also We have shown that (re)generating word embeddings act as nouns). Therefore one should be cautious in does much better than generating opaque word types interpreting what is prototypical about them. in unsupervised POS induction. At a high level, this confirms prior findings that unsupervised word em- 5 Conclusion beddings capture syntactic properties of words, and shows that different embeddings capture more syn- We propose using a multivariate Gaussian model to tactically salient information than others. As such, generate vector space representations of observed we contend that unsupervised POS induction can be words in generative or hybrid models for POS induc- seen as a diagnostic metric for assessing the syntactic tion, as a superior alternative to using multinomial quality of embeddings. distributions to generate categorical word types. We To get a better understanding of what the multi- find the performance from a simple Gaussian HMM variate Gaussian models have learned, we conduct a competitive with strong feature-rich baselines. We hill-climbing experiment on our English dataset. We further show that substituting the emission part of the seed each POS category with the average vector of CRF autoencoder can bring further improvements. 10 randomly sampled words from that category and We also confirm previous findings which suggest train the model. Seeding unsurprisingly improves tag- that smaller context windows in skip-gram models ging performance. We also find words that are the result in word embeddings which encode more syn- nearest to the centroids generally agree with the cor- tactic information. It would be interesting to see if we rect category label, which validate our assumption can apply this approach to other tasks which require that syntactically similar words tend to cluster in the generative modeling of textual observations such as high-dimensional embedding space. It also shows language modeling and grammar induction. that careful initialization of model parameters can Acknowledgements bring further improvements. However we also find that words that are close This work was sponsored by the U.S. Army Research to the centroid are not necessarily representative of Laboratory and the U.S. Army Research Office un- what linguists consider to be prototypical. For exam- der contract/grant numbers W911NF-11-2-0042 and ple, Hopper and Thompson (1983) show that physical, W911NF-10-1-0533. The statements made herein are telic, past tense verbs are more prototypical with re- solely the responsibility of the authors. spect to case marking, agreement, and other syntactic

1315 References Wang Ling, Chris Dyer, Alan Black, and Isabel Trancoso. 2015. Two/too simple adaptations of word2vec for Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. syntax problems. In Proc. of NAACL. Conditional random field autoencoders for unsuper- Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar vised . In NIPS. Burget, and J Cernocky. 2011. RNNLM — recurrent Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. neural network language modeling toolkit. In Proc. of Tailoring continuous word representations for depen- the 2011 ASRU Workshop, pages 196–201. dency parsing. In Proc. of ACL. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John Dean. 2013. Efficient estimation of word represen- DeNero, and Dan Klein. 2010. Painless unsupervised tations in vector space. ArXiv e-prints, January. learning with features. In Proc. of NAACL. Joakim Nivre, Johan Hall, Sandra Kubler, Ryan McDonald, David M. Blei and Peter I. Frazier. 2011. Distance depen- Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. dent Chinese restaurant processes. JMLR. The CoNLL 2007 shared task on dependency parsing. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- In Proc. of CoNLL. cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. based n-gram models of natural language. Computa- A universal part-of-speech tagset. In Proc. of LREC, tional Linguistics. May. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Uwe Quasthoff, Matthias Richter, and Christian Biemann. shared task on multilingual dependency parsing. In 2006. Corpus portal for search in monolingual corpora. CoNLL-X. In Proc. of LREC, pages 1799–1802. Ronan Collobert and Jason Weston. 2008. A unified ar- Andrew Rosenberg and Julia Hirschberg. 2007. V- chitecture for natural language processing: Deep neural measure: A conditional entropy-based external cluster networks with multitask learning. In Proc. of ICML. evaluation measure. In EMNLP-CoNLL. Ronan Collobert, Jason Weston, Léon Bottou, Michael T. Shinozaki and T. Kawahara. 2007. HMM training Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. based on CV-EM and CV gaussian mixture optimiza- Natural language processing (almost) from scratch. tion. In Proc. of the 2007 ASRU Workshop, pages 318– JMLR, 12:2493–2537, November. 322, Dec. Scott Deerwester, Susan T. Dumais, George W. Furnas, Kairit Sirts, Jacob Eisenstein, Micha Elsner, and Sharon Thomas K. Landauer, and Richard Harshman. 1990. Goldwater. 2014. POS induction with distribu- Indexing by latent semantic analysis. Journal of the tional and morphological information using a distance- American society for information science, 41(6):391– dependent Chinese restaurant process. In Proc. of ACL. 407. Sebastian Spiegler, Andrew van der Spuy, and Peter A. Paramveer S. Dhillon, Dean Foster, and Lyle Ungar. 2011. Flach. 2010. Ukwabelana: An open-source morpho- Multi-view learning of word embeddings via CCA. In logical Zulu corpus. In Proc. of COLING, pages 1020– NIPS, volume 24. 1028. Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Mehmet Ali Yatbaz, Enis Rıfat Sert, and Deniz Yuret. Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, 2014. Unsupervised instance-based part of speech in- Vladimir Eidelman, and Philip Resnik. 2010. cdec: A duction using probable substitutes. In Proc. of COL- decoder, alignment, and learning framework for finite- ING. state and context-free translation models. In Proc. of ACL. J. Gauvain and Chin-Hui Lee. 1994. Maximum a pos- teriori estimation for multivariate Gaussian mixture observations of Markov chains. Speech and Audio Pro- cessing, IEEE Transactions on, 2(2):291–298, Apr. Paul Hopper and Sandra Thompson. 1983. The iconicity of the universal categories “noun” and “verb”. In John Haiman, editor, Iconicity in Syntax: Proceedings of a symposium on iconicity in syntax. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proc. of EMNLP. Julian Kupiec. 1992. Robust part-of-speech tagging us- ing a hidden Markov model. Computer Speech and Language, 6:225–242.

1316