Unsupervised POS Induction with Word Embeddings

Chu-Cheng Lin Waleed Ammar Chris Dyer Lori Levin School of Computer Science Carnegie Mellon University {chuchenl,wammar,cdyer,lsl}@cs.cmu.edu

Abstract Our findings suggest that, in both models, sub- stantial improvements are possible when word em- Unsupervised word embeddings have been beddings are used rather than opaque word types. shown to be valuable as features in supervised However, the independence assumptions made by learning problems; however, their role in unsu- the model used to induce embeddings strongly deter- pervised problems has been less thoroughly ex- mines its effectiveness for POS induction: embedding plored. In this paper, we show that embeddings can likewise add value to the problem of unsu- models that model short-range context are more ef- pervised POS induction. In two representative fective than those that model longer-range contexts. models of POS induction, we replace multi- This result is unsurprising, but it illustrates the lack nomial distributions over the vocabulary with of an evaluation metric that measures the syntactic multivariate Gaussian distributions over word (rather than semantic) information in word embed- embeddings and observe consistent improve- dings. Our results also confirm the conclusions of ments in eight languages. We also analyze the Sirts et al. (2014) who were likewise able to improve effect of various choices while inducing word embeddings on “downstream” POS induction POS induction results, albeit using a custom clus- results. tering model based on the the distance-dependent Chinese restaurant process (Blei and Frazier, 2011). Our contributions are as follows: (i) reparameter- 1 Introduction ization of token-level POS induction models to use word embeddings; and (ii) a systematic evaluation Unsupervised POS induction is the problem of as- of word embeddings with respect to the syntactic signing word tokens to syntactic categories given information they contain. only a corpus of untagged text. In this paper we ex- plore the effect of replacing words with their vector 2 Vector Space Word Embeddings space embeddings1 in two POS induction models: the classic first-order HMM (Kupiec, 1992) and the Word embeddings represent words in a language’s newly introduced conditional random field autoen- vocabulary as points in a d-dimensional space such coder (Ammar et al., 2014). In each model, instead that nearby words (points) are similar in terms of their of using a conditional multinomial distribution2 to distributional properties. A variety of techniques for generate a word token wi ∈ V given a POS tag ti ∈ T, learning embeddings have been proposed, e.g., matrix we use a conditional Gaussian distribution and gen- factorization (Deerwester et al., 1990; Dhillon et al., d erate a d-dimensional vwi ∈ R 2011) and neural language modeling (Mikolov et al., given ti . 2011; Collobert and Weston, 2008). For the POS induction task, we specifically need 1Unlike Yatbaz et al. (2014), we leverage easily obtainable and widely used embeddings of word types. embeddings that capture syntactic similarities. There- 2Also known as a categorical distribution. fore we experiment with two types of embeddings that are known for such properties: • p(wi | ti ) is parameterized as a “naïve multino- mial” distribution with one distinct parameter for • Skip-gram embeddings (Mikolov et al., 2013) are each word type. based on a log bilinear model that predicts an un- ordered set of context words given a target word. • p(wi | ti ) is parameterized as a multinomial logis- Bansal et al. (2014) found that smaller context win- tic regression model with hand-engineered features dow sizes tend to result in embeddings with more as detailed in (Berg-Kirkpatrick et al., 2010). syntactic information. We confirm this finding in Gaussian Emissions. We now consider incorporat- our experiments. ing word embeddings in the HMM. Given a tag t ∈ T, • Structured skip-gram embeddings (Ling et al., instead of generating the observed word w ∈ V, we d 2015) extend the standard skip-gram embeddings generate the (pre-trained) embedding vw ∈ R of that (Mikolov et al., 2013) by taking into account the word. The conditional probability density assigned relative positions of words in a given context. to vw | t follows a multivariate Gaussian distribution 3 We use the tool and Ling et al. (2015)’s with mean µt and covariance matrix Σt : modified version4 to generate both plain and struc-   exp − 1 (v − µ )>Σ−1(v − µ ) tured Skip-gram embeddings in nine languages. 2 w t t w t p(vw; µt ,Σt ) = p d (2π) |Σt | 3 Models for POS Induction (2) In this section, we briefly review two classes of mod- This parameterization makes the assumption that em- els used for POS induction (HMMs and CRF autoen- beddings of words which are often tagged as t are coders), and explain how to generate word embed- concentrated around some point µ ∈ Rd, and the ding observations in each class. We will represent a t concentration decays according to the covariance ma- sentence of length ` as w = hw ,w ,...,w i ∈ V ` 1 2 ` trix Σ .6 and a sequence of tags as t = ht ,t ,...,t i ∈ T `. t 1 2 ` Now, the joint distribution over a sequence of ∈ The embeddings of word type w V will be written h i d observations v = vw1 ,vw2 ...,vw` (which corre- as vw ∈ R . sponds to word sequence w = hw1,w2,...,w`,i) and 3.1 Hidden Markov Models a tag sequence t = ht1,t2 ...,t`i becomes: The with multinomial emis- Y` | × Σ sions is a classic model for POS induction. This p(v,t) = p(ti ti−1) p(vwi ; µti , ti ) model makes the assumption that a latent Markov pro- i=1 cess with discrete states representing POS categories We use the Baum–Welch to fit the µt emits individual words in the vocabulary according and Σti parameters. In every iteration, we update µt ∗ to state (i.e., tag) specific emission distributions. An as follows: HMM thus defines the following joint distribution P P p(t = t∗ | v) × v over sequences of observations and tags: new v∈T i=1...` i wi µt ∗ = P P ∗ (3) v∈T i=1...` p(ti = t | v) Y` p(w,t) = p(ti | ti−1) × p(wi | ti ) (1) where T is a data set of word embedding sequences ∗ i=1 v each of length |v| = `, and p(ti = t | v) is the ∗ where distributions p(ti | ti−1) represents the transi- posterior probability of label t at position i in the tion probability and p(wi | ti ) is the emission prob- sequence v. Likewise the update to Σt ∗ is: ability, the probability of a particular tag generating P P ∗ > 5 new v∈T i=1...` p(ti = t | v) × δδ the word at position i. Σ ∗ = (4) t P P ∗ | We consider two variants of the HMM as baselines: v∈T i=1...` p(ti = t v) 3https://code.google.com/p/word2vec/ new where δ = vwi − µt ∗ . 4https://github.com/wlin12/wang2vec 5Terms for the starting and stopping transition probabilities 6“essentially, all models are wrong, but some are useful” – are omitted for brevity. George E. P. Box 3.2 4.1 Choice of POS Induction Models The second class of models this work extends is Here, we compare the following models for POS called CRF autoencoders, which we recently pro- induction: posed in (Ammar et al., 2014). It is a scalable family • Baseline: HMM with multinomial emissions (Ku- of models for feature-rich learning from unlabeled piec, 1992), examples. The model conditions on one copy of the structured input, and generates a reconstruction of • Baseline: HMM with log-linear emissions (Berg- the input via a set of interdependent latent variables Kirkpatrick et al., 2010), which represent the linguistic structure of interest. As • Baseline: CRF with multinomial re- shown in Eq. 5, the model factorizes into two distinct constructions (Ammar et al., 2014),7 parts: the encoding model p(t | w) and the recon- • Proposed: HMM with Gaussian emissions, and struction model p(wˆ | t); where w is the structured input (e.g., a token sequence), t is the linguistic struc- • Proposed: CRF autoencoder with Gaussian recon- ture of interest (e.g., a sequence of POS tags), and structions. wˆ is a generic reconstruction of the input. For POS Data. To train the POS induction models, we used induction, the encoding model is a linear-chain CRF the plain text from the training sections of the with feature vector λ and local feature functions f. CoNLL-X shared task (Buchholz and Marsi, 2006) (for Danish and Turkish), the CoNLL 2007 shared p(wˆ ,t | w) = p(t | w) × p(wˆ | t) task (Nivre et al., 2007) (for Arabic, Basque, Greek, X|w | Hungarian and Italian), and the Ukwabelana corpus ∝ | × · p(wˆ t) exp λ f(ti ,ti−1,w) (5) (Spiegler et al., 2010) (for Zulu). For evaluation, we i=1 obtain the corresponding gold-standard POS tags by In (Ammar et al., 2014), we explored two kinds of deterministically mapping the language-specific POS reconstructions wˆ : surface forms and Brown clusters tags in the aforementioned corpora to the correspond- (Brown et al., 1992), and used “stupid multinomials” ing universal POS tag set (Petrov et al., 2012). This as the underlying distributions for re-generating wˆ . is the same set up we used in (Ammar et al., 2014).

Gaussian Reconstruction. In this paper, we use d- Setup. In this section, we used skip-gram (i.e., word2vec) embeddings with a context window size dimensional word embedding reconstructions wˆ i = d = 1 and with dimensionality d = 100, trained with vwi ∈ R , and replace the multinomial distribution of the reconstruction model with the multivariate Gaus- the largest corpora for each language in (Quasthoff sian distribution in Eq. 2. We again use the Baum– et al., 2006), in addition to the plain text used to train the POS induction models.8 In the proposed models, Welch algorithm to estimate µt ∗ and Σt ∗ similar to Eq. 3. The only difference is that posterior label prob- we only show results for estimating µt , assuming a diagonal covariance matrix Σt (k, k) = 0.45∀k ∈ abilities are now conditional on both the input se- 9 quence w and the embeddings sequence v, i.e., re- {1,..., d}. While the CRF autoencoder with multi- ∗ ∗ nomial reconstructions were carefully initialized as place p(ti = t | v) in Eq. 2 with p(ti = t | w,v). 7We use the configuration with best performance which re- 4 Experiments constructs Brown clusters. 8We used the corpus/tokenize-anything.sh script in In this section, we attempt to answer the following the cdec decoder (Dyer et al., 2010) to tokenize the corpora questions: from (Quasthoff et al., 2006). The other corpora were already tokenized. In Arabic and Italian, we found a lot of discrepancies • §4.1: Do syntactically-informed word embeddings between the tokenization used for inducing word embeddings and the tokenization used for evaluation. We expect our results improve POS induction? Which model performs to improve with consistent tokenization. 9 best? Surprisingly, we found that estimating Σt significantly de- grades the performance. This may be due to overfitting (Shi- • §4.2: What kind of word embeddings are suitable nozaki and Kawahara, 2007). Possible remedies include using a for POS induction? prior (Gauvain and Lee, 1994). Multinomial HMM Multinomial CRF Autoencoder Gaussian CRF Autoencoder HMM (standard skip−gram) HMM (structured skip−gram) 0.8 Multinomial Featurized HMM Gaussian HMM 0.8 CRF Autoencoder (standard skip−gram) CRF Autoencoder (structured skip−gram) 0.6 0.6 0.4 0.4 V−measure V−measure 0.2 0.2 0.0 0.0 Arabic Basque Danish Greek Hungarian Italian Turkish Zulu Average Arabic Basque Danish Greek Hungarian Italian Turkish Zulu Average

Figure 1: POS induction results. (V-measure, higher is better.) Window size is 1 for all word embeddings. Left: Models which use standard skip-gram word embeddings (i.e., Gaussian HMM and Gaussian CRF Autoencoder) outperform all baselines on average across languages. Right: comparison between standard and structured skip-grams on Gaussian HMM and CRF Autoencoder. discussed in (Ammar et al., 2014), CRF autoencoder 4.2 Choice of Embeddings with Gaussian reconstructions were initialized uni- Standard skip-gram vs. structured skip-gram. − formly at random in [ 1,1]. All HMM models were On Gaussian HMMs, structured skip-gram embed- also randomly initialized. We tuned all hyperparame- dings score moderately higher than standard skip- ters on the English PTB corpus, then fixed them for grams. And as the context window size gets larger all languages. the gap widens. The reason may be that structured skip-gram embeddings give each position within the Evaluation. We use the V-measure evaluation met- context window its own project matrix, so the smear- ric (Rosenberg and Hirschberg, 2007) to evaluate the 10 ing effect is not as pronounced as the window grows predicted syntactic classes at the token level. when compared to the standard embeddings. How- ever the best performance is still obtained when the Results. The results in Fig. 1 (left) clearly sug- window is small.12 gest that we can use word embeddings to improve POS induction. Surprisingly, the feature-less Gaus- Dimensions = 20 vs. 200. We also varied the sian HMM model outperforms the strong feature-rich number of dimensions in the word vectors (d ∈ baselines: Multinomial Featurized HMM and Multi- {20,50,100,200}). The best V-measure we obtain nomial CRF Autoencoder. One explanation is that is 0.504 (d = 20) and the worst is 0.460 (d = 100). our word embeddings were induced using larger unla- However, we did not observe a consistent pattern as beled corpora than those used to train the POS induc- shown in Fig. 3. tion models. The best results are obtained using both word embeddings and feature-rich models using the Window size = 1 vs. 16. Finally, we varied the win- Gaussian CRF autoencoder model. This set of results dow size for the context surrounding target words suggest that word embeddings and hand-engineered (w ∈ {1,2,4,8,16}). w = 1 yields the best average features play complementary roles in POS induction. V-measure across the eight languages as shown in It is worth noting that the CRF autoencoder model Fig. 2. This is true for both standard and structured with Gaussian reconstructions did not require careful 12 11 In preliminary experiments, we also compared standard skip- initialization. gram embeddings to SENNA embeddings (Collobert et al., 2011) (which are trained in a semi-supervised multi-task learning setup, 10We found the V-measure results to be consistent with the with one task being POS tagging) on a subset of the English many-to-one evaluation metric (Johnson, 2007). We only show PTB corpus. As expected, the induced POS tags are much better one set of results for brevity. when using SENNA embeddings, yielding a V-measure score of 11In (Ammar et al., 2014), we found that careful initialization 0.57 compared to 0.51 for skip-gram embeddings. Since SENNA for the CRF autoencoder model with multinomial reconstructions embeddings are only available in English, we did not include it is necessary. in the comparison in Fig. 1. 0.55 0.50 standard skip−gram structured skip−gram 0.50 0.45 0.45 0.40 0.40 V−measure avg. V−measure avg. 0.35 0.35 0.30 0.30 1 2 4 8 16 20 50 100 200 Window size Dimension size

Figure 2: Effect of window size and embeddings type Figure 3: Effect of dimension size on POS induction on POS induction over the languages in Fig. 1. d = on a subset of the English PTB corpus. Window size 100. The model is HMM with Gaussian emissions. is set to 1 for all configurations. The model is HMM with Gaussian emissions. skip-gram models. Notably, larger window sizes ap- pear to produce word embeddings with less syntactic behavior. However, the verbs nearest our centroid all information. This result confirms the observations of seem rather abstract. In English, the nearest 5 words Bansal et al. (2014). in the verb category are entails, aspires, attaches, foresees, deems. This may be because these words 4.3 Discussion seldom serve functions other than verbs; and plac- ing the centroid around them incurs less penalty (in We have shown that (re)generating word embeddings contrast to physical verbs, e.g. bite, which often also does much better than generating opaque word types act as nouns). Therefore one should be cautious in in unsupervised POS induction. At a high level, this interpreting what is prototypical about them. confirms prior findings that unsupervised word em- beddings capture syntactic properties of words, and 5 Conclusion shows that different embeddings capture more syn- tactically salient information than others. As such, We propose using a multivariate Gaussian model to we contend that unsupervised POS induction can be generate vector space representations of observed seen as a diagnostic metric for assessing the syntactic words in generative or hybrid models for POS induc- quality of embeddings. tion, as a superior alternative to using multinomial To get a better understanding of what the multi- distributions to generate categorical word types. We variate Gaussian models have learned, we conduct a find the performance from a simple Gaussian HMM hill-climbing experiment on our English dataset. We competitive with strong feature-rich baselines. We seed each POS category with the average vector of further show that substituting the emission part of the 10 randomly sampled words from that category and CRF autoencoder can bring further improvements. train the model. Seeding unsurprisingly improves tag- We also confirm previous findings which suggest ging performance. We also find words that are the that smaller context windows in skip-gram models nearest to the centroids generally agree with the cor- result in word embeddings which encode more syn- rect category label, which validate our assumption tactic information. It would be interesting to see if we that syntactically similar words tend to cluster in the can apply this approach to other tasks which require high-dimensional embedding space. It also shows generative modeling of textual observations such as that careful initialization of model parameters can language modeling and grammar induction. bring further improvements. Acknowledgements However we also find that words that are close to the centroid are not necessarily representative of This work was sponsored by the U.S. Army Research what linguists consider to be prototypical. For exam- Laboratory and the U.S. Army Research Office un- ple, Hopper and Thompson (1983) show that physical, der contract/grant numbers W911NF-11-2-0042 and telic, past tense verbs are more prototypical with re- W911NF-10-1-0533. The statements made herein are spect to case marking, agreement, and other syntactic solely the responsibility of the authors. References Wang Ling, Chris Dyer, Alan Black, and Isabel Trancoso. 2015. Two/too simple adaptations of word2vec for Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. syntax problems. In Proc. of NAACL. Conditional random field autoencoders for unsuper- Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar vised . In NIPS. Burget, and J Cernocky. 2011. RNNLM — recurrent Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. neural network language modeling toolkit. In Proc. of Tailoring continuous word representations for depen- the 2011 ASRU Workshop, pages 196–201. dency parsing. In Proc. of ACL. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John Dean. 2013. Efficient estimation of word represen- DeNero, and Dan Klein. 2010. Painless unsupervised tations in vector space. ArXiv e-prints, January. learning with features. In Proc. of NAACL. Joakim Nivre, Johan Hall, Sandra Kubler, Ryan McDonald, David M. Blei and Peter I. Frazier. 2011. Distance depen- Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. dent Chinese restaurant processes. JMLR. The CoNLL 2007 shared task on dependency parsing. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- In Proc. of CoNLL. cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. based n-gram models of natural language. Computa- A universal part-of-speech tagset. In Proc. of LREC, tional Linguistics. May. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Uwe Quasthoff, Matthias Richter, and Christian Biemann. shared task on multilingual dependency parsing. In 2006. Corpus portal for search in monolingual corpora. CoNLL-X. In Proc. of LREC, pages 1799–1802. Ronan Collobert and Jason Weston. 2008. A unified ar- Andrew Rosenberg and Julia Hirschberg. 2007. V- chitecture for natural language processing: Deep neural measure: A conditional entropy-based external cluster networks with multitask learning. In Proc. of ICML. evaluation measure. In EMNLP-CoNLL. Ronan Collobert, Jason Weston, Léon Bottou, Michael T. Shinozaki and T. Kawahara. 2007. HMM training Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. based on CV-EM and CV gaussian mixture optimiza- Natural language processing (almost) from scratch. tion. In Proc. of the 2007 ASRU Workshop, pages 318– JMLR, 12:2493–2537, November. 322, Dec. Scott Deerwester, Susan T. Dumais, George W. Furnas, Kairit Sirts, Jacob Eisenstein, Micha Elsner, and Sharon Thomas K. Landauer, and Richard Harshman. 1990. Goldwater. 2014. POS induction with distribu- Indexing by latent semantic analysis. Journal of the tional and morphological information using a distance- American society for information science, 41(6):391– dependent Chinese restaurant process. In Proc. of ACL. 407. Sebastian Spiegler, Andrew van der Spuy, and Peter A. Paramveer S. Dhillon, Dean Foster, and Lyle Ungar. 2011. Flach. 2010. Ukwabelana: An open-source morpho- Multi-view learning of word embeddings via CCA. In logical Zulu corpus. In Proc. of COLING, pages 1020– NIPS, volume 24. 1028. Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Mehmet Ali Yatbaz, Enis Rıfat Sert, and Deniz Yuret. Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, 2014. Unsupervised instance-based part of speech in- Vladimir Eidelman, and Philip Resnik. 2010. cdec: A duction using probable substitutes. In Proc. of COL- decoder, alignment, and learning framework for finite- ING. state and context-free translation models. In Proc. of ACL. J. Gauvain and Chin-Hui Lee. 1994. Maximum a pos- teriori estimation for multivariate Gaussian mixture observations of Markov chains. Speech and Audio Pro- cessing, IEEE Transactions on, 2(2):291–298, Apr. Paul Hopper and Sandra Thompson. 1983. The iconicity of the universal categories “noun” and “verb”. In John Haiman, editor, Iconicity in Syntax: Proceedings of a symposium on iconicity in syntax. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proc. of EMNLP. Julian Kupiec. 1992. Robust part-of-speech tagging us- ing a hidden Markov model. Computer Speech and Language, 6:225–242.