Unsupervised POS Induction with Word Embeddings

Unsupervised POS Induction with Word Embeddings Chu-Cheng Lin Waleed Ammar Chris Dyer Lori Levin School of Computer Science Carnegie Mellon University {chuchenl,wammar,cdyer,lsl}@cs.cmu.edu Abstract Our findings suggest that, in both models, sub- stantial improvements are possible when word em- Unsupervised word embeddings have been beddings are used rather than opaque word types. shown to be valuable as features in supervised However, the independence assumptions made by learning problems; however, their role in unsu- the model used to induce embeddings strongly deter- pervised problems has been less thoroughly ex- mines its effectiveness for POS induction: embedding plored. In this paper, we show that embeddings can likewise add value to the problem of unsu- models that model short-range context are more ef- pervised POS induction. In two representative fective than those that model longer-range contexts. models of POS induction, we replace multi- This result is unsurprising, but it illustrates the lack nomial distributions over the vocabulary with of an evaluation metric that measures the syntactic multivariate Gaussian distributions over word (rather than semantic) information in word embed- embeddings and observe consistent improve- dings. Our results also confirm the conclusions of ments in eight languages. We also analyze the Sirts et al. (2014) who were likewise able to improve effect of various choices while inducing word embeddings on “downstream” POS induction POS induction results, albeit using a custom clus- results. tering model based on the the distance-dependent Chinese restaurant process (Blei and Frazier, 2011). Our contributions are as follows: (i) reparameter- 1 Introduction ization of token-level POS induction models to use word embeddings; and (ii) a systematic evaluation Unsupervised POS induction is the problem of as- of word embeddings with respect to the syntactic signing word tokens to syntactic categories given information they contain. only a corpus of untagged text. In this paper we ex- plore the effect of replacing words with their vector 2 Vector Space Word Embeddings space embeddings1 in two POS induction models: the classic first-order HMM (Kupiec, 1992) and the Word embeddings represent words in a language’s newly introduced conditional random field autoen- vocabulary as points in a d-dimensional space such coder (Ammar et al., 2014). In each model, instead that nearby words (points) are similar in terms of their of using a conditional multinomial distribution2 to distributional properties. A variety of techniques for generate a word token wi 2 V given a POS tag ti 2 T, learning embeddings have been proposed, e.g., matrix we use a conditional Gaussian distribution and gen- factorization (Deerwester et al., 1990; Dhillon et al., d erate a d-dimensional word embedding vwi 2 R 2011) and neural language modeling (Mikolov et al., given ti . 2011; Collobert and Weston, 2008). For the POS induction task, we specifically need 1Unlike Yatbaz et al. (2014), we leverage easily obtainable and widely used embeddings of word types. embeddings that capture syntactic similarities. There- 2Also known as a categorical distribution. fore we experiment with two types of embeddings that are known for such properties: • p(wi j ti ) is parameterized as a “naïve multinomial” distribution with one distinct parameter for • Skip-gram embeddings (Mikolov et al., 2013) are each word type. based on a log bilinear model that predicts an un- ordered set of context words given a target word. • p(wi j ti ) is parameterized as a multinomial logis- Bansal et al. (2014) found that smaller context win- tic regression model with hand-engineered features dow sizes tend to result in embeddings with more as detailed in (Berg-Kirkpatrick et al., 2010). syntactic information. We confirm this finding in Gaussian Emissions. We now consider incorporat- our experiments. ing word embeddings in the HMM. Given a tag t 2 T, • Structured skip-gram embeddings (Ling et al., instead of generating the observed word w 2 V, we d 2015) extend the standard skip-gram embeddings generate the (pre-trained) embedding vw 2 R of that (Mikolov et al., 2013) by taking into account the word. The conditional probability density assigned relative positions of words in a given context. to vw j t follows a multivariate Gaussian distribution 3 We use the tool word2vec and Ling et al. (2015)’s with mean µt and covariance matrix Σt : modified version4 to generate both plain and struc- exp − 1 (v − µ )>Σ−1(v − µ ) tured Skip-gram embeddings in nine languages. 2 w t t w t p(vw; µt ;Σt ) = p d (2π) jΣt j 3 Models for POS Induction (2) In this section, we briefly review two classes of mod- This parameterization makes the assumption that em- els used for POS induction (HMMs and CRF autoen- beddings of words which are often tagged as t are coders), and explain how to generate word embed- concentrated around some point µ 2 Rd, and the ding observations in each class. We will represent a t concentration decays according to the covariance ma- sentence of length ` as w = hw ;w ;:::;w i 2 V ` 1 2 ` trix Σ .6 and a sequence of tags as t = ht ;t ;:::;t i 2 T `. t 1 2 ` Now, the joint distribution over a sequence of 2 The embeddings of word type w V will be written h i d observations v = vw1 ;vw2 :::;vw` (which corre- as vw 2 R . sponds to word sequence w = hw1;w2;:::;w`;i) and 3.1 Hidden Markov Models a tag sequence t = ht1;t2 :::;tì becomes: The hidden Markov model with multinomial emis- Y` j × Σ sions is a classic model for POS induction. This p(v;t) = p(ti ti−1) p(vwi ; µti ; ti ) model makes the assumption that a latent Markov pro- i=1 cess with discrete states representing POS categories We use the Baum–Welch algorithm to fit the µt emits individual words in the vocabulary according and Σti parameters. In every iteration, we update µt ∗ to state (i.e., tag) specific emission distributions. An as follows: HMM thus defines the following joint distribution P P p(t = t∗ j v) × v over sequences of observations and tags: new v2T i=1:::` i wi µt ∗ = P P ∗ (3) v2T i=1:::` p(ti = t j v) Y` p(w;t) = p(ti j ti−1) × p(wi j ti ) (1) where T is a data set of word embedding sequences ∗ i=1 v each of length jvj = `, and p(ti = t j v) is the ∗ where distributions p(ti j ti−1) represents the transi- posterior probability of label t at position i in the tion probability and p(wi j ti ) is the emission prob- sequence v. Likewise the update to Σt ∗ is: ability, the probability of a particular tag generating P P ∗ > 5 new v2T i=1:::` p(ti = t j v) × δδ the word at position i. Σ ∗ = (4) t P P ∗ j We consider two variants of the HMM as baselines: v2T i=1:::` p(ti = t v) 3https://code.google.com/p/word2vec/ new where δ = vwi − µt ∗ . 4https://github.com/wlin12/wang2vec 5Terms for the starting and stopping transition probabilities 6“essentially, all models are wrong, but some are useful” – are omitted for brevity. George E. P. Box 3.2 Conditional Random Field Autoencoders 4.1 Choice of POS Induction Models The second class of models this work extends is Here, we compare the following models for POS called CRF autoencoders, which we recently pro- induction: posed in (Ammar et al., 2014). It is a scalable family • Baseline: HMM with multinomial emissions (Ku- of models for feature-rich learning from unlabeled piec, 1992), examples. The model conditions on one copy of the structured input, and generates a reconstruction of • Baseline: HMM with log-linear emissions (Berg- the input via a set of interdependent latent variables Kirkpatrick et al., 2010), which represent the linguistic structure of interest. As • Baseline: CRF autoencoder with multinomial re- shown in Eq. 5, the model factorizes into two distinct constructions (Ammar et al., 2014),7 parts: the encoding model p(t j w) and the recon- • Proposed: HMM with Gaussian emissions, and struction model p(wˆ j t); where w is the structured input (e.g., a token sequence), t is the linguistic struc- • Proposed: CRF autoencoder with Gaussian recon- ture of interest (e.g., a sequence of POS tags), and structions. wˆ is a generic reconstruction of the input. For POS Data. To train the POS induction models, we used induction, the encoding model is a linear-chain CRF the plain text from the training sections of the with feature vector λ and local feature functions f. CoNLL-X shared task (Buchholz and Marsi, 2006) (for Danish and Turkish), the CoNLL 2007 shared p(wˆ ;t j w) = p(t j w) × p(wˆ j t) task (Nivre et al., 2007) (for Arabic, Basque, Greek, Xjw j Hungarian and Italian), and the Ukwabelana corpus / j × · p(wˆ t) exp λ f(ti ;ti−1;w) (5) (Spiegler et al., 2010) (for Zulu). For evaluation, we i=1 obtain the corresponding gold-standard POS tags by In (Ammar et al., 2014), we explored two kinds of deterministically mapping the language-specific POS reconstructions wˆ : surface forms and Brown clusters tags in the aforementioned corpora to the correspond- (Brown et al., 1992), and used “stupid multinomials” ing universal POS tag set (Petrov et al., 2012). This as the underlying distributions for re-generating wˆ . is the same set up we used in (Ammar et al., 2014). Gaussian Reconstruction. In this paper, we use d- Setup. In this section, we used skip-gram (i.e., word2vec) embeddings with a context window size dimensional word embedding reconstructions wˆ i = d = 1 and with dimensionality d = 100, trained with vwi 2 R , and replace the multinomial distribution of the reconstruction model with the multivariate Gaus- the largest corpora for each language in (Quasthoff sian distribution in Eq.

Unsupervised POS Induction with Word Embeddings

Unsupervised Recurrent Neural Network Grammars

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

GRAINS: Generative Recursive Autoencoders for Indoor Scenes

Unsupervised Recurrent Neural Network Grammars

Semantic Analysis of Multi Meaning Words Using Machine Learning and Knowledge Representation by Marjan Alirezaie

Max-Margin Synchronous Grammar Induction for Machine Translation

Compound Probabilistic Context-Free Grammars for Grammar Induction

Grammar Induction

Dependency Grammar Induction with a Neural Variational Transition

Are Pre-Trained Language Models Aware of Phrases?Simplebut Strong Baselinesfor Grammar Induction

What Do Recurrent Neural Network Grammars Learn About Syntax?

A Survey of Unsupervised Dependency Parsing