Unsupervised POS Induction with Word Embeddings

Unsupervised POS Induction with Word Embeddings Chu-Cheng Lin Waleed Ammar Chris Dyer Lori Levin School of Computer Science Carnegie Mellon University {chuchenl,wammar,cdyer,lsl}@cs.cmu.edu Abstract Our findings suggest that, in both models, sub- stantial improvements are possible when word em- Unsupervised word embeddings have been beddings are used rather than opaque word types. shown to be valuable as features in supervised However, the independence assumptions made by learning problems; however, their role in unsu- the model used to induce embeddings strongly deter- pervised problems has been less thoroughly ex- mines its effectiveness for POS induction: embedding plored. In this paper, we show that embeddings can likewise add value to the problem of unsu- models that model short-range context are more ef- pervised POS induction. In two representative fective than those that model longer-range contexts. models of POS induction, we replace multi- This result is unsurprising, but it illustrates the lack nomial distributions over the vocabulary with of an evaluation metric that measures the syntactic multivariate Gaussian distributions over word (rather than semantic) information in word embed- embeddings and observe consistent improve- dings. Our results also confirm the conclusions of ments in eight languages. We also analyze the Sirts et al. (2014) who were likewise able to improve effect of various choices while inducing word embeddings on “downstream” POS induction POS induction results, albeit using a custom clus- results. tering model based on the the distance-dependent Chinese restaurant process (Blei and Frazier, 2011). Our contributions are as follows: (i) reparameter- 1 Introduction ization of token-level POS induction models to use word embeddings; and (ii) a systematic evaluation Unsupervised POS induction is the problem of as- of word embeddings with respect to the syntactic signing word tokens to syntactic categories given information they contain. only a corpus of untagged text. In this paper we ex- plore the effect of replacing words with their vector 1 2 Vector Space Word Embeddings space embeddings in two POS induction models: the classic first-order HMM (Kupiec, 1992) and the Word embeddings represent words in a language’s newly introduced conditional random field autoen- vocabulary as points in a d-dimensional space such coder (Ammar et al., 2014). In each model, instead that nearby words (points) are similar in terms of their 2 of using a conditional multinomial distribution to distributional properties. A variety of techniques for generate a word token wi V given a POS tag ti T, learning embeddings have been proposed, e.g., matrix ∈ ∈ we use a conditional Gaussian distribution and gen- factorization (Deerwester et al., 1990; Dhillon et al., d erate a d-dimensional word embedding vw R 2011) and neural language modeling (Mikolov et al., i ∈ given ti . 2011; Collobert and Weston, 2008). 1 For the POS induction task, we specifically need Unlike Yatbaz et al. (2014), we leverage easily obtainable and widely used embeddings of word types. embeddings that capture syntactic similarities. There- 2 Also known as a categorical distribution. fore we experiment with two types of embeddings 1311 Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1311–1316, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics that are known for such properties: • p(wi ti ) is parameterized as a “naïve multino- | mial” distribution with one distinct parameter for • Skip-gram embeddings (Mikolov et al., 2013) are each word type. based on a log bilinear model that predicts an un- ordered set of context words given a target word. • p(wi ti ) is parameterized as a multinomial logis- | Bansal et al. (2014) found that smaller context win- tic regression model with hand-engineered features dow sizes tend to result in embeddings with more as detailed in (Berg-Kirkpatrick et al., 2010). syntactic information. We confirm this finding in Gaussian Emissions. We now consider incorporat- our experiments. ing word embeddings in the HMM. Given a tag t T, ∈ • Structured skip-gram embeddings (Ling et al., instead of generating the observed word w V, we ∈d 2015) extend the standard skip-gram embeddings generate the (pre-trained) embedding vw R of that ∈ (Mikolov et al., 2013) by taking into account the word. The conditional probability density assigned relative positions of words in a given context. to vw t follows a multivariate Gaussian distribution 3 | We use the tool word2vec and Ling et al. (2015)’s with mean µt and covariance matrix Σt : 4 modified version to generate both plain and struc- 1 1 exp (vw µ )>Σ− (vw µ ) tured skip-gram embeddings in nine languages. − 2 − t t − t p(vw; µt ,Σt ) = d (2π) Σt 3 Models for POS Induction | | (2) p In this section, we briefly review two classes of mod- This parameterization makes the assumption that em- els used for POS induction (HMMs and CRF autoen- beddings of words which are often tagged as t are coders), and explain how to generate word embed- concentrated around some point µ Rd, and the ding observations in each class. We will represent a t ∈ ` concentration decays according to the covariance ma- sentence of length ` as w = w1,w2,...,w` V 6 h i ∈ ` trix Σt . and a sequence of tags as t = t ,t ,...,t` T . h 1 2 i ∈ Now, the joint distribution over a sequence of The embeddings of word type w V will be written d ∈ observations v = vw1 ,vw2 ...,vw` (which corre- as vw R . h i ∈ sponds to word sequence w = w1,w2,...,w`, ) and h i a tag sequence t = t ,t ...,t` becomes: 3.1 Hidden Markov Models h 1 2 i The hidden Markov model with multinomial emis- ` sions is a classic model for POS induction. This p(v, t) = p(ti ti 1) p(vwi ; µt ,Σti ) | − × i model makes the assumption that a latent Markov pro- Yi=1 cess with discrete states representing POS categories We use the Baum–Welch algorithm to fit the µt emits individual words in the vocabulary according Σ µ and ti parameters. In every iteration, we update t∗ to state (i.e., tag) specific emission distributions. An as follows: HMM thus defines the following joint distribution new v i=1:::` p(ti = t∗ v) vwi over sequences of observations and tags: µ = ∈T | × (3) t∗ v i=1:::` p(ti = t∗ v) ` P ∈TP | p(w, t) = p(ti ti 1) p(wi ti ) (1) where is a dataP setP of word embedding sequences | − × | T i=1 v each of length v = `, and p(ti = t∗ v) is the Y | | | where distributions p(ti ti 1) represents the transi- posterior probability of label t∗ at position i in the | − tion probability and p(wi ti ) is the emission prob- sequence v. Likewise the update to Σt is: | ∗ ability, the probability of a particular tag generating 5 new v i=1:::` p(ti = t∗ v) δδ> the word at position i. Σ = ∈T | × (4) t∗ v i=1:::` p(ti = t∗ v) We consider two variants of the HMM as baselines: P P∈T | 3https://code.google.com/p/word2vec/ new where δ = vwi P µt P. 4https://github.com/wlin12/wang2vec/ − ∗ 5 6 Terms for the starting and stopping transition probabilities “Essentially, all models are wrong, but some are useful.” — are omitted for brevity. George E. P. Box 1312 3.2 Conditional Random Field Autoencoders 4.1 Choice of POS Induction Models The second class of models this work extends is Here, we compare the following models for POS called CRF autoencoders, which we recently pro- induction: posed in (Ammar et al., 2014). It is a scalable family • Baseline: HMM with multinomial emissions (Ku- of models for feature-rich learning from unlabeled piec, 1992), examples. The model conditions on one copy of the structured input, and generates a reconstruction of • Baseline: HMM with log-linear emissions (Berg- the input via a set of interdependent latent variables Kirkpatrick et al., 2010), which represent the linguistic structure of interest. As • Baseline: CRF autoencoder with multinomial re- 7 shown in Eq. 5, the model factorizes into two distinct constructions (Ammar et al., 2014), parts: the encoding model p(t w) and the recon- | • Proposed: HMM with Gaussian emissions, and struction model p(wˆ t); where w is the structured | input (e.g., a token sequence), t is the linguistic struc- • Proposed: CRF autoencoder with Gaussian recon- ture of interest (e.g., a sequence of POS tags), and structions. w ˆ is a generic reconstruction of the input. For POS Data. To train the POS induction models, we used induction, the encoding model is a linear-chain CRF the plain text from the training sections of the with feature vector λ and local feature functions f. CoNLL-X shared task (Buchholz and Marsi, 2006) (for Danish and Turkish), the CoNLL 2007 shared p(wˆ ,t w) = p(t w) p(wˆ t) | | × | task (Nivre et al., 2007) (for Arabic, Basque, Greek, w | | Hungarian and Italian), and the Ukwabelana corpus p(wˆ t) exp λ f(ti ,ti 1, w) (5) ∝ | × · − (Spiegler et al., 2010) (for Zulu). For evaluation, we i= X1 obtain the corresponding gold-standard POS tags by In (Ammar et al., 2014), we explored two kinds of deterministically mapping the language-specific POS reconstructions wˆ : surface forms and Brown clusters tags in the aforementioned corpora to the correspond- (Brown et al., 1992), and used “stupid multinomials” ing universal POS tag set (Petrov et al., 2012). This as the underlying distributions for re-generating wˆ . is the same set up we used in (Ammar et al., 2014).

Load more