Redefining part-of-speech classes with distributional semantic models

Andrey Kutuzov Erik Velldal Lilja Øvrelid Department of Informatics Department of Informatics Department of Informatics University of Oslo University of Oslo University of Oslo [email protected] [email protected] [email protected]

Abstract ways depends on the human annotators them- selves, but also on the quality of linguistic con- This paper studies how embeddings ventions behind the division into different word trained on the in- classes. That is why there have been several at- teract with part of speech boundaries. Our tempts to refine the definitions of parts of speech work targets the PoS tag set, and to make them more empirically grounded, which is currently actively being used for based on corpora of real texts: see, among others, annotation of a range of . We ex- the seminal work of Biber et al. (1999). The aim periment with training classifiers for pre- of such attempts is to identify clusters of dicting PoS tags for words based on their occurring naturally and corresponding to what we embeddings. The results show that the in- usually call ‘parts of speech’. One of the main formation about PoS affiliation contained distance metrics that can be used in detecting such in the distributional vectors allows us to clusters is a distance between distributional fea- discover groups of words with distribu- tures of words (their contexts in a reference train- tional patterns that differ from other words ing corpus). of the same part of speech. In this paper, we test this approach using pre- This data often reveals hidden inconsisten- dictive models developed in the field of distribu- cies of the annotation process or guide- tional . Recent achievements in training lines. At the same time, it supports the distributional models of using machine notion of ‘soft’ or ‘graded’ part of speech learning allow for robust representations of nat- affiliations. Finally, we show that infor- ural language semantics created in a completely mation about PoS is distributed among unsupervised way, using only large corpora of raw dozens of vector components, not limited text. Relations between dense word vectors (em- to only one or two features. beddings) in the resulting vector space are as a rule used for semantic purposes. But can they be 1 Introduction employed to discover something new about gram- Parts of speech (PoS) are useful abstractions, but mar and , particularly parts of speech? Do still abstractions. Boundaries between them in nat- learned embeddings help here? Below we show ural languages are flexible. Sometimes, large open that such models do contain a lot of interesting classes of words are situated on the verge between data related to PoS classes. several parts of speech: for example, in The rest of the paper is organized as follows. English are in many respects both and ad- In Section 2 we briefly cover the previous work jectives. In other cases, closed word classes ‘inter- on the of parts of speech and distributional sect’, e.g., it is often difficult to tell a models. Section 3 describes data processing and from a . As Houston (1985) the training of a PoS predictor based on word em- puts it, ‘Grammatical categories exist along a con- beddings. In Section 4 errors of this predictor are tinuum which does not exhibit sharp boundaries analyzed and insights gained from them described. between the categories’. Section 5 introduces an attempt to build a full- When annotating natural language texts for fledged PoS tagger within the same approach. It parts of speech, the choice of a PoS tag in many also analyzes the correspondence between partic-

115 Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 115–125, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational ular word embedding components and PoS affilia- et al., 2010) based on a small number of pro- tion, before we conclude in Section 6. totypical examples for each PoS; induction ap- proaches that are completely unsupervised and 2 Related work make no use of prior knowledge. This is also the main focus of the comparative survey provided by Traditionally, 3 types of criteria are used to distin- (Christodoulopoulos et al., 2010). guish different parts of speech: formal (or mor- Work on PoS induction has a long history – in- phological), syntactic (or distributional) and se- cluding the use of distributional methods – going mantic (Aarts and McMahon, 2008). Arguably, back at least to Schütze (1995), and recent work syntactic and semantic criteria are not very differ- has demonstrated that word embeddings can be ent from each other, if one follows the famous dis- useful for this task as well (Yatbaz et al., 2012; tributional hypothesis stating that meaning is de- Lin et al., 2015; Ling et al., 2015a). termined by context (Firth, 1957). Below we show In terms of positioning this study relative to pre- that unsupervised distributional semantic models vious work, it falls somewhere in between the dis- contain data related to parts of speech. tinctions made above. It is perhaps closest to dis- For several years already it has been known ambiguation approaches, but it is not unsupervised that some information about morphological word given that we make use of existing tag annotations classes is indeed stored in distributional models. when training our embeddings and predictors. The Words belonging to different parts of speech pos- goal is also different; rather than performing PoS sess different contexts: in English, articles are typ- acquisition or tagging for its own sake, the main ically followed by , verbs are typically ac- focus here is on analyzing the boundaries of dif- companied by and so on. It means that ferent PoS classes. In Section 5, this analysis is during the training stage, words of one PoS should complemented by experiments with using word theoretically cluster together or at least their em- embeddings for PoS prediction on unlabeled data, beddings should retain some similarity allowing and here our approach can perhaps be seen as re- for their separation from words belonging to other lated to previous so-called prototype-driven ap- parts of speech. Recently, among others, Tsuboi proaches, but in these experiments we also make (2014) and Plank et al. (2016) have demonstrated use of labeled data when defining our prototypes. how word embeddings can improve supervised It seems clear that one can infer data about PoS-tagging. PoS classes of words from distributional models in Mikolov et al. (2013b) showed that there also general, including embedding models. As a next exist regular relations between words from dif- step then, these models could also prove useful ferent classes: the vector of ‘Brazil’is related to for deeper analysis of part of speech boundaries, ‘Brazilian’ in the same way as ‘England’ is re- leading to discovery of separate words or whole lated to ‘English’ and so on. Later, Liu et al. classes that tend to behave in non-typical ways. (2016) demonstrated how words of the same part Discovering such cases is one possible way to im- of speech cluster into distinct groups in a distri- prove the performance of existing automatic PoS butional model, and Tsvetkov et al. (2015) proved taggers (Manning, 2011). These ‘outliers’ may that dimensions of distributional models are cor- signal the necessity to revise the annotation strat- related with different linguistic features, releasing egy or classification system in general. Section 3 an evaluation dataset based on this. describes the process of constructing typical PoS Various types of distributional information has clusters and detecting words that belong to a clus- also played an important role in previous work ter different from their traditional annotation. done on the related problem of unsupervised PoS acquisition. As discussed in Christodoulopou- 3 PoS clusters in distributional models los et al. (2010), we can separate at least three main directions within this line of work: Disam- Our hypothesis is that for the majority of words biguation approaches (Merialdo, 1994; Toutanova their parts of speech can be inferred from their em- and Johnson, 2007; Ravi and Knight, 2009) that beddings in a distributional model. This inference start out from a providing possible tags can be considered a classification problem: we are for different words; prototype-driven approaches to train an algorithm that takes a word vector as in- (Haghighi and Klein, 2006; Christodoulopoulos put and outputs its part of speech. If the word em-

116 beddings do contain PoS-related data, the properly PART, PRON, PROPN, SCONJ, SYM, , trained classifier will correctly predict PoS tags for X (punctuation tokens marked with the PUNCT the majority of words: it means that these lexical tag were excluded). entities conform to a dominant distributional pat- Then, a Continuous Skipgram embedding tern of their part of speech class. At the same time, model (Mikolov et al., 2013a) was trained on this the words for which the classifier outputs incor- corpus, using a vector size of 300, 10 negative rect predictions, are expected to be ‘outliers’, with samples, a symmetric window of 2 words, no distributional patterns different from other words down-sampling, and 5 iterations over the training in the same class. These cases are the points of data. Words with corpus frequency less than 5 linguistic interest, and in the rest of the paper we were ignored. This model represents the seman- mostly concentrate on them. tics of the words it contains. But at the same time, To test the initial hypothesis, we used the XML for each word, a PoS tag is known (from the BNC Edition of British National Corpus (BNC), a bal- annotation). It means that is is possible to test how anced and representative corpus of English lan- good the word embeddings are in grouping words guage of about 98 million word tokens in size. As according to their parts of speech. stated in the corpus documentation, ‘it was [PoS- To this end, we extracted vectors for the 10 000 ]tagged automatically, using the CLAWS4 auto- most frequent words from the resulting model matic tagger developed by Roger Garside at Lan- (roughly, these are the words with corpus fre- caster, and a second program, known as Template quency more than 500). Then, these vectors were Tagger, developed by Mike Pacey and Steve Fligel- used to train a simple logistic regression multino- stone’ (Burnard, 2007). The corpus authors re- mial classifier aimed to predict the word’s part of port a precision of 0.96 and recall of 0.99 for their speech. tools, based on a manually checked sample. For this research, it is important that BNC is an es- It is important that we applied classification, not tablished and well-studied corpus of English with clustering here. Naive K-Means clustering of word PoS-tags and lemmas assigned to all words. embeddings in our model into 16 groups showed We produced a version of BNC where all the very poor performance (adjusted Rand index of words were replaced with their lemmas and PoS- 0.52 and adjusted Mutual Information score of tags converted into the Universal Part-of-Speech 0.61 in comparison to the original BNC tags). This Tagset (Petrov et al., 2012)1. Thus, each to- is because PoS-related features form only a part ken was represented as a concatenation of its of embeddings, and in the fully unsupervised set- lemma and PoS tag (for example, ‘love_VERB’ ting, the words tend to cluster into semantic groups and ‘love_NOUN’ yield different word types). rather than ‘syntactic’ ones. But when we train a The mappings between BNC tags and Universal classifier, it locates exactly the features (or com- tags were created by us and released online2. binations of features) that correspond to parts of The main motivation for the use of the Univer- speech, and uses them subsequently. sal PoS tag set was that this is a newly emerg- Note that during training (and subsequent test- ing standard which is actively being used for an- ing), each word’s vector was used several times, notation of a range of different languages through proportional to frequency of the word in the cor- the community-driven Universal Dependencies ef- pus, so the classifier was trained on 177 343 fort (Nivre et al., 2016). Additionally, this tag set (sometimes repeating) instances, instead of the is coarser than the original BNC one: it simpli- original 10 000. This was done to alleviate clas- fies the workflow and eliminates the necessity to sification bias due to class imbalance: There are merge ‘inflectional’ tags into one (e.g., singular much fewer word types in the closed PoS classes and plural nouns into one ‘’ class). This con- (, conjunctions, etc.) than in the open forms with our interest in parts of speech proper, ones (nouns, verbs, etc.), so without considering not inflectional forms within one PoS. We worked word frequency, the model does not have a chance with the following 16 Universal tags: ADJ, ADP, to learn good predictors for ‘small’ classes and ADV, AUX, CONJ, DET, INTJ, NOUN, NUM, ends up never predicting them. At the same time, words from closed classes occur very frequently 1We used the latest version of the tagset available at http://universaldependencies.org in the running text, so after ‘weighting’ training 2http://bit.ly/291BlpZ instances by corpus frequency, the balance is re-

117 stored and the classifier model has enough train- Figure 1. Centroid embedding for coordinating ing instances to learn to predict closed PoS classes conjunctions as well. As an additional benefit, by this modi- fication we make frequent words from all classes more ‘influential’ in training the classifier. The resulting classifier showed a weighted macro-averaged F-score (over all PoS classes) and accuracy equal to 0.98, with 10-fold cross- validation on the training set. This is a significant improvement over the one- baseline classifier (classify using only one vector dimension with maximum F-value in re- lation to class tags), with F-score equal to only Figure 2. Centroid embedding for subordinating 0.22. Thus, the results support the hypothesis that conjunctions word embeddings contain information that allows us to group words together based on their parts of speech. At the same time, we see that this infor- mation is not restricted to some particular vector component: rather, it is distributed among sev- eral axis of the vector space. After training the classifier, we were able to use it to detect ‘out- lying’ words in the BNC (judging by the distri- butional model). So as not to experiment on the same data we had trained our classifier on, we compiled another test set of 17 000 vectors for words with the BNC frequencies between 100 and 500. They were weighted by word frequencies in tributional patterns of words in the BNC. In sum, the same way as the training set, and the resulting the vast majority of words are classified correctly, test set contained 30 710 instances. Compared to which means that their embeddings enable the de- the training error reported above we naturally ob- tection of their parts of speech. In fact, one can serve a drop in performance when predicting PoS visualize ‘centroid’ vectors for each PoS by sim- for this unseen data, but the classifier still appears ply averaging vectors of words belonging to this quite robust, yielding an F-score of 0.91. How- part of speech. We did this for 10 000 words from ever, some of the drop is also due to the fact that our training set. we are applying the classifier to words with lower Plots for centroid vectors of coordinating and frequency, and hence we have somewhat less train- subordinating conjunctions are shown in Figures ing data for the input embeddings. 1 and 2 respectively. Even visually one can notice Furthermore, to make sure that the results can a very strongly expressed feature near the ‘100’ potentially be extended to other texts, we ap- mark in the horizontal axis (component number plied the trained classifier to all lemmas from 94). In fact, this is indeed an idiosyncratic feature the human-annotated Universal Dependencies En- of conjunctions: none of the other parts of speech glish Treebank (Silveira et al., 2014). The words shows such a property. More details about what not present in the distributional model were omit- vector components are relevant to part of speech ted (they sum to 27% of word types and 10% of affiliation are given in Section 5. word tokens). The classifier showed an F-Score Additionally, with centroid PoS vectors we can equal to 0.99, further demonstrating the robustness find out how similar different parts of speech are to of the classifier. Note, however, that part of this each other, by simply measuring cosine similarity performance is because the UD Treebank contains between them. If we rank PoS pairs according to many words from the classifier training set. Es- their similarity (Table 1), what we see is that nom- sentially, it means that the decisions of the UD hu- inative parts of speech are close to each other, de- man annotators are highly consistent with the dis- terminers and pronouns are also similar, as well as

118 Table 1. Distributional similarity between parts of Table 2. Most frequent PoS misclassifications of speech (fragment) the distributional predictor. The # column lists the number of word types. Cosine similarity PoS pair # Actual PoS Predicted PoS 0.81 NOUN ADJ 0.77 ADV PRON 347 PROPN NOUN 0.73 DET PRON 313 ADJ NOUN 0.73 ADV ADJ 190 NOUN ADJ ...... 91 NOUN PROPN ...... 87 PROPN ADJ 0.37 INTJ NUM 57 VERB ADJ 0.36 AUX NUM 55 NOUN NUM 52 NUM NOUN 45 NUM PROPN prepositions and subordinating conjunctions; quite 28 ADV PROPN in accordance with linguistic intuition. Proper 25 ADV NOUN nouns are not very similar to common nouns, with 25 ADJ PROPN cosine similarity between them only 0.67 (even 20 ADV ADJ adverbs are closer). Arguably, this is explained by co-occurrences together with the definite arti- cle, and as we show below, this helps the model to fied in this way. There are parts of speech with successfully separate the former from the latter. a much larger portion of word-types predicted er- Despite generally good performance of the clas- roneously: e.g., 22% of subordinate conjunctions sifier, if we look at our BNC test set, 1741 word were classified as adverbs. Table 3 lists error types types (about 10% of the whole test set vocabu- with the highest coverage (we excluded error types lary) were still classified incorrectly. Thus, they with absolute frequency equal to 1, as it is impos- are somehow dissimilar to ‘prototypical’ words of sible to speculate on solitary cases). their parts of speech. These are the ‘outliers’ we We now describe some of the interesting cases. were after. We analyze the patterns found among Almost 30% of error types (judging by absolute them in the next section. amount of misclassified words) consist of proper nouns predicted to be common ones and vice 4 Not from this crowd: analyzing outliers versa. These cases do not tell us anything new, as First, we filtered out misclassified word types it is obvious that distributionally these two classes with ‘X’ BNC annotation (they are mostly foreign of words are very similar, take the same syntac- words or typos). This leaves us with 1558 words tic contexts and hardly can be considered differ- for which the classifier assigned part of speech ent parts of speech at all. At the same time, it tags different from the ones in the BNC. It proba- is interesting that the majority of proper nouns bly means that these words’ distributional patterns in the test set (88%) was correctly predicted as differ somehow from what is more typically ob- such. It means that in spite of contextual sim- served, and that they tend to exhibit behavior sim- ilarity, the distributional model has managed to ilar to another part of speech. Table 2 shows the extract features typical for proper names. Errors most frequent misclassification cases, together ac- mostly cover comparatively rare names, such as counting for more than 85% of errors. ‘luftwaffe’, ‘stasi’, ‘stonehenge’, or ‘himalayas’. Additionally, we ranked misclassification cases Our guess is that the model was just not pre- by ‘part of speech coverage’, that is by the ratio of sented with enough contexts for these words to the words belonging to a particular PoS for which learn meaningful representations. Also, they are our classifier outputs this particular type of mis- mostly not personal names but toponyms or orga- classification. For example, proper nouns misclas- nization names, probably occurring together with sified as common nouns constitute the most nu- the definite the, unlike personal names. merous error type in Table 2, but in fact only 9% Another 30% of errors are due to vague bound- of all proper nouns in the test set were misclassi- aries between and adjectival distribution

119 patterns in English: nouns can be modified by both Table 3. Coverage of misclassifications with dis- (it seems that cases where a proper noun is mis- tributional predictor, i.e., ratio of errors over all taken for an are often caused by the same word types of a given PoS. The absolute type factor). Words like ‘materialist_NOUN’, ‘star- count is given by #. board_NOUN’ or ‘hypertext_NOUN’ are tagged as nouns in the BNC, but they often modify other Coverage Actual PoS Predicted PoS # nouns, and their contexts are so ‘adjectival’ that 0.22 SCONJ ADV 2 the distributional model actually assigned them se- 0.17 INTJ PROPN 8 mantic features highly similar to those of adjec- 0.11 ADP ADJ 3 tives. Vice versa, ‘white-collar_ADJ’ (an adjec- 0.09 ADJ NOUN 313 tive in BNC) is regarded as a noun from the point 0.09 PROPN NOUN 347 of view of our model. Indeed, there can be con- 0.09 NUM NOUN 52 tradicting views on the correct part of speech for 0.08 NUM PROPN 45 this word in like ‘and all the other white- collar workers’. Thus, in this case the distribu- tional model highlights the already known simi- ‘numerals’ are years (‘1804’, ‘1776’, ‘1822’) and larity between two word classes. decades (‘1820s’, ‘60s’ and even ‘twelfths’). Intu- The cases of verbs mistaken for seem itively, such entities do indeed function as nouns to be caused mostly by passive participles (‘was (‘I’d like to return to the sixties’). Anyway, it is overgrown’, ‘is indented’, ‘’), which intuitively difficult to invent a persuasive reason for why ‘fifty are indeed very adjective-like. So, this gives us pounds’ should be tagged as a noun, but ‘the year a set of verbs dominantly (or almost exclusively, 1776’ as a . So, this points to possible (mi- like ‘to intertwine’ or ‘to disillusion’) used in pas- nor) inconsistencies in the annotation strategy of sive. Of course, we will hardly announce such the BNC. Note that a similar problem exists in the verbs to be adjectives based on that evidence, but Penn Treebank as well (Manning, 2011). at least we can be sure that this sub-class of verbs Adverbs classified as nouns (53 words in total is clearly semantically and distributionally differ- for both common and proper nouns) are possibly ent from other verbs. the ones often followed by verbs or appearing in The next numerous type of errors consists of company of adjectives (examples are ‘ultra’ and common nouns predicted to be numerals. A ‘kinda’). This made the model treat them as close quick glance at the data reveals that 90% of these to the nominative classes. Interestingly, most ‘ad- ‘nouns’ are in fact currency amounts and percent- verbs’ predicted to be proper nouns are time indi- ages (‘£70’, ‘33%’, ‘$1’, etc). It seems reasonable cators (‘7pm’, ‘11am’); this also raises questions to classify these as numerals, even though they about what adverbial features are really present in contain some kind of nominative entities inside. these entities. Once again, unlike the BNC, the Judging by the decisions of the classifier, their UD Treebank does not tag them as adverbs. contexts do not differ much from those of sim- The cases we described above revealed some in- ple numbers, and their semantics is similar. The consistencies in the BNC annotation. However, it Universal Dependencies Treebank is more consis- seems that with adverbs mistaken for adjectives, tent in this respect: it separates entities like ‘1$’ we actually found a systematic error in the BNC into two tokens: a numeral (NUM) and a sym- tagging: these cases are mostly connected to ad- bol (SYM). Consequently, when our classifier was jectives like ‘plain’, ‘clear’ or ‘sharp’ (including tested on the words from the UD Treebank, there comparative and superlative forms) erroneously was only one occurrence of this type of error. tagged in the corpus as adverbs. These cases Related to this is the inverse case of numer- are not rare: just the three adjectives we men- als predicted to be common or proper nouns. It tioned alone appear in the BNC about 600 times is interesting that this error type also ranks quite with an tag, mostly in clauses of the kind high in terms of coverage: If we combine numer- ‘the author makes it plain that. . . ’, so-called small als predicted to be common and proper nouns, we clauses (Aarts, 2012). Sometimes these tokens are will see that 17% of all numerals in the test set tagged as ambiguous, and the adjective tag is there were subject to this error. The majority of these as a second variant; however, the corpus documen-

120 tation states that in such cases the first variant is al- Table 4. Most frequent PoS misclassifications ways more likely. Thus, distributional models can with the Stanford tagger (counting word types). actually detect outright errors in PoS-tagged cor- pora, when incorrectly tagged words strongly tend # Actual Predicted to cluster with another part of speech. In the UD 172675 NNP NN treebank such examples can also be observed, but 47202 VB NN they are much fewer and more ‘adverbial’, like ‘it 40218 JJ NN goes clear through’. 24075 NN JJ Turning to Table 3, most of the entries were 9723 JJ VB already covered above, except the first 3 cases. These relate to closed word classes (functional words), which is why the absolute number of in- workflow with the distributional predictor. Prior fluenced word types is low, but the coverage (ratio to this, BNC tags were converted to the Penn Tree- of all words of this PoS) is quite high. bank tagset3 to match the output of the tagger. As First, out of 9 distinct subordinate conjunctions we are interested in coarse, ‘overarching’ word in the test set, 2 were predicted to be adverbs. This classes, inflectional forms were merged into one is not surprising, as these words are ‘seeing’ and tag. That was easy to accomplish by dropping all ‘immediately’. For ‘seeing’ the prediction seems characters of the tags after the first two (exclud- to be just a random guess (the prediction confi- ing proper noun tags, which were all converted to dence was as low as 0.3), but with ‘immediately’ NNP). the classifier was actually more correct than the Analysis of the confusion matrix (cases where BNC tagger (the prediction confidence was about the tag predicted by the Stanford tagger was dif- 0.5). In BNC, these words are mostly tagged as ferent from the BNC tag) revealed the most fre- subordinate conjunctions in cases when they oc- quent error types shown in Table 4. Despite simi- cur sentence-initially (‘Immediately, she lowered lar top positions of errors types ‘proper noun pre- the gun’). The other words marked as SCONJ in dicted as common noun’ and ‘nouns and adjec- the test set are really such, and the classifier made tives mistaken for each other’, there are also very correct predictions matching the BNC tags. frequent errors of types ‘verb to noun’ and ‘ad- mistaken for proper names do not jective to verb’, not observed in the distributional seem very interpretable (examples are ‘gee’, ‘oy’ confusion matrix (Table 2). We would not be able and ‘farewell’). At the same time, 3 prepositions to draw the same insights that we did from the dis- predicted to be adjectives clearly form a separate tributional confusion matrix: the case with verbs group: they are ‘cross’, ‘pre’ and ‘pro’. They are mistaken for adjective is ranked only 12th, adverbs not often used as separate words, but when they mistaken for nouns - 13th, etc. are (‘Did anyone encounter any trouble from Hibs Table 5 shows top misclassification types by fans in Edinburgh pre season?’), they are very their word type coverage. Once again, interest- close to adjectives or adverbs, so the predictions ing cases we discovered with the distributional of the distributional classifier once again suggest confusion matrix (like subordinating conjunctions parts of speech boundaries a bit. mistaken for adverbs and prepositions mistaken Error analysis on the vocabulary from the for adjectives) did not show up. Obviously, a lot Universal Dependencies Treebank showed pretty of other insights can be extracted from the Stan- much the same results, except for some differences ford Tagger errors (as has been shown in previous already mentioned above. work), but it seems that employing a distributional predictor reveals different error cases and thus is There exists another way to retrieve this kind useful in evaluating the sanity of tag sets. of data: to process tagged data with a conven- To sum up, the analysis of ‘boundary cases’ de- tional PoS tagger and analyze the resulting confu- tected by a classifier trained on distributional vec- sion matrix. We tested this approach by process- tors, indeed reveals sub-classes of words lying on ing the whole BNC with the Stanford PoS Tagger the verge between different parts of speech. It also (Toutanova et al., 2003). Note that as an input to allows for quickly discovering systematic errors or the tagger we used not the whole sentences from the corpora, but separate tokens, to mimic our 3https://www.cis.upenn.edu/~treebank/

121 Table 5. Coverage of misclassifications (from all the model for these 1564 words only, using UD word types of this PoS) with the Stanford tagger. Treebank tags as class labels (the training in- stances were again weighted proportionally to the Coverage Actual Predicted # words’ frequencies in the Treebank). The result- 0.91 NNP NN 172675 ing classifier showed an accuracy of 0.938 after 0.8 UH NN 576 10-fold cross-validation on the training set. 0.79 DT NN 217 We then evaluated the classifier on tokens from 0.78 EX JJ 11 the UD Treebank test set. Now the input to the 0.78 PR NN 517 classifier consisted of these tokens’ lemmas only. Lemmas which were missing from the model’s vo- cabulary were omitted (860 of a total of 21759 to- inconsistencies in PoS annotations, whether they kens in the test set). The model reached an ac- be automatic or manual. Thus, discussions about curacy of 0.84 (weighted precision 0.85, weighted PoS boundaries would benefit from taking this recall 0.84). kind of data into consideration. These numbers may not seem very impres- sive in comparison with the performance of cur- 5 Embeddings as PoS predictors rent state-of-the-art PoS taggers. However, one In the experiment described in the previous sec- should remember that this classifier knows abso- tion, we used a model trained on words concate- lutely nothing about a word’s context in the current nated with their PoS tags. Thus, our ‘classifier’ sentence. It assigns PoS tags based solely on the was a bit artificial in that it required a word plus a proximity of the word’s distributional vector in an tag as an input, and then its output is a judgment unsupervised model to those of prototypical PoS about what tag is most applicable to this combina- examples. The classifier was in fact based only tion from the point of view of the BNC distribu- on knowledge of what words occurred in the BNC tional patterns. This was not a problem for us, as near other words within a symmetric window of 2 our aim was exactly to discover lexical outliers. words to the left and to the right. It did not even But is it possible to construct a proper predictor have access to the information about exact word in the same way, which is able to predict a PoS tag order within this sliding window, which makes its for a word without any pre-existing tags as hints? performance even more impressive. Preliminary experiments seem to indicate that it is. It is also interesting that one needs as few as We trained a Continuous Skipgram distribu- a thousand example words to train a decent classi- tional model on the BNC lemmas without PoS fier. Thus, it seems that PoS affiliation is expressed tags. After that, we constructed a vocabulary quite strongly and robustly in word embeddings. It of all unambiguous lemmas from the UD Tree- can be employed, for example, in preliminary tag- bank training set. ‘Unambiguous’ here means ging of large corpora of resource-poor languages. that the lemma either was always tagged with one Only a handful of non-ambiguous words need to and the same PoS tag in the Treebank, or has be manually PoS-tagged, and the rest is done by a one ‘dominant’ tag, with frequencies of other PoS distributional model trained on the corpus. assignments not exceeding 1/2 of the dominant Note that applying a K-neighbors classifier in- assignment frequency. Our hypothesis was that stead of logistic regression returned somewhat these words are prototypical examples of their PoS lower results, with 0.913 accuracy on 10-fold classes, with corresponding prototypical features cross-validation with the training set, and 0.81 ac- most pronounced; this approach is conceptually curacy on the test set. This seems to support our similar to (Haghighi and Klein, 2006). We also hypothesis that several particular embedding com- removed words with frequency less than 10 in the ponents correspond to part of speech affiliation, Treebank. This left us with 1564 words from all but not all of them. As a result, K-neighbors Universal Tag classes (excluding PUNCT, X and classifier fails to separate these important features SYM, as we hardly want to predict punctuation or from all the others and predicts word class based symbol tag). on its nearest neighbors with all dimensions of the Then the same simple logistic regression classi- semantic space equally important. At the same fier was trained on the distributional vectors from time, logistic regression learns to pay more atten-

122 Figure 3. Classifier accuracy depending on the in one or two specific features. Thus, the strongly number of used vector components (k) expressed component 94 in the average vector of conjunctions (Figures 1 and 2) seems to be a soli- tary case.

6 Conclusion

Distributional semantic vectors trained on word contexts from large text corpora can learn knowl- edge about part of speech clusters. Arguably, they are good at this precisely because part of speech boundaries are not strict, and even some- times considered to be a non-categorical linguistic phenomenon (Manning, 2015). In this paper we have demonstrated that seman- tion to the relevant features, neglecting unimpor- tic features derived in the process of training a tant ones. PoS prediction model on word embeddings can be To find out how many features are important for employed both in supporting linguistic hypotheses the classifier, we used the same training and test about part of speech class changes and in detect- set, and ranked all embedding components (fea- ing and fixing possible annotation errors in cor- tures, vector dimensions) by their ANOVAF-value pora. The prediction model is based on simple related to PoS class. Then we successively trained logistic regression and the word embeddings are the classifier on increasing amounts of top-ranked trained using Continuous Skip-Gram model over features (top k best) and measured the training set PoS-tagged lemmas. We show that the word em- accuracy. beddings contain robust data about the PoS classes The results are shown in Figure 3. One can see of the corresponding words, and that this knowl- that the accuracy smoothly grows with the number edge seems to be distributed among several com- of used features, eventually reaching almost ideal ponents (at least a hundred in our case of 300- performance on the training set. It is difficult to dimensional model). We also report preliminary define the point where the influence of adding fea- results for predicting PoS tags using a classifier tures reaches a plateau; it may lie somewhere near trained on a small number of prototypical mem- k = 100. It means that the knowledge about PoS bers (words with a dominant PoS class) and ap- affiliation is distributed among at least one hun- plying it to embeddings estimated from unlabeled dred components of the word embeddings, quite data. A detailed error analysis and experimental consistent with the underlying idea of embedding results are reported for both the BNC and the UD models. Treebank. One might argue that the largest gap in perfor- The reported experiment form part of ongoing mance is between k = 2 and k = 3 (from 0.38 research, and we plan to extend it, particularly to 0.51) and thus most PoS-related information is conducting similar experiments with other lan- contained in the 3 components with the largest F- guages typologically different from English. We value (in our case, these 3 features were compo- also plan to continue studying the issue of corre- nents 31, 51 and 11). But an accuracy of 0.51 spondence between particular embedding compo- is certainly not an adequate result, so even if im- nents and part of speech affiliation. Another di- portant, these components are not sufficient to ro- rection of future work is finding out how different bustly predict part of speech affiliation for a word. hyperparameters for training distributional models Further research is needed to study the effects of (including training corpus pre-processing) influ- adding features to the classifier training. ence their performance in PoS discrimination, and Regardless, an interesting finding is that part of also comparing the results to using structured em- speech affiliation is distributed among many com- bedding models like those of Ling et al. (2015b). ponents of the word embeddings, not concentrated

123 References Christopher D Manning. 2015. Computational linguis- tics and deep learning. Computational Linguistics, Bas Aarts and April McMahon. 2008. The handbook 41:701–707. of English linguistics. John Wiley & Sons. Bernard Merialdo. 1994. Tagging english text with Bas Aarts. 2012. Small Clauses in English. The Non- a probabilistic model. Computational Linguistics, verbal Types. De Gruyter Mouton, Boston. 20(2):155–172. Douglas Biber, Stig Johansson, Geoffrey Leech, Su- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- san Conrad, Edward Finegan, and Randolph Quirk. rado, and Jeff Dean. 2013a. Distributed represen- 1999. Longman of spoken and written En- tations of words and phrases and their composition- glish, volume 2. MIT Press. ality. In Advances in neural information processing Lou Burnard. 2007. Users Reference Guide for British systems, pages 3111–3119. National Corpus (XML Edition). Oxford University Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Computing Services, UK. 2013b. Linguistic regularities in continuous space Christos Christodoulopoulos, Sharon Goldwater, and word representations. In Proceedings of NAACL- Mark Steedman. 2010. Two decades of unsu- HLT 2013, pages 746–751. pervised POS induction: How far have we come? Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- In Proceedings of the 2010 Conference on Empiri- ter, Yoav Goldberg, Jan Hajic,ˇ Christopher D. Man- cal Methods in Natural Language Processing, pages ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, 575–584. Association for Computational Linguis- Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. tics. 2016. Universal Dependencies v1: A multilingual John Firth. 1957. A synopsis of linguistic theory, 1930- treebank collection. In Proceedings of the Interna- 1955. Blackwell. tional Conference on Language Resources and Eval- uation (LREC). Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. the main conference on Human Language Technol- A universal part-of-speech tagset. In LREC 2012. ogy Conference of the North American Chapter of the Association of Computational Linguistics, pages Barbara Plank, Anders Søgaard, and Yoav Goldberg. 320–327. Association for Computational Linguis- 2016. Multilingual part-of-speech tagging with tics. bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv:1604.05529. Ann Celeste Houston. 1985. Continuity and change in English : The variable (ING). Ph.D. Sujith Ravi and Kevin Knight. 2009. Minimized mod- thesis, University of Pennsylvania. els for unsupervised part-of-speech tagging. In Pro- ceedings of ACL-IJCNLP 2009, pages 504–512, Sin- Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and gapore. Lori Levin. 2015. Unsupervised POS in- duction with word embeddings. arXiv preprint Hinrich Schütze. 1995. Distributional part-of-speech arXiv:1503.06760. tagging. In Proceedings of the seventh conference on European chapter of the Association for Compu- Wang Ling, Lin Chu-Cheng, Yulia Tsvetkov, and Silvio tational Linguistics, pages 141–148. Morgan Kauf- Amir. 2015a. Not all contexts are created equal: mann Publishers Inc. Better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Natalia Silveira, Timothy Dozat, Marie-Catherine Methods in Natural Language Processing. de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. 2014. A Wang Ling, Chris Dyer, Alan W. Black, and Isabel gold standard dependency corpus for English. In Trancoso. 2015b. Two/too simple adaptations of Proceedings of the Ninth International Conference word2vec for syntax problems. In Proceedings of on Language Resources and Evaluation (LREC- the 2015 Conference of the North American Chapter 2014). of the Association for Computational Linguistics – Human Language Technologies, pages 1299–1304, Kristina Toutanova and Mark Johnson. 2007. A Denver, Colorado. Bayesian LDA-based model for semi-supervised part-of-speech tagging. In Proceedings of the Quan Liu, Zhen-Hua Ling, Hui Jiang, and Yu Hu. Neural Information Processing Systems Conference 2016. Part-of-speech relevance weights for (NIPS). learning word embeddings. arXiv preprint arXiv:1603.07695. Kristina Toutanova, Dan Klein, Christopher D Man- ning, and Yoram Singer. 2003. Feature-rich part- Christopher D Manning. 2011. Part-of-speech tag- of-speech tagging with a cyclic dependency net- ging from 97% to 100%: is it time for some linguis- work. In Proceedings of the 2003 NAACL-HLT tics? In Computational Linguistics and Intelligent Conference-Volume 1, pages 173–180. Association Text Processing, pages 171–189. Springer. for Computational Linguistics.

124 Yuta Tsuboi. 2014. Neural networks leverage corpus- wide information for part-of-speech tagging. In Pro- ceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 938–950. Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guil- laume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015, pages 2049–2054. Mehmet Ali Yatbaz, Enis Sert, and Deniz Yuret. 2012. Learning syntactic categories using paradigmatic representations of word context. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 940–951. Asso- ciation for Computational Linguistics.

125