Redefining Part-Of-Speech Classes with Distributional Semantic Models

Redefining part-of-speech classes with distributional semantic models Andrey Kutuzov Erik Velldal Lilja Øvrelid Department of Informatics Department of Informatics Department of Informatics University of Oslo University of Oslo University of Oslo [email protected] [email protected] [email protected] Abstract ways depends on the human annotators them- selves, but also on the quality of linguistic con- This paper studies how word embeddings ventions behind the division into different word trained on the British National Corpus in- classes. That is why there have been several at- teract with part of speech boundaries. Our tempts to refine the definitions of parts of speech work targets the Universal PoS tag set, and to make them more empirically grounded, which is currently actively being used for based on corpora of real texts: see, among others, annotation of a range of languages. We ex- the seminal work of Biber et al. (1999). The aim periment with training classifiers for pre- of such attempts is to identify clusters of words dicting PoS tags for words based on their occurring naturally and corresponding to what we embeddings. The results show that the in- usually call ‘parts of speech’. One of the main formation about PoS affiliation contained distance metrics that can be used in detecting such in the distributional vectors allows us to clusters is a distance between distributional fea- discover groups of words with distribu- tures of words (their contexts in a reference train- tional patterns that differ from other words ing corpus). of the same part of speech. In this paper, we test this approach using pre- This data often reveals hidden inconsisten- dictive models developed in the field of distribu- cies of the annotation process or guide- tional semantics. Recent achievements in training lines. At the same time, it supports the distributional models of language using machine notion of ‘soft’ or ‘graded’ part of speech learning allow for robust representations of nat- affiliations. Finally, we show that infor- ural language semantics created in a completely mation about PoS is distributed among unsupervised way, using only large corpora of raw dozens of vector components, not limited text. Relations between dense word vectors (em- to only one or two features. beddings) in the resulting vector space are as a rule used for semantic purposes. But can they be 1 Introduction employed to discover something new about gram- Parts of speech (PoS) are useful abstractions, but mar and syntax, particularly parts of speech? Do still abstractions. Boundaries between them in nat- learned embeddings help here? Below we show ural languages are flexible. Sometimes, large open that such models do contain a lot of interesting classes of words are situated on the verge between data related to PoS classes. several parts of speech: for example, participles in The rest of the paper is organized as follows. English are in many respects both verbs and ad- In Section 2 we briefly cover the previous work jectives. In other cases, closed word classes ‘inter- on the subject of parts of speech and distributional sect’, e.g., it is often difficult to tell a determiner models. Section 3 describes data processing and from a possessive pronoun. As Houston (1985) the training of a PoS predictor based on word em- puts it, ‘Grammatical categories exist along a con- beddings. In Section 4 errors of this predictor are tinuum which does not exhibit sharp boundaries analyzed and insights gained from them described. between the categories’. Section 5 introduces an attempt to build a full- When annotating natural language texts for fledged PoS tagger within the same approach. It parts of speech, the choice of a PoS tag in many also analyzes the correspondence between partic- 115 Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 115–125, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics ular word embedding components and PoS affilia- et al., 2010) based on a small number of pro- tion, before we conclude in Section 6. totypical examples for each PoS; induction approaches that are completely unsupervised and 2 Related work make no use of prior knowledge. This is also the main focus of the comparative survey provided by Traditionally, 3 types of criteria are used to distin- (Christodoulopoulos et al., 2010). guish different parts of speech: formal (or mor- Work on PoS induction has a long history – in- phological), syntactic (or distributional) and se- cluding the use of distributional methods – going mantic (Aarts and McMahon, 2008). Arguably, back at least to Schütze (1995), and recent work syntactic and semantic criteria are not very differ- has demonstrated that word embeddings can be ent from each other, if one follows the famous dis- useful for this task as well (Yatbaz et al., 2012; tributional hypothesis stating that meaning is de- Lin et al., 2015; Ling et al., 2015a). termined by context (Firth, 1957). Below we show In terms of positioning this study relative to pre- that unsupervised distributional semantic models vious work, it falls somewhere in between the dis- contain data related to parts of speech. tinctions made above. It is perhaps closest to dis- For several years already it has been known ambiguation approaches, but it is not unsupervised that some information about morphological word given that we make use of existing tag annotations classes is indeed stored in distributional models. when training our embeddings and predictors. The Words belonging to different parts of speech pos- goal is also different; rather than performing PoS sess different contexts: in English, articles are typ- acquisition or tagging for its own sake, the main ically followed by nouns, verbs are typically ac- focus here is on analyzing the boundaries of dif- companied by adverbs and so on. It means that ferent PoS classes. In Section 5, this analysis is during the training stage, words of one PoS should complemented by experiments with using word theoretically cluster together or at least their em- embeddings for PoS prediction on unlabeled data, beddings should retain some similarity allowing and here our approach can perhaps be seen as re- for their separation from words belonging to other lated to previous so-called prototype-driven ap- parts of speech. Recently, among others, Tsuboi proaches, but in these experiments we also make (2014) and Plank et al. (2016) have demonstrated use of labeled data when defining our prototypes. how word embeddings can improve supervised It seems clear that one can infer data about PoS-tagging. PoS classes of words from distributional models in Mikolov et al. (2013b) showed that there also general, including embedding models. As a next exist regular relations between words from dif- step then, these models could also prove useful ferent classes: the vector of ‘Brazil’is related to for deeper analysis of part of speech boundaries, ‘Brazilian’ in the same way as ‘England’ is re- leading to discovery of separate words or whole lated to ‘English’ and so on. Later, Liu et al. classes that tend to behave in non-typical ways. (2016) demonstrated how words of the same part Discovering such cases is one possible way to im- of speech cluster into distinct groups in a distri- prove the performance of existing automatic PoS butional model, and Tsvetkov et al. (2015) proved taggers (Manning, 2011). These ‘outliers’ may that dimensions of distributional models are cor- signal the necessity to revise the annotation strat- related with different linguistic features, releasing egy or classification system in general. Section 3 an evaluation dataset based on this. describes the process of constructing typical PoS Various types of distributional information has clusters and detecting words that belong to a clus- also played an important role in previous work ter different from their traditional annotation. done on the related problem of unsupervised PoS acquisition. As discussed in Christodoulopou- 3 PoS clusters in distributional models los et al. (2010), we can separate at least three main directions within this line of work: Disam- Our hypothesis is that for the majority of words biguation approaches (Merialdo, 1994; Toutanova their parts of speech can be inferred from their em- and Johnson, 2007; Ravi and Knight, 2009) that beddings in a distributional model. This inference start out from a dictionary providing possible tags can be considered a classification problem: we are for different words; prototype-driven approaches to train an algorithm that takes a word vector as in- (Haghighi and Klein, 2006; Christodoulopoulos put and outputs its part of speech. If the word em- 116 beddings do contain PoS-related data, the properly PART, PRON, PROPN, SCONJ, SYM, VERB, trained classifier will correctly predict PoS tags for X (punctuation tokens marked with the PUNCT the majority of words: it means that these lexical tag were excluded). entities conform to a dominant distributional pat- Then, a Continuous Skipgram embedding tern of their part of speech class. At the same time, model (Mikolov et al., 2013a) was trained on this the words for which the classifier outputs incor- corpus, using a vector size of 300, 10 negative rect predictions, are expected to be ‘outliers’, with samples, a symmetric window of 2 words, no distributional patterns different from other words down-sampling, and 5 iterations over the training in the same class. These cases are the points of data. Words with corpus frequency less than 5 linguistic interest, and in the rest of the paper we were ignored. This model represents the seman- mostly concentrate on them. tics of the words it contains. But at the same time, To test the initial hypothesis, we used the XML for each word, a PoS tag is known (from the BNC Edition of British National Corpus (BNC), a bal- annotation).

Redefining Part-Of-Speech Classes with Distributional Semantic Models

Learn Pronouns As Part of Speech for Bank & SSC Exams

6 the Major Parts of Speech

TRADITIONAL GRAMMAR REVIEW I. Parts of Speech Traditional

PARTS of SPEECH ADJECTIVE: Describes a Noun Or Pronoun; Tells

Automatic Extraction of Compound Verbs from Bangla Corpora

Phrasal Verbs: a Contribution Towards a More Accurate Definition

Verbals and Verbal Phrases

Chapter 3 Basic Concepts

Parts of Speech Overview

Grade 8 Writing and Language Teacher At-Home Activity Packet 3

Constructions and Result: English Phrasal Verbs As Analysed in Construction Grammar

Animacy Acquisition Using Morphological Case