Empirical Linguistic Study of Sentence Embeddings

Empirical Linguistic Study of Sentence Embeddings Katarzyna Krasnowska-Kieras´ Alina Wróblewska Institute of Computer Science, Polish Academy of Sciences ul. Jana Kazimierza 5, 01-248 Warsaw, Poland [email protected] [email protected] Abstract then trained with the resulting embeddings as in- puts and probing labels as targets. The performance The purpose of the research is to answer the question whether linguistic information is of the resulting classifier is considered a proxy for retained in vector representations of sentences. how well the probing property is retained in the sen- We introduce a method of analysing the con- tence embeddings. tent of sentence embeddings based on univer- We propose an extension and generalisation of sal probing tasks, along with the classification the methodology of the probing tasks-based experi- datasets for two contrasting languages. We per- ments. First, the current experiments are conducted form a series of probing and downstream ex- on two typologically and genetically different lan- periments with different types of sentence em- guages: English, which is an isolating Germanic beddings, followed by a thorough analysis of the experimental results. Aside from depen- language and Polish, which is a fusional Slavic dency parser-based embeddings, linguistic in- one. Our motivation for conducting experiments formation is retained best in the recently pro- on two contrasting languages is as follows. English posed LASER sentence embeddings. is undoubtedly the most prominent language with multiple resources and tools. However, English lan- 1 Introduction guage processing is only a part of NLP in general. Modelling natural language with neural networks Methods designed for English are not guaranteed has been an extensively researched area for sev- to be universal. In order to verify whether an NLP eral years now. On the one hand, deep learning algorithm is powerful, it is not enough to evaluate enormously reduced the cost of feature engineer- it solely on English. Evaluation on additional lan- ing. On the other hand, we are largely unaware of guages can shed light on an investigated method. features that are used in estimating a neural model We select Polish as our contrasting language for and, therefore, kinds of information that a trained pragmatic reasons, i.e. there is a Polish dataset – neural model relies most heavily on. Since neu- CDSCorpus (Wróblewska and Krasnowska-Kieras´, ral network-based models work very well in many 2017) – which is comparable to the SICK relat- NLP tasks and often provide state-of-the-art re- edness/entailment corpus (Bentivogli et al., 2014). sults, it is extremely interesting and desirable to Both datasets are used in downstream evaluation. understand which properties of words, phrases or Second, the designed probing tests are universal sentences are retained in their embeddings. An ap- for both tested languages. For syntactic processing proach to investigate whether linguistic properties of both languages, we use the Universal Dependen- of English sentences are encoded in their embed- cies schema (UD, Nivre et al., 2016).1 Since we dings is proposed by Shi et al.(2016), Adi et al. use automatically parsed UD trees for generating (2017), and Conneau et al.(2018). It consists in probing datasets, analogous tests can be generated designing a series of classification problems focus- for any language with a UD treebank on which ing on linguistic properties of sentences, so called a parser can be trained. probing tasks (Conneau et al., 2018). In a probing 1The Universal Dependencies initiative aims at developing task, sentences are labelled according to a particu- a cross-linguistically consistent morphosyntactic annotation lar linguistic property. Given a model that generates schema and at building a large multilingual collection of treebanks annotated according to this schema. It is worth nothing an embedding vector for any sentence, the model that the UD schema has become the de facto standard for is applied to the probing sentences. A classifier is syntactic annotation in the recent years. 5729 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5729–5739 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics The contributions of this work are twofold. a sentence embedding model. If a linguistic prop- (1) We introduce a method of analysing the content erty is encoded in the sentence embeddings and of sentence embeddings based on universal probing the classifier learns how this property is encoded, it tasks, along with the classification datasets for two will correctly classify the test sentence embeddings. contrasting languages. (2) We carry out a series of The efficiency of the classifiers for each probing empirical experiments based on publicly released task is measured with accuracy. The probing tasks probing datasets2 created within the described work are described in Section3. and the obtainable downstream task datasets with different types of sentence embeddings, followed 2.2 Downstream Task-based Method by a thorough analysis of the experimental results. Two downstream tasks are proposed in our experi- We test sentence embeddings obtained with max- ments: Relatedness and Entailment. The seman- pooling and mean-pooling operations over word tic relatedness3 task is to measure the degree of embeddings or contextualised word embeddings, any kind of lexical or functional association be- sentence embeddings estimated on small corpora, tween two terms, phrases or sentences. The effi- and sentence embeddings estimated on large mono- ciency of the classifier for semantic relatedness is lingual or multilingual corpora. measured with Pearson’s r and Spearman’s ρ co- efficients. The textual entailment task is to assess 2 Experimental Methodology whether the meaning of one sentence is entailed by The purpose of the research is to answer the ques- the meaning of another sentence. There are three tion whether linguistic information is retained in entailment classes: entailment, contradiction, and vector representations of sentences. Assessment of neutral. The efficiency of the classifier for entail- the linguistic content in sentence embeddings is not ment, in turn, is measured with accuracy. a trivial task and we verify whether it is possible with a probing task-based method (see Section 2.1). 3 Probing Tasks Probing sentence embeddings for individual lin- The point of reference for designing our probing guistic properties do not examine the overall perfor- tasks is the work by Conneau et al.(2018). The au- mance of embeddings in composing the meaning thors propose several probing tasks and divide them of the represented sentence. We therefore provide into those pertaining to surface, syntactic and se- two downstream tasks for a general evaluation (see mantic phenomena. However, we decide to discard Section 2.2). the ‘syntactic versus semantic’ distinction and con- sider all tasks either surface (see Section 3.1) or 2.1 Probing Task-based Method compositional (see Section 3.2). A probing task can be defined as “a classification This decision is motivated by the fact that both problem that focuses on simple linguistic proper- syntactic and semantic principles are undoubtedly ties of sentences” (Conneau et al., 2018). A probing compositional by their nature. The syntax admitting dataset contains the pairs of sentences and their cat- well-formed expressions on the basis of the lexicon egories. For example, the dataset for the Passive works in tandem with the semantics. According to probing task (the binary classification) consists of Jacobson’s notion of Direct Compositionality (Ja- two types of the pairs: ha passive voice sentence, cobson, 2014, 43), “each syntactic rule which pre- 1i and ha non-passive (active) voice sentence, 0i. dicts the existence of some well-formed expression The sentence–category pairs are automatically ex- (as output) is paired with a semantic rule which tracted from a corpus of dependency parsed sen- gives the meaning of the output expression in terms tences. The extraction procedure is based on a set of of the meaning(s) of the input expressions”. rules compatible with the Universal Dependencies annotation schema. The proposed rules of creat- 3.1 Tests on Surface Properties ing the probing task datasets are thus universal for The tests investigate whether surface properties of languages with the UD style dependency treebanks. sentences (i.e. sentence length and lexical content) A classifier is trained and tested on vector representations of the probing sentences generated with 3Semantic relatedness is not equivalent to semantic similarity. Semantic similarity is only a special case of semantic 2http://git.nlp.ipipan.waw.pl/Scwad/ relatedness, e.g. CAR and AUTO are similar terms and CAR SCWAD-probing-data and GARAGE are related terms. 5730 punct obl root case nsubj amod aux amod ROOT She has starred with many leading actors . Figure 1: An example UD tree of the sentence She has starred with many leading actors. are retained in their embeddings. We follow the def- TreeDepth (dependency tree depth) This task inition of surface probing tasks and the procedure consists in classifying sentences based on the depth of preparing training data as described by Conneau of the corresponding dependency trees. The task et al.(2018). is defined similarly to Conneau et al.(2018), but dependency trees are used instead of constituent SentLen (sentence length) This task consists in trees. Similarly to the original TreeDepth task, the classifying sentences by their length. There are 6 data is decorrelated with respect to sentence length. sentence length classes with the following token Example: The dependency tree in Figure1 has intervals: 0: (3, 5), 1: (6, 8), 2: (9, 11), 3: (12, 14), a depth of 3, because the path from the root node 4: (15, 17), 5: (18, 20), 6: (21, 23). to any token node contains 3 tokens at most. Example: The sentence from Figure1 has the category 1, since it contains 8 tokens.

Load more