How Much Competence Is There in Performance? Assessing the Distributional Hypothesis in Word Bigrams

Johann Seltmann Luca Ducceschi Aurelie´ Herbelot University of Potsdam♦ University of Trento♣ University of Trento♠ [email protected] [email protected] [email protected]

♦Department of Linguistics, ♣ Dept. of Psychology and Cognitive Science,♠Center for Mind/Brain Sciences, Dept. of Information Engineering and Computer Science

Abstract reach some ideal knowledge of that community’s language, thereby reaching competence. The field of Distributional Semantics (DS) is The present paper borrows the notions of ‘per- built on the ‘distributional hypothesis’, which formance’, ‘competence’ and ‘innateness’ to criti- states that meaning can be recovered from sta- cally analyse the semantic ‘acquisition’ processes tistical information in observable language. It simulated by Distributional Semantics models is however notable that the computations necessary to obtain ‘good’ DS representations are (DSMs). Our goal is to tease apart how much of often very involved, implying that if meaning their observed competence is due to the perfor- is derivable from linguistic data, it is not di- mance data they are exposed to, and how much rectly encoded in it. This prompts questions is contributed by ‘innate’ properties of those sys- related to fundamental questions about lan- tems, i.e. by their specific architectures. guage acquisition: if we regard text data as DSMs come in many shapes. Traditional performance linguistic , what kind of ‘innate’ unsupervised architectures rely on counting co- mechanisms must operate over that data to reach competence? In other words, how much occurrences of words with other words or docu- of semantic acquisition is truly data-driven, ments (Turney and Pantel, 2010; Erk, 2012; Clark, and what must be hard-encoded in a system’s 2012). Their neural counterparts, usually referred architecture? In this paper, we introduce a new to as ‘predictive models’ (Baroni et al., 2014) learn methodology to pull those questions apart. We from a language modelling task over raw linguis- use state-of-the-art computational models to tic data (e.g. Word2Vec, Mikolov et al., 2013, investigate the amount and nature of transfor- GloVE Pennington et al., 2014). The most re- mations required to perform particular seman- cent language embedding models (Vaswani et al., tic tasks. We apply that methodology to one of the simplest structures in language: the word 2017; Radford et al., 2018), ELMo (Peters et al., bigram, giving insights into the specific con- 2018), or BERT (Devlin et al., 2018) compute tribution of that linguistic component.1 contextualised word representations and sentence representations, yielding state-of-the-art results on 1 Introduction sentence-related tasks, including translation. In spite of their differences, all models claim to rely The traditional notions of performance and com- on the Distributional Hypothesis (Harris, 1954; petence come from Chomsky’s work on syntax Firth, 1957), that is, the idea that distributional (Chomsky, 1965), where much emphasis is put patterns of occurrences in language correlate with on the mental processes underpinning language specific aspects of meaning. acquisition. Chomsky posits the existence of a The Distributional Hypothesis, as stated in the Universal Grammar, innate in the human species, DSM literature, makes semantic acquisition sound which gets specialised to the particular language like an extremely data-driven procedure. But we of a speaker. By exposure to the imperfect utter- should ask to what extent meaning indeed is to be ances of their community (referred to as perfor- found in statistical patterns. The question is mo- mance data), an individual configures their UG to tivated by the observation that the success of the latest DSMs relies on complex mechanisms be- 1Copyright c 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- ing applied to the underlying linguistic data or the ternational (CC BY 4.0). task at hand (e.g. attention, self-attention, negative sampling, particular objective functions). Such word relatedness; b) sentence relatedness; c) sen- mechanisms have been shown to apply very sig- tence autoencoding (Turney, 2014; Bowman et al., nificant transformations to the original input data: 2016). The first two tasks test to which extent the for instance, the Word2Vec objective function in- linguistic structure under consideration encodes troduces parallelisms in the space that make it per- topicality: if it does, it should prove able to clus- form particularly well on analogy tasks (Gittens ter together similar lexical items, both in isolation et al., 2017). Models such as BERT apply exten- and as the constituents of sentences. The third task sive processing to the input through stacks of en- evaluates the ability of a system to build a sen- coders. So while meaning can be derived from tence representation and from that representation training regimes involving raw data, it is not di- alone, recover the original utterance. That is, it rectly encoded in it. tests distinguishability of representations. Impor- Interestingly, Harris himself (Harris, 1954) tantly, distinguishability is at odds with the relat- points out that a) distributional structure is in no edness tasks which favour clusterability. The type simple relation to the structure of meaning; b) dif- of space learned from the raw data will necessarily ferent distributions in language encode different favour one or the other. Our choice of tasks thus phenomena with various levels of complexity. We allows us to understand which type of space can be take both points as highlighting the complex re- learned from the bigram: we will expand on this in 2 lation between linguistic structure and the cogni- our discussion (§6). tive mechanisms that are necessary to apply to the raw input to retrieve semantic information. The 2 Related work point of our paper is to understand better what is The Distributional Hypothesis is naturally en- encoded in observable linguistic structures (at the coded in count-based models of Distributional Se- level of raw performance data), and how much dis- mantics (DS), which build lexical representations tortion of the input needs to be done to acquire by gathering statistics over word co-occurrences. meaning (i.e. what cognitive mechanisms are in- Over the years, however, these simple models have volved in learning semantic competence). been superseded by so-called predictive models In the spirit of Harris, we think it is worth inves- such as Word2Vec (Mikolov et al., 2013) or Fast- tigating the behaviour of specific components of Text (Bojanowski et al., 2017), which operate via language and understand which aspects of mean- language modeling tasks. These neural models ing they encode, and to what extent. The present involve sets of more or less complex procedures, work illustrates our claim by presenting an ex- from subsampling to negative sampling and sub- ploratory analysis of one of the simplest recover- word chunking, which give them a clear advan- able structure in corpora: the word bigram. Our tage over methods that stick more closely to dis- methodology is simple: we test the raw distribu- tributions in corpora. At the level of higher con- tional behaviour of the constituent over different stituents, the assumption is that a) additional com- tasks, comparing it to a state-of-the-art model. We position functions must be learned over the word posit that each task embodies a specific aspect of representations to generate meaning ‘bottom-up’ competence. By inspecting the difference in per- (Clark, 2012; Erk, 2012); b) the semantics of a formance between the simplest and more complex sentence influences the meaning of its parts ‘top- models, we get some insight into the way a par- down’, leading to a notion of contextualised word ticular structure (here, the bigram) contributes to semantics, retrievable by yet another class of dis- the acquisition of specific linguistic faculties. The tributional models (Erk and Pado´, 2008; Erk et al., failures of raw linguistic data to encode a particu- 2010; Thater et al., 2011; Peters et al., 2018). By- lar competence points at some necessary, ‘innate’ passing the word level, some research investigates constraint of the acquisition process, which might the meaning of sentences directly. Following from be encoded in a model’s architecture as well as the classic work on seq2seq architectures and atten- specific task that it is required to solve. tion, various models have been proposed to gen- In what follows, we propose to investigate the erate sentence embeddings through highly param- behaviour of the bigram with respect to three dif- 2Our code for this investigation can be found under ferent levels of semantic competence, correspond- https://github.com/sejo95/DSGeneration. ing to specific tasks from the DS literature: a) git. eterised stacks of encoders (Vaswani et al., 2017; tor ~vi which has one entry for each word wj in Radford et al., 2018; Devlin et al., 2018). the model. The entry ~vij then contains the bigram This very brief overview of work in DS shows probability p(wj|wi). the variety of models that have been proposed We talked in our introduction of ‘raw’ linguis- to encode meaning at different levels of con- tic structure without specifying at which level it stituency, building on more and more complex is to be found. Following Church and Hanks mechanisms. Aside from those efforts, much re- (1990), we consider the joint probability of two search has also focused on finding ideal hyper- events, relative to their probability of occurring in- parameters for the developed architectures (Bulli- dependently, to be a good correlate of the funda- naria and Levy, 2007; Baroni et al., 2014), ranging mental psycholinguistic notion of association. As from the amount of context taken into account by per previous work, we thus assume that a PMI- the model to the type of task it should be trained weighted DS space gives the most basic represen- on. Overall, it is fair to say that if meaning can tation of the information contained in the struc- be retrieved from raw language data, the process ture of interest. For our bigram model, the numer- requires knowing the right transformations to ap- ator and denominator of the PMI calculation ex- ply to that data, and the right parametrisation for actly correspond to elements in our bigram matrix those transformations, including the type of lin- B weighted by elements of our unigram vector U: guistic structure the model should focus on. One important question remains for the linguist to an- p(wj|wi) pmi(wi, wj) ≡ log (1) swer: how much semantics was actually contained p(wj) in corpus statistics, and where? We attempt to set In practice, we use PPMI weighting and map up a methodology to answer this question, and use every negative PMI value to 0. two different types of tasks (relatedness and au- Word relatedness: following standard prac- toencoding) to support our investigation. tice, we compute relatedness scores as the co- While good progress has been made in the sine similarity of two PPMI-weighted word vec- DS community on modelling relatedness, distin- tors, cos( ~wi, ~wj). For evaluation, we use the MEN guishability has received less attention. Some ap- test collection (Bruni et al., 2014), which con- proaches to autoencoding suggest using syntac- tains 3000 word pairs annotated for relatedness; tic elements (such as syntax trees) for decom- we compute the spearman ρ correlation between position of an embedding vector into a sentence system and human scores. (Dinu and Baroni, 2014; Iyyer et al., 2014). How- Sentence relatedness: we follow the proof ever, some research suggests that this may not be given by Paperno and Baroni(2016), indicating necessary and that continuous bag-of-words rep- that the meaning of a phrase ab in a count-based resentations and n-gram models contain enough model with PMI weighting is roughly equivalent word order information to reconstruct sentences to the addition of the PMI-weighted vectors of a (Schmaltz et al., 2016; Adi et al., 2017). Our own and b (shifted by some usually minor correction). methodology is inspired by White et al.(2016b), Thus, we can compute the similarity of two sen- who decode a sentence vector into a bag of words tences S1 and S2 as: using a greedy search over the vocabulary. In order to also recover word order, those authors ex- X X cos( ~wi, ~wj) (2) pand their original system in White et al.(2016a) wi∈S1 wj ∈S2 by combining it with a traditional trigram model, which they use to reconstruct the original sentence We report sentence relatedness scores on the from the bag of words. SICK dataset (Marelli et al., 2014), which contains 10,000 utterance pairs annotated for related- 3 Methodology ness. We calculate the relatedness for each pair in the dataset and order the pairs according to the 3.1 A bigram model of Distributional results. We then report the spearman correlation Semantics between the results of the model and the ordering We construct a count-based DS model by taking of the dataset. bigrams as our context windows. Specifically, Autoencoding of sentences: White et al. for a word wi, we construct an embedding vec- (2016b) encode a sentence as the sum of the word embedding vectors of the words of that sentence. build B and U from 90% of the BNC (≈ 5.4 mil- They decode that vector (the target) back into a lion sentences), retaining 10% for development bag of words in two steps. The first step, greedy purposes. We limit our vocabulary to the 50000 addition begins with an empty bag of words. In most common words in the corpus, therefore the each step a word is selected, such that the sum matrix is of the size 50002 × 50002, including to- of the word vectors in the bag and the vector of kens for sentence beginning and end. the candidate item is closest to the target (using Euclidian distance as similarity measure). This is 3.2 Comparison repeated until no new word could bring the sum In what follows, we compare our model to two closer to the target than it already is. The second Word2Vec models, which provide an upper bound step, n-Substitution begins with the bag of n words for what a DS model may be to achieve. One found in the greedy addition. For each subbag of model, W2V-BNC, is trained from scratch on our size m ≤ n it considers replacing it with another BNC background corpus, using gensim (Rehˇ u˚rekˇ possible subbag of size ≤ m. The replacement and Sojka, 2010) with 300 dimensions, window that brings the sum closest to the target vector is size ±5, and ignoring words that occur less than chosen. We follow the same procedure, except that five times in the corpus. The other model, W2V- we only consider subbags of size 1, i.e. substitu- LARGE, is given by out-of-the-box vectors re- tion of single words, for computational efficiency. leased by Baroni et al.(2014): that model is In addition, the bigram component of our model B trained on 2.5B words, giving an idea of the sys- lets us turn the bags of words back into an ordered tem’s performance on larger data. In all cases, we 3 sequence. We use a beam search to perform this limit the vocabulary to the same 50,000 words in- step, following Schmaltz et al.(2016). cluded in the bigram model. We evaluate sentence autoencoding in two Note that given space restrictions, we do not ways. First, we test the bag-of-words reconstruc- disentangle the contribution of the models them- tion on its own, by feeding the system the encoded selves and the particular type of linguistic struc- sentence embedding and evaluating whether it can ture they are trained on. Our results should thus be retrieve all single words contained in the origi- taken as indication of the amount of information nal utterance. We report the proportion of per- encoded in a raw bigram model compared to what fectly reconstructed bags-of-words across all test can be obtained by a state-of-the-art model using instances. Second, we test the entire autoencoding the best linguistic structure at its disposal (here, a process, including word re-ordering. We use two window of ±5 words around the target). different metrics: a) the BLEU score: (Papineni et al., 2002), which computes how many n-grams 4 Results of a decoded sentence are shared with several reference sentences, giving a precision score; b) the Word relatedness: the bigram model obtains an CIDEr-D score: (Vedantam et al., 2015) which acceptable ρ = 0.48 on the MEN dataset. W2V- accounts for both precision and recall and is com- BNC and W2V-LARGE perform very well, reach- puted using the average cosine similarity between ing ρ = 0.72 and ρ = 0.80. Note that whilst the the vector of a candidate sentence and a set of ref- bigram model lags well behind W2V, it achieves erence vectors. For this evaluation, we use the its score with what is in essence a unidirectional PASCAL-50S dataset (included in CIDEr-D), a model with window of size 1 – that is, with as caption generation dataset, that contains 1000 im- minimal input as it can get, seeing 10 times less ages with 50 reference captions each. We encode co-occurrences than W2V-BNC. and decode the first reference caption for each im- Sentence relatedness: the bigram model ob- age and use the remaining 49 as reference for the tains ρ = 0.40 on the sentence relatedness task. CIDEr and BLEU calculations. Interestingly, that score increases by 10 points, to ρ = 0.50, when filtering away frequent words For the actual implementation of the model, we with probability over 0.005. W2V-BNC and W2V- LARGE give respectively ρ = 0.59 and ρ = 0.61. 3Note that although a bigram language model would nor- Sentence autoencoding: we evaluate sentence mally perform rather poorly on sentence generation, having a constrained bag-of-words to reorder makes the task consider- autoencoding on sentences from the Brown cor- ably simpler. pus (Kuceraˇ and Francis, 1967), using seven bins original sents. in matrix all 2-10 11-23 sent. length W2V CB W2V CB CIDEr-D bigram 1.940 1.875 2.047 3-5 0.556 0.792 0.686 0.988 BLEU bigram 0.193 0.209 0.176 6-8 0.380 0.62 0.646 0.988 CIDEr-D random 1.113 1.1 1.134 9-11 0.279 0.586 0.548 1.0 BLEU random 0.053 0.059 0.045 12-14 0.210 0.578 0.402 1.0 15-17 0.178 0.338 0.366 0.978 Table 2: CIDEr-D and BLEU scores on reordering of bags-of-words using our bigram matrix and random re- 18-20 0.366 0.404 0.984 0.974 ordering. Results are given for all sentences as well as 21-23 0.306 0.392 0.982 0.968 sentences of lengths 2-10 and 11-23.

Table 1: Fraction of exact matches in bag-of-word re- Original sentence Reconstruction construction (W2V refers to W2V-LARGE) They have to be. they have to be . Six of these were by these were six of for different sentence lengths (from 3-5 words to proposed by religious religious groups pro- 21-23 words). Each bin contains 500 sentences. groups. posed . In some cases, the sentences contained words that His reply, he said, was the need for the coun- aren’t present in the matrix and which are there- that he agreed to the try , in his reply , he fore skipped for encoding. We thus look at two need for unity in the said that he was now different values: a) in how many cases the recon- country now. agreed to unity . struction returns exactly the words in the sentence; b) in how many cases the reconstruction returns Table 3: Examples of decoded and reordered sentences. the words in the sentence which are contained in All words in the original sentences were retrieved by the model, but the ordering is only perfectly recovered the matrix (results in Table1). in the first case. The bigram model shines in this task: ignoring words not contained in the matrix leads to al- most perfect reconstruction. In comparison, the edness: in spite of encoding extremely minimal W2V model has extremely erratic performance co-occurrence information, they manage to make (Table1), with scores decreasing as a function for two thirds of W2V’s performance, trained on of sentence length (from 0.686 for length 3-5 to the same data with a much larger window and a 0.366 for length 15-17), but increasing again for complex algorithm (see ρ = 0.48 for the bigram lengths over 18. model vs ρ = 0.72 for W2V-BNC). So related- One interesting aspect of the bigram model is ness, the flagship task of DS, seems to be present that it also affords a semantic competence that in the most basic structures of language use, al- W2V does not naturally have: encoding a se- though in moderate amount. quence and decoding it back into an ordered se- The result of the bigram model on sentence quence. We inspect how well the model does at relatedness is consistent with its performance at that task, compared to a random reordering base- the word level. The improved result obtained by line. Results are listed in Table2. The bigram filtering out frequent words, though, reminds us model clearly beats the baseline for all sentence that logical terms are perhaps not so amenable to lengths. But it is expectedly limited by the small the distributional hypothesis, despite indications n-gram size provided by the model. Table3 con- to the contrary (Abrusan´ et al., 2018). tains examples of sentences from the brown corpus As for sentence autoencoding, the excellent re- and their reconstructions. We see that local order- sults of the bigram model might at first be con- ing is reasonably modeled, but the entire sentence sidered trivial and due to the dimensionality of structure fails to be captured. the space, much larger for the bigram model than for W2V. Indeed, at the bag-of-words level, sen- 5 Discussion tence reconstruction can in principle be perfectly achieved by having a space of the dimensional- On the back of our results, we can start comment- ity of the vocabulary, with each word symboli- ing on the particular contribution of bigrams to cally expressed as a one-hot vector.4 However, the semantic competences tested here. First, bigrams are moderately efficient at capturing relat- 4To make this clear, if we have a vocabulary V = as noted in §2, the ability to encode relatedness we will also investigate which aspects of state-of- is at odds with the ability to distinguish between the-art models such as W2V contribute to score meanings. There is a trade-off between having a improvement on lexical aspects of semantics. We high-dimensionality space (which allows for more hope to thus gain insights into the specific cogni- discrimination between vectors and thus easier re- tive processes required to bridge the gap between construction – see White et al., 2016b) and captur- raw distributional structure as it is found in cor- ing latent features between concepts (which is typ- pora, and actual speaker competence. ically better achieved with lower dimensionality). Interestingly, bigrams seem to be biased towards more symbolic representations, generating repre- References sentations that distinguish very well between word Marta´ Abrusan,´ Nicholas Asher, and Tim Van de meanings, but they do also encapsulate a reason- Cruys. 2018. Content vs. function words: The view able amount of lexical information. This makes from distributional semantics. In Proceedings of Sinn und Bedeutung 22. them somewhat of a hybrid constituent, between proper symbols and continuous vectors. Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained anal- 6 Conclusion ysis of sentence embeddings using auxiliary predic- tion tasks. International Conference on Learning Representations (ICLR), Toulon, France. So what can be said about bigrams as distributional structure? They encode a very high level of Marco Baroni, Georgiana Dinu, and German´ lexical discrimination while accounting for some Kruszewski. 2014. Don’t count, predict! a basic semantic similarity. They of course also en- systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1), code minimal sequential information which can be pages 238–247. used to retrieve local sentence ordering. Essen- tially, they result in representations that are per- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with haps more ‘symbolic’ than continuous. It is im- subword information. Transactions of the Associa- portant to note that the reasonable correlations ob- tion for Computational Linguistics, 5:135–146. tained on relatedness tasks were achieved after ap- plication of PMI weighting, implying that the raw Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew Dai, Rafal Jozefowicz, and Samy Bengio. structure requires some minimal preprocessing to 2016. Generating sentences from a continuous generate lexical information. space. In Proceedings of The 20th SIGNLL Confer- On the back of our results, we can draw a ence on Computational Natural Language Learning, few conclusions with respect to the relation of pages 10–21. Association for Computational Lin- guistics. performance and competence at the level of bigrams. Performance data alone produces very E. Bruni, N. K. Tran, and M. Baroni. 2014. Multi- distinct word representations without any further modal distributional semantics. Journal of Artificial Intelligence Research, 49:1–47. processing. Some traces of lexical semantics are present, but require some hard-encoded prepro- John A Bullinaria and Joseph P Levy. 2007. Extracting cessing step in the shape of the PMI function. We semantic representations from word co-occurrence conclude from this that as a constituent involved statistics: A computational study: A computational study. Behavior Research Methods, 39(3):510–526. in acquisition, the bigram is mostly a marker of the uniqueness of word meaning. Interestingly, we Noam Chomsky. 1965. Aspects of the theory of syntax. note that the notion of contrast (words that differ MIT Press. in form differ in meaning) is an early feature of Kenneth Ward Church and Patrick Hanks. 1990. Word children’s language acquisition (Clark, 1988). The association norms, mutual information, and lexicog- fact that it is encoded in one of the most simple raphy. Computational linguistics, 16(1):22–29. structures in language is perhaps no coincidence. Eve V Clark. 1988. On the logic of contrast. Journal In future work, we plan a more encompassing of Child Language, 15(2):317–335. study of other linguistic components. Crucially, Stephen Clark. 2012. Vector space models of lexical {cat, dog, run} and we define cat = [100], dog = [010] meaning. In Shalom Lappin and Chris Fox, editors, and run = [001], then, trivially, [011] corresponds to the Handbook of Contemporary Semantics – second edi- bag-of-word {dog, run}. tion. Wiley-Blackwell. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Kristina Toutanova. 2018. Bert: Pre-training of deep Jing Zhu. 2002. Bleu: A method for automatic eval- bidirectional transformers for language understand- uation of machine translation. In Proceedings of ing. arXiv preprint arXiv:1810.04805. the 40th Annual Meeting on Association for Com- putational Linguistics, ACL ’02, pages 311–318, Georgiana Dinu and Marco Baroni. 2014. How to Stroudsburg, PA, USA. Association for Computa- make words with vectors: Phrase generation in dis- tional Linguistics. tributional semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computa- Jeffrey Pennington, Richard Socher, and Christopher tional Linguistics (Volume 1: Long Papers), vol- Manning. 2014. Glove: Global vectors for word ume 1, pages 624–633. representation. In Proceedings of the 2014 conference on empirical methods in natural language pro- Katrin Erk. 2012. Vector space models of word mean- cessing (EMNLP), pages 1532–1543. ing and phrase meaning: a survey. Language and Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Linguistics Compass, 6:635–653. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Katrin Erk and Sebastian Pado.´ 2008. A structured resentations. In Proceedings of the 2018 Confer- vector space model for word meaning in context. ence of the North American Chapter of the Associ- In Proceedings of the 2008 Conference on Em- ation for Computational Linguistics: Human Lan- pirical Methods in Natural Language Processing guage Technologies, Volume 1 (Long Papers), pages (EMNLP2008), pages 897–906, Honolulu, HI. 2227–2237. Katrin Erk, Sebastian Pado,´ and Ulrike Pado.´ 2010. A Alec Radford, Karthik Narasimhan, Tim Salimans, and flexible, corpus-driven model of regular and inverse Ilya Sutskever. 2018. Improving language under- selectional preferences. Computational Linguistics, standing by generative pre-training. 36(4):723–763. Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software Frame- John Rupert Firth. 1957. A synopsis of linguistic the- work for Topic Modelling with Large Corpora. In ory, 1930–1955. Philological Society, Oxford. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Val- Alex Gittens, Dimitris Achlioptas, and Michael W Ma- letta, Malta. ELRA. http://is.muni.cz/ honey. 2017. Skip-gram- zipf+ uniform= vector ad- publication/884893/en. ditivity. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics Allen Schmaltz, Alexander M. Rush, and Stuart (Volume 1: Long Papers), pages 69–76. Shieber. 2016. Word ordering without syntax. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages Zelig Harris. 1954. Distributional structure. Word, 2319–2324, Austin, Texas. Association for Compu- 10(2-3):146–162. tational Linguistics. Mohit Iyyer, Jordan Boyd-Graber, and Hal Daume´ III. S. Thater, H. Furstenau,¨ and M. Pinkal. 2011. Word 2014. Generating sentences from semantic vector meaning in context: A simple and effective vector space representations. In NIPS Workshop on Learn- model. In Proceedings of the 5th International Joint ing Semantics. Conference on Natural Language Processing, Chi- ang Mai, Thailand. Henry Kuceraˇ and Winthrop Nelson Francis. 1967. Computational analysis of present-day American Peter D. Turney. 2014. Semantic composition and English. Dartmouth Publishing Group. decomposition: From recognition to generation. CoRR, abs/1405.7908. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zam- Peter D. Turney and Patrick Pantel. 2010. From fre- parelli. 2014. A sick cure for the evaluation of com- quency to meaning: Vector space models of se- positional distributional semantic models. In LREC, mantics. Journal of Artificial Intelligence Research, pages 216–223. 37:141–188. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 2013. Linguistic regularities in continuous space Kaiser, and Illia Polosukhin. 2017. Attention is all word representations. In HLT-NAACL, pages 746– you need. In Advances in neural information pro- 751. cessing systems, pages 5998–6008. Denis Paperno and Marco Baroni. 2016. When the R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. whole is less than the sum of its parts: How compo- Cider: Consensus-based image description evalua- sition affects pmi values in distributional semantic tion. In 2015 IEEE Conference on Computer Vision vectors. Computational Linguistics, 42(2):345–350. and Pattern Recognition (CVPR), pages 4566–4575. L. White, R. Togneri, W. Liu, and M. Bennamoun. 2016a. Modelling sentence generation from sum of word embedding vectors as a mixed integer pro- gramming problem. In 2016 IEEE 16th Inter- national Conference on Data Mining Workshops (ICDMW), pages 770–777. Lyndon White, Roberto Togneri, Wei Liu, and Mo- hammed Bennamoun. 2016b. Generating bags of words from the sums of their word embeddings. In 17th International Conference on Intelli- gent Text Processing and Computational Linguistics (CICLing).