Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources

Ivan Vulic´1, Goran Glavaš2, Nikola Mrkšic´3, Anna Korhonen1 1 Language Technology Lab, University of Cambridge 2 Data and Web Science Group, University of Mannheim 3 PolyAI {iv250,alk23}@cam.ac.uk [email protected] [email protected]

Abstract Pennington et al., 2014; Levy and Goldberg, 2014; Levy et al., 2015; Bojanowski et al., 2017). As a Word vector specialisation (also known as retrofitting) is a portable, light-weight ap- result, these models tend to coalesce the notions of proach to fine-tuning arbitrary distributional and (broader) conceptual relat- word vector spaces by injecting external edness, and cannot accurately distinguish antonyms knowledge from rich lexical resources such from synonyms (Hill et al., 2015; Schwartz et al., as WordNet. By design, these post-processing 2015). Recently, we have witnessed a rise of in- methods only update the vectors of words oc- terest in representation models that move beyond curring in external lexicons, leaving the repre- stand-alone : they leverage sentations of all unseen words intact. In this external knowledge in human- and automatically- paper, we show that constraint-driven vector space specialisation can be extended to unseen constructed lexical resources to enrich the semantic words. We propose a novel post-specialisation content of distributional word vectors, in a process method that: a) preserves the useful linguistic termed semantic specialisation. knowledge for seen words; while b) propagat- This is often done as a post-processing (some- ing this external signal to unseen words in or- times referred to as retrofitting) step: input word der to improve their vector representations as vectors are fine-tuned to satisfy linguistic con- well. Our post-specialisation approach explic- straints extracted from lexical resources such as its a non-linear specialisation function in the form of a deep neural network by learning to WordNet or BabelNet (Faruqui et al., 2015; Mrkšic´ predict specialised vectors from their original et al., 2017). The use of external curated knowl- distributional counterparts. The learned func- edge yields improved word vectors for the benefit tion is then used to specialise vectors of unseen of downstream applications (Faruqui, 2016). At words. This approach, applicable to any post- the same time, this specialisation of the distribu- processing model, yields considerable gains tional space distinguishes between true similarity over the initial specialisation models both in in- and relatedness, and supports language understand- trinsic word similarity tasks, and in two down- stream tasks: dialogue state tracking and lexi- ing tasks (Kiela et al., 2015; Mrkšic´ et al., 2017). cal text simplification. The positive effects per- While there is consensus regarding their benefits sist across three languages, demonstrating the and ease of use, one property of the post-processing importance of specialising the full vocabulary specialisation methods slips under the radar: most of distributional word vector spaces. existing post-processors update word embeddings only for words which are present (i.e., seen) in the 1 Introduction external constraints, while vectors of all other (i.e., Word representation learning is a key research area unseen) words remain unaffected. In this work, we in current Natural Language Processing (NLP), propose a new approach that extends the speciali- with its usefulness demonstrated across a range sation framework to unseen words, relying on the of tasks (Collobert et al., 2011; Chen and Manning, transformation of the vector (sub)space of seen 2014; Melamud et al., 2016b). The standard tech- words. Our intuition is that the process of fine- niques for inducing distributed word representa- tuning seen words provides implicit information on tions are grounded in the distributional hypothesis how to leverage the external knowledge to unseen (Harris, 1954): they rely on co-occurrence informa- words. The method should preserve the already in- tion in large textual corpora (Mikolov et al., 2013b; jected knowledge for seen words, simultaneously

516 Proceedings of NAACL-HLT 2018, pages 516–527 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics propagating the external signal to unseen words in In contrast, fine-tuning models inject external order to improve their vectors. knowledge from available lexical resources (e.g., The proposed post-specialisation method can be WordNet, PPDB) into pre-trained word vectors as seen as a two-step process, illustrated in Fig. 1a: a post-processing step (Faruqui et al., 2015; Rothe 1) We use a state-of-the-art specialisation model and Schütze, 2015; Wieting et al., 2015; Nguyen to transform the subspace of seen words from the et al., 2016; Mrkšic´ et al., 2016; Cotterell et al., input distributional space into the specialised sub- 2016; Mrkšic´ et al., 2017). Such post-processing space; 2) We learn a mapping function based on models are popular because they offer a portable, the transformation of the “seen subspace”, and then flexible, and light-weight approach to incorporating apply it to the distributional subspace of unseen external knowledge into arbitrary vector spaces, words. We allow the proposed post-specialisation yielding state-of-the-art results on language under- model to learn from large external linguistic re- standing tasks (Faruqui et al., 2015; Mrkšic´ et al., sources by implementing the mapping as a deep 2016; Kim et al., 2016; Vulic´ et al., 2017b). feed-forward neural network with non-linear acti- Existing post-processing models, however, suf- vations. This allows the model to learn the general- fer from a major limitation. Their modus operandi isation of the fine-tuning steps taken by the initial is to enrich the distributional information with ex- specialisation model, itself based on a very large ternal knowledge only if such knowledge is present number (e.g., hundreds of thousands) of external in a . This means that they update linguistic constraints. and improve only representations of words actually As indicated by the results on word similar- seen in external resources. Because such words ity and two downstream tasks (dialogue state constitute only a fraction of the whole vocabulary tracking and lexical text simplification) our post- (see Sect.4), most words, unseen in the constraints, specialisation method consistently outperforms retain their original vectors. The main goal of this state-of-the-art methods which specialise seen work is to address this shortcoming by specialising words only. We report improvements using three all words from the initial distributional space. distinct input vector spaces for English and for three test languages (English, German, Italian), ver- 3 Methodology: Post-Specialisation ifying the robustness of our approach. Our starting point is the state-of-the-art specialisa- tion model ATTRACT-REPEL (AR) (Mrkšic´ et al., 2 Related Work and Motivation 2017), outlined in Sect. 3.1. We opt for the Vector Space Specialisation A standard ap- AR model due to its strong performance and proach to incorporating external and background ease of use, but we note that the proposed post- knowledge into word vector spaces is to pull the specialisation approach for specialising unseen representations of similar words closer together words, described in Sect. 3.2, is applicable to any and to push words in undesirable relations (e.g., post-processor, as empirically validated in Sect.5. antonyms) away from each other. Some models 3.1 Initial Specialisation Model: AR integrate such constraints into the training proce- dure and jointly optimize distributional and non- Let s be the vocabulary, A the set of synony- V distributional objectives: they modify the prior or mous ATTRACT word pairs (e.g., rich and wealthy), the regularisation (Yu and Dredze, 2014; Xu et al., and R the set of antonymous REPEL word pairs 2014; Bian et al., 2014; Kiela et al., 2015), or (e.g., increase and decrease). The ATTRACT-REPEL use a variant of the SGNS-style objective (Liu procedure operates over mini-batches of such et al., 2015; Ono et al., 2015; Osborne et al., 2016; pairs A and R. Let each word pair (xl, xr) in B B Nguyen et al., 2017). In theory, word embeddings these sets correspond to a vector pair (xl, xr).A obtained by these joint models could be as good as mini-batch of batt attract word pairs is given by 1 1 k1 k1 representations produced by models which fine- A = [(x , xr),..., (x , xr )] (analogously for B l l tune input vector space. However, their perfor- R, which consists of brep pairs). B mance falls behind that of fine-tuning methods (Wi- Next, the sets of negative exam- 1 1 k1 k1 eting et al., 2015). Another disadvantage is that ples TA = [(tl , tr),..., (tl , tr )] and 1 1 k2 k2 their architecture is tied to a specific underlying TR = [(tl , tr),..., (tl , tr )] are defined as model (typically models). pairs of negative examples for each A and R

517 dim dim pair in mini-batches A and R. These negative function f : R R , where dim is the vec- B B → examples are chosen from the word vectors present tor space dimensionality. It maps word vectors from in or so that, for each A pair (x , x ), the the initial vector space X to the specialised target BA BR l r negative example pair (tl, tr) is chosen so that tl space X0. Let X0 = f(X) refer to the predicted is the vector closest (in terms of cosine distance) to mapping of the vector space, while the mapping of 1 c xl and tr is closest to xr. The negatives are used a single word vector is denoted xi0 = f(xi). 1) to force A pairs to be closer to each other than An input distributional vector space Xd repre- to their respective negative examples; and 2) to sents words from a vocabulary b . may be di- Vd Vd force R pairs to be further away from each other vided into two vocabulary subsets: = , Vd Vs ∪ Vu than from their negative examples. The first term = , with the accompanying vector sub- Vs ∩ Vu ∅ of the cost function pulls A pairs together: spaces X = X X . refers to the vocabulary d s t u Vs of seen words: those that appear in the external batt linguistic constraints and have their embeddings i i i i Att( A,TA) = τ δatt + x t x x B l l − l r changed in the specialisation process. u denotes i=1 V X   i i i i the vocabulary of unseen words: those not present +τ δatt + x t x x (1) r r − l r in the constraints and whose embeddings are unaf-    fected by the specialisation procedure. where τ(z) = max(0, z) is the standard rectifier The AR specialisation process transforms only function (Nair and Hinton, 2010) and δ is the at- att the subspace X into the specialised subspace X . tract margin: it determines how much closer these s s0 All words x may now be used as training vectors should be to each other than to their respec- i ∈ Vs examples for learning the explicit mapping function tive negative examples. The second, REPEL term f from X into X . If N = , we in fact rely on in the cost function is analogous: it pushes R word s s0 |Vs| N training pairs: (xi, xi0 ) = xi Xs, xi0 Xs0 . pairs away from each other by the margin δrep. { ∈ ∈ } Function f can then be applied to unseen words Finally, in addition to the A and R terms, a regu- x to yield the specialised subspace X = larisation term is used to preserve the semantic con- ∈ Vu u0 f(X ). The specialised space containing all words tent originally present in the distributional vector u is then X = X X . The complete high-levelc space, as long as this information does not contra- f s0 ∪ u0 dict the injected external knowledge. Let ( ) be post-specialisation procedure is outlined in Fig. 1a. V B c the set of all word vectors present in a mini-batch, Note that another variant of the approach could the distributional regularisation term is then: obtain Xf as Xf = f(Xd), that is, the entire distri- butional space is transformed by f. However, this variant seems counter-intuitive as it forgets the ac- Reg( A, R) = λreg xi xi (2) B B k − k2 tual output of the initial specialisation procedure x V ( ) i∈ XBA∪BR and replaces word vectors from X with their ap- b s0 2 where λreg is the L2-regularisation constant and xi proximations, i.e., f-mapped vectors. denotes the original (distributional) word vector for Objective Functions As mentioned, the N seen word xi. The full ATTRACT-REPEL cost functionb is words x in fact serve as our “pseudo- finally constructed as the sum of all three terms. i ∈ Vs translation” pairs supporting the learning of a cross- 3.2 Specialisation of Unseen Words space mapping function. In practice, in its high- level formulation, our mapping problem is equiva- Problem Formulation The goal is to learn a global transformation function that generalises the lent to those encountered in the literature on cross- perturbations of the initial vector space made by lingual word embeddings where the goal is to learn a shared cross-lingual space given monolingual vec- ATTRACT-REPEL (or any other specialisation pro- cedure), as conditioned on the external constraints. tor spaces in two languages and N1 translation pairs The learned function propagates the signal coded (Mikolov et al., 2013a; Lazaridou et al., 2015; Vulic´ in the input constraints to all the words unseen dur- and Korhonen, 2016b; Artetxe et al., 2016, 2017; ing the specialisation process. We seek a regression Conneau et al., 2017; Ruder et al., 2017). In our setup, the standard objective based on L2-penalised 1 Similarly, for each R pair (xl, xr), the negative pair 2 (tl, tr) is chosen from the in-batch vectors so that tl is the We have empirically confirmed the intuition that the first vector furthest away from xl and tr is furthest from xr. All variant is superior to this alternative. We do not report the vectors are unit length (re)normalised after each epoch. actual quantitative comparison for brevity.

518 swish swish swish swish

...

...

...

... f: Deep neural network ...... xi Xu x X ... ∈ (non-linear regression) i0 ∈ u0 b c ...... Linguistic Constraints Training Pairs: Seen

(x1 Xs, y1 Xs) (x1 Xs, x10 Xs0 )

∈ ∈ ∈ ∈ ...

(x X , y X ) (x2 Xs, x0 X0 )

1 ∈ s 2 ∈ s ∈ 2 ∈ s ...

(x X , y X ) (x3 Xs, x0 X0 ) ...

2 ∈ s 3 ∈ s ∈ 3 ∈ s ··· ··· xi ... x'i,p (d=300) (d=300) Input Output Xd = Xs Xu attract-repel Xs0 Xu mapping X = X0 X x'i,h x'i,h x'i,h x'i,h ∪ ∪ f s u0 1 2 H-1 H (distributional) (specialised: seen) (specialised final:∪ all) (d1=512) (d2=512) (dH-1=512) (dH=512) c Xu X'u Hidden 1 Hidden 2 Hidden H-1 Hidden H

(a) High-level illustration (b) Low-level implementation: deep feed-forward neural network

Figure 1: (a) High-level illustration of the post-specialisation approach: the subspace Xs of the initial distributional vector space X = X X is first specialised/fine-tuned by the ATTRACT-REPEL speciali- d s ∪ u sation model (or any other post-processing model) to obtain the transformed subspace Xs0 . The words present (i.e., seen) in the input set of linguistic constraints are now assigned different representations in Xs (the original distributional vector) and Xs0 (the specialised vector): they are therefore used as training examples to learn a non-linear cross-space mapping function. This function is then applied to all word vectors x X representing words unseen in the constraints to yield a specialised subspace X . The i ∈ u u0 final space is X = X X , and it contains transformed representations for all words from the initial f s0 ∪ u0 space Xd. (b) The actual implementation of the non-linear regression function which maps fromcXu to Xu0 : a deep feed-forward fully-connectedc neural net with non-linearities and H hidden layers. c least squares may be formulated as follows: Another variant objective is the contrastive margin-based ranking loss with negative sampling 2 fMSE = arg min f(Xs) Xs0 F (3) (MM) similar to the original ATTRACT-REPEL ob- || − || jective, used in other applications in prior work where 2 denotes the squared Frobenius norm. || · ||F (e.g., for cross-modal mapping) (Weston et al., In the most common form f(Xs) is simply a lin- dim dim 2011; Frome et al., 2013; Lazaridou et al., 2015; ear map/matrix Wf R × (Mikolov et al., ∈ Kummerfeld et al., 2015). Let x0i = f(xi) denote 2013a) as follows: f(X) = Wf X. the predicted vector for the word xi s, and let After learning f based on the Xs Xs0 trans- ∈ V → x refer to the “true” vector of xcin the specialised formation, one can simply apply f to unseen words: 0i i space Xs0 after the AR specialisation procedure. The Xu0 = f(Xu). This linear mapping model, termed MM loss is then defined as follows: LINEAR-MSE, has an analytical solution (Artetxe etc al., 2016), and has been proven to work well with N k cross-lingual embeddings. However, given that the JMM = τ δmm cos x i, x0i + cos x i, x0j − 0 0 i=1 j=i specialisation model injects hundreds of thousands X X6    c c (or even millions) of linguistic constraints into the where cos is the cosine similarity measure, δmm is distributional space (see later in Sect.4), we sus- the margin, and k is the number of negative sam- pect that the assumption of linearity is too limiting ples. The objective tries to learn the mapping f so and does not fully hold in this particular setup. that each predicted vector x0i is by the specified Using the same L2-penalized least squares objec- margin δmm closer to the correct target vector x0i tive, we can thus replace the linear map with a non- than to any other of k targetc vectors x0j serving as linear function f : Rdim Rdim. The non-linear negative examples.3 Function f can again be either → mapping, illustrated by Fig. 1b, is implemented as a a simple linear map (LINEAR-MM), or implemented deep feed-forward fully-connected neural network as a DFFN (NONLINEAR-MM, see Fig. 1b). (DFFN) with H hidden layers and non-linear acti- 3We have also experimented with a simpler hinge loss vations. This variant is called NONLINEAR-MSE. function without negative examples, formulated as J =

519 4 Experimental Setup omitted only before the final output layer to enable full-range predictions (see Fig. 1b again). Starting Word Embeddings (X = Xs Xu) d ∪ The choices of non-linear activation and ini- To test the robustness of our approach, we exper- tialisation are guided by recent recommendations iment with three well-known, publicly available from the literature. First, we use swish (Ramachan- collections of English word vectors: 1) Skip-Gram dran et al., 2017; Elfwing et al., 2017) as non- with Negative Sampling (SGNS-BOW2)(Mikolov linearity, defined as swish(x) = x sigmoid(βx). et al., 2013b) trained on the Polyglot Wikipedia (Al- · We fix β = 1 as suggested by Ramachandran et al. Rfou et al., 2013) by Levy and Goldberg(2014) (2017).6 Second, we use the HE normal initialisa- using bag-of-words windows of size 2; 2) GLOVE tion (He et al., 2015), which is preferred over the Common Crawl (Pennington et al., 2014); and 3) XAVIER initialisation (Glorot and Bengio, 2010) FASTTEXT (Bojanowski et al., 2017), a SGNS vari- for deep models (Mishkin and Matas, 2016; Li ant which builds word vectors as the sum of their et al., 2016), although in our experiments we do constituent character n-gram vectors. All word em- not observe a significant difference in performance beddings are 300-dimensional.4 between the two alternatives. We set H = 5 in AR Specialisation and Constraints (Xs X0 ) all experiments without any fine-tuning; we also → s We experiment with linguistic constraints used be- analyse the impact of the network depth in Sect.5. fore by (Mrkšic´ et al., 2017; Vulic´ et al., 2017a): Optimisation For the AR specialisation step, we they extracted monolingual synonymy/ATTRACT adopt the original suggested model setup. Hyper- pairs from the Paraphrase Database (PPDB) parameter values are set to: δ = 0.6, δ = 0.0, (Ganitkevitch et al., 2013; Pavlick et al., 2015) att rep λ = 10 9 (Mrkšic´ et al., 2017). The models are (640,435 synonymy pairs in total), while their reg − trained for 5 epochs with Adagrad (Duchi et al., antonymy/REPEL constraints came from BabelNet 2011), with batch sizes set to b = b = 50, (Navigli and Ponzetto, 2012) (11,939 pairs).5 att rep again as in the original work. The coverage of vocabulary words in the Vd For training the non-linear mapping with DFFN constraints illustrates well the problem of unseen (Fig. 1b), we use the Adam algorithm (Kingma words with the fine-tuning specialisation models. and Ba, 2015) with default settings. The model is For instance, the constraints cover only a small trained for 100 epochs with early stopping on a subset of the entire vocabulary for SGNS-BOW2: Vd validation set. We reserve 10% of all available seen 16.6%. They also cover only 15.3% of the top 200K data (i.e., the words from s represented in Xs and most frequent d words from FASTTEXT. V V Xs0 ) for validation, the rest are used for training. For Network Design and Parameters (Xu X ) the MM objective, we set δmm = 0.6 and k = 25 → u0 The non-linear regression function f : Rd Rd in all experiments without any fine-tuning. → is a DFFN with H hidden layers, each of dimen-c sionality d1 = d2 = ... = dH = 512 (see Fig. 1b). 5 Results and Discussion Non-linear activations are used in each layer and 5.1 Intrinsic Evaluation: Word Similarity N τ δmm cos(x i, x0i) . For instance, with δmm = i=1 − 0 Evaluation Protocol The first set of experiments 1.0 f x P the idea is to learn a mapping that, for each i enforces evaluates vector spaces with different specialisation the predicted vector andc the correct target vector to have a maximum cosine similarity. We do not report the results with procedures intrinsically on word similarity bench- this variant as, although it outscores the MSE-style objective, marks: we use the SimLex-999 dataset (Hill et al., it was consistently outperformed by the MM objective. 2015), and SimVerb-3500 (Gerz et al., 2016), a 4For further details regarding the architectures and training setup of the used vector collections, we refer the reader to recent verb pair similarity dataset providing simi- the original papers. Additional experiments with other word larity ratings for 3,500 verb pairs.7 Spearman’s ρ vectors, e.g., with CONTEXT2VEC (Melamud et al., 2016a) (which uses bidirectional LSTMs (Hochreiter and Schmidhu- 6According to Ramachandran et al.(2017), for deep net- ber, 1997) for context modeling), and with dependency-word works swish has a slight edge over the family of LU/ReLU- based embeddings (Bansal et al., 2014; Melamud et al., 2016b) related activations (Maas et al., 2013; He et al., 2015; Klam- lead to similar results and same conclusions. bauer et al., 2017). We also observe a minor (and insignificant) 5We have experimented with another set of constraints difference in performance in favour of swish. used in prior work (Zhang et al., 2014; Ono et al., 2015), 7While other gold standards such as WordSim-353 (Finkel- reaching similar conclusions: these were extracted from Word- stein et al., 2002) or MEN (Bruni et al., 2014) coalesce the no- Net (Fellbaum, 1998) and Roget (Kipfer, 2009), and comprise tions of true semantic similarity and (more broad) conceptual 1,023,082 synonymy pairs and 380,873 antonymy pairs. relatedness, SimLex and SimVerb provide explicit guidelines

520 Setup: hold-out Setup: all GLOVESGNS-BOW2 FASTTEXTGLOVESGNS-BOW2 FASTTEXT SL SV SL SV SL SV SL SV SL SV SL SV

Distributional: Xd .408 .286 .414 .275 .383 .255 .408 .286 .414 .275 .383 .255 +AR specialisation: Xs0 .408 .286 .414 .275 .383 .255 .690 .578 .658 .544 .629 .502 ++Mapping unseen: Xf LINEAR-MSE .504 .384 .447 .309 .405 .285 .690 .578 .656 .551 .628 .502 NONLINEAR-MSE .549 .407 .484 .344 .459 .329 .694 .586 .663 .556 .631 .506 LINEAR-MM .548 .422 .468 .329 .419 .308 .697 .582 .663 .554 .628 .487 NONLINEAR-MM .603 .480 .531 .391 .471 .349 .705 .600 .667 .562 .638 .507

Table 1: Spearman’s ρ correlation scores for three word vector collections on two English word similarity datasets, SimLex-999 (SL) and SimVerb-3500 (SV), using different mapping variants, evaluation protocols, and word vector spaces: from the initial distributional space Xd to the fully specialised space Xf . H = 5.

0.65 SimLex-999 SimVerb-3500 0.65 SimLex-999 SimVerb-3500 0.65 SimLex-999 SimVerb-3500

0.55 0.55 0.55 correlation correlation

ρ 0.45 0.45 ρ 0.45

0.35 0.35 0.35 Spearman’s Spearman’s

0.25 0.25 0.25 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Number of hidden layers H Number of hidden layers H Number of hidden layers H (a) GLOVE (b) SGNS-BOW2 (c) FASTTEXT

Figure 2: The results of the hold-out experiments on SimLex-999 and SimVerb-3500 after applying our non-linear vector space transformation with different depths (hidden layer size H, see Fig. 1b). The results are presented as averages over 20 runs with the NONLINEAR-MM variant, the shaded regions are spanned by the maximum and minimum scores obtained. Thick horizontal lines refer to Spearman’s rank correlations achieved in the initial space Xd. H = 0 denotes the standard linear regression model (Mikolov et al., 2013a; Lazaridou et al., 2015)(LINEAR-MM shown since it outperforms LINEAR-MSE). rank correlation is used as the evaluation metric. depth H on the model’s performance. We evaluate word vectors in two settings. First, The results suggest that the mapping of unseen in a synthetic hold-out setting, we remove all lin- words is universally useful, as the highest corre- guistic constraints which contain words from the lation scores are obtained with the final fully spe- SimLex and SimVerb evaluation data, effectively cialised vector space Xf for all three input spaces. forcing all SimLex and SimVerb words to be un- The results in the hold-out setup are particularly seen by the AR specialisation model. The spe- indicative of the improvement achieved by our post- cialised vectors for these words are estimated by specialisation method. For instance, it achieves a the learned non-linear DFFN mapping model. Sec- +0.2 correlation gain with GLOVE on both SimLex ond, the all setting is a standard “real-life” scenario and SimVerb by specialising vector representations where some test (SimLex/SimVerb) words do occur for words present in these datasets without seeing in the constraints, while the mapping is learned for a single external constraint which contains any of the remaining words. these words. This suggests that the perturbation of the seen subspace Xs by ATTRACT-REPEL con- Results and Analysis The results with the three tains implicit knowledge that can be propagated word vector collections are provided in Tab.1. In to Xu, learning better representations for unseen addition, Fig.2 plots the influence of the network words. We observe small but consistent improve- to discern between the two, so that related but non-similar ments across the board in the all setup. The smaller words (e.g. tiger and jungle) have a low rating. gains can be explained by the fact that a majority of

521 SimLex and SimVerb words are present in the ex- GLOVESGNS-BOW2 FASTTEXT ternal constraints (93.7% and 87.2%, respectively). SL SV SL SV SL SV

The scores also indicate that both non-linearity Xd .408 .286 .414 .275 .383 .255 and the chosen objective function contribute to the Xf : RFit .493 .365 .412 .285 .413 .279 Xf : CFit .540 .401 .439 .318 .306 .441 quality of the learned mapping: largest gains are reported with the NONLINEAR-MM variant which Table 2: Post-specialisation applied to two a) employs non-linear activations and b) replaces other post-processing methods. SL: SimLex; SV: the basic mean-squared-error objective with max- SimVerb. Hold-out setting. NONLINEAR-MM. margin. The usefulness of the latter has been estab- lished in prior work on cross-space mapping learn- than with ATTRACT-REPEL. RFit falls short of CFit ing (Lazaridou et al., 2015). The former indicates as by design it can leverage only synonymy (i.e., that the initial AR transformation is non-linear. It is ATTRACT) external constraints. guided by a large number of constraints; their effect cannot be captured by a simple linear map as in 5.2 Downstream Task I: DST prior work on, e.g., cross-lingual word embeddings Next, we evaluate the usefulness of post- (Mikolov et al., 2013a; Ruder et al., 2017). specialisation for two downstream tasks – dialogue Finally, the analysis of the network depth H state tracking and lexical text simplification – in indicates that going deeper helps only to a cer- which discerning semantic similarity from other tain extent. Adding more layers allows for a richer types of semantic relatedness is crucial. We first parametrisation of the network (which is beneficial evaluate the importance of post-specialisation for given the number of linguistic constraints used by a downstream language understanding task of dia- AR). This makes the model more expressive, but it logue state tracking (DST) (Henderson et al., 2014; seems to saturate with larger H values. Williams et al., 2016), adopting the evaluation pro- tocol and data of Mrkšic´ et al.(2017). Post-Specialisation with Other Post-Processors We also verify that our post-specialisation approach DST: Model and Evaluation The DST model is is not tied to the ATTRACT-REPEL method, and is in- the first component of modern dialogue pipelines deed applicable on top of any post-processing spe- (Young, 2010), which captures the users’ goals at cialisation method. We analyse the impact of post- each dialogue turn and then updates the dialogue specialisation in the hold-out setting using the orig- state. Goals are represented as sets of constraints inal retrofitting (RFit) model (Faruqui et al., 2015) expressed as slot-value pairs (e.g., food=Chinese). and counter-fitting (CFit)(Mrkšic´ et al., 2016) in The set of slots and the set of values for each slot lieu of attract-repel. The results on word similarity constitute the ontology of a dialogue domain. The with the best-performing NONLINEAR-MM variant probability distribution over the possible states is are summarised in Tab.2. the system’s estimate of the user’s goals, and it is The scores again indicate the usefulness of post- used by the dialogue manager module to select the specialisation. As expected, the gains are lower subsequent system response (Su et al., 2016). An example in Fig.3 illustrates the DST pipeline. For evaluation, we use the Neural Belief Tracker (NBT), a state-of-the-art DST model which was the first to reason purely over pre-trained word vectors (Mrkšic´ et al., 2017).8 The NBT uses no hand-crafted semantic lexicons, instead compos- ing word vectors into intermediate utterance and context representations.9 For full model details, we refer the reader to the original paper. The impor- tance of word vector specialisation for the DST task (e.g., distinguishing between synonyms and antonyms by pulling northern and north closer in

8 Figure 3: DST labels (user goals given by slot-value https://github.com/nmrksic/neural-belief-tracker 9The NBT keeps word vectors fixed during training to pairs) in a multi-turn dialogue (Mrkšic´ et al., 2015). enable generalisation for words unseen in DST training data.

522 ENGLISH hold-out all Italian (IT). SimLex-999 has been translated and

Distributional: Xd .797 .797 rescored in the two languages by Leviant and +AR Spec.: X0 Xu .797 .817 s ∪ Reichart(2015), and the WOZ data were trans- ++Mapping: Xf = Xs0 X0u lated and adapted by Mrkšic´ et al.(2017). Exactly LINEAR-MM ∪ .815 .818 NONLINEAR-MM c .827 .835 the same setup is used as in our English exper- iments, without any additional language-specific Table 3: DST results in two evaluation settings fine-tuning. Linguistic constraints were extracted (hold-out and all) with different GLOVE variants. from the same sources: synonyms from the PPDB (135,868 in DE, 362,452 in IT), antonyms from the vector space while pushing north and south BabelNet (4,124 in DE, and 16,854 in IT). Our away) has been established (Mrkšic´ et al., 2017). starting distributional vector spaces are taken from Again, as in prior work the DST evaluation is prior work: IT vectors are from (Dinu et al., 2015), based on the Wizard-of-Oz (WOZ) v2.0 dataset DE vectors are from (Vulic´ and Korhonen, 2016a). (Wen et al., 2017; Mrkšic´ et al., 2017), comprising The results are summarised in Tab.4. 1,200 dialogues split into training (600 dialogues), Our post-specialisation approach yields consis- development (200), and test data (400). In all exper- tent improvements over the initial distributional iments, we report the standard DST performance space and the AR specialisation model in both tasks measure: joint goal accuracy, and report scores as and for both languages. We do not observe any gain averages over 5 NBT training runs. on IT SimLex in the all setup since IT constraints have almost complete coverage of all IT SimLex Results and Analysis We again evaluate word words (99.3%; the coverage is 64.8% in German). vectors in two settings: 1) hold-out, where linguis- As expected, the DST scores in the all setup are tic constraints with words appearing in the WOZ higher than in the hold-out setup due to a larger data are removed, making all WOZ words unseen number of constraints and training examples. by ATTRACT-REPEL; and 2) all. The results for Lower absolute scores for Italian and German the English DST task with different GLOVE word compared to the ones reported for English are vector variants are summarised in Tab.3; similar due to multiple factors, as discussed previously by trends in results are observed with two other word Mrkšic´ et al.(2017): 1) the AR model uses less lin- vector collections. The scores maintain conclusions guistic constraints for DE and IT; 2) distributional established in the word similarity task. First, seman- word vectors are induced from smaller corpora; 3) tic specialisation with ATTRACT-REPEL is again linguistic phenomena (e.g., cases and - beneficial, and discerning between synonyms and ing in DE) contribute to data sparsity and also make antonyms improves DST performance. However, the DST task more challenging. However, it is im- specialising unseen words (the final Xu vector portant to stress the consistent gains over the vector space) yields further improvements in both evalua- space specialised by the state-of-the-art ATTRACT- tion settings, supporting our claim that the speciali- REPEL model across all three test languages. This sation signal can be propagated to unseen words. indicates that the proposed approach is language- This downstream evaluation again demonstrates agnostic and portable to multiple languages. the importance of non-linearity, as the peak scores are reported with the NONLINEAR-MM variant. 5.3 Downstream Task II: Lexical More substantial gains in the all setup are observed Simplification in the DST task compared to the word similarity task. This stems from a lower coverage of the WOZ In our second downstream task, we examine the data in the AR constraints: 36.3% of all WOZ words effects of post-specialisation on lexical simplifica- are unseen words. Finally, the scores are higher on tion (LS) in English. LS aims to substitute complex average in the all setup, since this setup uses more words (i.e., less commonly used words) with their external constraints for AR, and consequently uses simpler synonyms in the context. Simplified text more training examples to learn the mapping. must keep the meaning of the original text, which is discerning similarity from relatedness is impor- Other Languages We test the portability of our tant (e.g., in “The automobile was set on fire” the framework to two other languages for which we word “automobile” should be replaced with “car” have similar evaluation data: German (DE) and or “vehicle” but not with “wheel” or “driver”).

523 GERMAN ITALIAN SimLex (Similarity) WOZ (DST) SimLex (Similarity) WOZ (DST) hold-out all hold-out all hold-out all hold-out all

Distributional: Xd .267 .267 .487 .487 .363 .363 .618 .618 +AR Spec.: X0 Xu .267 .422 .487 .535 .363 .616 .618 .634 s ∪ ++Mapping: Xf LINEAR-MM .354 .449 .485 .533 .401 .616 .627 .633 NONLINEAR-MM .367 .466 .496 .538 .428 .616 .637 .647

Table 4: Results on word similarity (Spearman’s ρ) and DST (joint goal accuracy) for German and Italian.

Vectors Specialisation Acc. Ch. that were replaced by the system (whether or not

Distributional: Xd 66.0 94.0 the replacement was correct). GLOVE +AR Spec.: X0 Xu 67.6 87.0 s ∪ LS results are summarised in Tab.5. Post- ++Mapping: Xf 72.3 87.6 specialised vector spaces consistently yield 5-6% Distributional: Xd 57.8 84.0 gain in Accuracy compared to respective distribu- FASTTEXT +AR Spec.: X0 Xu 69.8 89.4 s ∪ ++Mapping: Xf 74.3 88.8 tional vectors and embeddings specialised with the state-of-the-art ATTRACT-REPEL model. Similar Distributional: Xd 56.0 79.1 SGNS-BOW2 +AR Spec.: X0 Xu 64.4 86.7 to DST evaluation, improvements over ATTRACT- s ∪ ++Mapping: Xf 70.9 86.8 REPEL demonstrate the importance of specialising the vectors of the entire vocabulary and not only Table 5: Lexical simplification performance with the vectors of words from the external constraints. post-specialisation applied on three input spaces. 6 Conclusion and Future Work

We employ LIGHT-LS (Glavaš and Štajner, We have presented a novel post-processing model, 2015), a lexical simplification algorithm that: 1) termed post-specialisation, that specialises word makes substitutions based on word similarities in a vectors for the full vocabulary of the input vec- semantic vector space, and 2) can be provided an tor space. Previous post-processing specialisation arbitrary embedding space as input.10 For a com- models fine-tune word vectors only for words oc- plex word, LIGHT-LS considers the most similar curring in external lexical resources. In this work, words from the vector space as simplification can- we have demonstrated that the specialisation of the didates. Candidates are ranked according to several subspace of seen words can be leveraged to learn features, indicating simplicity and fitness for the a mapping function which specialises vectors for context (semantic relatedness to the context of the all other words, unseen in the external resources. complex word). The substitution is made if the best Our results across word similarity and downstream candidate is simpler than the original word. By pro- language understanding tasks show consistent im- viding vector spaces post-specialised for semantic provements over the state-of-the-art specialisation similarity to LIGHT-LS, we expect to more often method for all three test languages. replace complex words with their true synonyms. In future work, we plan to extend our approach to specialisation for asymmetric relations such as We evaluate LIGHT-LS performance in the all setup on the LS benchmark compiled by Horn et al. hypernymy or meronymy (Glavaš and Ponzetto, (2014), who crowdsourced 50 manual simplifica- 2017; Nickel and Kiela, 2017; Vulic´ and Mrkšic´, tions for each complex word. As in prior work, 2018). We will also investigate more sophisticated we evaluate performance with the following met- non-linear functions. The code is available at: https://github.com/cambridgeltl/ rics: 1) Accurracy (Acc.) is the number of correct post-specialisation/ simplifications made (i.e., the system made the sim- . plification and its substitution is found in the list of Acknowledgments crowdsourced substitutions), divided by the total number of indicated complex words; 2) Changed We thank the three anonymous reviewers for their (Ch.) is the percentage of indicated complex words insightful suggestions. This work is supported by the ERC Consolidator Grant LEXICAL: Lexical 10https://github.com/codogogo/lightls Acquisition Across Languages (no 648909).

524 References Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2017. Sigmoid-weighted linear units for neural network Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. function approximation in reinforcement learning. 2013. Polyglot: Distributed word representations for CoRR, abs/1702.03118. multilingual NLP. In Proceedings of CoNLL, pages 183–192. Manaal Faruqui. 2016. Diverse Context for Learning Word Representations. Ph.D. thesis, Carnegie Mel- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. lon University. Learning principled bilingual mappings of word em- beddings while preserving monolingual invariance. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, In Proceedings of EMNLP, pages 2289–2294. Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Proceedings of NAACL-HLT, pages 1606–1615. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL, pages Christiane Fellbaum. 1998. WordNet. 451–462. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Tailoring continuous word representations for depen- Ruppin. 2002. Placing search in context: The con- Proceedings of ACL dency . In , pages 809– cept revisited. ACM Transactions on Information 815. Systems, 20(1):116–131. Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Knowledge-powered deep learning for word embed- Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, ding. In Proceedings of ECML-PKDD, pages 132– and Tomas Mikolov. 2013. DeViSE: A deep visual- 148. semantic embedding model. In Proceedings of Piotr Bojanowski, Edouard Grave, Armand Joulin, and NIPS, pages 2121–2129. Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the ACL, Juri Ganitkevitch, Benjamin Van Durme, and Chris 5:135–146. Callison-Burch. 2013. PPDB: The Paraphrase Database. In Proceedings of NAACL-HLT, pages Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. 758–764. Multimodal . Journal of Ar- tificial Intelligence Research, 49:1–47. Daniela Gerz, Ivan Vulic,´ Felix Hill, Roi Reichart, and Anna Korhonen. 2016. SimVerb-3500: A large- Danqi Chen and Christopher D. Manning. 2014.A scale evaluation set of verb similarity. In Proceed- fast and accurate dependency parser using neural net- ings of EMNLP, pages 2173–2182. works. In Proceedings of EMNLP, pages 740–750. Goran Glavaš and Simone Paolo Ponzetto. 2017. Ronan Collobert, Jason Weston, Léon Bottou, Michael Dual tensor model for detecting asymmetric lexico- Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. semantic relations. In Proceedings of EMNLP, 2011. Natural language processing (almost) from pages 1758–1768. scratch. Journal of Research, 12:2493–2537. Goran Glavaš and Sanja Štajner. 2015. Simplifying lex- ical simplification: Do we need simplified corpora? Alexis Conneau, Guillaume Lample, Marc’Aurelio In Proceedings of ACL, pages 63–68. Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. CoRR, Xavier Glorot and Yoshua Bengio. 2010. Understand- abs/1710.04087. ing the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS, pages 249– Ryan Cotterell, Hinrich Schütze, and Jason Eisner. 256. 2016. Morphological smoothing and extrapolation of word embeddings. In Proceedings of ACL, pages Zellig S. Harris. 1954. Distributional structure. Word, 1651–1660. 10(23):146–162.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian roni. 2015. Improving zero-shot learning by mitigat- Sun. 2015. Delving deep into rectifiers: Surpass- ing the hubness problem. In Proceedings of ICLR ing human-level performance on ImageNet classifi- (Workshop Papers). cation. In Proceedings of ICCV, pages 1026–1034.

John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Matthew Henderson, Blaise Thomson, and Jason D. Adaptive subgradient methods for online learning Wiliams. 2014. The Second Dialog State Tracking and stochastic optimization. Journal of Machine Challenge. In Proceedings of SIGDIAL, pages 263– Learning Research, 12:2121–2159. 272.

525 Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. SimLex-999: Evaluating semantic models with (gen- 2013. Rectifier nonlinearities improve neural net- uine) similarity estimation. Computational Linguis- work acoustic models. In Proceedings of ICML. tics, 41(4):665–695. Oren Melamud, Jacob Goldberger, and Ido Dagan. Sepp Hochreiter and Jürgen Schmidhuber. 1997. 2016a. Context2vec: Learning generic context em- Long Short-Term Memory. Neural Computation, bedding with bidirectional LSTM. In Proceedings 9(8):1735–1780. of CoNLL, pages 51–61.

Colby Horn, Cathryn Manduca, and David Kauchak. Oren Melamud, David McClosky, Siddharth Patward- 2014. Learning a lexical simplifier using wikipedia. han, and Mohit Bansal. 2016b. The role of context In Proceedings of the ACL, pages 458–463. types and dimensionality in learning word embed- Douwe Kiela, Felix Hill, and Stephen Clark. 2015. dings. In Proceedings of NAACL-HLT, pages 1030– Specializing word embeddings for similarity or re- 1040. latedness. In Proceedings of EMNLP, pages 2044– 2048. Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages Joo-Kyung Kim, Gokhan Tur, Asli Celikyilmaz, Bin for . arXiv preprint, CoRR, Cao, and Ye-Yi Wang. 2016. Intent detection us- abs/1309.4168. ing semantically enriched word embeddings. In Pro- ceedings of SLT. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed rep- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A resentations of words and phrases and their compo- method for stochastic optimization. In Proceedings sitionality. In Proceedings of NIPS, pages 3111– of ICLR (Conference Track). 3119.

Barbara Ann Kipfer. 2009. Roget’s 21st Century The- Dmytro Mishkin and Jiri Matas. 2016. All you need saurus (3rd Edition). Philip Lief Group. is a good init. In Proceedings of ICLR (Conference Track). Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. Self-normalizing Nikola Mrkšic,´ Diarmuid Ó Séaghdha, Tsung-Hsien neural networks. CoRR, abs/1706.02515. Wen, Blaise Thomson, and Steve Young. 2017. Neu- Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick, ral belief tracker: Data-driven dialogue state track- and Dan Klein. 2015. An empirical analysis of op- ing. In Proceedings of ACL, pages 1777–1788. timization for max-margin NLP. In Proceedings of EMNLP, pages 273–279. Nikola Mrkšic,´ Diarmuid Ó Séaghdha, Blaise Thom- son, Milica Gašic,´ Pei-Hao Su, David Vandyke, Angeliki Lazaridou, Georgiana Dinu, and Marco Ba- Tsung-Hsien Wen, and Steve Young. 2015. Multi- roni. 2015. Hubness and pollution: Delving into domain dialog state tracking using recurrent neural cross-space mapping for zero-shot learning. In Pro- networks. In Proceedings of ACL, pages 794–799. ceedings of ACL, pages 270–280. Nikola Mrkšic,´ Diarmuid Ó Séaghdha, Blaise Thom- Ira Leviant and Roi Reichart. 2015. Separated by son, Milica Gašic,´ Lina Maria Rojas-Barahona, Pei- an un-common language: Towards judgment lan- Hao Su, David Vandyke, Tsung-Hsien Wen, and guage informed vector space modeling. CoRR, Steve Young. 2016. Counter-fitting word vectors abs/1508.00106. to linguistic constraints. In Proceedings of NAACL- HLT. Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of ACL, Nikola Mrkšic,´ Ivan Vulic,´ Diarmuid Ó Séaghdha, Ira pages 302–308. Leviant, Roi Reichart, Milica Gašic,´ Anna Korho- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- nen, and Steve Young. 2017. Semantic specialisa- proving distributional similarity with lessons learned tion of distributional word vector spaces using mono- from word embeddings. Transactions of the ACL, lingual and cross-lingual constraints. Transactions 3:211–225. of the ACL, 5:309–324.

Sihan Li, Jiantao Jiao, Yanjun Han, and Tsachy Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Weissman. 2016. Demystifying ResNet. CoRR, linear units improve restricted Boltzmann machines. abs/1611.01186. In Proceedings of ICML, pages 807–814.

Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba- Yu Hu. 2015. Learning semantic word embeddings belNet: The automatic construction, evaluation and based on ordinal knowledge constraints. In Proceed- application of a wide-coverage multilingual seman- ings of ACL, pages 1501–1511. tic network. Artificial Intelligence, 193:217–250.

526 Kim Anh Nguyen, Maximilian Köper, Sabine Ivan Vulic´ and Anna Korhonen. 2016b. On the role of Schulte im Walde, and Ngoc Thang Vu. 2017. seed lexicons in learning bilingual word embeddings. Hierarchical embeddings for hypernymy detection In Proceedings of ACL, pages 247–257. and directionality. In Proceedings of EMNLP, pages 233–243. Ivan Vulic´ and Nikola Mrkšic.´ 2018. Specialising word vectors for lexical entailment. In Proceedings of Kim Anh Nguyen, Sabine Schulte im Walde, and NAACL-HLT. Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym- Ivan Vulic,´ Nikola Mrkšic,´ and Anna Korhonen. 2017a. synonym distinction. In Proceedings of ACL, pages Cross-lingual induction and transfer of verb classes 454–459. based on word vector space specialisation. In Pro- ceedings of EMNLP, pages 2536–2548. Maximilian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representa- Ivan Vulic,´ Nikola Mrkšic,´ Roi Reichart, Diarmuid tions. In Proceedings of NIPS. Ó Séaghdha, Steve Young, and Anna Korhonen. 2017b. Morph-fitting: Fine-tuning word vector Masataka Ono, Makoto Miwa, and Yutaka Sasaki. spaces with simple language-specific rules. In Pro- 2015. -based antonym detection ceedings of ACL, pages 56–68. using thesauri and distributional information. In ´ Proceedings of NAACL-HLT, pages 984–989. Tsung-Hsien Wen, David Vandyke, Nikola Mrkšic, Milica Gašic,´ Lina M. Rojas-Barahona, Pei-Hao Su, Dominique Osborne, Shashi Narayan, and Shay Cohen. Stefan Ultes, and Steve Young. 2017. A network- 2016. Encoding prior knowledge with eigenword based end-to-end trainable task-oriented dialogue embeddings. Transactions of the ACL, 4:417–430. system. In Proceedings of EACL.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Jason Weston, Samy Bengio, and Nicolas Usunier. Benjamin Van Durme, and Chris Callison-Burch. 2011. WSABIE: Scaling up to large vocabulary Proceedings of IJCAI 2015. PPDB 2.0: Better paraphrase ranking, fine- image annotation. In , pages grained entailment relations, word embeddings, and 2764–2770. style classification. In Proceedings of ACL, pages John Wieting, Mohit Bansal, Kevin Gimpel, and Karen 425–430. Livescu. 2015. From paraphrase database to compo- sitional paraphrase model and back. Transactions of Jeffrey Pennington, Richard Socher, and Christopher the ACL, 3:345–358. Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of EMNLP, pages 1532– Jason D. Williams, Antoine Raux, and Matthew Hen- 1543. derson. 2016. The Dialog State Tracking Challenge series: A review. Dialogue & Discourse, 7(3):4–33. Prajit Ramachandran, Barret Zoph, and Quoc V. Le. 2017. Searching for activation functions. CoRR, Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang abs/1710.05941. Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC- NET: A general framework for incorporating knowl- Sascha Rothe and Hinrich Schütze. 2015. AutoEx- edge into word representations. In Proceedings of tend: Extending word embeddings to embeddings CIKM, pages 1219–1228. for synsets and lexemes. In Proceedings of ACL, pages 1793–1803. Steve Young. 2010. Cognitive User Interfaces. IEEE Signal Processing Magazine. Sebastian Ruder, Ivan Vulic,´ and Anders Søgaard. 2017. A survey of cross-lingual embedding models. CoRR, Mo Yu and Mark Dredze. 2014. Improving lexical em- abs/1706.04902. beddings with semantic knowledge. In Proceedings of ACL, pages 545–550. Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015. Symmetric pattern based word embeddings for im- Jingwei Zhang, Jeremy Salwen, Michael Glass, and Al- proved word similarity prediction. In Proceedings fio Gliozzo. 2014. Word semantic representations of CoNLL, pages 258–267. using bayesian probabilistic tensor factorization. In Proceedings of EMNLP, pages 1522–1531. Pei-Hao Su, Milica Gašic,´ Nikola Mrkšic,´ Lina Rojas- Barahona, Stefan Ultes, David Vandyke, Tsung- Hsien Wen, and Steve Young. 2016. Continuously learning neural dialogue management. In arXiv preprint: 1606.02689.

Ivan Vulic´ and Anna Korhonen. 2016a. Is "universal syntax" universally useful for learning distributed word representations? In Proceedings of ACL, pages 518–524.

527