3231 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3231–3241 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics embeddings (Nickel and Kiela, 2017, 2018) as 2 Related Work follows: Roller et al. (2018) showed that Hearst is-a patterns can provide important constraints for hy- Hypernym detection Detecting -relations pernymy extraction from distributional contexts. from text is a long-standing task in natural language However, it is also well-known that Hearst patterns processing. A popular approach is to exploit high- suffer from missing and incorrect extractions, as precision lexico-syntactic patterns as first proposed words must co-occur in exactly the right pattern to by Hearst (1992). These patterns may be prede- be detected successfully. For this reason, we first fined or learned automatically (Snow et al., 2005; extract potential is-a relationships from a corpus Shwartz et al., 2016; Nakashole et al., 2012). How- using Hearst patterns and build a directed weighted ever, it is well known that such pattern-based meth- graph from these extractions. We then embed this ods suffer significantly from missing extractions as Hearst Graph in hyperbolic space to infer missing terms must occur in exactly the right configuration hypernymy relations and remove wrong extractions. to be detected (Shwartz et al., 2016; Roller et al., By using hyperbolic space for the embedding, we 2018). Recent works improve coverage by lever- can exploit the following important advantages: aging search engines (Kozareva and Hovy, 2010) or by exploiting web-scale corpora (Seitner et al., 2016); but also come with precision trade-offs. Consistency Hyperbolic entailment cones (Ganea To overcome the sparse extractions of pattern- et al., 2018) allow us to enforce transitivity of based methods, focus has recently shifted to dis- is-a-relations in the entire embedding space. tributional approaches which provide rich repre- This improves the taxonomic consistency of sentations of lexical meaning. These methods al- the model, as it enforces that (x, is-a, z) if leviate the sparsity issue but also require special- (x, is-a, y) and (y, is-a, z). To improve ized similarity measures to distinguish different optimization properties, we also propose a lexical relationships. To date, most measures are new method to compute hyperbolic entailment inspired by the Distributional Inclusion Hypothe- cones in the Lorentz model of hyperbolic space. sis (DIH; Geffet and Dagan 2005) which hypoth- esizes that for a subsumption relation (cat, is-a, Efficiency Hyperbolic space allows for very low mammal) the subordinate term (cat) should appear dimensional embeddings of graphs with latent in a subset of the contexts in which the superior hierarchies and heavy-tailed degree distribu- term (mammal) occurs. Unsupervised methods for tions. For embedding large Hearst graphs – hypernymy detection based on distributional ap- which exhibit both properties (e.g., see Fig- proaches include WeedsPrec (Weeds et al., 2004), ure 2) – this is an important advantage. In our invCL (Lenci and Benotto, 2012), SLQS (Santus experiments, we will show that hyperbolic em- et al., 2014), and DIVE (Chang et al., 2018). Dis- beddings allow us to decrease the embedding tributional representations that are based on posi- dimension by over an order of magnitude while tional or dependency-based contexts may also cap- outperforming SVD-based methods. ture crude Hearst-pattern-like features (Levy et al., 2015; Roller and Erk, 2016). Shwartz et al. (2017) Interpretability In hyperbolic embeddings, simi- showed that such contexts plays an important role larity is captured via distance while the gener- for the success of distributional methods. Camacho- ality of terms is captured through their norms. Collados et al. (2018) proposed a new shared task This makes it easy to interpret the embeddings for hypernym retrieval from text corpora. with regard to their hierarchical structure and Recently, Roller et al. (2018) performed a sys- allows us to get additional insights, e.g., about tematic study of unsupervised distributional and a term’s degree of generality. pattern-based approaches for hypernym detection. Their results showed that pattern-based methods Figure 1 shows an example of a two-dimensional are able to outperform DIH-based methods on sev- embedding of the Hearst graph that we use in our eral challenging hypernymy benchmarks. Key as- experiments. Although we will use higher dimen- pects to good performance were the extraction of sionalities for our final embedding, the visualiza- patterns from large text corpora and using embed- tion serves as a good illustration of the hierarchical ding methods to overcome the sparsity issue. Our structure that is obtained through the embedding. work builds on these findings by replacing their

3232 Pattern 5 X which is a (example | class | kind | . . . ) of Y 4 X (and | or) (any | some) other Y X which is called Y 3 X is JJS (most)? Y X a special case of Y 2 X is an Y that Frequency (log scale) X is a !(member | part | given) Y 1 !(features | properties) Y such as X1,X2,... (Unlike | like) (most | all | any | other) Y, X 0 1 2 3 4 5 Rank (log scale) Y including X1,X2,...

Figure 2: Frequency distribution of words appearing Table 1: Hearst patterns used in this study. Patterns are in the Hearst pattern corpus (on a log-log scale). lemmatized, but listed as inflected for clarity. embeddings with ones with a natural hierarchical 2014) to hyperbolic space. In addition, works have structure. considered how distributional co-occurrences may be used to augment order-embeddings (Li et al., Taxonomy induction Although detecting hyper- 2018) and hyperbolic embeddings (Dhingra et al., nymy relationships is an important and difficult 2018). Further methods have focused on the of- task, these systems alone do not produce rich taxo- ten complex overlapping structure of word classes, nomic graph structures (Camacho-Collados, 2017), and induced hierarchies using box-lattice structures and complete taxonomy induction may be seen as (Vilnis et al., 2018) and Gaussian word embeddings a parallel and complementary task. (Athiwaratkun and Wilson, 2018). Compared to Many works in this area consider a taxonomic many of the purely graph-based works, these meth- graph as the starting point, and consider a variety ods generally require supervision of hierarchical of methods for growing or discovering areas of the structure, and cannot learn taxonomies using only graph. For example, Snow et al. (2006) train a clas- unstructured noisy data. sifier to predict the likelihood of an edge in Word- Net, and suggest new undiscovered edges, while 3 Methods Kozareva and Hovy (2010) propose an algorithm which repeatedly crawls for new edges using a web In the following, we discuss our method for un- search engine and an initial seed taxonomy. Cimi- supervised learning of concept hierarchies. We ano et al. (2005) considered learning ontologies first discuss the extraction and construction of the using Formal Concept Analysis. Similar works Hearst graph, followed by a description of the Hy- consider noisy graphs discovered from Hearst pat- perbolic Embeddings. terns, and provide algorithms for pruning edges 3.1 Hearst Graph until a strict hierarchy remains (Velardi et al., 2005; Kozareva and Hovy, 2010; Velardi et al., 2013). The main idea introduced by Hearst (1992) is to Maedche and Staab (2001) proposed a method to exploit certain lexico-syntactic patterns to detect learn ontologies in a Semantic Web context. is-a relationships in natural language. For in- stance, patterns like “NPy such as NPx” or “NPx Embeddings Recently, works have proposed a and other NPy” often indicate a hypernymy rela- variety of graph embedding techniques for rep- tionship (u, is-a, v). By treating unique noun resenting and recovering hierarchical structure. phrases as nodes in a large, directed graph, we may Order-embeddings (Vendrov et al., 2016) repre- construct a Hearst Graph using only unstructured sent text and images with embeddings where the text and very limited prior knowledge in the form of ordering over individual dimensions forms a par- patterns. Table 1 lists the only patterns that we use N tially ordered set. Hyperbolic embeddings repre- in this work. Formally, let E = {(u, v)}i=1 denote sent words in hyperbolic manifolds such as the the set of is-a relationships that have been ex- Poincare´ ball and may be viewed as a continuous tracted from a . Furthermore, let w(u, v) analogue to tree-like structures (Nickel and Kiela, denote how often we have extracted the relationship 2017, 2018). Recently, Tifrea et al. (2018) also pro- (u, is-a, v). We then represent the extracted pat- posed an extension of GLOVE (Pennington et al., terns as a weighted directed graph G = (V, E, w)

3233 3234 3235 v M let Θ = { i}i=1 be the set of embeddings. To find (as described in (Roller et al., 2018)) as well as in a an embedding that minimizes the overall energy, reconstruction setting (as described in (Nickel and we then solve the optimization problem Kiela, 2017)). Following Roller et al. (2018), we compare to the following methods for unsupervised Θˆ = arg min L(u, v) (4) hypernymy detection: ∈Hn Θ u,v ∈V X N Pattern-Based Models Let E = {(x, y)}i=1 be where the set of Hearst patterns in our corpus, w(x, y) be the count of how many times (x, y) occurs in E, E(u, v) if (u, v) ∈ E L(u, v) = and W = (x,y)∈E w(x, y). We then consider the (max(0, γ − E(u, v)) otherwise following pattern-based methods: P is the max-margin loss as used in (Ganea et al., Count Model (p) This model simply outputs the 2018; Vendrov et al., 2016). The goal of Equa- count, or equivalently, the extraction probabilities tion (4) is to find a joint embedding of all terms of Hearst patterns, i.e., that best explains the observed Hearst patterns. w(x, y) To solve Equation (4), we follow Nickel and p(x, y) = Kiela (2018) and perform stochastic optimization W via Riemannian SGD (RSGD; Bonnabel 2013). In PPMI Model (ppmi) To correct for skewed oc- RSGD, updates to the parameters v are computed currence probabilities, the PPMI model predicts via hypernymy relations based on the Positive Point- v v t+1 = expvt (−η grad f( t)) (5) wise Mutual Information over the Hearst pattern − corpus. Let p (x) = Σ ∈ w(x, y)/W and where grad f(vt) denotes the Riemannian gradient (x,y) E + and η denotes the learning rate. In Equation 5, the p (x) = Σ(y,x)∈Ew(y, x)/W , then: Riemannian gradient of f at v is computed via p(x, y) ppmi(x, y) = max 0, log − v proj −1 v p (x)p+(y) grad f( t) = vt gℓ ∇f( )   where ∇f(v) denotes the Euclidean gradient of f SVD Count (sp) To account for missing relations, and where we also compare against low-rank embeddings of the Hearst corpus using Singular Value Decompo- MxM projv(x) = v + hv, xiLv sition (SVD). Specifically, let X ∈ R , such ⊤ − that and be the singular g 1(v) = diag([−1, 1,..., 1]) Xij = w(i, j)/W UΣV ℓ value decomposition of X, then: Rn+1 denote the projection from the ambient space ⊤ n sp(x, y) = u Σ v onto the tangent space TvL and the inverse of the x r y metric tensor, respectively. Finally, the exponential SVD PPMI (spmi) We also evaluate against the map for Ln is computed via SVD of the PPMI matrix, which is identical to x sp(i, j), with the exception that Xij = ppmi(i, j), expv(x) = cosh(kxkL)v + sinh(kxkL) kxkL instead of p(i, j). Roller et al. (2018) showed that this method provides state-of-the-art results for un- n where kvkL = hv, viL and v ∈ TxL . supervised hypernymy detection. As suggested by Nickel and Kiela (2018), p we initialize the embeddings close to the ori- Hyperbolic Embeddings (HypeCones) We embed gin of Ln by sampling from the uniform dis- the Hearst graph into hyperbolic space as described in Section 3.2. At evaluation time, we predict the tribution U(−0.001, 0.001) and by setting v0 to 1 + ||v′||2, what ensures that the sampled points likelihood using the model energy E(u, v). are located on the surface of the hyperboloid. p Distributional Models The distributional mod- 4 Experiments els in our evaluation are based on the DIH, i.e., the idea that contexts in which a narrow term x (ex: To evaluate the efficacy of our method, we evaluate cat) may appear should be a subset of the contexts on several commonly-used hypernymy benchmarks in which a broader term y (ex: animal) may appear.

3236 Detection (AP) Direction (Acc.) Graded (ρ)

BLESS EVAL LEDS SHWARTZ WBLESS BLESS WBLESS BIBLESS HYPERLEX Cosine .12 .29 .71 .31 .53 .00 .54 .52 .14 WeedsPrec .19 .39 .87 .43 .68 .63 .59 .45 .43 invCL .18 .37 .89 .38 .66 .64 .60 .47 .43 SLQS .15 .35 .60 .38 .69 .75 .67 .51 .16 p(x, y) .49 .38 .71 .29 .74 .46 .69 .62 .62 ppmi(x, y) .45 .36 .70 .28 .72 .46 .68 .61 .60 sp(x, y) .66 .45 .81 .41 .91 .96 .84 .80 .51 spmi(x, y) .76 .48 .84 .44 .96 .96 .87 .85 .53 HypeCones .81 .50 .89 .50 .98 .94 .90 .87 .59

Table 2: Experimental results comparing distributional and pattern-based methods in all settings.

WeedsPrec The first distributional model we con- lemmatized, and POS-tagged using CoreNLP 3.8.0 sider is WeedsPrec (Weeds et al., 2004), which (Manning et al., 2014). The full set of Hearst pat- captures the features of x which are included in the terns is provided in Table 1. These include proto- set of more general term’s features, y: typical Hearst patterns, like “animals [such as] big cats”, as well as broader patterns like “New Year [is n ✶ xi · yi> WeedsPrec(x, y) = i=1 0 the most important] holiday.” Noun phrases were n x P i=1 i allowed to match limited modifiers, and produced invCL Lenci and Benotto (2012P ), introduce the additional hits for the head of the noun phrase. The idea of distributional exclusion by also measuring final corpus contains circa 4.5M matched pairs, the degree to which the broader term contains con- 431K unique pairs, and 243K unique terms. texts not used by the narrower term. The degree of Hypernymy Tasks We consider three distinct inclusion is denoted as: subtasks for evaluating the performance of these n models for hypernymy prediction: i=1 min(xi, yi) CL(x, y) = n xi P i=1 • Detection: Given a pair of words (u, v), deter- To measure the inclusion ofPx and y and the non- mine if v is a hypernym of u. inclusion of y in x, invCL is then computed as • Direction: Given a pair (u, v), determine if u invCL(x, y) = CL(x, y) · (1 − CL(y, x)) is more general than v or vise versa. p SLQS The SLQS model is based on the informa- • Graded Entailment: Given a pair of words (u, tiveness hypothesis (Santus et al., 2014; Shwartz v), determine the degree to which u is a v. et al., 2017), i.e., the idea that general words appear mostly in uninformative contexts, as measured by For detection, we evaluate all models on five entropy. SLQS depends on the median entropy of commonly-used benchmark datasets: BLESS (Ba- a term’s top k contexts: roni and Lenci, 2011), LEDS (Baroni et al., 2012), EVAL (Santus et al., 2015), SHWARTZ (Shwartz k Ex = mediani=1[H(ci)] et al., 2016), and WBLESS (Weeds et al., 2014), In addition to positive hypernymy relations, these where H(c ) is the Shannon entropy of context c i i datasets include negative samples in the form of across all terms. SLQS is then defined as: random pairs, co-hyponymy, antonymy, meronymy,

SLQS(x, y) = 1 − Ex/Ey and adjectival relations. For directionality and graded entailment, we also use BIBLESS (Kiela Corpora and Preprocessing We construct our et al., 2015) and HYPERLEX (Vulic et al., 2016). Hearst graph using the same data, patterns, and pro- We refer to Roller et al. (2018) for an in-depth dis- cedure as described in (Roller et al., 2018): Hearst cussion of these datasets. For all models, we use patterns are extracted from the concatenation of the identical text corpus and tune hyperparameters GigaWord and Wikipedia. The corpus is tokenized, on the validation sets.

3237 Animals Plants Vehicles All Missing Transitive All Missing Transitive All Missing Transitive p(x, y) 350.18 512.28 455.27 271.38 393.98 363.73 43.12 82.57 66.10 ppmi(x, y) 350.47 512.28 455.38 271.40 393.98 363.76 43.20 82.57 66.16 sp(x, y) 56.56 77.10 11.22 43.40 64.70 17.88 9.19 26.98 14.84 spmi(x, y) 58.40 102.56 12.37 40.61 71.81 14.80 9.62 17.96 3.03 HypeCones 25.33 37.60 4.37 17.00 31.53 6.36 5.12 10.28 2.74 ∆% 56.6 51.2 61.1 58.1 51.3 57.0 44.3 42.8 9.6

Table 3: Reconstruction of Animals, Plants, and Vehicles subtrees in WORDNET.

Table 2 shows the results for all tasks. It can relation (u, v) in WORDNET, its score ranked be seen that our proposed approach provides sub- against the score of the ground-truth negative stantial gains on the detection and directionality edges. In Table 3, All refers to the ranking of tasks and, overall, achieves state of the art re- all edges in the subtree, Missing to edges that are sults on seven of nine benchmarks. In addition, not included in the Hearst graph G, Transitive to our method clearly outperforms other embedding- missing transitive edges in G (i.e. for all edges based approaches on HYPERLEX, although it can {(x, z):(x, y), (y, z) ∈ E ∧ (x, z) ∈/ E}). not fully match the count-based methods. As Roller It can be seen that our method clearly outper- et al. (2018) noted, this might be an artifact of the forms the SVD and count-based models with a rel- evaluation metric, as count-based methods benefit ative improvement of typically over 40% over the from their sparse-predictions in this setting. best non-hyperbolic model. Furthermore, our abla- Our method achieves also strong performance tion shows that HYPECONES improves the consis- when compared to Poincare´ GLOVE on the task tency of the embedding due to its transitivity prop- of hypernymy prediction. While Tifrea et al. erty. For instance, in our Hearst Graph the relation (2018) report Spearman’s rho ρ = 0.421 on HY- (male horse, is-a, equine) is missing. However, PERLEX and accuracy ACC = 0.790 on WBLESS, since we correctly model that (male horse, is-a, our method achieves ρ = 0.59 (HYPERLEX) and horse) and (horse, is-a, equine), by transitivity, ACC = 0.909 (WBLESS). This illustrates the im- we also infer (male horse, is-a, equine), which portance of the distributional constraints that are SVD fails to do. provided by Hearst patterns. 5 Conclusion An additional benefit is the efficiency of our embeddings. For all tasks, we have used a 20- In this work, we have proposed a new approach dimensional embedding for HYPECONES, while for inferring concept hierarchies from large text the best results for SVD-based methods have been corpora. For this purpose, we combine Hearst pat- achieved with 300 dimensions. This reduction in terns with hyperbolic embeddings which allows parameters by over an order of magnitude clearly us to set appropriate constraints on the distribu- highlights the efficiency of hyperbolic embeddings tional contexts and to improve the consistency in for representing hierarchical structures. the embedding space. By computing a joint embed- ding of all terms that best explains the extracted Reconstruction In the following, we compare Hearst patterns, we can then exploit these prop- embedding and pattern-based methods on the task erties for improved hypernymy prediction. The of reconstructing an entire subtree of WORDNET, natural hierarchical structure of hyperbolic space i.e., the animals, plants, and vehicles taxonomies, allows us also to learn very efficient embeddings as proposed by Kozareva and Hovy (2010). In that reduce the required dimensionality substan- addition to predicting the existence of single hy- tially over SVD-based methods. To improve op- pernymy relations, this allows us to evaluate the timization, we have furthermore proposed a new performance of these models for inferring full tax- method to compute entailment cones in the Lorentz onomies and to perform an ablation for the pre- model of hyperbolic space. Experimentally, we diction of missing and transitive relations. We fol- show that our embeddings achieve state-of-the-art low previous work (Bordes et al., 2013; Nickel performance on a variety of commonly-used hyper- and Kiela, 2017) and report for each observed nymy benchmarks.

3238 References Jose Camacho-Collados, Claudio Delli Bovi, Luis Es- pinosa Anke, Sergio Oramas, Tommaso Pasini, En- Rolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C rico Santus, Vered Shwartz, Roberto Navigli, and Barker, Brigitte Boeckmann, Serenella Ferro, Elis- Horacio Saggion. 2018. Semeval-2018 task 9: hy- abeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, pernym discovery. In Proceedings of The 12th Inter- Michele Magrane, et al. 2004. Uniprot: the univer- national Workshop on Semantic Evaluation, pages sal protein knowledgebase. Nucleic acids research, 712–724. 32(suppl 1):D115–D119. Haw-Shiuan Chang, Ziyun Wang, Luke Vilnis, and An- Michael Ashburner, Catherine A Ball, Judith A Blake, drew McCallum. 2018. Distributional inclusion vec- David Botstein, Heather Butler, J Michael Cherry, tor embedding for unsupervised hypernymy detec- Allan P Davis, Kara Dolinski, Selina S Dwight, tion. In Proceedings of the 2018 Conference of the Janan T Eppig, et al. 2000. Gene ontology: tool North American Chapter of the Association for Com- for the unification of biology. Nature genetics, putational Linguistics: Human Language Technolo- 25(1):25. gies, Volume 1 (Long Papers), pages 485–495, New Orleans, Louisiana. Association for Computational Ben Athiwaratkun and Andrew Gordon Wilson. 2018. Linguistics. Hierarchical density order embeddings. In Proceed- ings of the International Conference on Learning Philipp Cimiano, Andreas Hotho, and Steffen Staab. Representations. 2005. Learning concept hierarchies from text cor- pora using formal concept analysis. Journal of arti- Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens ficial intelligence research, 24:305–339. Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan In The semantic web, pages 722–735. Springer. Roth. 2010. Recognizing : Ra- tional, evaluation and approaches–erratum. Natural Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, Language Engineering, 16(1):105–105. and Chung-chieh Shan. 2012. Entailment above the word level in . In Proceed- Jeffrey Dean and Sanjay Ghemawat. 2004. Mapre- ings of the 13th Conference of the European Chap- duce: Simplified data processing on large clusters. ter of the Association for Computational Linguis- In OSDI’04: Sixth Symposium on Operating System tics, pages 23–32. Association for Computational Design and Implementation, pages 137–150, San Linguistics. Francisco, CA.

Marco Baroni and Alessandro Lenci. 2011. How we Bhuwan Dhingra, Christopher Shallue, Mohammad BLESSed distributional semantic evaluation. In Norouzi, Andrew Dai, and George Dahl. 2018. Em- Proceedings of the 2011 Workshop on GEometrical bedding text in hyperbolic spaces. In Proceedings Models of Natural Language Semantics, pages 1–10, of the Twelfth Workshop on Graph-Based Methods Edinburgh, UK. for Natural Language Processing (TextGraphs-12), pages 59–69, New Orleans, Louisiana, USA. Asso- Tim Berners-Lee, James Hendler, and Ora Lassila. ciation for Computational Linguistics. 2001. The semantic web. Scientific american, 284(5):34–43. Octavian-Eugen Ganea, Gary Becigneul,´ and Thomas Hofmann. 2018. Hyperbolic entailment cones for Silvere Bonnabel. 2013. Stochastic gradient descent learning hierarchical embeddings. arXiv preprint on Riemannian manifolds. IEEE Trans. Automat. arXiv:1804.01882. Contr., 58(9):2217–2229. Maayan Geffet and Ido Dagan. 2005. The distribu- Antoine Bordes, Nicolas Usunier, Alberto Garcia- tional inclusion hypotheses and lexical entailment. Duran, Jason Weston, and Oksana Yakhnenko. In Proceedings of the 43rd Annual Meeting on As- 2013. Translating embeddings for modeling multi- sociation for Computational Linguistics, pages 107– relational data. In Advances in neural information 114. Association for Computational Linguistics. processing systems, pages 2787–2795. Gene Ontology Consortium. 2016. Expansion of the gene ontology knowledgebase and resources. Nu- Samuel R. Bowman, Gabor Angeli, Christopher Potts, cleic acids research, 45(D1):D331–D338. and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. Marti A Hearst. 1992. Automatic acquisition of hy- In Proceedings of the 2015 Conference on Empiri- ponyms from large text corpora. In Proceedings cal Methods in Natural Language Processing, pages of the 14th conference on Computational linguistics- 632–642, Lisbon, Portugal. Association for Compu- Volume 2, pages 539–545. Association for Computa- tational Linguistics. tional Linguistics. Jose Camacho-Collados. 2017. Why we have switched Johannes Hoffart, Fabian M. Suchanek, Klaus from building full-fledged taxonomies to simply Berberich, and Gerhard Weikum. 2013. YAGO2: A detecting hypernymy relations. arXiv preprint spatially and temporally enhanced knowledge base arXiv:1703.04178. from wikipedia. Artif. Intell., 194:28–61.

3239 Douwe Kiela, Laura Rimell, Ivan Vulic, and Stephen Ndapandula Nakashole, Gerhard Weikum, and Fabian Clark. 2015. Exploiting image generality for lexical Suchanek. 2012. Patty: a taxonomy of relational entailment detection. In Proceedings of the 53rd An- patterns with semantic types. In Proceedings of the nual Meeting of the Association for Computational 2012 Joint Conference on Empirical Methods in Nat- Linguistics (ACL 2015), pages 119–124. ACL. ural Language Processing and Computational Natu- ral Language Learning, pages 1135–1145. Associa- Zornitsa Kozareva and Eduard Hovy. 2010. A tion for Computational Linguistics. semi-supervised method to learn and construct tax- onomies using the web. In Proceedings of the Maximilian Nickel and Douwe Kiela. 2017. Poincare´ 2010 conference on empirical methods in natural embeddings for learning hierarchical representa- language processing, pages 1110–1118. Association tions. In Advances in Neural Information Process- for Computational Linguistics. ing Systems 30, pages 6338–6347. Curran Asso- Douglas B. Lenat. 1995. Cyc: a large-scale investment ciates, Inc. in knowledge infrastructure. Communications of the Maximilian Nickel and Douwe Kiela. 2018. Learning ACM, 38(11):33–38. continuous hierarchies in the lorentz model of hyper- Alessandro Lenci and Giulia Benotto. 2012. Identify- bolic geometry. In Proceedings of the Thirty-fifth ing hypernyms in distributional semantic spaces. In International Conference on Machine Learning. Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceed- Jeffrey Pennington, Richard Socher, and Christopher ings of the main conference and the shared task, and Manning. 2014. Glove: Global vectors for word rep- Volume 2: Proceedings of the Sixth International resentation. In Proceedings of the 2014 conference Workshop on Semantic Evaluation, pages 75–79. As- on empirical methods in natural language process- sociation for Computational Linguistics. ing (EMNLP), pages 1532–1543. Omer Levy, Steffen Remus, Chris Biemann, and Ido Philip Stuart Resnik. 1993. Selection and informa- Dagan. 2015. Do supervised distributional meth- tion: a class-based approach to lexical relationships. ods really learn lexical inference relations? In Pro- IRCS Technical Reports Series, page 200. ceedings of the 2015 Conference of the North Amer- ican Chapter of the Association for Computational FB Rogers. 1963. Medical subject headings. Bulletin Linguistics: Human Language Technologies, pages of the Medical Library Association, 51:114–116. 970–976. Stephen Roller and Katrin Erk. 2016. Relations such Xiang Li, Luke Vilnis, and Andrew McCallum. 2018. as hypernymy: Identifying and exploiting hearst pat- Improved representation learning for predicting terns in distributional vectors for lexical entailment. commonsense ontologies. In International Confer- arXiv preprint arXiv:1605.05433. ence on Machine Learning Workshop on Deep Struc- tured Prediction. Stephen Roller, Douwe Kiela, and Maximilian Nickel. Dekang Lin. 1998. An information-theoretic definition 2018. Hearst patterns revisited: Automatic hy- of similarity. In Proceedings of the 14th Interna- pernym detection from large text corpora. arXiv tional Conference on Machine Learning, volume 98, preprint arXiv:1806.03191. pages 296–304. Enrico Santus, Alessandro Lenci, Qin Lu, and Carolus Linnaeus et al. 1758. Systema naturae, vol. 1. S Schulte im Walde. 2014. Chasing hypernyms in Systema naturae, Vol. 1. vector spaces with entropy. In 14th Conference of the European Chapter of the Association for Com- Alexander Maedche and Steffen Staab. 2001. Ontol- putational Linguistics, pages 38–42. EACL (Euro- ogy learning for the semantic web. IEEE Intelligent pean chapter of the Association for Computational systems, 16(2):72–79. Linguistics).

Christopher Manning, Mihai Surdeanu, John Bauer, Enrico Santus, Frances Yung, Alessandro Lenci, and Jenny Finkel, Steven Bethard, and David McClosky. Chu-Ren Huang. 2015. Evalution 1.0: an evolving 2014. The stanford corenlp natural language pro- semantic dataset for training and evaluation of distri- cessing toolkit. In Proceedings of 52nd annual meet- butional semantic models. In Proceedings of the 4th ing of the association for computational linguistics: Workshop on Linked Data in Linguistics: Resources system demonstrations, pages 55–60. and Applications, pages 64–69. George Miller and Christiane Fellbaum. 1998. Word- net: An electronic lexical database. Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli, Robert Meusel, Heiko Paulheim, and Si- George A Miller, Richard Beckwith, Christiane Fell- mone Paolo Ponzetto. 2016. A large database of baum, Derek Gross, and Katherine J Miller. 1990. hypernymy relations extracted from the web. In Introduction to : An on-line lexical database. Proceedings of the Tenth International Conference International journal of lexicography, 3(4):235– on Language Resources and Evaluation LREC 2016, 244. Portoroz,ˇ Slovenia, May 23-28, 2016.

3240 Vered Shwartz, Yoav Goldberg, and Ido Dagan. Ivan Vulic, Daniela Gerz, Douwe Kiela, Felix Hill, and 2016. Improving hypernymy detection with an inte- Anna Korhonen. 2016. Hyperlex: A large-scale eval- grated path-based and distributional method. arXiv uation of graded lexical entailment. arXiv preprint preprint arXiv:1603.06076. arXiv:1608.02117. Vered Shwartz, Enrico Santus, and Dominik Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, Schlechtweg. 2017. Hypernyms under siege: and Bill Keller. 2014. Learning to distinguish hyper- Linguistically-motivated artillery for hypernymy nyms and co-hyponyms. In Proceedings of COL- detection. In Proceedings of the 15th Conference ING 2014, the 25th International Conference on of the European Chapter of the Association for Computational Linguistics: Technical Papers, pages Computational Linguistics: Volume 1, Long Papers, 2249–2259. Dublin City University and Association pages 65–75, Valencia, Spain. Association for for Computational Linguistics. Computational Linguistics. Julie Weeds, David Weir, and Diana McCarthy. 2004. GO Simms. 1992. The ICD-10 classification of men- Characterising measures of lexical distributional tal and behavioural disorders: clinical descriptions similarity. In Proceedings of the 20th interna- and diagnostic guidelines, volume 1. World Health tional conference on Computational Linguistics, Organization. page 1015. Association for Computational Linguis- Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2005. tics. Learning syntactic patterns for automatic hypernym Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q discovery. In Advances in neural information pro- Zhu. 2012. Probase: A probabilistic taxonomy for cessing systems, pages 1297–1304. text understanding. In Proceedings of the 2012 ACM Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. SIGMOD International Conference on Management Semantic taxonomy induction from heterogenous ev- of Data, pages 481–492. ACM. idence. In Proceedings of the 21st International Amir R Zamir, Alexander Sax, William Shen, Conference on Computational Linguistics and 44th Leonidas J Guibas, Jitendra Malik, and Silvio Annual Meeting of the Association for Computa- Savarese. 2018. Taskonomy: Disentangling task tional Linguistics, pages 801–808. Association for transfer learning. In Proceedings of the IEEE Con- Computational Linguistics. ference on Computer Vision and Pattern Recogni- Fabian M. Suchanek, Gjergji Kasneci, and Gerhard tion, pages 3712–3722. Weikum. 2007. Yago: a core of semantic knowl- edge. In Proceedings of the 16th International Con- ference on World Wide Web, WWW 2007, Banff, Al- berta, Canada, May 8-12, 2007, pages 697–706. Alexandru Tifrea, Gary Becigneul,´ and Octavian- Eugen Ganea. 2018. Poincare´ glove: Hy- perbolic word embeddings. arXiv preprint arXiv:1810.06546. Paola Velardi, Stefano Faralli, and Roberto Navigli. 2013. Ontolearn reloaded: A graph-based algorithm for taxonomy induction. Computational Linguistics, 39(3):665–707. Paola Velardi, Roberto Navigli, Alessandro Cuchiarelli, and R Neri. 2005. Evaluation of ontolearn, a methodology for automatic learning of domain on- tologies. from Text: Methods, evaluation and applications, 123(92). Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Ur- tasun. 2016. Order-embeddings of images and lan- guage. In Proceedings of the International Confer- ence on Learning Representations (ICLR), volume abs/1511.06361. Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew Mc- Callum. 2018. Probabilistic embedding of knowl- edge graphs with box lattice measures. In Proceed- ings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 263–272. Association for Computa- tional Linguistics.

3241