Endowing Fasttext with Hyperbolic Geometry

HyperText: Endowing FastText with Hyperbolic Geometry Yudong Zhu Di Zhou Jinghui Xiao Xin Jiang Xiao Chen Qun Liu Huawei Noah’s Ark Lab fzhuyudong3,zhoudi7,xiaojinghui4,Jiang.Xin, chen.xiao2,[email protected] Abstract Word Embedding Average Linear Natural language data exhibit tree-like hi- Pooling Classifier erarchical structures such as the hypernym- hyponym relations in WordNet. FastText, as Poincar Einstein M bius the state-of-the-art text classifier based on shal- Embedding tree-like midpoint visualize Linear midpoint low neural network in Euclidean space, may embedding not model such hierarchies precisely with lim- Hyperbolic ited representation capacity. Considering that Classifier hyperbolic space is naturally suitable for modeling tree-like hierarchical data, we propose a new model named HyperText for efficient text classification by endowing FastText with hyperbolic geometry. Empirically, we show that HyperText outperforms FastText on a range of Figure 1: The architecture comparison of FastText (up- text classification tasks with much reduced paper) and HyperText (lower). rameters. achieved promising results on learning word em- 1 Introduction beddings in WordNet. Inspired by their work, we propose HyperText FastText (Joulin et al., 2016) is a simple and effi- for text classification by endowing FastText with cient neural network for text classification, which hyperbolic geometry. We base our method on the achieves comparable performance to many deep Poincare´ ball model of hyperbolic space. Specif- models like char-CNN (Zhang et al., 2015) and ically, we exploit the Poincare´ ball embedding of VDCNN (Conneau et al., 2016), with a much lower words or ngrams to capture the latent hierarchies computational cost in training and inference. How- in natural language sentences. Further, we use ever, natural language data exhibit tree-like hierar- the Einstein midpoint (Gulcehre et al., 2018) as chies in several respects (Dhingra et al., 2018) such the pooling method to emphasize semantically spe- as the taxonomy of WordNet. In Euclidean space cific words which usually contain more information the representation capacity of a model is strictly but occur less frequently than general ones (Dhin- bounded by the number of parameters. Thus, con- gra et al., 2018). Finally, we employ Mobius¨ lin- ventional shallow neural networks (e.g., FastText) ear transformation (Ganea et al., 2018) to play may not represent tree-like hierarchies efficiently the part of the hyperbolic classifier. We evaluate given limited model complexity. the performance of our approach on text classifi- Fortunately, hyperbolic space is naturally suit- cation task using ten standard datasets. We ob- able for modeling the tree-like hierarchical data. serve HyperText outperforms FastText on eight of Theoretically, hyperbolic space can be viewed as them. Besides, HyperText is much more parameter- a continuous analogue of trees, and it can easily efficient. Across different tasks, only 17% ∼ 50% embed trees with arbitrarily low distortion (Kri- parameters of FastText are needed for HyperText oukov et al., 2010). Experimentally, Nickel and to achieve comparable performance. Meanwhile, Kiela(2017) first used the Poincar e´ ball model to the computational cost of our model increases mod- embed hierarchical data into hyperbolic space and erately (2:6x in inference time) over FastText. 1166 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1166–1171 November 16 - 20, 2020. c 2020 Association for Computational Linguistics 2 Method To extend the average pooling operation to the hyperbolic space, we adopt a weighted midpoint 2.1 Overview method called the Einstein midpoint (Gulcehre Figure1 illustrates the connection and distinction et al., 2018). In the d-dimensional Klein model d between FastText and HyperText. The differences K , the Einstein midpoint takes the weighted aver- of the model architecture are three-fold: First, the age of embeddings, which is given by: input token in HyperText is embedded using hy- Pk γ x m¯ = i=1 i i ; x 2 d; (5) perbolic geometry, specifically the Poincare´ ball K Pk i K i=1 γi model, instead of Euclidean geometry. Second, p 1 where γi = 2 are the Lorentz factors. Einstein midpoint is adopted in the pooling layer 1−kxik so as to emphasize semantically specific words. However, our embedding layer is based on the Last, the Mobius¨ linear transformation is chosen Poincare´ model rather than the Klein model, which as the prediction layer. Besides, the Riemannian means we can’t directly compute the Einstein mid- optimization is employed in training HyperText. points using Equation (5). Nevertheless, the var- ious models commonly used for hyperbolic ge- 2.2 Poincare´ Embedding Layer ometry are isomorphic, which means we can first project the input embedding to the Klein model, There are several optional models of hyperbolic execute the Einstein midpoint pooling, and then space such as the Poincare´ ball model, the Hyper- project results back to the Poincare´ model. boloid model and the Klein model, which offer dif- The transition formulas between the Poincare´ ferent affordances for computation. In HyperText, and Klein models are as follow: we choose the Poincare´ ball model to embed the 2x x = P ; (6) input words and ngrams so as to better exploit the K 1 + kx k2 latent hierarchical structure in text. The Poincare´ P m¯ K m¯ P = ; (7) ball model of hyperbolic space corresponds to the q 2 Riemannian manifold which is defined as follow: 1 + 1 − km¯ Kk d d x x P = (B ; gx); (1) where P and K respectively denote token embeddings in the Poincare´ and Klein models. m¯ and where Bd = fx 2 d kxk < 1g is an open d- P R m¯ are the Einstein midpoint pooling vectors in dimensional unit ball ( k·k denotes the Euclidean K the Poincare´ and Klein models. It should be noted norm) and g is the Riemannian metric tensor. x that points near the boundary of the Poincare´ ball 2 E gx = λxg ; (2) get larger weights in the Einstein midpoint formula. where λ = 2 is the conformal factor, gE = x 1−kxk2 These points (tokens) are regarded to be more repre- Id denotes the Euclidean metric tensor. While per- sentative, which can provide salient information for forming back-propagation, we use the Riemannian the text classification task (Dhingra et al., 2018). gradients to update the Poincare´ embedding. The 2.4 Mobius¨ Linear Layer Riemannian gradients are computed by rescaling the Euclidean gradients: The Mobius¨ linear transformation is an analogue of 1 linear mapping in Euclidean neural networks. We 5Rf(x) = 2 5E f(x): (3) use the Mobius¨ linear to combine features outputted 1 − λx by the pooling layer and complete the classification Since ngrams retain the sequence order infor- m task, which takes the form: mation, given a text sequence S = fwigi=1, we embed all the words and ngrams into the Poincare´ o = M⊗m¯ P ⊕ b; (8) k d ball, denoted as X = fxigi=1, where xi 2 B . where ⊗ and ⊕ denote the Mobius¨ matrix multi- plication and Mobius¨ addition defined as follows 2.3 Einstein midpoint Pooling Layer (Ganea et al., 2018): p kMxk p Mx Average pooling is a normal way to summarize M⊗x = (1= c) tanh tanh−1( c kxk) ; features as in FastText. In Euclidean space, the kxk kMxk average pooling is (1 + 2chx; bi + ckbk2)x + (1 − ckxk2)b x ⊕ b = : Pk 1 + 2chx; bi + c2kxk2kbk2 xi u¯ = i=1 : (4) M 2 d×n k where R denotes the weight matrix, and 1167 Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P. TNEWS IFYTEK FastText 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6 54.6 54.0 VDCNN 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7 54.8 55.4 DistilBERT(1-layer)∗ 92.9 - 99.0 91.6 58.6 74.9 59.5 - - - FastBERT(speed=0.8) 92.5 - 99.0 94.3 60.7 75.0 61.7 - - - HyperText 93.2 97.3 98.5 96.1 64.6 74.3 60.1 94.6 55.9 55.2 Table 1: Accuracy(%) of different models. ∗The results of DistilBERT are cited from Liu et al.(2020) n denotes the number of class; b 2 n is the bias Dataset #Classes #Train #Test R AG 4 120,000 7,600 vector and c is a hyper-parameter that denotes the Sogou 5 450,000 60,000 curvature of hyperbolic spaces. In order to obtain DBP 14 560,000 70,000 the categorical probability y^ , a softmax layer is Yelp P. 2 560,000 38,000 Yelp F. 5 650,000 50,000 used after the Mobius¨ linear layer. Yah. A. 10 1,400,000 60,000 y^ = softmax(o) (9) Amz. F. 5 3,000,000 650,000 Amz. P. 2 3,600,000 400,000 TNEWS 15 53,360 10,000 2.5 Model Optimization IFLYTEK 119 12,133 2,599 This paper uses the cross-entropy loss function for Table 2: Dataset statistics the multi-class classification task: N 1 X Table1. Our proposed HyperText model outper- L = − y · log(y^); (10) N forms FastText on eight out of ten datasets, and the i=1 accuracy of HyperText is 0:7% higher than Fast- where N is the number of training examples, and Text on average. In addition, from the results, we y is the one-hot representation of ground-truth la- observe that HyperText works significantly better bels. For training, we use the Riemannian opti- than FastText on the datasets with more label cat- mizer (Becigneul´ and Ganea, 2018) which is more egories, such as Yah.A., TNEWS and IFLYTEK. accurate for the hyperbolic models. We refer the This arguably confirms our hypothesis that Hyper- reader to the original paper for more details.

Load more