HyperText: Endowing FastText with Hyperbolic Geometry

Yudong Zhu Di Zhou Jinghui Xiao Xin Jiang Xiao Chen Qun Liu Huawei Noah’s Ark Lab {zhuyudong3,zhoudi7,xiaojinghui4,Jiang.Xin, chen.xiao2,qun.liu}@huawei.com

Abstract

Average Linear Natural language data exhibit tree-like hi- Pooling Classifier erarchical structures such as the hypernym-

hyponym relations in WordNet. FastText, as Poincar Einstein M bius the state-of-the-art text classifier based on shal- Embedding tree-like midpoint visualize Linear midpoint low neural network in Euclidean space, may embedding not model such hierarchies precisely with lim- Hyperbolic ited representation capacity. Considering that Classifier hyperbolic space is naturally suitable for mod- eling tree-like hierarchical data, we propose a new model named HyperText for efficient text classification by endowing FastText with hy- perbolic geometry. Empirically, we show that HyperText outperforms FastText on a range of Figure 1: The architecture comparison of FastText (up- text classification tasks with much reduced pa- per) and HyperText (lower). rameters. achieved promising results on learning word em- 1 Introduction beddings in WordNet. Inspired by their work, we propose HyperText FastText (Joulin et al., 2016) is a simple and effi- for text classification by endowing FastText with cient neural network for text classification, which hyperbolic geometry. We base our method on the achieves comparable performance to many deep Poincare´ ball model of hyperbolic space. Specif- models like char-CNN (Zhang et al., 2015) and ically, we exploit the Poincare´ ball embedding of VDCNN (Conneau et al., 2016), with a much lower words or ngrams to capture the latent hierarchies computational cost in training and inference. How- in natural language sentences. Further, we use ever, natural language data exhibit tree-like hierar- the Einstein midpoint (Gulcehre et al., 2018) as chies in several respects (Dhingra et al., 2018) such the pooling method to emphasize semantically spe- as the taxonomy of WordNet. In Euclidean space cific words which usually contain more information the representation capacity of a model is strictly but occur less frequently than general ones (Dhin- bounded by the number of parameters. Thus, con- gra et al., 2018). Finally, we employ Mobius¨ lin- ventional shallow neural networks (e.g., FastText) ear transformation (Ganea et al., 2018) to play may not represent tree-like hierarchies efficiently the part of the hyperbolic classifier. We evaluate given limited model complexity. the performance of our approach on text classifi- Fortunately, hyperbolic space is naturally suit- cation task using ten standard datasets. We ob- able for modeling the tree-like hierarchical data. serve HyperText outperforms FastText on eight of Theoretically, hyperbolic space can be viewed as them. Besides, HyperText is much more parameter- a continuous analogue of trees, and it can easily efficient. Across different tasks, only 17% ∼ 50% embed trees with arbitrarily low distortion (Kri- parameters of FastText are needed for HyperText oukov et al., 2010). Experimentally, Nickel and to achieve comparable performance. Meanwhile, Kiela(2017) first used the Poincar e´ ball model to the computational cost of our model increases mod- embed hierarchical data into hyperbolic space and erately (2.6x in inference time) over FastText.

1166 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1166–1171 November 16 - 20, 2020. c 2020 Association for Computational Linguistics 2 Method To extend the average pooling operation to the hyperbolic space, we adopt a weighted midpoint 2.1 Overview method called the Einstein midpoint (Gulcehre Figure1 illustrates the connection and distinction et al., 2018). In the d-dimensional Klein model d between FastText and HyperText. The differences K , the Einstein midpoint takes the weighted aver- of the model architecture are three-fold: First, the age of embeddings, which is given by: input token in HyperText is embedded using hy- Pk γ x m¯ = i=1 i i , x ∈ d, (5) perbolic geometry, specifically the Poincare´ ball K Pk i K i=1 γi model, instead of Euclidean geometry. Second, √ 1 where γi = 2 are the Lorentz factors. Einstein midpoint is adopted in the pooling layer 1−kxik so as to emphasize semantically specific words. However, our embedding layer is based on the Last, the Mobius¨ linear transformation is chosen Poincare´ model rather than the Klein model, which as the prediction layer. Besides, the Riemannian means we can’t directly compute the Einstein mid- optimization is employed in training HyperText. points using Equation (5). Nevertheless, the var- ious models commonly used for hyperbolic ge- 2.2 Poincare´ Embedding Layer ometry are isomorphic, which means we can first project the input embedding to the Klein model, There are several optional models of hyperbolic execute the Einstein midpoint pooling, and then space such as the Poincare´ ball model, the Hyper- project results back to the Poincare´ model. boloid model and the Klein model, which offer dif- The transition formulas between the Poincare´ ferent affordances for computation. In HyperText, and Klein models are as follow: we choose the Poincare´ ball model to embed the 2x x = P , (6) input words and ngrams so as to better exploit the K 1 + kx k2 latent hierarchical structure in text. The Poincare´ P m¯ K m¯ P = , (7) ball model of hyperbolic space corresponds to the q 2 Riemannian manifold which is defined as follow: 1 + 1 − km¯ Kk d d x x P = (B , gx), (1) where P and K respectively denote token embed- dings in the Poincare´ and Klein models. m¯ and where Bd = {x ∈ d kxk < 1} is an open d- P R m¯ are the Einstein midpoint pooling vectors in dimensional unit ball ( k·k denotes the Euclidean K the Poincare´ and Klein models. It should be noted norm) and g is the Riemannian metric tensor. x that points near the boundary of the Poincare´ ball 2 E gx = λxg , (2) get larger weights in the Einstein midpoint formula. where λ = 2 is the conformal factor, gE = x 1−kxk2 These points (tokens) are regarded to be more repre- Id denotes the Euclidean metric tensor. While per- sentative, which can provide salient information for forming back-propagation, we use the Riemannian the text classification task (Dhingra et al., 2018). gradients to update the Poincare´ embedding. The 2.4 Mobius¨ Linear Layer Riemannian gradients are computed by rescaling the Euclidean gradients: The Mobius¨ linear transformation is an analogue of 1 linear mapping in Euclidean neural networks. We 5Rf(x) = 2 5E f(x). (3) use the Mobius¨ linear to combine features outputted 1 − λx by the pooling layer and complete the classification Since ngrams retain the sequence order infor- m task, which takes the form: mation, given a text sequence S = {wi}i=1, we embed all the words and ngrams into the Poincare´ o = M⊗m¯ P ⊕ b, (8) k d ball, denoted as X = {xi}i=1, where xi ∈ B . where ⊗ and ⊕ denote the Mobius¨ matrix multi- plication and Mobius¨ addition defined as follows 2.3 Einstein midpoint Pooling Layer (Ganea et al., 2018): √ kMxk √ Mx Average pooling is a normal way to summarize M⊗x = (1/ c) tanh tanh−1( c kxk) , features as in FastText. In Euclidean space, the kxk kMxk average pooling is (1 + 2chx, bi + ckbk2)x + (1 − ckxk2)b x ⊕ b = . Pk 1 + 2chx, bi + c2kxk2kbk2 xi u¯ = i=1 . (4) M ∈ d×n k where R denotes the weight matrix, and 1167 Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P. TNEWS IFYTEK FastText 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6 54.6 54.0 VDCNN 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7 54.8 55.4 DistilBERT(1-layer)∗ 92.9 - 99.0 91.6 58.6 74.9 59.5 - - - FastBERT(speed=0.8) 92.5 - 99.0 94.3 60.7 75.0 61.7 - - - HyperText 93.2 97.3 98.5 96.1 64.6 74.3 60.1 94.6 55.9 55.2

Table 1: Accuracy(%) of different models. ∗The results of DistilBERT are cited from Liu et al.(2020) n denotes the number of class; b ∈ n is the bias Dataset #Classes #Train #Test R AG 4 120,000 7,600 vector and c is a hyper-parameter that denotes the Sogou 5 450,000 60,000 curvature of hyperbolic spaces. In order to obtain DBP 14 560,000 70,000 the categorical probability yˆ , a softmax layer is Yelp P. 2 560,000 38,000 Yelp F. 5 650,000 50,000 used after the Mobius¨ linear layer. Yah. A. 10 1,400,000 60,000 yˆ = softmax(o) (9) Amz. F. 5 3,000,000 650,000 Amz. P. 2 3,600,000 400,000 TNEWS 15 53,360 10,000 2.5 Model Optimization IFLYTEK 119 12,133 2,599 This paper uses the cross-entropy loss function for Table 2: Dataset statistics the multi-class classification task: N 1 X Table1. Our proposed HyperText model outper- L = − y · log(yˆ), (10) N forms FastText on eight out of ten datasets, and the i=1 accuracy of HyperText is 0.7% higher than Fast- where N is the number of training examples, and Text on average. In addition, from the results, we y is the one-hot representation of ground-truth la- observe that HyperText works significantly better bels. For training, we use the Riemannian opti- than FastText on the datasets with more label cat- mizer (Becigneul´ and Ganea, 2018) which is more egories, such as Yah.A., TNEWS and IFLYTEK. accurate for the hyperbolic models. We refer the This arguably confirms our hypothesis that Hyper- reader to the original paper for more details. Text can better model the hierarchical relationships of the underlying data and extract more discrim- 3 Experiments inative features for classification. Moreover, Hy- perText outperforms DistilBERT(Sanh et al., 2019) 3.1 Experimental setup and FastBERT(Liu et al., 2020) which are two dis- Datasets To make a comprehensive comparison tilled versions of BERT. And HyperText achieves with FastText, we choose the same eight datasets as comparable performance to the very deep convo- in Joulin et al.(2016) in our experiments. Also, we lutional network (VDCNN) (Conneau et al., 2016) add two Chinese text classification datasets from which consists of 29 convolutional layers. From Chinese CLUE (Xu et al., 2020), which are presum- the results, we can see that HyperText has better or ably more challenging. We summarize the statistics comparable classification accuracy than these deep of datasets used in our experiments in Table2. models while requiring several orders of magnitude less computation. Hyperparameters Follow Joulin et al.(2016), we set the embedding dimension as 10 for first Embedding Dimension Since the input embed- eight datasets in Table1. On TNEWS and IFLY- dings account for more than 90% model param- TEK datasets, we use 200-dimension and 300- eters, we investigate the impact of dimension of dimension embeddings respectively. The learn- input embedding on the classification accuracy. ing rate is selected on a validation set from The experimental results are presented in Figure {0.001, 0.05, 0.01, 0.015}. In addition, we use 2. As we can see, on most tasks HyperText per- PKUSEG tool (Luo et al., 2019) for Chinese word forms consistently better than FastText in various segmentation. dimension settings. In particular, on IFLYTEK and TNEWS datasets, HyperText with 50-dimension re- 3.2 Experimental Results spectively achieves better performance to FastText Comparison with FastText and deep models with 300-dimension and 200-dimension. On other The results of our experiments are displayed in eight less challenging datasets, the experiments are

1168 57.0 56.0 93.4 98.0 99.7

56.0 55.0 93.2 97.7 99.0 54.0 55.0 93.0 97.4 98.3 53.0 54.0 92.8 97.1 97.6 52.0 53.0 92.6 96.8 96.9 51.0 FastText FastText FastText FastText FastText 52.0 50.0 92.4 96.5 96.2 HyperText HyperText HyperText HyperText HyperText 51.0 49.0 92.2 96.2 95.5 2050 100 150 200 250 300 350 2050 100 150 200 250 300 350 3 5 10 15 20 25 30 3 5 10 15 20 25 30 3 5 10 15 20 25 30 (a) Tnews (b) IFLYTEK (c) AG (d) Sogou (e) DBP

96.1 64.9 76.3 60.5 94.8

96.0 64.6 74.5 60.3 94.7

95.9 64.3 72.7 60.1 94.6

95.8 64.0 70.9 59.9 94.5

95.7 63.7 69.1 59.7 94.4

95.6 FastText 63.4 FastText 67.3 FastText 59.5 FastText 94.3 FastText HyperText HyperText HyperText HyperText HyperText 95.5 63.1 65.5 59.3 94.2 3 5 10 15 20 25 30 3 5 10 15 20 25 30 3 5 10 15 20 25 30 3 5 10 15 20 25 30 3 5 10 15 20 25 30 (f) Yelp P. (g) Yelp F. (h) Yah. A. (i) Amz. F. (j) Amz. P. Figure 2: Accuracy vs Embedding dimension. The x-axis represents the embedding dimension, while the y-axis represents the accuracy. conducted in the low-dimensional settings and Hy- 57.0 93.4 56.0 93.2 perText often requires less dimensions to achieve 55.0 93.0 the optimal performance in general. It verifies that 54.0 92.8 53.0 92.6 thanks to the ability to capture the internal structure 52.0 FastText 92.4 FastText HyperText HyperText 51.0 92.2 of the text, the hyperbolic model is more parameter 0 10 20 30 40 50 60 70 80 90 100 0 3 6 9 12 15 18 21 24 27 30 (a) Tnews (b) AG efficient than its Euclidean competitor. 99.7 76.3

99.0 74.5

98.3 72.7

Computation in Inference FastText is well- 97.6 70.9 known for its fast inference speed. We compare 96.9 69.1 96.2 FastText 67.3 FastText HyperText HyperText the FLOPs versus accuracy under different dimen- 95.5 65.5 0 2 4 6 8 10 12 14 16 0 8 16 24 32 40 48 56 64 sions in Figure3. Due to the additional non- (c) DBP (d) Yah. A. linear operations, HyperText generally requires Figure 3: FLOPs(×103) vs Accuracy(%) under dif- more (4.5 ∼ 6.7x) computations compared to Fast- ferent dimensions. The x-axis represents the FLOPs, Text with the same dimension. But since Hyper- while the y-axis represents the accuracy. Different Text is more parameter efficient, when constrained points represent different embedding dimensions on the same level of FLOPs, HyperText mostly performs better than FastText on the classification 4 Related Work accuracy. Besides, the FLOPs level of VDCNN is 105 higher than HyperText and FastText. Hyperbolic space can be regarded as a continuous version of tree, which makes it a natural choice to represent the hierarchical data (Nickel and Kiela, Ablation study We conduct the ablation study 2017, 2018; Sa et al., 2018). Hyperbolic geome- to figure out the contribution of different layers. try has been applied to learning knowledge graph The results on several datasets are present in Ta- representations. HyperKG (Kolyvakis et al., 2019) ble3. Note that whenever we replace a hyperbolic extends TransE to the hyperbolic space, which ob- layer with its counterpart in Euclidean geometry, tains great improvement over TransE on WordNet the model performs worse. The results show that all dataset. Balazevicˇ et al.(2019) proposes MURP the hyperbolic layers (Poincare´ Embedding Layer, model which minimizes the hyperbolic distances Einstein midpoint Pooling Layer and Mobius¨ Lin- between head and tail entities in the multi-relational ear Layer) are necessary to achieve the best perfor- graphs. Instead of using the hyperbolic distance, mance. Chami et al.(2019, 2020) uses the hyperbolic rota- Model Yelp P. AG Yah.A. TNEWS tions and reflections to better model the rich kinds HyperText 96.1 93.2 74.3 55.9 of relations in knowledge graphs. Specifically, the -PE&EM 95.9 92.8 73.9 55.6 authors use the hyperbolic rotations to capture anti- -ML 95.6 92.3 73.2 54.6 symmetric relations and hyperbolic reflections to Table 3: Ablation study of each components in Hy- capture symmetric relations, and combine these op- perText (PE for Poincare´ Embedding, EM for Einstein erations together by the attention mechanism. It Midpoint, and ML for Mobius¨ Linear layer). achieves significant improvement at low dimension.

1169 Hyperbolic geometry is also applied in natural lan- Bhuwan Dhingra, Christopher J. Shallue, Mohammad guage data so as to exploit the latent hierarchies in Norouzi, Andrew M. Dai, and George E. Dahl. the word sequences (Tifrea et al., 2019). 2018. Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on Graph- Recently, many hyperbolic geometry based deep Based Methods for Natural Language Processing, neural networks (Gulcehre et al., 2018; Ganea et al., TextGraphs@NAACL-HLT. 2018) achieve promising results, especially when the mount of parameters is limited. There are some Octavian-Eugen Ganea, Gary Becigneul,´ and Thomas Hofmann. 2018. Hyperbolic neural networks. In applications based on hyperbolic geometry, such Advances in Neural Information Processing Systems, as system (Tay et al., 2018), pages 5345–5355. recommendation system (Chamberlain et al., 2019) and image embedding (Khrulkov et al., 2020). Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam 5 Conclusion Santoro, and Nando de Freitas. 2018. Hyperbolic attention networks. In International Conference on We have shown that hyperbolic geometry can en- Learning Representations. dow the shallow neural networks with the ability to capture the latent hierarchies in natural language. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient The empirical results indicate that HyperText con- text classification. In Proceedings of the 15th Con- sistently outperforms FastText on a variety of text ference of the European Chapter of the Association classification tasks. On the other hand, Hyper- for Computational Linguistics (EACL). Text requires much less parameters to retain perfor- Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya mance on par with FastText, which means neural Ustinova, Ivan Oseledets, and Victor Lempitsky. networks in hyperbolic space could have a stronger 2020. Hyperbolic image embeddings. arXiv representation capacity. preprint arXiv:1904.02239.

Prodromos Kolyvakis, Alexandros Kalousis, and Dim- itris Kiritsis. 2019. Hyperkg: Hyperbolic knowl- References edge graph embeddings for knowledge base comple- arXiv preprint arXiv:1908.04895 Ivana Balazevic,ˇ Carl Allen, and Timothy Hospedales. tion. . 2019. Multi-relational poincare´ graph embeddings. In Advances in Neural Information Processing Sys- Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim ´ ´ tems. Kitsak, Amin Vahdat, and Marian Boguna. 2010. Hyperbolic geometry of complex networks. Physi- cal Review E, 82(3):036106. Gary Becigneul´ and Octavian-Eugen Ganea. 2018. Rie- arXiv mannian adaptive optimization methods. Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, preprint arXiv:1810.00760 . Haotang Deng, and Qi Ju. 2020. Fastbert: a selfdistilling with adaptive inference time. Benjamin Paul Chamberlain, Stephen R. Hardwick, CoRR,abs/2004.02178. David R. Wardrope, Fabon Dzogang, Fabio Daolio, and Saul´ Vargas. 2019. Scalable hyperbolic recom- Ruixuan Luo, Jingjing Xu, Yi Zhang, Xuancheng mender systems. CoRR, abs/1902.08648. Ren, and Xu Sun. 2019. Pkuseg: A toolkit for multi-domain chinese word segmentation. CoRR, Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic abs/1906.11455. Sala, Sujith Ravi, and Christopher Re.´ 2020. Low- dimensional hyperbolic knowledge graph embed- Maximillian Nickel and Douwe Kiela. 2017. Poincare´ dings. In Proceedings of the 58th Annual Meeting of embeddings for learning hierarchical representa- the Association for Computational Linguistics (ACL tions. In Advances in Neural Information Process- 2020). ing Systems, pages 6338–6347.

Ines Chami, Adva Wolf, Frederic Sala, and Christopher Maximillian Nickel and Douwe Kiela. 2018. Learning Re.´ 2019. Low-dimensional knowledge graph em- continuous hierarchies in the lorentz model of hyper- beddings via hyperbolic rotations. In Graph Repre- bolic geometry. In Proc. ICML, page 3776–3785. sentation Learning NeurIPS 2019 Workshop. Christopher De Sa, Albert Gu, Christopher Re,´ and Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, Frederic Sala. 2018. Representation tradeoffs for and Yann Lecun. 2016. Very deep convolutional hyperbolic embeddings. In Proceedings of the networks for text classification. arXiv preprint 35th International Conference on Machine Learn- arXiv:1606.01781. ing,PMLR, volume 80, pages 4460–4469.

1170 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Hyperbolic representation learning for fast and effi- cient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM), pages 583–591. Alexandru Tifrea, Gary Becigneul, and Octavian- Eugen Ganea. 2019. Poincare´ : Hyperbolic word embeddings. In International Conference on Learning Representation. Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chen- jie Cao, Weitang Liu, Junyi Li, Yudong Li, Kai Sun, Yechen Xu, Yiming Cui, Cong Yu, Qianqian Dong, Yin Tian, Dian Yu, Bo Shi, Rongzhao Wang Jun Zeng, Weijian Xie, Yanting Li, Yina Patter- son, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaowei- hua Liu, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, and Zhenzhong Lan. 2020. Clue: A chinese language understanding evaluation bench- mark. arXiv preprint arXiv:2004.05986.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Advances in Neural Information Pro- cessing Systems, pages 649–657.

1171