Quick viewing(Text Mode)

Siamesexml: Siamese Networks Meet Extreme Classifiers with 100M

Siamesexml: Siamese Networks Meet Extreme Classifiers with 100M

SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

Kunal Dahiya 1 Ananye Agarwal 1 Deepak Saini 2 Gururaj K 3 Jian Jiao 3 Amit Singh 3 Sumeet Agarwal 1 Purushottam Kar 4 2 Manik Varma 2 1

Abstract product-to-product (Mittal et al., 2021), product-to-query The task of deep extreme multi-label learning (Chang et al., 2020), query-to-product (Medini et al., 2019), (XML) requires training deep architectures capa- query-to-bid-phrase (Dahiya et al., 2021), etc., that present ble of tagging a data point with its most relevant specific statistical as well as computational challenges. subset of labels from an extremely large label set. Short-text Applications with Label Metadata: Applica- Applications of XML include tasks such as ad tions where data points are endowed with short textual de- and product recommendation that involve labels scriptions (e.g., product title, search query) containing 3-10 that are rarely seen during training but which nev- tokens are known as short-text applications. These accu- ertheless hold the key to recommendations that rately model the ranking and recommendation tasks men- delight users. Effective utilization of label meta- tioned above and have attracted much attention recently data and high quality predictions for rare labels at (Chang et al., 2020; Dahiya et al., 2021; Medini et al., 2019; the scale of millions of labels are key challenges Mittal et al., 2021). This paper will specifically focus on ap- in contemporary XML research. To address these, plications where labels (e.g., related items, bid-phrases) are this paper develops the SiameseXML method by also endowed with short-text descriptions. This is in sharp proposing a novel probabilistic model suitable for contrast with XML works e.g., (Babbar & Scholkopf,¨ 2019; extreme scales that naturally yields a Siamese ar- Prabhu et al., 2018b; You et al., 2019) that treat labels as chitecture that offers generalization guarantees identifiers devoid of descriptive features. We note that other that can be entirely independent of the number of forms of label metadata such as label hierarchies could also labels, melded with high-capacity extreme clas- be used but this paper focuses on label textual descriptions sifiers. SiameseXML could effortlessly scale to as a reliable and readily available form of label metadata. tasks with 100 million labels and in live A/B tests on a popular search engine it yielded significant Technical Challenges: Label sets in XML tasks routinely gains in click-through-rates, coverage, revenue contain several millions of labels yet latency requirements and other online metrics over state-of-the-art tech- demand that predictions on a test point be made within mil- niques currently in production. SiameseXML also liseconds. The vast majority of labels in XML tasks are rare offers predictions 3–12% more accurate than lead- labels for which very little training data (often < 5 training ing XML methods on public benchmark datasets. points) is available. The use of label metadata has been The generalization bounds are based on a novel demonstrated (Mittal et al., 2021) to be beneficial in accu- uniform Maurey-type sparsification lemma which rately predicting rare labels for which training data alone may be of independent interest. Code for Siame- may not inform the classifier model adequately. However, it seXML will be made available publicly. remains challenging to design architectures that effectively utilize label metadata at the scale of millions of labels. Contributions: This paper presents SiameseXML an effec- 1. Introduction tive solution to training XML models utilizing label text at the scale of 100 million labels. From a technical stand- Overview: Extreme Multi-label Learning (XML) involves point, a) SiameseXML is based on a novel probabilistic tagging a data point with its most relevant subset of la- model that naturally motivates a modular approach melding bels from an extremely large set. XML finds applications Siamese networks with extreme classifiers that enhances the in myriad of ranking and recommendation tasks such as capacity of existing Siamese architectures. b) The Siamese 1Indian Institute of Technology Delhi 2Microsoft Research module of SiameseXML offers generalization bounds en- 3Microsoft 4Indian Institute of Technology Kanpur. Correspon- tirely independent of the number of labels L whereas the dence to: Kunal Dahiya . Extreme module bounds explicitly depend only on log L. Proceedings of the 38 th International Conference on Machine This is advantageous for tasks with as many as 100M labels. Learning, PMLR 139, 2021. Copyright 2021 by the author(s). These bounds are based on a novel Maurey-type sparsifica- SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels tion lemma which may be of independent interest. From an the number of labels L grows, for example Lˆ ≈ 130K for application standpoint, c) SiameseXML offers superior scal- L ≈ 1M labels. Since DECAF requires meta-label represen- ability than existing methods, scaling to tasks with 100M tations corresponding to all Lˆ meta labels to be recomputed labels while still offering predictions within milliseconds. for every iteration (mini-batch), it makes the method expen- d) SiameseXML’s predictions can be 3-12% more accurate sive for larger datasets. Taking a small value, say Lˆ ≈ 8K than leading XML methods on benchmark datasets and was does improve training time but hurts performance. Siame- found to improve the quality of predictions by 11% in the seXML will demonstrate that a direct label-text embedding application to match user queries to advertiser bid phrases. can offer better scalability, as well as better performance, as compared to DECAF’s meta-label based approach. 2. Related Works Siamese Networks: Siamese networks typically learn data point and label embeddings by optimizing the pairwise con- Extreme Multi-label Learning (XML): Much prior work trastive loss (Chen et al., 2020; Xiong et al., 2020) or else has focused on designing classifiers for fixed features such the triplet loss (Schroff et al., 2015; Wu et al., 2017). Per- as bag-of-words or else pre-trained features such as FastText forming this optimization exactly can be computationally (Joulin et al., 2017). Representative works include (Agrawal prohibitive at extreme scales, as for N data points and L et al., 2013; Babbar & Scholkopf,¨ 2017; 2019; Bhatia labels, a per-epoch cost of O (NL) and O NL2 is in- et al., 2015; Jain et al., 2016; 2019; Jasinska et al., 2016; curred for pairwise and triplet losses, respectively. To avoid Khandagale et al., 2019; Mineiro & Karampatziakis, 2015; this, it is common to train for a data point with respect to Niculescu-Mizil & Abbasnejad, 2017; Prabhu & Varma, only a small, say O (log L)-sized, subset of labels which 2014; Prabhu et al., 2018a;b; Tagami, 2017; Wydmuch et al., brings training cost down to O (N log L) per epoch. This 2018; Xu et al., 2016; Yen et al., 2016). While highly scal- subset typically contains all positive labels of the data point able, using fixed or pre-trained features leads to sub-optimal (of which there are only O (log L) in most XML applica- accuracy in real-word scenarios as demonstrated in several tions) and a carefully chosen set of O (log L) negative labels recent works that propose deep learning algorithms to jointly which seem the most challenging for this data point. There learn task-dependent features and classifiers. These include is some debate as to whether the absolutely “hardest” nega- XML-CNN (Liu et al., 2017), AttentionXML (You et al., tives for a data point should be considered (Wu et al., 2017; 2019), MACH (Medini et al., 2019), X-Transformer (Chang Xiong et al., 2020), especially in situations where missing et al., 2020) and Astec (Dahiya et al., 2021) that respec- labels abound (Jain et al., 2016), or whether considering tively use CNN, attention, MLP, transformer, and bag-of- multiple hard negatives is essential. For instance, (Schroff embeddings-based architectures. It is notable that deep et al., 2015; Harwood et al., 2017) observed that using only extreme classifiers not only outperform classical approaches “hard” negatives could lead to overfitting in Siamese Net- that use fixed features, but also scale to millions of labels. works. Nevertheless, such negative mining has become a Label Metadata: Recent works have demonstrated that cornerstone for training classifier models at extreme scales. incorporating label metadata can also lead to significant per- Negative Mining: Although relatively well-understood for formance boosts as compared to approaches that consider approaches that use fixed features, negative mining becomes labels as feature-less identifiers. Prominent among them are an interesting problem in itself when features get jointly the X-Transformer (Chang et al., 2020) and DECAF (Mit- learnt since the set of challenging negatives for a data point tal et al., 2021). Unfortunately, both methods struggle to may keep changing as the feature representation of that data accurately scale to tasks with several millions of labels, as point changes over the epochs. Several approaches have is discussed below. Moreover, experiments show that the been proposed to address this. Online approaches look for SiameseXML method proposed in this paper can outperform challenging negative labels for a data point within the posi- both the X-Transformer and DECAF on smaller datasets. tive labels of other data points within the mini-batch (Faghri The X-Transformer makes use of label text to learn interme- et al., 2018; Guo et al., 2019; Chen et al., 2020; He et al., diate representations and learns a vast ensemble of classifiers 2020). Although computationally cheap, as the number of based on the powerful transformer architecture. However, labels grows, it becomes more and more unlikely that the training multiple transformer models requires a rather large most useful negative labels for a data point would just hap- array of GPUs and the method has not been shown to scale to pen to get sampled within its mini-batch. Thus, at extreme several millions of labels. On the other hand, DECAF uses scales, either negative mining quality suffers or else mini- label text embeddings as a component in its 1-vs-all label batch sizes have to be enlarged which hurts scalability (Chen classifiers. Although relatively more scalable, DECAF still et al., 2020; Dahiya et al., 2021). Offline approaches that uti- struggles with several millions of labels. The method cru- lize approximate nearest neighbor search (ANNS) structures cially relies clustering labels into Lˆ meta-labels and seems have also been studied. This can either be done once if using to demand progressively larger and larger values of Lˆ as pre-trained features (Luan et al., 2020) or else periodically SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

∗ ∗ L if using learnt features (Xiong et al., 2020; Harwood et al., expression for MLE-based estimation of θ , {wl }l=1. 2017). However, repeated creation of ANNS structures N L needs to be done within computational budgets. def 1 X X L(θθ, {w }) = − ln p y , E (x )>w  (1) l NL l il θ i l Generalization Bounds: Previous works have studied the i=1 l=1 generalization behavior of multi-label problems for linear (Lei et al., 2019; Yu et al., 2014) as well as deep models Equation (1) closely resembles objectives used by 1-vs-all (Bartlett et al., 2017; Wei & Ma, 2019), but, to the best methods in literature and is equally prohibitive to optimize of our knowledge, do not take label features into account naively at extreme scales, requiring Ω(NLD) time to per- and treat labels simply as identifiers. This introduces an form even a single gradient descent step. To remedy this, a inescapable (implicit or explicit) dependence on the number novel postulate is introduced below that posits that for any of labels in these bounds. The Siamese module of Siame- label l ∈ [L], if its label-text zl is presented as a document, seXML, on the other hand, offers generalization bounds that then the label l itself is likely to be found relevant. incorporate deep-learnt features for data points as well as Postulate 2 (Label Self Proximity). For every label l ∈ [L], ∗ ∗ labels, as well as are entirely independent of the number its label text zl ∈ X satisfies P [yl = +1 | zl, wl ,θθ ] = > ∗ of labels. This is notable given that SiameseXML operates pl +1, Eθ∗ (zl) wl ≥ pl(+1, 1) − l for some l ≤ 1. at the scale of 100M labels. The bounds are based on a novel uniform Maurey-type sparsification lemma that uses Note that Postulate 2 always holds for some l ≤ 1 and is powerful Bernstein-type tail bounds on Hilbert spaces as thus, not an assumption. However, in applications where opposed to existing results that use empirical Maurey-type documents and labels reside in the similar spaces, it can lemmata based on the less powerful Chevyshev bound. We be expected to hold with small values of l → 0. This is refer to Appendix E for a more relaxed discussion. especially true of product-to-product recommendation tasks where documents and labels come from the same universe and a product can be reasonably expected to be strongly 3. An Extreme Probabilistic Model related to itself. However, this is also expected in other ap- N L plications enumerated in Section 1 such as product-to-query Notation: Let {{xi, yi}i=1, {zl}l=1} be a multi-label train- ing set with L labels and N documents. xi, zl ∈ X denote and query-to-bid-phrase where the textual representations representations of the textual descriptions of document i and of related document-label pairs, i.e., where yil = 1, convey L label l respectively. For each i ∈ [N], yi ∈ {−1, +1} de- similar intent. This postulate is key to the modular nature of notes the ground truth label vector for document i such that SiameseXML since it immediately yields the following. yil = +1 if label l is relevant to document i else yil = −1. Lemma 1. If pl(+1, ·) is monotonically increasing and has SD−1 denotes the D-dimensional unit sphere def an inverse that is Clip-Lipschitz over the range set Rp = D−1 Probabilistic Model: Let Eθ : X → S denote a pa- {pl(+1, v): v ∈ [−1, +1]}, then for all l ∈ [L] and x ∈ X , rameterized embedding model with parameters θ ∈ Θ. The > ∗ > p first postulate below posits that “extreme” 1-vs-all (OvA) Eθ∗ (x) wl − Eθ∗ (x) Eθ∗ (zl) ≤ 2Clip · l classifiers suffice to solve the XML problem. This postulate In particular, if ln pl(+1, ·) and ln pl(−1, ·) are Dlip- is widely accepted by 1-vs-all methods popular in literature, Lipschitz over [−1, +1], then e.g., (Babbar & Scholkopf,¨ 2019; Chang et al., 2020; Jain et al., 2019; Prabhu et al., 2018b; Yen et al., 2017). ∗ ∗ ∗ p L(θ , {Eθ∗ (zl)}) ≤ L(θ , {wl }) + Dlip · 2Clip · ,¯

Postulate 1 (OvA sufficiency). For some known point likeli- def 1 PL where ¯ = L i=1 l is the avg label self-proximity value. hood function pl : {−1, +1} × [−1, +1] → [0, 1] satisfying pl(+1, v)+pl(−1, v) = 1 for every v ∈ [−1, +1], unknown Lemma 1 shows that label text embeddings are capable of ∗ ∗ D−1 parameters θ ∈ Θ, and extreme classifiers wl ∈ S , acting as a stand-ins for the 1-vs-all classifiers. This in turn motivates a notion of partial NLL that is defined over the

L embedding model parameters alone: ∗ ∗ Y ∗ ∗ P [y | xi, {wl } ,θθ ] = P [yil | xi, wl ,θθ ] N L l=1 1 def 1 X X >  L (θ) = − ln pl yil, Eθ (xi) Eθ (zl) (2) L NL i=1 l=1 Y > ∗ = pl yil, Eθ∗ (xi) wl l l=1 Note that the expression for L naturally suggests a Siamese architecture as it requires learning a shared embedding model Eθ to embed both document as well as label text, This motivates the following negative log-likelihood (NLL) so that L1 is minimized. SiameseXML crucially exploits SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

ℰ휽 퐱 fE(x) = N(ReLU(Ex)) is computed (N is the normal- Norm ℰ휽 Pet Extreme 퐰 ization operator, i.e., N : v 7→ v/ kvk ) and then passed ReLU STOP 푙 2 The Module Comb Most through the residual layer to get the ”final” embedding Eθ (x) … Siamese Norm 퐱ො Module Norm 퐑 푖 Famous (please see Figure 1 for details). Similar architectures have … for STOP 훈푙 ReLU Flea Opera been demonstrated to offer superior performance on short ℰ휽 ℰ휽 Refinement 퐄퐱 Removal Duets Vector text applications with millisecond latency (Dahiya et al.,

ℰ휽 퐱 = 픑 푓퐄 퐱 + 푔퐑 푓퐄 퐱 퐄퐱푖 퐄퐳푙 2021; Mittal et al., 2021). SiameseXML uses the following 푓퐄 퐱 = 픑 ReLU 퐄퐱 probability functions in its NLL objectives. 푔퐑 퐯 = ReLU 퐑퐯 퐱ො푖 = ℰ휽 퐱풊 퐰푙 = 픑 ℰ휽 퐳푙 + 훈푙 Text Embedding Block Document Embedding OvA Label Classifier c · exp (d · v) p (+1, v) = , p (−1, v) = 1 − p (+1, v) Figure 1: SiameseXML’s embedding and classifier creation l exp(d) l l architectures. (best viewed under magnification in color). The purpose of the scaling constants c, d is to control the ag- gressiveness of the NLL function. For values c ∈ (0, 1), d ≥ this to bootstrap the learning process. Given a solution, say exp(2d) 1, the inverse of pl(+1, ·) exists and is cd -Lipschitz ˆ 1 θ, that (approximately) optimizes L , the 1-vs-all extreme over Rp. Also, ln pl(1, ·), ln pl(−1, ·) are respectively d and cd classifiers are now reparameterized as follows 1−c -Lipschitz over [−1, +1]. The calculations are given in Appendix C. Hyper-parameters are detailed in Appendix F. wl = N(Eˆ(zl) + ηl), θ Overview of Training: Section 3 suggests a modular train- D ing pipeline, similar to the one recently proposed by Dahiya where ηl ∈ R are “refinement” vectors, learnt one per label and N is the normalization operator, i.e., N : v 7→ et al. (2021). SiameseXML is trained over 3 modules: In v/ kvk ∈ SD−1. Yet another partial NLL expression, this Module I (Siamese Module), a Siamese architecture Eθ 2 1 time over the refinement vectors alone, is used to train them is adopted to embed document and label texts and L is 0 minimized to learn θˆ = {Eˆ, Rˆ 0}. In Module II (Neg- N L def 1 X X ative Mining Module), O (log L) hard negative labels are L2({η }) = − ln p y , E (x )>w  (3) l NL l il θˆ i l mined for each document based on embeddings learned in n=1 l=1 Module-I. Finally, in Module III (Extreme Module), L2 is The following result shows that this multi-stage learning minimized (considering only terms corresponding to pos- process does yield embedding model parameters and 1-vs- itive and mined negative labels from Module II per data all classifiers with bounded suboptimality with respect to point instead of using all NL terms) to obtain refinement the original NLL function L from Equation (1). vectors ηl and, by reparameterization, the extreme classi- fiers w . Jointly training the residual matrix in E in this ˆ l Lemma 2. Suppose θ is obtained by (approximately) opti- module to obtain a fine-tuned value Rˆ boosted performance. 1 1 ˆ mizing L with δopt being the suboptimality, i.e., L (θ) ≤ Note that this does not affect the applicability of Lemma 2 1 minθ∈Θ L (θ) + δopt and wˆ l are created using refinement (justified in Appendix B). Module III thus yields 1-vs-all 2 vectors ηˆl obtained by (approximately) optimizing L , then ˆ classifiers H = [ηˆ1,...,ηˆL] and a fine-tuned embedding θˆ ˆ ˆ ˆ ˆ ∗ ∗ p model θ = {E, R} (E is frozen after Module I). L(θθ, {wˆ l}) ≤ L(θ , {wl }) + Dlip · 2Clip · ¯+ δopt. Negative Mining: The objectives L1, L2 contain NL terms and performing optimization naively takes Ω(NDL) time The above discussion is sufficient preparation to present a for a single gradient step which is prohibitive since N,L ≥ detailed description of the SiameseXML method. 106,D ≥ 102. The terms for each i can be expressed as P ln p (y ,...) where P = {l | y = +1} and l∈Pi∪Ni l il i il 4. SiameseXML Ni = {l | yil = −1} are the sets of positive and negative la- bels for document i respectively. Although |P | ≤ O (log L) Architecture Details: SiameseXML uses sparse TF-IDF i in typical XML applications, |N | ≥ Ω(L) resulting in the vectors as input representations for both document and label i Ω(NDL) runtime. Now, whereas optimization of L2 in texts. Thus, if V is the number of tokens in the vocabulary, Module III relies on negatives mined in Module II, Module then X = V and x , z ∈ V . However, note that the re- R i l R I faces a bootstrapping problem and needs to mine nega- sults of Section 3 are agnostic to this choice. SiameseXML tives for itself. SiameseXML does this by adopting a novel uses a frugal embedding architecture E parameterized as D×V combination of effective online and offline negative mining θ = {E, R} where E = R denotes D-dimensional 1 D techniques to optimize L in O (ND log L) time. embeddings et ∈ R learnt for each token t ∈ [V ] in the vocabulary and R ∈ RD×D represents a residual layer. Module I: the Siamese Module: In this module, the partial Initially, an “intermediate” bag-of-embeddings embedding NNL objective L1 is optimized to learn parameters θ for the SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

LF-WikiSeeAlsoTitles-320K 24 ate embeddings of the label text, i.e., fEˆ (zl) for label µ 23 l; b) ANNS computed over label centroids defined as def  1 P  def 22 vl = N f ˆ (xi) where Rl = {i | yil = 1} |Rl| i∈Rl E Label batching

Precision@1 21 Document batching is the set of documents for which label l is relevant. Nega-

0 10 20 30 40 50 tive mining was done using intermediate embeddings fEˆ (·) Epochs rather than the final embeddings E since Eˆ remains frozen through Module III whereas E get modified in Module Figure 2: Label-wise batching offers far superior precision III as the residual layer R gets fine-tuned jointly with than document-wise batching in Module-I. the 1-vs-all classifiers. The graphs were queried to gen- erate O (log L) ≤ 500 negative labels for document i with z  z two components: a) Ni = l | i ∈ ANNS (fEˆ (xi)) (b) “Siamese” model Eθ (·) shared by documents and labels. µ  µ r Ni = l | vl ∈ ANNS (fEˆ (xi)) . Including a set Ni of Online negative mining: It is common practice to create O (log L) ≤ 50 negative labels sampled from a multinomial batches over documents and mine negative labels for a doc- distribution led to more robust training. Negative mining ument within positive labels of other documents in its mini- incurs a cost of O(ND log L) (derivation in Appendix D). batch (Faghri et al., 2018; Guo et al., 2019; Chen et al., 2020; Module III: the Extreme Module: SiameseXML trains He et al., 2020). However, this was found to neglect rare the refinement vectors ηl, thus implicitly training the 1-vs- labels and a disproportionate number of mined negatives all vectors wl = N(Eθ (zl) + ηl), and jointly fine-tunes the ended up being popular labels. This not only starved rare residual layer R present in the embedding architecture E labels of gradient updates, but also starved documents of (token embeddings Eˆ learnt in Module I remain frozen). the most useful negative labels which could be rare. Siame- Training for each data point i ∈ [N] is restricted to the seXML mitigated this problem by creating batches over set S def= P ∪ Nˆ , where P = {l | y = 1} is the set of labels and mining negative documents instead (see Fig. 2). i i i i il positive labels for document i and Nˆ = Nˆ z ∪ Nˆ µ ∪ Nˆ r Note that the Siamese nature of this module seamlessly al- i i i i is the label shortlist obtained in Module-II. Recall that data lows such a mirrored strategy. Label mini-batches of size points in XML applications typically have O (log L) posi- B were created at random. For each label l in the batch, a tive labels i.e. |P | ≤ O (log L). Since Nˆ ≤ O (log L) single positive document i, i.e., yil = +1, was sampled uni- i i formly at random, and the top-κ hard negative documents by design, we have |Si| ≤ O (log L). The following ob- were selected from the pool of B − 1 positive documents jective, that contains only O (N log L) terms, was mini- for other labels in the batch. L1 was trained only on these mized using mini-batches over documents and the Adam B(κ + 1) document-label pairs in this mini-batch. optimizer (Kingma & Ba, 2014). Offline negative mining: For very large datasets, online neg- ative mining does not suffice as it becomes unlikely that the most useful negative labels for a data point would be N def 1 X X Lˆ2({η }, R) = − ln p y , E (x )>w  present as positive labels of other documents in its mini- l NL l il θˆ i l batch. Increasing batch sizes as a remedy is not an option as n=1 l∈Si it increases the cost and memory footprint of forward and backward passes which hurts scalability. The alternative is to mine negatives globally using some nearest neighbor to obtain the 1-vs-all classifiers Hˆ = [ηˆ ,...,ηˆ ] and a (ANNS) structure. Although effective, if used alone, this 1 L fine-tuned embedding model θˆ = {Eˆ, Rˆ }. strategy risks overfitting, especially in situations with miss- ing labels. SiameseXML tackles this issue by deploying a Asynchronous distributed training on 100M labels: Op- hybrid negative mining strategy which mines κ = 3 negative timizing Lˆ2 on a single GPU was infeasible for the largest documents per label, 1 from an ANNS data structure that is dataset Q2B-100M given the Ω(LD) memory footprint periodically refreshed after every few epochs to account for needed to train all L 1-vs-all classifiers. Multiple GPUs evolving embeddings, and the other 2 from the mini-batch. were deployed to speed up training by partitioning the 1- 1 g n o vs-all classifiers {wl} into subsets {wl} ,... {wl} , each ˆ ˆ 0 Module II: the Negative Mining Module: Let E, R subset being trained on a separate GPU. As R is jointly be the parameters learnt by Module I by minimizing L1. To trained in this module and shared across labels, this brought accelerate training using L2 in Module III, SiameseXML up synchronization issues. SiameseXML eliminated the mines O (log L) hard negative labels for each document need for a high-bandwidth connection between GPUs and by creating two small-world (HNSW) graphs (Malkov & effected entirely asynchronous training on completely dis- Yashunin, 2016): a) ANNSz learned over the intermedi- connected machines by learning a separate residual Rj for SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels each label subset using the slightly modified objective open question to explore bounds that are better adapted to neural networks encountered in practice. Theorem 3 realizes N 2 j def 1 X X  j >  this. However, to avoid dependence on L, Theorem 3 cannot Lˆ ({ηl}, R ) = − ln pl yil, E (xi) wl . j NL θˆ rely on pushing empirical covers across layers of the net- n=1 l∈Sj i work as previous works do. Instead, “uniform” covers are used for which a novel uniform Maurey-type sparsification j j j where j is the id of the GPU, S = Si ∩ {wl} , and E (xi) i θˆ lemma is established that uses powerful Bernstein bounds is the embedding function using Rj as the residual. Using over Hilbert spaces instead of the Chevyshev’s inequality ˆ2 ˆ2 Lj instead of L was found to yield comparable accuracy. used by standard “empirical” Maurey lemmata. Log-time Prediction: XML applications demand predic- Theorem 3 does present a trade-off: it depends on quantities q tions within milliseconds to meet the latency and through- e.g. token embeddings E, via kEk kEk ≤ kEk put requirements of real-world applications. Given a test ∞,1 1,1 1,1 V whereas contemporary bounds (Bartlett et al., 2017) incur point x ∈ R , iterating over all L 1-vs-all classifiers takes > 8 a dependence on the mixed norm E (see below for O (DL) time which is prohibitive when L ≈ 10 ,D ≥ 2,1 √ 102. SiameseXML makes use of the nearest neighbors notation). This indicates a suboptimality of upto√ D. How- graphs available from Module-II to offer accelerated in- ever, in XML settings, small D ≈ 100, i.e., D ≈ 10 are ference in 3 steps: a) extract the intermediate and final common which suggests that Theorem 3 presents a favor- embeddings for the test point, i.e., f (x) and E (x); b) able trade-off as it avoids a dependence on L, e.g. via kZk Eˆ θˆ √  √ F shortlist O (log L) of the most relevant labels as S = which could scale as Ω L , in favor of an extra D l | l ∈ z(f (x)) ∪ l | l ∈ µ(f (x)) ANNS Eˆ ANNS Eˆ ; and factor. Theorem 3 (Part II) extends to the model learnt in finally c) evaluate 1-vs-all classifiers for only the shortlisted > Module III that incorporates extreme classifiers. This bound labels to get the scores yˆl = α·wl Eθˆ(x)+(1−α)·(fEˆ (zl)+ > is no longer independent of L but depends on it explicitly in vl) Eˆ(x) if label l ∈ S and yˆl = −∞ otherwise, where θ a weak manner via log L and implicitly via the L1,1 norm of α ∈ [0, 1] is a hyper-parameter set via validation. Recall D×L the refinement vector matrix H = [η1, . . . ,ηηL] ∈ R . that vl are label centroids. This allowed SiameseXML to 2  L make predictions in O D + D log L time (derivation in Notation: For a label vector y ∈ {−1, +1} , let Py := Appendix D). In practice, this translated to around 12 mil- {l : yl = +1} and Ny := {l : yl = −1} denote the sets of liseconds on a CPU even on the dataset with 100M labels. positive and negative labels respectively. Given a score L vector s = [s1, . . . , sL] ∈ [−1, 1] , let πs ∈ Sym([L]) 5. Generalization Bounds be the permutation that ranks labels in decreasing order of their scores according to s i.e. sπs(1) ≥ sπs(2) ≥ .... + Generalization bounds for multi-label problems incur either We will also let πs ∈ Sym(Py) denote the permutation an explicit dependence on the number of labels L (Zhang, that ranks the positive labels in decreasing order of their + 2004) or else an implicit one via regularization constants scores according to s i.e. πs (t) ∈ Py for all t ∈ |Py|. kAxk or norms of 1-vs-all classifiers (Bartlett et al., 2017; Yu m×n 2 For any A ∈ , kAk := supx∈ n denotes its R σ R kxk2 et al., 2014). However, Module I of SiameseXML learns spectral norm. For p, q ∈ [1, ∞], define the mixed norm parameters θ that are of size independent of L, and yet yields h i 0 kAk := kA1,:k , kA2,:k ,..., kAm,:k . This wˆ = E 0 (z ) p,q p p p a fully functional classifier model, namely l θˆ l q owing to Lemma 1. This presents an opportunity to establish first takes the p-th norm over all rows, then the q-th norm. bounds that genuinely avoid any dependence on L, explicit To be sure, this is slightly different from popular convention or implicit. Theorem 3 (Part I) establishes such a result. that takes norms over columns first. However, this slight change of convention will simplify our notation and avoid Contemporary generalization bounds for deep networks clutter due to repeated transpose symbols. (Bartlett et al., 2017; Wei & Ma, 2019) do not take label metadata into account and thus, their results do not directly Definitions: The prec@k loss counts the fraction of top def apply to Siamese architectures. Moreover, such an applica- ranks that are not occupied by relevant labels ℘k(s, y) = tion is expected to yield bounds for Module I that depend on 1 − 1 Pk {π (t) ∈ P }. Given a margin γ > 0, define V ×L k t=1 I s y the Frobenius√ norm of Z = [z1,..., zL] ∈ R that can the γ-ramp function as follows scale as Ω( L), due to their use of “empirical” covering  numbers (Bartlett et al., 2017). Theorem 3 presents a result 0 if v < 0  for Siamese architectures that not only avoids any depen- def v rγ (v) = γ if v ∈ [0, γ] . dence on L but also exploits task structure such as sparsity  V 1 if v > γ of the input vectors xi, zl ∈ R that existing bounds dont. prec def In fact, Bartlett et al. (2017) state that it is a “tantalizing” For k ∈ N, let the surrogate prec@k loss be `γ,k (s, y) = SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

1 min{k,|Py|}   P + 0 0 Table 1: SiameseXML could be significantly more accurate 1 − t=1 rγ sπ (t) − maxl ∈Ny sl . For any k s than leading deep XML methods including Astec, Decaf, V θ θ data point x ∈ R and model θ, denote s (x) = and X-Transformer on public datasets. Results are only θ θ > L θ def > [s1, . . . , sL] ∈ R , where sl = Eθ (x) Eθ (zl) are scores presented for methods that converged within the timeout. assigned to various labels for this data point using the SiameseXML-3 refers to an ensemble of 3 learners. Siamese model obtained after Module I. For models ob- Training tained after Module III that incorporate extreme classi- Method PSP@1 PSP@3 PSP@5 P@1 P@5 θθ,H def Time (hr) fiers H = [η1, . . . ,ηηL], scores are redefined as sl = > LF-AmazonTitles-131K Eθ (x) wl. For training data sampled as (xi, yi) ∼ D, i ∈ SiameseXML 33.83 38.34 42.82 39.43 18.87 0.66 ˆ def [N], the surrogate empirical risk for a model θ is `N (θ) = SiameseXML-3 34.51 39.20 43.89 40.26 19.38 1.97 1 PN prec θ Astec 29.22 34.64 39.49 37.12 18.24 1.83 N i=1 `γ,k (s (xi), yi). The population prec@k risk for Decaf 30.85 36.44 41.42 38.40 18.65 2.16 def  θ  X-Transformer 21.72 24.42 27.09 29.95 13.07 64.40 a model θ is ℘k(θ) = E ℘k(s (x), y) . `(θθ, H) and MACH 24.97 30.23 34.72 33.49 16.45 3.30 (x,y)∼D SLICE+FastText 23.08 27.74 31.89 30.43 14.84 0.08 ˆ θθ,H `N (θθ, H) are similarly defined using sl instead. AttentionXML 23.97 28.60 32.57 32.25 15.61 20.73 Parabel 23.27 28.21 32.14 32.60 15.61 0.03 0 Theorem 3. (Part I) If θˆ is the model obtained after Mod- Bonsai 24.75 30.35 34.86 34.11 16.63 0.10 DiSMEC 25.86 32.11 36.97 35.14 17.24 3.10 ule I and kxik , kzlk ≤ s, then with prob. 1 − δ, 0 0 LF-WikiSeeAlsoTitles-320K s SiameseXML 22.68 24.10 25.80 27.63 14.02 1.05 1 0 0 ln SiameseXML-3 23.29 24.88 26.66 28.51 14.62 3.15 ˆ ˆ ˆ 1 P ln(N) δ ℘k(θ ) ≤ `N (θ ) + · √ + , Astec 13.69 15.81 17.50 22.72 11.43 4.17 γ N N Decaf 16.73 18.99 21.01 25.14 12.86 11.16 MACH 9.68 11.28 12.53 18.06 8.99 8.23  √  P = O pD ln(D)pRR RR + s ln(DV )RRpRE RE SLICE+FastText 11.24 13.45 15.20 18.55 9.68 0.20 where ∞ 1 σ ∞ 1 AttentionXML 9.45 10.63 11.73 17.56 8.52 56.12 E ˆ E ˆ R ˆ 0 R ˆ 0 R ˆ 0 and R1 = kEk1,1,R∞ = kEk∞,1,R1 = kR k1,1,R∞ = kR k∞,1,Rσ = kR kσ Parabel 9.24 10.65 11.80 17.68 8.59 0.07 nˆ ˆ o ˆ Bonsai 10.69 12.44 13.79 19.31 9.55 0.37 (Part II) If θθ, H is the model after Module III (E from DiSMEC 10.56 13.01 14.82 19.12 9.87 15.56 Module I, Rˆ , Hˆ learnt in module III). Then w.p. 1 − δ, s ln 1 ˆ ˆ ˆ 1 Q ln(N) δ from Q2BP-40M to allow experimentation with less scalable ℘k(θθ, Hˆ ) ≤ `N (θθ, Hˆ ) + · √ + , γ N N methods. Please refer to Appendix F for data statistics.   Baselines: Comparisons are presented against leading deep where Q = P + O ln(DL)pRC RC ,RC = kHˆ k ,RC = kHˆ k ∞ 1 1 1,1 ∞ 1,∞ XML methods such as X-Transformer (Chang et al., 2020), R ˆ R ˆ R ˆ and R1 = kRk1,1,R∞ = kRk∞,1,Rσ = kRkσ. Astec (Dahiya et al., 2021), MACH (Medini et al., 2019), DECAF (Mittal et al., 2021), and AttentionXML (You et al., Appendix E contains a more relaxed discussion on related 2019), as well as classical methods such as DiSMEC (Bab- works, a proof for Theorem 3, as well as an outline on bar & Scholkopf,¨ 2017), Bonsai (Khandagale et al., 2019), how `ˆ can be replaced with L (i.e., the NLL expression N Slice (Jain et al., 2019), and Parabel (Prabhu et al., 2018b). optimized by SiameseXML– see Eqn (1)) in Theorem 3. XTransformer and DECAF are of particular interest as they make use of label text to improve their predictions. Im- 6. Experiments plementations provided by respective authors were used in all cases. All methods were offered a timeout of one Datasets: Multiple benchmark short-text datasets for week on a single GPU. Results are also reported for popular product-to-product recommendation (LF-AmazonTitles- Siamese Networks for ad retrieval such as TwinBert (Lu 130K), as well as predicting related Wikipedia articles based et al., 2020) and CDSSM (Huang et al., 2013) on the Q2BP on titles (LF-WikiSeeAlsoTitles-320K) were considered. datasets. Please refer to Appendix F for SiameseXML’s These can be downloaded from the Extreme Classification hyper-parameters and their settings. Repository (Bhatia et al., 2016). Results are also reported on proprietary datasets with up to 100 million labels for Evaluation metrics: Performance was evaluated using stan- matching user queries to advertiser bid phrases (Q2BP-4M, dard XML performance measures: precision@k (P@k, Q2BP-40M, and Q2BP-100M). These were created by min- k ∈ {1, 5}) and propensity scored precision@k (PSP@k, ing click logs of a popular search engine where a query was k ∈ {1, 3, 5}). Results on nDCG@k (N@k) and propen- treated as a data point and clicked advertiser bid phrases be- sity scored nDCG@k (PSN@k) are presented in Appendix came its labels. The Q2BP-40M and Q2BP-100M datasets F which also contains the definitions of all these metrics. were created by mining logs of different streams, whereas All training times are reported on a 24-core Intel Xeon 2.6 Q2BP-4M was created by randomly sampling 10% labels GHz machine with a single Nvidia V100 GPU. However, SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

Table 2: SiameseXML could be significantly more accurate Table 3: Impact of incrementally adding components, i.e., and than leading extreme classifiers as well as Siamese label embedding (fEˆ (zl)), label centroids (vl) and label Networks including TwinBert and CDSSM for matching classifiers wl on the LF-WikiSeeAlsoTitles-320K dataset. user queries to advertiser bid-phrases. Components P@1 Recall@500 Method PSP@1 PSP@3 PSP@5 P@1 P@3 P@5 f (z ) 24.71 48.61 Q2BP-100M Eˆ l {f ˆ (zl), vl)} 25.54 58.31 SiameseXML 74.82 84.28 87.36 80.84 40.33 26.71 E TwinBert 60.76 66.58 69.37 61.54 30.51 20.42 {fEˆ (zl), vl, wl} 27.63 58.31 Q2BP-40M SiameseXML 58.86 66.44 70.59 70.33 48.13 37.38 TwinBert 52.44 55.13 56.43 57.69 41.53 33.2 bid phrases (Q2BP) and was compared to an ensemble of CDSSM 43.41 42.57 42.83 48.6 32.29 25.19 Parabel 36.04 41.52 44.84 47.71 36.64 30.45 leading (proprietary) information retrieval, XML, generative Q2BP-4M and graph based techniques. Performance was measured in terms of Click-Yield (CY), Query Coverage (QC), and SiameseXML 67.24 72.82 76.03 69.53 38.58 27.12 CDSSM 44.95 48.19 50.78 45.58 24.62 17.38 Click-Through Rate (CTR). Click Yield (CY) is defined Astec 57.78 69.16 73.84 66.39 37.35 26.52 as the number of clicks per unit search query (please refer SLICE+CDSSM 47.41 59.52 65.89 54.77 33.15 24.11 to Appendix F for details). SiameseXML was found to in- Parabel 55.18 66.08 71.21 61.61 36.36 26.01 crease CY and CTR by 1.4% and 0.6%, respectively. This indicates that ads surfaced using the proposed method are more relevant to the end user. Additionally, SiameseXML Q2BP-40M,100M datasets were afforded multiple GPUs. offered higher QC by 2.83%, indicating its ability in surfac- Offline results: Table 1 presents results on benchmark ing ads for queries where ads were previously not shown. datasets for short-text XML tasks where SiameseXML could Further, human labelling by expert judges established that be 3–12% more accurate than DECAF and X-Transformer SiameseXML could increase the quality of predictions by in propensity scored precision. This indicates that Siame- 11% over state-of-the-art in-production techniques. seXML’s gains are more prominent on rare labels which Ablations: The aim of ablation experiments was to inves- are more rewarding in real-world scenarios. SiameseXML tigate SiameseXML’s design choices in Modules I, II, and could also be 3–97× faster at training than DECAF and III. First, the label mini-batching based sampling strategy X-Transformer. Moreover, SiameseXML was found to be 2– in Module-I could be upto 3% more accurate than the popu- 13% more accurate than Astec, AttentionXML, and MACH lar strategy based on creating document mini-batches (see indicating that SiameseXML can be more accurate and si- Fig. 2). Second, label shortlists in Module-II (Nˆi) could multaneously faster to train as compared to leading deep be upto 4% and 10% more accurate than the shortlist com- ˆ z extreme classifiers. Table 4 includes a qualitative analysis puted solely based on label embeddings (Ni ) in terms of of SiameseXML’s predictions. SiameseXML could also be precision and recall respectively (see Table 3) with compar- 4–11% more accurate than leading XML methods with fixed atively larger gains on large datasets. This demonstrates that features including Parabel, DiSMEC, Bonsai, and Slice. Re- the label text by itself may not be as informative for some sults are also reported for an ensemble of three learners, labels and using multiple representatives for each label can named SiameseXML-3 which offers an additional 1% gain lead to performance gains in short-text applications. As in performance over SiameseXML. noted earlier, Lemma 1 shows that Module I itself yeilds a Table 2 demonstrates that SiameseXML could be 10–30% fully functional classifier model with bounded suboptimality. more accurate than leading XML techniques that could scale Thus, in principle, SiameseXML could stop training after to the Q2BP datasets. Most existing techniques could not Module-I and use the label text embeddings as classifiers ˆ z scale to 100 million labels whereas SiameseXML could be along with the label shortlist Ni to make final predictions. trained within 2 days on 6× Nvidia V100 GPUs. Siame- However, this is suboptimal and training an extreme clas- seXML was also found to be 6–21% more accurate than sifier in Module-III can lead to 2-6% more accurate pre- Siamese networks including TwinBert and CDSSM on the dictions with comparatively larger gains on larger datasets Q2BP datasets. This demonstrates that SiameseXML can indicating the utility of extreme classifiers (see Table 3). scale to datasets with 100M labels while still being more ac- curate than both Siamese networks and extreme classifiers. References Online live deployment: SiameseXML was deployed on a Agrawal, R., Gupta, A., Prabhu, Y., and Varma, M. Multi- popular search engine to perform A/B tests on live search label learning with millions of labels: Recommending engine traffic for matching user entered queries to advertiser advertiser bid phrases for web pages. In WWW, 2013. SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

Table 4: SiameseXML’s predictions for the document “List for embedding-based classifiers for large output spaces. of Go players” from LF-WikiSeeAlsoTitles-320K are more In NeurIPS, 2019. accurate as compared to leading methods including DECAF and AttentionXML. Mispredictions are typeset in light gray. Harwood, B., B.-V., K., Carneiro, G., Reid, I., and Drum- mond, T. Smart mining for deep metric learning. In ICCV, Method Predictions 2017. SiameseXML List of Go organizations, Go players International Go Federation, Go professional He, K., Fan, H., W., Y., Xie, S., and Girshick, R. Momentum List of professional Go tournaments contrast for unsupervised visual representation learning. In CVPR, 2020. DECAF Go players, List of Go organizations, Music of the Republic of Macedonia, Players, List of all-female bands Huang, P. S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. Learning deep structured semantic models for web AttentionXML List of NHL players, List of professional Go tournaments, List of foreign NBA players, search using clickthrough data. In CIKM, 2013. List of chess grandmasters, List of Israeli chess players Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss functions for recommendation, tagging, ranking and other missing label applications. In KDD, August 2016. Babbar, R. and Scholkopf,¨ B. Dismec: Distributed sparse machines for extreme multi-label classification. In Jain, H., Balasubramanian, V., Chunduri, B., and Varma, M. WSDM, 2017. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In WSDM, 2019. Babbar, R. and Scholkopf,¨ B. Data scarcity, robustness and extreme multi-label classification. ML, 2019. Jasinska, K., Dembczynski, K., Busa-Fekete, R., Pfannschmidt, K., Klerx, T., and Hullermeier, E. Ex- Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrally- treme f-measure maximization using sparse probability normalized margin bounds for neural networks. In NIPS, estimates. In ICML, 2016. 2017. Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse of tricks for efficient text classification. In EACL, 2017. local embeddings for extreme multi-label classification. In NIPS, December 2015. Khandagale, S., Xiao, H., and Babbar, R. Bonsai - diverse and shallow trees for extreme multi-label classification. Bhatia, K., Dahiya, K., Jain, H., Mittal, A., Prabhu, CoRR, 2019. Y., and Varma, M. The extreme classification repos- itory: Multi-label datasets & code, 2016. URL Kingma, P. D. and Ba, J. Adam: A method for stochastic http://manikvarma.org/downloads/XC/ optimization. 2014. XMLRepository.html. Lei, Y., Dogan, U., Zhou, D.-X., and Kloft, M. Data- Chang, W.-C., H.-F., Y., Zhong, K., Yang, Y., and Dhillon, dependent Generalization Bounds for Multi-class Classi- I.-S. Taming pretrained transformers for extreme multi- fication. IEEE Transactions on Information Theory, 65 label text classification. In KDD, 2020. (5):2995–3021, 2019. doi: 10.1109/TIT.2019.2893916. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A Liu, J., Chang, W., Wu, Y., and Yang, Y. Deep learning for simple framework for contrastive learning of visual rep- extreme multi-label text classification. In SIGIR, 2017. resentations. In ICML, 2020. Dahiya, K., Saini, D., Mittal, A., Shaw, A., Dave, K., Soni, Lu, W., Jiao, J., and Zhang, R. Twinbert: Distilling knowl- A., Jain, H., Agarwal, S., and Varma, M. Deepxml: A edge to twin-structured compressed bert models for large- deep extreme multi-label learning framework applied to scale retrieval. In CIKM, 2020. short text documents. In WSDM, 2021. Luan, Y., Eisenstein, J., Toutanova, K., and Collins, M. Faghri, F., Fleet, D.-J., Kiros, J.-R., and Fidler, S. Vse++: Sparse, dense, and attentional representations for text Improving visual-semantic embeddings with hard nega- retrieval. 2020. tives. In BMVC, 2018. Malkov, A. Y. and Yashunin, D. A. Efficient and robust Guo, C., Mousavi, A., Wu, X., Holtmann-Rice, D.-N., Kale, approximate nearest neighbor search using hierarchical S., Reddi, S., and Kumar, S. Breaking the glass ceiling navigable small world graphs. CoRR, 2016. SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels

Medini, T. K. R., Huang, Q., Wang, Y., Mohan, V., and Yen, E. I., Huang, X., Zhong, K., Ravikumar, P., and Shrivastava, A. Extreme classification in log memory Dhillon, I. S. Pd-sparse: A primal and dual sparse ap- using count-min sketch: A case study of amazon search proach to extreme multiclass and multilabel classification. with 50m products. In NeurIPS, 2019. In ICML, 2016.

Mineiro, P. and Karampatziakis, N. Fast label embeddings Yen, E. I., Huang, X., Dai, W., Ravikumar, P.and Dhillon, via randomized linear algebra. In ECML/PKDD, 2015. I., and Xing, E. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In KDD, 2017. Mittal, A., Dahiya, K., Agrawal, S., Saini, D., Agarwal, S., Kar, P., and Varma, M. Decaf: Deep extreme classifica- You, R., Dai, S., Zhang, Z., Mamitsuka, H., and Zhu, S. tion with label features. In WSDM, 2021. Attentionxml: Extreme multi-label text classification with multi-label attention based recurrent neural networks. In Niculescu-Mizil, A. and Abbasnejad, E. Label filters for NeurIPS, 2019. large scale multilabel classification. In AISTATS, 2017. Yu, H., Jain, P., Kar, P., and Dhillon, I. S. Large-scale multi-label learning with missing labels. In ICML, 2014. Prabhu, Y. and Varma, M. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Zhang, T. Statistical Analysis of Some Multi-Category KDD, August 2014. Large Margin Classification Methods. Journal of Ma- chine Learning Research, 5:1225–1251, October 2004. Prabhu, Y., Kag, A., Gopinath, S., Dahiya, K., Harsola, S., Agrawal, R., and Varma, M. Extreme multi-label learning with label features for warm-start tagging, ranking and recommendation. In WSDM, 2018a.

Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., and Varma, M. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In WWW, 2018b.

Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.

Tagami, Y. Annexml: Approximate nearest neighbor search for extreme multi-label classification. In KDD, 2017.

Wei, C. and Ma, T. Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation. In NeurIPS, 2019.

Wu, C.-Y., Manmatha, R., Smola, A.-J., and Krahenbuhl, P. Sampling matters in deep embedding learning. In ICCV, 2017.

Wydmuch, M., Jasinska, K., Kuznetsov, M., Busa-Fekete, R., and Dembczynski, K. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In NIPS, 2018.

Xiong, L., Xiong, C., Li, Y., Tang, K.-F., Liu, J., Bennett, P., Ahmed, J., and Overwijk, A. Approximate nearest neigh- bor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.

Xu, C., Tao, D., and Xu, C. Robust extreme multi-label learning. In KDD, 2016.