Siamesexml: Siamese Networks Meet Extreme Classifiers with 100M
Total Page:16
File Type:pdf, Size:1020Kb
SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels Kunal Dahiya 1 Ananye Agarwal 1 Deepak Saini 2 Gururaj K 3 Jian Jiao 3 Amit Singh 3 Sumeet Agarwal 1 Purushottam Kar 4 2 Manik Varma 2 1 Abstract product-to-product (Mittal et al., 2021), product-to-query The task of deep extreme multi-label learning (Chang et al., 2020), query-to-product (Medini et al., 2019), (XML) requires training deep architectures capa- query-to-bid-phrase (Dahiya et al., 2021), etc., that present ble of tagging a data point with its most relevant specific statistical as well as computational challenges. subset of labels from an extremely large label set. Short-text Applications with Label Metadata: Applica- Applications of XML include tasks such as ad tions where data points are endowed with short textual de- and product recommendation that involve labels scriptions (e.g., product title, search query) containing 3-10 that are rarely seen during training but which nev- tokens are known as short-text applications. These accu- ertheless hold the key to recommendations that rately model the ranking and recommendation tasks men- delight users. Effective utilization of label meta- tioned above and have attracted much attention recently data and high quality predictions for rare labels at (Chang et al., 2020; Dahiya et al., 2021; Medini et al., 2019; the scale of millions of labels are key challenges Mittal et al., 2021). This paper will specifically focus on ap- in contemporary XML research. To address these, plications where labels (e.g., related items, bid-phrases) are this paper develops the SiameseXML method by also endowed with short-text descriptions. This is in sharp proposing a novel probabilistic model suitable for contrast with XML works e.g., (Babbar & Scholkopf,¨ 2019; extreme scales that naturally yields a Siamese ar- Prabhu et al., 2018b; You et al., 2019) that treat labels as chitecture that offers generalization guarantees identifiers devoid of descriptive features. We note that other that can be entirely independent of the number of forms of label metadata such as label hierarchies could also labels, melded with high-capacity extreme clas- be used but this paper focuses on label textual descriptions sifiers. SiameseXML could effortlessly scale to as a reliable and readily available form of label metadata. tasks with 100 million labels and in live A/B tests on a popular search engine it yielded significant Technical Challenges: Label sets in XML tasks routinely gains in click-through-rates, coverage, revenue contain several millions of labels yet latency requirements and other online metrics over state-of-the-art tech- demand that predictions on a test point be made within mil- niques currently in production. SiameseXML also liseconds. The vast majority of labels in XML tasks are rare offers predictions 3–12% more accurate than lead- labels for which very little training data (often < 5 training ing XML methods on public benchmark datasets. points) is available. The use of label metadata has been The generalization bounds are based on a novel demonstrated (Mittal et al., 2021) to be beneficial in accu- uniform Maurey-type sparsification lemma which rately predicting rare labels for which training data alone may be of independent interest. Code for Siame- may not inform the classifier model adequately. However, it seXML will be made available publicly. remains challenging to design architectures that effectively utilize label metadata at the scale of millions of labels. Contributions: This paper presents SiameseXML an effec- 1. Introduction tive solution to training XML models utilizing label text at the scale of 100 million labels. From a technical stand- Overview: Extreme Multi-label Learning (XML) involves point, a) SiameseXML is based on a novel probabilistic tagging a data point with its most relevant subset of la- model that naturally motivates a modular approach melding bels from an extremely large set. XML finds applications Siamese networks with extreme classifiers that enhances the in myriad of ranking and recommendation tasks such as capacity of existing Siamese architectures. b) The Siamese 1Indian Institute of Technology Delhi 2Microsoft Research module of SiameseXML offers generalization bounds en- 3Microsoft 4Indian Institute of Technology Kanpur. Correspon- tirely independent of the number of labels L whereas the dence to: Kunal Dahiya <[email protected]>. Extreme module bounds explicitly depend only on log L. Proceedings of the 38 th International Conference on Machine This is advantageous for tasks with as many as 100M labels. Learning, PMLR 139, 2021. Copyright 2021 by the author(s). These bounds are based on a novel Maurey-type sparsifica- SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels tion lemma which may be of independent interest. From an the number of labels L grows, for example L^ ≈ 130K for application standpoint, c) SiameseXML offers superior scal- L ≈ 1M labels. Since DECAF requires meta-label represen- ability than existing methods, scaling to tasks with 100M tations corresponding to all L^ meta labels to be recomputed labels while still offering predictions within milliseconds. for every iteration (mini-batch), it makes the method expen- d) SiameseXML’s predictions can be 3-12% more accurate sive for larger datasets. Taking a small value, say L^ ≈ 8K than leading XML methods on benchmark datasets and was does improve training time but hurts performance. Siame- found to improve the quality of predictions by 11% in the seXML will demonstrate that a direct label-text embedding application to match user queries to advertiser bid phrases. can offer better scalability, as well as better performance, as compared to DECAF’s meta-label based approach. 2. Related Works Siamese Networks: Siamese networks typically learn data point and label embeddings by optimizing the pairwise con- Extreme Multi-label Learning (XML): Much prior work trastive loss (Chen et al., 2020; Xiong et al., 2020) or else has focused on designing classifiers for fixed features such the triplet loss (Schroff et al., 2015; Wu et al., 2017). Per- as bag-of-words or else pre-trained features such as FastText forming this optimization exactly can be computationally (Joulin et al., 2017). Representative works include (Agrawal prohibitive at extreme scales, as for N data points and L et al., 2013; Babbar & Scholkopf,¨ 2017; 2019; Bhatia labels, a per-epoch cost of O (NL) and O NL2 is in- et al., 2015; Jain et al., 2016; 2019; Jasinska et al., 2016; curred for pairwise and triplet losses, respectively. To avoid Khandagale et al., 2019; Mineiro & Karampatziakis, 2015; this, it is common to train for a data point with respect to Niculescu-Mizil & Abbasnejad, 2017; Prabhu & Varma, only a small, say O (log L)-sized, subset of labels which 2014; Prabhu et al., 2018a;b; Tagami, 2017; Wydmuch et al., brings training cost down to O (N log L) per epoch. This 2018; Xu et al., 2016; Yen et al., 2016). While highly scal- subset typically contains all positive labels of the data point able, using fixed or pre-trained features leads to sub-optimal (of which there are only O (log L) in most XML applica- accuracy in real-word scenarios as demonstrated in several tions) and a carefully chosen set of O (log L) negative labels recent works that propose deep learning algorithms to jointly which seem the most challenging for this data point. There learn task-dependent features and classifiers. These include is some debate as to whether the absolutely “hardest” nega- XML-CNN (Liu et al., 2017), AttentionXML (You et al., tives for a data point should be considered (Wu et al., 2017; 2019), MACH (Medini et al., 2019), X-Transformer (Chang Xiong et al., 2020), especially in situations where missing et al., 2020) and Astec (Dahiya et al., 2021) that respec- labels abound (Jain et al., 2016), or whether considering tively use CNN, attention, MLP, transformer, and bag-of- multiple hard negatives is essential. For instance, (Schroff embeddings-based architectures. It is notable that deep et al., 2015; Harwood et al., 2017) observed that using only extreme classifiers not only outperform classical approaches “hard” negatives could lead to overfitting in Siamese Net- that use fixed features, but also scale to millions of labels. works. Nevertheless, such negative mining has become a Label Metadata: Recent works have demonstrated that cornerstone for training classifier models at extreme scales. incorporating label metadata can also lead to significant per- Negative Mining: Although relatively well-understood for formance boosts as compared to approaches that consider approaches that use fixed features, negative mining becomes labels as feature-less identifiers. Prominent among them are an interesting problem in itself when features get jointly the X-Transformer (Chang et al., 2020) and DECAF (Mit- learnt since the set of challenging negatives for a data point tal et al., 2021). Unfortunately, both methods struggle to may keep changing as the feature representation of that data accurately scale to tasks with several millions of labels, as point changes over the epochs. Several approaches have is discussed below. Moreover, experiments show that the been proposed to address this. Online approaches look for SiameseXML method proposed in this paper can outperform challenging negative labels for a data point within the posi- both the X-Transformer and DECAF on smaller datasets. tive labels of other data points within the mini-batch (Faghri The X-Transformer makes use of label text to learn interme- et al., 2018; Guo et al., 2019; Chen et al., 2020; He et al., diate representations and learns a vast ensemble of classifiers 2020). Although computationally cheap, as the number of based on the powerful transformer architecture. However, labels grows, it becomes more and more unlikely that the training multiple transformer models requires a rather large most useful negative labels for a data point would just hap- array of GPUs and the method has not been shown to scale to pen to get sampled within its mini-batch.