Downloads/XC/Xmlrepository.Html Ppdsparse: a Parallel Primal-Dual Sparse Method for Extreme Classification
Total Page:16
File Type:pdf, Size:1020Kb
DECAF: Deep Extreme Classification with Label Features Anshul Mittal Sheshansh Agrawal Sumeet Agarwal Kunal Dahiya Deepak Saini [email protected] [email protected] [email protected] IIT Delhi [email protected] [email protected] India IIT Delhi Microsoft Research India India Purushottam Kar Manik Varma [email protected] [email protected] IIT Kanpur Microsoft Research Microsoft Research IIT Delhi India India ABSTRACT ACM Reference Format: Extreme multi-label classification (XML) involves tagging a data Anshul Mittal, Kunal Dahiya, Sheshansh Agrawal, Deepak Saini, Sumeet Agarwal, Purushottam Kar, and Manik Varma. 2021. DECAF: Deep Extreme point with its most relevant subset of labels from an extremely Classification with Label Features. In Proceedings of the Fourteenth ACM large label set, with several applications such as product-to-product International Conference on Web Search and Data Mining (WSDM ’21), March recommendation with millions of products. Although leading XML 8–12, 2021, Virtual Event, Israel. ACM, New York, NY, USA, 18 pages. https: algorithms scale to millions of labels, they largely ignore label meta- //doi.org/10.1145/3437963.3441807 data such as textual descriptions of the labels. On the other hand, classical techniques that can utilize label metadata via represen- 1 INTRODUCTION tation learning using deep networks struggle in extreme settings. This paper develops the DECAF algorithm that addresses these chal- Objective: Extreme multi-label classification (XML) refers to the lenges by learning models enriched by label metadata that jointly task of tagging data points with a relevant subset of labels from learn model parameters and feature representations using deep an extremely large label set. This paper demonstrates that XML networks and offer accurate classification at the scale of millions algorithms stand to gain significantly by incorporating label meta- of labels. DECAF makes specific contributions to model architec- data. The DECAF algorithm is proposed, which could be up to ture design, initialization, and training, enabling it to offer up to 2-6% more accurate than leading XML methods such as Astec [8], 2-6% more accurate prediction than leading extreme classifiers on MACH [23], Bonsai [20], AttentionXML [38], etc, while offering pre- publicly available benchmark product-to-product recommendation dictions within a fraction of a millisecond, which makes it suitable datasets, such as LF-AmazonTitles-1.3M. At the same time, DECAF for high-volume and time-critical applications. was found to be up to 22× faster at inference than leading deep ex- Short-text applications: Applications such as predicting re- treme classifiers, which makes it suitable for real-time applications lated products given a retail product’s name [23], or predicting that require predictions within a few milliseconds. The code for related webpages given a webpage title, or related searches [14], DECAF is available at the following URL all involve short texts, with the product name, webpage title, or https://github.com/Extreme-classification/DECAF. search query having just 3-10 words on average. In addition to the statistical and computational challenges posed by a large set of CCS CONCEPTS labels, short-text tasks are particularly challenging as only a few words are available per data point. This paper focuses on short-text • Computing methodologies ! Machine learning; Supervised arXiv:2108.00368v1 [cs.CL] 1 Aug 2021 applications such as related product and webpage recommendation. learning by classification. Label metadata: Metadata for labels can be available in various forms: textual representations, label hierarchies, label taxonomies KEYWORDS [19, 24, 31], or label correlation graphs, and can capture seman- Extreme multi-label classification; product to product recommen- tic relations between labels. For instance, the Amazon products dation; label features; label metadata; large-scale learning (that serve as labels in a product-to-product recommendation task) “Panzer Dragoon", and “Panzer Dragoon Orta" do not share any com- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed mon training point but are semantically related. Label metadata can for profit or commercial advantage and that copies bear this notice and the full citation allow collaborative learning, which especially benefits tail labels. on the first page. Copyrights for components of this work owned by others than the Tail labels are those for which very few training points are available author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and form the majority of labels in XML applications [2, 3, 15]. For and/or a fee. Request permissions from [email protected]. instance, just 14 documents are tagged with the label “Panzer Dra- WSDM ’21, March 8–12, 2021, Virtual Event, Israel goon Orta" while 23 documents are tagged with the label “Panzer © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8297-7/21/03...$15.00 Dragoon" in the LF-AmazonTitles-131K dataset. In this paper, we https://doi.org/10.1145/3437963.3441807 will focus on utilizing label text as a form of label metadata. DECAF: DECAF learns a separate linear classifier per label based representations since it is expensive to repeatedly update these on the 1-vs-All approach. These classifiers critically utilize label search structures as deep-learned representations keep getting up- metadata and require careful initialization since random initializa- dated across learning epochs. Thus, these approaches are unable to tion [10] leads to inferior performance at extreme scales. DECAF utilize deep-learned features, which leads to inaccurate solutions. proposes using a shortlister with large fanout to cut down train- DECAF avoids these issues by its use of the shortlister which offers ing and prediction time drastically. Specifically, given a training a high recall filtering of labels allowing training and prediction set of # examples, ! labels, and 퐷 dimensional embeddings being costs that are logarithmic in the number of labels. learnt, the use of the shortlister brings training time down from Tree classifiers: Tree-based classifiers typically partition the O ¹# 퐷!º to O ¹# 퐷 log !º (by training only on the O ¹log !º most label space to achieve logarithmic prediction complexity. In partic- confusing negative labels for every training point), and prediction ular, MLRF [1], FastXML [30], PfastreXML [15] learn an ensemble time down from O ¹퐷!º to O ¹퐷 log !º (by evaluating classifiers of trees where each node in a tree is partitioned by optimizing an corresponding to only the O ¹log !º most likely labels). An efficient objective based on the Gini index or nDCG. CRAFTML [32] deploys and scalable two-stage strategy is proposed to train the shortlister. random partitioning of features and labels to learn an ensemble Comparison with state-of-the-art: Experiments conducted of trees. However, such algorithms can be expensive in terms of on publicly available benchmark datasets revealed that DECAF training time and model size. could be 5% more accurate than the leading approaches such as Deep feature representations: Recent works MACH [23], X- DiSMEC [2], Parabel [29], Bonsai [20] AnnexML [33], etc, which Transformer [7], XML-CNN [22], and AttentionXML [22] have utilize pre-computed features. DECAF was also found to be 2-6% graduated from using fixed or pre-learned features to using task- more accurate than leading deep learning-based approaches such specific feature representations that can be significantly more accu- as Astec [8], AttentionXML [38] and MACH [23] that jointly learn rate. However, CNN and attention-based mechanisms were found feature representations and classifiers. Furthermore, DECAF could to be inaccurate on short-text applications (as shown in [8]) where be up to 22× faster at prediction than deep learning methods such scant information is available (3-10 tokens) for a data point. Fur- as MACH and AttentionXML. thermore, approaches like X-Transformer and AttentionXML that Contributions: This paper presents DECAF, a scalable deep learn label-specific document representations do not scale well. learning architecture for XML applications that effectively utilize Using label metadata: Techniques that use label metadata e.g. label metadata. Specific contributions are made in designing a short- label text include SwiftXML [28] which uses a pre-trained Word2Vec lister with a large fanout and a two-stage training strategy. DECAF [25] model to compute label representations. However, SwiftXML also introduces a novel initialization strategy for classifiers that is designed for warm-start settings where a subset of ground-truth leads to accuracy gains, more prominently on data-scarce tail la- labels for each test point is already available. This is a non-standard bels. DECAF scales to XML tasks with millions of labels and makes scenario that is beyond the scope of this paper. [11] demonstrated, predictions significantly more accurate than state-of-the-art XML using the GlaS regularizer, that modeling label correlations could methods. Even on datasets with more than a million labels, DECAF lead to gains on tail labels. Siamese networks [34] are a popu- can make predictions in a fraction