Extreme Multi-label Learning for Semantic Matching in Product Search

Wei-Cheng Chang Daniel Jiang Hsiang-Fu Yu Amazon Amazon Amazon USA USA USA Choon-Hui Teo Jiong Zhang Kai Zhong Amazon Amazon Amazon USA USA USA Kedarnath Kolluri Qie Hu Nikhil Shandilya Amazon Amazon Amazon USA USA USA Vyacheslav Ievgrafov Japinder Singh Inderjit S. Dhillon Amazon Amazon Amazon USA USA USA ABSTRACT CCS CONCEPTS We consider the problem of semantic matching in product search: • Computing methodologies → Machine learning; Natural lan- given a customer query, retrieve all semantically related products guage processing; • Information systems → Information retrieval; from a huge catalog of size 100 million, or more. Because of large . catalog spaces and real-time latency constraints, semantic match- ing algorithms not only desire high recall but also need to have KEYWORDS low latency. Conventional lexical matching approaches (e.g., Okapi- Product Search; Semantic Matching; Extreme Multi-label Learning BM25) exploit inverted indices to achieve fast inference time, but fail to capture behavioral signals between queries and products. ACM Reference Format: Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon-Hui Teo, Jiong Zhang, In contrast, embedding-based models learn semantic representa- Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Iev- tions from customer behavior data, but the performance is often grafov, Japinder Singh, and Inderjit S. Dhillon. 2021. Extreme Multi-label limited by shallow neural encoders due to latency constraints. Se- Learning for Semantic Matching in Product Search. In Proceedings of the mantic product search can be viewed as an eXtreme Multi-label 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Classification (XMC) problem, where customer queries are input (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New York, instances and products are output labels. In this paper, we aim to NY, USA, 9 pages. https://doi.org/10.1145/3447548.3467092 improve semantic product search by using tree-based XMC mod- els where inference time complexity is logarithmic in the number 1 INTRODUCTION of products. We consider hierarchical linear models with n-gram In general, product search consists of two building blocks: the features for fast real-time inference. Quantitatively, our method matching stage, followed by the ranking stage. When a customer maintains a low latency of 1.25 milliseconds per query and achieves issues a query, the query is passed to several matching algorithms a 65% improvement of Recall@100 (60.9% v.s. 36.8%) over a compet- to retrieve relevant products, resulting in a match set. The match ing embedding-based DSSM model. Our model is robust to weight set passes through stages of ranking, where top results from the pruning with varying thresholds, which can flexibly meet different previous stage are re-ranked before the most relevant items are system requirements for online deployments. Qualitatively, our displayed to customers. Figure 1 outlines a product search system. method can retrieve products that are complementary to existing The matching (i.e., retrieval) phase is critical. Ideally, the match product search system and add diversity to the match set. set should have a high recall [29], containing as many relevant and diverse products that match the customer intent as possible; otherwise, many relevant products won’t be considered in the final Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ranking phase. On the other hand, the matching algorithm should be for profit or commercial advantage and that copies bear this notice and the full citation highly efficient to meet real-time inference constraints: retrieving on the first page. Copyrights for components of this work owned by others than ACM a small subset of relevant products in time sublinear to enormous must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a catalog spaces, whose size is as large as 100 million or more. fee. Request permissions from [email protected]. Matching algorithms can be put into two categories. The first KDD ’21, August 14–18, 2021, Virtual Event, Singapore type is lexical matching approaches that score a query-product © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8332-5/21/08...$15.00 pair by a weighted sum of overlapping keywords among the pair. https://doi.org/10.1145/3447548.3467092 One representative example is Okapi-BM25 [36, 37], which remains industrial product search engines do not have the luxury to leverage state-of-the-art deep Transformer encoders in embedding-based neural methods, whereas shallow MLP encoders result in limited performance. In this paper, we take another route to tackle semantic matching by formulating it as an eXtreme Multi-label Classification (XMC) problem, where the goal is tagging an input instance (i.e., query) with most relevant output labels (i.e., products). XMC approaches have received great attention recently in both the academic and industrial community. For instance, the deep learning XMC model X-Transformer [7, 52] achieves state-of-the-art performance on Figure 1: System architecture for augmenting match set with public academic benchmarks [3]. Partition-based methods such as extreme multi-label learning models. Parabel [33] and XReg [34], as another example, finds successful applications to dynamic search advertising in Bing. In particular, state-of-the-art for decades and is still widely used in many re- tree-based partitioning XMC models are a staple of modern search trieval tasks such as open-domain question answering [6, 27] and engines and recommender systems due to their inference time being passage/document retrieval [5, 14]. While the inference of Okapi- sub-linear (i.e., logarithmic) to the enormous output space (e.g., 100 BM25 can be done efficiently using an inverted index [53], these million or more). index-based lexical matching approaches cannot capture semantic In this paper, we aim to answer the following research question: and customer behavior signals (e.g., clicks, impressions, or pur- Can we develop an effective tree-based XMC approach for semantic chases) tailored from the product search system and are fragile to matching in product search? Our goal is not to replace lexical-based morphological variants or spelling errors. matching methods (e.g., Okapi-BM25). Instead, we aim to comple- The second option is embedding-based neural models that learn ment/augment the match set with more diverse candidates from semantic representations of queries and products based on the the proposed tree-based XMC approaches; hence the final stage customer behavior signals. The similarity is measured by inner ranker can produce a good and diverse set of products in response products or cosine distances between the query and product em- to search queries. beddings. To infer in real-time the match set of a novel query, We make the following contributions in this paper. embedding-based neural models need to first vectorize the query • We apply the leading tree-based XR-Linear (PECOS) model [52] tokens into the query embedding, and then find its nearest neighbor to semantic matching. To our knowledge, we are the first to products in the embedding space. Finding nearest neighbors can successfully apply tree-based XMC methods to an industrial be done efficiently via approximate nearest neighbor search (ANN) scale product search system. methods (e.g., HNSW [28] or ScaNN [15]). Vectorizing query tokens • We explore various 푛-gram TF-IDF features for XR-Linear into an embedding, nevertheless, is often the inference bottleneck, as vectorizing queries into sparse TF-IDF features is highly that depends on the complexity of neural network architectures. efficient for real-time inference. Using BERT-based encoders [11] for query embeddings is less • We study weight pruning of the XR-Linear model and demon- practical because of the large latency in vectorizing queries. Specif- strate its robustness to different disk-space requirements. ically, consider retrieving the match set of a query from the output • We present the trade-off between recall rates and real-time space of 100 millions of products on a CPU machine. In Table 1, latency. Specifically, for beam size푏 = 15, XR-Linear achieves we compare the latency of a BERT-based encoder (12 layers deep Recall@100 of 60.9% with a latency of 1.25 ms/q; for beam Transformer) with a shallow DSSM [32] encoder (2 layers MLP). The size 푏 = 50, XR-Linear achieves Recall@100 of 65.7% with a inference latency of a BERT-based encoder is 50 milliseconds per latency of 3.48 ms/q. In contrast, DSSM only has Recall@100 query (ms/q) in total, where vectorization and ANN search needs of 36.2% with a latency of 3.13 ms/q. 88% and 12% of the time, respectively. On the other hand, the latency The implementation of XR-Linear (PECOS) is publicly available at of DSSM encoder is 3.13 ms/q in total, where vectorization only https://github.com/amzn/pecos. takes 33% of the time. Because of real-time inference constraints, 2 RELATED WORK Vectorizer latency (ms/q) ANN latency (ms/q) 2.1 Semantic Product Search Systems DSSM [32] 1.00 HNSW [28] 2.13 Two-tower models (a.k.a., dual encoders, Siamese networks) are BERT-based [11] 44.00 HNSW [28] ≈ 6.39 arguably one of the most popular embedding-based neural mod- Table 1: Comparing inference latency of BERT-based en- els used in passage or document retrieval [31, 46], dialogue sys- coder with DSSM encoder on a CPU machine, with single- tems [18, 30], open-domain question answering [16, 24, 27], and batch, single-thread settings. The latency of BERT-based recommender systems [9, 47, 50]. It is not surprising that two- encoder is taken from HuggingFace Transformers [43, 44]. tower models with ResNet [17] encoder and deep Transformer BERT’s ANN time is estimated by linear extrapolation on encoders [41] achieve state-of-the-art in most computer vision [26] the dimensionality 푑, where BERT has 푑 = 768 and DSSM and textual-domain retrieval benchmarks [10], respectively. The has 푑 = 256. more complex the encoder architecture is, nevertheless, the less applicable the deep two-tower models are for industrial product 3 XR-LINEAR (PECOS): A TREE-BASED XMC search systems: very few of them can meet the low real-time latency METHOD FOR SEMANTIC MATCHING constraint (e.g., 5 ms/q) due to the vectorization bottleneck of query Overview. Tree-based XMC models for semantic matching can tokens. be characterized as follows: given a vectorized query x ∈ R푑 and One of the exceptions is Deep Model (DSSM) [32], a set of labels (i.e., products) Y = {1,...,ℓ,...,퐿}, produce a tree- a variant of two-tower retrieval models with shallow multi-layer based model that retrieves top 푘 most relevant products in Y. The perceptron (MLP) layers. DSSM is tailored for industrial-scale se- model parameters are estimated from the training dataset {(x , y ) : mantic product search, and was A/B tested online on Amazon 푖 푖 푖 = 1, . . . , 푛} where y ∈ {0, 1}퐿 denotes the relevant labels for this Search Engine [32]. Another example is two-tower recommender 푖 query from the output label space Y. models for Youtube [9] and Google Play [47], where they also em- brace shallow MLP encoders for fast vectorization of novel queries to meet online latency constraints. Finally, [38] leverages distributed GPU training and KNN softmax, scaling two-tower models for the Alibaba Retail Product Dataset with 100 million products.

2.2 eXtreme Multi-label Classification (XMC) To overcome computational issues, most existing XMC algorithms with textual inputs use sparse TF-IDF features and leverage different partitioning techniques on the label space to reduce complexity. Sparse linear models. Linear one-versus-rest (OVR) methods such as DiSMEC [1], ProXML [2], PPD-Sparse [48, 49] explore parallelism to speed up the algorithm and reduce the model size by truncating model weights to encourage sparsity. Linear OVR classifiers are also building blocks for many other XMC approaches, such as Parabel [33], SLICE [20], XR-Linear (PECOS) [52], to name just a few. However, naive OVR methods are not applicable to Figure 2: Illustration of inference of XR-Linear (PECOS) [52] semantic product search because their inference time complexity is using beam search with beam width 푏 = 2 to retrieve 4 rele- still linear in the output space. vant products for the given input query x. Partition-based methods. The efficiency and scalability of sparse linear models can be further improved by incorporating There are many tree-based linear methods [21, 25, 33, 34, 45, 52], different partitioning techniques on the label spaces. For instance, and we consider the XR-Linear model within PECOS [52] in this Parabel [33] partitions the labels through a balanced 2-means la- paper, due to its flexibility and scalability to large output spaces. bel tree using label features constructed from the instances. Other The XR-Linear model partitions the enormous label space Y with approaches attempt to improve on Parabel, for instance, eXtreme- hierarchical clustering to form a hierarchical label tree. The 푗th Y (푡) Text [45], Bonsai [25], NAPKINXC [21], XReg [34], and PECOS [52]. label cluster at depth 푡 is denoted by 푗 . The leaves of the tree In particular, the PECOS framework [52] allows different index- are the individual labels (i.e., products) of Y. See Figure 2 for an ing and matching methods for XMC problems, resulting in two illustration of the tree structure and inference with beam search. realizations, XR-Linear and X-Transformer. Also note that tree- Each layer of the XR-Linear model has a linear OVR classifier ( ) based methods with neural encoders such as AttentionXML [51], that scores the relevance of a cluster Y 푡 to a query x. Specifically, X-Transformer (PECOS) [7, 52], and LightXML [22], are mostly the 푗 Y (푡) state-of-the-art results on XMC benchmarks, at the cost of longer the unconditional relevance score of a cluster 푗 is training time and expensive inference. To sum up, tree-based sparse ( ) ( ) (x Y 푡 ) = (x⊤w 푡 ) linear methods are more suitable for the industrial-scale semantic 푓 , 푗 휎 푗 , (1) matching problems due to their sub-linear inference time and fast ( ) w 푡 ∈ R푑 vectorization of the query tokens. where 휎 is an activation function and 푗 are model param- eters of the 푗-th node at 푡-th layer. ( ) Graph-based methods. SLICE [20] and AnnexML [39] build- Practically, the weight vectors w 푡 for each layer 푡 are stored in ing an approximate nearest neighbor (ANN) graph to index the 푗 a 푑 × 퐾푡 weight matrix large output space. For a given instance, the relevant labels can be h i found quickly via ANN search. SLICE then trains linear OVR classi- (푡) = (푡) (푡) (푡) W w1 w2 ... w퐾 , (2) fiers with negative samples induced from ANN. While the inference 푡 time complexity of advanced ANN algorithms (e.g., HNSW [28] where 퐾푡 denotes the number of clusters at layer 푡, such that 퐾퐷 = 퐿 or ScaNN [15]) is sub-linear, ANN methods typically work better and 퐾0 = 1. In addition, the tree topology at layer 푡 is represented on low-dimensional dense embeddings. Therefore, the inference by a cluster indicator matrix C(푡) ∈ {0, 1}퐾푡+1×퐾푡 . Next, we discuss latency of SLICE and AnnexML still hinges on the vectorization how to construct the cluster indicator matrix C(푡) and learn the time of pre-trained embeddings. model weight matrix W(푡) . 3.1 Hierarchical Label Tree {X, Y퐷−1 = binarize(Y퐷 C(퐷) )}. Similar recursion is applied to all Tree Construction. Given semantic label representations {zℓ : other intermediate layers. Y} , the XR-Linear model constructs a hierarchical label tree via 퐵 Learning OVR Classifiers. At layer 푡 with the corresponding array partitioning (i.e., clustering) in a top-down fashion, where XMC dataset {X, Y푡 }, the XR-Linear model considers point-wise the number of clusters at layer 푡 = 1, . . . , 퐷 are loss to learn the parameter matrix W(푡) ∈ R푑×퐾푡 independently for 퐷−1 1 (푡) 퐾퐷 = 퐿, 퐾퐷−1 = 퐵 , . . . , 퐾1 = 퐵 . (3) each column ℓ, 1 ≤ ℓ ≤ 퐾푡 . Column ℓ of W is found by solving a − Y (푡) binary classification sub-problem: Starting from the node 푘 of layer 푡 1 is a cluster 푘 containing ( ) Y 0 = Y (푡) Õ 푡 ⊤ 휆 2 labels assigned to this cluster. For example, 1 is the root w = arg min L(Y , w x푖 ) + ∥w∥ , ℓ = 1, . . . , 퐾푡 , ℓ w 푖ℓ 2 node clustering including all the labels {ℓ : 1, . . . , 퐿}. To proceed ( ) 2 푖:M 푡 ≠0 from layer 푡 − 1 to layer 푡, we perform Spherical 퐵-Means [12] 푖,cℓ ( − ) (6) clustering to partition Y 푡 1 into 퐵 clusters to form its 퐵 child 푘 where 휆 is the regularization coefficient, L(·, ·) is the point-wise (푡) {Y 푗 = , . . . , 퐵} (푡) 푛×퐾푡−1 nodes 퐵(푘−1)+푗 : 1 . Applying this for all parent loss function (e.g., squared hinge loss), M ∈ {0, 1} is a 푡−1 푡 nodes 푘 = 1, . . . , 퐵 at layer 푡 − 1, the cluster indicator matrix negative sampling matrix at layer 푡, and cℓ = cℓ ∈ {1, . . . , 퐾푡−1} is C(푡) ∈ {0, 1}퐾푡 ×퐾푡−1 at layer 푡 is represented by the cluster index of the ℓ-th label (cluster) at the 푡-th layer. Crucially, (푡) ( ( ) ( − ) the negative sampling matrix M not only finds hard negative in-   1, if Y 푡 is a child of Y 푡 1 , C(푡) = 푗 푘 (4) stances (queries) for stronger learning signals, but also significantly 푗,푘 0, otherwise. reduces the size of the active instance set for faster convergence. An illustration with 퐵 = 2 is given in Figure 3. In practice, we solve the binary classification sub-problem (6) via a state-of-the-art efficient solver LIBLINEAR [13]. Furthermore, 퐾푡 independent sub-problems at layer 푡 can be computed in an embarrassingly parallel manner to fully utilize the multi-core CPU design in modern hardware.

3.3 Inference Weight Pruning. The XR-Linear model is composed of 퐷 OVR linear classifiers 푓 (푡) (x, ℓ) parametrized by matrices W(푡) ∈ R푑×퐾푡 , 1 ≤ 푡 ≤ 퐷 Naively storing the dense parameter matrices is not feasible. To overcome a prohibitive model size, we apply entry-wise weight pruning to sparsify W(푡) [1, 33]. Specifically, after solving ( ) w 푡 Figure 3: An illustration of hierarchical label clustering [52] ℓ for each binary classification sub-problem, we perform a hard thresholding operation to truncate parameters with magnitude with recursive 퐵-ary partitions with 퐵 = 2. smaller than a pre-specified value 휖 ≥ 0 to zero: ( ( ) ( ) Label Representations. For semantic matching problem, we   W 푡 , if |W 푡 | > 휖, W(푡) = 푖 푗 푖 푗 (7) present two ways to construct label embeddings, depending on 푖 푗 0, otherwise. the quality of product information (i.e., titles and descriptions). If product information such as titles and descriptions are noisy or We set the pruning threshold 휖 such that the parameter matrices missing, each label is represented by aggregating query feature can be stored in the main memory for real-time inference. vectors from positive instances (namely, PIFA): Beam Search. The relevance score of a query-cluster pair (x, Y)˜ 푛 PIFA vℓ Õ is defined to be the aggregation of all ancestor nodes in the tree: zℓ = , where vℓ = Y푖ℓ x푖, ℓ ∈ Y. (5) ∥vℓ ∥ Ö 푖=1 푓 (x, Y)˜ = 휎 (w · x) , (8) Otherwise, labels with rich textual information can be represented w∈A (Y)˜ by featurization (e.g., 푛-gram TF-IDF or some pre-trained neural embeddings) of its titles and descriptions. A(Y)˜ { (푡) | Y˜ ⊂ Y (푡) } where = w푖 푖 , 푡 ≠ 0 denotes the weight Y˜ Y˜ 3.2 Sparse Linear Matcher vectors of all ancestors of in the tree, including and disregard- ing the root. Naturally, this definition extends all the way to the × Given the query feature matrix X ∈ R푛 푑 , the label matrix Y ∈ individual labels ℓ ∈ Y at the bottom of the tree. {0, 1}푛×퐿, and cluster indicator matrices {C(푡) ∈ {0, 1}퐾푡 ×퐾푡−1 : In practice, exact inference is typically intractable, as it requires 퐾0 = 1, 퐾퐷 = 퐿, 푡 = 1, . . . , 퐷}, we are ready to learn the model searching the entire label tree. To circumvent this issue, XR-Linear parameters W(푡) for layer 푡 = 1, . . . , 퐷 in a top-down fashion. For use a greedy beam search of the tree as an approximation. For a example, at the bottom layer 푡 = 퐷, it corresponds to the original query x, this approach discards any clusters at a given level that do XMC problem on the given training data {X, Y퐷 = Y}. For layer not fall into the top 푏 most relevant clusters, where 푏 is the beam 푡 = 퐷 − 1, we construct the XMC problem with the induced dataset search width. The inference time complexity of XR-Linear with beam search [52] is training signals are extremely sparse, and the resulting query fea- 퐷 ture matrix has an average of 27.3 non zero elements per query Õ  퐾푡   퐿   and the product feature matrix has an average of 142.1 non zero O 푏 × × 푇푓 = O 퐷 × 푏 × max 퐵, × 푇푓 , (9) 퐾 − 퐷−1 = 푡=1 푡 1 퐵 element per product, with the dimensionality 푑 4.2푀. For the DSSM model, both the query feature matrix and product feature (푡) where 푇푓 is the time to compute relevance score 푓 (x, ℓ). We see matrix are dense low-dimensional (i.e., 256) embeddings, which are 퐷−1 that if the tree depth 퐷 = O(log퐵 퐿) and max leaf size 퐿/퐵 is a more suitable for HNSW inference. small constant such as 100, the overall inference time complexity Featurization. For the proposed XR-Linear method, we con- for XR-Linear is sider 푛-gram sparse TF-IDF features for the input queries and con- O(푏 × 푇 × 퐿), 푓 log (10) duct ℓ2 normalization of each query vector. The vocabulary size is which is logarithmic in the size of the original output space. 4.2 million, which is determined by the most frequent tokens of word and character level 푛-grams. In particular, there are 1 million 3.4 Tokenization and Featurization word unigrams, 3 million word bigrams, and 200 thousand charac- ter trigrams. Note that we consider characters within each word To pre-process the input query or product title, we apply simple boundary when building the character level 푛-grams vocabulary. normalization (e.g., lower case, remove basic punctuations) and use The motivation of using character trigrams is to better capture white space to tokenize the query string into a sequence of smaller specific model types, certain brand names, and typos. components such as words, phrases, or characters. We combine For out-of-vocabulary (OOV) tokens, we map those unseen to- word unigrams, word bigrams, and character trigrams into a bag kens into a shared unique token id. It is possible to construct larger of 푛-grams feature, and construct sparse high-dimensional TF-IDF bins with hashing tricks [23, 32, 42] that potentially handle the features as the vectorization of the query. OOV tokens better. We leave the hashing tricks as future work. Word 푛-grams. The basic form of tokenization is a word uni- gram. For example, the word unigrams of "artistic iphone 6s case" 4.2 Experimental Setup are ["artistic", "iphone", "6s", "case"]. However, word unigrams lose Evaluation Protocol. We sample 100, 000 queries as the test word ordering information, which leads us to use higher order 푛- set, which comprises 182, 000 query-product pairs with purchase grams, such as bigrams. For example, the word bigram of the same counts being at least one. Given a query in the retrieval stage, query becomes ["artistic#iphone", "iphone#6s", "6s#case"]. These semantic matching algorithms need to find the top 100 relevant 푛-grams capture phrase-level information which helps the model products from the catalog space consisting of 100 million candidate to better infer the customer intent for search. products. We measure the performance with recall metrics, which are Character Trigrams. The string is broken into a list of all three- widely-used in retrieval [6, 46] and XMC tasks [20, 33, 35]. Specifi- character sequences, and we only consider the trigrams within word cally, for a predicted score vector yˆ ∈ R퐿 and a ground truth label boundary. For example, the character trigrams of the query "artistic vector y ∈ {0, 1}퐿, Recall@푘 (푘 = 10, 50, 100) is defined as iphone 6s case" are ["#ar", "art", "rti", "tis", "ist", "sti", "tic", "ic#", ℓ Õ "#ip", "iph", "phi", "hon", "one", "#6s", "6s#", "#ca", "cas", "ase", "se#"]. Recall@푘 = yℓ, Character trigrams are robust to typos ("iphone" and "iphonr") and |y| ℓ ∈top푘 (yˆ) can handle compound words ("amazontv" and "firetvstick") more where top (yˆ) denotes the labels with the largest 푘 predict values. naturally. Another advantage for product search is the ability to 푘 capture similarity of model parts and sizes. Comparing Methods and Hyper-parameters. We compare XR-Linear with two representative retrieval approaches and de- 4 EXPERIMENTAL RESULTS scribe the hyper-parameter settings. 4.1 Data Preprocessing • XR-Linear: the proposed tree-based XMC model, where the implementation is from PECOS [52]1. We use PIFA to con- We sample over 1 billion positive query-product pairs as the data struct semantic product embeddings and build the label tree set, which covers over 240 million queries and 100 million products. via hierarchical K-means, where the branching factor is A positive pair means aggregated counts of clicks or purchases 퐵 = 32, the tree depth is 퐷 = 5, and the maximum num- are above a threshold. While each query-product pair may have a ber of labels in each cluster is 퐿/퐵퐷−1 = 100. The negative real-valued weight representing the aggregated counts, we do not sampling strategy is Teacher Forcing Negatives (TFN) [52], currently leverage this information and simply treat the training the transformation predictive function is L hinge, and de- signals as binary feedback. We split the train/test set by time horizon 3 fault beam size is 푏 = 10. Unless otherwise specified, we where we use 12 months of search logs as the training set and the follow the default hyper-parameter settings of PECOS. trailing 1 month as the offline evaluation test set. Approximately • DSSM [32]: the leading neural embedding-based retrieval 12% of products in the test set are unseen in the training set, which model in semantic product search applications. We take a are also known as cold-start products. saved checkpoint DSSM model from [32], and generate query For the XR-Linear model and Okapi-BM25, we consider white and product embeddings on our dataset for evaluation. For space tokenization for both query text and product title to vectorize the query feature and product features with 푛-gram TF-IDF. The 1https://github.com/amzn/pecos fast real-time inference, we use the state-of-the-art approxi- Training Cost on AWS. We compare the AWS instance cost mate nearest neighbor (ANN) search algorithm HNSW [28] of XR-Linear (PECOS) and DSSM, where the former is trained on a to retrieve relevant products efficiently. Specifically, we use x1.32xlarge CPU instance ($13.3 per hour) and the latter is trained the implementation of NMSLIB [4]2. Regarding the hyper- on a p3.8xlarge GPU instance ($24.5 per hour). From Table 2, the parameter of HNSW, we set 푀0 = 32, 푒푓 퐶 = 300, and vary training of XR-Linear takes 42.5 hours (i.e., includes vectorization, the beam search width 푒푓 푆 = {10, 50, 100, 250, 500} to see clustering, matching), which amounts to $567 USD. In contrast, the trade-off between inference latency and retrieval quality. training of DSSM takes 260.0 hours (i.e., training DSSM with early • Okapi-BM25 [36]: the lexical matching baseline widely-used stopping, building HNSW indexer), which amounts to $6365 USD. In in retrieval tasks. We use the implementation of [40]3 where other words, the proposed XR-Linear enjoys a 10푥 smaller training hyper-parameters are set to 푘1 = 0.5 and 푏 = 0.45. cost compared to DSSM, and thus offers a more frugal solution for large-scale applications such as product search systems. Computing Environments. We run XR-Linear and Okapi-BM25 on a x1.32xlarge AWS instance (128 CPUs and 1952 GB RAM). For the embedding-based model DSSM, the query and product em- 4.4 Recall and Inference Latency Trade-off beddings are generated from a pre-trained DSSM model [32] on a Depending on the computational environment and resources, se- p3.16xlarge AWS instance (8 Nvidia V100 GPUs and 128 GB RAM). mantic product search systems can have various constraints such The inference of DSSM leverages the ANN algorithm HNSW, which as model size on disk, real-time inference memory, and most impor- is conducted on the same x1.32xlarge AWS instance for fair com- tantly, the real-time inference latency, measured in milliseconds per parison of inference latency. query (i.e., ms/q) under the single-thread single-CPU setup. In this subsection, we dive deep to study the trade-off between Recall@100 4.3 Main Results and inference latency by varying two XR-Linear hyper-parameters: Comparison of XR-Linear (PECOS) with two representative re- 1) weight pruning threshold 휖; 2) inference beam search size 푏. trieval approaches (i.e., the neural embedding-based model DSSM and the lexical matching method Okapi-BM25) is presented in Ta- Weight Pruning. As discussed in Section 3.3, we conduct element- ble 2. For XR-Linear, we study different 푛-gram TF-IDF feature wise weight pruning on the parameters of XR-Linear model trained realizations. Specifically, we consider four configurations: 1) word- on Unigram+Bigrams+Char Trigrams (4.2푀) features. We experi- level unigrams with 3 million features; 2) word-level unigrams ment with the thresholds 휖 = {0.1, 0.2, 0.3, 0.35, 0.4, 0.45}, and the and bigrams with 3 million features; 3) word-level unigrams and results are shown in Table 3. bigrams with 5.6 million features; 4) word-level unigrams and bi- From 휖 = 0.1 to 휖 = 0.35, the model size has a 3x reduction (i.e., grams, plus character-level trigrams, total of 4.2 million features. from 295 GB to 90 GB), having the same model size as the XR-Linear Configuration 3) has a larger vocabulary size compared to 2)and4) model trained on unigram (3푀) features. Crucially, the pruned XR- to examine the effect of character trigrams. Linear model still enjoys a sizable improvement over the XR-Linear From 푛-gram TF-IDF configuration 1) to 3), we observe that model trained on unigram (3푀) features, where the Recall@100 adding bigram features increases the Recall@100 from 54.86% to is 57.63% versus 54.86%. While the real-time inference memory 58.88%, a significant improvement over unigram features alone. of XR-Linear closely follows its model disk space, the latter has However, this also comes at the cost of a 1.57x larger model size on smaller impact on the inference latency. In particular, from 휖 = 0.1 disk. We will discuss the trade-off between model size and inference to 휖 = 0.45, the model size has a 5x reduction (from 295 GB to 62 latency in Section 4.4. Next, from configuration 3) to 4), adding GB), while the inference latency reduced by only 32.5% (from 1.20 character trigrams further improves the Recall@100 from 58.88% ms/q to 0.81 ms/q). On the other hand, the inference beam search to 60.30%, with a slightly larger model size. This empirical result size has a larger impact on real-time inference latency, as we will suggests character trigrams can better handle typos and compound discuss in the following paragraph. words, which are common patterns in real-world product search system. Similar observations are also found in [19, 32]. Beam Size. We analyze how beam search size of XR-Linear All 푛-gram TF-IDF configurations of XR-Linear achieve higher prediction affects the Recall@100 and inference latency. The results Recall@푘 compared to the competitive neural retrieval model DSSM are presented in Table 4. We also compare the throughput (i.e., with approximate nearest neighbor search algorithm HNSW. For inverse of latency, higher the better) versus Recall@100 for X-Linear sanity check, we also conduct exact kNN inference on the DSSM em- and DSSM + HNSW, as shown in Figure 4. beddings, and the conclusion remains the same. Finally, the lexical From beam size 푏 = 1 to 푏 = 15, we see a clear trade-off between matching method Okapi-BM25 performs poorly on this semantic Recall@100 and latency, where the former increases from 20.95% product search dataset, which may be due to the fact that purchase to 60.94%, at the cost of 7푥 higher latency (from 0.16 ms/q to 1.24 signals are not well correlated with overlapping token statistics. ms/q). Real-world product search systems typically limit the real- This indicates another unique difference between semantic product time latency of matching algorithms to be smaller than 5 ms/q. search problem and the conventional web retrieval problem where Therefore, even with 푏 = 30, the proposed XR-Linear approach is Okapi-BM25 variants are often a strong baseline [8, 14, 27]. still highly applicable for the real-world online deployment setup. In Figure 4, we also examine the throughput-recall trade-off 2https://github.com/nmslib/nmslib for the embedding-based retrieval model DSSM [32] that uses 3https://github.com/dorianbrown/rank_bm25 HNSW [28] to do inference. XR-Linear outperforms DSSM + HNSW Methods Recall@10 Recall@50 Recall@100 Training Time (hr) Model Size (GB) XR-Linear (PECOS) (thresholds 휖 = 0.1, beam-size 푏 = 10) Unigrams (3M) 38.40 52.49 54.86 25.7 90.0 Unigrams + Bigrams (3M) 39.66 54.93 57.70 31.5 212.0 Unigrams + Bigrams (5.6M) 40.71 56.09 58.88 67.9 232.0 Unigrams + Bigrams + Char Trigrams (4.2M) 42.86 57.78 60.30 42.5 295.0 DSSM [32] + HNSW [28] 18.03 30.22 36.19 263.5 129.0 DSSM [32] + exact kNN 18.37 30.72 36.79 260.0 192.0 Okapi-BM25 [36] 11.62 18.35 21.72 Table 2: Comparing XR-Linear (PECOS) with different 푛-gram features to representative semantic matching algorithms, where the output space is 100 million products. The results of XR-Linear are obtained by weight pruning threshold 휖 = 0.1 and beam search size 푏 = 10. Training time is measured in hours (hr) and model size on disk is measured in Gigabytes (GB). The model size of DSSM includes the encoders and storing the product embeddings for retrieval. The training time of DSSM model is measured on Nvidia V100 GPUs, while all other methods are benchmarked on Intel CPUs.

Pruning threshold (휖) Recall@10 Recall@50 Recall@100 #Parameters Disk Space (GBs) Memory (GBs) Latency (ms/q) 휖 = 0.1 42.86 57.78 60.30 26,994M 295 258 1.2046 휖 = 0.2 42.42 57.22 59.81 15,419M 168 166 1.0176 휖 = 0.3 41.35 55.98 58.52 9,957M 109 118 0.8834 휖 = 0.35 40.50 55.09 57.63 8,182M 90 102 0.8551 휖 = 0.4 39.41 53.86 56.35 6,784M 75 88 0.8330 휖 = 0.45 38.08 52.37 54.85 5,657M 62 76 0.8138 Table 3: Weight pruning of the best XR-Linear (PECOS) with various thresholds 휖. The model memory is measured by the stable memory consumption when processing input queries sequentially in a single-thread real-time inference setup. The inference latency includes the time for featurizating query tokens into a vector as well as the prediction by XR-Linear model.

Beam Size (푏) Recall@100 Latency (ms/q) Throughput (q/s) 푏 = 1 20.95 0.1628 6,140.97 푏 = 5 49.23 0.4963 2,014.82 푏 = 10 57.63 0.8551 1,169.39 푏 = 15 60.94 1.2466 802.20 푏 = 20 62.72 1.6415 609.19 푏 = 25 63.78 2.0080 498.00 푏 = 30 64.46 2.4191 413.37 푏 = 50 65.74 3.4756 287.72 푏 = 75 66.17 4.7759 209.38 푏 = 100 66.30 6.0366 165.66 Table 4: The tradeoff between retrieval performance and infer- ence latency of XR-Linear (PECOS) with the threshold 휖 = 0.35, which can be flexibly controlled by the beam size 푏. Figure 4: Throughput-Recall Trade-off compar- ison between XR-Linear (PECOS) and DSSM +HNSW. The curve of the latter method is obtained by sweeping the HNSW inference hyper-parameters 푒푓 푆 = {10, 50, 100, 250, 500}. Figure 5: Side-by-side comparison of the retrieved products for XR-Linear (PECOS), DSSM +HNSW, and Okapi-BM25. The test query is "rose of jericho plant", and green box means there is at least one purchase while red box means no purchase. significantly, with higher retrieval quality and larger throughput and improved product discovery with better recall rates. Specifi- (smaller inference latency). cally, to retrieve products from a 100 million product catalog, the In Figure 5, we present the retrieved products for an example proposed solution achieves Recall@100 of 60.9% with a low latency test query "rose of jericho plant", and compare the retrieval quality of 1.25 millisecond per query (ms/q). Our tree-based linear mod- of XR-Linear (PECOS), DSSM +HNSW, and Okapi-BM25. From the els is robust to weight pruning which can flexibly meet different top 10 predictions (i.e., retrieved products), we see that XR-Linear system memory or disk space requirements. covers more products that were purchased, and the retrieved set is One challenge for our method is how to retrieve cold-start prod- more diverse, compared to the other two baselines. ucts with no training signal (e.g., no clicks nor purchases). One naive solution is to assign cold-start products into nearest relevant leaf node clusters based on distances between product embeddings 4.5 Online Experiments and cluster centroids. We then use the average of existing product We conducted an online A/B test to experiment different semantic weight vectors which are the nearest neighbors of this novel prod- matching algorithms on a large-scale e-commerce retail product uct to induce the relevance score. This is an interesting setup and website. The control in this experiment is the traditional lexical- we leave it for future work. based matching augmented with candidates generated by DSSM. The treatment is the same traditional lexical-based matching, but augmented with candidates generated by XR-Linear instead. After REFERENCES [1] Rohit Babbar and Bernhard Schölkopf. 2017. DiSMEC: distributed sparse ma- a period of time, many key performance indicators of the treat- chines for extreme multi-label classification. In WSDM. ment are statistically significantly higher than the control, which [2] Rohit Babbar and Bernhard Schölkopf. 2019. Data scarcity, robustness and is consistent with our offline observations. We leave how to com- extreme multi-label classification. Machine Learning (2019), 1–23. [3] K. Bhatia, K. Dahiya, H. Jain, A. Mittal, Y. Prabhu, and M. Varma. 2016. The ex- bine DSSM and XR-Linear to generate better match set for product treme classification repository: Multi-label datasets and code. http://manikvarma. search as future work. org/downloads/XC/XMLRepository.html [4] Leonid Boytsov and Bilegsaikhan Naidan. 2013. Engineering efficient and effective non-metric space library. In International Conference on Similarity Search and Applications. Springer, 280–293. 5 DISCUSSION AND FUTURE WORK [5] Leonid Boytsov and Eric Nyberg. 2020. Flexible retrieval with NMSLIB and FlexNeuART. arXiv preprint arXiv:2010.14848 (2020). In this work, we presented XR-Linear (PECOS), a tree-based XMC [6] Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Ku- model to augment the match set for an online retail search system mar. 2020. Pre-training Tasks for Embedding-based Large-scale Retrieval. In International Conference on Learning Representations. [30] Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. [7] Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, and Inderjit S Dhillon. 2018. Training millions of personalized dialogue agents. In EMNLP. 2020. Taming Pretrained Transformers for Extreme Multi-label Text Classification. [31] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading Discovery & Data Mining. 3163–3171. comprehension dataset. In CoCo@ NIPS. [8] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading [32] Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, Ankit Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Shingavi, Choon Hui Teo, Hao Gu, and Bing Yin. 2019. Semantic product search. Meeting of the Association for Computational (ACL) (Volume 1: Long In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Papers). 1870–1879. Discovery & Data Mining. 2876–2885. [9] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks [33] Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik for Youtube recommendations. In Proceedings of the 10th ACM conference on Varma. 2018. Parabel: Partitioned label trees for extreme classification with recommender systems. 191–198. application to dynamic search advertising. In WWW. [10] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M [34] Yashoteja Prabhu, Aditya Kusupati, Nilesh Gupta, and Manik Varma. 2020. Ex- Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv preprint treme regression for dynamic search advertising. In Proceedings of the 13th Inter- arXiv:2003.07820 (2020). national Conference on Web Search and Data Mining. 456–464. [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [35] Sashank J Reddi, Satyen Kale, Felix Yu, Dan Holtmann-Rice, Jiecao Chen, and Pre-training of deep bidirectional transformers for language understanding. In Sanjiv Kumar. 2019. Stochastic Negative Mining for Learning with Large Output Proceedings of the 2019 Conference of the North American Chapter of the Association Spaces. In AISTATS. for Computational Linguistics (NAACL). [36] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance [12] Inderjit S Dhillon and Dharmendra S Modha. 2001. Concept decompositions for framework: BM25 and beyond. Foundations and Trends® in Information Retrieval large sparse text data using clustering. Machine learning 42, 1-2 (2001), 143–175. 3, 4 (2009), 333–389. [13] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. [37] Stephen E Robertson and Steve Walker. 1994. Some simple effective approxi- 2008. LIBLINEAR: A library for large linear classification. Journal of machine mations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR’94. learning research 9, Aug (2008), 1871–1874. Springer, 232–241. [14] Luyu Gao, Zhuyun Dai, Zhen Fan, and Jamie Callan. 2020. Complementing lexical [38] Liuyihan Song, Pan Pan, Kang Zhao, Hao Yang, Yiming Chen, Yingya Zhang, retrieval with semantic residual embedding. arXiv preprint arXiv:2004.13969 Yinghui Xu, and Rong Jin. 2020. Large-Scale Training System for 100-Million (2020). Classification at Alibaba. In Proceedings of the 26th ACM SIGKDD International [15] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Conference on Knowledge Discovery & Data Mining. 2909–2930. Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector [39] Yukihiro Tagami. 2017. AnnexML: Approximate nearest neighbor search for quantization. In International Conference on Machine Learning. PMLR, 3887–3896. extreme multi-label classification. In Proceedings of the 23rd ACM SIGKDD inter- [16] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. national conference on knowledge discovery and data mining. 455–464. 2020. REALM: Retrieval-augmented language model pre-training. In Proceedings [40] Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to of the 37th International Conference on Machine Learning (Proceedings of Machine BM25 and language models examined. In Proceedings of the 2014 Australasian Learning Research, Vol. 119). 3929–3938. Document Computing Symposium. 58–65. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, learning for image recognition. In CVPR. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all [18] Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo Casanueva, Paweł you need. In NIPS. Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola [42] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Mrkšić, and Pei-Hao Su. 2019. Training neural response selection for task-oriented Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings dialogue systems. arXiv preprint arXiv:1906.01543 (2019). of the 26th annual international conference on machine learning. 1113–1120. [19] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry [43] Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Heck. 2013. Learning deep structured semantic models for web search using Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, clickthrough data. In Proceedings of the 22nd ACM international conference on et al. [n.d.]. HuggingFace Transformer Benchmarking. https://huggingface.co/ Information & . 2333–2338. transformers/benchmarks.html. Accessed: 2021-02-08. [20] Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chunduri, and Manik Varma. [44] Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, 2019. SLICE: Scalable Linear Extreme Classifiers Trained on 100 Million Labels Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. for Related Searches. In Proceedings of the Twelfth ACM International Conference 2020. Transformers: State-of-the-art natural language processing. In Proceedings on Web Search and Data Mining. ACM, 528–536. of the 2020 Conference on Empirical Methods in Natural Language Processing: [21] Kalina Jasinska-Kobus, Marek Wydmuch, Krzysztof Dembczynski, Mikhail System Demonstrations. 38–45. Kuznetsov, and Robert Busa-Fekete. 2020. Probabilistic Label Trees for Extreme [45] Marek Wydmuch, Kalina Jasinska, Mikhail Kuznetsov, Róbert Busa-Fekete, and Multi-label Classification. arXiv preprint arXiv:2009.11218 (2020). Krzysztof Dembczynski. 2018. A no-regret generalization of hierarchical softmax [22] Ting Jiang, Deqing Wang, Leilei Sun, Huayi Yang, Zhengyang Zhao, and Fuzhen to extreme multi-label classification. In NIPS. Zhuang. 2021. LightXML: Transformer with Dynamic Negative Sampling for [46] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, High-Performance Extreme Multi-label Text Classification. In Proceedings of the Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor AAAI Conference on Artificial Intelligence. negative contrastive learning for dense text retrieval. In ICLR. [23] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag [47] Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaom- of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference ing Wang, Taibai Xu, and Ed H Chi. 2020. Mixed Negative Sampling for Learning of the European Chapter of the Association for Computational Linguistics (EACL): Two-tower Neural Networks in Recommendations. In Companion Proceedings of Volume 2, Short Papers. Association for Computational Linguistics, 427–431. the Web Conference 2020. 441–447. [24] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi [48] Ian EH Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon, and Chen, and Wen-Tau Yih. 2020. Dense passage retrieval for open-domain question Eric Xing. 2017. PPDsparse: A parallel primal-dual sparse method for extreme answering. arXiv preprint arXiv:2004.04906 (2020). classification. In KDD. ACM. [25] Sujay Khandagale, Han Xiao, and Rohit Babbar. 2020. BONSAI-Diverse and [49] Ian EH Yen, Xiangru Huang, Kai Zhong, Pradeep Ravikumar, and Inderjit S Shallow Trees for Extreme Multi-label Classification. Machine Learning (2020). Dhillon. 2016. PD-Sparse: A Primal and Dual Sparse Approach to Extreme [26] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Multiclass and Multilabel Classification. In International Conference on Machine Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Learning (ICML). et al. 2020. The open images dataset v4. International Journal of Computer Vision [50] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee (2020), 1–26. Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural [27] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for modeling for large corpus item recommendations. In Proceedings of the 13th ACM weakly supervised open domain question answering. In Proceedings of the 57th Conference on Recommender Systems. 269–277. Annual Meeting of the Association for Computational Linguistics (ACL). [51] Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and [28] Y. A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Shanfeng Zhu. 2019. AttentionXML: Label Tree-based Attention-Aware Deep Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. Model for High-Performance Extreme Multi-Label Text Classification. In Ad- IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2020), 824– vances in Neural Information Processing Systems. 5812–5822. 836. [52] Hsiang-Fu Yu, Kai Zhong, and Inderjit S Dhillon. 2020. PECOS: Prediction for [29] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Intro- Enormous and Correlated Output Spaces. arXiv preprint arXiv:2010.05878 (2020). duction to information retrieval. Cambridge University Press. https://doi.org/10. [53] Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM 1017/CBO9780511809071 computing surveys (CSUR) 38, 2 (2006), 6–es.