Distant-supervised algorithms with applications to text mining, product search, and scholarly networks.

A SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Saurav Manchanda

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Professor George Karypis, Advisor

November, 2020 © Saurav Manchanda 2020 ALL RIGHTS RESERVED Acknowledgements

Ph.D. is more about the journey than the destination, and I am lucky enough to be surrounded by many people, who joined me in this journey. First and foremost, I would like to thank my advisor, George. He is an excellent advisor who not only uplifted my research skills but always inspired me to do great research work. Every interaction with him was an opportunity for me to learn. I hope to become as smart and hardworking as him one day. I would like to thank Professors Zhi-Li Zhang, Rui Kuang, Brian Reese,, and Tian He for serving on my committee for my thesis, thesis proposal, and preliminary exams. Their comments and suggestions have been invaluable in shaping my research, and I am fortunate to have them on my committees. My Ph.D. journey would be very different without my lab, which I consider my home away from home. I have kept great memories of my time in the Karypis lab that I will never forget (especially the silly arguments and quarrels, like what happens in a family). My lab-family includes: Agi, Ancy, David, Dominique, Eva, Haoji, Jake, Jeremy, Kosta, Maria, Mohit, Prableen, Rohit, Sara, Shaden, Shalini, Zeren. I would also like to acknowledge the grants that supported my research: NSF (1447788, 1704074, 1757916, 1834251), Army Research Office (W911NF1810344) and Intel Corp. I also thank Walmart Labs for providing data for my research. Lastly, I would like to thank the staff at the Department of Computer , and the Digital Technology Center for providing me the necessary resources for my research.

i Dedication

To papa, mummy, chai, akoo and the reader!

ii Abstract

In recent times, data has become the lifeblood of pretty much all businesses. As such, the real-world impact of data-driven has grown in leaps and bounds. It has set up itself as a standard tool for organizations to draw insights from the data at scale, and hence, to enhance their profits. However, one of the key-bottlenecks in deploying machine learning models in practice is the unavailability of labeled training data. The manually-labeled training sets are expensive and it can be a tedious exercise to create them. Besides, they cannot be practically reused for new objectives, if the underlying distribution of data changes with time. As such, distant-supervision provides a solution to using expensive hand-labeled datasets, which means leveraging alternative sources of weak-supervision. In this thesis, we identify and provide solutions to some of the challenges that can benefit from distant-supervised approaches. First, we present a distant-supervised approach to accurately and efficiently estimate a vector representation for each sense of the multi-sense words. Second, we present ap- proaches for distant-supervised text-segmentation and annotation, which is the task of associating individual parts in a multilabel document with their most appropriate class labels. Third, we present approaches for query understanding in product search. Specifi- cally, we developed distant-supervised solutions to three challenges in query understand- ing: (i) when multiple terms are present in a query, determining the relevant terms that are representative of the query’s product intent, (ii) vocabulary gap between the terms in the query and the product’s description, and (iii) annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). Fourth, we present approaches to estimate content-aware bib- liometrics to accurately quantitatively measure the scholarly impact of a publication. Our proposed metric assigns content-aware weights to the edges of a citation network, that quantify the extent to which the cited-node informs the citing-node. Consequently, this weighted network can be used to derive impact metrics for the various involved entities, like the publications, authors, etc.

iii Contents

Acknowledgements i

Dedication ii

Abstract iii

List of Tables viii

List of Figures x

1 Introduction 1 1.1 Thesis statement ...... 1 1.2 Thesis Outline and Original Contributions ...... 2 1.2.1 Chapter 3: Distributed representation of multi-sense words . . .2 1.2.2 Chapter 4: Credit attribution in multilabel documents ...... 2 1.2.3 Chapters 5 and 6: Query understanding in product search . . . .4 1.2.4 Chapter 7: Impact assessment in scholarly networks ...... 5 1.3 Bibliographic Notes ...... 6

2 Background and Related Work 7 2.1 Distributed representation of multi-sense words ...... 7 2.2 Credit attribution in multilabel documents ...... 8 2.3 Improving relevance in product-search ...... 10 2.3.1 Term-weighting methods ...... 10 2.3.2 Improving relevance for the tail queries ...... 11

iv 2.3.3 Query refinement ...... 12 2.3.4 Slot-filling ...... 14 2.3.5 Entity linking ...... 14 2.4 Importance assessment in scholarly networks ...... 15 2.4.1 Citation Indexing ...... 15 2.4.2 Citation recommendation ...... 15 2.4.3 Link-prediction ...... 16

3 Distributed representation of multi-sense words 18 3.1 Definitions and Notations ...... 19 3.2 Loss driven multisense identification (LDMI) ...... 20 3.2.1 Identifying the words with multiple senses ...... 21 3.2.2 Clustering the occurrences ...... 22 3.2.3 Putting everything together ...... 23 3.3 Experimental methodology ...... 23 3.3.1 Datasets ...... 23 3.3.2 Evaluation methodology and metrics ...... 24 3.4 Results and discussion ...... 28 3.4.1 Quantitative analysis ...... 28 3.4.2 Qualitative analysis ...... 30

4 Credit attribution in multilabel documents 32 4.1 Definitions and Notations ...... 34 4.2 Dynamic programming based methods ...... 36 4.2.1 Overview ...... 36 4.2.2 Segmentation algorithm ...... 37 4.2.3 Building the classification model ...... 40 4.3 Credit Attribution With Attention (CAWA) ...... 43 4.3.1 Architecture ...... 44 4.3.2 Model estimation ...... 47 4.3.3 Segment inference ...... 48 4.4 Experimental methodology ...... 49 4.4.1 Datasets ...... 49

v 4.4.2 Competing approaches ...... 51 4.4.3 Evaluation Methodology ...... 52 4.4.4 Parameter selection ...... 53 4.4.5 Performance Assessment Metrics ...... 54 4.5 Results and Discussion ...... 56 4.5.1 Credit Attribution ...... 56 4.5.2 Multilabel classification ...... 60 4.5.3 Ablation study ...... 62

5 Intent term selection and refinement in e-commerce queries 65 5.1 Definitions and Notations ...... 68 5.2 Proposed methods ...... 68 5.2.1 Contextual term-weighting (CTW) model ...... 70 5.2.2 Contextual query refinement (CQR) model ...... 72 5.3 Experimental methodology ...... 74 5.3.1 Dataset ...... 74 5.3.2 Evaluation Methodology and Performance Assessment ...... 76 5.4 Results and Discussion ...... 81 5.4.1 Quantitative evaluation ...... 81 5.4.2 Qualitative evaluation ...... 83

6 Distant-supervised slot-filling for e-commerce queries 87 6.1 Definitions and Notations ...... 89 6.2 Slot Filling via Distant Supervision ...... 92 6.2.1 Uniform Slot Distribution (USD) ...... 93 6.2.2 Markovian Slot Distribution (MSD) ...... 95 6.2.3 Correlated Uniform Slot Distribution (CUSD) ...... 97 6.2.4 Correlated Uniform Slot Distribution with Subset Selection (CUS- DSS) ...... 99 6.3 Experimental methodology ...... 101 6.3.1 User engagement and annotated dataset ...... 101 6.3.2 Evaluation methodology ...... 102 6.3.3 Performance Assessment metrics ...... 104

vi 6.4 Results and Discussion ...... 105

7 Importance assessment in scholarly networks 110 7.1 Definitions and Notations ...... 112 7.2 Content-Informed Index (CII) ...... 113 7.3 Experimental methodology ...... 115 7.3.1 Evaluation methodology and metrics ...... 115 7.3.2 Baselines ...... 116 7.3.3 Datasets ...... 118 7.3.4 Parameter selection ...... 119 7.4 Results and discussion ...... 121 7.4.1 Quantitative analysis ...... 121 7.4.2 Qualitative analysis ...... 122

8 Conclusion 126

vii List of Tables

3.1 Dataset statistics...... 24 3.2 Results for the Spearman rank correlation (ρ × 100)...... 28 3.3 Top similar words for different senses of the multi-sense words*...... 30 3.4 Senses discovered by the competing approaches*...... 31 4.1 Notation used throughout the chapter...... 35 4.2 Dataset statistics...... 50 4.3 Hyperparameter values...... 53 4.4 Results on the segmentation task...... 57 4.5 Sentence classification performance when the sentences belong to the sim- ilar classes...... 58 4.6 Statistics of the predicted segments...... 60 4.7 Results on the multilabel classification task...... 61 5.1 Notation used throughout the chapter...... 68 5.2 Results for the query term-weighting problem...... 81 5.3 Results for the query refinement problem...... 82 5.4 Predicted term-weights for a some queries...... 84 5.5 Top 20 predicted terms for a few selected queries...... 85 6.1 Notation used throughout the chapter...... 91 6.2 Retrieval results (Task T1)...... 105

6.3 Slot-filling results when cq is observed (Task T2)...... 106

6.4 Slot-filling results when cq is unobserved (Task T3)...... 106

6.5 Individual precision (P), recall (R) and F1-score (F1) when cq is observed.106

6.6 Individual precision (P), recall (R) and F1-score (F1) when cq is unobserved.107 6.7 Examples of product categories estimated by CUSD and CUSDSS . . . 108

viii 7.1 Notation used throughout the paper...... 112 7.2 Results on the Somers’ ∆ metric...... 120 7.3 Examples of lowest and highest CII-weighted citations...... 124

ix List of Figures

3.1 Distribution of the average contextual loss for all words ...... 25 3.2 Distribution of the number of senses ...... 29 4.1 Plot summary of the movie Blondie Johnson, belonging to the crime/drama genre ...... 33 4.2 Example of the attention weights ...... 44 4.3 CAWA architecture with an example ...... 45 4.4 Architecture of the attention mechanism...... 46 4.5 Architecture of the per-class binary classifier...... 47 4.6 Illustration of the minov and maxov ...... 55 4.7 Advantage of using the SOV metric over the PPPA metric ...... 55 4.8 Effect of average pooling and β on credit attribution performance . . . . 62 4.9 Effect of α on credit attribution performance ...... 63 4.10 Effect of α on multilabel classification performance ...... 64 5.1 Contextual term-weighting (CTW) model...... 69 5.2 Contextual query refinement (CQR) model...... 70 5.3 Variation of the MRR with query length...... 82 6.1 Query understanding with slot-filling...... 88 6.2 Plate notation of the developed approaches ...... 94 6.3 Variation of the q-accuracy with query length...... 107 7.1 Overview of Content-Informed Index ...... 112 7.2 Word-cloud (Frequently occurring words) that appear in the citation con- text of the citations with the highest predict importance weights. . . . . 122 7.3 Word-cloud (Frequently occurring words) that appear in the citation con- text of the citations with the least predict importance weights...... 123

x Chapter 1

Introduction

In recent years, data has become the lifeblood of pretty much all the businesses, and often data is said to be the new oil. Many machine learning models use supervised learning to refine this data, to extract its real value. Thus, for such models, it is the labeled data that is the precious commodity. If this labeled data is a result of man- ual annotation, it becomes the key-bottleneck in deploying machine learning models in practice. The manually-labeled training sets are expensive and it can be a tedious exercise to create them. Besides, they cannot be practically reused for new objectives, if the underlying distribution of data changes with time. To solve the problem of cost and scalability of manual annotation, distant supervision provides a potential solution. Distant supervision refers to leveraging alternative sources of weak-supervision. Exam- ples of distant-supervision include generating labeled data using heuristics, using rules, using ontologies, or using coarse-labeled data, to name a few common techniques.

1.1 Thesis statement

In this thesis, we identify and address some of the challenges that can benefit from distant-supervised approaches. To that end, we contribute several state-of-the-art distant-supervised approaches to various challenges arising in text-mining, product-search, and scholarly-networks. We provide an extensive experimental evaluation of our contributions to show their advantage over the competing approaches.

1 2 1.2 Thesis Outline and Original Contributions

We start by providing the necessary background and overall motivation for this thesis in Chapter 2. The remaining outline and major contributions of this thesis are discussed below in the given order:

1.2.1 Chapter 3: Distributed representation of multi-sense words

Many NLP tasks benefit by embedding the words of a collection into a low dimensional space in a way that captures their syntactic and semantic information. Such NLP tasks include analogy/similarity questions [86], part-of-speech tagging [4], named entity recognition [5], machine translation [139, 87] etc. Distributed representations of words are real-valued, low dimensional embeddings based on the distributional properties of words in large samples of the language data. However, representing each word by a single vector does not properly model the words that have multiple senses (i.e., polysemous and homonymous words). For multi-sense words, a single representation leads to a vector that is the amalgamation of all its different senses, which can lead to ambiguity. We present an extension to the Skip-Gram model of Word2Vec to accurately and efficiently estimate a vector representation for each sense of multi-sense words. Our model relies on the fact that, given a word, the Skip-Gram model’s loss associated with predicting the words that co-occur with that word, should be greater when that word has multiple senses as compared to the case when it has a single sense. This information is used to identify the words that have multiple senses and estimate a different representation for each of the senses. These representations are estimated using the Skip-Gram model by first clustering the occurrences of the multi-sense words by accounting for the diversity of the words in these contexts.

1.2.2 Chapter 4: Credit attribution in multilabel documents

Credit attribution in multilabel documents is the task of segmenting documents into semantically coherent sections, and annotating the sections with their most appropriate class labels. Segmenting text into semantically coherent sections is an important task with applications in information retrieval and text summarization. Developing accurate segmentation approaches requires the availability of training data with ground truth 3 information at the segment level. However, generating such labeled datasets, especially for applications in which the meaning of the class-labels is user-defined, is expensive and time-consuming. We have developed approaches that leverage the set of class-labels that are associated with a document as a source of distant supervision, to accurately and efficiently segment text documents. The set of class-labels are easier to get as compared to segment-level ground truth information since the training data essentially corresponds to a multilabel dataset. Specifically, the approaches that we have developed can be broadly classified into two categories as described below:

Dynamic programming-based approaches

We developed dynamic programming based methods that exploit the structure within the documents by relying on the fact that sentences occurring together in a document have high chances of belonging to the same class. Our methods employ an iterative approach. In each iteration, they estimate a model for predicting the classes of an ar- bitrary length text segment, then use dynamic programming to segment the document into semantically-coherent chunks using the estimated model, and fine-tune the model with these segments. Every iteration provides a more precise classification model and hence better segmentation. However, these approaches cannot model case where sen- tences can belong to multiple classes; thus, they cannot correctly model semantically similar classes.

Attention mechanism-based approaches

To deal with the limitations of our dynamic programming-based methods, we developed Credit Attribution With Attention (CAWA), a neural-network-based approach that mod- els multi-topic segments. CAWA uses the class labels of a multilabel document as the source of distant-supervision, to assign class labels to the individual sentences of the document. CAWA leverages the attention mechanism to compute weights that establish the relevance of a sentence for each of the classes. The attention weights, which can be interpreted as a probability distribution over the classes, allows CAWA to capture the semantically similar classes by modeling each sentence as a distribution over the classes, instead of mapping to just one class. In addition, CAWA leverages a simple average pooling layer to constrain the neighboring sentences to belong to the same class 4 by smoothing their class distributions. CAWA uses an end-to-end learning framework to combine the attention mechanism with the multilabel classifier.

1.2.3 Chapters 5 and 6: Query understanding in product search

Online shopping has become a popular activity, with a recent report from the US De- partment of Commerce stating that online sales grew 15.1% in 2016 and accounted for 8.1% of total retail sales for the year. With such a growth rate, improving the user experience concerning the product search is a valuable challenge for e-commerce com- panies. We have developed distant-supervised solutions to three challenges in query understanding as described below:

Intent Term Selection in E-Commerce Queries

E-commerce search engines can fail to retrieve results that satisfy a query’s product intent because, conventional retrieval approaches, such as BM25, may ignore the impor- tant terms in queries owing to their low inverse document frequency (IDF). We leveraged the historical query reformulation logs of a large e-retailer (walmart.com) to develop a distant-supervision-based approach to identify the relevant terms that characterize the query’s product intent. The key idea underpinning our approach is that the terms re- tained in the reformulation of a query are more important in describing the query’s product intent than the discarded terms. Additionally, we also use the fact that the significance of a term depends on its context (other terms in the neighborhood) in the query to determine the term’s importance towards the query’s product intent.

Intent Term Refinement in E-Commerce Queries

E-commerce search engines can also fail to retrieve results that satisfy a query’s product intent if there is a vocabulary gap between the terms in the query and the product’s description, i.e., terms used in the query are semantically similar but different from the terms in the product description. Hence, predicting additional terms that describe the query’s product intent better than the existing query terms to the search engine is an essential task in e-commerce search. Similar to our last task, we leveraged the 5 historical query reformulation logs of a major e-commerce retailer to develop distant- supervised approaches to solve this problem. Experiments illustrate that our context- aware approaches outperform the non-contextual baselines on the task of predicting the additional terms that represent product intent.

Query-tagging for e-commerce queries

Query-tagging or slot-filling refers to the task of annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). These characteristics can then be used by a search engine to return results that better match the query’s product intent. Traditional methods for query tagging require the availability of training data with ground truth slot-annotation infor- mation. However, generating such labeled data, especially in e-commerce is expensive and time-consuming because the number of slots increases as new products are added. We developed distant-supervised probabilistic generative models, that require no man- ual annotation. These approaches leverage historical query logs and the purchases that these queries led to and also exploit co-occurrence information among the slots to iden- tify intended product characteristics.

1.2.4 Chapter 7: Impact assessment in scholarly networks

Scientific, engineering, and technological (SET) innovations have been the drivers be- hind many of the significant positive advances in our modern economy, society, and life. To measure various impact-related aspects of these innovations various quantita- tive metrics have been developed and deployed. These metrics play an important role as they are used to influence how resources are allocated, assess the performance of personnel, identify the intellectual property (IP)-related takeover targets, value a com- pany’s intangible assets (IP is such an asset), and identify strategic and/or emerging competitors. Citation networks of peered-reviewed scholarly publications (e.g., journal/conference articles and patents) have widely been used and studied in order to derive such met- rics for the various entities involved (e.g., articles, researchers, institutions, companies, journals, conferences, countries, etc. [3]). However, most of these traditional metrics, such as citation counts and h-index treat all citations and publications equally, and 6 do not take into account the content of the publications and the context in which a prior scholarly work was cited. We propose machine-learning-driven approaches, that automatically estimate the weights of the edges in a citation network, such that edges with higher weights correspond to higher-impact citations. The developed approaches leverage the readily available content of the papers as a source of distant-supervision.

1.3 Bibliographic Notes

Findings of the Chapter 3 on estimating representations for multisense words has been published as a conference paper, ‘Distributed representation of multi-sense words: A loss driven approach’ at the 22nd Pacific-Asia Conference on Advances in Knowledge Discov- ery and Data Mining (PAKDD 2018) [75]. Contributions made in Chapter 4 has been published in two conference papers, (i) ‘Text segmentation on multilabel documents: A distant-supervised approach’, published at the 18th IEEE International Conference on Data Mining (ICDM 2018) [76], and (ii) ‘CAWA: An Attention-Network for Credit Attribution’, published at the 34th AAAI Conference on Artificial Intelligence (AAAI 2020) [77]. The findings of Chapter 5 has been published as a conference paper, ‘Intent term weighting in e-commerce queries’ at the 28th ACM International Conference on Information and Knowledge Management (CIKM 2019) [78]. Additional findings were also collected as a part of arXiv preprint [79]. Chapter 2

Background and Related Work

2.1 Distributed representation of multi-sense words

Various models have been developed to deal with the distributed representations of the multi-sense words. These models presented in this section work by estimating multiple vector-space representations per word, one for each sense. Most of these models estimate a fixed number of vector representations for each word, irrespective of the number of senses associated with a word. In the rest of this section, we review these models and discuss their limitations. Reisinger and Mooney [105] clusters the occurrences of a word using the mixture of von Mises-Fisher distributions [8] clustering method to assign a different sense to each occurrence of the word. The clustering is performed on all the words even if the word has a single sense. This approach estimates a fixed number of vector representations for each word in the vocabulary. As per the authors, the model captures meaningful variation in the word usage and does not assume that each vector representation corresponds to a different sense. Huang et al. [52] also uses the same idea and estimates a fixed number of senses for each word. It uses spherical k-means [24] to cluster the occurrences. Neelakantan et al. [91] proposed two models built on the top of the Skip-Gram model: Multiple-Sense Skip-Gram (MSSG), and its Non-Parametric counterpart NP-MSSG. MSSG estimates a fixed number of senses per word whereas NP-MSSG discovers varying number of senses. MSSG maintains clusters of the occurrences for each word, each cluster corresponding to a sense. Each occurrence of a word is assigned a sense based on

7 8 the similarity of its context with the already maintained clusters, and the corresponding vector representation, as well as the sense cluster of the word is updated. During training, NP-MSSG creates a new sense for a word with the probability proportional to the distance of the context to the nearest sense cluster. Both MSSG and NP-MSSG create an auxiliary vector to represent an occurrence, by taking the average of vectors associated with all the words belonging to its context. The similarity between the two occurrences is computed as the cosine similarity between these auxiliary vectors. This approach does not take into consideration the variation among the words that occur within the same context. Another disadvantage is that the auxiliary vector is biased towards the words having higher L2 norm. This leads to noisy clusters, and hence, the senses discovered by these models are not robust.

2.2 Credit attribution in multilabel documents

Various unsupervised, supervised and distant-supervised methods have been developed to deal with the credit attribution problem. The unsupervised methods do not rely on labeled training data and use approaches such as clustering or graph search to put boundaries in the locations where the transition from a topic to another happens, but no assumption is made regarding the class-labels of the segments. The supervised methods rely on labeled training data, and the problem in this case can be mapped to a classification problem, the task being to predict if a sentence corresponds to beginning (or end) of a segment. The distant-supervised methods do not rely on explicitly segment- level labeled data, but use the bag-of-classes associated with a document to associate individual words/sentences in a document with their most appropriate classes. In the rest of this section, we review these prior approaches for credit attribution and discuss their limitations. Popular examples of the unsupervised approaches include TextTiling [49], C99 [18] and GraphSeg [31]. TextTiling assumes that a set of lexical items are in use during the course of a topic discussion, and detects change in topic by means of change in vocab- ulary. Given a sequence of textual units, TextTiling computes the similarity between each pair of consecutive textual units, and locates the segment boundaries based on the minimas in the resultant similarity sequence. C99 [18] makes the same assumption, and 9 computes the similarity between a pair of sentences based on their constituent words. These similarity values are used build a similarity matrix. Each cell in this similarity matrix is assigned a rank based on the similarity values of its neighboring cells, such that two neighboring sentences belonging to the same segment are expected to have higher ranks. C99 then uses divisive clustering on this ranking matrix to locate the bound- ary locations of the segments. GraphSeg [31] builds a semantic relatedness graph in which nodes denote the sentences and edges are created for pairs of semantically related sentences. The segments are determined by finding maximal cliques of this semantic relatedness graph. Supervised approaches for text classification include the ones using decision trees [38], multiple regression analysis [41], exponential model [9], probabilistic modeling [118] and more recently, deep neural network based approaches [6, 67]. The decision-tree based method uses the acoustic-prosodic features to predict the discourse structure in the text. The multiple regression analysis method [41] uses various surface linguistic cues as features to predict the segment boundaries. The exponential model [9] uses feature selection to identify features which are used to predict segment boundaries. The selected features are lexical (including a topicality measure and a number of cue-word features). The probabilistic model [118] combines both lexical and prosodic cues using Hidden Markov Models and decision trees to predict the segment boundaries. The LSTM- based neural model [67] is composed of a hierarchy of two LSTM networks. The lower- level sub-network generates sentence representations and the higher-level sub-network is the segmentation prediction network. The outputs of the higher-level sub-network are passed on to a fully-connected network, which predicts a binary-label for each sentence, predicting whether a sentence ends a segment. The attention-based model [6] uses an LSTM with attention where the sentence representations are estimated with CNNs and the segments are predicted based on contextual information. Finally, a fully-connected network outputs a binary-label for each sentence, predicting whether a sentence begins a segment. The methods presented in this thesis use the set of classes that are associated with a document as a source of supervision, instead of using explicit segment-level ground truth information. Prior approaches proposed for distant-supervised text segmentation include Labeled Latent Dirichlet Allocation (LLDA) [102], Partially Labeled Dirichlet 10 Allocation (PLDA) [103] and Multi-Label Topic Model (MLTM) [113]. We review these prior approaches below. Labeled Latent Dirichlet Allocation (LLDA) [102] is a probabilistic graphical model for credit attribution. It assumes a one-to-one mapping between the classes and the topics. Like Latent Dirichlet Allocation, LLDA models each document as a mixture of underlying topics and generates each word from one topic. Unlike LDA, LLDA incorporates supervision by simply constraining the topic model to use only those topics that correspond to a document’s (observed) classes. LLDA assigns each word in a document to one of the document’s classes. Partially Labeled Dirichlet Allocation (PLDA) [103] is a further extension of the LLDA that allows more than one topic for every class, as well as some general topics that are not associated with any class. Multi-Label Topic Model (MLTM) [113] improves upon PLDA by allowing each topic to belong to multiple, one, or even zero classes probabilistically. MLTM also assigns a class to each sentence, based on the assigned topics of the constituent words. The classess of the documents are generated from the classes of the constituent sentences. A common problem with the above mentioned approaches is that they model the document as a bag of words/sentences and does not take into consideration that the sentences occurring together in a document tend to talk about the same topic, and hence they have more chances of belonging to the same class. Moreover, these methods are based on top of the Latent Dirichlet Allocation (LDA), which requires inference even during the testing phase, thus are computationally expensive; making them unpractical for use in real applications.

2.3 Improving relevance in product-search

2.3.1 Term-weighting methods

Feature selection using term-weighting is a well-researched topic in information retrieval, but the prior work in this area has been regarding ranking terms according to their discriminating power in determining relevance, the relevance based on the text-matching between the query and document (or product). Popular examples of such term-weighting methods include term frequency-inverse document frequency (TF-IDF) [112] and Okapi 11 Best Matching 25 (Okapi BM25) [107, 106]. Intuitively, such techniques determine the importance of a given term in a particular document. Terms that are common in a single or a small group of documents tend to have higher weights than frequent terms. Such term-weighting techniques are well-suited to rank documents in response to a query, but cannot be applied on a group of queries to rank the terms in individual queries with respect to their confidence in defining the query’s product intent. A term might be discriminatory with respect to a query (that is, the term occurs in a small number of queries), but this does not necessarily mean that the term is important towards defining that query’s product intent. For example, in the query “striped mens shirt”, the terms mens and shirt are expected to be more frequent than the term striped. The above weighting techniques will tend to give higher weight to striped, but we expect that the terms mens and shirt should carry more weight towards defining the query’s product intent. Another relevant approach that can be used to estimate term-weights is to first construct a click graph between the queries and documents and estimate a vector rep- resentation for each entity (queries and documents) using a vector propagation model (VPCG & VG) [59]. The queries or documents can then be broken down into individual units (e.g., n-grams), and we can learn a vector representation for each n-gram based on the vectors already estimated from the click graph. We can then estimate a weight for each n-gram by a regression model that minimizes the square of the euclidean distance between the representation of a query vector and the weighted linear combination of the representation of the n-grams in the query. The limitation of this approach is that it estimates a global weight for each n-gram, and hence does not take care of the fact that the significance of a term (n-gram) is dependent upon the context in which it is used.

2.3.2 Improving relevance for the tail queries

Another class of methods that are relevant to the work present in this thesis is the work regarding understanding the tail queries [115, 25]. Queries can be classified into head queries (frequent queries, associated with rich historical user engagement information) and tail queries (rare queries, do not have much historical user engagement data). Head queries enable retrieval engines to utilize statistical models to correctly identify the 12 query’s product intent from the historical engagement data (clicks, add to carts, and orders). Tail queries constitute the majority of unique queries submitted by the users, and hence, it is very valuable challenge for e-commerce companies to improve search relevance for tail queries. One of the proposed approaches to improve the relevance for the tail queries is to predict the likelihood of one query being the reformulation of another query [25]. The above approach is based on Probabilistic Latent Semantic Analysis (PLSA). It models each pair of consecutive queries that a user composes as generated by a single latent topic. As such, the above proposed model makes sense when we are looking to find the most likely (or a set of most likely) reformulation given a query and the set of candidate reformulations is small. But the reformulations cannot be limited to a small set of head queries. Another approach to improve search relevance for tail queries is to extract single or multi-word expressions from both the head and tail queries and mapping them to a common concept space defined by a knowledge-base [115]. Once both the head and tail queries are mapped to the same concept space, historical engagement data for the head queries can be used for the tail queries as well via the mapped concepts. The limitation of this approach is that mapping tail queries to a logical concept space using query text is again prone to noisy terms, and the concepts estimated from the noisy terms will degrade the performance. The VPCG & VG [59] approach described in Section 2.3.1 can also be used to improve the relevance of tail queries. From the estimated weights and vectors of the n-grams, we can estimate the vectors for the tail queries by a linear combination of the vectors of the constituent n-grams. As mentioned before, the limitation of this approach is that it does not take care of the fact that the significance of a term is dependent upon the context in which it is used.

2.3.3 Query refinement

Query refinement (QR) refers to reformulating a given query to improve its search rele- vance. Query refinement approaches can be either knowledge-based (using knowledge- bases to find semantically related terms) or data-based (statistical models using histor- ical data, like search logs and engagement information). Examples of query refinement 13 include query expansion and query suggestion [134]. The query refinement work pre- sented in this thesis is data-based and uses historical search logs to build a statistical model. One of the most popular approaches for query refinement is the use of relevance feedback [108] where the term suggestions (suggesting terms relevant to the query’s product intent) are based on the user engagement with the results retrieved for the initial query. In the absence of user feedback, term suggestion can be done via pseudo relevance feedback, which assumes that the top retrieved results are relevant to the query’s product intent. These relevance based approaches assume that the retrieval system is able to understand the query’s product intent and retrieve relevant results to some extent, which is an invalid assumption for tail queries with noisy terms. A number of prior approaches have been proposed to use historical search logs for query refinement. The prior approaches work by finding similarity between the queries, so that, given a query, other similar queries can be used to refine it. The reformulateed queries from historical logs can be clustered to identify different topics or aspects of a query [101, 22]. Chien and Immorlica [16] use the temporal correlation in sessions between two queries as the similarity between queries. Cucerzan and Brill [21], Boldi et al. [12] use the session co-occurrence information between two queries to find the similarity between them. Sadikov et al. [111] also use the document-click information in addition to co-occurrence information between two queries for query recommendation. A common drawback of the above approaches is that they cannot work on unseen queries, and hence are not helpful with tail queries, which are mostly unique and issued rarely. One of the approaches analyzes the relations of terms inside a query and uses the discovered term association patterns for effective query reformulation [124]. Since this approach is based on the co-occurrence patterns of the terms, it can be applied on the tail queries. It estimates a context distribution for terms occurring in a query and then suggests similar words based on their distributional similarity. The drawback of this approach is that the term association patterns are discovered from the co-occurrence statistics of the terms within a query, independent of actual reformulations of that query. Therefore, the discovered patterns lack the task-specific context of reformulation. 14 2.3.4 Slot-filling

Slot-filling is a well-researched topic in spoken language understanding, and involves extracting relevant semantic slots from a natural language text. Generative approaches designed for the slot-filling task includes the ones based on hidden markov models and context free grammar composite models like [126, 98, 74]. Conditional models designed for slot-filling based on conditional random fields (CRFs) include [104, 125, 57, 128]. Recently, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been applied to the slot-filling task [85, 130, 129, 84, 70, 122, 136, 128]. A common drawback of these approaches is that they require the availability of tagged sequences as the training data. To tackle absence of the labeled training data, some approaches have been developed. These approaches are either specific to the domain of spoken language understanding or make stronger assumptions about the similarity of the words and the slot values [125, 99, 138, 50]. The unsupervised method developed in [133] is closely related to ours but it requires a manually described grammar rules to annotate the queries and is limited to the task of predicting only two slots (product type and brand) as curating grammar for more diverse slots is a non-trivial task. Furthermore, new products continue to get added in the inventory bringing new product-types, brands, etc., which leads to continuously evolving vocabulary. Hence, designing manual rules to align slot-values to the query words for e-commerce queries requires a continuous effort which is expensive.

2.3.5 Entity linking

Entity linking, also known as record linkage or entity resolution, involves mapping a named-entity to an appropriate entry in a knowledge base. Some of the earliest work on entity-linking [14, 21] rely upon the lexical similarity between the context of the named-entity and the features derived from Wikipedia page. The most relevant task to the problems addressed in this thesis is the optional entity linking task [81, 58]), in which the systems can only use the attributes in the knowledge base; this corresponds to the task of updating a knowledge base with no ‘backing’ text, such as Wikipedia text. However, our setup is more strict because of the absence of availability of entity attributes and lack of lexical context as most e-commerce queries are concise. 15 2.4 Importance assessment in scholarly networks

The research areas relevant to the work present in this thesis belong to citation indexing, citation recommendation and link prediction approaches. We briefly discuss these areas below:

2.4.1 Citation Indexing

A citation index indexes the links between publications that authors make when they cite other publications. Citation indexes aim to improve the dissemination and retrieval of scientific literature. CiteSeer [30, 68] is a first automated citation indexing system that works by downloading publications from the Web and converting them to text. It then parses the papers to extract the citations and the context in which the citations are made in the body of the paper, storing this information in a database. Other examples of popular citation indices include Google Scholar1 , Web of Science2 by Clarivate Analytics, Scopus3 by Elsevier and Semantic Scholar4 . Some examples of subject-specific citation indices include INSPIRE-HEP5 which covers high energy physics, PubMed6 , which covers life sciences and biomedical topics, and Astrophysics Data System7 which covers astronomy and physics.

2.4.2 Citation recommendation

Citation recommendation describes the task of recommending citations for a given text. It is an essential task, as all claims written by the authors need to be backed up in order to ensure reliability and truthfulness. The approaches developed for citation recommendation can be grouped into 4 groups as follows[28]: hand-crafted feature based approaches, topic-modelling based approaches, machine-translation based approaches, and neural-network based approaches. Hand-crafted feature based approaches are based on features are are manually engineered by the developers. For example, text similarity

1 https://scholar.google.com/ 2 http://www.webofknowledge.com/ 3 https://www.scopus.com/ 4 https://www.semanticscholar.org/ 5 https://inspirehep.net/ 6 https://pubmed.ncbi.nlm.nih.gov/ 7 http://ads.harvard.edu/ 16 between the citation context and the candidate papers can be used as one of the text- based features. Examples of papers that propose hand-crafted feature based approaches include [28, 48, 71, 72, 109]. Topic modeling based approaches represent the candidate papers’ text and the citation contexts by means of abstract topics, and thereby exploiting the latent semantic structure of texts. Examples of topic modeling based approaches include [47, 62]. The machine-translation based approaches apply the idea of translating the citation context into the cited document to find the candidate-papers worth citing. Examples in this category include [45, 53]. Finally, the popular examples of neural- network based models include [27, 43, 54, 66, 117, 131].

2.4.3 Link-prediction

A link is a connection between two nodes in a network. As such, link-prediction is the problem of predicting the existence of a link between two nodes in a network. A good link-prediction model predicts the likelihood of a link between two nodes, so it can not only be used to predict new links, but to also curate the graph by filtering less-likely links that are already present. Thus, the link-prediction can be a useful tool to find likely citations in a citation network. The citation recommendation task described previously can be thought of as a special case of link-prediction. Following the taxonomy described in [80], link-prediction approaches can be broadly categorized into three categories: similarity-based approaches, probabilistic and statistical approaches and algorithmic approaches. The similarity based approaches assume that nodes tend to form links with other similar nodes, and that two nodes are similar if they are connected to similar nodes or are near in the network according to a given similarity function. Examples of popular similarity functions include number of common neighbors [69], Adamic-Adar index [1], etc. The probabilistic and statistical approaches assume that the network has a known structure. These approaches estimates the model parameters of the network structure using statistical methods, and use these parameters to calculate the likelihood of the presence of a link between two nodes. Examples of probabilistic and statistical approaches include [19, 40, 55, 123]. Algorithmic approaches directly uses the link- prediction as supervision to build the model. For example, link-prediction task can be formulated as a binary classification task where the positive instances are the pair of nodes which are connected in the network, and negative instances are the unconnected 17 nodes. Examples include [83, 11]. Unsupervised or self-supervised node embedding (such as DeepWalk [96], node2vec [39]), followed by training a binary classifier and Graph Neural network approaches such as GraphSage [42] belong to this category. Chapter 3

Distributed representation of multi-sense words

Many NLP tasks benefit by embedding the words of a collection into a low dimensional space in a way that captures their syntactic and semantic information. Such NLP tasks include analogy/similarity questions [86], part-of-speech tagging [4], named entity recognition [5], machine translation [139, 87] etc. Distributed representations of words are real-valued, low dimensional embeddings based on the distributional properties of words in large samples of the language data. However, representing each word by a single vector does not properly model the words that have multiple senses (i.e., polysemous and homonymous words). For multi-sense words, a single representation leads to a vector that is the amalgamation of all its different senses, which can lead to ambiguity. To address this problem, models have been developed to estimate a different rep- resentation for each of the senses of multi-sense words. The common idea utilized by these models is that if the words have different senses, then they tend to co-occur with different sets of words. The models proposed by Reisinger and Mooney [105], Huang et al. [52] and the Multiple-Sense Skip-Gram (MSSG) model of Neelakantan et al. [91] estimates a fixed number of representations per word, without discriminating between the single-sense and multi-sense words. As a result, these approaches fail to identify the right number of senses per word and estimate multiple representations for the words that have a single sense. In addition, these approaches cluster the occurrences without

18 19 taking into consideration the diversity of words that occur within the contexts of these occurrences (explained in Section 2.1). The Non-Parametric Multiple-Sense Skip-Gram (NP-MSSG) model [91] estimates a varying number of representations for each word but uses the same clustering approach and hence, is not effective in taking into consideration the diversity of words that occur within the same context. We present an extension to the Skip-Gram model of Word2Vec to accurately and efficiently estimate a vector representation for each sense of multi-sense words. Our model relies on the fact that, given a word, the Skip-Gram model’s loss associated with predicting the words that co-occur with that word, should be greater when that word has multiple senses as compared to the case when it has a single sense. This information is used to identify the words that have multiple senses and estimate a different represen- tation for each of the senses. These representations are estimated using the Skip-Gram model by first clustering the occurrences of the multi-sense words by accounting for the diversity of the words in these contexts. We evaluated the performance of our model for the contextual similarity task on the Stanford’s Contextual Word Similarities (SCWS) dataset. When comparing the most likely contextual sense of words, our model was able to achieve approximately 13% and 10% improvement over the NP-MSSG and MSSG approaches, respectively. In addition, our qualitative evaluation shows that our model does a better job of identifying the words that have multiple senses over the competing approaches.

3.1 Definitions and Notations

Distributed representation of words quantify the syntactic and semantic relations among the words based on their distributional properties in large samples of the language data. The underlying assumption is that the co-occurring words should be similar to each other. We say that the word wj co-occurs with the word wi if wj occurs within a window around wi. The context of wi corresponds to the set of words which co-occur with wi within a window and is represented by C(wi). The state-of-the-art technique to learn the distributed representation of words is Word2Vec. The word vector representations produced by Word2Vec are able to cap- ture fine-grained semantic and syntactic regularities in the language data. Word2Vec 20 provides two models to learn word vector representations. The first is the Continuous Bag-of-words Model that involves predicting a word using its context. The second is called the Continuous Skip-gram Model that involves predicting the context using the current word. To estimate the word vectors, Word2Vec trains a simple neural network with a single hidden layer to perform the following task: Given an input word (wi), the network computes the probability for every word in the vocabulary of being in the context of wi. The network is trained such that, if it is given wi as an input, it will give a higher probability to wj in the output layer than wk if wj occurs in the context of wi but wk does not occur in the context of wi. The set of all words in the vocabulary is represented by V . The vector associated with the word wi is denoted by wi. The vector corresponding to word wi when wi is used in the context is denoted by w˜i. The size of the word vector wi or the context vector w˜i is denoted by d. The objective function for the Skip-Gram model with negative sampling is given by [32]

|V | ! X X X minimize − log σ(hwi, w˜ji) + log σ(−hwi, w˜ki) ,

i=1 wj ∈C(wi) k∈R(m,|V |)wk∈/C(wi) where R(m, n) denotes a set of m random numbers from the range [1, n] (negative samples), hwi, wji is the dot product of wi and wj. and σ(hwi, w˜ji) is the sigmoid function. The parameters of the model are estimated using Stochastic Gradient Descent (SGD) in which, for each iteration, the model makes a single pass through every word in the training corpus (say wi) and gathers the context words within a window. The negative samples are sampled from a probability distribution which favors the frequent words. The model also down-samples the frequent words using a hyper-parameter called the sub-sampling parameter.

3.2 Loss driven multisense identification (LDMI)

In order to address the limitations of the existing models, we developed an extension to the Skip-Gram model that combines two ideas. The first is to identify the multi-sense 21 words and the second is to cluster the occurrences of the identified words such that the clustering correctly accounts for the variation among the words that occur within the same context. We explain these parts as follows:

3.2.1 Identifying the words with multiple senses

For the Skip-Gram model, the loss associated with an occurrence of wi is ! X X L(wi) = − log σ(hwi, w˜ji) + log σ(−hwi, w˜ki) .

wj ∈C(wi) k∈R(m,|V |)wk∈/C(wi)

The model minimizes L(wi) by increasing the probability of the co-occurrence of wj and wi if wj is present in the context of wi and decreasing the probability of the co- occurrence of wk and wi if wk is not present in the context of wi. This happens by aligning the directions of wi and w˜j closer to each other and aligning the directions of wi and w˜k farther from each other. At the end of the optimization process, we expect that the co-occurring words have their vectors aligned closer in the vector space. However, consider the polysemous word bat. We expect that the vector representation of bat is aligned in a direction closer to the directions of the vectors representing the terms like ball, baseball, sports etc. (the sense corresponding to sports). We also expect that the vector representation of bat is aligned in a direction closer to the directions of the vectors representing the terms like animal, batman, nocturnal etc. (the sense corresponding to animals). But at the same time, we do not expect that the directions of the vectors representing the words corresponding to the sports sense are closer to the directions of the vectors representing the words corresponding to the animal sense. This leads to the direction of the vector representing bat lying in between the directions of the vectors representing the words corresponding to the sports sense and the directions of the vectors representing the words corresponding to the animal sense. Consequently, the multi-sense words will tend to contribute more to the overall loss than the words with a single sense. Having a vector representation for each sense of the word bat will avoid this scenario, as each sense can be considered as a new single-sense word in the vocabulary. Hence, the loss associated with a word provides us information regarding whether a word has 22 multiple senses or not. LDMI leverages this insight to identify a word wi as multi- sense if the average L(wi) across all its occurrences is more than a threshold. However,

L(wi) has a random component associated with it, in the form of negative samples. We found that, in general, infrequent words have higher loss as compared to the frequent words. This can be attributed to the fact that given a random negative sample while calculating the loss, there is a greater chance that the frequent words have already seen this negative sample before during the optimization process as compared to the infrequent words. This way, infrequent words end up having higher loss than frequent words. Therefore, for the selection purposes, we ignore the loss associated with negative samples. We denote the average loss associated with the prediction of the context words + in an occurrence of wi as L (wi) and define it as

+ 1 X L (wi) = − log σ(hwi, w˜ji). |C(wi)| wj ∈C(wi) + We describe L (wi) as the contextual loss associated with an occurrence of wi. To identify the multi-sense words, LDMI performs a few iterations to optimize the loss function on the text dataset, and shortlist the words with average contextual loss + (average L (wi) across all the occurrences of the wi) that is higher than a threshold. These shortlisted words represent the identified multi-sense words, which form the input of the second step described in the next section.

3.2.2 Clustering the occurrences

To assign senses to the occurrences of each of the identified multi-sense words, LDMI clusters its occurrences so that each of the clusters corresponds to a particular sense.

The clustering solution employs the I1 criterion function [137] which maximizes the objective function of the form

k X maximize niQ(Si), (3.1) i=1 where Q(Si) is the quality of cluster Si whose size is ni. We define Q(Si) as

1 X Q(Si) = 2 sim(u, v), ni u,v∈Si 23 where sim(u, v) denotes the similarity between the occurrences u and v, and is given by

1 X X sim(u, v) = cos(x, y), (3.2) |C(u)||C(v)| x∈C(u) y∈C(v) According to Equation (3.2), LDMI measures the similarity between the two occurrences as the average of the pairwise cosine similarities between the words belonging to the contexts of these occurrences. This approach considers the variation among the words that occur within the same context. We can simplify Equation (3.1) to the following equation

  2 k X 1 X X x maximize   . ni kxk2 i=1 u∈S i x∈C(u) 2 LDMI maximizes this objective function using a greedy incremental strategy [137].

3.2.3 Putting everything together

LDMI is an iterative algorithm with two steps in each iteration. The first step is to perform a few SGD iterations to optimize the loss function. In the second step, it calculates the contextual loss associated with each occurrence of each word and identifies the words having the average contextual loss that is more than a threshold. It then clusters the occurrences of the identified multi-sense words into two clusters (k = 2) as per the clustering approach discussed earlier. The algorithm terminates after a fixed number of iterations. x number of iterations of LDMI can estimate a maximum of 2x senses for each word.

3.3 Experimental methodology

3.3.1 Datasets

We train LDMI on two corpora of varying sizes: The Wall Street Journal (WSJ) dataset [44] and the Google’s One Billion Word (GOBW) dataset [15]. In preprocessing, we removed all the words which contained a number or did not contain any alphabet and converted the remaining words to lower case. 24

Table 3.1: Dataset statistics. Dataset Vocabulary size Total words WSJ 88,118 62,653,821 GOBW 73,443 710,848,599

For WSJ, we removed all the words with frequency less than 10 and for GOBW, we removed all the words with frequency less than 100. The statistics of these datasets after preprocessing are presented in Table 3.1. We use Stanford’s Contextual Word Similarities (SCWS) dataset [52] for evaluation on the contextual word similarity task. SCWS contains human judgments on pairs of words (2,003 in total) presented in sentential context. The word pairs are chosen so as to reflect interesting variations in meanings. When the contextual information is not present, different people can consider differ- ent senses when giving a similarity judgment. Therefore, having representations for all the senses of a word can help us to find similarities which align better with the human judgments, as compared to having a single representation of a word. To investigate this, we evaluated our model on the WordSim-353 dataset [29], which consists of 353 pairs of nouns, without any contextual information. Each pair has an associated averaged human judgments on similarity.

3.3.2 Evaluation methodology and metrics

Baselines.

We compare the LDMI model with the MSSG and NP-MSSG approaches as they are also built on top of the Skip-Gram model. As mentioned earlier, MSSG estimates the vectors for a fixed number of senses per word whereas NP-MSSG discovers varying number of senses per word. To illustrate the advantage of using the clustering with I1 criterion over the clustering approach used by the competing models, we also compare LDMI with LDMI-SK. LDMI-SK uses the same approach to select the multi-sense words as used by the LDMI, but instead of clustering with the I1 criterion, it uses spherical K-means [24]. 25 Distribution of the average contextual loss

2.5

2.0

1.5

1.0

0.5 Average contextual loss

0 1e4 2e4 3e4 4e4 5e4 6e4 7e4 Words

Figure 3.1: Distribution of the average contextual loss for all words (Words on the x-axis are sorted in order of their loss).

Parameter selection.

For all our experiments, we consider 10 negative samples and a symmetric window of 10 words. The sub-sampling parameter is 10−4 for both the datasets. To avoid clustering the infrequent and stop-words, we only consider the words within a frequency range to select them as the multi-sense words. For the WSJ dataset, we consider the words with frequency between 50 and 30,000 while for the GOBW dataset, we consider the words with frequency between 500 and 300,000. For the WSJ dataset, we consider only 50-dimensional embeddings while for the GOBW dataset, we consider 50, 100 and 200 dimensional embeddings. The model checks for multi-sense words after every 5 iterations. We selected our hyperparameter values by a small amount of manual exploration to get the best performing model. To decide the threshold for the average contextual loss to select the multi-sense words, we consider the distribution of the average contextual loss after running an iteration of Skip-Gram. For example, Fig. 3.1 shows the average contextual loss of every word in the vocabulary for the GOBW dataset for the 50-dimensional embeddings. We can see that there is an increase in the average contextual loss around 2.0 − 2.4. 26 We experiment around this range to select a loss threshold for which our model performs best. For the experiments presented in this chapter, this threshold is set to 2.15 for the WSJ (50-dimensional embeddings), and 2.15, 2.10 and 2.05 for the GOBW corresponding to the 50, 100 and 200-dimensional embeddings, respectively. With increasing dimensionality of the vectors, we are able to model the information from the dataset in a better way, which leads to a relatively lower loss. For the MSSG and NP-MSSG models, we use the same hyperparameter values as used by Neelakantan et al. [91]. For MSSG, the number of senses is set to 3. Increasing the number of senses involves a compromise between getting the correct number of senses for some words while noisy senses for the others. For NP-MSSG, the maximum number of senses is set to 10 and the parameter λ is set to −0.5 (A new sense cluster is created if the similarity of an occurrence to the existing sense clusters is less than λ). The models are trained using SGD with AdaGrad [26] with 0.025 as the initial learning rate and we run 15 iterations.

Metrics.

For evaluation, we use the similarities calculated by our model and sort them to create an ordering among all the word-pairs. We compare this ordering against the one obtained by the human judgments. To do this comparison, we use the Spearman rank correlation (ρ). Higher score for the Spearman rank correlation corresponds to the better correlation between the respective orderings. For the SCWS dataset, to measure the similarity between two words given their sentential contexts, we use two different metrics [105]. The first is the maxSimC, which for each word in the pair, identifies the sense of the word that is the most similar to its context and then compares those two senses. It is computed as

maxSimC(w1, w2,C(w1),C(w2)) = cos(ˆπ(w1), πˆ(w2)), where,π ˆ(wi) is the vector representation of the sense that is most similar to C(wi). As in Equation (3.2), we measure the similarity between x and C(wi) as

 m(y)  1 X X sim(x, C(w )) = cos(x, V (y, j)) , i Z   y∈C(wi) j=1 27 where, Z = P m(y), m(y) is the number of senses discovered for the word y y∈C(wi) and V (y, i) is the vector representation associated with the ith sense of the word y. For simplicity, we consider all the senses of the words in the sentential context for the similarity calculation. The second metric is the avgSimC which calculates the similarity between the two words as the weighed average of the similarities between each of their senses. It is computed as

avgSimC(w1, w2,C(w1),C(w2)) =

m(w1) m(w2) ! X X P r(w1, i, C(w1))P r(w2, j, C(w2)) × cos(V (w1, i),V (w2, j)) , i=1 j=1 where P r(x, i, C(x)) is the probability that x takes the ith sense given the context C(x). We calculate P r(x, i, C(x)) as

1  1  P r(x, i, C(x)) = , N 1 − sim(x, C(x)) where N is the normalization constant so that the probabilities add to 1. Note that, the maxSimC metric models the similarity between two words with respect to the most probable identified sense for each of them. If there are noisy senses as a result of overclustering, maxSimC will penalize them. Hence, maxSimC is a stricter metric as compared to the avgSimC. For the WordSim-353 dataset, we used the avgSim metric, which is qualitatively similar to the avgSimC, but does not take contextual information into consideration. The avgSim metric is calculated as

m(w1) m(w2) 1 X X avgSim(w , w ) = × cos(V (w , i),V (w , j)). 1 2 m(w )m(w ) 1 2 1 2 i=1 j=1 For qualitative analysis, we look into the similar words corresponding to different senses for some of the words identified as multi-sense by the LDMI and compare them to the ones discovered by the competing approaches. 28

Table 3.2: Results for the Spearman rank correlation (ρ × 100). Dataset Model d maxSimC avgSimC avgSim (SCWS) (SCWS) (WordSim-353) WSJ Skip-Gram 50 57.0 57.0 54.9 WSJ MSSG 50 41.4 56.3 50.5 WSJ NP-MSSG 50 33.0 52.2 47.4 WSJ LDMI-SK 50 57.1 57.9 55.2 WSJ LDMI 50 57.9 58.9 56.8 GOBW Skip-Gram 50 60.1 60.1 62.0 GOBW MSSG 50 50.0 59.6 57.1 GOBW NP-MSSG 50 48.2 60.0 58.9 GOBW LDMI-SK 50 60.1 60.6 62.8 GOBW LDMI 50 60.6 61.2 63.8 GOBW Skip-Gram 100 61.7 61.7 64.3 GOBW MSSG 100 53.4 62.6 60.4 GOBW NP-MSSG 100 47.9 63.3 61.7 GOBW LDMI-SK 100 61.9 62.4 64.9 GOBW LDMI 100 62.2 63.1 65.3 GOBW Skip-Gram 200 63.1 63.1 65.4 GOBW MSSG 200 54.7 64.0 64.2 GOBW NP-MSSG 200 51.5 64.1 62.8 GOBW LDMI-SK 200 63.3 63.9 66.4 GOBW LDMI 200 63.9 64.4 66.8

3.4 Results and discussion

3.4.1 Quantitative analysis

Table 3.2 shows the Spearman rank correlation (ρ) on the SCWS and WordSim-353 dataset for various models and different vector dimensions. For all the vector dimensions, LDMI performs better than the competing approaches on the maxSimC metric. For the GOBW dataset, LDMI shows an average improvement of about 13% over the NP- MSSG and 10% over the MSSG on the maxSimC metric. The average is taken over all vector dimensions. This shows the advantage of LDMI over the competing approaches. For the avgSimC metric, LDMI performs at par with the competing approaches. The 29

Distribution of the number of senses 100 LDMI 80 LDMI-SK 60 NP-MSSG

40

20 Percentage of words 0 1 2 3 4 5 6 7 8 9 10 Number of discovered senses

Figure 3.2: Distribution of the number of senses other approaches are not as effective in identifying the correct number of senses, leading to noisy clusters and hence, poor performance on the maxSimC metric. LDMI also performs better than LDMI-SK on both maxSimC and avgSimC, demonstrating the effectiveness of the clustering approach employed by LDMI over spherical k-means. Similarly, LDMI performs better than other approaches on the avgSim metric for the WordSim-353 dataset in all the cases, further demonstrating the advantage of LDMI. Fig. 3.2 shows the distribution of the number of senses discovered by the LDMI, LDMI- SK and NP-MSSG model for the GOBW dataset and 200-dimensional embeddings. We can see that LDMI and LDMI-SK discover 88% of the words as single-sense, while NP-MSSG discovers 63% of the words as single-sense. In addition, we used the Kolmogorov-Smirnov two-sample test to assess if LDMI’s performance advantage over the Skip-Gram is statistically significant. We performed the test on maxSimC and avgSimC metrics corresponding to the 1,000 runs each of LDMI and Skip-Gram on the WSJ dataset. For the null hypothesis that the two samples are derived from the same distribution, the resulting p-value (≈ 10−8) shows that the difference is statistically significant for both maxSimC and avgSimC metrics. Similarly, the difference in the LDMI’s and LDMI-SK’s performance is also found to be statistically significant. 30

Table 3.3: Top similar words for different senses of the multi-sense words*. Word Similar words Sense figure status; considered; iconoclast; charismatic; stature; known; leader calculate; understand; know; find; quantify; explain; how; tell; deduce doubling; tenth; average; percentage; total; cent; gdp; estimate numbers cool breezy; gentle; chill; hot; warm; chilled; cooler; sunny; frosty; weather pretty; liking; classy; quite; nice; wise; fast; nicer; okay; mad; expression block amend; revoke; disallow; overturn; thwart; nullify; reject; hindrance alley; avenue; waterside; duplex; opposite; lane; boulevard; address digest eat; metabolize; starches; reproduce; chew; gut; consume; food editor; guide; penguin; publisher; compilers; editions; paper; magazine head arm; shoulder; ankles; neck; throat; torso; nose; limp; toe; body assistant; associate; deputy; chief; vice; executive; adviser; organization

* Different lines in a row correspond to different senses.

3.4.2 Qualitative analysis

In order to evaluate the actual senses that the different models identify, we look into the similar words corresponding to different senses for some of the words identified as multi- sense by LDMI. We compare these discovered senses with other competing approaches. Table 3.3 shows the similar words (corresponding to the cosine similarity) with respect to some of the words that LDMI identified as multi-sense words and estimated a different vector representation for each sense. The results correspond to the 50-dimensional embeddings for the GOBW dataset. The table illustrates that LDMI is able to identify meaningful senses. For example, it is able to identify two senses of the word digest, one corresponding to the food sense and the other to the magazine sense. For the word block, it is able to identify two senses, corresponding to the hindrance and address sense. Table 3.4 shows the similar words with respect to the identified senses for the words digest and block by the competing approaches. We can see that LDMI is able to identify more comprehensible senses for digest and block, compared to MSSG and NP-MSSG. Compared to the LDMI, LDMI-SK finds redundant senses for the word digest, but overall, the senses found by the LDMI-SK are comparable to the ones found by the LDMI. 31

Table 3.4: Senses discovered by the competing approaches*. digest (Skip-Gram) block (Skip-Gram) nutritional; publishes; bittman; reader annex; barricade; snaked; curving; narrow digest (MSSG) block (MSSG) comenu; ponder; catch; turn; ignore street; corner; brick; lofts; lombard; wall areat; grow; tease; releasing; warts yancey; linden; calif; stapleton; spruce; ellis nast; conde; blender; magazine; edition bypass; allow; clears; compel; stop digest (NP-MSSG) block (NP-MSSG) guide; bible; ebook; danielle; bookseller acquire; pipeline; blocks; stumbling; owner snippets; find; squeeze; analyze; tease override; approve; thwart; strip; overturn eat; ingest; starches; microbes; produce townhouse; alley; blocks; street; entrance oprah; cosmopolitan; editor; conde; nast mill; dix; pickens; dewitt; woodland; lane disappointing; ahead; unease; nervousness slices; rebounded; wrestled; effort; limit observer; writing; irina; reveals; bewildered target; remove; hamper; remove; binding hinder; reclaim; thwart; hamper; stop side; blocks; stand; walls; concrete; front approve; enforce; overturn; halted; delay inside; simply; retrieve; track; stopping digest (LDMI-SK) block (LDMI-SK) almanac; deloitte; nast; wired; guide cinder; fronted; avenue; flagstone;bricks sugars; bacteria; ingest; enzymes; nutrients amend; blocking; withhold; bypass; stall liking; sort; swallow; find; bite; whole find; fresh; percolate; tease; answers

* Different lines in a row correspond to different senses.

We present the concluding remarks in Chapter 8. Chapter 4

Credit attribution in multilabel documents

A document can be visualized as a sequence of semantically-coherent segments. Text segmentation refers to the task of breaking down the documents into these segments. Segmenting the text into these semantically-coherent segments is useful in many natural processing tasks [49]: it can improve information retrieval (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result); and text summarization (by including information from each of the classes present in a document). In the case of multilabel documents, these segments can be mapped to one or more of the class-labels associated with the document. For example, Fig. 4.1 shows the plot summary of the 1933 movie Blondie Johnson. The movies belongs to the crime/drama genre. We can segment the plot summary text into segments belonging individually to the crime genre and drama genre (Red, italicized text corresponds to the crime genre and black, unitalicized text belongs to the drama genre). Manually generating these segments and annotating them with the corresponding classes in a document is a tedious and expensive task. The task of associating individual parts in a document with their most appropriate classes is termed as credit attribution [102]. As discussed in Section 2.2, many distant supervised methods have been developed to address the problem of credit attribution. The prior distant-supervised approaches use topic modelling to associate individual words/sentences in a document with their

32 33 Set during the Great Depression, Blondie Johnson quits her job after a co­ worker sexually harasses her. She next is evicted with her sick mother, but cannot get relief. After her mother dies, Blondie is determined to become rich. She soon gets involved in the criminal circuit. She falls in love with a gangster, whom she convinces to take down his boss. Blondie eventually climbs up the criminal ladder, becoming boss to the "little navy" gang.

Figure 4.1: Plot summary of the movie Blondie Johnson, belonging to the crime/drama genre. Red, italicized text corresponds to the crime genre and black, unitalicized text belongs to the drama genre. most appropriate topics (classes). Methods like Labeled Latent Dirichlet Allocation (LLDA) [102], Partially Labeled Dirichlet Allocation (PLDA) [103] and Multi-Label Topic Model (MLTM) [113] belong to this category. These topic-modeling based meth- ods model the document as a bag of word/sentences and does not take into consideration that the sentences occurring together in a document tend to talk about the same topic, and hence they have more chances of belonging to the same class. Moreover, these meth- ods are based on top of the Latent Dirichlet Allocation (LDA), whose high computation complexity makes these methods undesirable for use in real applications. In this chapter, we present distant-supervised methods to accurately and efficiently solve the credit attribution problem. Our methods exploit the structure within the documents by relying on the fact that sentences occurring together in a document have high chances of belonging to the same class. We present two classes of methods: (i) dynamic-programming based methods, and (ii) attention-mechanism based methods. The dynamic programming based methods employ an iterative approach. In each iteration, they estimate a model for predicting the classes of an arbitrary length text segment, then use dynamic programming to segment the document into semantically- coherent chunks using the estimated model, and fine-tune the model with these seg- ments. Every iteration provides a more precise classification model and hence better segmentation. These dynamic programming based methods cannot model case where sentences can belong to multiple classes; thus, they cannot correctly model semantically similar classes. Therefore, we also present another class of methods that can also take care of these limitations, as described next. 34 The attention mechanism based method uses a neural-network that models multi- topic segments. It uses the class labels of a multilabel document as the source of distant-supervision, to assign class labels to the individual sentences of the document. It leverages the attention mechanism to compute weights that establish the relevance of a sentence for each of the classes. The attention weights, which can be interpreted as a probability distribution over the classes, allows our model to capture the semantically similar classes by modeling each sentence as a distribution over the classes, instead of mapping to just one class. In addition, our approach leverages a simple average pooling layer to constrain the neighboring sentences to belong to the same class by smooth- ing their class distributions. Our approach uses an end-to-end learning framework to combine the attention mechanism with the multilabel classifier. We evaluated the performance of our methods on five datasets belonging to varying domains. On the credit attribution task, our best performing method has an average performance gain of 7.5% as compared to the competing approaches. On the multilabel classification task, our methods also performs better than the competing approaches. The performance of our best performing method with respect to the F1 score between the predicted and the actual classes is on an average 4.1% better than the baselines. While our discussion and evaluation focuses on credit attribution in documents, our methods are not restricted to this domain. For example, our methods can be extended to multilabel images, where it is expensive to manually annotate the bounding boxes corresponding to various segments. The reminder of the chapter is organized as follows. Section 4.1 corresponds to the definitions, notations and the background used in the chapter. The chapter dis- cusses the proposed dynamic-programming based methods in Section 4.2, and attention- mechanism based methods in 4.3 followed by the experiments in Section 4.4. Section 4.5 discusses the results.

4.1 Definitions and Notations

Let C be a set of classes and D be a set of multilabel documents. For each document d ∈ D, let Ld ⊆ C be its set of classes and let dq be the number of sentences that it contains. The approach developed in this chapter assumes that in multilabel documents, 35

Table 4.1: Notation used throughout the chapter. Symbol Description D Collection of multilabel documents C Set of classes associated with D d A document from the collection D

dq Number of sentences in the document d

Sd Segmentation of the document d

du Number of segments in the segmentation Sd d[i] ith sentence of the document d d[i, . . . , j] Subsequence of the sentences in the document d

Ld Set of classes from which each segment in Sd needs to be annotated y(d, c) Binary indicator if the class c is present in the document d s(d, c) Probability of the document d belonging to the class c

Hd Sequence of classes corresponding to the segmentation Sd γ Segment creation penalty score(d[i, . . . , j], c) score (likelihood) that the segment d[i, . . . , j] belongs to the class c kw(x) Key vector for the word x vw(x) Value vector for the word x kd(i) Key vector for the ith sentence in the document d vd(i) Value vector for the ith sentence in the document d a(d, i, c) Attention-weight for of the ith sentence of d for the class c. rd(c) Class-specific representation of the document d for the class c

LC (D) Weighted binary cross entropy loss associated with multilabel classification of the documents.

LS (D) Attention loss, which penalizes the attention on the absent classes in the documents.

α Hyperparameter to control the relative contribution of LC (D) and LS (D) towards the final loss. β Hyperparameter to control the relative contribution of attention-weights and document’s classification probability towards predicted class of a sentence. each sentence can be labeled with class label(s) from that document. In particular, given a document d we assume that each sentence d[i] can be labeled with a class y(d, i) ∈ Ld. We seek to find these sentence-level class labels, the training data being the multilabel documents and their class labels, i.e., we do not have access to the sentence-level class labels for training. Table 4.1 provides a reference for the notation used throughout the chapter. 36 4.2 Dynamic programming based methods

Our dynamic programming based approaches consider the structure within the docu- ments by making use of the fact that neighboring sentences tend to talk about the same topic. Our methods work iteratively, where in each iteration, they estimate a model for predicting the classes of an arbitrary length text segment, then use dynamic program- ming to segment the document into semantically-coherent chunks using the estimated model, and fine-tune the model based on these segments. Every iteration leads to a bet- ter segmentation, which gives us more precise sections of text belonging to each class and therefore, we have more accurate prediction model in each iteration. We explain our methods in detail in this section.

4.2.1 Overview

Given a document d, we assume that there is a set of indices 0 = s0 < s1 < . . . < sdu = dq which partitions d into du non-overlapping contiguous text segments:

Sd = hd[s0 + 1, . . . , s1], . . . , d[sdu−1 + 1, . . . , sdu ]i,

that span the entire document. The indices s0, . . . , sdu are called segment boundaries and correspond to changes in the topic being discussed in the document d. We refer to Sd as the segmentation of d. Let Hd is the sequence of classes corresponding to the segmentation Sd, that is, ith element of Hd (Hdi) is the annotated class of the segment d[si+1, . . . , si+1]. Furthermore, we will use d[i, . . . , j] to refer to the part of the document starting at the ith sentence and ending at the jth sentence (inclusive), and we will refer to that part as the subsequence of the document d. We seek a segmentation Sd of the document d and the corresponding segment class-labels Hd such that each sentence in the segment Sdi ∈ Sd belongs to the class Hdi. We can quantify the degree to which a text segment belongs to a class by using a multiclass classifier, which takes a piece of text of arbitrary length as input and gives as output the likelihood of that text belonging to each of the classes in the set Ld. We intend to find a segmentation that maximizes the aggregated score (likelihood) of all the segments belonging to their respective annotated 37 classes. This leads to an optimization function of the following form:

d −1 Xu J(d) = maximize score(d[si + 1, . . . , si+1],Hdi), (4.1) S ,H d d i=0 where score(d[si + 1, . . . , si+1],Hdi) is the score (likelihood) that the segment d[si +

1, . . . , si+1] belongs to the class Hdi. This score is achieved as an output of a multiclass classifier. According to Equation (4.1), in order to obtain a segmentation of a document d, we require two components: (i) A model for predicting the classes, which takes a piece of text of arbitrary length as input and gives as output the likelihood of that text belonging to each of the classes in the set Ld. (ii) An algorithm to optimize the objective function corresponding to the Equation. (4.1). We explain each of the above two parts in the next two sections. First, we look into the segmentation algorithm, and then we look into how to build the prediction model to get the likelihood of belonging to a class.

4.2.2 Segmentation algorithm

We developed two different segmentation algorithms. Given a document d and a set of classes Ld associated with it, the first computes a segmentation Sd and assignment of classes to each segment without requiring that every single class in Ld is used. The second algorithm imposes the constraint that each class in Ld is mapped to at least one segment in Sd. We define both the algorithms in detail as below:

Unconstrained segmentation

Given the additive formulation of our objective function (Equation (4.1)), we use dy- namic programming (DP) to obtain an efficient and optimal solution. Let F (d, i, j, l, R) be the objective value associated with optimal segmentation of the subsequence d[i, . . . , j] of document d such that the class associated with the first segment in the subsequence d[i, . . . , j] is l and the classes associated with all segments (including l) are drawn from the set R. Using this definition, we can see that the objective value of the optimal 38 segmentation for the document d, according to Equation (4.1), is given by:

J(d) = max F (d, 1, dq, l, Ld). l∈Ld In order to prevent the segmentation algorithm from creating very small segments that switch topics, we use the γ parameter, which assigns a fixed cost to the creation of a new segment. This is motivated by the gap-opening costs, often used to obtain meaningful alignments in biological sequences. The recurrence relation for F (d, i, j, l, R) is given as follows:

F (d, i, j, l, R) = max (score(d[i, . . . , k], l) + F (d, k + 1, j, l0,R) − γ). (4.2) l0∈R l06=l i

Constrained segmentation

Next, we present a DP formulation with the constraint that every class in Ld is mapped to at least one segment in Sd. Let G(d, i, j, R) be the objective value associated with optimal segmentation of the subsequence d[i, . . . , j] of the document d such that all the classes associated with the resultant segments are drawn from the set R, such that each class in R is mapped to at least one segment. Therefore, the optimal segmentation objective value for the document d, according to Equation (4.1) and constrained segmentation is given by:

J(d) = G(d, 1, dq,Ld). 39 Algorithm 1 Segmentation helper

Input: Document d of length dq, class-set R, segment creation penalty γ, likelihood function score

Output: A dictionary Qd of size |R|, the key being each class of R (say l), and value being the segmentation of the document d such that the class associated with the first segment in the resultant segmentation is l, and the corresponding optimal objective.

1: W : A 2D array of size dq × |R| to store optimal objective values; Wptr,Wclass: 2D arrays of size dq × |R| to store pointer information.

2: Qd: A dictionary of size |R| Initialization:

3: for i = 1 to dq do 4: for j = 1 to |R| do

5: W (i, j) = score(d[i, . . . , dq], j) − γ, Wptr(i, j) = dq + 1,Wclass(i, j) = −1. 6: end for 7: end for Maximization

8: for i = 2 to dq do

9: start = dq − i. 10: for j = 1 to |R| do

11: for k = start + 1 to dq − 1 do 12: for q = 1 to |R| do 13: if (q == j) then 14: continue. 15: end if 16: if (W (i, j) < score(d[i, . . . , k − 1], j) − γ + W (k, q)) then 17: W (i, j) = score(d[i, . . . , k − 1], j) − γ + W (k, q).

18: Wptr(i, j) = k, Wclass(i, j) = q. 19: end if 20: end for 21: end for 22: end for 23: end for Backtracking 24: for l = 1 to |R| do

25: start = l, i = 1, temp seg = An array of size dq.

26: while i ≤ dq do

27: for j = i to Wptr(i, start) − 1 do 28: temp seg(j) = start 29: end for

30: (i, start) = (Wptr(i, start),Wclass(i, start)) 31: end while

32: Qd(l) = temp seg, W (1, l). 33: end for

34: return Qd 40 Algorithm 2 DP algorithm for unconstrained segmentation

Input: Document d of length dq, class-set Ld, segment creation penalty γ, likelihood function score

Output: Segmentation Segd of document d

1: Segd: An array of size dq to store class information for all points.

2: Qd = Output of segmentation helper on document d, class-set Ld, segment creation penalty γ and likelihood function score

3: optobjective = − inf

4: for l = 1 to |Ld| do

5: seg, obj = Qd(l)

6: if (obj > optobjective) then

7: optobjective = obj, Segd = seg. 8: end if 9: end for

10: return Segd

The recurrence relation for the optimum segmentation as per our constrained seg- mentation formulation is given as follows:

G(d, i, j, R) = max (G(d, i, k, R − l) + F (d, k + 1, j, l, R)), l∈R,i

4.2.3 Building the classification model

For the segmentation algorithm, we require a model that gives us the score corresponding to the likelihood of a text segment belonging to a class; i.e., the score function in Equation (4.2). Our method achieves this by using a multiclass classifier that takes as 41 Algorithm 3 DP algorithm for constrained segmentation

Input: Document d of length dq, class-set Ld, segment creation penalty γ, likelihood function score

Output: Segmentation Segd of document d

1: PSd = Powerset of Ld

2: W : A 2D array of size dq × |PSd| to store optimal objective values; Wclass: 2D array of size dq × |PSd| to store pointer information. Initialization:

3: for i = 1 to dq do

4: for j ∈ PSd do 5: W (i, set(j)) = − inf. 6: end for

7: for j = 1 to |Ld| do 8: W (i, set(j)) = score(d[1, . . . , i], j) − γ.

9: Wclass(i, set(j)) = [j, j..(i times)]. 10: end for 11: end for Maximization

12: for i = 2 to |Ld| do

13: B = Set of subsets of Ld of size i. 14: for j ∈ |B| do

15: for k = 2 to dq do 16: for m = 1 to k do

17: Qkm = Output of segmentation helper on sequence d[m + 1, . . . , k], class-set j, segment creation penalty γ and likelihood function score 18: for q ∈ j do 19: subset = j − q.

20: seg, obj = Qkm(q) 21: if (W (k, j) < W (m, subset) + obj) then 22: W (k, j) = W (m, subset) + temp obj.

23: Wclass(k, j) = [Wclass(m, subset), seg]. 24: end if 25: end for 26: end for 27: end for 28: end for 29: end for

30: Segd = Wclass(dq,Ld).

31: return Segd input a text segment and outputs the likelihood of it belonging to each of the classes. In this section, we look at methods to train this classifier, training examples being the set of multilabel documents. 42 Arguably, the most obvious way to train this classifier is by using each document- class pair as a training example, that is, we convert the multilabel documents into single-class documents by pairing every document to each of its a associated classes independently. However, as discussed earlier, the classes of the document do not apply with equal specificity across the complete document. In this section, we look at the methods to improve over the training exercise by using class-specific parts from these training examples. We propose an iterative method to train the classifier in a more precise manner as follows: We start with training the classifier using each document-class pair in the simple manner described above. In each iteration, we perform segmentation on the training documents to get specific segments belonging to each class and fine-tune the classifier with these class-specific training examples. Therefore, in each iteration, we expect to get a better segmentation of the training documents which we use to fine tune the classifier to make it more precise. Since we have observed class-sets for the training documents, we can use either unconstrained or constrained segmentation. Based on the discussion above, we have three ways to build the classifier:

1. Train a noisy classifier by using each document-class pair as the training examples. We call training in this way segmentation with noise (SEG-NOISY).

2. Refine the classifier iteratively by performing unconstrained segmentation on the training documents in each iteration. We call training in this way segmentation with refinement (SEG-REFINE).

3. Refine the classifier iteratively by performing constrained segmentation on the training documents in each iteration. We call training in this way constrained segmentation with refinement (CON-REFINE).

In this chapter, we use a simple two-layer feedforward neural network as our multi- class classifier. The input layer has the number of nodes equal to the vocabulary size (|V |), followed by a hidden layer, and an output layer of the size (|C| + 1). We add an extra node in the output node to capture the segments that do not belong to any of the classes (we call it the null class). We apply Rectified Linear Unit (ReLU) non-linearity 43 Algorithm 4 Algorithm to iteratively build the classification model Input: Training documents set T , Initial learning rate lr, Fine tuning learning rate lr0, number of iterations iter, training algorithm alg (either SEG-NOISY, SEG-REFINE or CON-REFINE) Output: Trained multiclass classifier Z Initialization: Build initial classifier Z using all training documents T , and learning rate lr. 1: for i = 1 to iter do 2: if (alg == SEG-REFINE) then 3: Segment each training document in T using unconstrained segmentation to get precise segments be- longing to each of the class and redefine T to consist of these segments and corresponding classes. 4: end if 5: if (alg == CON-REFINE) then 6: Segment each training document in T using constrained segmentation to get precise segments belonging to each of the class and redefine T to consist of these segments and corresponding classes. 7: end if 8: Fine tune Z using documents from T , and learning rate lr0. 9: end for 10: return Z on the hidden layer and softmax on the output layer. We minimize the cross entropy loss for training the model. In the cases where we refine (fine tune) the network iteratively (SEG-REFINE and CON-REFINE), we find the text segments corresponding to each class (except the null class) and train the network with a smaller learning rate (the weights are reinitialized from the previous iteration). The pseudo code for training the classifier model is present in Algorithm 4.

4.3 Credit Attribution With Attention (CAWA)

As discussed earlier, our dynamic-programming-based approaches do not model the semantically similar classes accurately. In order to address the limitation, we present a neural-network based approach Credit Attribution With Attention (CAWA). CAWA captures the semantically similar classes by modeling each sentence as a distribution over the classes, instead of mapping to just one class. In addition, CAWA leverages the local structure within the documents by using a simple average pooling layer to constrain the neighboring sentences to have similar class distribution. To this end, CAWA combines an attention mechanism with a multilabel classifier and uses this multilabel classifier to predict the classes of an input document. For each 44

Figure 4.2: Example of the attention weights. Each sentence contributes towards the class-specific document representation, the extent of contribution, and hence to rele- vance for a class, is decided by the attention weights. predicted class, the attention mechanism allows CAWA to precisely identify the sen- tences of the input document which are relevant towards predicting that class. Using these relevant sentences, CAWA estimates a class-specific document representation for each class. Finally, each sentence is assigned the class, for which it is most relevant, i.e., has the highest attention weight. Figure 4.2 shown an example of sentence-labeling using the attention weights. Additionally, CAWA uses a simple average pooling layer to constrain the neighboring sentences to have similar attention weights (class distribu- tion). We explain CAWA in detail in this section.

4.3.1 Architecture

CAWA consists of three components: (i) a sentence representation generator, which is responsible for generating a representation of the sentences in the input document; (ii) an attention module, which is responsible for generating a class-specific document representation from the sentence representations, and (iii) a multilabel classifier, which is responsible for predicting the classes of the document using the class-specific docu- ment representations as input. These three components form an end-to-end learning framework as shown in Figure 4.3. We explain each of these components in detail in this section. 45

Figure 4.3: CAWA architecture with an example. The input document consists of two sentences, having class labels 1 and 2 respectively. The Sentence-representation gen- erator generates the key and value representation for these sentences. The attention module generates class-specific document representations using the key and value rep- resentations of the sentences. Finally, the multilabel classifier, uses these class-specific representations, to predict the correct classes of the document. Although we don’t have direct supervision about the sentence-level class labels, the attention mechanism allows us to find how much each sentence is relevant to a class, that can be used to predict the sentence-level class labels.

Sentence-representation generator (SRG):

The SRG takes the document as an input and generates two different representations for each sentence in the document. The two representations correspond to the keys and the values that will be taken as input by the attention mechanism, as explained in next section. For both keys and values, SRG generates the representation of a sentence as the average of the representations of the constituent words of the sentence, i.e.,

1 X 1 X kd(i) = kw(x); vd(i) = vw(x), |d[i]| |d[i]| x∈d[i] x∈d[i] where d[i] is the ith sentence of document d, kd(i) is the key-representation of d[i], kw(x) is the key-representation of word x, vd(i) is the value-representation of d[i] and vw(x) is the value-representation of word x. These representations for the words are estimated during the training. 46

Figure 4.4: Architecture of the attention mechanism.

Attention module:

The attention module takes the sentence representations (keys and values) as input and outputs the class-specific representation of the document, one document-representation for each class. Since the different sentences have difference relevance for each class, we es- timate the class-specific representations as a weighted average of the value-representations of the sentences. We calculate the attention weights for this weighted average using a feed-forward network. Specifically, we estimate the class-specific representations as, d Pdq d d r (c) = i=1 a(d, i, c) × v (i), where r (c) is the class-specific representation of docu- ment d for class c, a(d, i, c) is the attention-weight for of the ith sentence of d for class c. The feed-forward network to calculate the attention weights takes as input the key representation of a sentence and outputs the attention weight of the sentence for each class. This feed-forward network plays the role of the sentence classifier and outputs the probability of the input sentence belonging to each of the classes, on its output layer. We implement this feed-forward network with two hidden-layers, and we use softmax on the output layer to calculate the attention-weights. To leverage the local structure within the document, i.e., to constrain the neighboring sentences to have similar class distri- butions, we apply average pooling before the softmax layer. Average pooling smooths out the neighboring class distributions and cancels the effect due to random variation. Note that, we can also use more flexible sequence modeling approaches, such as Recur- rent Neural Networks (RNNs) to leverage the local structure. But, we choose to use a simple average pooling layer, due to its simplicity. We also add a residual connection between the first hidden layer and the output layer, which eases the optimization of the model [46]. The architecture for the attention mechanism is shown in Figure 4.4. 47

Figure 4.5: Architecture of the per-class binary classifier.

Multilabel classifier:

Several architectures and loss-functions have been proposed for multilabel classifica- tion, such as Backpropagation for Multi-Label Learning (BP-MLL) [135]. However, as the focus of this chapter is credit attribution and not multilabel classification, we simply implement the multilabel classifier as a separate binary classifier for each class. Therefore, each binary classifier predicts whether a particular class is present in the document or not. The input to each of these binary classifiers is the class-specific rep- resentation, which is the output of the attention module. We implement each of these binary classifiers as a feed-forward network with two hidden layers and use sigmoid on the output layer to predict the probability of the document belonging to that class. The architecture for the class-specific binary classifiers is shown in Figure 4.5.

4.3.2 Model estimation

To quantify the loss for predicting the classes of a document, we minimize the weighted binary cross-entropy loss [90], which is a widely used loss function for multilabel clas- sification. The weighted binary cross-entropy loss associated with all the documents in collection D is given by: 1 X X L (D) = − w (y(d, c) log(s(d, c)) + (1 − y(d, c)) log(1 − s(d, c))), C |D| c d∈D c∈C where y(d, c) = 1 if the class c is present in document d, and y(d, c) = 0 otherwise, s(d, c) is the prediction probability of document d belonging to class c, and wc is the class-specific weight for class c. This weight wc is used to handle the class imbalance by increasing the importance of infrequent classes (upsampling), and we empirically set it p to wc = |D|/nc, where nc is the number of documents belonging to the class c. Note 48 that we require the sentences to be labeled with the class which they describe. However, the attention mechanism can also assign high attention-weights to the sentences that provide a negative-signal for a class. For example, if a document can exclusively belong to only one class A and B, the text describing one of the classes (say A) will also provide a negative signal for the other class (B), and hence, will get high attention-weight for both classes A and B. To constraint that the attention is only focused on the classes that are actually present in the document, we introduce attention loss, which penalizes the attention on the absent classes, and is given by

dq 1 X 1 X X LS(D) = − (1 − y(d, c)) log(1 − a(d, i, c)), |D| dq d∈D i=1 c∈C where a(d, i, c) is the attention weight for class c on the ith sentence of the document d. To estimate the CAWA, we minimize the weighted sum of both LC (D) and LS(D), given by L(D) = αLC (D) + (1 − α)LS(D), where α is a hyperparameter to control the relative contribution of LC (D) and LS(D) towards the final loss.

4.3.3 Segment inference

We can directly use the estimated attention-weights to assign a class to each sentence, corresponding to the class with the maximum attention-weight. However, to ensure the consensus between the predicted sentence-level classes and document’s classes, we use a linear combination of the attention-weights and document’s predicted class-probabilities to assign a class to each sentence, i.e.,

l(d, i) = argmax(β × a(d, i, c) + (1 − β) × y(d, c)), (4.3) c where l(d, i) is the predicted class for the ith sentence of d and β is a hyperparameter to control the relative contribution of attention-weights and document’s classification probability. Additionally, y(d, c) acts as a global bias term, and makes the sentence-level predictions less prone to random variation in the attention weights. 49 4.4 Experimental methodology

4.4.1 Datasets

We performed experiments on five multilabel text datasets (Movies [7], Ohsumed [51], TMC2007 1 , Patents2 , Delicious [140]). The datasets belong to different domains as described below:

Movies

This dataset contains movie plot summaries extracted from Wikipedia and correspond- ing genres extracted from Freebase. We randomly take a subset of the movies from this dataset corresponding to six common genres: Romance Film, Comedy, Action, Thriller, Musical, Science Fiction.

Ohsumed

The Ohsumed test collection is a subset of the MEDLINE database. The labels cor- respond to 23 Medical Subject Headings (MeSH) categories of cardiovascular diseases group.

TMC2007

This is the dataset used for the SIAM 2007 Text Mining competition. The documents are the aviation safety reports corresponding to the one or more problems that occurred during certain flights. There are a total of 22 unique labels.

Patents

This dataset contains brief summary text of the parents and the labels correspond to the associated Cooperative Patent Classification (CPC) group labels. We randomly take a subset of summaries corresponding to the eight CPC groups: A: Human Necessi- ties, B: Operations and Transport, C: Chemistry and Metallurgy, D: Textiles, E: Fixed Constructions, F: Mechanical Engineering, G: Physics, H: Electricity.

1 https://c3.nasa.gov/dashlink/resources/138/ 2 http://www.patentsview.org/download/ 50

Table 4.2: Dataset statistics. Dataset Vocab size Number of Number of Average Number of Number of Average classes training number of test test number of documents classes in documents documents classes in training (classifica- (segmen- test documents tion) tation) documents (segmen- tation) Movies 9,568 6 3,834 2.23 959 488 2.26 Ohsumed 13,253 23 12,800 2.41 3,200 500 2.39 TMC2007 10,686 22 15,131 2.69 3,783 500 2.66 Patents 5,681 8 9,257 2.25 2,314 500 2.22 Delicious 8,778 20 6,871 2.54 1,718 488 2.48

Delicious

This data set contains tagged web pages retrieved from the social bookmarking site delicious.com. Tags for the web pages in this data set are not selected from a predefined set of labels; rather, users of the website delicious.com bookmarked each page with single word tags. We randomly choose documents corresponding to 20 common tags as our class labels: humour, computer, money, news, music, shopping, games, science, history, politics, lifehacks, recipes, health, travel, math, movies, economics, psychology, government, journalism. All datasets were pre-processed by removing stop-words and rare words (words that occur less than 4 times), by stemming the words with Porter stemmer [100], and by removing sentences with less than 4 or more than 10 words. For the multilabel classi- fication task, both training and test data are the documents with at least two classes associated with each document. For the credit attribution, the test dataset is synthetic, and each test document corresponds to multiple single-label documents concatenated together (thus, giving us ground truth sentence class labels for a document). Addition- ally, we also use a validation dataset, created in a similar manner to this synthetic test dataset, for the parameter selection of our model. Table 4.2 reports the statistics of these datasets. 51 4.4.2 Competing approaches

As mentioned in Section 2.2, not a lot of approaches have been developed that are specifically designed to solve the credit attribution problem. However,any multilabel classifier can be used to perform credit attribution, by training on the multilabel doc- uments and predicting the classes of the individual sentences. Thus, apart from the credit attribution specific approaches, we compare our approaches against several mul- tilabel classification approaches. We selected our baselines from diverse domains such as graphical models, deep neural networks, dynamic programming as well as classical approaches for text classification such as Multinomial Naive Bayes. Specifically, we compare our approaches against the following baselines:

Multi-Label Topic Model (MLTM) [113]

MLTM is a probabilistic generative approach, that generates the classes of a document from the classes of its constituent sentences, which are further generated from the classes of the constituent words.

Deep Neural Network with Attention (DNN+A)

As mentioned earlier, any multilabel classifier can be used to perform credit attribution, by training on the multilabel documents and predicting the classes of the individual sentences. Thus, we compare CAWA against a deep neural network based multilabel classifier. For a fair comparison, we use the same architecture as CAWA for DNN+A, except the components specific to CAWA (attention loss and average pooling layer).

Deep Neural Network without Attention (DNN-A)

DNN-A has the same architecture as DNN+A, except the attention, i.e., each class gives equal emphasis on all the sentences.

Multi-Label k-Nearest Neighbor (ML-KNN) [135]

ML-KNN is a popular method for multilabel classification.It uses k nearest neighbors to a test example and uses Bayesian inference to assign classes to the text example. 52 Binary Relevance - Multinomial Naive Bayes (BR-MNB)

Binary relevance is also a popular approach for multilabel classification, and amounts to independently training a binary classifier for each class. The prediction output is the union of all per class classifiers. We use Multinomial Naive Bayes as the per class binary classifier, which is a popular classical approach for text classification.

4.4.3 Evaluation Methodology

We describe our evaluation methodology for the segmentation and multilabel classifica- tion below. Segmentation: To test our methods on the segmentation task, we require ground- truth data for the segmentation and associated classes. Since we are not aware of any publicly available datasets with segment-level classes, we created a synthetic test set to evaluate our methods on the segmentation task as follows: For each of the mul- tilabel datasets described in the previous section, first we separated the single-label documents from the other (containing at least two classes) documents. From this set of single-label documents, we randomly chose documents and merged them to form a multilabel document. To mimic the naturally co-occurring classes, we only created the documents whose resultant class-set is present in the naturally occurring multilabel documents. Since, we generated these documents synthetically, we know the segments and the associated classes. This allows us to assess the performance of our methods on the segmentation task. To predict the segmentation of a document in this test set, we use unconstrained segmentation with the class-set being the set of all classes (Ld = C). To train the classifier required for segmentation, we use the same training examples as we use in the multilabel classification task, which we describe next. Multilabel classification: We used the multilabel documents (each document has at least two classes associated to it) to create the training set and a test set to test our methods on the multilabel classification task. We did a 80–20 split to form the training-test sets. Table 4.2 reports the statistics of these datasets. 53

Table 4.3: Hyperparameter values. Dataset ML-KNN MLTM SEG-NOISY SEG-REFINE CON-REFINE CAWA CAWA k m γ γ γ α β Movies 100 120 0.30 0.50 0.50 0.20 0.10 Ohsumed 20 90 0.10 0.40 0.35 0.10 0.10 TMC2007 50 90 0.20 0.40 0.45 0.10 0.30 Patents 50 110 0.20 0.45 0.50 0.50 0.30 Delicious 20 70 0.15 0.50 0.60 0.10 0.20

4.4.4 Parameter selection

For the neural-network based approaches, the number of nodes in the each of the hidden layer, all representations’ length, as well as the batch size for training was set to 256. For regularization, we used a dropout[116] of 0.5 between all layers, except between the penultimate and output layer. For optimization, we used the ADAM[64] optimizer. For average pooling in CAWA, we fixed the kernel-size to three. The keys and values embeddings are initialized randomly, i.e., we didn’t use any pretraining. We trained the CAWA for 100 epochs, with learning-rate set to 0.001. For SEG-NOISY, SEG-REFINE and CON-REFINE, we ran a total of three refinement iterations. In each iteration, we trained the network for 100 epochs. For initial training of the neural network, we set the learning rate to 0.001, and for the fine-tuning of the network, we used 0.0001 as the learning rate. We chose the remaining hyperparameters of our methods and baselines with respect to the best-performance on the SOV metric on the validation set. The tunable hyperparameter for the MLTM is m (number of topics), while that of ML-KNN is k (number of neighbors). The chosen values of the hyperparameters for different datasets and methods are shown in Table 4.3. For ML-KNN, we used cosine similarity measure to find the nearest neighbors which is a common used similarity measure for text documents. For ML-KNN and BR-MNB, we used the implementation as provided by scikit-multilearn3 . For MLTM, we used the implementation provided by the authors4 .

3 http://scikit.ml/ 4 https://github.com/hsoleimani/MLTM/ 54 4.4.5 Performance Assessment Metrics

We describe the metrics that we used to evalaute our approaches on the segmentation and multilabel classification below. Segmentation: For evaluation on the segmentation task, we first look into per-point prediction accuracy, which corresponds to the fraction of sentences that are predicted correctly. We call this metric PPPA (per-point prediction accuracy) and define it as:

N 1 X PPPA(S ,S ) = 1(S (i) == S (i)). 1 2 N 1 2 i=1 As a single-point measure, PPPA does not take into account the correlation between the neighboring sentences. To overcome this limitation of PPPA, we also report results on the Segment OVerlap score (SOV), which is a metric commonly used to evaluate the quality of protein secondary structure predictions [110]. SOV measures how well the observed and the predicted segments align with each other. SOV is defined as

1 X minov(s1, s2) + δ(s1, s2) SOV(S1,S2) = × len(s1), N maxov(s1, s2) s1∈S1 s2∈S2 (s1,s2)∈s where N is the total number of sentences in the document we are segmenting, S1 is the observed segmentation, and S2 is the predicted segmentation. The sum is taken over all segment pairs (s1, s2) ∈ s for which s1 and s2 overlap on at least one point.

The actual overlap between the s1 and s2 is minov(s1, s2), that is, the number of points both segments have in common, while maxov(s1, s2) is the total extent of both segments. Calculation of the minov and maxov is illustrated in Figure 4.6. The accepted variation

δ(s1, s2) is to bring robustness in case of minor deviations at the ends of segments.

δ(s1, s2) is defined as [132]:  maxov(s1, s2) − minov(s1, s2).    minov(s1, s2). δ(s1, s2) = min  blen(s1)/2c.    blen(s2)/2c. Compared to the PPPA metric, SOV penalizes fragmented segments and favors con- tinuity in the predictions. For example, prediction errors at the end of segments will be 55

minov A A A B B B B B B B B B B B B B A

A B B B B B B B B B B B A A A A A maxov

Figure 4.6: Illustration of the minov and maxov. The two segments compared corre- spond to the ones labeled with letter B, also colored red.

Actual labels (L): a a a a a a a a a a b b b b b b b b b b

Prediction 1 (P1): a b a b a b a b a b a b a b a b a b a b

Prediction 2 (P2): b b b b b a a a a a a a a a a b b b b b

Figure 4.7: Advantage of using the SOV metric over the PPPA metric. Predictions P1 and P2 score the same on the PPPA metric with PPPA(L, P 1) = PPPA(L, P 2) = 0.5. But, P2 being more aligned with the ground truth classes, scores higher than P1 on the SOV metric, with SOV(L, P 1) = 0.5 and SOV(L, P 2) = 0.83. penalized less by SOV than the prediction errors in the middle of the segments. Figure 4.7 illustrates the advantage of using the SOV metric over the PPPA metric. L shows the ground truth segments with corresponding classes and P1 and P2 are the two segmen- tation predictions. P1 and P2 score the same on the PPPA metric with PPPA(L, P 1) = PPPA(L, P 2) = 0.5. But, P2 being more aligned with the ground truth classes, scores higher than P1 on the SOV metric, with SOV(L, P 1) = 0.5 and SOV(L, P 2) = 0.83. Multilabel classification: In addition to performing segmentation, we also used our methods to solve the multilabel classification problem. This was done by predicting for each document the union of the classes that were identified during the segmentation process. To evaluate the performance of our methods for this task, we used the following measures: The first is the F1 score. For each document, we compute the F1 score based on the predicted classes and the observed classes. We report the mean F1 score over all the documents. We call it F1mean. F1mean is an important metric because it tells how well the predicted classes of the sentences correspond to the document classes. The second is the Area Under the Receiver Operating Characteristic Curve (AUC) [13]. 56 AUC gives the probability that a randomly chosen positive example ranks above (is deemed to have a higher probability of being positive than) a randomly chosen neg- ative example. We report AUC both under the micro (AUCµ) and macro (AUCM ) settings. AUCM computes the metric independently for each class and then takes the average (hence treating all classes equally), whereas AUCµ aggregates the contributions of all classes to compute the metric. To compute AUC metrics, we need soft scores corresponding to likelihood of belonging to a class. We find the score of a document corresponding to a class as the maximum of the scores of any segment of that document corresponding to that class.

4.5 Results and Discussion

4.5.1 Credit Attribution

Table 4.4 shows the performance of various methods on the credit attribution task. The credit attribution specific approaches (CAWA, SEG-NOISY, SEG-REFINE, CON- REFINE and MLTM) perform considerably better than the other multilabel approaches (DNN+A, DNN-A, ML-KNN and BR-MNB). Among the credit-attribution methods, our methods perform better than the competing approaches on the four out of five datasets (Ohsumed, TMC2007 Patents and Delicious) on the SOV metric, and give comparable performance on the Movies Dataset. Our approaches avoid fragmentation by assigning the neighboring sentences same classes, leading to better performance on the SOV metric. On the PPPA metric, while our attention-mechanism based approach, CAWA, also outperforms the other approaches on the same four datasets (Ohsumed, TMC2007 Patents and Delicious), the dynamic-programming based approaches perform at-par with the next-best performing approach, i.e., MLTM. This can be attributed to the fact that, while the dynamic-programming based approaches are able to avoid fragmentation by leveraging the structure of the document, they still have the limitation of not-modeling the semantically-similar classes. On the other hand, while MLTM models the semantically similar classes, it does not leverage the local structure of the documents. Thus, we do not see a clear trend on the PPPA performance between the MLTM and the dynamic-programming based approaches. Our best performing approach, CAWA, in general is the best performing method. The average performance 57

Table 4.4: Results on the segmentation task. Dataset Model* SOV PPPA Movies CAWA 0.50 0.38 SEG-NOISY 0.49 0.36 SEG-REFINE 0.49 0.36 CON-REFINE 0.46 0.33 MLTM 0.50 0.40 DNN+A 0.33 0.27 DNN-A 0.33 0.27 ML-KNN 0.38 0.30 BR-MNB 0.39 0.31 Ohsumed CAWA 0.65 0.55 SEG-NOISY 0.63 0.48 SEG-REFINE 0.63 0.47 CON-REFINE 0.61 0.44 MLTM 0.55 0.46 DNN+A 0.44 0.37 DNN-A 0.33 0.31 ML-KNN 0.48 0.38 BR-MNB 0.29 0.30 TMC2007 CAWA 0.56 0.47 SEG-NOISY 0.56 0.44 SEG-REFINE 0.59 0.44 CON-REFINE 0.58 0.43 MLTM 0.50 0.44 DNN+A 0.43 0.37 DNN-A 0.35 0.34 ML-KNN 0.45 0.35 BR-MNB 0.30 0.33 Patents CAWA 0.58 0.50 SEG-NOISY 0.53 0.42 SEG-REFINE 0.56 0.45 CON-REFINE 0.55 0.45 MLTM 0.55 0.49 DNN+A 0.53 0.43 DNN-A 0.51 0.42 ML-KNN 0.45 0.37 BR-MNB 0.50 0.43 Delicious CAWA 0.50 0.39 SEG-NOISY 0.45 0.33 SEG-REFINE 0.48 0.36 CON-REFINE 0.47 0.35 MLTM 0.49 0.37 DNN+A 0.22 0.18 DNN-A 0.21 0.17 ML-KNN 0.24 0.19 BR-MNB 0.25 0.19

* The models CAWA, SEG-NOISY, SEG-REFINE, CON-REFINE and MLTM have been specifically designed to solve the credit attribution problem, while the models DNN+A, DNN-A, ML-KNN and BR-MNB are multilabel classification approaches. 58

Table 4.5: Sentence classification performance when the sentences belong to the similar classes. Dataset Class Model F1 Ohsumed Nutritional/ CAWA 0.68 metabolic SEG-REFINE 0.64 Endocrine CAWA 0.39 disease SEG-REFINE 0.26 Patents Electricity CAWA 0.53 SEG-REFINE 0.48 Physics CAWA 0.41 SEG-REFINE 0.33 Delicious Health CAWA 0.50 SEG-REFINE 0.47 Recipes CAWA 0.62 SEG-REFINE 0.58 gain for the CAWA on the SOV and PPPA as compared to MLTM is 7.5% and 5.8%, respectively. Out of the dynamic-programming based approaches, SEG-REFINE usually outper- forms the others. On the TMC2007, Patents and Delicious datsets, SEG-REFINE performs better than the SEG-NOISY, while on the Movies and Ohsumed datasets, SEG-REFINE performs equally good as the SEG-NOISY. The average performance gain of SEG-REFINE over SEG-NOISY over all the datasets, in terms of SOV met- ric is 4.5%. For CON-REFINE, we observe two distinct patterns: on the Movies and Ohsumed datasets, CON-REFINE does not perform as good as the SEG-REFINE and SEG-NOISY, while on TMC2007, Patents and Delicious datasets, CON-REFINE per- forms better than SEG-NOISY and at par with SEG-REFINE. The reason for this be- havior can be explained as follows. For a document d, as the segment creation penalty

γ is increased, the number of segments du decreases, and as a result, the number of pos- sible segmentations also decreases. Thus, for smaller γ, there are more possible ways to find a segment that provides enough signal for a class so as to be annotated with it. As γ is increased, if any segment does not provide enough signal for a particular class (as the number of possible segmentations decreases), SEG-REFINE is free to ignore that class and use fewer classes to annotate the document (just one class in the extreme case, the one having the strongest signal in the document). However, for CON-REFINE, as γ is increased, segments will still be annotated with all the classes, even though they do 59 not provide enough signal for those classes. Thus, the extent to which we can exploit the fact that the neighboring sentences talk about the same class ( i.e., the extent to which γ can be increased without generating false-positives in the training documents) is limited for CON-REFINE than for SEG-REFINE. Thus, we expect CON-REFINE to generate fragmented segments as compared to SEG-REFINE if enough signals are not present for a class in the training documents. The first observation that validate our hy- pothesis is that, the two datasets on which CON-REFINE does not perform as good as SEG-NOISY (Movies and Ohsumed) are the ones on which SEG-NOISY performs at par with SEG-REFINE, i.e., individual sentences provide a weak signal of belongingness to individual classes, thus limiting the extent to which refinement can help in making pre- cise classifiers. The second observation pertains to verifying that CON-REFINE, in fact leads to fragmented segments as compared to SEG-REFINE on the datasets it does not perform well. Table 4.6 reports the average number of predicted segments per document and corresponding average length (number of sentences) of each segment corresponding the results presented in Table 4.4. As expected, CON-REFINE predicts significantly more segments per document as compared to SEG-REFINE (> 2 segments per doc- ument at an average) for the Movies and Ohsumed datasets, and similar number of segments per document as compared to SEG-REFINE on the remaining datasets. This fragmentation leads to not-so-good performance on the Movies and Ohsumed dataset by CON-REFINE. To validate our hypotheses that CAWA is able to efficiently model the semantically similar classes as compared to the SEG-REFINE, we looked into the performance of both CAWA and SEG-REFINE on the two most similar classes for each of the Ohsumed, Patents and Delicious datasets. To measure the similarity between the two classes, we calculated the Jaccard similarity [56] between the two classes, based on the number of documents in which they occur. For each of these selected classes, we calculated the the F1 score based on the predicted and actual classes of the sentences in the segmentation dataset. Table 4.5 shows the results for this analysis. For the Ohsumed dataset, the two selected classes are Nutritional/Metabolic disease and Endocrine disease, which are very similar. Similarly, the selected classes for the Patents and Delicious dataset are also similar. We see that, for all the selected classes, CAWA performs better than SEG-REFINE on the F1 metric, illustrating the effectiveness of CAWA on modeling 60

Table 4.6: Statistics of the predicted segments. Dataset Model Average length of the Average number of the predicted segments predicted segments per document Movies SEG-NOISY 3.89 10.57 SEG-REFINE 3.58 11.47 CON-REFINE 2.94 13.97 Ohsumed SEG-NOISY 4.04 7.62 SEG-REFINE 3.74 8.24 CON-REFINE 2.92 10.57 TMC2007 SEG-NOISY 5.25 7.52 SEG-REFINE 3.82 10.32 CON-REFINE 3.59 10.99 Patents SEG-NOISY 2.23 8.11 SEG-REFINE 2.66 6.79 CON-REFINE 2.70 6.71 Delicious SEG-NOISY 8.11 8.60 SEG-REFINE 10.13 6.89 CON-REFINE 10.08 6.92 semantically similar classes. We further investigate the effect of various parameters of CAWA on the credit attribution task in Section 4.5.3. In addition, DNN+A also performs considerably better than the DNN-A on both SOV and PPPA metrics for all the datasets. The average performance gain for the DNN+A over DNN-A is 13% and 7.3% on the SOV metric and PPPA metrics, respec- tively. This shows the effectiveness of the proposed attention architecture on correctly modeling the multilabel documents.

4.5.2 Multilabel classification

Table 4.7 shows the performance of different methods on the multilabel text classification task. Similar to the segmentation task, among the credit-attribution methods, in general our methods perform better than the competing approaches on the on the F1mean metric. This shows that the classes predicted for the sentences by our methods correlate better with the document classes as compared to the classes predicted by the other credit-attribution approach, i.e., MLTM. Similar to the segmentation task, CAWA is the best performing method, with an average performance gain of 5.1% over MLTM. The superiuor performance of CAWA can be attributed to the way we calculate the 61

Table 4.7: Results on the multilabel classification task. * Dataset Model F1mean AUCµ AUCM Movie CAWA 0.65 0.81 0.78 SEG-NOISY 0.64 0.82 0.81 SEG-REFINE 0.63 0.81 0.80 CON-REFINE 0.63 0.80 0.78 MLTM 0.65 0.82 0.80 DNN+A 0.62 0.84 0.82 DNN-A 0.61 0.85 0.83 ML-KNN 0.63 0.83 0.81 BR-MNB 0.53 0.82 0.84 Ohsumed CAWA 0.64 0.93 0.89 SEG-NOISY 0.65 0.94 0.93 SEG-REFINE 0.65 0.94 0.92 CON-REFINE 0.67 0.94 0.92 MLTM 0.59 0.93 0.90 DNN+A 0.67 0.94 0.92 DNN-A 0.58 0.94 0.92 ML-KNN 0.59 0.90 0.87 BR-MNB 0.31 0.82 0.71 TMC2007 CAWA 0.68 0.95 0.91 SEG-NOISY 0.66 0.96 0.92 SEG-REFINE 0.68 0.95 0.90 CON-REFINE 0.69 0.96 0.91 MLTM 0.62 0.96 0.91 DNN+A 0.68 0.96 0.92 DNN-A 0.59 0.96 0.92 ML-KNN 0.71 0.95 0.89 BR-MNB 0.62 0.89 0.72 Patents CAWA 0.61 0.88 0.86 SEG-NOISY 0.62 0.87 0.86 SEG-REFINE 0.61 0.86 0.85 CON-REFINE 0.61 0.86 0.85 MLTM 0.59 0.85 0.84 DNN+A 0.64 0.89 0.87 DNN-A 0.63 0.89 0.88 ML-KNN 0.51 0.82 0.80 BR-MNB 0.50 0.87 0.86 Delicious CAWA 0.52 0.85 0.84 SEG-NOISY 0.48 0.86 0.85 SEG-REFINE 0.49 0.85 0.85 CON-REFINE 0.50 0.86 0.85 MLTM 0.50 0.84 0.83 DNN+A 0.38 0.87 0.86 DNN-A 0.36 0.88 0.87 ML-KNN 0.35 0.82 0.80 BR-MNB 0.05 0.76 0.73

* The models CAWA, SEG-NOISY, SEG-REFINE, CON-REFINE and MLTM have been specifically designed to solve the credit attribution problem, while the models DNN+A, DNN-A, ML-KNN and BR-MNB are multilabel classification approaches. 62

Figure 4.8: Change in the SOV with β as it is increased from 0 to 1 for the two cases (i) average-pooling layer is used, and (ii) average-pooling layer is not used. The plots correspond the values of α corresponding to the best validation performance. sentence classes (Equation (4.3)), which ensures the consensus between the predicted sentence-level and document-level classes. Additionally, with respect to the AUCµ and

AUCM metrics, our methods perform as good, if not somewhat better than the MLTM and other multilabel classification approaches. This illustrates the effectiveness of the proposed approaches even on the multilabel text classification task. Compared to the approaches specific to the multilabel classification task, either

CAWA or DNN+A achieve the best performance on the F1mean metric on all but the TMC2007 dataset, where ML-KNN achieves the best performance. This further verifies the effectiveness of the proposed attention architecture on correctly modeling the mul- tilabel documents. Similar to the credit attribution task, DNN+A also performs better than the DNN-A on the F1mean metric for all the datasets, showing the effectiveness of the proposed attention architecture. Even on the AUCµ and AUCM metrics, DNN+A and DNN-A in general perform better than the other approaches, thus validating the strength of the proposed architecture.

4.5.3 Ablation study

Effect of average pooling and β:

Figure 4.8 shows the change in SOV metric with change in β for all the datasets. For each dataset, we plot the SOV metric as β is increased from 0.0 to 1.0 for the two cases (i) average-pooling layer is used, and (ii) average-pooling layer is not used. For both the cases, when β = 0, each sentence gets the same class, which is the class with the maximum prediction probability for the complete document. As β increases, the effect 63

Figure 4.9: Sub-figure (a) shows the change in SOV with change in α. Sub-figure (b) shows the β values for which the maximum SOV is obtained for each α. of the attention-weights starts pitching in, leading to each sentence getting its own class, thus a sharp jump in the performance on the SOV metric. However, as the β increases, the contribution of attention weights outpowers the overall document class probabilities, and the predicted sentence-classes become more prone to noise in the attention weights, thus leading to performance degradation for large β. Comparing the performance curves of the case when the average-pooling layer is used to the one when it is not used, the average pooling leads to better performance for all values of β. Thus, average pooling effectively constrains the nearby sentences to have similar attention weights, leading to better performance on the SOV metric.

Effect of α:

Figure 4.9 shows the change in performance on the SOV metric with change in α for all the datasets. Figure 4.9(a) reports the maximum value of SOV for each α over all the β values. Figure 4.9(b) reports the corresponding value of β for each α that gives the maximum performance on the SOV metric. For all the cases, as the α increases from 0.0 to 0.1, SOV shows a sharp increase, which can be attributed to the effect of classification loss (LC (D)) pitching in. Additionally, we see that as the α increases, the corresponding value of β giving the maximum performance also increases in general. As the α increases, the contribution of attention loss decreases, thus requiring more contribution from the attention weights to accurately predict the sentence classes. This explains the increase in the values of β values, as the value of α increases. The exceptionally high value of 64

Figure 4.10: Sub-figures (a) and (b) shows the change in AUCµ and AUCM , with change in α, respectively.

β when α = 0 can be explained as follows: α = 0 corresponds to the case when we are only minimizing the attention loss (LS(D)), and ignoring the loss for predicting the document’s classes (LC (D)). The multilabel classifier does not get trained at all in this case, leading to y(d, c) getting random values. Therefore, β takes large values to ignore the contribution of y(d, c) (which is random) towards the sentence-level labels, so as to make correct predictions.

Figures 4.10(a) and 4.10(b) show the change in performance on the AUCmu and

AUCM metrics with change in α, respectively. For both the metrics, the performance increases with an increase in α, i.e.., the performance on the AUC metrics is negatively impacted by the attention loss. As explained earlier, attention loss ignores the sentences that provide the negative signals for the classes to make its predictions, thus, adversely affects the multi-label classification performance. The average pooling, α and β have the same effect on the PPPA metric too. We present the concluding remarks in Chapter 8. Chapter 5

Intent term selection and refinement in e-commerce queries

Retrieving the products that are relevant to a search query in an e-commerce site is a fundamental problem as it is often the first step a customer performs during an e- commerce transaction. The search engine relies on the terms in the query to return a set of products whose attributes match the terms in the query. Different terms in the search query describe the different characteristics of the product intent (relevant products to the query) such as brand, product-type, and other product attributes. For example, for the search query “motorola phone”, the product-type intent is phone and brand intent is motorola. However, a search engine in e-commerce that relies on terms in a query for retrieval suffers from two different issues. First, when a query has multiple terms, it will retrieve all the products whose attributes match the terms in the query, but this affects the search results as it may also match the terms in the query that do not describe the query’s product intent, e.g., the query “socks for running shoes” may return “running shoes” as well in addition to “socks”. Moreover, the search engine suffers from the vocabulary gap when a term in the query that describes the intended product is not identical to the terms used in the description of the relevant products in the catalog, e.g., the term “outdoor” in query “outdoor paint” corresponds to the term “exterior” used often in product description of the relevant products. The state-of-the-art term frequency based retrieval approaches, e.g., BM25 and

65 66 BM25F, use the appearance of query terms in the product descriptions to compute a score indicating the relevance of the product for the query. However, a term that frequently occurs in the catalog will have a low IDF weight thereby lowering its con- tribution in the final score for queries having that term. This low contribution of term affects the recall of products as these terms can be the most critical terms in the query, e.g., the term “women” in “women shoes” is important but due to its high frequency and consequently low IDF weight, the search engine may show “men shoes” leading to a bad customer experience followed by loss in sales and revenue. For the vocabulary gap problem, the existing approaches [16, 12, 21, 111] rely on the historical engagement of a query to suggest terms for the already seen query and hence these approaches would not work for the rare queries. In this chapter, we use the query reformulation data derived from the histori- cal search logs of a major e-commerce retailer1 to develop distant-supervised ap- proaches [89] to solve these problems. Specifically, we leverage the fact that the terms in the reformulated query describe the query’s product intent in a better manner than the terms in the original query. Additionally, our approaches also take into account the context of a term, i.e., the entities in the neighborhood of the term, in the query to esti- mate the representation of the query’s product intent with Recurrent Neural Networks (RNNs). For example, the same term 3-piece has different significance for the queries like “3-piece kids dinnerware” and “3-piece mens suit”. For the first query, the context is kids dinnerware and for the second query, the context is mens suit. In order to identify the relevant terms in a query that represent its intent, we es- timate a weight for each term in the query that indicates the importance of the term towards expressing the query’s product intent. To estimate these term weights, we present a model that uses an intent encoder to learn the query’s product intent repre- sentation using RNNs, a feature generator to create features that capture the contextual information, and a weight calculator to estimate the weights for each term. Finally, to bridge the vocabulary gap between a query and the products relevant to the query we develop a query refinement approach to suggest terms that represent the query’s prod- uct intent but are not in the original query. Our query refinement model uses an intent encoder followed by a multilabel classifier to predict the terms that express the query’s

1 walmart.com 67 product intent. The approaches presented in this chapter can be generalized to all the queries. However, from the e-commerce perspective, we can use the historical engagement, e.g., clicks, add-to-cart, and purchases, for the queries that have been issued by the users in the past to retrieve the relevant items for these queries. Therefore, we have restricted our evaluation to the rare queries, i.e., cold-start or tail queries, and our analysis shows that these queries constitute a significant chunk of the overall search traffic of the e-commerce retailer. The impact of the developed approaches will be more for the tail queries as we do not have any prior historical information to retrieve all the relevant items for these queries. To evaluate our approach on the term-weight prediction problem, we performed two tasks. The first one evaluates the ranking of the search results after weighting or boosting higher the terms that represent query’s product intent. On the mean reciprocal rank (MRR) [121] metric, our term-weight prediction approach improves the ranking by 3% over the BM25F [107] approach. Considering that the e-commerce retailer’s catalog contains millions of products, this improvement is significant and will help by generating a better set of initial candidate products followed by the application of Learning-to-Rank methods [61] to produce the final ranking. The second task predicts the terms that are important towards defining the query’s product intent and hence, are retained in the reformulated query. With respect to predicting the important terms, our approach performs better than the non-contextual baseline, with the relative performance gain of 6.7% over the non-contextual baseline. For query refinement, we evaluated our approach on predicting all the terms in the reformulated query, given the current query. With respect to predicting the terms in the reformulated query, our approach beats the non- contextual baseline with the relative performance gain of 3.4%. The remainder of the chapter is organized as follows. Section 5.1 presents the defini- tions and notations used in the chapter. The chapter discusses the developed methods in Section 5.2 followed by the details of our experimental methodology in Section 5.3. Section 5.4 discusses the results. 68

Table 5.1: Notation used throughout the chapter. Symbol Description V Vocabulary (set of all the terms). q Query such that the results displayed do not satisfy the query’s intent. |q| Number of terms in q R(q) Reformulation of q such that the results displayed satisfy the query’s intent. Q Collection of all the queries q

rx Word vector representation of the word x d Length of the word vector representation f f ht Hidden state at position t for the GRU b b ht Hidden state at position t for the GRU l Size of the hidden states for the GRUs s(x) Importance weight for the term x

5.1 Definitions and Notations

Let V be the set of distinct terms that appear in all the queries of a website, referred to as the vocabulary. Let q be a query such that the results (products) displayed for q do not satisfy its product intent. The query q is a sequence of |q| terms from the vocabulary V , that is,

q = hvq1 , . . . , vq|q| i. The set of all such queries q is denoted by Q. Let R(q) denote a reformulation of q such that the results displayed for R(q) satisfy its product intent. Each word ui in the vocabulary will also be represented by a d-dimensional vector rui . Table 5.1 provides a reference for the notation used throughout the chapter.

5.2 Proposed methods

In order to identify the terms that express a query’s product intent, we do not rely on access to a labeled data but use the query reformulation logs to develop a distant- supervised approach to find terms that represent the query’s product intent. Specifically, we assume that we are given a query q and its reformulation R(q) such that the results displayed for q does not satisfy its product intent while results displayed for R(q) satisfy 69

promo Weight calculator 0.1 Forward Concatenate code GRU Weight calculator 0.1

for Weight calculator 0.0

motorola Reverse Contextual Weight calculator 0.7 GRU features phone Weight calculator 0.9

Word Intent Query Intent Feature Term features Final vectors representation encoder generator |q|X (d+2l) weights |q| X d |q| X 2l

Figure 5.1: Contextual term-weighting (CTW) model. the product intent. We assume that the terms in q that represent its product intent are also retained in R(q), i.e., the terms in set q ∩ R(q) are critical in expressing q’s product intent than the terms in set q \ R(q). Our approach contextual term-weighting (CTW) model solves this problem by predicting a weight for each term in a query that indicates the importance of that term towards defining the query’s product intent. CTW leverages the context of a term, i.e., the entities in the neighborhood of the term, to estimate its weight. CTW comprises of an intent encoder that estimates the representation of a term in query with Recurrent Neural Networks (RNNs), a feature generator that uses the intent representation to generate features that capture the contextual information, and a weight calculator that uses multilayer perceptron (MLP) to estimate the weights for each term using the generated features. Additionally, we focus on addressing the vocabulary gap between the terms in a query and a relevant product’s description, i.e., terms used in the query are semantically similar but different from the terms in the description of the product. For example, for query “outdoor paint” a relevant product’s description contains semantically similar term instead of “outdoor” in its title, i.e., “exterior paint”. We want to refine a query by suggesting other terms that are not present in the query but express its product intent better than the original terms in the query. We assume that the terms in R(q) are candidates for the query refinement of q as the terms in the reformulated query define the product intent in a better way than query q. Our approach Contextual query refinement (CQR) model suggests relevant terms by classifying each term in the vocabulary if the 70

code 0.2 promo Forward motorola 0.8 GRU code phone 0.9

for sale 0.6

mobile 0.7 motorola Reverse GRU samsung 0.6 phone bar 0.1 Word Intent Query Intent Multilabel Output layer Final vectors representation encoder classifier |V| weights |q| X d 2 X l

Figure 5.2: Contextual query refinement (CQR) model. term defines the q’s product intent. Similar to CTW, CQR also leverages the context of a term to make its predictions. CTW comprises of an intent encoder that estimates the product intent of the query with Recurrent Neural Networks (RNNs) and a multilabel classifier that classifies each term if it is relevant towards defining the query’s product intent.

5.2.1 Contextual term-weighting (CTW) model

Given a query hvq1 , . . . , vq|q| i, our task is to estimate a weight s(vqt ) for each term vqt , which captures how important is vqt towards defining q’s product intent. The supervision during training comes from the fact that the terms in the set q∩R(q) are more important in defining the product intent than the terms in the set q \ R(q). Figure 5.1 provides an illustration of the contextual term-weighting model. The contextual term-weighting model has three modules: intent encoder, feature generator and weight calculator. The intent encoder takes q as input, and outputs a representation of the product intent of q. Next, the feature generator uses the output of the intent encoder as input and generates the features for each query term vqt , which captures its importance towards the product intent of q. For each term vqt , the weight calculator uses the features produced by the feature generator as input and outputs the importance weight for it. We discuss these three modules in detail below: Intent encoder: This module encodes the sequence of terms in the query q into 71 a fixed length representation using bidirectional Gated Recurrent Units (GRUs) [17]. GRUs are a type of Recurrent Neural Networks (RNNs), which model variable-length sequential input using a recurrent, shared hidden state. The hidden state can be thought of a summary of the complete sequential input. The sequence of word vectors in the query q is denoted by hrvq ,...,rv i. The bidirectional GRU encodes the query q as: 1 q|q|

hf = GRU f (hf , r ) (5.1) t t−1 vqt

hb = GRU b(hb , r ), (5.2) t t+1 vqt where GRU f encodes the query q in the forward direction and GRU b encodes in the f f backward direction. ht is the hidden state at position t for the GRU and corresponds to the summary of the sequence hr ,..., r i. Similarly, hb is the hidden state at vq1 vqt t b position t for the GRU and stores to the summary of the sequence hrv ,..., rv i. qt q|q| f b The output of the intent encoder are the hidden states (ht and ht ) at each position t.

Feature generator: This module generates features for each term vqt in the query q, the features capturing its importance towards defining the q’s product intent. The importance of a term is not just dependent on the term itself, but also on the context in which the term is used. For example, the same term 3-piece can have different importance for the different queries like “3-piece kids dinnerware” and “3-piece mens suit”. The term level features can be captured using the word vectors. To capture the con- textual features of a term, we use the output from the intent encoder module. Consider the forward encoder GRU f . At each position t, GRU f updates the current summary f ht−1 with the term vqt at point t. The contribution of the term vqt towards defining the product intent of the query should be manifested in the extent to which it updates the summary at point t − 1. Therefore, the contribution of a term at position t towards defining the product intent of the query should be a function of the difference in the f f summaries at the positions t and t − 1 (ht − ht−1). Similarly, for the reverse encoder GRU b, it should be a function of the difference in the summaries at the positions t and b b t + 1 (ht − ht+1).

For each term qt at position t in the query, the output of the feature generator (ft) 72 is the concatenation of the vectors r , hf − hf and hb − hb . vqt t t−1 t t+1

f = [r , hf − hf , hb − hb ]. t vqt t t−1 t t+1

Weight calculator: This module takes as input the features ft generated by the feature generator for a term vqt , and outputs a weight for that term, the weight mani- festing the importance the term vqt has towards defining the product intent of the query q. We model the weight calculator as a multilayer perceptron (MLP) with a single node at the output layer. The weight calculator outputs a weight between 0 and 1. We refer to the weight corresponding to the query term vqt as s(vqt ). To train the model, we minimize the binary cross entropy loss [90]. Using the binary cross entropy loss allows our model to make predictions independently for each term, that is, predicting high weight for one term will not effect the weights of the other terms. The binary cross entropy loss associated with all the queries in the collection Q is given by: |q| X X L(Q) = − (y(vqt ) log(s(vqt )) + (1 − y(vqt )) log(1 − s(vqt ))), q∈Q t=1 where y(vqt ) = 1 if the term vqt is also present in the reformulated query R(q), and y(vqt ) = 0 otherwise.

5.2.2 Contextual query refinement (CQR) model

Given a query hvq1 , . . . , vq|q| i, our task is to find and recommend terms vi from the vocabulary V , which define the product intent of the query q, and hence can be used to refine q. The supervision during training comes from the fact that the terms in the reformulation R(q) define the product intent better than the terms in the original query q, as the products displayed for R(q) satisfied the product intent. We model the query refinement as a multilabel classification model, the label space being the set of all the terms in the vocabulary V . Figure 5.2 provides an illustration of the contextual query refinement model. The query refinement model has two modules: intent encoder and multilabel classifier. The intent encoder takes as input the query q, and outputs a representation of the product 73 intent of the query q. Next, the multilabel classifier uses the output of intent encoder as input and generates a weight s(vi) for each term vi in the vocabulary V , the weight establishing the confidence of that term towards defining q’s product intent. We discuss these two modules in detail below: Intent encoder: Similarly to the query term-weighting model, we encode query q using a bidirectional GRU using the equations 5.1 and 5.2. However, unlike the intent encoder for weighting query terms, we are only interested in the product intent representation of the complete query, not the representation at individual positions in the query. For the forward encoder GRU f , the hidden state at the position |q| captures the product intent of the query q. Similarly, for the reverse encoder GRU b, the hidden state at the position 1 captures the intent of the query q. f f We use the concatenation of h|q| and h1 as the overall intent representation of the query q, and represent it by hq f b hq = [h|q|, h1]. Multilabel classifier: This module takes the the product intent representation of the query q as input and outputs a weight for each term vi in the vocabulary V , the weight manifesting the importance the term vi has towards defining the product intent of the query q. We model the multilabel classifier as a multilayer perceptron (MLP) with output layer having the number of nodes equal to the size of vocabulary (|V |). For every term vi in the vocabulary, the multilabel classifier outputs a weight between 0 and 1. We refer to the weight corresponding to the term vi as s(vi). To estimate the model, we minimize the binary cross entropy loss. The binary cross entropy loss associated with all the queries in the collection Q is given by:

|V | X X L(Q) = − (y(vi) log(s(vi)) + (1 − y(vi)) log(1 − s(vi))), q∈Q i=1 where y(vi) = 1 if the term vi is present in the reformulated query R(q), and y(vi) = 0 otherwise. Both the models, i.e., Contextual term-weighting (CTW) model and Contextual query refinement (CQR) model, can generalize to unseen tail queries that have the same vocabulary as that of the queries in historical search logs. 74 5.3 Experimental methodology

5.3.1 Dataset

To understand the query reformulation patterns in product search, we analyzed the his- torical search logs of a major e-commerce retailer2 and looked at the query transitions of the form a → b, where a and b are two queries within the same session and b is issued immediately after issuing a. The transitions a → b broadly fall into the following five categories:

Transition from a general to specific product intent: For example, a is furniture and b is wooden furniture. The query a in this category is usually short and the set of the terms in a is usually a proper subset of the set of terms in b.

Transition from an incomplete to a complete query (user pressed enter key before finishing the query): For example, a is air condi and b is air conditioner. The query a in this category tends to have spelling errors.

Change of intent: For example, a is bedside lamps and b is light bulbs. The terms in a tend to be different than the terms in b.

Transition from a specific to general product intent: For example, a is 3-piece kids dinnerware and b is kids dinnerware.

Reformulation with the same product intent: For example, a is promo code for motorola phone and b is motorola phone on sale.

We need to extract query pairs of the form (q, R(q)) such that the user is unable to find the intended product from the results retrieved for the query q. Within the same session, he reformulates the query to R(q), and is able to find the required products from the results displayed for R(q). We assume that the user’s engagement with the displayed products (click, add-to- cart, order) is a proxy for the satisfaction of the query’s product intent. However, lack of user’s engagement does not necessary mean that the displayed products do not satisfy the query’s product intent (for example, a is furniture and b is wooden furniture).

2 walmart.com 75 However, for the following two types of transitions, we can say that the product intent was not satisfied for the initial query: Transition from a specific to general product intent and Reformulation with the same product intent. We can manually look into the query reformulations to collect the required query pairs, but this approach is expensive in terms of both time and money. Instead, we generate the required query pairs by performing the following steps:

• We sampled an initial dataset of query pairs (a, b) from the historical search logs data spanning over eleven months (July’17 to May’18), such that user searches for b after query a, and the user searches at most two other queries between a and b.

• We only keep the pairs for which there is an add to cart (ATC) event associated with b. This restricts b to the queries for which the product intent is satisfied.

• We limit a to only rare queries (less than 300 occurrences over a 60-day period and the fraction of times user clicks (click through rate) any product displayed for a is less than 5%). Note that, we limit ourselves to the rare queries because the frequent queries with small click-through-rate can be attributed to crawling by bots, which is a common practice in e-commerce by online retailers to monitor prices, product descriptions etc.

• We only look into the pairs such that the Jaccard similarity [56] between the sets of terms in a and b is at least 0.2. This ensures that the product intent does not change drastically between the queries.

• We limit our dataset to contain only those a’s whose constituent terms’ frequency is more than 100 over the complete dataset (if some term is frequent, it is probably not a spelling error). This filters out the query pairs for which a’s product intent is not satisfied as a result of spelling errors.

• Since the lack of user’s engagement does not necessary mean that the displayed products do not satisfy the query’s product intent (when the initial query has a very general product intent, for example, the initial query is furniture and the reformulated query is wooden furniture), we limit our dataset to the pairs, such that the set of terms in a is not a proper subset of the set of terms in b. Also we 76 only keep pairs with a having at least three terms, as the queries with a general product intent tend to be shorter.

From the remaining pairs, we create a training-test-validation split. We have 722,235 query pairs in the training set, 50,000 query pairs in the test set and 5,000 pairs in the validation set. The vocabulary contains 12,118 terms. Note that the approaches developed by us can be generalized to all the queries. However, from the e-commerce perspective, we can use the historical engagement, e.g., clicks, add-to-cart, and conversions, for the queries that have been issued by the users in the past to retrieve the relevant items for these queries. For the same reason, we constructed our dataset to only contain the reformulations where the initial query is a rare query. Hence, we have restricted our training and evaluation to the queries that are not frequent, i.e., cold-start or tail queries, and these queries constitute a significant chunk of the overall search traffic of the e-commerce retailer. The impact of the developed approaches will be more for the tail queries as we do not have any prior historical information to retrieve all the relevant items for these queries.

5.3.2 Evaluation Methodology and Performance Assessment

Methodology and metrics

To evaluate our approach on the term-weight prediction problem, we performed two tasks. We describe our evaluation methodology and metrics for the term-weighting and query refinement problems below. Term-weighting: We need to evaluate how well the intent term weighting improves the ranking of the search results in response to a query. To apply the intent term weighting for ranking products in response to a query, we first retrieve the top products for the query using a BM25F [107] based retrieval algorithm. The original relevance score of each product (without the intent term weighting) is calculated as the sum of the corresponding BM25F scores for each term in the query. Then, the individual BM25F score is scaled (boosted) with the corresponding computed term-weight and re-ranking is performed with the modified scores. We calculate the Mean Reciprocal Rank (MRR), which is the average of the reciprocal of the rank of the relevant product in response to 77 a query. The relative performance improvement in MRR is given by

MRRboost MRRratio = , MRRBM25F where MRRboost is the MRR when we have scaled the BM25F scores of each term with the corresponding term weight, and MRRBM25F is the MRR when we have used the BM25F scores of each term as it is. Moreover, we also evaluated our approach on how well it is able to estimate weights for the query terms in order of their importance towards defining the product intent of the query. We assume that, if a term is important, it should be present in the reformulated query too, for which the displayed products satisfy the product intent. Therefore, to evaluate our approach on the term-weighting problem, we looked at how well our model is able to estimate higher weights for the terms in the set q ∩ R(q) than the terms in the set q \ R(q). For this prediction task, we used Precision@k (P @k), which is a popular metric used in multilabel classification literature [2]. Given a list of ground truth labels, ranked ac- cording to their relevance, P @k measures the precision of predicting the first k labels from this list. Since we cannot score ground truth labels for the term-weighting task, that is, the ground truth labels are binary, we use k = nnz, where nnz is the output sparsity of the ground truth labels. We also ignored the stop words in the calculation of P @k. For example, lets say the initial query is “promo code for motorola phone” and the reformulated query is “motorola phone on sale”. In this case, the words mo- torola and phone are retained in the reformulation, and hence we use nnz = 2 for the term-weighting problem. We report the averaged P @nnz (AP @nnz) over all the test instances. We also present results for k = 1, 2, 3 as the queries in product search tend to be shorter (the average length of the reformulated query in our test set is 3.6). Query refinement: The terms that appear in the reformulation R(q), for which the query’s product intent is satisfied, describe the product intent in a better manner than the terms in the initial query q. Therefore, if we can predict the terms in the reformulation, we can refine the query with these predicted terms. We evaluated our query refinement approach on how well it is able to predict the terms in the reformulated query for which the displayed products satisfy the product intent. Similar to the term- weighting task, we report the averaged P @nnz (AP @nnz) over all the test instances. 78 P @nnz in this case measures how many of the top nnz predicted terms also appear in the reformulated query. Lets say the initial query is “promo code for motorola phone” and the reformulated query is “motorola phone on sale”. For the query refinement problem, we use nnz = 3 corresponding to three non-stop words in the reformulated query (motorola, phone and sale). We also present results for k = 1, 2, 3.

Baselines

Our approaches use the contextual information for term-weighting and query refinement. To illustrate the advantage of using contextual information, we compared our approaches against the baselines that do not take the context into consideration.

Term-weighting: Given a query q and a term vqt in q, s(vqt ) denotes the confidence of word vqt towards defining the q’s product intent. As mentioned earlier, we assume that if the term vqt has high confidence towards defining the product intent of q, vqt should also be present in the reformulated query R(q). Hence, we can calculate s(vqt ) from the historical data as follows:

s(vqt ) = P (vqt ∈ R(q)|vqt ∈ q)

P (vqt ∈ R(q), vqt ∈ q) s(vqt ) = . P (vqt ∈ q) Using the maximum likelihood estimation, we have,

P 0 0 q0∈Q 1(vqt ∈ R(q ), vqt ∈ q ) s(vqt ) = P 0 , q0∈Q 1(vqt ∈ q ) P 0 0 where q0∈Q 1(vqt ∈ R(q ), vqt ∈ q ) denotes the number of times the term vqt occurs 0 0 P 0 in any query q ∈ Q and is retained in the reformulated query R(q ). q0∈Q 1(vqt ∈ q ) 0 denotes the number of times vqt occurs in the initial query q ∈ Q. We call this method of estimating query term-weights frequentist term-weighting (FTW). As discussed in Section 2.3.1, term-weighting methods that rank terms according to their discriminating power are not well suited to rank terms according to their contri- bution towards the product intent of the query. A term might be discriminatory with respect to a query (that is, the term occurs in a small number of queries), but this does not necessarily mean that the term is important towards defining that query’s product 79 intent. To illustrate this, we also compared our approach against term frequency-inverse document frequency (TF-IDF). As a competing approach, we also compared our approach against one of the methods that uses use click-through logs over query-product pairs to estimate vector representa- tions for both query and product in the vocabulary space of search queries. We use the vector propagation method (VPCG) to learn the vector representation of search queries and the products in the same space [59]. Next, we learn the representation of the n- grams (uni- and bi-grams) present in the queries by a weighted sum of the representation of the queries having these n-grams. We estimate the vector representation of a new query by a weighted linear combination of the representation of the n-grams present in the queries (VPCG & VG). These weights of the individual n-grams are estimated by minimizing the square of the euclidean distance between the representation of a query vector and the weighted linear combination of the representation of the n-grams in the query. The relevance score of a product for a query is calculated by the scalar product of the vector representations of the product and the query. Note that we can use the estimated weights for each unigram as the importance weight for that term, but there is only one weight estimated for each term, and hence, this approach does not take the context into consideration. Query refinement: For the query refinement problem, we need to calculate the confidence of each term in the vocabulary towards defining the product intent of the query. Given any word vi in the vocabulary, let s(vi) denote the confidence of the word vi towards defining the product intent of the query q. We define s(vi) as the sum of probabilities of the term vi appearing after each of the terms vqt in the query q, that is, we calculate s(vi) as,

X s(vi) = P (vi ∈ R(q)|vqt ∈ q)

vqt ∈q

X P (vi ∈ R(q)) s(vi) = P (vqt ∈ q) vqt ∈q P 0 0 X q0∈Q 1(vi ∈ R(q ), vqt ∈ q ) s(vi) = P 0 q0∈Q 1(vqt ∈ q ) vqt ∈q 80 We call this method of finding query refinement terms frequentist query refine- ment (FQR). FQR can be thought of discovering term association patterns as proposed by Wang and Zhai [124]. However, instead of discovering these patterns from the co- occurrence statistics of the terms within a query, we discover them from the query reformulations, thus making these patterns task-specific for a fair evaluation.

Parameter selection

We explored different hyper-parameters on validation set to select our hyper-parameters. For both term-weighting and query refinement, we used 300-dimensional word vectors as input to the GRUs in the intent encoder. We initialized the word-vectors using pre-trained skip-gram vectors [88] trained on one year of query data. We used 2-layer GRUs for both forward and reverse encoding, with 256-dimensional hidden layer. For regularization, we used a dropout [116] of 0.25 between the first and second layers of both the GRUs. For the weight calculator module in term-weighting, we used a two- layer multilayer perceptron (MLP) with 10 nodes in the hidden layer. For the multilabel classifier module in query refinement, we used a two-layer multilayer perceptron (MLP) with 2 × |V | nodes in the hidden layer (there are |V | nodes in the output layer). For both these MLPs, we used a dropout of 0.25 between the input and hidden layer. We applied Rectified Linear Unit (ReLU) non-linearity on the hidden layer and sigmoid function on the output layer nodes. For both the models, we used ADAM [64] optimizer for optimization with 0.001 as the initial learning rate and a batch-size of 512. We trained both the models for 20 epochs. For VPCG & VG, we experimented with different dimensions of the vector represen- tations and selected the final value for the dimension of vector representations i.e., 50, based on the ranking performance on a validation set. We ran the vector propagation method for maximum iterations, i.e., 50, or until the ranking performance does not improve on the validation set. We used stochastic gradient descent to learn weights for different n-grams and used 0.001 as the value of our learning rate. 81

Table 5.2: Results for the query term-weighting problem.

Model MRRratio AP @nnz AP @1 AP @2 AP @3 CTW* 1.030 0.746 0.792 0.804 0.832 FTW 1.019 0.699 0.746 0.750 0.790 TF-IDF 1.006 0.548 0.514 0.603 0.700 VPCG & VG 0.694 0.558 0.537 0.616 0.709

* The performance of the approach was found to be statistically significant (p-value < 0.05 using paired t-test).

5.4 Results and Discussion

5.4.1 Quantitative evaluation

Term-weighting

Table 5.2 shows the performance statistics for the term-weighting problem for the various methods. The intent term weighting performed by our proposed method CTW, FTW and TF-IDF leads to better ranking of the search results, as shown by the improvement on the MRR metric. CTW leads to the maximum increase in the MRR, achieving an improvement of 3% against the case when no intent term weighting is applied. This illustrates the advantage of using contextual knowledge to find important terms from a search query. We found the difference between the reciprocal ranks (RR) for CTW and FTW to be statistically significant according to the paired sample t-test (p-value < 0.05). Moreover, the poor performance of the TF-IDF as compared to the non contextual baseline (FTW) shows that the discriminatory power of a term does not necessarily correlate with the importance of that term towards defining the product intent of that query. Our competing approach VPCG & VG gives worse ranking of the search results as compared to the baseline BM25F based retrieval algorithm, further demonstrating the advantage of using contextual knowledge to identify important terms from a search query, especially for the tail queries. Figure 5.3 shows how the MRR is affected with the increase in the query length. For all query lengths, CTW gives a better ranking of the search results than the baselines. However, for all the methods, the performance usually drops with the increase in query length, as noisy terms are expected to appear in longer queries. 82

1.0

0.9 CTW FTW 0.8 TF-IDF VPCG&VG 0.7

0.6 Ratio of achieved MRR to that the BM25F based retrieval baseline 3 4 5 6 7 Number of terms in the query

Figure 5.3: Variation of the MRR with query length.

Table 5.3: Results for the query refinement problem. Model AP @nnz AP @1 AP @2 AP @3 CQR* 0.611 0.794 0.727 0.680 FQR 0.591 0.753 0.691 0.657

The performance of the approach was found to be statistically significant (p-value < 0.01 using paired t-test).

The same trend is demonstrated by the other metrics too. The relative performance gain of the CTW over FTW, in terms of the AP @nnz is 6.7%. We found the difference between the AP @nnz for CTW and FTW to be statistically significant according to the paired sample t-test (p-value ¡ 0.01). Even on AP @1, AP @2 and AP @3, CTW performs significantly better than the FTW, showing that the top 1, 2 and 3 predicted terms by the CTW model are better in explaining the query’s product intent than the terms predicted by the FTW model. We provide further qualitative discussion regarding this in Section 5.4.2. 83 Query refinement

Table 5.3 shows the performance statistics for the query refinement problem. Similar to the term-weighting problem, our proposed method contextual query refinement (CQR) beats the non-contextual baseline an all the metrics. The relative performance gain of CQR over the FQR, in terms of the AP @nnz is 3.4%. We found the difference between the AP @nnz for CQR and FQR to be statistically significant according to the paired sample t-test (p-value < 0.01). Similar to the term-weighting, CQR performs signifi- cantly better than the FQR even on the AP @1, AP @2 and AP @3 metrics, showing that the top 1, 2 and 3 predicted terms by the CQR model are better in explaining the query’s product intent than the terms predicted by the FQR model. We provide further qualitative discussion regarding this in Section 5.4.2.

5.4.2 Qualitative evaluation

Term-weighting

In order to visualize the weights produced by the CTW and compare them with the non-contextual baselines, we looked into a few selected search queries and the weights estimated for the terms in them by various methods. Table 5.4 shows the predictions for some of the selected queries. The weights are normalized so that the maximum weight assigned to a term is one. Table 5.5a shows an example of a query where the user reformulated it to capture the generalized product intent. The initial query is “battery night light with timer”, for which the product intent is a night light with some particular attributes. The product intent for the query was not satisfied, and the user reformulated the query to “munchkin night light”. Without the contextual information, it would be difficult to estimate the product intent because of the presence of the terms like battery and timer. TF-IDF gives highest weight to the term timer, because timer is less frequent as compared to the other terms, but the term timer will produce irrelevant results if it is not accompanied with the night light. Similarly, FTW gives highest weight to the term battery because battery is a common term which defines the intent for many queries like “battery operated fan”, “battery charger” etc. The term battery is retained in a large number of reformulations, thus FTW gives it high weight. In a similar manner, VPCG & VG gives higher weight to the term battery than the term night because battery is 84

Table 5.4: Predicted term-weights for a some queries.

Model battery night light with timer CTW 0.73 0.80 1.00 0.51 0.29 FTW 1.00 0.77 0.88 0.49 0.97 TF-IDF 0.62 0.80 0.61 0.48 1.00 VPCG & VG 0.83 0.09 1.00 0.00 0.53

(a) Initial query is “battery night light with timer” and the reformulated query is “munchkin night light”.

Model cars shaving kit CTW 0.58 1.00 0.77 FTW 0.79 1.00 0.70 TF-IDF 0.72 1.00 0.66 VPCG & VG 1.00 0.74 0.03

(b) Initial query is “cars shaving kit” and the reformulated query is “kids shaving kit”.

Model 12 piece gold flatware set CTW 0.71 0.63 0.55 1.00 0.89 FTW 0.69 0.73 0.76 1.00 0.70 TF-IDF 0.66 0.68 0.62 1.00 0.46 VPCG & VG 1.00 0.01 0.01 0.38 0.22

(c) Initial query is “12 piece gold flatware set” and the reformulated query is “flatware set for 12”.

Model helmets for electric scooters for girls CTW 1.00 0.90 0.82 0.69 0.79 0.66 FTW 0.95 0.58 0.91 1.00 0.58 0.72 TF-IDF 1.00 0.34 0.67 0.98 0.34 0.52 VPCG & VG 1.00 0.54 0.02 0.96 0.54 0.04

(d) Initial query is “helmets for electric scooters for girls” and the reformulated query is “helmets”.

Model auto seat cover wonder woman CTW 0.78 1.00 0.94 0.39 0.21 FTW 0.52 0.85 0.59 0.91 1.00 TF-IDF 0.92 0.71 0.74 1.00 0.73 VPCG & VG 0.19 0.10 0.07 1.00 0.16

(e) Initial query is “auto seat cover wonder woman” and the reformulated query is “auto seat cover”.

* The bold terms refer to the terms retained in the reformulated query. 85

Table 5.5: Top 20 predicted terms for a few selected queries.

Model Top predicted terms CQR hose, nozzle, orbit, garden, water, red, high, wand, pressure, hoses, gilmour, flexible, rain, house, quick, nozzles, sprayer, valve, gutter, connector FQR hose, water, garden, orbit, red, nozzle, better, gum, homes, timer, spray, bottle, pressure, washer, outdoor, rv, home, spearmint, adapter, black

(a) Initial query is “orbit red garden hose water nozzle” and the reformulated query is “orbit hose nozzle”.

Model Top predicted terms CQR ball, pit, balls, little, tikes, tykes, bounce, soccer, indoor, playground, pits, kids, plastic, sports, play, house, basketball, toys, put, set FQR little, tikes, ball, balls, pit, fire, kids, tennis, set, toy, boss, golf, baby, toys, pony, play, girls, car, table, grill

(b) Initial query is “little tikes balls for ball pit” and the reformulated query is “ball pit balls”.

Model Top predicted terms CQR physical, therapy, tools, strap, ankle, tool, stretching, machine, braces, kids, yoga, wedge, pain, inversion, seller, foot, body, support, relief, hands FQR therapy, ankle, socks, physical, tools, roller, foam, gift, weights, mens, black, body, kids, cards, set, cushion, womens, tool, oil, wedge

(c) Initial query is “physical therapy tools for ankle” and the reformulated query is “physical therapy tools”.

Model Top predicted terms CQR paint, house, exterior, outdoor, spray, wood, white, metal, plastic, chalk, based, rust, kit, concrete, kits, gloss, gallon, cream, hide, color FQR paint, outdoor, cream, house, spray, kids, white, set, ice, black, lights, christmas, light, acrylic, table, coffee, baby, dog, maxwell, face

(d) Initial query is “outdoor paint for house cream” and the reformulated query is ‘exterior paint for house”.

Model Top predicted terms CQR liner, easy, shelf, kitchen, paper, laminate, duck, x, shelving, oven, shelves, lining, liners, granite, roll, top, contact, plastic, rolls, peel FQR paper, kitchen, shelving, liner, easy, shelf, wire, storage, shower, set, x, toilet, plastic, white, black, unit, wall, bags, towels, trash

(e) Initial query is “kitchen shelving easy liner paper” and the reformulated query is “easy liner shelf liner”. 86 a common term which defines the intent for many queries. In comparison, CTW is able to estimate the correct product type and gives highest weights to the terms night and light. Similarly, Table 5.5e shows an example where the initial query is “auto seat cover wonder woman” and the reformulated query is “auto seat cover”. CTW is able to estimate the correct type and give highest weight to the terms auto, seat and cover, while other approaches failed to do so.

Query refinement

We looked into a few search queries and the top terms predicted for them by the CQR and compared them with the ones predicted by the FQR model. Table 5.5 shows the top 20 predicted terms for some of these queries. Table 5.6a shows an example of a query and its reformulation. The initial query is “orbit red garden hose water nozzle”, for which the product intent is a water spray nozzle of brand orbit and color red. The product intent for the query was not satisfied, and the user reformulated the query to “orbit hoze nozzle”. Without the contextual information, it is difficult to estimate the product intent because orbit is also a popular chewing gum brand. FQR ends up predicting terms like gum and spearmint because of the presence of the term orbit. In comparison, CQR is able to correctly estimate the correct product type water spray nozzle and predicts only the terms relevant to it. Other examples also show a similar trend. We present the concluding remarks in Chapter 8. Chapter 6

Distant-supervised slot-filling for e-commerce queries

Online shopping accounts for an ever growing portion of the total retail sales 1 . A way to help customers find what they are searching for is to analyze a customer’s product search query in order to identify the different product characteristics that the customer is looking for, i.e., intended product characteristics such as product type, brand, gender, size, color, etc. For example, for the query “nike men black running shoes”, the term nike describes the brand, men describes the gender, black describes the color and the terms running and shoes describe the product type. Once the query’s intended product characteristics are understood, they can be used by a search engine to return results that correspond to the products whose attributes match these intended characteristics, or as a feature to various Learning to Rank methods [61]. Slot-filling refers to the task of annotating individual terms in a query with the corresponding intended product characteristics, where each product characteristic is a key-value pair, e.g., key : brand, value : Nike Inc. . Figure 6.1 illustrates the role of slot-filling in understanding the query’s product intent. Slot-filling can be thought of as an instance of entity resolution or entity linking, which is the problem of mapping the textual mentions of entities to their respective

1 https://www.census.gov/programs-surveys/arts.html

87 88

Query: nike men black running shoes

product product Slot-keys: brand gender color type type

Nike Slot-values: Mens Black Athletic shoes Inc.

Figure 6.1: Query understanding with slot-filling. entries in a knowledge base. In the case of search queries, these entities are the prede- fined set of slots (key-value pairs). Slot-filling is traditionally being treated as a word sequence labeling problem, which assigns a tag (slot) to each word in the given in- put word sequence. Traditional approaches require the availability of tagged sequences as the training data [126, 98, 74, 104, 125, 57, 128, 85, 130, 129, 84, 70, 122, 136]. However, generating such labeled data is expensive and time-consuming. To overcome this problem, approaches have been developed that work in the absence of labeled data [125, 99, 138, 50]. However, these approaches are either specific to the domain of spoken language understanding or assume that the words are very similar to the slot values. To address the problem of lack of labeled data, we develop distant-supervised ap- proaches, which use engagement data that is readily available in search engine query logs and does not require any manual labeling effort. We present probabilistic generative models that use the products that are engaged (e.g., clicked, added-to-cart, ordered) for the queries in e-commerce search logs as a source of distant-supervision [89]. The key insight that we exploit is that in most cases, if a particular query term is associated with products that have a specific value for a slot, e.g., brand : Nike Inc , then there is a high likelihood that query term will have that particular slot. Hence, the slots for a query are a subset of the characteristics of the engaged products for that query. Since our approaches are distant supervision-based, they do not need any information about this subset or mapping of the slots to query words. Moreover, they also leverage the co-occurrence information of the product characteristics to achieve better performance. 89 To the best of our knowledge, our work is first of its kind that leverages engagement data in search logs for slot-filling task. We evaluated our approaches on their impact on the retrieval and on correctly predicting the slots. In terms of the retrieval task, our approaches achieve up to 156% better NDCG performance than the Okapi BM25 sim- ilarity measure. Additionally, our approach leveraging the co-occurrence information among the slots gained ≈ 3.3% improvement over the approach that does not leverage the co-occurrence information. With respect to correctly predicting the slots, our ap- proach leveraging the co-occurrence information gained 8% performance improvement over the approach that does not leverage the co-occurrence information. Therefore, our work provides an easy solution for slot-filling, by only using the readily available search query logs. The remainder of the chapter is organized as follows. Section 6.1 presents the defini- tions and notations used in the chapter. The chapter discusses the developed methods in Section 6.2 followed by the details of our experimental methodology in Section 6.3. Section 6.4 discusses the results.

6.1 Definitions and Notations

Let V be the vocabulary that corresponds to set of distinct terms that appear across the set of queries Q in a website. The query q is a sequence of |q| terms from the vocabulary V , that is, q = hvq1 , . . . , vqi , . . . , vq|q| i. Each query q is associated with aq product characteristics (slots), also referred as product intent in e-commerce. The collection of aq, ∀q is denoted by A. Each slot is a key-value pair (e.g., brand: Nike Inc.), let M denote the set of all possible slots and L denote the set of all possible slot- keys (e.g., product-type, brand, color etc.). Let yqi be the slot associated with the term vq|i| in query q and yq denotes the sequence of slots for the terms in the query q, i.e., yq = hyq1 , . . . , yq|q| i. The collection of yq, ∀q is denoted by Y . Given query q let cq be the set of slots that is extracted from the characteristics of the products that any user issue q engaged with. The product intent aq is a subset of cq. The set cq (and hence aq) is constrained to have at most one slot for each unique slot-key, i.e., cq cannot contain multiple brands, product-types etc. The collection of cq, ∀q is denoted by C. Let zq denote the product category of query q, K represents set of all the product categories 90 and, Z denotes collection of zq, ∀q. Let ωq denote a set with ωq,i = 1 if the candidate slot cq,i is present in the slot-set aq, and 0 otherwise. Let I denote the collection of all possible candidate slots. Throughout the chapter, the superscript −(q, i) on a collection symbol denotes that collection excluding the position i in query q. For example, Y −(q,i) is the collection of all slot sequences, excluding the one at the position i of query q. Similarly, the superscript −(q) on a collection symbol denotes that collection excluding the query q. Star (∗) in place of a symbol indicates a collection with star (∗) taking all possible values. For example, φ∗ denotes the collection of φi∀i. Table 6.1 provides a reference for the notation used in the chapter. The discrete uniform distribution, denoted by Uniform(·|u), has a finite number of values from the set u; each value equally likely to be observed with probability 1/|u|. The Dirichlet distribution, denoted by Dir(·|λ), is parameterized by the vector λ of positive reals. A k-dimensional Dirichlet random variable x can take values in the (k−1) simplex. The categorical distribution, denoted by Cat(·|x), is a discrete probability distribution that describes the possible results of a random variable that can take on one of k possible categories, with the probability of each category separately specified in the vector x. The Dirichlet distribution is the conjugate prior of the categorical distribution. The posterior mean of a categorical distribution with the Dirichlet prior is given by λ + n x = i i , ∀i ∈ [1, . . . , k], (6.1) i Pk i=1(λi + ni) where, ni is the number of times category i is observed. The Bernoulli distribution, de- noted by Bern(·|p) is a discrete distribution having two possible outcomes, i.e., selection, that occurs with probability p, and rejection, that occurs with probability 1 − p. In this work, α represents Dirichlet prior for sampling the product categories. β is a set of |K| vectors, such that βk denotes the Dirichlet prior for sampling the slots from the product category k. γ is Bernoulli parameter for selecting/rejecting a slot from cq.

δ is a set of |M| vectors, where δm denotes Dirichlet prior for the query words from the slot m. ζ is a set of |M| vectors, where ζm denotes Dirichlet prior for transitioning from the slot m. φ is a probability distribution of the product categories in the query collection. χ is a set of |K| vectors, where χk denotes the probability distribution of the slots in the product category k. ψ is a set of |M| vectors, where ψm denotes the probability distribution of the query words in the slot m. υ is a set of |M| vectors, 91

Table 6.1: Notation used throughout the chapter. Symbol Description V Vocabulary (set of all the terms). q A query. Q Collection of all the queries q.

zq The product category of the query q

Z Collection of zq, ∀q

aq The slot-set for the query q

A Collection of aq, ∀q

cq The candidate slot-set for the query q

C Collection of cq, ∀q I Collection of all possible candidate slot-sets

ωq A set with ωq,i = 1 if the candidate slot cq,i is present in the slot-set aq, and 0 otherwise.

Ω Collection of ωq, ∀q

yq The sequence of slots for the terms in the query q

Y Collection of yq, ∀q K Set of all the product categories M Set of all the possible slots (product characteristics) L Set of all the possible slot-keys α Dirichlet prior for sampling the product categories.

β Set of |K| vectors, βk denoting the Dirichlet prior for sampling the slots from the product category k.

γ Bernoulli parameter for selecting/rejecting a slot from cq

δ Set of |M| vectors, δm denoting the Dirichlet prior for the query words from the slot m.

ζ Set of |M| vectors, ζm denoting the Dirichlet prior for transitioning from the slot m. φ Probability distribution of the product categories in the query collection

χ Set of |K| vectors, χk denoting the probability distribution of the slots in the product category k.

ψ Set of |M| vectors, ψm denoting the probability distribution of the query words in the slot m.

υ Set of |M| vectors, υm denoting the transition probability distribution from the slot m.

where υm denotes the transition probability distribution from the slot m. 92 Algorithm 5 Outline of the generative process for slot-filling 1: for each query q ∈ Q do

2: Generate cq from I; aq from cq; yq from aq 3: for each i = 1 to |q| do

4: Generate vq,i from yq,i 5: end for 6: end for

6.2 Slot Filling via Distant Supervision

To find the most appropriate slot for the terms in a query, when the labeled training data is unavailable, we developed generative probabilistic approaches. Our approaches lever- age the information of the engaged products from the historical search logs as a source of distant-supervision. The training data for our approaches are the search queries and the product characteristics of the engaged products that form the corresponding candidate slot-sets. For example, for the search query “nike men running shoes”, the character- istics of an engaged product are {product-type: athletic shoes, brand: Nike Inc., color: red, size: 10, gender: mens}. These characteristics form the candidate slot-set for the search query. Given a query q and its candidate slot-set cq, our approaches find the most probable slot-sequence yq, such that yq,i ∈ cq, ∀i. Our approaches follow some variation of the common generative process outlined in Algorithm 5. The generative process has four unknowns: generating the candidate slot-set cq from I, generating the product intent aq from cq, generating the slot sequence yq from aq and generating each query word vq,i from yq,i. To generate vq,i from yq,i, our approaches model each slot m as a categorical distribution over the vocabulary V . The categorical distribution for the slot m, denoted by ψm, is sampled from a Dirichlet distribution with prior δm. Our first approach, uniform slot distribution (USD), arguably the simplest approach, assumes that cq is sampled from the I under a discrete uniform distribution. USD directly generates yq from cq, thus bypassing generating aq. USD samples each slot yq,i in the slot-sequence independently from a uniform discrete distribution. The rest of our approaches relax the independence assumptions of the USD and leverage different ideas to capture the co-occurrence of the slots. Markovian slot distri- bution (MSD) leverages the idea that two co-occurring slots (m1 and m2) should have 93 high transition probabilities, i.e., probability of m2 followed by m1 and/or m1 followed by m2 should be relatively high. Correlated uniform slot distribution (CUSD) extends

USD so as to sample cq from I such that the slots in cq are more probable to co-occur. To model this co-occurrence, CUSD leverages the idea that the product intent of a search query is sampled from one of the product categories (set of semantically similar product characteristics). As opposed to CUSD which models all the slots in cq as belonging to the same product category, correlated uniform slot distribution with subset selection

(CUSDSS) assumes that a subset of cq belong to a product category, and this subset corresponds to the actual product intent of the query (aq). For example, consider a laundry detergent product with the following slots: {product-type: laundry detergent, brand: tide, color: yellow}, but we expect that color is not an important attribute when someone is buying laundry detergent. The plate notation for the developed approaches in this chapter is shown in Fig- ure 6.2. In the remainder of this section, we discuss these approaches.

6.2.1 Uniform Slot Distribution (USD)

The uniform slot distribution (USD) model assumes that cq is sampled from the I under a uniform distribution. Further, each slot is sampled independently from a uniform distribution i.e., |q| |q| Y Y 1 P (y |c ) = P (y |c ) = . q q q,i q |c | i=1 i=1 q Algorithm 6 and Figure 6.2(a) shows the generative process for plate notation for the USD, respectively.

Learning:

As mentioned before, our approaches model each slot as a categorical distribution over the vocabulary V . The probability distribution for the slot m is denoted by ψm, and is sampled from a Dirichlet distribution with prior δm. We use Gibbs sampling to perform approximate inference. The parameters that we need to estimate are ψ and Y . However, since Dirichlet is the conjugate prior of the categorical distribution, we can integrate out ψ and use collapsed Gibbs sampling [37] to only estimate the Y . Once the inference is 94

(a) USD (b) MSD

(c) CUSD (d) CUSDSS

Figure 6.2: Plate notation of the developed approaches complete, ψ can then be computed from Y . Therefore, we only have to sample the slot for each query word from its conditional distribution given the remaining parameters, the probability distribution for which is given by

−(q,i) δy ,v + T (yq,i, vq,i) P (y |Q, C, Y −(q,i), δ ) ∝ q,i q,i , (6.2) q,i ∗ −(q,i) P 0 0 v0∈V (δyq,i,v + T (yq,i, v ) where, T(a, b) is the count of word a tagged with the slot b. The RHS in Equation (6.2) is exactly the posterior distribution of ψyq , excluding the current assignment. Once the learning is complete, ψm, ∀m can be estimated as in Equation (6.1). 95 Algorithm 6 Generative process for USD 1: for each characteristic m ∈ M do

2: Generate ψm = hψm,1, . . . , ψm,|V |i ∼ Dir(·|δm) 3: end for 4: for each query q ∈ Q do

5: Generate cq ∼ Uniform(·|I) 6: for each i = 1 to |q| do

7: Generate yq,i ∼ Uniform(·|cq); vq,i ∼ Cat(·|ψyq,i ) 8: end for 9: end for

Inference:

The slot for a query word vq,i is simply the slot m for which ψm,vq,i is maximum, i.e.,yq,i = argmaxm∈cq ψm,vq,i .

6.2.2 Markovian Slot Distribution (MSD)

Similar to the USD, the markovian slot distribution (MSD) model samples cq from I under a discrete uniform distribution. But, instead of sampling each slot independent of the other slots, MSD assumes that the slots in the slot sequence yq follow a first order markov process, i.e.,

|q| |q| Y Y P (yq|cq) = P (yq,i|cq) = P (yq,i|yq,i−1). i=1 i=1

MSD leverages the idea that two co-occurring slots (m1 and m2) should have high transition probabilities, i.e., P (m2|m1) and/or P (m2|m1) should be high. The gener- ative process for MSD is outlined in Algorithm 7 and the plate notation is shown in Figure 6.2(b).

Learning:

To learn the transition probability P (m1|m2), we model each slot as a categorical dis- tribution over the set of all slots M. The probability distribution for the slot m is denoted by υm, and is sampled from a Dirichlet distribution with prior ζm. Note the, 96 Algorithm 7 Generative process for MSD 1: for each characteristic m ∈ M do

2: Generate ψm = hψm,1, . . . , ψm,|V |i ∼ Dir(·|δm)

3: Generate υm = hυm,1, . . . , υm,|M|i ∼ Dir(·|ζm) 4: end for 5: for each query q ∈ Q do

6: Generate cq ∼ Uniform(·|I) 7: for each i = 1 to |q| do

8: Generate yq,i ∼ Cat(·|υyq,i−1 ); vq,i ∼ Cat(·|ψyq,i ) 9: end for 10: end for

unlike USD, we are only sampling the states (slots) from the cq and not M. This makes integrating out υ∗ difficult. Hence, we don’t collapse out υ∗, but estimate it in course of our learning. The conditional distribution for the slot of a query word, given the −v remaining parameters, is given by: P (yq,i|Q, C, Y q,i , υ∗, δ∗, ζ∗) ∝ P 1 × P 2, where

P 1 corresponds to sampling the slot yq,i and equals P 1 = υyq,i−1,yq,i × υyq,i,yq,i+1 .P 2 corresponds to sampling the query word vq,i from the slot yq,i and equals

−(q,i) δy ,v + T (yq,i, vq,i) P 2 = q,i q,i . −(q,i) P 0 0 v0∈V (δyq,i,v + T (yq,i, v )

Equation (6.1) is used to estimate ψ∗ and υ∗.

Inference:

Once we have ψ∗ and υ∗, the slot-sequence for a query can be estimated using the Viterbi decoding algorithm [120].

The primary advantage of exploiting the co-occurrence is to perform selection of cq from I, when cq is not observed. For this, we can use the transition probabilities of the slot sequence as a proxy of how much the slots are likely to co-occur. To achieve this, we estimate the slot sequence, and the corresponding candidate slot set as below:

µ yq, cq = argmax(P (ˆyq|cˆq)) P (ˆyq|q, cˆq);c ˆq ∈ I, yˆq,cˆq 97 Algorithm 8 Generative process for sampling a product category and intended slots from the product category

1: Generate φ = hφ1, . . . , φ|K|i ∼ Dir(·|α) 2: for each product category k ∈ K do

3: Generate χk = hχk,1, . . . , χk,|M|i ∼ Dir(·|βk) 4: end for 5: for each query q ∈ Q do

6: Generate zq ∼ Cat(·|φ)

7: for each i = 1 to |cq| do

8: Generate cq,i ∼ Cat(·|χzq ) 9: end for 10: end for

Q|q|−1 where, P (ˆyq|cˆq)) = i=1 P (yq,i+1|yq,i), µ is another hyperparameter establishing how much co-occurrence information we want to incorporate and P (ˆyq|q, cˆq) corresponds to a posteriori estimate.

6.2.3 Correlated Uniform Slot Distribution (CUSD)

To model the co-occurrence among the slots, CUSD leverages the idea that slots in the candidate slot-set cq is sampled from one of the product categories. CUSD extends

USD by favoring that cq such that the slots in cq are more probable of belonging to the same product category. The number of product categories is a hyperparameter denoted by K. Hence, sampling cq from I corresponds to sampling a product category and then sampling the slot set cq from this product category. This sampling step is modeled as Q|q| a mixture of unigrams, i.e., P (cq|I) = P (zq) i=1 P (cq,i|zq). The generative process for sampling the product category and the candidate slots from the product category is outlined in Algorithm 8. The plate notation for CUSD is shown in Figure 6.2(c). Note that, when cq is observed (training queries), the slot annotations generated by USD and CUSD are the same. 98 Learning:

We denote the probability of sampling a product category (P (zq)) as φ, and model it as a categorical distribution sampled from a Dirichlet distribution with prior α. Each product category is further modeled as categorical distribution over the slots M. The probability distribution for the product category k over the slots is denoted by χk, and is sampled from a Dirichlet distribution with prior βk. For the training data, since the candidate slot sets are observed, the candidate slot set sampling step can be treated independently of the further sampling steps (same steps as the USD). So, we focus on estimating the parameters corresponding to the product categories. −(q) The update equation for the product category of a query is given as: P (zq|C,Z , α, β∗) ∝ P 1 × P 2, where P 1 corresponds to sampling a product category and equals P 1 = −(q) αzq + U (zq), where U(zq) is the count of the queries sampled from the product cat- egory zq. P 2 corresponds to sampling each of the slot in cq from the product category zq and equals

Q −(q) m∈c (βzq,m + R (zq, m)) P 2 = q , |cq|−1 −(q) Q P 0 0 i=0 ( m0∈M (βzq,m + R (zq, m )) + i) where R(zq, m) is the count of the slot m sampled from zq.

Inference:

The product category for a query q given its candidate slot set cq can be found using Q|cq| a maximum a posteriori estimate, i.e., zq = argmaxk φk i=1 χk,cq,i . The advantage of

CUSD over the USD is to perform selection of cq from I, when cq is not observed. For this, we can use the probability of the candidate slot-set cq of being sampled from the product category zq as a proxy of how much the candidate slots are likely to co-occur. To achieve this, we estimate the slot sequence, and the corresponding candidate slot set as below: µ yq, cq = argmax(P (ˆcq, zˆq)) P (ˆyq|q, cˆq);c ˆq ∈ I, (6.3) yˆq,cˆq

Q|cˆq| where, P (ˆcq, zˆq) = φzˆq i=1 χzˆq,cˆq,i , µ is another hyperparameter establishing how much co-occurrence information we want to incorporate and P (ˆyq|q, cˆq) corresponds to a pos- teriori estimate according to USD. 99 6.2.4 Correlated Uniform Slot Distribution with Subset Selection (CUS- DSS)

CUSD leverages the co-occurrence among the slots by favoring cq such that the slots in cq are more probable of belonging to the same product category. However, not all the slots in cq are related to each other. For example, the query tide detergent liquid has the following product intent : {product-type: laundry detergent, brand: tide}, but the engaged product with this query can have more slots than the query’s intent, for example {color: yellow}, as the catalog records all the possible slots. We expect that the color is not an important attribute when someone is buying detergent. In this case, we expect that the slot color: yellow is not related to the other slots in cq. Therefore, a better way to capture the co-occurrence among the slots is the favor that aq (product intent of the query q) such that the slots in aq are more probable of belonging to the same product category. We present correlated uniform slot distribution with subset selection (CUSDSS), which leverages this idea. CUSDSS extends CUSD by selecting or rejecting each slot in the set cq using a Bernoulli coin toss for each slot with parameter γ. The selected slots make up the set aq ⊆ cq. For each query q, we define the set ωq of size cq, with ωq,i = 1 if cq,i ∈ aq and 0 otherwise. The collection of all these ωq, ∀q is denoted by Ω. The generative process for CUSDSS is outlined in Algorithm 9 and the plate notation is shown in Figure 6.2(d).

Learning:

Unlike CUSD, we are not sampling the slot-sequence yq from the observed cq but from the unobserved aq. Therefore, sampling aq is not independent of the subsequent sam- pling steps.

Specifically, we perform block Gibbs sampling updating aq, zq and yq,i in one step, as follows:

−(q,i) −(q) −(q,i) P (yq,i, zq, aq, ωq|Q, C, A ,Z , Ω , α, β∗, γ, δ∗, ζ∗) ∝ P 1 × P 2 × P 3 × P 4 × P 5,

where, P 1 corresponds to sampling the product category zq and equals, P 1 = αzq + 100 Algorithm 9 Generative process for CUSDSS

1: Generate φ = hφ1, . . . , φ|K|i ∼ Dir(·|α) 2: for each product category k ∈ K do

3: Generate χk = hχk,1, . . . , χk,M¯ i ∼ Dir(·|βk) 4: end for 5: for each slot m ∈ M do

6: Generate ψm = hψm,1, . . . , ψm,|V |i ∼ Dir(·|δm) 7: end for 8: for each query q ∈ Q do

9: Generate zq ∼ Cat(·|φ)

10: for each slot i = 1 to |cq| do

11: Generate ωq,i ∈ {0, 1} ∼ Bern(·|γ) 12: end for

13: for each i = 1 to |cq| do

14: if ωq,i == 1 then

15: Generate cq,i ∼ Cat(·|χzq ); Add cq,i to aq 16: else

17: Generate ¬cq,i ∼ Cat(·|χzq ) 18: end if 19: end for 20: for each i = 1 to |q| do

21: Generate yq,i ∼ Uniform(·|aq); vq,i ∼ Cat(·|ψyq,i ) 22: end for 23: end for

−(q) U (zq).P 2 corresponds to sampling slots in aq from the product category zq and equals,

Q −(q) Q −(q) m∈a (βzq,m + R (zq, m)) m∈c −a (βzq,m + R (¬zq, m)) P 2 = q q q |cq|−1 −(q) Q P 0 0 i=0 ( m0∈M¯ (βzq,m + R (zq, m )) + i) 101 P 3 corresponds to the subset selection and equals,  1 yq,i ∈ aq−v P 3 = q,i  γ 1−γ yq,i ∈/ aq−vq,i

P 4 corresponds to sampling each slot and equals, P 4 = 1/|aq|.P 5 corresponds to the probability of sampling the query word vq,i given the slot yq,i and equals,

−(q,i) δy ,v + T (yq,i, vq,i) P 5 = q,i q,i . −(q,i) P 0 0 v0∈V (δyq,i,v + T (yq,i, v )

Inference:

We perform the inference iteratively using block Gibbs sampling in the same manner as learning. Similar to CUSD, we use the probability of the product intent aq of being sampled from zq as a proxy of how much the slots in the product intent are likely to co-occur. Therefore, in this case, we estimate the slot sequence, and the corresponding candidate slot set as below:

µ yq, cq = argmax(P (ˆcq, zˆq)) P (ˆyq|q, cˆq);c ˆq ∈ I, (6.4) yˆq,cˆq

|cˆ | ω where P (ˆc , zˆ ) = φ Q q χ q,i (1 − χ )(1−ωq,i) and µ controls how much co- q q zˆq i=1 zˆq,cˆq,i zˆq,¬cˆq,i occurrence information we want to incorporate.

6.3 Experimental methodology

6.3.1 User engagement and annotated dataset

We used two datasets to estimate and evaluate our models. We call the first dataset engagement, and it contains pairs of the form (q, cq), i.e., queries along with their candidate slot-sets. We call the second dataset annotated and it contains ground-truth annotated queries. We used the historical search logs of an e-commerce retailer to prepare our engage- ment dataset. This data contains 48,785 distinct queries, and the size of vocabulary is

1,936. We assume that cq corresponds to the product characteristics of the products that 102 were engaged for q. The product characteristics belong to the following six types (slot- keys) and their number of distinct values is in parenthesis: product-type (983), brand (807), gender (7), color (20), age (86) and size (243). We created the required dataset by sampling query-item pairs from the historical logs spanning 13 months (July’17 to July’18), such that the number of orders of that item for that query over a month is at least five. Note that each query can be engaged with multiple items, all of which were included to create cq. We limited our dataset to the queries whose words are frequent, i.e., each query word occurs at least 50 times. Furthermore, we use the items only if their product characteristics are present in at least 50 items. These filtering steps avoid data sparsity, thus, help estimating models accurately. The annotated dataset contains a manually annotated representative sample of the queries. The instructions for slot-key annotation along with a few examples illustrating the same were given to annotators. Two annotators tagged each query, and when the annotators disagree over their annotations, they were asked to resolve conflicts through reasoning and discussions. Furthermore, we discarded the queries for which annotators failed to resolve conflicts. In addition to the slot-keys product-type, brand, gender, color, age and size, the set of slot-keys contain an additional tag miscellaneous corresponding to the words which do not fit into either of the other tags. The final annotated dataset contains a total of 3,430 annotated queries.

6.3.2 Evaluation methodology

We evaluate our methods on three tasks, as described below: T1: This task evaluates how well the predicted slots for a query improve the retrieval performance. We calculate the similarity between a query and a product by counting the number of predicted slots for the query that match the characteristics of the product. In a typical e-commerce ranking framework, the similarity of a query with a product is composed of many features, such as the textual similarity between the query and the product description. Okapi Best Matching 25 (BM25) [107, 106], is a widely used text similarity measure by search engines. We incorporate BM25 into our ranking function by normalizing BM25 score of the candidate products for a query between zero and one by min-max scaling, and add this score to the slot-filling score of corresponding query-product pair to compute the final ranking score. 103 As discussed in Section 2.3, our work is the first attempt at using engagement data for distant-supervised slot-filling in e-commerce queries. This limits availability of baselines for a fair evaluation and thus we limit our baselines to the BM25-based approaches. T2: This task compares the predicted slots with the manually annotated queries. Since the candidate slot set is observed, this task evaluates how well our methods can tag each word in the query with the corresponding slot, given a set of slots to choose from.

T3: This task is similar to the task T2, but the candidate slot-set cq is not observed for T3. Thus, in addition to predicting the tagging performance, this task also evaluates how well our methods are able to correctly choose the candidate slot-set cq. When cq is not observed, it has to be sampled from the collection I, which can be very large (of the order 1012 for the presented experimental setup). This makes calculating Equations (6.3) and (6.4) computationally expensive. Hence, we heuristically decrease the number of candidate slot-sets. For each query-word vq,i, we find the top-t most probable slots for each possible slot-key that may have generated vq,i, i.e., top-t indices corresponding to

ψ∗,vq,i for each possible slot-key. The collection of the candidate slot-sets for a query then becomes equal to the all possible candidate slot-sets based on top-t slots for each query word. In this chapter, we use t = 1 for simplicity. This also means that our approaches are exponential in terms of number of words in the query, which is typically small in the case of product-search queries (mean query length is 3.6); thus the exponential complexity does not effect the scalability of our approaches. Note that a query word can have only one slot-key but multiple slot values leading to annotation ambiguity. To avoid this, we relax our evaluation to predict the correct slot-keys for the tasks T2 and T3. However, when the candidate slot-set is observed, predicting the correct slot-key corresponds to predicting the correct slot, as there is only one possible slot-value for a slot-key. We generated our training, test and validation datasets from the engagement data and annotated data as follows:

• From the annotated data, we used the queries not present in the engagement data to evaluate our methods on the task T3. We divided these queries in validation and test set, containing 454 and 2,976 queries, respectively. 104 • We used the queries present in both annotated data and engagement data as the test dataset for evaluation on the tasks T1 and T2, leading to a total of 1,699 test queries. For T2, we generated the candidate slot-set of a query as the characteristics of the product that had the largest number of orders for the query. An additional slot miscellaneous was added to each candidate slot-set. We use the purchase count of a product as as its relevance for a query, for T1.

• We used the remaining queries from engagement data to estimate our models. Unlike

the test data for the task T2 we prepared earlier, we do not generate cq from the

most purchased item, but create a new q, cq pair for each engaged item. We added an additional slot miscellaneous to each candidate slot-set. The size of our training

set is 108,101 (q, cq) pairs, with 47,086 distinct queries.

6.3.3 Performance Assessment metrics

For T1, we calculate the Mean Reciprocal Rank (MRR) [121] and Normalized Dis- counted Cumulative Gain (NDCG) [127] which are popular measures of the ranking quality. To calculate the MRR, we assume that the relevant products for a query are the most purchased products for that query. To calculate the NDCG, we assume that the relevance of a product for a query is the number of times that product was pur- chased after issuing that query. We calculate NDCG only till rank 10 (NDCG@10), as users usually pay attention to the top few results and break ties as proposed in [82]. For T2 and T3, our first metric, accuracy, calculates the overall percentage of the query words whose slot-key is correctly predicted. Our second metric calculates the per- centage of the words whose slot-key is predicted correctly in a query. We report average of this percentage over all the queries and call this metric q-accuracy. Metric q-accuracy treats all queries equally but accuracy is biased towards longer queries. Compared to accuracy, q-accuracy being a transaction-level metric gives a better picture of how our methods impact the traffic. The metrics accuracy and q-accuracy are biased towards the frequently occurring slot-keys. Therefore, we also report macro-averaged precision (avgprec), recall (avgrec) and F1-score (avgF1). We used grid search on validation set to select our hyper-parameters. 105

Table 6.2: Retrieval results (Task T1). Model Features MRR NDCG@10 Only BM25 0.036 0.039 USD Only Slots 0.071 0.096 Slots+BM25 0.088 0.121 MSD Only Slots 0.072 0.097 Slots+BM25 0.088 0.121 CUSD Only Slots 0.071 0.096 Slots+BM25 0.091 0.123 CUSDSS1 Only Slots 0.075 0.100 Slots+BM25 0.090 0.125

1 The performance of CUSDSS is statistically significant over the other models on the NDCG metric, and over USD and MSD on the MRR metric (p-value ≤ 0.01 using t-test), for both the feature settings (only slots and slots+BM25).

6.4 Results and Discussion

Task T1: Table 6.2 shows the performance of the different approaches for the retrieval task. CUSDSS performs the best of all the models on both metrics, whereas CUSD performs better than USD and MSD when it is combined with BM25. This validates our hypothesis that modeling co-occurrence and subset-selection leads to better performance on slot-filling, which in turn leads to better retrieval. The performance improvement of CUSDSS over USD, using only the predicted slots in the ranking function, is ≈ 5.6% and ≈ 4.2% on the MRR and NDCG metrics, respectively. Upon incorporating the BM25 with the predicted slots in the ranking function, the performance improvement is ≈ 2.3% and ≈ 3.3% on the MRR and NDCG metrics, respectively. On both the MRR and NDCG metrics, using the predicted slots to rank the products leads to better performance than using BM25. Additionally, using BM25 in conjunction with the predicted slots lead to even better performance. For example, the performance improvement of CUSDSS over BM25, using only the slots in the ranking function, is ≈ 108% and ≈ 156% on the MRR and NDCG metrics, respectively. The performance further improves upon incorporating BM25 with the predicted slots in the ranking function. This is particularly appreciable because the developed approaches do not 106

Table 6.3: Slot-filling results when cq is observed (Task T2). Model accuracy q-accuracy avgprec avgrec avgF1 USD/CUSD 0.678 0.708 0.426 0.587 0.453 MSD 0.679 0.709 0.428 0.589 0.455 CUSDSS1 0.657 0.686 0.571 0.493 0.465

1 The performance of CUSDSS is statistically significant over USD and MSD on the avgprec and avgF1 metrics (p-value ≤ 0.01 using t-test).

Table 6.4: Slot-filling results when cq is unobserved (Task T3). Model accuracy q-accuracy avgprec avgrec avgF1 USD 0.558 0.583 0.506 0.645 0.472 MSD 0.558 0.583 0.506 0.645 0.472 CUSD1 0.565 0.588 0.510 0.650 0.476 CUSDSS2 0.566 0.586 0.526 0.622 0.511

1 The performance of CUSD is statistically significant over USD and MSD on the accuracy, q-accuracy and avgrec metrics (p-value ≤ 0.01 using t-test). 2 The performance of CUSDSS is statistically significant over USD and MSD on the accuracy, q-accuracy, avgprec and avgF1 metrics and over CUSD on the avgprec and avgF1 metrics (p-value ≤ 0.01 using t-test)

Table 6.5: Individual precision (P), recall (R) and F1-score (F1) when cq is observed. Slots product-type brand gender color age size miscellaneous Model P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1

USD/CUSD 0.93 0.74 0.82 0.35 0.89 0.51 0.63 0.82 0.69 0.11 0.64 0.16 0.27 0.49 0.36 0.28 0.27 0.30 0.42 0.25 0.33 MSD 0.93 0.74 0.82 0.35 0.89 0.51 0.63 0.82 0.69 0.11 0.64 0.17 0.27 0.50 0.36 0.29 0.27 0.30 0.42 0.26 0.33 CUSDSS 0.90 0.74 0.81 0.26 0.88 0.40 0.88 0.69 0.78 0.34 0.54 0.41 0.72 0.19 0.29 0.41 0.18 0.25 0.49 0.24 0.32 require any labeled data, instead they leverage the readily available search query logs to boost the retrieval performance. We also see that the USD and MSD performs almost equally. This is expected as all the queries nike mens running shoes, mens nike running shoes, running mens shoes nike etc. are possible, favoring uniform transition probabilities. Task T2 and T3: Tables 6.3 and 6.4 show the performance for the slot classification when the candidate slot-set is observed and unobserved, respectively. We report average of 100 runs with random initializations. For all the methods, we see that q-accuracy 107

USD 0.75 MSD 0.70 CUSD CUSDSS 0.65

0.60

q-accuracy 0.55

0.50

0.45 1 2 3 4 5 6 7 Number of terms in the query

Figure 6.3: Variation of the q-accuracy with query length.

Table 6.6: Individual precision (P), recall (R) and F1-score (F1) when cq is unobserved. Slots product-type brand gender color age size miscellaneous Model P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1

USD 0.91 0.59 0.71 0.20 0.85 0.35 0.81 0.93 0.76 0.39 0.88 0.42 0.17 0.62 0.42 0.23 0.50 0.40 0.83 0.14 0.24 MSD 0.91 0.59 0.71 0.20 0.85 0.35 0.81 0.93 0.77 0.39 0.88 0.41 0.17 0.62 0.42 0.23 0.50 0.40 0.83 0.14 0.23 CUSD 0.91 0.60 0.72 0.20 0.85 0.36 0.81 0.94 0.76 0.40 0.88 0.42 0.18 0.63 0.43 0.24 0.51 0.41 0.84 0.14 0.24 CUSDSS 0.90 0.61 0.72 0.19 0.84 0.32 0.83 0.90 0.86 0.37 0.87 0.52 0.46 0.53 0.48 0.34 0.41 0.38 0.61 0.20 0.30 is more than accuracy, which shows that all the methods perform better on shorter queries. Figure 6.3 shows how the q-accuracy varies with the number of words in the queries. For all the methods, q-accuracy decreases with the length of query. Similar to the ranking task T1, we see that USD and MSD performs equally good for both the cases. CUSD performs better than USD and MSD on all the metrics, which shows that selecting cq with co-occurring slots leads to better performance. Moreover, co-occurrence information is exploited when the query has multiple query words. As shown in Figure 6.3, CUSD performs considerably better than USD and MSD on the q-accuracy metric as the number of words in the queries increases. Interestingly, CUSDSS performs better than the USD and MSD on the q-accuracy and accuracy metrics when the candidate slot-set is observed, but does not perform as good when the candidate slot-set is unobserved. In addition, CUSDSS performs best on 108

Table 6.7: Examples of product categories estimated by CUSD and CUSDSS Product category Most probable slots estimated by CUSD (ranked Most probable slots estimated by CUSDSS by probability) (ranked by probability) Houseware age: adult, gender: unisex, brand: mainstays, brand: mainstays, pt: storage chests & boxes, pt: size: 1, color: white, color: black, brand: bet- vacuum cleaners, brand: sterilite, brand: better ter homes & gardens, color: brown, pt: stor- homes & gardens, brand: the pioneer woman, pt: age chests & boxes, brand: sterilite, color: silver, food storage jars & containers, brand: holiday color: gray time Toys pt: dolls, color: multicolor, age: child, gender: pt: dolls, pt: action figures, pt: play vehicles, unisex, brand: barbie, pt: doll playsets, gender: brand: barbie, pt: doll playsets, pt: stuffed ani- girls, age: adult, brand: shopkins, brand: disney mals & plush toys, brand: disney, pt: interlocking princess, brand: my life as, age: 3 years & up, block building sets, brand: lego, brand: paw pa- brand: disney trol Dental hygiene gender: unisex, age: adult, color: multicolor, pt: pt: toothpastes, brand: crest, brand: colgate, toothpastes, brand: crest, brand: colgate, brand: brand: oral-b, pt: baby foods, brand: gerber, oral-b, size: 1, color: white, age: child, size: 4, pt: mouthwash, pt: meal replacement drinks, pt: pt: mouthwash, pt: powered toothbrushes, age: manual toothbrushes, pt: powered toothbrushes, teen, size: 6 brand: ensure the avgprec and avgF1 metrics, but not so good on the avgrec metrics. To investigate this peculiar behavior of CUSDSS, we compared the individual precision, recall and F1 scores for the slot-keys of CUSDSS with the other approaches. Tables 6.5 and 6.6 shows these statistics for each of the slot when the candidate slot-set is observed and unobserved, respectively. Subset selection leads to CUSDSS labeling more query words with frequent slots like product-type and brand. This not only decreases the recall of the other less frequent slots (gender, color, age, size) but also decreases the precision of the frequent slots. However, this also tends to make CUSDSS predict only those less frequent slots for which it has high confidence, leading to an increase on the avgprec metric. Consequently, CUSDSS also performs the best on the avgF1 metric (3% over the USD when cq is observed, and 8% improvement when cq is not observed). When the candidate slot-set is unobserved, there are more possible ways to make an error in slot- prediction, than the case when the candidate slot-set is observed. For the unobserved case, as the prediction is biased towards the many slots, CUSDSS performs better on the micro-averaged metrics, i.e., accuracy and q-accuracy. However, for observed candidate slot, CUSDSS does not benefit as the scope of error is small. Additionally, even though the overall performance of CUSDSS on the q-accuracy metric is not at par with the CUSD, the performance is considerably better for the queries with multiple words, 109 and the difference in performance increases with the number of words in the query, as depicted in Figure 6.3. When there is only one word in the query, CUSDSS favors frequent slots, leading to degraded performance as there is no co-occurrence information to be leveraged. To study qualitatively how subset selection helps CUSDSS make better predictions than CUSD. Table 6.7 shows the most probable slots for some product categories es- timated by CUSDSS and CUSD. We manually label the estimated product categories with a relevant semantic name. We observe that the product categories estimated by CUSD contain more slots corresponding to the slot-keys age, gender, size and color, while the product categories estimated by CUSDSS contain more slots corresponding to the slot-keys product-type and brand. For example, the product category houseware for CUSD has age: adult as the most probable slot, while CUSDSS has brand: mainstays as the most probable slot. This follows from the fact that CUSD models the complete candidate slot-sets cq as belonging to the same product category, which leads to the frequent occurring slots in cq having high probabilities in the product categories. In contrast, CUSDSS models the actual product-intent aq as belonging to the same prod- uct category, which leads to the actual intended slots having high probabilities in the product categories. We present the concluding remarks in Chapter 8. Chapter 7

Importance assessment in scholarly networks

Citation networks of peered-reviewed scholarly publications (e.g., journal/conference articles and patents) have widely been used and studied in order to derive such met- rics for the various entities involved (e.g., articles, researchers, institutions, companies, journals, conferences, countries, etc. [3]). However, most of these traditional metrics, such as citation counts and h-index treat all citations and publications equally, and do not take into account the content of the publications and the context in which a prior scholarly work was cited. Another related line of work, such as PageRank [94] and HITS [65] takes the node centrality into consideration (as a proxy for publication influ- ence), but still operate in an content-agnostic manner. These content-agnostic metrics fail to reliably measure the scholarly impact of an article as they do not differentiate between the possible reasons a scholarly work is being cited. Being content-agnostic, these metrics is can be easily manipulated by the presence of malicious entities, such as publication venues indulging in self-citations, which leads to high impact factor, or a group of scholars citing each others’ work. For example, Journal Citation Reports (JCR)1 routinely suppresses many journals that indulge in citation stacking, a prac- tice where the reviewers and journal editors pressure authors to cite papers that either they wrote or that are published in “their” journal. Thus, there is a need to establish

1 http://help.incites.clarivate.com/incitesLiveJCR/JCRGroup/titleSuppressions.html

110 111 content-aware metrics to accurately quantitatively measure various innovation-related aspects such as their significance, novelty, impact, and market value. Such metrics are essential for ensuring that SET-driven innovations will play an ever more significant role in the future. In this chapter, we propose machine-learning-driven approaches, that automatically estimate the weights of the edges in a citation network, such that edges with higher weights correspond to higher-impact citations. There has been considerable effort in the past to identify important citations [119, 60, 20]. These approaches treat this task as a supervised text-classification problem, and thus, require the availability of training data with ground truth annotations. However, generating such labeled data is difficult and time consuming, especially when the meaning of the labels is user-defined. In contrast, our approaches are distant supervised, that require no manual annotation. The proposed approaches leverage the readily available content of the papers as a source of distant- supervision. Specifically, we formulate the problem as how well the linear combination of the representations of the cited publication explains the representation of the citing publication. The weights in these linear-combination quantify the extent to which the cited-publication informs the citing-publication. Experiments on the three manually annotated datasets show the advantage of using the proposed method on the competing approaches, achieving up to 106% improvement in performance as compared to second best performing approach. While our discussion and evaluation focuses on identifying informing citations, our approach is not restricted to this domain, and can be used to derive impact metrics for the various involved entities. For example, the content-aware weights estimated by the proposed approach convert the original unweighted citation network to a weighted one. Consequently, this weighted network can be used to derive impact metrics for the various involved entities, like the publications, authors etc. For example, to find the impact of a publication, the sum of weights outgoing from its corresponding node can be used to quantify the impact of the publication, instead of using vanilla citation count. The reminder of the chapter is organized as follows. The mathematical notation used in the chapter is described in Section 7.1. The chapter discusses the proposed method in Section 7.2 followed by the experiments in Section 7.3. Section 7.4 discusses the results. 112 Paper 2 Paper 3 Paper 4

Paper 1 (Citing paper)

---[Paper 4]- ---[Paper 3]- ---[Paper 2]-

Unit Normalization

Representation of the historical concepts

Minimize the explanation loss

Figure 7.1: Overview of Content-Informed Index. Paper P1 cites papers P2, P3 and P4.

The weights w21, w31, and w41 quantifies the extent to which P2, P3 and P4 informs P1, respectively. The function f is implemented as a Multilayer Perceptron.

Table 7.1: Notation used throughout the paper. Symbol Description

Pi A paper.

Ci Set of concepts in the paper Pi.

Hi Set of historical concepts in the paper Pi.

Ni Set of novel concepts in the paper Pi.

Cji Set of concepts that are informed by the paper Pj to Pi. r(X) Representation of the set of concepts denoted by X.

wji A weight quantifying the information in the citation Pi → Pj

7.1 Definitions and Notations

We assume that each paper Pi can be represented as a set of concepts Ci. Further, we assume that each paper Pi is build on top of a set of historical concepts Hi, and its novelty Ni is the new set of concepts it proposes. The contribution of a cited paper Pj towards the citing paper Pi is the set of concepts Cji = Cj ∩ Hi. In other terms, the set 113 of concepts Ci is given by:

Ci = Ni ∪ Hi = Ni ∪ [∪PicitesPj Cji].

Table 7.1 provides a reference for the notation used in the chapter.

7.2 Content-Informed Index (CII)

In the absence of labels that define the impact, we assume that the extent to which a cited paper informs the citing paper is an indication of the citation’s impact. The task at hand is to quantify the extent to which Cji contributes towards Hi. To achieve this task, we look into the following directions:

• How do we supervise the exercise? We minimize the novelty of paper Pi,

by trying to explain the concepts in paper Pi (denoted by Ci) using the historical

concepts, i.e., the concepts of the papers it cites (Cj). We call the loss associated with this minimization as the explanation loss. This gives rise to the following optimization problem:

X X minimize Ni = minimize Ci − Hi. i i

To proceed in this direction, we need to answer two questions, (i) How to represent

the the set of concepts associated with the paper Pi?, and (ii) How do we represent

the set of historical concepts Hi? As we show next, we use the textual content of

the papers to estimate the representations of Ci and Hi. Thus, we formulate our problem as a distant supervised, and the content of the papers acts as a source of distant-supervision.

• How to represent the set of concepts associated with a paper? For

simplicity, we represent the set of concepts associated with a paper (Ci) as a pre- trained vector representation (embedding) of its abstract, such as Word2Vec [86], GloVe [95], BERT [23], ELMo [97], etc. In this chapter, we use the pretrained representations pretrained on scientific documents provided by ScispaCy [92]. The

representation of Ci is denoted by r(Ci). 114

• How do we represent the set of historical concepts Hi? As the set of

historical concepts Hi is a union of the the borrowed concepts from the cited

papers (Cj), we simply represent the set of historical concepts as a weighted linear combination of the representation of the concepts of the cited papers, i.e., X r(Hi) = w˜jir(Cj)

Pi cites Pj X 2 subject to w˜ji = 1 Pi cites Pj

w˜ji ≥ 0; ∀(i, j).

P 2 We have the constrained norm condition ( w˜ji = 1) to make the represen- Pi cites Pj tation of r(Hi) agnostic to the number of cited-papers (a paper can cite multiple papers to reference the same borrowed concepts).

The weightsw ˜ji can be thought of as normalized similarity measure between the

concepts of the cited paper, and the citation context. Thus, to estimatew ˜ji, we

first estimate unnormalizedw ˜ji, denoted by wji, and then normalize wji so as to

have unit norm. The unnormalized weight wji is precisely the extent to which Cj

contributes towards Hi (and hence Ci), i.e., the weight that we wish to estimate

in this paper. We estimate wji as a multilayer perceptron, that takes as input the representations of the cited paper and the citation context. We use the represen- tation associated with the corresponding concepts as the representations of the

cited papers (r(Cj)). Similar to r(Cj), we use the ScispaCy vector representation for the citation context as the representation of the context, and denote it by r(j → i). 115 The above discussion leads to the following formulation:

X X 2 minimize ||r(Ci) − w˜jir(Cj)|| f i Pi cites Pj wji subject tow ˜ji = ; ∀(i, j), r P 2 wji Pi cites Pj (7.1)

wji = f(r(Cj), r(Cji)); ∀(i, j),

wji ≥ 0; ∀(i, j),

wji ≤ b; ∀(i, j).

The max-bound constraint (wji ≤ b) is introduced to limit the projection space of the weights wji. This is because, without this constraint, for a given citing paper Pi, if the set of weights wji minimize Equation (7.1), then so will any scalar multiplication of the weights wji. This can potentially lead to the estimated weights being incomparable across different citing papers. Having a max bound on the estimated weights helps avoid this scenario. To take care of the constraints, the function f(·) can be implemented as a L2−regularized multilayer perceptron, with a single output node, and a non-negative mapping at the output node. Not that we do not explicitly set the max-bound b, but it is implicitly set by the L2 regularization of the weights of the function f. The L2 regularization parameter is treated as a hyperparameter. Figure 7.1 shows an overview of Content-Informed Index (CII).

7.3 Experimental methodology

7.3.1 Evaluation methodology and metrics

We need to evaluate how well the weights estimated by our proposed approach quantifies the extent to which a cited paper informs the citing paper. To this extent, we leverage various manually annotated datasets (explained later in Section 7.3.3), where the anno- tations quantify the extent of information in the citation. The task inherently becomes a ordinal association, and we need to evaluate how well the ranking imposed by our proposed method associates with the ranking imposed by the manual annotations. As a measure of rank correlation, we use the non-parametric Somers’ Delta [114] (denoted by ∆). Values of ∆ range from −1(100% negative association, or perfect inversion) 116 to +1(100% positive association, or perfect agreement). A value of zero indicates the absence of association. Formally, given a dependent variable (i.e., the predicted weights by our model) and an independent variable (i.e., the manually annotated ground truth), ∆ is the difference between the number of concordant and discordant pairs, divided by the number of pairs with independent variable values in the pair being unequal.

Relation of ∆ to other metrics:

When the independent variable has only two distinct classes (binary variable), the area under the receiver operating characteristic curve (AUC ROC) statistic is equivalent to ∆ [93]. Thus, ∆ can also be visualized as a generalization of AUC ROC towards ordinal classification with multiple classes. Further, as the dependent variable (the weights estimated by our proposed approach) is real valued, having two tied values on the independent variable is very difficult. Thus, for our case, ∆ is equivalent to Goodman and Kruskal’s Gamma [33, 34, 35, 36], and just a scaled variant of Kendall’s τ coefficient [63], with are other popular measures of ordinal association.

7.3.2 Baselines

We choose representative baselines from diverse categories as discussed below:

Link-prediction approaches:

The citation weights that we estimate in this chapter can also looked from the link- prediction perspective, i.e., assigning a score to every citation (link) in the citation graph, the score portraying the likelihood of the existence of a link. Thus, the citations, which are noisy, i.e., do not make sense with the respect to underlying link-prediction model get smaller weights. We compare against two link-prediction methods, one based on classic network embedding approach, and other belonging to recent Graph Neural Network (GNN) based approaches.

• DeepWalk [96]: DeepWalk is a popular method to learn node embeddings. DeepWalk borrows ideas from language modeling and incorporates them with network concepts. Its main proposition is that linked nodes tend to be similar and they should have sim- ilar embeddings as well. Once we have node embeddings as the output of DeepWalk, 117 we train a binary classifier, with the positive instances as the pairs of nodes which are connected in the network, and negative instances are the unconnected nodes (gen- erated using negative sampling). We provide results using two different classifiers: Logistic Regression (denoted by DeepWalk+LR) and Multilayer Perceptron (denoted by DeepWalk+MLP). Note that Deepwalk is a transductive model, and only consid- ers the network topology, i.e., DeepWalk does not use the content of the papers to estimate the model.

• GraphSage [42]: GraphSAGE is a Graph Concolutional Network (GCN) based frame- work for inductive representation learning on large graphs. GraphSage is trained with the link-prediction loss, so we do not use a second step (as in DeepWalk) to train separate classifier. Note that, GraphSage is an inductive model, so also considers the content of the papers in addition to topology of the network to estimate the model.

Text-similarity based baselines:

We can think of the function f as a similarity measure between the cited paper and the citation context. Thus, we consider the following similarity measures as our baselines: We use the same pretrained representations as we used as an input to CII, and cosine similarity as the similarity measure, which is a popular similarity measure for text data.

• Similarity-Abstract-Context: Similarity between the cited abstract and the cita- tion context.

• Similarity-Context-Abstract: Similarity between the citing abstract and the cita- tion context.

• Similarity-Abstract-Abstract: Similarity between the cited abstract and citing abstract.

To calculate each of the above similarity measures, We use the same pretrained represen- tations as we used as an input to CII, and cosine similarity as the similarity measure, which is a popular similarity measure for text data. The baselines belonging to this category can also be thought of as similarity-based link prediction approaches. In addition, we also consider another simple baseline, referred to as Reference Fre- quency, where we assume that more frequently the cited paper is referenced in the citing 118 paper, the higher the chances of the cited paper informing the citing paper. This as- sumption has also been used as a feature in prior supervised approaches [119]. The absolute frequency of referencing a cited-paper may provide a good signal regarding the information borrowed from the cited paper, when comparing with other papers be- ing cited by the same citing paper. However, as the citation-behavior differs between papers, the absolute frequency may not be comparable across different citing papers. Thus, we also provide results after doing normalization of the absolute frequency of the citation references for each citing paper. We provide results for mean, max, and min normalization. Specifically, given a citation and the corresponding citing paper, the information weight for a citation is calculated by dividing the number of references of that citation, by the mean, max, and min of references of all the citations in that citing paper, respectively.

7.3.3 Datasets

The Open Research Corpus (S2ORC) [73]

The S2ORC dataset is a citation graph of 81.1 million academic publications and 380.5 million citation edges. We only consider the publications for which full-text is available and abstract contains at least 50 words. This leaves us with a total of 5, 653, 297 papers, and 30, 533, 111 edges (citations).

ACL-2015 [119]

The ACL-2015 dataset contains 465 citations gathered from the ACL anthology2 , represented as tuples of (cited paper, citing paper), with ordinal labels ranging from 0 to 3, in increasing order of importance. The citations were annotated by one expert, followed by annotation by another expert on a subset of the dataset, to verify the inter- annotator agreement. We only use the citations for which we have the inter-annotator agreement, and the citations are present in the S2ORC dataset we described before. The selected dataset contains 300 citations among 316 unique publications. The total number of unique citing publications are 283 and the total number of unique cited publications are 38.

2 https://www.aclweb.org/anthology/ 119 ACL-ARC [60]

The ACL-ARC is a dataset of citation intents based on a sample of papers from the ACL Anthology Reference Corpus [10] and includes 1,941 citation instances from 186 papers and is annotated by domain experts. The dataset provides ACL IDs for the papers in the ACL corpus, but does not provide an identifier to the papers outside the ACL corpus, making it difficult to map many citations to the S2ORC corpus. However, it provided the titles of those papers, and we used these titles to map these papers to the papers in the S2ORC dataset, if we found matching titles. The selected dataset contains 460 citations among 547 unique publications. The total number of unique citing publications are 145 and the total number of unique cited publications are 413.

SciCite [20]

SciCite is a dataset of citation intents based on a sample of papers from the Semantic Scholar corpus3 , consisting of papers in general computer science and medicine do- mains. Citation intent was labeled using crowdsourcing. The annotators were asked to identify the intent of a citation, and were directed to select among three citation intent options: Method, Result/Comparison and Background. This resulted in a total 9, 159 crowdsourced instances. We use the citations that are present in the S2ORC dataset we described before. The selected dataset contains 352 citations among 704 unique publications. There is no repeated citing or cited publication in this dataset, thus, the total number of unique citing publications as well as unique citing publications are 352 each.

7.3.4 Parameter selection

We treat one of the evaluation datasets (ACL-ARC) as the validation set, and chose the hyperparameters of our approaches and baselines with respect to best performance on this dataset. For DeepWalk, we use the implementation provided here4 , with the default parameters, except the dimensionality of the estimated representations, which is set to 200 (for the sake of fairness, as the used 200 dimensional text representations

3 https://www.semanticscholar.org/ 4 https://github.com/xgfs/deepwalk-c 120

Table 7.2: Results on the Somers’ ∆ metric. Model ACL-2015 ACL-ARC SciCite

Content-Informed Index (CII) 0.428 ± 0.013 0.308 ± 0.010 0.296 ± 0.006

Ref. Frequency (Absolute) 0.325 ± 0.000 0.308 ± 0.000 0.144 ± 0.000 Ref. Frequency (Mean-normalized) 0.351 ± 0.000 0.300 ± 0.000 0.120 ± 0.000 Ref. Frequency (Min-normalized) 0.321 ± 0.000 0.298 ± 0.000 0.145 ± 0.000 Ref. Frequency (Max-normalized) 0.270 ± 0.000 0.172 ± 0.000 0.035 ± 0.000

Similarity-Abstract-Abstract −0.041 ± 0.000 0.091 ± 0.000 −0.003 ± 0.000 Similarity-Abstract-Context −0.147 ± 0.000 0.090 ± 0.000 −0.125 ± 0.000 Similarity-Context-Abstract 0.013 ± 0.000 −0.062 ± 0.000 −0.202 ± 0.000

Deepwalk+LR −0.071 ± 0.016 0.190 ± 0.006 −0.037 ± 0.018 Deepwalk+MLP −0.026 ± 0.011 0.205 ± 0.024 −0.047 ± 0.015 GraphSage 0.023 ± 0.045 0.132 ± 0.024 0.049 ± 0.019 for CII). For the models that require learning, i.e., the logistic regression part of Deep- walk, MLP part of Deepwalk, GraphSage, and CII, we used the ADAM [64] optimizer, with initial learning rate of 0.0001, and further use step learning rate scheduler, by exponentially decaying the learning rate by a factor of 0.2 every epoch. We use L2 regularization of 0.0001. The function f in CII was implemented as a multilayer per- ceptron, with three hidden layers, with 256, 64, and 8 neurons, respectively. We use the same network architecture for the MLP part of Deepwalk+MLP as we do for function f in CII. We train the logistic regression and MLP parts of Deepwalk, GraphSage, and CII for a maximum of 50 epochs, and do early-stopping if the validation performance does not improve for 5 epochs. For GraphSage, we use the implementation provided by DGL5 . We used mini-batch size of 1024 for training the models. 121 7.4 Results and discussion

7.4.1 Quantitative analysis

Table 7.2 shows the performance of the various approaches on the Somers’ Delta (∆) for each of the datasets ACL-2015, ACL-ARC and SciCite. For ACL-2015 and SciCite, the proposed approach CII outperforms the competing approaches; while for the ACL-ARC dataset, CII performs at par with the best performing approach. The improvement of CII over the second best performing approach is 22% and 103%, on the ACL-2015 and SciCite datasets, respectively. Interestingly, the simplest baseline, Reference-frequency and its normalized forms are the second best performing approaches. While Reference-frequency performs at par with the CII on the ACL-ARC dataset, it does not perform as good on the other two datasets. This can be attributed to the fact that the number of unique citing papers in ACL-ARC dataset are relatively small. Thus, many citations in ACL-ARC are shared by the same citing paper, which is not the case with the other two datasets. Thus, as mentioned in Section 7.3.2, absolute frequency of referencing a cited-paper may provide a good signal regarding the information borrowed from the cited paper, when comparing with other papers being cited by the same citing paper. Further, even the normalized forms of the Reference-frequency lead to only marginal increase in performance for the ACL-2015 and SciCite datasets. Thus, the simple normalizations (such as mean, max and min normalization used in this paper), are not sufficient to address the difference in citation-behavior that occurs between different papers. Furthermore, we observe that simple similarity based approaches, such as cosine- similarity between pairs of various entities (each combination of citing abstract, citing abstract, and citation-context) performs close to random scoring (∆ value of close to zero). This validates that the simple similarity measures, like cosine similarity are not sufficient to manifest the the information that a cited-paper lends to the citing-paper; thus, showing the necessity of more expressive approaches, like CII. In addition, the other learning-based link-prediction-based approaches perform con- siderably worse than the simple baseline reference-frequency. While on ACL-2015 and SciCite datasets, they perform close to random scoring, the performance on ACL-ARC

5 https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage 122

Figure 7.2: Word-cloud (Frequently occurring words) that appear in the citation context of the citations with the highest predict importance weights. dataset is better than the random baseline.

7.4.2 Qualitative analysis

In order to understand the patterns that the proposed approach CII learns, we look into the data instances with the highest and lowest predicted weights. As the function f takes as input both the abstract of the cited paper and the citation context, the learnt patterns can be a complex function of the cited paper abstract and the citation context. Thus, for simplicity, we limit the discussion in this section to understand the linguistic patterns in the citation context, and how these patterns associate with the CII weights predicted for them. In this direction, we select 10, 000 citation-contexts corresponding to citations with highest predicted weights, and plot the word clouds for these contexts. We repeat the same exercise for the citation-contexts with the lowest predicted weights. Figures 7.2 and 7.3 shows the wordclouds for the highest weighted citations and lowest weighted citations, respectively. These figures show some clear discriminatory patters between the highest-weighted and lowest-weighted citations, that relate well with the information carried by a citation. For example, the words such as ‘used’ and ‘using’ are very frequent 123

Figure 7.3: Word-cloud (Frequently occurring words) that appear in the citation context of the citations with the least predict importance weights. in the citation contexts of the highest weighted citations. This is expected, as such verbs provide a strong signal that the cited work was indeed employed by the citing paper, and hence the cited paper informed the citing work. Another interesting pattern in the highest weighted citations is the presence of words like ‘fig’, ‘figure’ and ‘table’. Such words are usually present when the authors present or describe important concepts, such as methods and results. As such, citations in these important sections indicates that the cited work is used or extended in the citing paper, which signals importance. On the other hand, the wordcloud for the least weighted citations (Figure 7.3) is dominated by weasle words such as ‘may’, ‘many’, ‘however’, etc. The words such as ‘many’ commonly occur in the related work section of the paper, where the paper presents some examples of other related works to emphasize the problem that the citing paper is solving. The words like ‘may’, ‘however’, ‘but’ etc are commonly used to describe some limitation of the cited work. Such citations are expected to be incidental, carrying less information, as compared to other citations. We also look at some examples of individual citation contexts, and the weights pre- dicted by CII for them. Table 7.3 shows three citing papers, with an example of high weighted citation and an example of low weighted citation for each of those papers. 124

Table 7.3: Examples of lowest and highest CII-weighted citations. Citing-paper title Relavant citation-context Score

The number of regenerating axons per nerve was then 8.32e−1 calculated at each distance using a previously developed AAV-mediated gene formula (Lim et al., 2016; Bei et al., 2016), with the total therapy for retinal number of axons equal to πr2. disorders: from Most experimental therapies that stimulate RGC axon 3.46e−5 mouse to man regeneration involve interventions at the time of injury or, in the case of many gene therapies, prior to injury (Buch et al., 2008). While such studies are valuable for identifying therapeutic targets and elucidating mecha- nisms of RGC axon regeneration, they are not readily translatable to human patients.

Exploration in response to a novel open field (OF) was 8.33e−1 Hippocampal measured as previously described (Richardson-Jones et Memory Traces are al., 2010). Differentially Many models have stressed the importance of the hip- 1.55e−5 Modulated by pocampus (HPC) subregions in distinguishing similar Experience, Time, patterns (pattern separation) and in completing par- and Adult tial patterns (pattern completion) (Bakker et al., 2008; Neurogenesis Leutgeb et al., 2007; Marr, 1971; McHugh et al., 2007; O’Reilly and McClelland, 1994; Treves and Rolls, 1992)..

We combined the three youngest bins (0–2 weeks, 2 8.35e−1 weeks-2 months, and 2–6 months) into one group la- The companion dog beled 0.25 years; for all other age bins, midpoint ages for as a model for each bin were used as in Hoffman, Creevy and Promislow human aging and (2013). mortality Many authors have pointed out that age is the sin- 7.34e−3 gle greatest risk factor for a variety of causes of death (Finkel, 2005; Kaeberlein et al., 2015; Kennedy et al., 2014). However, with few exceptions, we know relatively little about the factors that determine how age shapes disease risk.

For each of these examples, we see that the high predicted weight corresponds to cited work indeed being employed by the citing paper. For example, the high weight citations 125 for the papers titled ‘AAV-mediated gene therapy for retinal disorders: from mouse to man’ and ‘Hippocampal Memory Traces are Differentially Modulated by Experience, Time, and Adult Neurogenesis’ in Table 7.3 correspond to formulas employed by these papers, that were developed in the cited papers. Similarly, the lowest weighted cita- tions correspond to cited papers that are not informative. For example, for the paper titled ‘AAV-mediated gene therapy for retinal disorders: from mouse to man’, the lower- weighted citation corresponds to describing the limitation of the cited paper. Similarly, for the paper titled ‘Hippocampal Memory Traces are Differentially Modulated by Ex- perience, Time, and Adult Neurogenesis’, the lower-weighted citation corresponds to describing the background work, which is not an informing citation. We present the concluding remarks in Chapter 8. Chapter 8

Conclusion

In this thesis, we contributed distant-supervised solutions to various challenges arising in text-mining, product-search, and scholarly-networks. In Chapter 3, we presented LDMI, a model to estimate distributed representations of the multi-sense words. As a source of distant supervision, LDMI relies on the idea that, if a word carries multiple senses, then having a different representation for each of its senses should lead to a lower loss associated with predicting its co-occurring words, as opposed to the case when a single vector representation is used for all the senses. LDMI is able to efficiently identify the meaningful senses of words and estimate the vector embeddings for each sense of these identified words. The vector embeddings produced by LDMI achieves state-of-the-art results on the contextual similarity task by outperforming the other related work. In Chapter 4, we presented we proposed methods to automatically segment docu- ments into semantically-coherent segments and annotate these segments with the cor- responding class. We presented approaches, that instead of using segment-level la- beled data, uses the set of class labels that are associated with an entire document as a source of distant-supervision. We presented two classes of approaches: (i) dy- namic programming-based approaches, and (ii) attention-mechanism based approaches. Experiments on the text segmentation and multi-class classification demonstrate the superior performance of the presented approaches over the competing approaches. In Chapter 5, we presented our distant-supervised solution to two of the challenges in query understanding, that can potentially be used to improve users’ experience: (i)

126 127 when multiple terms are present in a query, determining the relevant terms that are representative of the query’s product intent, (ii) vocabulary gap between the terms in the query and the product’s description. We leveraged the historical query reformu- lation logs of walmart.com to estimate and evaluate our approaches. The developed approaches perform better than the non-contextual baselines on multiple experiments, demonstrating the advantage of these approaches. To further verify the generalizability of our methods, it is important to use more test collections for evaluation. This consti- tutes an important future work. Apart from the query reformulations, another valuable source of information that can be used to identify the important terms in a query as well as to recommend new terms is the description of the engaged products. It would be an interesting research direction to combine these multiple sources of information to develop approaches to improve users’ product search experience. To further evaluate our query refinement approach, it would be interesting to analyze the users’ behaviour regarding how they modify their queries after going through the suggested terms. In Chapter 6, we presented our distant-supervised solution to another challenges in query understanding, i.e., (iii) annotating individual terms in a query with the corre- sponding intended product characteristics (product type,brand, gender, size, color etc.). The developed approaches belong to the class of probabilistic generative models. Our approaches leverage historical query logs and the purchases that these queries led to, and also exploit co-occurrence information among the slots in order to identify intended product characteristics. The developed approaches perform better than the baselines on multiple experiments, demonstrating the advantage of these approaches. In Chapter 7, we presented approaches to estimate content-aware bibliometrics to accurately quantitatively measure the scholarly impact of a publication. Our distant- supervised approaches use the content of the publications to weight the edges of a citation network, where the weights quantify the extent to which the cited-publication informs the citing-publication. Experiments on the three manually annotated datasets show the advantage of using the proposed method on the competing approaches. We believe this approach is a good starting point towards developing rigorous quality-related metrics. As a future work, an interesting direction is to develop metrics while also addressing (i) robustness with respect to the malicious entities, and (ii) fairness with respect to various exogenous environmental and societal factors. 128 This thesis makes a step towards leveraging distant-supervision for a variety of problems, where the key-bottleneck is the unavailability of labeled training data. We envision that this work will serve as a motivation for other applications that rely on the labeled training data, which is expensive and time-consuming. References

[1] Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003.

[2] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24. ACM, 2013.

[3] Herman Aguinis, Isabel Su´arez-Gonz´alez,Gustavo Lannelongue, and Harry Joo. Scholarly impact revisited. Academy of Management Perspectives, 26(2):105–132, 2012.

[4] Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Polyglot: Distributed word representations for multilingual nlp. CoNLL, 2013.

[5] Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. Polyglot-ner: Massive multilingual named entity recognition. In SDM, 2015.

[6] Pinkesh Badjatiya, Litton J Kurisinkel, Manish Gupta, and Vasudeva Varma. Attention-based neural text segmentation. In European Conference on Informa- tion Retrieval, pages 180–193. Springer, 2018.

[7] David Bamman, Brendan O’Connor, and Noah A Smith. Learning latent personas of film characters. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), page 352, 2014.

[8] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering

129 130 on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(Sep), 2005.

[9] Doug Beeferman, Adam Berger, and John Lafferty. Statistical models for text segmentation. Machine learning, 34(1-3):177–210, 1999.

[10] Steven Bird, Robert Dale, Bonnie J Dorr, Bryan Gibson, Mark Thomas Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir R Radev, and Yee Fan Tan. The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. 2008.

[11] Catherine A Bliss, Morgan R Frank, Christopher M Danforth, and Peter Sheridan Dodds. An evolutionary algorithm approach to link prediction in dynamic social networks. Journal of Computational Science, 5(5):750–764, 2014.

[12] Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, Aristides Gionis, and Sebastiano Vigna. The query-flow graph: model and applications. In Pro- ceedings of the 17th ACM conference on Information and knowledge management, pages 609–618. ACM, 2008.

[13] Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.

[14] Razvan Bunescu and Marius Pa¸sca. Using encyclopedic knowledge for named entity disambiguation. In EACL, 2006.

[15] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. Technical report, 2013. URL http://arxiv. org/abs/1312.3005.

[16] Steve Chien and Nicole Immorlica. Semantic similarity between search engine queries using temporal correlation. In Proceedings of the 14th international con- ference on World Wide Web, pages 2–11. ACM, 2005.

[17] Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. 131 On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

[18] Freddy YY Choi. Advances in domain independent linear text segmentation. In Proceedings of the 1st North American chapter of the Association for Computa- tional Linguistics conference, pages 26–33. Association for Computational Lin- guistics, 2000.

[19] Aaron Clauset, Cristopher Moore, and Mark EJ Newman. Hierarchical structure and the prediction of missing links in networks. , 453(7191):98–101, 2008.

[20] Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3586–3596, 2019.

[21] Silviu Cucerzan and Eric Brill. Extracting semantically related queries by exploit- ing user session information. Technical report, Technical report, MSR, 2005.

[22] Van Dang, Xiaobing Xue, and W Bruce Croft. Inferring query aspects from reformulations using clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 2117–2120. ACM, 2011.

[23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[24] Inderjit S Dhillon and Dharmendra S Modha. Concept decompositions for large sparse text data using clustering. Machine learning, 42(1-2), 2001.

[25] Doug Downey, Susan Dumais, and Eric Horvitz. Heads and tails: studies of web search with common and rare queries. In Proceedings of the 30th annual inter- national ACM SIGIR conference on Research and development in information retrieval, pages 847–848. ACM, 2007. 132 [26] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Re- search, 12(Jul), 2011.

[27] Travis Ebesu and Yi Fang. Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 1093–1096, 2017.

[28] Michael F¨arber and Adam Jatowt. Citation recommendation: Approaches and datasets. arXiv preprint arXiv:2002.06961, 2020.

[29] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: The concept re- visited. In Proceedings of the 10th international conference on World Wide Web, 2001.

[30] C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pages 89–98, 1998.

[31] Goran Glavaˇs,Federico Nanni, and Simone Paolo Ponzetto. Unsupervised text segmentation using semantic relatedness graphs. Association for Computational Linguistics, 2016.

[32] Yoav Goldberg and Omer Levy. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.

[33] Leo A Goodman and William H Kruskal. Measures of association for cross classifi- cations. ii: Further discussion and references. Journal of the American Statistical Association, 54(285):123–163, 1959.

[34] Leo A Goodman and William H Kruskal. Measures of association for cross clas- sifications iii: Approximate sampling theory. Journal of the American Statistical Association, 58(302):310–364, 1963. 133 [35] Leo A Goodman and William H Kruskal. Measures of association for cross clas- sifications, iv: Simplification of asymptotic variances. Journal of the American Statistical Association, 67(338):415–421, 1972.

[36] Leo A Goodman and William H Kruskal. Measures of association for cross classifi- cations. In Measures of association for cross classifications, pages 2–34. Springer, 1979.

[37] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235, 2004.

[38] Barbara Grosz and Julia Hirschberg. Some intonational characteristics of discourse structure. In Second international conference on spoken language processing, 1992.

[39] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for net- works. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.

[40] Roger Guimer`aand Marta Sales-Pardo. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52):22073–22078, 2009.

[41] Mochizuki Hajime, Honda Takeo, and Okumura Manabu. Text segmentation with multiple surface linguistic cues. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2, pages 881–885. Association for Computa- tional Linguistics, 1998.

[42] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.

[43] Jialong Han, Yan Song, Wayne Xin Zhao, Shuming Shi, and Haisong Zhang. hy- perdoc2vec: Distributed representations of hypertext documents. arXiv preprint arXiv:1805.03793, 2018. 134 [44] Donna Harman and Mark Liberman. Tipster complete. Corpus number LDC93T3A, Linguistic Data Consortium, Philadelphia, 1993.

[45] Jing He, Jian-Yun Nie, Yang Lu, and Wayne Xin Zhao. Position-aligned trans- lation model for citation recommendation. In International symposium on string processing and information retrieval, pages 251–263. Springer, 2012.

[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE CVPR, 2016.

[47] Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. Context-aware citation recommendation. In Proceedings of the 19th international conference on World wide web, pages 421–430, 2010.

[48] Qi He, Daniel Kifer, Jian Pei, Prasenjit Mitra, and C Lee Giles. Citation recom- mendation without author supervision. In Proceedings of the fourth ACM inter- national conference on Web search and data mining, pages 755–764, 2011.

[49] Marti A Hearst. Texttiling: Segmenting text into multi-paragraph subtopic pas- sages. Computational linguistics, 23(1):33–64, 1997.

[50] Matthew S Henderson. Discriminative methods for statistical spoken dialogue systems. PhD thesis, University of Cambridge, 2015.

[51] William Hersh, Chris Buckley, TJ Leone, and David Hickam. Ohsumed: an inter- active retrieval evaluation and new large test collection for research. In SIGIR’94, pages 192–201. Springer, 1994.

[52] Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Im- proving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012.

[53] Wenyi Huang, Saurabh Kataria, Cornelia Caragea, Prasenjit Mitra, C Lee Giles, and Lior Rokach. Recommending citations: translating papers into references. In Proceedings of the 21st ACM international conference on Information and knowl- edge management, pages 1910–1914, 2012. 135 [54] Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mitra, and C Lee Giles. A neural probabilistic model for context based citation recommendation. In Twenty- ninth AAAI conference on artificial intelligence, 2015.

[55] Zan Huang. Link prediction based on graph topology: The predictive value of generalized clustering coefficient. Available at SSRN 1634014, 2010.

[56] Paul Jaccard. Etude´ comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901.

[57] Minwoo Jeong and Gary Geunbae Lee. Structures for spoken language under- standing: A two-step approach. In ICASSP, volume 4, pages IV–141. IEEE, 2007.

[58] Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis. Overview of the tac 2010 knowledge base population track. In TAC 2010, volume 3, 2010.

[59] Shan Jiang, Yuening Hu, Changsung Kang, Tim Daly Jr, Dawei Yin, Yi Chang, and Chengxiang Zhai. Learning query and document relevance from a web-scale click graph. In SIGIR, 2016.

[60] David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. Measuring the of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391–406, 2018.

[61] Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. On application of learning to rank for e-commerce search. In SIGIR. ACM, 2017.

[62] Saurabh Kataria, Prasenjit Mitra, and Sumit Bhatia. Utilizing context in gen- erative bayesian models for linked corpus. In Aaai, volume 10, page 1. Citeseer, 2010.

[63] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2): 81–93, 1938.

[64] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015. URL http://arxiv.org/abs/1412.6980. 136 [65] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999.

[66] Yuta Kobayashi, Masashi Shimbo, and Yuji Matsumoto. Citation recommenda- tion using distributed representation of discourse facets in scientific articles. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pages 243–251, 2018.

[67] Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Be- rant. Text segmentation as a supervised learning task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 469– 473, 2018.

[68] Huajing Li, Isaac Councill, Wang-Chien Lee, and C Lee Giles. Citeseerx: an architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on World Wide Web, pages 883– 884, 2006.

[69] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.

[70] Bing Liu and Ian Lane. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454, 2016.

[71] Ya’ning LIU, Rui YAN, and Hongfei YAN. Personalized citation recommendation based on user’s preference and language model. Journal of Chinese Information Processing, (2):18, 2016.

[72] Avishay Livne, Vivek Gokuladas, Jaime Teevan, Susan T Dumais, and Eytan Adar. Citesight: supporting contextual citation recommendation using differen- tial search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 807–816, 2014.

[73] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th 137 Annual Meeting of the Association for Computational Linguistics, pages 4969– 4983, Online, July 2020. Association for Computational Linguistics. doi: 10. 18653/v1/2020.acl-main.447. URL https://www.aclweb.org/anthology/2020. acl-main.447.

[74] Klaus Macherey, Franz Josef Och, and Hermann Ney. Natural language under- standing using statistical machine translation. In Eurospeech, 2001.

[75] Saurav Manchanda and George Karypis. Distributed representation of multi- sense words: A loss driven approach. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 337–349. Springer, 2018.

[76] Saurav Manchanda and George Karypis. Text segmentation on multilabel docu- ments: A distant-supervised approach. In ICDM, pages 1170–1175. IEEE, 2018.

[77] Saurav Manchanda and George Karypis. Cawa: An attention-network for credit attribution. arXiv preprint arXiv:1911.11358, 2019.

[78] Saurav Manchanda, Mohit Sharma, and George Karypis. Intent term selection and refinement in e-commerce queries. arXiv preprint arXiv:1908.08564, 2019.

[79] Saurav Manchanda, Mohit Sharma, and George Karypis. Intent term weighting in e-commerce queries. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2345–2348. ACM, 2019.

[80] V´ıctor Mart´ınez, Fernando Berzal, and Juan-Carlos Cubero. A survey of link prediction in complex networks. ACM computing surveys (CSUR), 49(4):1–33, 2016.

[81] Paul McNamee and Hoa Trang Dang. Overview of the tac 2009 knowledge base population track. In Text Analysis Conference (TAC), volume 17, pages 111– 113. National Institute of Standards and Technology (NIST) Gaithersburg, Mary- land . . . , 2009.

[82] Frank McSherry and Marc Najork. Computing information retrieval performance measures efficiently in the presence of tied scores. In ECIR, 2008. 138 [83] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factoriza- tion. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2011.

[84] Gr´egoireMesnil, Xiaodong He, Li Deng, and Yoshua Bengio. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In Interspeech, pages 3771–3775, 2013.

[85] Gr´egoireMesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. Using recur- rent neural networks for slot filling in spoken language understanding. IEEE/ACM TRANS. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2015.

[86] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[87] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.

[88] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Ad- vances in neural information processing systems, pages 3111–3119, 2013.

[89] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In ACL-IJCNLP, 2009.

[90] Jinseok Nam, Jungi Kim, Eneldo Loza Menc´ıa,Iryna Gurevych, and Johannes F¨urnkranz.Large-scale multi-label text classification—revisiting neural networks. In ECML PKDD, 2014.

[91] Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. Efficient non-parametric estimation of multiple embeddings per word in vector space. In EMNLP, 2014.

[92] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327, Florence, Italy, August 139 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5034. URL https://www.aclweb.org/anthology/W19-5034.

[93] Roger Newson. Parameters behind “nonparametric” statistics: Kendall’s tau, somers’ d and median differences. The Stata Journal, 2(1):45–64, 2002.

[94] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.

[95] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[96] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.

[97] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representa- tions. arXiv preprint arXiv:1802.05365, 2018.

[98] Roberto Pieraccini, Evelyne Tzoukermann, Zakhar Gorelov, J-L Gauvain, Esther Levin, C-H Lee, and Jay G Wilpon. A speech understanding system based on statistical representation of semantics. In ICASSP, volume 1, pages 193–196. IEEE, 1992.

[99] Stephen Della Pietra, Mark Epstein, Salim Roukos, and Todd Ward. Fertility models for statistical natural language understanding. In EACL, pages 168–173, 1997.

[100] Martin F Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[101] Filip Radlinski, Martin Szummer, and Nick Craswell. Inferring query intent from reformulations and clicks. In Proceedings of the 19th international conference on World wide web, pages 1171–1172. ACM, 2010. 140 [102] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. La- beled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248–256. Association for Computational Linguistics, 2009.

[103] Daniel Ramage, Christopher D Manning, and Susan Dumais. Partially labeled topic models for interpretable text mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 457–465. ACM, 2011.

[104] Christian Raymond and Giuseppe Riccardi. Generative and discriminative algo- rithms for spoken language understanding. In Interspeech, 2007.

[105] Joseph Reisinger and Raymond J Mooney. Multi-prototype vector-space models of word meaning. In NAACL:HLT, 2010.

[106] Stephen E Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR’94.

[107] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 1995.

[108] Joseph John Rocchio. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing, pages 313–323, 1971.

[109] Lior Rokach, Prasenjit Mitra, Saurabh Kataria, Wenyi Huang, and Lee Giles. A supervised learning method for context-aware citation recommendation in a large corpus. INVITED SPEAKER: Analyzing the Performance of Top-K Retrieval Algorithms, page 1978, 1978.

[110] , , and Reinhard Schneider. Redefining the goals of protein secondary structure prediction. Journal of molecular biology, 235(1):13– 26, 1994. 141 [111] Eldar Sadikov, Jayant Madhavan, Lu Wang, and Alon Halevy. Clustering query refinements by user intent. In Proceedings of the 19th international conference on World wide web, pages 841–850. ACM, 2010.

[112] Gerard Salton and Michael J McGill. Introduction to modern information re- trieval. 1986.

[113] Hossein Soleimani and David J Miller. Semisupervised, multilabel, multi-instance learning for structured data. Neural computation, 29(4):1053–1102, 2017.

[114] Robert H Somers. A new asymmetric measure of association for ordinal variables. American sociological review, pages 799–811, 1962.

[115] Yangqiu Song, Haixun Wang, Weizhu Chen, and Shusen Wang. Transfer un- derstanding from head queries to tail queries. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Manage- ment, pages 1299–1308. ACM, 2014.

[116] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfit- ting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[117] Xuewei Tang, Xiaojun Wan, and Xun Zhang. Cross-language context-aware ci- tation recommendation in scientific articles. In Proceedings of the 37th inter- national ACM SIGIR conference on Research & development in information re- trieval, pages 817–826, 2014.

[118] G¨okhanT¨ur,Dilek Hakkani-T¨ur,Andreas Stolcke, and Elizabeth Shriberg. Inte- grating prosodic and lexical cues for automatic topic segmentation. Computational linguistics, 27(1):31–57, 2001.

[119] Marco Valenzuela, Vu Ha, and Oren Etzioni. Identifying meaningful citations. In Workshops at the twenty-ninth AAAI conference on artificial intelligence, 2015.

[120] Andrew Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 1967. 142 [121] Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77–82, 1999.

[122] Ngoc Thang Vu. Sequential convolutional neural networks for slot filling in spoken language understanding. arXiv preprint arXiv:1606.07783, 2016.

[123] Chao Wang, Venu Satuluri, and Srinivasan Parthasarathy. Local probabilistic models for link prediction. In Seventh IEEE international conference on data mining (ICDM 2007), pages 322–331. IEEE, 2007.

[124] Xuanhui Wang and ChengXiang Zhai. Mining term association patterns from search logs for effective query reformulation. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 479–488. ACM, 2008.

[125] Ye-Yi Wang, Li Deng, and Alex Acero. Spoken language understanding. IEEE Signal Processing Magazine, 22(5):16–31, 2005.

[126] Yeyi Wang, Li Deng, and Alex Acero. Semantic frame-based spoken language un- derstanding. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, pages 41–91, 2011.

[127] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu. A theoretical analysis of ndcg ranking measures. In COLT, volume 8, page 6, 2013.

[128] Puyang Xu and Ruhi Sarikaya. Convolutional neural network based triangular crf for joint intent detection and slot filling. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 78–83. IEEE, 2013.

[129] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. Recurrent neural networks for language understanding. In Interspeech, 2013.

[130] Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. Spoken language understanding using long short-term memory neural net- works. In 2014 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014. 143 [131] Jun Yin and Xiaoming Li. Personalized citation recommendation via convolutional neural networks. In Asia-Pacific web (APWeb) and web-age information manage- ment (WAIM) joint conference on web and big data, pages 285–293. Springer, 2017.

[132] Adam Zemla, Ceslovasˇ Venclovas, Krzysztof Fidelis, and Burkhard Rost. A mod- ified definition of sov, a segment-based measure for protein secondary structure prediction assessment. Proteins: Structure, Function, and , 34(2): 220–223, 1999.

[133] Ke Zhai, Zornitsa Kozareva, Yuening Hu, Qi Li, and Weiwei Guo. Query to knowledge: Unsupervised entity extraction from shopping queries using adaptor grammars. In SIGIR, pages 255–264. ACM, 2016.

[134] Aston Zhang, Amit Goyal, Ricardo Baeza-Yates, Yi Chang, Jiawei Han, Carl A Gunter, and Hongbo Deng. Towards mobile query auto-completion: An effi- cient mobile application-aware approach. In Proceedings of the 25th International Conference on World Wide Web, pages 579–590. International World Wide Web Conferences Steering Committee, 2016.

[135] Min-Ling Zhang and Zhi-Hua Zhou. Ml-knn: A lazy learning approach to multi- label learning. Pattern recognition, 40(7):2038–2048, 2007.

[136] Xiaodong Zhang and Houfeng Wang. A joint model of intent determination and slot filling for spoken language understanding. In IJCAI, volume 16, 2016.

[137] Ying Zhao and George Karypis. Criterion functions for document clustering. Tech- nical report, Department of Computer Science, University of Minnesota, 2005.

[138] Deyu Zhou and Yulan He. Learning conditional random fields from unaligned data for natural language understanding. In ECIR, pages 283–288. Springer, 2011.

[139] Will Y Zou, Richard Socher, Daniel M Cer, and Christopher D Manning. Bilingual word embeddings for phrase-based machine translation. In EMNLP, 2013.

[140] Arkaitz Zubiaga, Alberto P Garc´ıa-Plaza,V´ıctorFresno, and Raquel Mart´ınez. Content-based clustering for tag cloud visualization. In Social Network Analysis 144 and Mining, 2009. ASONAM’09. International Conference on Advances in, pages 316–319. IEEE, 2009.