Distant-supervised algorithms with applications to text mining, product search, and scholarly networks. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Saurav Manchanda IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Professor George Karypis, Advisor November, 2020 © Saurav Manchanda 2020 ALL RIGHTS RESERVED Acknowledgements Ph.D. is more about the journey than the destination, and I am lucky enough to be surrounded by many people, who joined me in this journey. First and foremost, I would like to thank my advisor, George. He is an excellent advisor who not only uplifted my research skills but always inspired me to do great research work. Every interaction with him was an opportunity for me to learn. I hope to become as smart and hardworking as him one day. I would like to thank Professors Zhi-Li Zhang, Rui Kuang, Brian Reese,, and Tian He for serving on my committee for my thesis, thesis proposal, and preliminary exams. Their comments and suggestions have been invaluable in shaping my research, and I am fortunate to have them on my committees. My Ph.D. journey would be very different without my lab, which I consider my home away from home. I have kept great memories of my time in the Karypis lab that I will never forget (especially the silly arguments and quarrels, like what happens in a family). My lab-family includes: Agi, Ancy, David, Dominique, Eva, Haoji, Jake, Jeremy, Kosta, Maria, Mohit, Prableen, Rohit, Sara, Shaden, Shalini, Zeren. I would also like to acknowledge the grants that supported my research: NSF (1447788, 1704074, 1757916, 1834251), Army Research Office (W911NF1810344) and Intel Corp. I also thank Walmart Labs for providing data for my research. Lastly, I would like to thank the staff at the Department of Computer Science, and the Digital Technology Center for providing me the necessary resources for my research. i Dedication To papa, mummy, chai, akoo and the reader! ii Abstract In recent times, data has become the lifeblood of pretty much all businesses. As such, the real-world impact of data-driven machine learning has grown in leaps and bounds. It has set up itself as a standard tool for organizations to draw insights from the data at scale, and hence, to enhance their profits. However, one of the key-bottlenecks in deploying machine learning models in practice is the unavailability of labeled training data. The manually-labeled training sets are expensive and it can be a tedious exercise to create them. Besides, they cannot be practically reused for new objectives, if the underlying distribution of data changes with time. As such, distant-supervision provides a solution to using expensive hand-labeled datasets, which means leveraging alternative sources of weak-supervision. In this thesis, we identify and provide solutions to some of the challenges that can benefit from distant-supervised approaches. First, we present a distant-supervised approach to accurately and efficiently estimate a vector representation for each sense of the multi-sense words. Second, we present ap- proaches for distant-supervised text-segmentation and annotation, which is the task of associating individual parts in a multilabel document with their most appropriate class labels. Third, we present approaches for query understanding in product search. Specifi- cally, we developed distant-supervised solutions to three challenges in query understand- ing: (i) when multiple terms are present in a query, determining the relevant terms that are representative of the query's product intent, (ii) vocabulary gap between the terms in the query and the product's description, and (iii) annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). Fourth, we present approaches to estimate content-aware bib- liometrics to accurately quantitatively measure the scholarly impact of a publication. Our proposed metric assigns content-aware weights to the edges of a citation network, that quantify the extent to which the cited-node informs the citing-node. Consequently, this weighted network can be used to derive impact metrics for the various involved entities, like the publications, authors, etc. iii Contents Acknowledgements i Dedication ii Abstract iii List of Tables viii List of Figures x 1 Introduction 1 1.1 Thesis statement . .1 1.2 Thesis Outline and Original Contributions . .2 1.2.1 Chapter 3: Distributed representation of multi-sense words . .2 1.2.2 Chapter 4: Credit attribution in multilabel documents . .2 1.2.3 Chapters 5 and 6: Query understanding in product search . .4 1.2.4 Chapter 7: Impact assessment in scholarly networks . .5 1.3 Bibliographic Notes . .6 2 Background and Related Work 7 2.1 Distributed representation of multi-sense words . .7 2.2 Credit attribution in multilabel documents . .8 2.3 Improving relevance in product-search . 10 2.3.1 Term-weighting methods . 10 2.3.2 Improving relevance for the tail queries . 11 iv 2.3.3 Query refinement . 12 2.3.4 Slot-filling . 14 2.3.5 Entity linking . 14 2.4 Importance assessment in scholarly networks . 15 2.4.1 Citation Indexing . 15 2.4.2 Citation recommendation . 15 2.4.3 Link-prediction . 16 3 Distributed representation of multi-sense words 18 3.1 Definitions and Notations . 19 3.2 Loss driven multisense identification (LDMI) . 20 3.2.1 Identifying the words with multiple senses . 21 3.2.2 Clustering the occurrences . 22 3.2.3 Putting everything together . 23 3.3 Experimental methodology . 23 3.3.1 Datasets . 23 3.3.2 Evaluation methodology and metrics . 24 3.4 Results and discussion . 28 3.4.1 Quantitative analysis . 28 3.4.2 Qualitative analysis . 30 4 Credit attribution in multilabel documents 32 4.1 Definitions and Notations . 34 4.2 Dynamic programming based methods . 36 4.2.1 Overview . 36 4.2.2 Segmentation algorithm . 37 4.2.3 Building the classification model . 40 4.3 Credit Attribution With Attention (CAWA) . 43 4.3.1 Architecture . 44 4.3.2 Model estimation . 47 4.3.3 Segment inference . 48 4.4 Experimental methodology . 49 4.4.1 Datasets . 49 v 4.4.2 Competing approaches . 51 4.4.3 Evaluation Methodology . 52 4.4.4 Parameter selection . 53 4.4.5 Performance Assessment Metrics . 54 4.5 Results and Discussion . 56 4.5.1 Credit Attribution . 56 4.5.2 Multilabel classification . 60 4.5.3 Ablation study . 62 5 Intent term selection and refinement in e-commerce queries 65 5.1 Definitions and Notations . 68 5.2 Proposed methods . 68 5.2.1 Contextual term-weighting (CTW) model . 70 5.2.2 Contextual query refinement (CQR) model . 72 5.3 Experimental methodology . 74 5.3.1 Dataset . 74 5.3.2 Evaluation Methodology and Performance Assessment . 76 5.4 Results and Discussion . 81 5.4.1 Quantitative evaluation . 81 5.4.2 Qualitative evaluation . 83 6 Distant-supervised slot-filling for e-commerce queries 87 6.1 Definitions and Notations . 89 6.2 Slot Filling via Distant Supervision . 92 6.2.1 Uniform Slot Distribution (USD) . 93 6.2.2 Markovian Slot Distribution (MSD) . 95 6.2.3 Correlated Uniform Slot Distribution (CUSD) . 97 6.2.4 Correlated Uniform Slot Distribution with Subset Selection (CUS- DSS) . 99 6.3 Experimental methodology . 101 6.3.1 User engagement and annotated dataset . 101 6.3.2 Evaluation methodology . 102 6.3.3 Performance Assessment metrics . 104 vi 6.4 Results and Discussion . 105 7 Importance assessment in scholarly networks 110 7.1 Definitions and Notations . 112 7.2 Content-Informed Index (CII) . 113 7.3 Experimental methodology . 115 7.3.1 Evaluation methodology and metrics . 115 7.3.2 Baselines . 116 7.3.3 Datasets . 118 7.3.4 Parameter selection . 119 7.4 Results and discussion . 121 7.4.1 Quantitative analysis . 121 7.4.2 Qualitative analysis . 122 8 Conclusion 126 vii List of Tables 3.1 Dataset statistics. 24 3.2 Results for the Spearman rank correlation (ρ × 100). 28 3.3 Top similar words for different senses of the multi-sense words*. 30 3.4 Senses discovered by the competing approaches*. 31 4.1 Notation used throughout the chapter. 35 4.2 Dataset statistics. 50 4.3 Hyperparameter values. 53 4.4 Results on the segmentation task. 57 4.5 Sentence classification performance when the sentences belong to the sim- ilar classes. 58 4.6 Statistics of the predicted segments. 60 4.7 Results on the multilabel classification task. 61 5.1 Notation used throughout the chapter. 68 5.2 Results for the query term-weighting problem. 81 5.3 Results for the query refinement problem. 82 5.4 Predicted term-weights for a some queries. 84 5.5 Top 20 predicted terms for a few selected queries. 85 6.1 Notation used throughout the chapter. 91 6.2 Retrieval results (Task T1). 105 6.3 Slot-filling results when cq is observed (Task T2). 106 6.4 Slot-filling results when cq is unobserved (Task T3). 106 6.5 Individual precision (P), recall (R) and F1-score (F1) when cq is observed.106 6.6 Individual precision (P), recall (R) and F1-score (F1) when cq is unobserved.107 6.7 Examples of product categories estimated by CUSD and CUSDSS . 108 viii 7.1 Notation used throughout the paper. 112 7.2 Results on the Somers' ∆ metric. 120 7.3 Examples of lowest and highest CII-weighted citations. 124 ix List of Figures 3.1 Distribution of the average contextual loss for all words . ..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages156 Page
-
File Size-