Leveraging Open Source Web Resources to Improve Retrieval of Low Text Content Items
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging open source web resources to improve retrieval of low text content items A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Ayush Singhal IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY JAIDEEP SRIVASTAVA August, 2014 c Ayush Singhal 2014 ALL RIGHTS RESERVED Acknowledgements There are many people to whom I feel grateful. First of all, I would like to thank my advisor Prof. Jaideep Srivastava for being such a nice and open-minded advisor. He always encouraged me to pursue my research interests. I am grateful to him for taking me as his PhD student even after one year of graduate studies. I also express my special gratitude to Prof. Vipin Kumar for giving me the opportunity to come to Univesity of Minnesota for my PhD. His advice about PhD and his understanding nature helped me so much in times of difficulties. I am grateful to him for that. I sincerely thanks Prof. Nikolaos Papanikolopoulos and Prof. Adam Rothman for spending their valuable time to serve on my thesis committee. I also thanks Prof Maria Gini and Prof. Abhishek Chandra for being in my preliminary exam committee. The way I have developed during my PhD work and the strength to be able to face several challenges is a result of the wonderful teachings of Dr. Krishnan. I feel indebted to him for his constant efforts to encourage me to do this work with the right motivation– for the cause of education and welfare of the society. Because of him, today I feel that all the dynamics in my life have a clear purpose and I can be always joyful to pursue that purpose. I am extremely grateful to Dr. Ankush Mittal, who introduced me to research during my undergraduate studies at IIT -Roorkee. He has given me several loving suggestions both before and during my PhD which I would like to use in time to come. I am most delighted to express my loving feelings towards my dear friends Sanyam and Ashish. They have been like my brothers throughout my PhD work. I also thank Shashank, Ankit and Dr. Saket for being such nice friends. I would like to share my appreciation for my family members, especially my mother, who cooperateed very favorably with me. She always encouraged me to fulfill the aim with which I have come to US. I am grateful for their love and sacrifices. I extend a loving thanks to my brothers Arpit and Kushagra for just being good brothers. i Dedication To my teacher Dr. Krishnan, who always keeps me up... with his teachings, precept and encouragements. ii Abstract With the exponential increase in the amount of digital information in the world, search en- gines and recommendation systems have become the most convenient ways to find relevant information. As an example, the number of web pages on the world wide web was estimated to be over a trillion mark by the year 2008. However, search today is no longer limited to docu- ments on the world wide web. The new “information needs” such as multimedia items (images, videos) opens up challenging avenues for scientific research. Thus the search techniques used to find items which are content rich ( e.g documents) no longer holds for items with low-text content. In the literature, several solutions are proposed for developing search framework for multimedia item search which includes using the visual or audio content of such items for re- trieval purposes. However, there is little research on this problem in the domain of scientific research artifacts. This thesis investigates the problem of retrieval of low-text content items for search and recommendation purposes and propose novel techniques to improve retrieval of such items. In particular, we focus on scientific research datasets owing to their importance and exponential growth in the last decade. One of the main challenges in searching research datasets is the lack of text content or the meta-information about the datasets. While the datasets themselves have raw content, the problem of low text content makes the conventional text based search techniques inadequate for their retrieval. In comparison to the multimedia items, where visual and audio features have been utilized to enhance search or recommendation based retrieval, scientific research datasets lack a uniform schema for representing their raw content. Although solutions such as curation and annotation by experts/data scientists exists but these are infeasible for practical operation on a large scale. As a solution, this thesis provides a computational and an efficient framework for retrieving such low-text content items. We primarily present two retrieval models, namely, (1) a user profile based search, and (2) keyword based search. For the user profile based search model, we show that the text content of the item can be derived from the user’s profile and the relevance ranking can also be derived based on the information in the users profile. We find that the proposed approach which lever- ages the information from open source web resources for item extraction outperforms the local content based extraction approaches. For the keyword based search model, we have developed a iii content rich database for research datasets. We use novel content generation techniques to over- come the low-text content challenge for datasets. The content information is extracted from open source and crowd sourced web resources like academic search engines and Wikipedia. In addition to the stand-alone quantitative assessment of the content generated, we evaluate the efficiency of the entire keyword based search framework via a user study. Based on user re- sponses, the thesis reports positive evidence that the proposed search framework is better than the popular general purpose search engine for searching research datasets with a context based queries. The ideas developed in this thesis are implemented in a real search tool- DataGopher.org: an open source search engine for scientific research datasets. Moreover, the approaches de- veloped for research datasets can have application to other low-content items such as short text document, news feeds, Twitter and Facebook postings. In summary, the computational approaches proposed in this work advance the state-of-the-art in retrieval of low-text content items. Whereas the extensive evaluations that are performed on items like scientific research datasets and low text content documents demonstrate the validity of the findings. iv Contents Acknowledgements i Dedication ii Abstract iii List of Tables ix List of Figures xi 1 Introduction 1 1.1 Information retrieval . .1 1.1.1 Basic definitions and concepts . .1 1.1.2 A brief history . .4 1.2 Research challenges and thesis problem . .5 1.3 Overview of the thesis . .6 1.3.1 Organization of the thesis . .6 2 Leveraging user profile to enhance search of low-content items 9 2.1 Background and Problem Statement . 11 2.2 Related Work . 11 2.3 Proposed Approach . 12 2.3.1 Context creation from user’s interest . 13 2.3.2 Item finding . 15 2.3.3 Item ranking . 17 v 2.4 Experimental Analysis . 18 2.4.1 Experimental analysis for item finding algorithm . 18 2.4.2 Experimental analysis for overall framework . 20 2.5 Experimental results and discussion . 22 2.6 Conclusions and Future Work . 24 3 Automating keyword assignment to short text documents 25 3.1 Related Work . 28 3.2 Background of open source web resources . 29 3.2.1 WikiCFP . 29 3.2.2 Crowd-sourced knowledge . 30 3.2.3 Academic search engines . 31 3.3 Overview of our work . 31 3.4 Web context extraction . 33 3.5 Keyword abstraction . 33 3.5.1 Finding keywords for documents with 5-10 text words . 33 3.5.2 Automating keyword abstraction and de-noising . 37 3.6 Keyword extraction from the summary text of a document . 40 3.6.1 TextRank[1] . 40 3.6.2 Rake[2] . 40 3.6.3 Alchemy[3] . 41 3.6.4 Adding web context . 41 3.7 Experimental analysis for keyword abstraction for documents with 5-10 text words ....................................... 42 3.7.1 Test dataset description . 42 3.7.2 Ground truth . 42 3.7.3 Baseline approach . 42 3.7.4 Evaluation metrics . 43 3.7.5 Results and discussion for keyword abstraction without de-noising . 44 3.7.6 Results and discussion for keyword abstraction with de-noising . 46 3.8 Experimental analysis of the keyword extraction approach . 50 3.8.1 Dataset description . 50 vi 3.8.2 Preliminary analysis of the data . 51 3.8.3 Experimental design parameters . 51 3.8.4 Results and discussion . 54 3.9 Conclusions and future work . 58 4 Automating annotation for research datasets 59 4.1 Problem description . 62 4.2 Proposed Approach . 63 4.2.1 Context generation . 64 4.2.2 Identifying data type labels . 64 4.2.3 Concept generation . 65 4.2.4 Finding similar datasets . 66 4.3 Experiments . 67 4.3.1 Dataset Used . 68 4.3.2 Experiments for data type labelling . 69 4.3.3 Experiments for concept generation . 70 4.3.4 Experiments for similar dataset assignment . 72 4.4 Experimental results and discussion . 73 4.4.1 Data type labelling . 73 4.4.2 Concept generation . 74 4.4.3 Similar dataset assignment . 77 4.5 Use case study . 78 4.5.1 Qualitative evaluation . 79 4.5.2 Quantitative evaluation . 80 4.6 Related work . 81 4.7 Conclusions and future work . 83 5 Annotation expansion for low text content items 84 5.1 Problem Setting . 86 5.2 Proposed Approach .