Dynamically Exploiting Available Metadata for Browsing and Information Retreival
Total Page:16
File Type:pdf, Size:1020Kb
.1.11,-1- .......... Dynamically Exploiting Available Metadata For Browsing And Information Retreival by Vineet Sinha Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2003 @ Massachusetts Institute of Technology 2003. All rights reserved. I /IA Author .... Dep c ngineering.. .. and Computer Science September 2003 Certified by.. ..... ... .... ....................... Professor David R. Karger Associate Professor Thesis Supervisor Accepted by ............... .. Arthur C. Smith Chairman, Department Committee on Graduate Students WASSWHUSETTS INSTWTE OF TECHNOLOGY BARKER APR 15 2004 LIBRARIES Dynamically Exploiting Available Metadata For Browsing And Information Retreival by Vineet Sinha Submitted to the Department of Electrical Engineering and Computer Science on September 2003, in partial fulfillment of the requirements for the degree of Master of Science Abstract Systems trying to help users deal with information overload need to be able to sup- port the user in an information-centric manner, and need to support portions of the information which are structured - like creation dates - while at the same time allowing for irregularity and evolution of the data schemas which are not allowed in databases, i.e. they need to support semistructured repositories. Thus, to solve information overload there is a need for a system to support finding information in these semistructured repositories. Since users rarely know the exact query to ask a system when searching for infor- mation the first time, they will want to take the results from their first query and use it to successively improve their queries to "home in" in on the needed information. This research investigates a framework, system, and user interface for supporting ef- fective information retrieval and browsing in such repositories. This system is general purpose, involving no hard-coded assumptions about the structure of the repository. An understanding of the end-user's search process is used to provide effective "next steps" to the users leading them towards the information they are seeking. Thesis Supervisor: Professor David R. Karger Title: Associate Professor 2 VIIIII IiIM~ninII93'" Acknowledgments I would like to thank my thesis advisor David R. Karger for his insights, guidance, and effort in clarifying the ideas relevant to the thesis. Additionally, I would like to thank David Huynh for his help with bugs in both my code and the interface as well as Dennis Quan for his effort in building most of the code in Haystack. I would also like to thank the members of the Haystack group whose comments gave me directions for my research. I would also like to thank Mark Ackerman and the pilots in my user study who helped finalizing its design, as well as the actual study participants. I would like to thank Nippon Telegraph and Telephone Corporation, the MIT oxygen project, the Packard Foundation and IBM for their financial support. I would also like to thank all of my friends for their encouragements and support. Lastly but not the least, I want to thank my parents for their sacrifices and encourage- ments, as well as my brothers for living with the annoyances that come with knowing me. 3 4 Contents 1 Introduction 13 1.1 Information-centric environments 14 1.2 Semistructured Information ... .. 15 1.3 M etadata . .. .. ... 16 1.4 The Resource Description Framework (RDF) 17 1.5 Haystack . ............. .. 18 1.6 An Example ....... ....... 19 1.7 Contribution .......... .... 20 1.8 Overview ...... .......... 21 2 Related Work 23 2.1 Information Retrieval Interfaces .. 23 2.2 Search Interfaces in Databases ... 25 2.3 Interfaces supporting metadata . 26 2.4 Semistructured data ......... 27 3 The Navigation Engine 29 3.1 User Search Needs ................. .... 29 3.1.1 An Incremental Process .......... ..... 29 3.1.2 Granularity of Assistance .......... ..... 30 3.1.3 Navigation Support Desiderata ... ... ..... 31 3.2 Navigation Experts . ...... ...... .... ..... 32 3.2.1 Support for Documents .. .... ... .. .. .. 33 5 3.2.2 Support for Collections .......... ........... 33 3.2.3 Support for Strategies .................. .... 35 3.3 Implementation .. ..... ...... ..... .... ..... ... 36 3.3.1 A gents ..... ...... ...... ...... ...... .. 36 3.3.2 A Blackboard system .... ...... ...... ...... 37 3.3.3 Query Engine .... ...... ..... ...... ...... 38 4 User Interface 41 4.1 Interface Design Desiderata ....... ............. ... 41 4.1.1 Integrated Querying and Navigation ........... ... 42 4.1.2 System Integration ............... ......... 42 4.2 Starting searches ................... .......... 43 4.2.1 Pervasive Querying Capabilities ....... .......... 43 4.2.2 Support for large Collections ......... .......... 44 4.3 The Navigation Pane .... ............ ........... 45 4.4 Support for Annotations on Attribute-Types ...... ........ 47 4.4.1 Supporting continuous variables ......... ........ 47 4.4.2 Supporting hierarchical attribute values ..... ........ 48 5 A Vector Space Model for Semistructured Data 49 5.1 Building the model ....................... ..... 49 5.1.1 Traditional Use .... ............ .......... 49 5.1.2 Adapting to Semistructured data ........ ........ 51 5.1.3 Extending The Model .... ........ ....... ... 51 5.2 Using The Model .... ......... ........ ........ 54 5.2.1 Implementation Considerations ... ........... ... 54 5.2.2 Normalizing the Model .................. .... 54 5.3 Refinement ........... ............. ........ 56 5.4 Similar Documents .......... ............ ...... 56 6 IrillWI IRRI, O~lillililisilMW111Pll Irr, 6 Evaluation 57 6.1 Datasets ......... ......................... 57 6.1.1 Datasets in Haystack ..... .................. 57 6.1.2 External Datasets ... ...................... 58 6.2 User Study ................................ 62 6.2.1 Study Data Preparation ................. .... 62 6.2.2 Iterative Developm ent ......................64 6.2.3 Study Design ........................... 65 6.2.4 Apparatus ............................. 67 6.2.5 Results ............................... 68 7 Future Work 73 7.1 Navigation ................................. 73 7.2 Organizing ................................. 75 A User-Study Details 77 A .1 Phase 1 - Survey ........................ ..... 77 A .2 Phase 2 - Tasks .............................. 80 A.2.1 Instructions ............................ 80 A .2.2 Task 1 ............................... 81 A .2.3 Task 2 ............. .................. 82 A .2.4 Task 3 ............................... 83 A .2.5 Task 4 ........ ....................... 84 A .3 Phase 3 - Post Tasks Survey .................. ..... 84 7 8 List of Figures 1-1 Haystack's navigation system on a collection of the users favorites. .. 19 4-1 Haystack's scrapbook collection shown in the left pane. ........ 42 4-2 Haystack's navigation system on a large collection. .......... 43 4-3 Haystack's navigation system on a large collection. .......... 44 4-4 Haystack's navigation system on a collection of the users favorites. 45 4-5 The navigation interface on a side panel. ................ 46 4-6 Possible interfaces that can be used to show date options ....... 48 5-1 Translation of "Betty bought some butter, but the butter was bitter" to the Vector Space Model . ....................... 50 5-2 An RDF Graph of some recipes. ..................... 51 5-3 The vector space model representation of the recipes. ......... 52 5-4 Adding support to make the navigation engine look up authors' uni- versities for any paper ............. ............. 53 6-1 The output provided by the navigation system in Haystack for the system 's Inbox. .............................. 59 6-2 The output provided by the navigation system in Haystack for the 50 states dataset (as given). ....................... 60 6-3 Clicking on Bird:Cardinal. ....................... 60 6-4 Adding rdfs:label's to improve data. ........ .......... 61 6-5 Annotating state area better. ...................... 61 6-6 Attributes used in the Epicurious.com web-site. ............ 63 9 6-7 Listing of added ingredients to the Recipes corpus. ...... ... 64 6-8 Listing of added ingredients to the Recipes corpus. (continued) . ... 65 6-9 Users evaluating the system (out of 9). ..... .... ..... ... 68 10 List of Tables 6.1 Experts displayed for the Base Interface ................ 66 6.2 Additional Experts displayed for the Complete Interface ....... 66 6.3 Users response to Success and Difficulty levels of Tasks 2 and 3 (ranked out of 10) .......... ................................. 71 11 12 Chapter 1 Introduction With people spending increasing amounts of time using computers and the steep de- crease in price of both storage and communication, the amount of digital information available to users is growing rapidly. These information-centric environments contain information or documents that are inherently semistructured in nature, i.e. they often contain portions of structured information (like creation dates) along with unstruc- tured portions of text (paragraphs). Using this information properly brings a need to be able to search and browse semistructured data repositories. Users frequently do not know exactly what query to ask the system to satisfy their information need. Instead, a user may take an iterative approach, repeatedly posing a query, looking at the results that come back, and refining the query to "home in" on the information they want. This iteration