1. the Content Score Is the Resulting Score Provided by the Indexing Engine Avail
Total Page:16
File Type:pdf, Size:1020Kb
INSTITUTO TECNOLÓGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY CAMPUS MONTERREY PROGRAMA DE GRADUADOS EN ELECTRÓNICA, COMPUTACIÓN, INFORMACIÓN Y COMUNICACIONES TECNOLÓGICO DE MONTERREY CONCEPT BASED RANKING IN PERSONAL REPOSITORIES TESIS PRESENTADA COMO REQUISITO PARCIAL PARA OBTENER EL GRADO ACADÉMICO DE: MAESTRÍA EN CIENCIAS EN TECNOLOGÍA INFORMÁTICA POR ALBERTO JOAN SAAVEDRA CHOQUE MONTERREY, N. L., DICIEMBRE 2007 Concept Based Ranking in Personal Repositories by Alberto Joan Saavedra Choque Thesis Presented to the Gradúate Program in Electronics, Computing, Information and Communications School This thesis is a partial requirement to obtain the academic degree of Master of Science in Information Technology Instituto Tecnológico y de Estudios Superiores de Monterrey Campus Monterrey December 2007 A Gary, mi hermano Que me motiva a superarme cada día y a mirar las circunstancias desde diversos puntos de vista. A Yolanda, mi mamá Que me apoya incondicionalmente y me acompaño presencialmente durante un período de mi recorrido en este país que me acogió. A Alberto, mi papá Que me orienta cuando lo necesito y es mi punto de referencia para superar los obstáculos que se presentan en el camino. A Pilar, mi mejor amiga Que conoce mejor que nadie el sacrificio realizado durante esta aventura y me mantuvo enfocado en cumplir exitosamente la trayectoria iniciada. A mi abuelito, tíos y primos Que me hicieron sentir a la distancia, el respaldo y el apoyo de toda la familia. Reconocimientos Al Dr. Juan Carlos Lavariega, mi asesor Por su confianza y apoyo en la realización de esta investigación. Al Dr. Raúl Pérez, sinodal y coordinador de la maestría Por su disponibilidad constante y orientación para responder mis consultas Al Dr. David Garza, Por su confianza para participar en iniciativas de investigación como el proyecto PDLib A los compañeros del proyecto PDLib y del ITESM, Por darme la oportunidad de aprender y trabajar en conjunto con personas muy ca- paces. A los profesores del proyecto PDLib y del ITESM, Por la comprensión y paciencia para transmitir sus conocimientos y aclarar algunas dudas. A LASPAU/OEA Por darme la oportunidad de conocer y ser parte de esta prestigiosa institución: el Tec de Monterrey. ALBERTO JOAN SAAVEDRA CHOQUE Instituto Tecnológico y de Estudios Superiores de Monterrey December 2007 iv Concept Based Ranking in Personal Repositories Summary We are storing an increasing amount of digital information to support our work and life activities. Emails, documents, articles, bookmarks, e-books, images, audio and video are being persisted not only in our hard drives but also in the storage provided by the newer Web 2.0 services. All this information constitutes some sort of personal digital memory, where the key feature is to be able to find what we are looking for as fast as possible. Information requests in these personal repositories are mostly related to the re- finding of documents we have already read/seen. Pulí text search will do the job when we remember the exact words, which is getting harder nowadays; so we may remember a concept associated in our memory to the information object we are looking for. To sol ve this problem, concept based query expansión is the solution. With query expansión, our chances to find the document requested increase. But if the list of results is big, probably it will take some time to lócate it. In repositories like the Personal Digital Library system, which targets an optimal use in portable devices, it is important to have an adequate ranking of the documents to provide a better searching experience. The analysis of the linking structure of documents is a good ranking factor for general search engines. But in the context of personal collections, this factor is not available; so, more information is needed about the document and the tagging approach seems like the best option to provide more criteria to rank appropiately the list of results. This research proposes an architecture of a conceptual information retrieval system for personal repositories, which includes tagging as the semantics provider mechanism and query expansión as the recall enabler. The architecture includes an additional feature to gather more semantics which is the indexing of concepts using latent semantic analysis. Contents Reconocimientos iv Summary v List of Tables ix List of Figures x Chapter 1 Introduction 1 1.1 Personal Digital Repositories 2 1.1.1 Personal Digital Libraries 3 1.1.2 Search in Personal Repositories 3 1.2 Problem Statement 4 1.2.1 Research Goal 5 1.2.2 Hypothesis 5 1.2.3 Contributions 6 1.3 Thesis Outline 6 Chapter 2 Literature Review 8 2.1 Concept based Information Retrieval 8 2.1.1 Set-theoretic conceptual models 10 2.1.2 Algebraic conceptual models 12 2.1.3 Probabilistic conceptual models 15 2.2 Concept Based Organization 17 2.2.1 Tagging 18 2.2.2 Information Retrieval in Folksonomies 20 2.3 Personal Information Re-Use 21 2.3.1 Personal Digital Libraries Retrieval 23 2.4 Concept Based Ranking 24 2.4.1 Search Engine Ranking 25 2.4.2 Domain Specific Ranking 25 2.4.3 Personalized Ranking 26 vi 2.4.4 Query expansión 27 2.4.5 Relevance Feedback 28 2.4.6 Retrieval Performance 29 2.5 Summary 31 Chapter 3 Personal Conceptual Architecture 32 3.1 Architecture of a Conceptual Information Retrieval System 32 3.2 Conceptual Modules 36 3.2.1 Concepts Labeling 36 3.2.2 Concepts Indexing 38 3.2.3 Concepts Ranking 39 3.2.4 Indexing Engine 41 3.2.5 Tagging Store 42 3.2.6 Document Retrieval 42 3.3 Integration Modules 43 Chapter 4 Prototype 44 4.1 Technologies used 44 4.1.1 Tagging Store datábase: db4o 44 4.1.2 Tagging interface: del.icio.us API 46 4.1.3 Concepts Indexing: Lingo 48 4.1.4 Indexing Engine: Lucene 49 4.1.5 Text Summarization: blindLight 51 4.1.6 Query Expansión: WordNet 52 4.2 The Experiment 52 4.2.1 Experiment documents selection 53 4.2.2 índices structure 54 4.2.3 Querying and Ranking 58 4.3 Integration with other systems 60 4.3.1 Search services 60 4.3.2 Upload service 61 4.3.3 Indexing service 62 4.4 Considerations for PDLib 62 4.4.1 Advantages 62 4.4.2 Considerations 63 4.5 Summary 63 Chapter 5 Results, Conclusions and Future Work 65 5.1 Results 65 5.2 Conclusions 67 vii 5.3 Futurework 69 5.3.1 Tagging interface and usability 69 5.3.2 Attention metadata for relevance feedback 70 5.3.3 Probabilistic Latent Semantic Analysis for Concepts Indexing . 70 Bibliography 71 Appendix A XML Files 80 Appendix B UML diagrams 83 B.l Class diagrams 83 B.2 Sequence diagrams 86 viii List of Tables 2.1 Constraints and Retrieval Formulas 30 4.1 Comparison of datábase technologies 46 4.2 Test set summary 54 4.3 Set features for concepts indexing 57 5.1 Comparison of precisión and recall averages 66 ix List of Figures 2.1 Information Retrieval Models 9 3.1 Information Retrieval System 33 3.2 The Retrieval Process 34 3.3 Architecture of a conceptual IR 35 3.4 Concepts Labeling Module 37 3.5 Concepts Indexing Module 38 3.6 Concepts Ranking Module 40 4.1 Tagging Store Data Model 45 4.2 Delicious Class Diagram 47 4.3 Lingo Algorithm 49 4.4 Lucene Index Update and Search Configuration 50 4.5 Automatic Text Summarization with blindLight 52 4.6 WordNet Lucene índex 53 4.7 del.icio.us user interface 54 4.8 Test set tag cloud 55 4.9 Lucene index document example 56 4.10 Concepts in db4o datábase 57 4.11 Bayesian Network for Ranking 59 4.12 delPDLib Web Service WSDL diagram 60 5.1 Google Custom Search Engine Results 66 5.2 Expanded Relevant Ranking Results 67 B.l Package Dependency 84 B.2 Package edu.itesm.pdlib.delicious 85 B.3 Package edu.itesm.pdlib.lucene - index process 86 B.4 Package edu.itesm.pdlib.lucene - query process 87 B.5 Package edu.itesm.pdlib.model - Documents model 88 B.6 Package edu.itesm.pdlib.model - Results model 89 B.7 retrieveAUPosts method - sequence diagram 90 Chapter 1 Introduction Our society's progress has been dependent on access to the knowledge generated by our ancestors; throughout our history the libraries had been gathering, archiving and preserving the written knowledge, classifying and organizing these collections to make them available to the public. Nowadays, since an important amount of our inter- actions can be digitized, we can even have access to the information generated by our colleagues almost in an on-needed basis, it is the digital libraries' job to organize this information. A digital library is a collection of digital information objects and services which support users in dealing with management, access, storage and manipulation of these objects, taking special consideration in the organization and presentation of the objects so users find them useful and have fewer problems in understanding them [78]. One of the key requirements for a digital library is to provide adequate search tools to allow agüe access to the information requested [55]. The amount of information available is huge. According to the EMC's Digital Universe Whitepaper [51] 161 exabytes where used to store digital information in 2006, only 25% of the 161 exabytes is original data. These numbers are going to keep grow- ing as more information becomes digitally available - like digital TV broadcasting or the digitization of books collections by the Open Content Alliance and Google - and archival efforts like the Internet Archive or the Televisión Broadcast archiving [66] bring memory capabilities to the most powerful médiums in our culture; digital libraries have to take into consideration this constant growth.