A Protocol and Prototype Application to Find Maps
Total Page:16
File Type:pdf, Size:1020Kb
MAPSEARCH: A PROTOCOL AND PROTOTYPE APPLICATION TO FIND MAPS by JUDITH GELERNTER A Dissertation submitted to the Graduate School-New Brunswick Rutgers, The State University of New Jersey In partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Communication, Information and Library Studies Written under the direction of Professor Michael E. Lesk and approved by ________________________ ________________________ ________________________ ________________________ New Brunswick, New Jersey May 2008 © 2008 Judith Gelernter ALL RIGHTS RESERVED ABSTRACT OF THE DISSERTATION MAPSEARCH: A Protocol and Prototype Application to Find Maps By Judith Gelernter Dissertation Director: Professor Michael Lesk Even geographers need ways to find what they need among the thousands of maps buried in map libraries and in journal articles. It is not enough to provide search by region and keyword. Studies of queries show that people often want to look for maps showing a certain location at a certain time period or with a subject theme. The difficulties in finding such maps are several. Maps in physical and digital collections often are organized by region. Multi-dimensional manual indexing is time-consuming and so many maps are not indexed. Further, maps in non-geographical publications are indexed rarely, making them essentially invisible. In an attempt to solve actual problems, this dissertation research automatically indexes maps in published documents so that they become visible to searchers. The MapSearch prototype aggregates journal components to allow finer-grained searching of article content. MapSearch allows search by region, time, or theme as well as by keyword (http://scilsresx.rutgers.edu/~gelern/maps/). Automatic classification of maps is a multi-step process. A sample of 150 maps and the text (that becomes metadata) describing the maps have been copied from a random assortment of journal articles. Experience taking metadata manually enabled the writing of instructions to mine data automatically; experience with manual classification allowed for writing algorithms that classify maps by region, time and theme automatically. That classification is supported by ontologies for region, time and theme that have been ii generated or adapted for the purpose and that allow what has been called intelligent search, or smart search. The 150 map training set was loaded into the MapSearch engine repeatedly, each time comparing automatically-assigned classification to manually- assigned classification. Analysis of computer misclassifications suggested whether the ontology or classification algorithm should be modified in order to improve classification accuracy. After repeated trials and analyses to improve the algorithms and ontologies, MapSearch was evaluated with a set of 55 previously unseen maps in a test set. Automated classification of the test set of maps was compared to the manual classification, with the assumption that the manual process provides the most accurate classification obtainable. Results showed an accuracy, or a correspondence between manual and automated classification, of 75% for region, 69% for time, and 84% for theme. The dissertation contributes: (1) a protocol to harvest metadata from maps in published articles that could be adapted to aggregate other sorts of journal article components such as charts, diagrams, cartoons or photographs, (2) a method for ontology-supported metadata processing to allow for improved result relevance that could be applied to other sorts of data, (3) algorithms to classify maps into region, time and theme facets that could be adapted to classify other document types, and (4) a proof-of-concept MapSearch system that could be expanded with heterogeneous map types. iii Acknowledgement I am thankful for my chief advisor Michael Lesk’s wisdom and kindness that have guided me throughout this research. His exceptional knowledge of data mining and word sense disambiguation allowed me to delve deeply into the heart of the problem. His translation of my algorithms into the web programming language Perl breathed life into MapSearch and created the prototype. I will miss our work on this project, but look forward to research together in future. Thanks are due to committee members in my department Dan O’Connor and Marie Radford for advice on the rigors of form and style, and for being available when I have been in need. Thanks also to Michael Goodchild from the University of California at Santa Barbara for sharing some of his latest research and offering advice on this dissertation’s geographic aspects. I thank my colleagues in the graduate school Heather Moulaison and Sarah Legins for the careful attention they gave to indexing the map sample, which was used for the evaluation of the classification algorithms. Nancy Kandoian of the Map Division of the New York Public Library, and Cathay Crosby of the Internet Public Library, were most helpful in providing data on how people ask for maps. This research was supported in part by the Rutgers Academic Excellence Fund. iv Table of Contents 1 Introduction ……………………………………………………………. 1 1.1 Problem statement 1.2 Significance 1.3 Research questions and small scale answer to the problem 1.4 Contributions and long term goals 2 Background for the research ………………………………………… 5 2.1 What is a map? 2.2 How can one find a digital map? 2.3 Why finding a digital map can be a problem: spatial metadata 2.4 Theme maps and those who use them 2.5 MapSearch in response to present problems 2.6 Approach 3 Related research ……………………………………………………… 12 3.1 Information retrieval 3.2 Data mining and automated classification 3.3 Lexical search using an ontology and weighted indexing 3.4 Determining relevance: semantic similarity 3.5 Indexing attributes for maps 3.6 Systems to retrieve maps 4 User-defined facets …………………………………………………… 22 4.1 How do people ask for maps? 4.1.1 New York Public Library Map Division Method Data Findings 4.1.2 Actual questions from the Internet Public Library Method Data Findings 4.2 Interpretation of both studies 5 Overview of automatic cataloging and classification of maps …..... 28 5.1 System architecture 5.2 Harvesting metadata 5.2.1 Collecting the sample 5.2.2 Creating and storing 5.2.3 Data cleaning 5.2.4 Weighting and indexing 5.2.5 Information extraction 5.3 Metadata classification 5.3.1 Classification rules v 5.3.2 Classification categories 5.3.3 An ontology for each facet 5.4 Query processing: assessment and improvement 5.4.1 Algorithms 5.4.2 How to use an ontology to classify metadata or query 5.4.3 How to rank results 5.4.4 How good is a given classification? Scoring 5.5 Improving classification algorithms 6 Classification architecture …………………………………………… 42 6.1 Method 6.2 Region 6.2.1 Classes and their application 6.2.2 Geo-ontology 6.2.3 Algorithm for region 6.2.4 Findings (research question 1) 6.3 Time 6.3.1 Classes and their application 6.3.2 Chrono-ontology 6.3.3 Algorithm for time 6.3.4 Findings (research question 2) 6.4 Theme 6.4.1 Classes and their application 6.4.2 Concept ontology 6.4.3 Algorithm for theme 6.4.4 Findings (research question 3) 6.5 Summary of results 6.6 Discussion 7 Interface design ……………………………………………………… 70 7.1 Method 7.2 Keyword search 7.3 Faceted category search 7.4 Results display 7.5 Messages to the user 7.6 Further testing 8 Evaluation ……………………………………………………………. 77 8.1 Potential approaches for evaluation 8.2 How well do the algorithms work 8.2.1 Method for manual classification 8.2.2 Assembling the results of manual classifications 8.2.3 Evaluation of automatic classification 8.3 Proposed experiment for data mining 8.4 Proposed experiment for classification categories 8.5 Proposed experiment for user satisfaction vi 8.6 Cost effectiveness 8.7 Importance of testing 9 Limitations ………………………..…………………………………… 86 9.1 Collection, current and expanded 9.2 Categories 9.3 Classifying maps 9.4 Ontologies 9.5 Interface design 9.6 MapSearch is a question-answering tool—but what are the questions? 10 Contribution: what is new and why it matters ………………………. 90 10.1 Hybrid method for classification 10.2 Subject ontology 10.3 Algorithms 10.4 MapSearch prototype expandable 10.5 Toward a simpler spatial metadata scheme 11 Future Research ………………………………………………………… 93 11.1 Expanding the search protocol in order to expand the collection 11.2 Adding features for users 11.3 MapSearch as a model vii List of tables Table 1 Findings for region ………………………………………………….. 50 Table 2 Findings for time ……………………………………………………. 58 Table 3 Findings for theme …………………………………………………... 64 Table 4 Comparison of automatic classifications in the training set by facet .. 66 Table 5 Time and theme assignments for the training set ………………….... 68 Table 6 Comparison of automatic classifications in the testing set by facet .... 80 List of illustrations Figure 1 The architecture of the proposed map retrieval system ……………… 28 Figure 2a Map as it appears in the article; Center and Right: Demonstration of the text layer extraction program ………………………………… 30 Figure 2b Words from the article that can be taken using the data mining algorithm outlined in Appendix B…………………………………… 31 Figure 3 Example of weighting: Sample output of classification by theme ….. 38 Figure 4 Library of Congress Classification excerpt ………..………………... 61 Figure 5 MapSearch interface ………………………………………………… 75 Figure 6 MapSearch results display …………………………………………… 76 viii Concluding Material Appendix A. Glossary of terms in the dissertation ……………………………. 97 Appendix B. Instructions to mine data from around maps…………………….. 100 Appendix C. Ontology building with Geonames .…………………………….. 102 Appendix D. Materials for evaluation ………………………………………… 103 D.1 Instructions to indexers D.2 Regions expanded D.3 Regions mapped D.4 Themes expanded References ……………………………………………………………………. 117 Curriculum vitae (abbreviated) …………………………………………….... 126 ix 1 1 Introduction 1.1 Problem statement How do we find maps in non-geographic publications? In digital publications, often we do not—maps are invisible because they are not well-indexed.