THE UNIVERSITY OF NEW SOUTH WALES

Development, Evaluation and Application of a Geographic Information Retrieval System

You-Heng Hu

School of Surveying and Spatial Information Systems University of New South Wales Sydney 2052, Australia

A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy (Surveying and Spatial Information Systems) July 2007

Abstract

Geographic Information Retrieval (GIR) systems provide users with functionalities of representation, storage, organisation of and access to various types of electronic information resources based on their textual and geographic context. In contrast with conventional Information Retrieval (IR) systems, GIR places the main emphasis on geographic information, which can be defined as information that references some part of the Earth's surface. This thesis explores various aspects of the development, evaluation and application of GIR systems, including: 1) the extraction and grounding of geographic information entities from documents written in natural languages; 2) the geo-textual ranking of GIR retrieval results; 3) the GIR information retrieval model; and

4) the publishing, browsing and navigation of geographic information on the World

Wide Web (WWW).

The first study focuses upon the extraction and grounding of geographic information entities. This research investigates the nature of ambiguities in geographic information entities as well as the heuristics of disambiguation. My approach for this study consists of a hierarchical structure-based geographic relationship model that is used to describe connections between geographic information entities, and a supervised machine learning algorithm that is used to resolve ambiguities. The proposed approach has been evaluated on a toponym disambiguation task using a large collection of news articles.

The second study details the development and validation of a GIR ranking mechanism.

The proposed approach takes advantage of the power of the Genetic Programming (GP) paradigm with the aim of finding an optimal functional form that integrates both textual and geographic similarities between retrieved documents and a given user query. My

I approach has been validated by applying it to a large collection of geographic metadata documents. The performance evaluation shows that significant improvement over existing ranking methods can be achieved with appropriate GP implementations.

The third study addresses the problem of modelling the GIR retrieval process that takes into account both thematic and geographic criteria. Based on the Spreading Activation

Network (SAN), the proposed model consists a two-layer associative network that is used to construct a structured search space; a constrained spreading activation algorithm that is used to retrieve and to rank relevant documents; and a geographic knowledge base that is used to provide necessary domain knowledge for network. The retrieval performance of my model has been evaluated using the GeoCLEF 2006 tasks. The unswTitleBase and unswNarrBase runs archived a 32.75% and 39.64% improvement comparing to the mean average precision of all submitted runs in monolingual English task. The unswTitleBase run was ranked fourth with 73 entries in title and description fields only run.

The fourth study discusses the publishing, browsing and navigation of geographic information on the . Key challenges in designing and implementing of a GIR user interface through which online content can be systematically organised based on their geospatial characteristics, and can be efficiently accessed and interrelated, are addressed. The components of the proposed system are described in detail, and the effectiveness and the usefulness of the system are shown by applying it to a large collection of geo-tagged web pages.

In summary, this thesis has developed a GIR system to support the representation,

II retrieval and presentation of geographic information. The system is designed to allow facilities of both textual and geographic context of information to provide better performance than keyword-based searching. Experimental results and case studies have shown the potential of the approaches in various GIR application domains.

III Acknowledgements

First of all, I wish to acknowledge my supervisor, Dr. Linlin Ge, for making this research possible. I am sincerely grateful for his trust, patience and the unwavering guidance, encouragement and support he has given me throughout my studies. Special thanks also go to my co-supervisor and Head of School, Professor Chris Rizos for his encouragement and invaluable advice.

I would also like to thank all the other members of the School of Surveying and Spatial

Information Systems, especially Associate Professor Andrew Dempster and Dr. Jinling

Wang for their enthusiasm and guidance throughout my postgraduate studies. Thank you also to Dr. Joel Barnes, Michael Chang, Leon Daras, Weidong (John) Ding, Brian

Donnelly, Dr. Bruce Harvey, Dr. Balakrishnan Kannan, Sharon Kazemi, Wing Yip Lau,

Peter Leech, Dr. Binghao Li, Xiaojing Li, Dr. Yong Li, Dr. Samsung Lim, Peter

Mumford, Tajul Musa, Alex NG, Maria Ponce, Dr. Craig Roberts, Asghar Tabatabaei,

Jiangou (Jack) Wang, Thomas Yan, Eugene Yu and Dr. Yincai Zhou for all the friendship, support and encouragement.

I am extremely grateful to the Faculty of Engineering for awarding me the Faculty of

Engineering Research Scholarship and allowing me to pursue my Ph.D. studies at

UNSW.

My deep gratitude is due to my mother and my parents-in-law for their support during the years of this study. Finally, I would like to express my deepest gratitude to my dear wife, Lu Gu, for her steady love and dedication. I dedicate this thesis to her and to our daughters, Julia and Emma.

IV List of Publications

Hu, Y-H & Lim, S (2005), 'A Novel Approach for Publishing and Discovery of Location-based Services', in Proceedings of the Spatial Sciences Conference, Melbourne, Australia, 12-16 September, 2005. pp. 272-281.

Hu, Y-H, Lim, S & Rizos, C (2005), 'Delivering GNSS Data Over the Internet using RSS for Post-Processing Applications', in Proceedings of the International Symposium on GPS/GNSS, Hong Kong, China, 8-10 December, 2005. http://www.gmat.unsw.edu.au/snap/publications/hu_etal2005b.pdf.

Hu, Y-H & Ge, L (2006), 'UNSW at GeoCLEF 2006', in Proceedings of the The Cross-Language Evaluation Forum 2006 Working Notes, Alicante, Spain, 20-22 September, 2006. http://www.gmat.unsw.edu.au/snap/publications/hu_etal2006b.pdf.

Hu, Y-H & Ge, L (2007), 'A Supervised Machine Learning Approach to Toponym Disambiguation', in A Scharl & K Tochtermann (eds), The Geospatial Web - How Geobrowsers, and the Web 2.0 are Shaping the Network Society, Springer, pp. 117-130.

Hu, Y-H & Ge, L (2007), 'A Spreading Activation Network Model for Gographic Information Retrieval', in Proceedings of the Spatial Sciences Conference, Hobart, Australia, 14-18 May, 2007. pp. 1049-1058.

Hu, Y-H & Ge, L (2007), 'The University of New South Wales at GeoCLEF 2006', Lecture Notes for Computer Science. (Accepted on 21 February 2007).

Hu, Y-H & Ge, L (2007), 'The Design and Development of a Geo-referencing and Browsing System for Geospatial Web Content ', Journal of Spatial Science. (Accepted on 27 February 2007).  Hu, Y-H & Ge, L (2007), 'GeoTagMapper: An Online Map-based Geographic Information Retrieval System for Geo-Tagged Web Content', in MP Peterson (ed.), International Perspectives on Maps and the Internet. (Accepted on 16 March 2007).

Hu, Y-H & Ge, L (2007), 'Learning Ranking Functions for Geographic Information Retrieval Using Genetic Programming', Journal of Research and Practice in Information Technology. (Accepted on 20 December 2007).

V Table of Contents

Chapter 1 INTRODUCTION...... 1 1.1 Problem statement...... 1 1.1.1 A generic framework of GIR systems ...... 2 1.1.2 Research problems ...... 4 1.2 Purpose of the research ...... 6 1.3 Significant Contributions ...... 7 1.4 Thesis organisation...... 9 Chapter 2 BACKGROUND...... 11 2.1 Geographic information entity extraction and grounding ...... 11 2.1.1 Definition and ambiguities of geographic information entities...... 11 2.1.2 Geographic information entity extraction ...... 13 2.1.3 Geographic information entity grounding...... 15 2.2 Geo-textual indexing, searching and ranking...... 17 2.3 Geo-enabled visualisation and interaction ...... 20 2.4 Geographic knowledge bases...... 23 2.5 Performance evaluation...... 26 2.6 Conclusions...... 28 Chapter 3 A SUPERVISED MACHINE LEARNING APPROACH TO TOPONYM DISAMBIGUATION ...... 29 3.1 Introduction...... 29 3.2 A supervised machine learning approach to toponym disambiguation ...... 32 3.2.1 Geographic relationship model ...... 33 3.2.2 Geographic information entity recognition ...... 36 3.2.3 Supervised machine learning ...... 38 3.2.4 Geographic knowledge base...... 41 3.3 Experiments ...... 44 3.3.1 Data ...... 44 3.3.2 Evaluation ...... 45 3.3.3 Discussion ...... 47 3.4 Conclusions...... 48 Chapter 4 LEARNING RANKING FUNCTIONS FOR GEOGRAPHIC INFORMATION RETRIEVAL USING GENETIC PROGRAMMING...... 49 4.1 Introduction...... 49 4.2 Methodology ...... 52 4.2.1 Genetic Programming ...... 52 4.2.2 Terminals and functions ...... 53 4.2.3 Fitness functions...... 55 4.2.4 Genetic operators...... 57 4.3 Experiments ...... 58 4.3.1 Data ...... 59 4.3.2 Baselines for retrieval performance evaluation...... 60 4.3.3 Experiment 1: GP evolution control parameters ...... 62 4.3.4 Experiment 2: ranking function learning...... 62 4.3.5 Discussion ...... 65 4.4 Conclusions...... 66 Chapter 5 A SPREADING ACTIVATION NETWORK MODEL FOR GEOGRAPHIC INFORMATION RETRIEVAL...... 68 5.1 Introduction...... 68 5.2 Spreading Activation Network...... 69 5.3 Methodology ...... 70 5.3.1 An overview of the Spreading Activation Network for GIR...... 71 5.3.2 Spreading of activation...... 73 5.3.3 Network constraints...... 74 5.3.4 Retrieval and ranking ...... 75 5.3.5 Construction of a G-SAN network...... 75

VI 5.4 Conclusions...... 76 Chapter 6 GEOTAGMAPPER: AN ONLINE MAP-BASED GEOGRAPHIC INFORMATION RETRIEVAL SYSTEM FOR GEO-TAGGED WEB CONTENT...... 77 6.1 Introduction...... 77 6.2 Background ...... 79 6.2.1 Tagging and geo-tagging...... 79 6.2.2 GeoRSS...... 80 6.3 GeoTagMapper...... 81 6.3.1 System overview ...... 81 6.3.2 Geo- parsing ...... 83 6.3.3 Textaul-geo indexing...... 84 6.3.4 Retrieval model ...... 85 6.3.5 Visualisation and interaction ...... 86 6.4 Case study ...... 88 6.5 Conclusions...... 92 Chapter 7 THE UNIVERSITY OF NEW SOUTH WALES AT GEOCLEF 2006...... 93 7.1 Introduction...... 93 7.2 GeoCLEF 2006 ...... 95 7.3 Approaches for GeoCLEF 2006...... 96 7.3.1 System overview ...... 96 7.3.2 Geographic knowledge base...... 97 7.3.3 Textual-geo indexing...... 98 7.3.4 Document retrieval...... 99 7.3.5 Document ranking...... 100 7.4 Experiments ...... 102 7.5 Results...... 104 7.6 Conclusions...... 106 Chapter 8 APPLICATIONS ...... 107 8.1 GNSS-RSS...... 107 8.2 Local news reader ...... 113 8.3 Web sites finder...... 115 8.4 Property list map ...... 118 8.5 Conclusions...... 120 Chapter 9 CONCLUSION AND FUTURE WORK ...... 121 9.1 Thesis summary ...... 121 9.2 Limitations ...... 122 9.3 Future work ...... 125 9.3.1 Integration with Location-based Service...... 125 9.3.2 Extending geographic knowledge base ...... 126 9.3.3 Multimedia GIR ...... 127 9.3.4 Advanced thematic searching...... 128 9.3.5 Fuzzy geographic reasoning...... 129 9.3.6 High performance computation...... 130 Appendix...... 132 Reference ...... 138

VII List of Abbreviations

ABC Australian Broadcasting Corporation ADL Alexandria Digital Library AJAX Asynchronous JavaScript and XML AR Augmented Reality ASDD Australian Spatial Data Directory CLEF Cross Language Evaluation Forum CPU Central Processing Unit CVE Collaborative Virtual Environments DLESE Digital Library for Earth System Education GA Genetic Algorithm GDL Geo-referenced Digital Library GeoRSS Geographically Encoded Objects for RSS feeds GIR Geographic Information Retrieval GIS Geographical Information Systems GISci Geographic Information Science GKB Geographic Knowledge Bases GML Geography Markup Language G-NAF Geocoded National Address File GNIS Geographic Names Information System GNSS Global Navigation Satellite System GP Genetic Programming GPS Global Positioning System GREASE Geographic Reasoning for Search Engines Geographic Knowledge Base HTML HyperText Markup Language ICBM Intercontinental Ballistic Missile IDF Inverse Document Frequency IGS International GNSS Service IR Information Retrieval ITS Intelligent Transportation Systems LBS Location-Based Services LSA Latent Semantic Analysis MAP Mean Average Precision MGIR Multimedia Geographic Information Retrieval MUC Message Understanding Conference NER Named Entity Recognition NLP Natural Language Processing OASIS Ontologically-Augmented Spatial Information System OGC Open Geospatial Consortium Inc. RDF Resource Description Framework RINEX The Receiver Independent Exchange Format RSS Rich Site Summary, Resource Description Framework (RDF) Site Summary, or Really Simple Syndication SAN Spreading Activation Network TF Term Frequency

VIII TGN Thesaurus of Geographic Names UN/LOCODE United Nations Code for Trade and Transport Locations USGS United States Geological Survey VSM Vector Space Model WGS84 World Geodetic System 1984 WSD Word Sense Disambiguation XML eXtensible Markup Language

IX List of Figures

Figure 1-1 A generic framework of GIR systems...... 2 Figure 2-1 An Australian Broadcasting Corporation (ABC) news on 30th August 2004 ...... 12 Figure 2-2 An example of GeoCLEF 2005 query topics...... 27 Figure 2-3 An example GeoCLEF 2005 document...... 27 Figure 3-1The GeoRelationshipWeighting algorithm...... 41 Figure 3-2 Class diagram of the geographic knowledge base data model ...... 42 Figure 3-3 An instance level view of the geographic knowledge base data model...... 43 Figure 4-1 An example of the tree representation of GP individuals ...... 54 Figure 4-2 The crossover operation...... 58 Figure 4-3 An example document used in the experiments ...... 60 Figure 4-4 An example query used in the experiments ...... 60 Figure 4-5 Precision values at the standard 11 recall levels...... 64 Figure 4-6 The tree representation of the ranking function learned using Fitness4 ...... 64 Figure 4-7 The generation number of the best fitness value found for each fitness function ...... 65 Figure 5-1 The G-SAN spreading activation network for GIR...... 73 Figure 6-1 An example of geo-tagged digital photos on flickr.com with geo-tags geo:lat=-37.798993 and geo:lon=144.96049...... 80 Figure 6-2 Architecture of the GeoTagMapper system ...... 82 Figure 6-3 The search interface...... 90 Figure 6-4 The result page of the user interface: a) showing all results; b) showing sampled results with a fixed sample rate; c) showing sampled results with varied sample rates; and d) a document shown in a pop-up window...... 91 Figure 7-1 An example topic used in GeoCLEF 2006 ...... 96 Figure 7-2 An example GeoCLEF document...... 96 Figure 7-3 System architecture of the GIR system used in the GeoCLEF 2006...... 97 Figure 8-1 Architecture of SydNET data distribution platform ...... 109 Figure 8-2 An example of SydNET RSS feeds item ...... 110 Figure 8-3 An example of SydNET RSS feed...... 111 Figure 8-4 Desktop RSS aggregator...... 112 Figure 8-5 The Web-based RSS aggregator ...... 112 Figure 8-6A screen shot of the Local News Reader...... 114 Figure 8-7 An example of ABC regional news feeds...... 115 Figure 8-8 A screen shot of the Web Sites Finder ...... 116 Figure 8-9 Address Mettags of http://pye.dyndns.org/...... 117 Figure 8-10 Search results returned from the GeoURL address Server ...... 117 Figure 8-11 A screen shot of the Property List Map...... 119

X List of Tables

Table 3-1 Statistics from the geographic knowledge base for Australia toponyms disambiguation .. 44 Table 3-2 Parameter configurations used in the experiments...... 46 Table 3-3 Toponyms disambiguation accuracy (%) on the ABC local news collection...... 46 Table 4-1 Terminals used in the GP learning process...... 54 Table 4-2 Results of the best fitness value under four different combinations of pc, po and pr ...... 62 Table 4-3 Table 4 3 Comparison of the results (mean interpolated precision values at the standard 11 recall levels) obtained using four fitness functions with three baselines ...... 63 Table 6-1 Statistics of data sources used in GeoTagMapper ...... 89 Table 7-1 Statistics for the geographic knowledge base...... 98 Table 7-2 The sources used for the construction of the geographic knowledge base...... 98 Table 7-3 Terminals used in the ranking function learning process ...... 101 Table 7-4 GeoCLEF 2006 monolingual English tasks results...... 104 Table 7-5 Precision average values (%) for individual queries...... 105

XI CHAPTER 1 INTRODUCTION

1.1 Problem statement

With the explosive growth of information available in digital format, it becomes increasingly difficult for people to find information they need. Information Retrieval (IR) techniques are efforts to solve this problem through the provision of tools for representation, storage, organisation of, and access to digital documents (Baeza-Yates &

Ribeiro-Neto 1999). While the main focus of conventional IR systems is to support users retrieving documents using keyword-based queries, there is an increasing need for a new class of IR systems that takes into account the geographic context of information when indexing and searching. Such systems are known as Geographic Information

Retrieval (GIR) systems (Larson 1996).

Bringing together a wide variety of perspectives from disciplines such as information science, geographic science and cognitive science, GIR can be considered an extension of conventional IR, but places the main emphasis on geographic information, which can be defined as information that references some part of the physical Earth's surface (Cai

2002). Given the fact that all human activities associate with geographic context, many types of digital information can be categorised as geographic information. Media that communities use to distribute geographic information includes textual documents,

World Wide Web (WWW) pages, digital maps and aerial photographs. The goal of GIR is to retrieve relevant documents from a larger document collection, based on user queries that in general consist of thematic criteria and geographic criteria. Examples of

GIR queries are: "Find news stories about bushfires in Sydney" and "Find maps of the

1 nearest Chinese restaurant." Retrieved results are then presented in a geo-enabled way, which most likely is map-based. In GIR, the relevance of a document to a given query is determined by not only thematic similarity measures, but also geographic associations between them.

1.1.1 A generic framework of GIR systems

A generic framework of GIR systems is shown in Figure 1-1. The whole framework consists of six modules: 1) the document collection management module; 2) the indexing module; 3) the user interface module; 4) the query processing module; 5) the searching module; and 6) the ranking module.

Figure 1-1 A generic framework of GIR systems

The control flow and data flow between GIR modules can be described from two different perspectives. Firstly, from the viewpoint of document management, the processing procedure can be specified as follows:

2 Collect documents: A document is a basic unit for information storage, organisation and retrieval. During this phase, selected documents are assembled into a document collection.

Index documents: After a document is collected, the indexing module creates index entries for the document. These index entries will be used by the searching module to answer user queries. This phase consists of four operations, one for textual indexing and three for geographic indexing: 1) extract keywords from the document and then add them to the textual index subsystem (e.g. an inverted file system); 2) extract geographic information entities from the document; 3) ground geographic information entities, i.e. associate each geographic information entity with a place, and 4) create geographic index entries and add them to the geographic index subsystem (e.g. a geographic-enabled database).

Compared with conventional IR systems, two additional facilities are required to support geographic indexing operations. The first one is a geographic knowledge base that provides important characteristics (e.g. names, geometric properties and relationships to others) of geographic entities, which is useful for geographic information entity extraction and grounding. And the other is the geographic index subsystem that is used for storing and searching of geographic index entries.

On the other hand, from the viewpoint of system users, GIR systems take user queries as a starting point. The query processing procedure can be specified as follows:

1) Submit a query: A user query is submitted to the system through the user interface

3 module.

2) Search the document collection: During this phase, the searching module searches the textual and geographic index entries, and then retrieves all documents that meet both the thematic criteria and the geographic criteria defined in the user query.

3) Rank results: After the searching module retrieves all relevant documents, the ranking module calculates a numeric score for each retrieved document based on the similarity measure between the document and the user query. These scores are then used to rank results.

4) Present results: The ranked results are presented in the user interface module in a geo-enabled way.

5) Refine the query: In this phase, the user browses the result and reads documents that interest him, and may refine his query, in which case the whole procedure goes back to the start.

1.1.2 Research problems

GIR has developed rapidly in recent years. Many research challenges, at different levels, have been identified during efforts to develop high-performance GIR systems. The four problems addressed in this thesis are: 1) geographic information entity extraction and grounding; 2) geo-textual ranking of retrieval results; 3) GIR retrieval process modelling; and 4) the publishing, browsing and navigation of geographic information.

4 Firstly, geographic information entities are information entities that have geographic attributes. Geographic information entities discovered from a document collectively describe the document's geographic context. Semantic meaning of geographic information entities can be expressed explicitly or implicitly, examples of the former include street addresses, place names and numeric coordinates. Names of famous local people, organisations, and important local events are examples of the latter. Existing

Named Entity Recognition (NER) approaches developed in conventional IR systems, such as gazetteer-based text matching, rule-based linguistics analysis and regular expression-based text matching are useful for the task of extraction of geographic information entities (Humphreys et al. 1998; Southall 2003). However, like much other information described using human language, geographic information entities have a high risk of ambiguity. Grounding provides additional facilities of removing uncertainties and associating each geographic information entity with a place in the real world (Li et al. 2002).

Secondly, GIR systems need efficient geo-textual searching and ranking methods to answer user queries. As discussed in the problem statement, user queries in GIR systems consist of keyword-based thematic criteria and, at the same time, geographic criteria. Therefore, searching and ranking in GIR must involve not only textual-based technologies, but also geographic-based ones (van Kreveld et al. 2004).

Thirdly, the development of a single and unified retrieval model that integrates both textual and geographic context of documents and user queries is one of the fundamental problems in developing GIR systems. An information retrieval model describes how documents that relevant to a user query are retrieved. When geographic context is taken

5 into account in retrieval, keyword-based models developed in classic textual IR systems are not sufficient to provide insight into the complexity of geographic information. In particular, a certain mechanism is required to define, discover, and utilise various geographic relationships between documents and user queries (Markowetz et al. 2005).

Fourthly, digital geographic data is used in a wide variety of applications by a variety of user bases, including government agencies, private organisations and personal usages.

Due to the explosive increase in the volume and complexity of geographic information, which reflects many of the properties we can recognise of the real world, it is necessary to develop sophisticated map-based human-computer interfaces through which information can be systematically organised and efficiently accessed (MacEachren &

Kraak 1997).

1.2 Purpose of the research

The increase in the volume, variety and popularity of geographic information has created a strong need for efficient and reliable solutions to support user tasks involved in the searching and accessing to geographic information by specifying both thematic and geographic requirements. A systematic design and evaluation of geographically enabled information retrieval techniques can provide clues in regard to the construction, performance and effectiveness of all essential components of GIR systems. It can also provide insights into how to best integrate existing textual information retrieval techniques and geographic data analysing and processing techniques in improving retrieval performance and user experiences.

The purpose of this thesis is to develop, implement and evaluate a GIR system that

6 supports GIR user tasks in a flexible, efficient and easy manner. The question to be answered is: How does an information retrieval system that emphases geographic context of digital information influences the production, distribution, and consumption of geographic information?

The core of this study consists of six components: 1) the methodology of extraction and grounding of geographic information entities; 2) the methodology of discovering optimal GIR ranking functions; 3) the modelling of GIR retrieval process; 4) the mechanism of publishing, browsing and navigation of geographic information on the

World Wide Web; 5) the implementation and evaluation of a prototype GIR system; and

6) the demonstration of real-world GIR applications. The validity of the proposed approaches is confirmed by using quantitative experiments and case studies.

Information derived from this study will contribute to the growing body of knowledge concerning information retrieval in the fields of geographic information science and geoinformatic research, and should provide useful clues to those involved in modelling, developing and evaluating GIR systems. This study will also provide useful tools and data resources for further development of telegeoinformatics applications.

1.3 Significant Contributions

This study presents the development, evaluation and applications of a GIR system that facilitates advances in both geographic science and information science with the focus on analysing geographic properties and geographic relationships. This effort will contribute to geoinformatic researchers and developers an overview and an in depth analysis of how such an alternative approach can be developed and used to support

7 various GIR user tasks, such as indexing, searching, ranking, browsing and publishing of geographic information.

The implementation and application of a GIR system could also have the potential to provide new functional capabilities and a broader range of applications for GIR technologies. That is, using a GIR system as a geographic information processing tool may result in changes in the behaviours and practices of both information providers and consumers on how information is organised, published, represented and accessed.

Therefore, the research will contribute to improving the quality of geographic-enabled information services.

One of the key elements of GIR systems is the geographic knowledge base, which consists of information about geographic entities and geographic relationships.

Geographic knowledge bases are essential to GIR systems as they provide necessary background knowledge to support geographic information process. However, the construction of a geographic knowledge base is a time-consuming and nontrivial task.

The lack of comprehensive geographic knowledge bases is a major bottleneck in developing GIR systems (Martins, Silva & Chaves 2005; Souza et al. 2005). Therefore, one contribution of this study is the geographic knowledge base constructed during the research, which can be used in the future development of GIR systems.

It is difficult to make comparison between various GIR implementations without a standard test-bed. However, one of the current major issues in GIR is the lack of a data set that can be used as a golden standard for performance evaluation. This study will benefit the research community by providing a set of documents, query topics and

8 relevant judgments that have been collected, annotated and justified in the experiments for retrieval performance evaluation and comparison.

In a summary, by developing and evaluation of a GIR system, it is anticipated that the topic discussed in this thesis will provide insight to the fundamental ways we model and utilise geographic information.

1.4 Thesis organisation

The remainder of this thesis is organised as follows:

Chapter 2 reviews important prior research in the GIR research areas.

Chapter 3 describes the proposed methodology addressing the problem of geographic information entity extraction and grounding using supervised machine learning. The results from experiments that evaluate the method on a large collection of local news articles are then presented.

Chapter 4 presents the new approach for learning GIR ranking functions using Genetic

Programming. The performance of the proposed method is evaluated on a large collection of geographic metadata documents.

Chapter 5 models the GIR retrieval process using approaches based on the Spreading

Activation Network model.

Chapter 6 discusses the publishing, browsing and navigation of geographic information

9 on the World Wide Web. Key challenges in designing and implementing of a geographic-enabled user interactive environment are highlighted.

Chapter 7 evaluates the proposed approaches by detailing the participation in the

GeoCLEF 2006 tasks.

Chapter 8 demonstrates GIR applications using several case studies in different domains.

Chapter 9 concludes the thesis. The contributions and limitations of this research are discussed, and directions of future research are suggested.

10 CHAPTER 2 BACKGROUND

This chapter reviews literature in the GIR research areas, including: 1) geographic information entity extraction and grounding; 2) geo-textual indexing, searching and ranking; 3) geo-enabled user interfaces and visualisation; 4) construction of geographic knowledge bases; and 5) performance evaluation of GIR systems.

2.1 Geographic information entity extraction and grounding

2.1.1 Definition and ambiguities of geographic information entities

In GIR, geographic information entities can be defined as information entities that have geographic senses. Geographic information entities discovered from a document collectively describe the document's geographic context. Semantic meaning of geographic information entities can be expressed explicitly or implicitly, examples of the former include street addresses, place names and numeric coordinates. Names of famous local people, organisations, and important local events are examples of the latter.

In open domain unstructured and self-structured documents written using human language, such as free text and HyperText Markup Language (HTML) web pages, many types of geographic information entities, in particular place names, have a high risk of ambiguity. Two types of ambiguities have been identified: geo/non-geo and geo/geo

(Amitay et al. 2004).

11 Geo/non-geo ambiguity happens when a geographic information entity corresponds to other non-geographic references. Such ambiguity can be further subdivided into three types:

Ambiguity with other proper names: many place names can be found in other types of proper names, such as people and organisation names. For the cases of people names, in

English, some surnames are taken from place names, such as York and Lancashire

(Jobling 2001). On the other hand, many places in the world are named after people.

Examples in Australia include Darwin, the capital city of the Northern Territory, which was named after Charles Darwin, the British naturalist. Country and city names might also be included in organisations' names, to indicate their origin or headquarters.

Examples include "Bank of Queensland" and "Sydney Dance Company".

Used as metonymies: place names can be used to refer to another related concept, such as a governmental and community body. As the following example (see Figure 2-1) extracted from a news article provided by Australian Broadcasting Corporation (ABC) on 30th August 2004 indicates, "Australia", the country name in this context refers to its sport team. Place names also can be included in attributive phrases, such as "the chef from China" and "the Prime Minister of Australia". In these examples place names are used as an attribute to their relational nouns and not necessarily to refer to any geographic location.

Figure 2-1 An Australian Broadcasting Corporation (ABC) news on 30th August 2004

12 Ambiguity with common words: Many place names are named using common words.

Examples in Australia include Sunshine and Waterfall.

Geo/geo ambiguity reflects the fact that several distinct places may share the same name.

As a very simple example, the mention "Sydney" in "Sydney is a great holiday destination for families" could refer to many possible places around the world, such as

Sydney, New South Wales, Australia; Sydney, Nova Scotia, Canada; or Sydney, Florida,

United States. Real instances of such ambiguity can be easily found all around the world.

Statistical data of the Getty Thesaurus of Geographic Names (TGN), a widely used geographic gazetteer developed by the Getty Information Institute, has shown that the percentage of place names that are used by more than one place ranging from 16.6% for

Europe to 57.1% for North and Central America (Smith & Crane 2001).

Disambiguation of geographic information entities is one of the most important challenges in many geoinformatic applications, including geographic information systems, geographic information retrieval and geographic digital libraries (Garbin &

Mani 2005; Larson 1996; Zong et al. 2005). Automatically discovering and removing of ambiguities from geographic information entities includes two steps: extraction and grounding.

2.1.2 Geographic information entity extraction

The main task of the extraction procedure is to identify all geographic information entities and remove geo/non-geo ambiguities from a given document. This procedure can be considered as an extension of the Named Entity Recognition (NER) tasks, a research domain that aims to identify and classify instances of different types of named

13 entities from text. NER is a very important component of many natural language processing (NLP) applications, including information extraction and information retrieval. Seven types of named entity have been defined for NER tasks in the Message

Understanding Conference (MUC) standard: i.e. , , ,

The key aspect that makes geographic information entity extraction different from NER is that it not only deals with geographic information entities, such as place names, geographic coordinates, postal codes and telephone numbers, which are found in the document content, but also geographic related references that are found from other sources outside the document itself, instances of these sources include metadata of documents, host locations of Web pages and geographic distribution of hyperlinks.

Geographic information entities which fall into the first category are called internal geographic information entities, and which fall into the second category are called external geographic information entities.

Other types of named entities can also be considered as external geographic information entities when geographic attributes of these entities can be identified, examples include people's name and company names.

Different types of geographic information entities need different information extraction strategies. Existing techniques that are used to extract internal geographic information entities can be categorised into three groups: those that use gazetteer-based text matching, grammar rule-based linguistics analysis and regular expression-based text matching (Humphreys et al. 1998; Southall 2003); those that use machine learning

14 techniques to learn statistical models from large amount of training data (Yangarber &

Grishman 1998), and those that combine the two previous approaches, such as Mikheev,

Grover & Moens (1998). For external geographic information entities, in particular,

Web-related source, the WHOIS database that provides location information of Web hosts is widely used (Buyukkokten et al. 1999; Junyan, Luis & Narayanan 2000).

2.1.3 Geographic information entity grounding

After geographic information entities are extracted, it is necessary to ground them.

Grounding is a procedure of removing geo/geo ambiguities and associating each geographic information entity with a place in the real world (Li et al. 2002).

Geographic information entity grounding has its root in the problem of word sense disambiguation (WSD) in the domain of computational linguistics. WSD takes free text as input, and assigns each word its most likely sense. For a given word that has several distinct senses, the WSD procedure typically involves two steps: 1) finds all possible senses of the word, and 2) selects a single sense for the word (Ide & Veronis 1998).

Geographic information entity grounding can be seen as an extension of WSD into the geographic domain, in which the words to be disambiguated and their senses are all geographic related.

Various approaches have been proposed for the grounding problem. Generally, these approaches can be grouped into three categories: machine learning-based statistical methods, rule-based linguistics analysis and geographic heuristic approaches.

15 The machine learning-based statistical methods, such as Garbin & Mani (2005) and

Smith & Mann (2003), apply statistical analysis and machine learning algorithms to a document collection in which geographic information entities are previously tagged and disambiguated in order to learn statistical models and classifiers that can be used to disambiguate geographic information entities in unseen documents.

The rule-based linguistics analysis approaches disambiguate geographic information entities based on their local context using linguistic pattern-matching and co-occurrence analysis (Rauch, Bukatin & Baker 2003; Zong et al. 2005). Examples of some widely used heuristic in these methods include "a place name is qualified by its following place names" and "place names in one document share the some geographic context". By applying these rules, the true reference of the place name "Sydney" in, for an instance,

"Many people think Sydney, New South Wales is the capital of Australia" can be easily found. Rule-based linguistics methods are simple and easy to implement.

The geographic heuristic approaches (Smith & Crane 2001; Woodruff & Plaunt 1994) disambiguate geographic information entities by performing geographic computations based on geographic nature of each possible candidate place. Possible geographic features that can be used include geographic distance and area size. Rules like higher scores are assigned to those candidates that are geographically closer to other places that have no ambiguity, and higher scores are assigned to those candidates that are bigger in area, are used as heuristic guidance to find the right candidate. A geographic knowledge base is essential for these methods as it provides necessary underlying geographic data.

16 2.2 Geo-textual indexing, searching and ranking

GIR systems need efficient geo-textual indexing, searching and ranking methods to answer user queries. As discussed previously, user queries in GIR systems consist of keyword-based thematic criteria and, at the same time, geographic criteria. Therefore, indexing, searching and ranking in GIR must involve not only textual-based technologies, but also geographic-based ones.

A geo-textual index scheme can be considered as a combination of two independent index schemes: a textual index scheme and a geographic index scheme. The construction of a geo-textual index scheme for a document collection is relatively straightforward: textual indexing structures, such as inverted files (Araújo, Navarro &

Ziviani 1997) and suffix arrays (Manber & Myers 1993), are used to create textual index entries for all keywords in the document collection; and geographic indexing methods, which are called spatial access methods in the literature, such as R* trees

(Norbert et al. 1990) and quadtrees (Dyer, Rosenfeld & Samet 1980), are used to create geographic index entries for all geographic information entities discovered from the document collection. The textual index scheme and a geographic index scheme are usually linked together using document identifications.

Textual index schemes allow full text search, and geographic index schemes support geographic searching. In the mean time, the separation of textual and geographic index schemes enables efficient processing of queries that only contain textual or geographic parts, which makes it less complex for load balancing (Andrade & Silva 2006). A deep analysis and comparison of the total computation cost (e.g. storage costs and

17 computational time) of different implementations of geo-textual index schemes can be found from Vaid et al. (2005).

Searching in GIR can be described as a joint of two models: the thematic searching model and the geographic searching model. The thematic searching model describes relationships between documents and user queries using keywords and search terms.

The Vector Space Model (VSM) (Salton 1971) is one of the most popular models that can be applied in GIR. In VSM, both documents and user queries are represented as n-dimensional vectors, where n is the total number of index terms in the system. Each component of a VSM vector is a weight value associated with an index term and a   document (or a query). The definitions of a document d j and a query q are given as the following equations:  = wwwd ),,...,,( Equation 2-1 j ,2,1 jj , jn  = ,( ,..., wwwq ), Equation 2-2 ,2,1 qq ,qn   The similarity between d j and q is calculated by the cosine of the angle between the two vectors,  n  B × ww • qd , ji ,qi =  j = i=1 sim j qd ),(  Equation 2-3 × n n j qd 2 × 2 , ji BB ww ,qi i=1 i=1

On the other hand, the geographic searching model has it root from Geographic

Information Science (GISci), in which information is modelled as a hierarchy of layers that are all in registration, each layer represents one specific theme. At each layer, geographic features of information are represented as geometric objects (e.g. points, lines and polygons) in a Euclidean space, and other features are represented as attributes

18 of these geometric objects (Tomlin 1990). User queries in the geographic model are expressed using structured query languages, in which query criteria (geographic and non-geographic) are well specified and are linked using predefined predicates.

Geographic criteria of queries are processed based on geometric computation, and non-geographic criteria are processed by matching with attribute values (Burrough &

McDonnell 1998).

Using the above two models, each document (or a user query) is represented as a joint of two complementary and non-redundant subspaces, a thematic subspace that reflects the textual context of the document (or the query), and a geographic subspace that reflects the geographic context of the document (or the query). During a search, the thematic searching model is used in the thematic subspace and the geographic search model is used in the geographic subspace. Search results from the two subspaces can be merged using a Boolean AND operator.

Similarly, a geo-textual ranking method can be considered a combination of a textual ranking method and a geographic ranking method. Textual ranking scores can be calculated using inverse document frequency of query term (IDF) and term frequency

(TF) measurements (Gerard 1989). In VSM the most common approach to relevance ranking is to give each document a score based on the sum of the weights of terms common to the document and query, where the TF-IDF measures are used to calculate term weights. Geographic ranking scores can be calculated based on the geographic distance, the size of overlap area, and the geographic relationship between documents and user queries (Martins, Silva & Andrade 2005). Different from the Boolean AND combination used in the geo-textual searching, a geo-textual ranking method defines a

19 function that takes the textual ranking score and the geographic ranking score as input, and then calculates a numerical value that can be used to measure the degree of relevance of a document to a user query (Cai 2002). Both linear functions (Göbel &

Klein 2002) and non-linear functions (van Kreveld et al. 2004) can be used to combine textual and geographic ranking scores.

2.3 Geo-enabled visualisation and interaction

Due to the explosive increase in the volume and complexity of geographic information, it is necessary to develop sophisticated user interfaces through which geographic information can be systematically organised and efficiently accessed. Geographic visualisation and interaction techniques aim to provide users virtual environments for visualisation and communication of geographic information by using advanced cartographic and geoinformatic tools (MacEachren & Kraak 1997).

Map-based human-computer interfaces are used by many GIR systems (Brown 1999;

Lim et al. 2002), in which geographic criteria of user queries and retrieval results can be specified and displayed on maps. Digital maps used in such environments share three important features: integrative, interactive and dynamic (Andrienko 1999). The term integrative refers to combining visual map representation with both geographic and thematic properties of information. The term interactive refers to enabling users to control the data exploration processing by moving viewpoints, changing level of details

(e.g. resolution of maps) and distorting the visualisation space. And the term dynamic refers to the capacity to change the visualisations as underlying data changes. These changes can be made manually by applying data filters, or automatically when new data are added to the system. All these techniques have been successfully adopted to support

20 user activities like geographic thinking, geographic reasoning and geographic knowledge construction.

Remote geographic visualisation uses the client-server paradigm to interact with users by providing network-based mapping facilities. Three elements are involved in remote geographic visualisation applications: server-side applications that store data and serve users' requests; client-side applications that let users explore data space interactively and interact with servers; and a communication network that connects servers and clients.

Web-based mapping (Plewe 1997) is a typical example of remote geographic visualisation, in which online mapping services are used to implement server-side functionalities, Web browsers (e.g. Microsoft Internet Explorer, Mozilla Firefox) are employed as the only required user front-end, and the Internet is used as communication media. Applications of Web-based mapping include planning and resource management,

Intelligent Transportation Systems (ITS) and Location-based Services (LBS) (Peng &

Tsou 2003).

With the recent emerging of Web 2.0 technologies, many mapping service providers start to offer a wide range of Web mapping services (e.g. driving directions, street maps and satellite/aerial photography) to public non-expert users, example products include

Google Maps (c.f. http://maps.google.com/), Google Earth (c.f. http://earth.google.com/), Yahoo Maps (c.f. http://maps.yahoo.com/) and Microsoft

Virtual Earth (c.f. http://www.microsoft.com/virtualearth/). Custom applications can integrate functionalities and data of these Web services via application programmatic interfaces (APIs) provided. These applications are called mashups (O'Reilly 2005).

21 Another import application domain for geographically enabled interaction and visualisation is geo-referenced digital libraries (GDL). A geo-referenced digital library is an library information system that stores geo-referenced resources and provides a geographic orientation to those resources in terms of discovery, browsing, viewing, and access (Janée, Frew & Hill 2004). Collections of such libraries range from maps, aerial photographs, remote sensed images and documents that are relevant to one or several identifiable geographic places. Examples of geo-referenced digital libraries include the

Alexandria Digital Library (ADL) Project (cf. http://www.alexandria.ucsb.edu/), and the

Digital Library for Earth System Education (DLESE) (cf. http://www.dlese.org/).

Geo-referenced digital libraries in general allow users to access information using map-based interfaces. The geospatial and non-geospatial context of the materials are described using predefined metadata schemes, and are then indexed into a hierarchy of categories for searching and browsing.

Though maps as a means of representing geographic characteristics and relationships have been widely used for a very long time, several issues must be addressed in order to design effective user interfaces.

Firstly, using a map-based interface, information is organised based on their geographic context. However, at the same time it may break other connections within information, such as hierarchical and sequential relationships, which are important in many scenarios.

One interesting approach to address this issue is the using of the spatial hypertext model, in which information are organised in a two-dimensional space and other characteristics are differentiated by using visual attributes, such as colour, size and shape (Shipman &

Catherine 1999).

22

Another problem arises when a large number of documents are retrieved as query results. Because of the limitation of display space on users' screens, not all retrieved results can be presented at the same time in many situations. In conventional IR systems, retrieved documents are sorted by their ranking scores and are then organised into pages.

Each page consists of detail information of a certain number of the ranked documents.

At the same time, the total page number and the current page location provide users with a global understanding of the overall result. In GIR systems, how massive geographic information can be organised and displayed in a satisfactory way is still an open question (Couclelis 1998; Skupin 2000).

2.4 Geographic knowledge bases

GIR systems require high quality geographic knowledge bases to support GIR processes, such as geographic information entity extraction, grounding and geographic reasoning.

Research efforts in this area comprise three main lines of interest: flat gazetteers, geographic thesaurus and geographic ontologies.

Flat gazetteers consist of simple attributes of geographic entities, such as alternative names that are expressed in natural language, geographic locations that are expressed using geographic coordinates, type designations and population numbers. Flat gazetteers can be seen as specialised Geographic Information Systems (GIS) with specific data management solutions that aim to provide functionalities for indirect geographic location identification through names and types in a large scale and collective data environment (Brandt, Hill & Goodchild 1999; Hill, Frew & Zheng 1999).

23 Many gazetteers of different granularity have been published by government agencies, research institutions and industry as public resources. For instances, the United Nations

Code for Trade and Transport Locations (UN/LOCODE) database

(http://www.unece.org/locode/) published by the United Nations Economic Commission for Europe, the Alexandria Digital Library gazetteer

(http://www.alexandria.ucsb.edu/gazetteer/) developed by the University of California, and the Geographic Names Information System (GNIS) (http://geonames.usgs.gov/) developed by the United States Geological Survey (USGS). Millions of geographic names and associated properties can be found in these gazetteers.

Geographic thesaurus different from flat gazetteers by putting emphasise on not only attributes of single geographic entity, but also geographic and semantic relationships between geographic entities. An implementation of geographic thesaurus defines a relationship model based on which various relationships are encoded to connect geographic entities to each other. Relationships between geographic entities make it possible to perform automated reasoning under a set of rules (Jones, Alani & Tudhope

2002).

The most common geographic thesaurus structure is the hierarchical data model, in which two types of relationships that can be represented: the whole-part relationship and the genus-species relationship (Hill 1998). An example of the former is "Sydney is part of New South Wales which is part of Australia", examples of the latter include

"Sydney is a city" and "Australia is a country". A hierarchically structured geographic thesaurus allows for the possibility of expanding a place name term by finding its contained places and the parent places. It also can be used to answer questions like

24 "Find all coastal cities in New South Wales, Australia" by combining with GIS systems, which provide geographic attributes of thesaurus entries. This approach has been used in many widely used geographic thesauruses, such as the Getty Thesaurus of

Geographic Names developed by the Getty Information Institute, and the Geographic

Thesaurus of Petroleum Abstracts developed at the University of Tulsa.

Geographic ontologies are fundamental elements of Geospatial Semantic Web, which can be seen as narrowly focused domain ontologies that aim to model geographic relationships and properties using a standardised and interoperable semantic model

(Egenhofer 2002). Same as ontologies developed in other domains, geographic ontologies are usually encoded in the Web Ontology Language (OWL), a Web language designed to define and describe formal semantics. Semantic query languages, such as

OWL Query Language (OWL-QL) can be used to query and reason with geographic ontologies (Hiramatsu & Reitsma 2004).

By compiling, associating, and integrating data from multiple distinct sources (e.g. flat gazetteers, and geographic thesaurus), geographic ontologies that support different levels of granularity and richness can be built (Kavouras, Kokla & Tomai 2005).

Examples of geographic ontologies include the Geographic Reasoning for Search

Engines Geographic Knowledge Base (GREASE GKB) developed at the University of

Lisbon, and the Ontologically-Augmented Spatial Information System (OASIS) developed at the Cardiff University.

The use of geographic knowledge bases in GIR has two aspects: 1) place names and their attributes are usually used for geographic information entities recognising and

25 extraction (Nissim, Matheson & Reid 2004; Wang et al. 2005); and 2) geographic relationships are used for grounding and reasoning (Leidner, Sinclair & Webber 2003;

Li et al. 2002). Although vast volume of place names can be easily collected, the current availability of data about geographic relationships is very limited. Take the GREASE

GKB as an example, among 12,293 features around the world, only 13 adjacency relationships can be found (Chaves, Silva & Martins 2005). The current lack of comprehensive geographic knowledge bases is a major bottleneck in developing GIR systems.

2.5 Performance evaluation

It is difficult to make comparison between various GIR implementations without a standard test set. In 2005, due to the efforts of research groups at the University of

California, Berkeley and the University of Sheffield, the Cross Language Evaluation

Forum (CLEF) started a new track named GeoCLEF that aims to provide the necessary framework in which GIR systems can be evaluated using search tasks involving both geographic and multilingual aspects (Gey et al. 2005).

Participants of GeoCLEF are provided with a common set of document collections and query topics. System retrieval performances are then measured using relevance judgments that are determined by human assessors. GeoCLEF tasks can be performed in two contexts, monolingual and bilingual. In the monolingual context both documents and topics are provided in the same language, while in the bilingual context documents and topics are given in different languages. In GeoCLEF 2005, available options for document languages include English and German, and available options for query topic languages include English, German, Portuguese and Spanish.

26

In GeoCLEF 2005, query topics consist of fields of title, description, narrative, geographic locations and geographic relationships (e.g. in, near, south of). An example of GeoCLEF 2005 topics is shown in Figure 2-2. In this example the topic title is given in line 4, the description is given from line 5 to 6, the narrative is given from line 7 to

11, the geographic relationship is given in line 14, and the geographic locations are given in lines 15 and 16.

Figure 2-2 An example of GeoCLEF 2005 query topics

Document collections used in GeoCLEF consist of internal and national newswire stories that have very different geographic scope. The English document collection used in GeoCLEF 2005 consists of 169,477 documents from the British newspaper "The

Glasgow Herald" (1995) and the American newspaper "The Los Angeles Times" (1994).

Figure 2-3 shows an example of GeoCLEF documents. In this example, the document number is given in line 2, and the newswire text is given in lines 11 to 12.

Figure 2-3 An example GeoCLEF 2005 document

27

There are 12 research groups from 6 countries participated in GeoCLEF 2005. An overview and results comparison of GeoCLEF 2005 can be found from Gey et al.

(2006). GeoCLEF becomes as a regular track of CLEF since 2006. There were 17 research groups from 8 different countries participated in GeoCLEF 2006, including our group. The detail of our participation in GeoCLEF 2006 and analysis of results is discussed in Chapter 7.

2.6 Conclusions

In conclusion, geographic information is diverse and complex. Effective management and utilisation of geographic information require mechanisms that combine advances from many different disciplines, such as, information science, geographic science and cognitive science. The differences in the ways that documents are indexed, searched, ranked and presented between GIR systems and conventional IR systems introduce unique research challenges to the community. This chapter provides an extensive review of a range of activities that fall under GIR and offers insights into techniques that are fundamental to GIR system, from which we can see that although considerable efforts have been undertaken in this field, GIR is still in its early development stage.

Considering the importance and ubiquity of geographic information and the potential values of GIR systems for geographic information researchers, providers and consumers as evidenced by the literature review, developing and evaluation a GIR system that emphasises on geographic features and geographic relationships may provide significant help for the understanding of how geographic information can be processed and retrieved effectively and efficiently.

28 CHAPTER 3 A SUPERVISED MACHINE LEARNING APPROACH TO TOPONYM

DISAMBIGUATION

This chapter addresses geographic information entity extraction and grounding problem by presenting a supervised machine learning approach for toponym disambiguation tasks. The proposed approach uses a simple hierarchical geographic relationship model to describe geographic entities and geographic relationships among them. The disambiguation procedure begins with the identification of toponyms in documents by applying and extending the state-of-the-art named entity recognition technologies, and then performs disambiguation as a supervised classification processes over a feature space of geographic relationships. A geographic knowledge base is modelled and constructed to support the whole procedure. This chapter is mostly based on a manuscript that published in the book – “The Geospatial Web - How Geobrowsers,

Social Software and the Web 2.0 are Shaping the Network Society”.

3.1 Introduction

Ambiguities exist in assigning a physical place to a given toponym in open domain unstructured and self-structured documents, such as free text and HyperText Markup

Language (HTML) web pages. This is a natural consequence of the fact that many distinct places on the earth may use the same name. As a very simple example, the mention "Sydney" in "Sydney is a great holiday destination for families" could refer to many possible places around the world, such as Sydney, New South Wales, Australia;

Sydney, Nova Scotia, Canada; or Sydney, Florida, United States. Real instances of such ambiguity can be easily found all around the world. Statistical data of the Getty

29 Thesaurus of Geographic Names (TGN), a widely used geographic gazetteer developed by the Getty Information Institute, has shown that the percentage of toponyms that are used by more than one place ranging from 16.6% for Europe to 57.1% for North and

Central America (Smith & Crane 2001).

Ambiguity of toponyms must be resolved in order to gain a full understanding of the geographic context of documents. A common and unique feature of toponyms is that a toponym always refers to a place on the earth, and thus corresponds to same geographic properties (e.g. geometry, topology and thematic data). This important feature makes the problem of toponym disambiguation different from disambiguation of general word sense and disambiguation of other proper nouns (e.g. personal names and organisation names) in natural language processing. Toponym disambiguation is one of the most important challenges in many geoinformatic applications including geographic information systems, geographic information retrieval and geographic digital libraries

(Garbin & Mani 2005; Larson 1996; Zong et al. 2005).

In generally, a toponym disambiguation procedure consists of two steps. Firstly, a set of toponyms T(t1, t2, t3… tn) is extracted from a document. This step can be performed by using software tools such as Named Entity Recognition (NER) systems. Secondly, the extracted toponyms and their context (i.e. the document from which the toponyms are extracted) are inputted into a toponym disambiguation algorithm, which then outputs a set of unique places P(p1, p2, p3… pn) each of which is the disambiguated referent of a toponym in the input set. The places in the output set can be expressed using numerical or descriptive formats, a geographic coordinate (e.g. latitude, longitude) is an example of the former, and a hierarchical administration structure (e.g. Australia/New South

30 Wales/Sydney) is an example of the latter.

Two types of evidences can be used to support the disambiguation procedure: textual evidence and geographic evidence. For a toponym that is to be disambiguated, textual evidence refers to its linguistic environment, including linguistic syntax and discourse semantic. Geographic evidence refers to its related geographic properties, including geographic attributes (e.g. location, distance) and geographic relationships (e.g. administrative hierarchy, adjacency). Textual evidence can be discovered from local context using natural language processing techniques. On the other hand, external resources, such as geographic knowledge bases, are required to obtain geographic evidence. Both textual and geographic evidences play very important roles in the task.

The different ways how these evidences are utilised and are integrated leading to different implementations of toponym disambiguation systems.

This chapter proposes a supervised machine learning approach to the problem of toponym disambiguation. In particular, our approach consists of the following four components: 1) a geographic relationship model that is used to describe geographic entities and geographic relationships among them; 2) a geographic information entity recognition module that is used to extract geographic information entities from context by applying the current state-of-the-art NER techniques; 3) a learning module that aims to learn a disambiguation model from statistics of geographic relationships among geographic entities derived from a given training data set. The learned model then can be applied to new data sets that are similar to the training data, and 4) a geographic knowledge base that provides the necessary underlying data resource to support the extraction and disambiguation procedures.

31

An experimental system that makes use of our approach has been implemented and evaluated on a large document collection that consists of 15,194 articles collected from the Australian Broadcasting Corporation (ABC) local news collection. The overall evaluation results provide quantitative support for the major hypothesis of this study that by employing the geographic relationship model and a general supervised machine learning algorithm, it is possible to improve the performance of toponym disambiguation.

The main contributions of this chapter are three-fold: firstly, the proposed approach is described, implemented and validated; secondly, a large document collection that can be used as a test-bed in future research is constructed; and lastly, a geographic knowledge base that contains hierarchical information about common place names of Australia is developed.

The reminder of this chapter is organised as follows. Section 3.2 describes our methodology. Section 3.3 presents the experiments with the implementation of the algorithm, the obtained results and the important observations. Section 3.4 concludes this paper and gives some directions for future research.

3.2 A supervised machine learning approach to toponym disambiguation

The proposed approach consists of four components: a geographic relationship model, a geographic information entity recognition module, a learning module and a geographic knowledge base. This section describes each of these components in more detail.

32 3.2.1 Geographic relationship model

Geographic relationships between geographic entities reflect the nature of their embedding in the real world. The geographic relationship model in our approach is designed to describe geographic entities to which toponyms are mapped and the geographic relationships that hold between geographic entities. Specifically, the model defines a simple hierarchical structure that is used to map toponyms to geographic entities, and three geographic relationships that are considered in the disambiguation procedure.

Our approach maps toponyms to a hierarchical structure, which is based on the political administrative properties of toponyms. Nodes on the higher level of the hierarchy consist of one or more nodes on the lower levels in a political sense. A three level country/state/city hierarchy is an instance of the structure. By using this structure, a toponym can be mapped to a single place in the world.

The reasons that this mapping strategy is selected are: firstly, it is simple and well-understood; secondly, compared with other possible mapping strategies that are concerned with geometric properties (e.g. location coordinates, boundary edge and bounding boxes), this approach requires less computation and storage cost; and lastly, there are many existing resources available for building a geographic knowledge base that uses this structure as internal data structure.

Based on the above hierarchical structure, three alternative relationships that are possible between two geographic entities are defined in our geographic relationship model, namely: identical, similar and part-of.

33

Identical: Two geographic entities are identical if they have the same values at all levels of the hierarchical structure.

Same as the "one sense per discourse" heuristic in Natural Language Processing (NLP) literature (Gale, Church & Yarowsky 1992), in many cases, several mentions of a toponym in a document are mapped to a single geographic entity. However, toponyms that use different spellings can be identical geographic entities as well. This is because that the same place can have more than one name. For example, both Stalingrad and

Volgograd are mapped to the same famous Russia city on the west bank of Volga River.

This issue also has an important impact on multilingual applications.

Similar: Two geographic entities are similar if: 1) they are at the same hierarchical level, and 2) they have the same upper level values. For example, Australia/New South Wales and Australia/Queensland are similar, because both of them are states of Australia.

Australia and China are similar, because both of them are country-level geographic entities.

Part-of: A geographic entity X is part-of another geographic entity Y if X is at lower hierarchical level of Y. For example, Australia/New South Wales/Sydney is part-of

Australia/New South Wales, Australia/New South Wales is part-of Australia.

It is also important to note the part-of relationship in the hierarchical structure is different from the part-of concept derived from the geographic geometry point of view, where the former focus on the political relationship between two geographic entities,

34 but the latter checks whether a geographic entity is entirely surrounded by another one.

Real world examples that can be used to describe this difference are enclaves and exclaves.

These relationships are helpful to resolve toponym ambiguities. Let us consider a simple example.

Example 3-1: Many people think Sydney is the capital of Australia.

There are two toponyms can be extracted from the above sentence: Sydney and

Australia. The toponym needed to be disambiguated is Sydney, which can be mapped to many geographic entities, including:

Australia/New South Wales/Sydney

Canada/ Nova Scotia/Sydney

United States/North Dakota/Sydney

United States/Florida/Sydney

Based on the geographic relationships between Sydney and Australia, it is easy to find out the correct mapping (i.e. Australia/New South Wales/Sydney), because there is a part-of relationship between this mapping and Australia, and there is no relationship that can be found using other mappings.

Now, let's look at an interesting question arising from the following examples.

Example 3-2: China, Texas is located in Jefferson County.

35 Example 3-3: In the United States, Texas is the third largest exporter to China.

It is easy for a human reader to understand the context and then realise that the toponym

China is mapped to two different geographic entities in the above two examples: United

States/Texas/China for the former, and China for the latter.

The ambiguity of China in Example 3-2 can be resolved easily, as there is only one relationship (i.e. United States/Texas/China is part-of United States/Texas) can be found from the context for the toponym China.

However, for Example 3-3, the ambiguity of China is more complex because many possible relationships can be found:

United States/Texas/China is part-of United States/Texas

China is similar to United States

The disambiguation procedure for this example in our approach is regarded as a classification problem in which each toponym has ambiguities is classified into one of its possible geographic entity mapping. Relationships connected to geographic entities are used as the classification features, based on which ambiguities are resolved by using supervised machine learning technique.

3.2.2 Geographic information entity recognition

The goal of the geographic information entity recognition module is to identify all toponyms in a given document. This procedure is carried out by employing the

36 state-of-art NER technologies employing gazetteer-based string matching and statistical analysis methods. Two important extensions are also made to improve the system performance.

The first extension is called "geographic stop words". In information retrieval, stop words are words that do not have semantic meaning (e.g. a, the) or occurs in many of the documents in the collection (e.g. say, you). These words are not useful for information retrieval tasks and can be eliminated during the information process procedure in order to reduce the processing and storage costs. The concept of the geographic stop words in our approach is similar, but with a focus on geographic place names, and only applies to the gazetteer-based string matching method. Examples of toponyms that can be seen as geographic stop words include US, Mobile (a city in

Alabama, U.S.), Orange (a city in Texas, U.S., a city in New South Wales, Australia) and Reading (a town in Berkshire, U.K.). The existing of these words in a document introduces geo/non-geo ambiguity to the whole toponym disambiguate procedure when the gazetteer-based string matching method is used. By applying a geographic stop word list, this kind of geo/non-geo ambiguity could be efficiently removed and the size of the NER result sets could be reduced. Therefore, the number of toponyms that are needed disambiguated could be reduced, and the computational costs of further processing could be reduced as well.

The second extension is called "multi-words toponym merging". Many toponyms consist of more than one word, and in some cases, a part of a toponym could be a toponym as well. Examples include New South Wales (a state of Australia) in which both the phrase "South Wales" and "Wales" could be used as toponyms that refer to an

37 area of England. The multi-words toponym merging algorithm finds a maximum overlap to merge two or more geographic named entities into one toponym by checking their positions in the documents.

In summary, the whole geographic named entity recognition procedure can be described as procedure of three steps. The first step performs a simple string matching against all documents in the collections utilising the gazetteer derived from our geographic knowledge base, and toponyms in the geographic stop word list are eliminated during this step. The second step performs a NER process using a pre-trained NER tagger to tag three types of named entities: , and in all documents. The final step matches result sets from the two previous steps using following rules: 1) for each string found in the first step, it is eliminated if it is tagged as a non-location entity (i.e. or ) in the second step, otherwise it is added to the result set; 2) for each toponym in the geographic stop word list of the first step, it is added to the result set if it is tagged as a entity in the second step; and then 3) two or more toponyms are merged using the above multi-words toponym merging algorithm.

3.2.3 Supervised machine learning

After all toponyms are identified and tagged in a document, a supervised machine learning module is used to disambiguate all toponyms that could be mapped to more than one geographic places. The disambiguation procedure can be treated as a supervised classification procedure, and the statistics of geographic relationships among toponyms are used as features for classification.

38 Typically there are four steps are involved in a classification process based on supervised machine learning methods, such as Naive Bayes classifier (Duda & Hart

1973), Logistic regression (Hosmer & Lemeshow 1989) and Decision trees (Quinlan

1986).

Firstly, one or more quantitative characters are selected as features for representing of data entities to be classified. The collection of these features is usually referred to a feature space, which is normally modelled as a multi-dimensional vector space. The selection of the features is problem dependent and can be done both manually and automatically.

Secondly, a statistical model is derived from a training data set in which ground truth of each data entity to be classified is already known. By using this statistical model, the total feature space is divided into a number of sub-spaces, each of which corresponds to a class. Both linear and non-linear functions can be used to describe this model.

Thirdly, a classification rule is constructed for the statistical model and is used to estimate the class-conditional probabilities (i.e. the likelihood that a given data entity has a particular set of feature values) and the probability of appearance of each class for any data entity to be classified based on Bayes' rule of conditional probability, in which prior probabilities are estimated from training data.

And lastly, the class probabilities calculated in the previous step is used to predict a class for each data entity based on the maximum likelihood estimation, i.e. for a given data entity, the class that has the maximum probability of the observed feature values is

39 selected as the result class.

Our proposed classification method covers all the above-mentioned steps. The feature scheme used is based on statistics of geographic relationships among possible mappings of toponyms. Here we discuss the detail of the classification scheme, feature selection and feature value acquisition.

Let T{t1, t2, t3… tn} be the set of all toponyms that is extracted from a given document, and let Pk{pk1, pk2, … pkm} be the set of all possible geographic place mapping of tk. The problem of disambiguation of tk is defined as a classification procedure that assigns one element of Pk{pk1, pk2, … pkm} to tk. As an example, considering the above example

3-1, we have T = {Sydney, Australia} and PSydney= {Australia/New South Wales/Sydney,

Canada/Nova Scotia/Sydney, United States/North Dakota/Sydney, United

States/Florida/Sydney} and PAustralia = {Australia}.

Once all toponyms are identified and all their possible mappings are acquired, a weighting value that is used as the quantitative feature for classification is assigned to each element in all P sets. The weighting algorithm is described as following:

40

Figure 3-1The GeoRelationshipWeighting algorithm

The constant values (i.e. IDENTICAL_WEIGHT, PARTOF_WEIGHT, and

SIMILAR_WEIGHT) in the GeoRelationshipWeighting algorithm are decided by experiments.

3.2.4 Geographic knowledge base

The geographic knowledge base is our approach provides a rich repository from which all necessary information for geographic named entity recognition and geographic relationship analysis can be acquired. The data schema of our geographic knowledge base is defined using the object-orient modelling method. Figure 3-2 shows the class diagram of the schema, in which four classes are defined: 1) a geographic entity class that is used to describe a distinct place on the earth. Properties of the geographic entity class include id, qualified name and type (e.g. city, state, and country); 2) a geoname class that is used to represent names of geographic entities; 3) a relationship class that is used to model the geographic relationships between geographic entities, and 4) a part-of 41 class, which is a subclass of the relationship class.

Figure 3-2 Class diagram of the geographic knowledge base data model

Given two toponyms and their possible mappings, the relationships between them can be driven from the knowledge base using the following rules:

1) If the two toponyms are mapped to the same geographic entity, an identical relationship exists

2) If the two toponyms are mapped to two geographic entities between which a

Part-Of connection is found, a part-of relationship exists

3) If the two toponyms are mapped to two geographic entities both of which have a

Part-Of connection point to a same parent, a similar relationship exists.

Figure 3-3 gives an example of instance level view of the model. Seven geographic entities are defined in this example: two country entities (i.e. United States and

Australia), three state level entities (i.e. United States/Florida, Australia/New South

Wales and Australia/Queensland) and two city level entities (i.e. United

States/Florida/Sydney and Australia/New South Wales/Sydney). Some observations we can make from this example include: United States/Florida/Sydney and Australia/New

42 South Wales/Sydney share the same name: Sydney. The United States has seven names:

United States, United States of America, the States, US, U.S., USA and U.S.A., all of these names points to the same geographic entity, therefore identical relationships can be derived. Part-Of relationship can be seen between United States and United

States/Florida, Australia and Australia/New South Wales, United States/Florida and

United States/Florida/Sydney, and Australia/New South Wales and Australia/New South

Wales/Sydney. Similar relationship can be seen between Australia/New South Wales and Australia/Queensland.

Figure 3-3 An instance level view of the geographic knowledge base data model

A relational database management system (RDBMS) is used to implement the geographic knowledge base.

43 3.3 Experiments

An experimental system has been implemented to evaluate the proposed method. This section first describes the data, document collection used in the experiments, and then presents base line methods and the evaluation of our system.

3.3.1 Data

Our experimental system aims to disambiguate toponyms of Australia. The toponym disambiguation process is modelled as a procedure that classifies each toponym that has ambiguities into a set of eight of state level geographic entities (i.e. New South Wales,

Queensland, South Australia, Tasmania, Victoria, Western Australia, Northern Territory and Australian Capital Territory).

A geographic knowledge base instance is constructed based on the data model presented in section 3.2.4. Statistics from knowledge base are summarised in Table 3-1. The two resources from which geographic knowledge is acquired are the Gazetteer of Australia developed by Geoscience Australia and the Postcode Datafile provided by the Australia

Post.

Table 3-1 Statistics from the geographic knowledge base for Australia toponyms disambiguation State/Territory Total number of Number of toponyms % toponyms have distinct toponyms have ambiguities ambiguities New South Wales 2454 274 11.17 Queensland 2396 249 10.39 South Australia 826 132 15.98 Tasmania 605 121 20.00 Victoria 1692 245 14.48 Western Australia 1681 167 9.93 Northern Territory 112 10 8.93 Australian Capital Territory 123 30 24.39 Australia 9203 1228 13.34

The document collection used in our experiments consists of 15,194 articles collected

44 from the Australian Broadcasting Corporation (ABC) local news collection. The ground truth of all toponyms in the collection is obtained based on the subjective viewing data getting from the ABC news feeds that are tagged by the editors. An example of the ABC news feeds can be found from http://abc.net.au/xmlcontent/indexes/southwestwa/southwestwa_rss_index..

3.3.2 Evaluation

To evaluate the performance of our system, experiments were carried out using 10-fold cross-validations out over the ABC local news collection. During the experiment, the original document collection is partitioned into 10 sub-samples. Of the 10 sub-samples, a single sub-sample is retained as the validation data for testing the model, and the remaining 9 sub-samples are used as training data. The cross-validation process is then repeated 10 times (the folds), with each of the 10 sub-samples used exactly once as the validation data. The 10 results from the folds then are averaged. The two baseline methods used for comparison are described as following:

Baseline 1 - Maximum occurrence: Given a toponym to be disambiguated, this method counts all candidates (i.e. state level geographic entities) from the training data set, and the one that has the maximum occurrence is output as the final result. In cases that the toponym is not seen in the training data set, a default mapping determined by experts is assigned.

Baseline 2 - Maximum local weighting score: This method runs the

GeoRelationshipWeighting algorithm to calculate weighting scores for all toponyms and their mappings in a document. The mapping that has the maximum weighting score is

45 assigned to a toponym. A default mapping is used when the maximum weighting score is zero, this happens when no geographic relationship can be found for a toponym.

Three supervised machine learning algorithms are trained and applied: Naive Bayes classifier, J48 decision trees where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes, and boosted J48. These algorithms are implemented by using the Weka software package developed at the

Department of Computer Science, University of Waikato, New Zealand (Witten &

Frank 2005).

Totally three runs were performed, each of which using different parameter configurations as shown in Table 3-2.

Table 3-2 Parameter configurations used in the experiments Run 1 Run 2 Run 3 IDENTICAL_WEIGHT 1.0 1.0 1.0 PART OF_WEIGHT 1.0 0.8 0.5 SIMILAR_WEIGHT 1.0 0.8 0.5

Table 3-3 Toponyms disambiguation accuracy (%) on the ABC local news collection Run 1 Run 2 Run 3 Average Baseline 1 78.71 78.71 78.71 78.71 Baseline 2 51.85 52.33 52.11 52.10 Naive Bayes classifier 75.62 73.70 73.55 74.29 J48 85.37 84.70 84.58 84.88 Boosted J48 85.38 85.08 84.73 85.06

Table 3-3 shows the results of our experiments. In all three runs, the J48 and boosted

J48 algorithms performed consistently better than the two baseline methods, the Naive

Bayes classifier performed better than baseline 2, but worse than baseline 1. The best accuracy result was 85.38% using boosted J48 in run 1.

The best results for the three machine learning algorithms (i.e. 75.62% for Naive Bayes

46 classifier, 85.37% for J48 and 85.38% for boosted J48) were all found in run 1 with running parameter configuration as IDENTICAL_WEIGHT = 1.0, PARTOF_WEIGHT

= 1.0 and SIMILAR_WEIGHT = 1.0, and their accuracy results decreased when the values of PARTOF_WEIGHT and SIMILAR_WEIGHT decreased.

3.3.3 Discussion

From the above results several observations emerge. Firstly, the statistics of toponyms of

Australia confirm the claim that toponym ambiguity is a common fact in real-world practice. The percentage of toponyms that have ambiguities ranges from 8.93% for

Northern Territory to 24.39% for Australian Capital Territory, and an overall value of

13.34% is found for the whole Australia. These figures show that resolution of toponym ambiguities must be taken into account in any geographic-related natural language processing and information retrieval applications.

Secondly, the results from the above experiments clearly show that the overall performance in disambiguation accuracy can be significantly improved by employing the proposed method and appropriate supervised machine learning algorithms. Based on the average disambiguation accuracy value, the boosted J48 algorithms achieved 8.07% and 63.26% improvement compares to the baseline 1 and baseline 2 methods. This observation indicates that our approach including the geographic relationship model and the geographic knowledge base is a very promising one to be applied to the toponym disambiguation problem.

Lastly, we are aware that the performance of the geographic information entity recognition module is not full evaluated. The main reason for this limitation is that we

47 lack necessary human and technical resources to annotate large document collections.

However, the effectiveness of the module could be tested with existing annotated corpus and golden standard data. Further experiments are planned to address this issue.

3.4 Conclusions

This chapter addresses geographic information entity extraction and grounding problem by presenting a supervised machine learning approach for toponym disambiguation tasks. In particular, the details of four components, including a geographic relationship model, a geographic information entity recognition module, a supervised machine learning module and a geographic knowledge base, have been discussed. The proposed approach has been evaluated over a large document collection consisting of 15,194

Australia local news articles The experiment results demonstrate that the disambiguation accuracy ranges from 73.55% to 85.38% depending on the running parameters and the learning strategies used, in particular, those use J48 and boosted J48 machine learning methods, can provide better performance than other baseline methods.

48 CHAPTER 4 LEARNING RANKING FUNCTIONS FOR GEOGRAPHIC INFORMATION

RETRIEVAL USING GENETIC PROGRAMMING

This chapter describes an approach that learns GIR ranking functions using Genetic

Programming (GP) methods based on textual statistics and geographic features derived from documents and user queries. The proposed approach has been applied to a large collection of geographic metadata documents, and the experimental results show that the ranking functions learned using the proposed method achieved significant improvement over existing ranking mechanisms in retrieval performance.

4.1 Introduction

Ranking is one of the key research questions in Information Retrieval (IR). Given a user query, IR ranking functions assign each retrieved document a numerical score, which reflects the relevance of the document to the query. Ranking scores are then used by IR systems to order the retrieved documents, and documents that are most relevant to the query are presented to users first. With the ranked results users can find information of interest more easily. In addition, most of the current evaluation methods for IR system retrieval performance are based on ranked results.

In Geographic Information Retrieval (GIR), the relevance of a document to a user query is determined by their thematic context, and also geographic context. This important characteristic of GIR suggests two distinct but interrelated hypotheses: firstly, geographic features of documents and user queries should be used for the calculation of

49 geographic ranking scores; and secondly, thematic ranking scores and geographic ranking scores should be combined to produce the final ranked result.

Thematic ranking methods assign ranking scores to documents based on lexical and syntactic statistics of documents and queries. On the other hand, geographic ranking methods assign ranking scores to documents based on geographic features discovered from documents and user queries. The main issue in the integration of several ranking approaches into one ranking function is how different measures can be combined. Many integration schemes have been proposed in the literature, which can be divided into three categories: standard combinations, linear combinations, and non-linear combinations. Standard combinations calculate the maximal, minimal or median values of individual ranking scores as final results. Linear combinations assign a weight value to each individual ranking score, and take the sum of weighted ranking scores as final results. The performance of standard combinations and linear combinations are studied extensively using large unstructured text collections by Fox & Shaw (1994), Lee (1997) and Vogt & Cottrell (1998). The construction of weight matrices used in linear combinations can be automated using machine learning (Bartell, Cottrell & Belew 1994) and logistic regression (Gey 1994, Savoy & Rasolofo 2002). In contrast to standard and linear combinations, non-linear combinations allow individual ranking scores to be combined in more complex manners. Neural network and evolutionary computation are two important approaches for the learning of non-linear ranking functions (Bartell 1994;

Trotman 2005). Although many experimental results show that significant improvements on retrieval performance can be achieved by combining multiple ranking strategies, there is little previous work on the integration of textual ranking mechanisms

50 with geographic ranking mechanisms for GIR systems, which is the main focus of this chapter.

This chapter reports my experiences in attempting learning optimal GIR ranking functions for a given document collection using Genetic Programming (GP). GP is an evolutionary computation technique that aims to automatically find an optimised computer program for a specified problem using the principle of natural evolution process (Koza 1992). The retrieval performance of the proposed approach has been compared with traditional ranking methods, including: classic ranking method based on the Vector Space Model (VSM), simple linear combination of multiple ranking scores, and linear combination of multiple ranking scores using the logistic regression. The results show that the proposed method can achieve significant improvement over these methods. In addition, this paper highlighted issues inherent in the design and implementation of GP-based GIR ranking functions. The impact of different fitness functions and evolution strategies on retrieval performance has been examined. The test collection used in the experiments consists of 4,000 geographic metadata records collected from the New South Wales Natural Resources Data Directory, a state government node of Australian Spatial Data Directory (ASDD). The evaluation method used for retrieval performance comparison was based on the precision and recall measurements of ranked results.

The contributions of this chapter are as follows. Firstly, it describes the overall design of a GIR ranking function learning method using GP; secondly, it highlights important decisions regarding GP evolution strategies; thirdly, it proposes and compares four

51 different fitness functions for the GP algorithm; and finally it reports experimental results to support the proposed approach.

The remaining sections of this chapter are structured as follows: Section 4.2 describes the methodology for learning GIR ranking functions using GP, Section 4.3 reports the experiments and results, and Section 4.4 concludes the chapter.

4.2 Methodology

This section describes the methodology for learning GIR ranking functions for a given document collection and a set of queries using GP. There are several reasons that GP is chosen for this task: firstly, GP has been demonstrated to be capable of evolving complex program structures and giving solutions to some problems that are competitive with human-written algorithms; secondly, GP can be used for solving both linear and non-linear problems; and finally, previous work in conventional IR systems has shown that ranking functions learned using GP can achieve significant improvement in retrieval performance (Fan, Gordon & Pathak 2005).

4.2.1 Genetic Programming

Genetic Programming is a sub-field of evolutionary computation technique that aims to automatically find an optimised computer program for a specified problem using the principle of natural evolution process. GP theory has its root in the classical Genetic

Algorithm (GA), but GP provides a more expressive way to represent the search space and solutions. In GP literatures, the search space is called population and a solution is called an individual. GP algorithms utilise genetic operators such as reproduction,

52 crossover and mutation on each generation of individuals to produce new generations of better solutions. The most important elements of a GP implementation include: a set of terminals and functions that can be used as logic units of a computer program; a fitness function evaluates how well each individual in the population is for the problem; and an evolution strategy specifies control mechanisms, such as the genetic operators used in the evolution and the frequencies of using these operators.

4.2.2 Terminals and functions

Terminals and functions are logic units of GP individuals. Terminals reflect logical views of documents and user queries. In the proposed method, index terms (i.e. keywords) are used to describe the textual content of documents, and place names and geometric objects (i.e. a polygons) are used to describe geographic features of documents. User queries are represented using the same way. Based on such representation, those listed in Table 4-1 are selected as GP terminals. The basic hypothesis underlying the geographic ranking method is that the relevance between a document and a user query increases or decreases with an increase or decrease in the geographic distance and/or overlap area between them (Beard & Sharma 1997; Larson

& Frontiera 2004).

Terminals listed in Table 4-1 can be categorised into two groups: local terminals and global terminals. Local terminals reflects content of one particular document.

RAWFREQ_KEYWORD, RAWFREQ_PLACE, MAX_RAWFREQ,

OVERLAP_AREA, DOC_AREA and QUERY_AREA are examples of local terminals.

In contrast, global terminals reflects content of the whole collection. DOC_COUNT,

53 DOCFREQ_KEYWORD, DOCFREQ_PLACE and DOC_OVERLAP are examples of global terminals.

Functions reflect relationships between terminals. Functions used in the experiments include addition (+), subtraction (-), multiplication (×), division (/) and natural logarithm (log). Additional controls are added to the function definitions to handle exception cases, such as divided by zero, and logarithm of non-positive numbers.

Table 4-1 Terminals used in the GP learning process Name Description DOC_COUNT Number of documents in the collection RAWFREQ_KEYWORD Raw frequency of the keyword in the document DOCFREQ_KEYWORD Number of documents in which the keyword appears RAWFREQ_PLACE Raw frequency of the place name in the document DOCFREQ_PLACE Number of documents in which the place name appears MAX_RAWFREQ The maximum raw frequency of all keywords and place names in the document OVERLAP_AREA Area size of overlap between the document and the query DOC_AREA Area of the document QUERY_AREA Area of the query DOC_OVERLAP Number of documents which overlap the query

Each GP individual is represented as a tree, in which inner nodes are selected from the function collection and leaf nodes are selected from the terminal collection. Figure 4-1

+ yx shows an example of GP individual whose mathematic formula is . + yx 22

Figure 4-1 An example of the tree representation of GP individuals

54 4.2.3 Fitness functions

Fitness functions reflect how well each individual is. A good design of fitness functions can help to reduce learning time and to produce better solution. The effectiveness of GP algorithms is sensitive to fitness function used, and a careful design and select of fitness functions has beneficial impact on GP performance. Four fitness functions are considered in the proposed method:

Fitness1. The first fitness function utilises the harmonic mean of standard precision and recall measures.

1 Q 2 Fitness1 ×= B Equation 4-1 Q = + 1/R1/P 1i ii where Pi and Ri are the precision and the recall value of the ith query respectively, Q is the total number of queries. This function assigns a high fitness value to a solution when both recall and precision are high. The advantage of this function is its mathematical simplicity. However, the main disadvantage of this function is that the ranking order of retrieved results is not taken into account.

Fitness2. The second fitness function returns the arithmetic mean of the precision values at 50% recall for all queries as results.

Q 1 ×= Fitness2 B Pi,50 Equation 4-2 Q =1i where Q is the total number of queries, Pi,50 is the precision value at 50% recall level of the ith query. The advantage of this function is that it takes into account the order of retrieved result, and higher fitness values are assigned to solutions that retrieve relevant documents quickly. The reason to use 50% recall for Fitness2 is precision measured at

55 this level would indicate how many non-relevant documents a user would have to read in order to find half the relevant ones.

Fitness3. The third fitness function utilises the idea of average precision at seen relevant documents.

C j S D B dr )( T Q D j k 1 1 ××= D × k=1 T Fitness3 BB( D dr j )( T) Equation 4-3 ==DQ j i 11i j D T E U

where Q is the total number of queries, Di is the total number of retrieved documents for the ith query, and r(dk) is a function that returns 1 if the document dk is relevant to the ith query and returns 0 otherwise.

Fitness4. The last fitness function is of my own design, which utilises the weighted sum of precision values on the 11 standard recall levels, from 0 to 100% in steps of 10%.

1 Q 10 1 Fitness4 ×= BB( P ) Equation 4-4 + m , ji Q ij==1 0 j )1( where Q is the total number of queries, Pi,j is the precision value at the jth recall level of the 11 standard recall levels for the ith query, m is a positive scaling factor determined from experiments. This fitness function prefers systems that have high precision values at low recall level. Fitness4 is believed to be a good one as precision values at all standard recall value contribute to the final results, and the scaling factor m can be tuned for different document collections and query sets.

56 In all these fitness functions, higher fitness value reflects indicates better individual.

4.2.4 Genetic operators

The evolution procedure in GP can be described as a repeated procedure that creates a new generation by applying various genetic operators on the previous generation. There are three genetic operators are used in the proposed method to create new generations, including: creation, crossover and reproduction.

The creation operator creates a new individual as a random selection procedure, which assumes each terminal and function has the same probability of being selected. A node is first selected and is added to the tree that presents the solution as the root node. If the selected node is a terminal, the procedure is finished. If the selected node is a function, one or more sub-trees need to be created. The number of sub-trees the function node has is decided by the function definition. For example, the addition, subtraction, multiplication and division functions have two sub-trees, because they need two input parameters; the natural logarithm function has one sub-tree, because only one input parameter is required. Sub-trees are recursively created using the same method until no sub-tree is required. Using the creation operator, the produced individual added to the new generation increases the diversity of the population.

The crossover operator first selects two individuals as parents, and randomly selects a sub-tree from each parent. Then two new individuals are generated by swapping the two sub-trees of each parent. The crossover operator is one of the most widely used genetic operators in GP, because it creates new individuals faster than other operators, and it

57 creases the diversity of individuals in the new generation. Figure 4-2 illustrates how the crossover operator works.

Figure 4-2 The crossover operation

The reproduction operator selects one individual and copies it into the new generation directly without any modification. This operator doesn't improve the diversity of new generation, but it is helpful to keep good individuals during the evolution.

4.3 Experiments

A set of experiments was conducted to evaluate the effectiveness of learning ranking functions using GP for GIR. Constants used in the experiments include the maximum number of generations that was set to 50, and the maximum population size that was set

58 to 100. These values were determined from experiments based on suggested value ranges given by Jong (1975).

Because of the randomness of the GP evolution, each experiment was run three times and the best result of the three runs is selected as the final result. The experiments were implemented using JAVA programming language, and were run on a 2.8 GHz processor

Linux machine with 512MB memory.

4.3.1 Data

The document collection used in the experiments consists of 4000 geographic metadata records collected from the New South Wales Natural Resources Data Directory, a state government node of Australian Spatial Data Directory (ASDD). The original document collection was split into two subsets, training data (75%, 3000 documents) and test data

(25%, 1000 documents). 100 queries were generated by randomly selecting from a large pool for both training and testing. Each query consists of three fields: keyword, place name and the minimum-bounding rectangle (MBR) of the place. Binary relevance judgments (i.e. a document is relevant to a query or not) were made by human judges for each query. Figure 4-3 and Figure 4-4 shows an example of documents and queries used in the experiments.

59

Figure 4-3 An example document used in the experiments

Figure 4-4 An example query used in the experiments

4.3.2 Baselines for retrieval performance evaluation

Three existing ranking methods were used as baselines for comparison of the experimental results and for evidence of the effectiveness of the ranking function learned by GP.

1) Jakarta Lucene Search Engine (Lucene): The Apache Jakarta Lucene (c.f. http://lucene.apache.org) is a high performance full text search engine, in which the

60 ranking function is based on the Vector Space Model (VSM) and the Term frequency-Inverse Document Frequency (TF-IDF) scheme.

2) Linear combination (LC): This ranking function assigns a ranking score to each retrieved document by linearly combining the thematic similarity and the geographic similarity measures between a document and a query. The ranking function is defined as the following:

× overlap qd ),(2 score qd ),( λ ×= (1 λ)- ×+ LuceneScor qde ),( Equation 4-5 + area d area q)()(

Given a document d and a query q, overlap(d, q) returns the overlap area between d and q, area(d) and area (q) return the area of d and q respectively. λ is the weighting value, and the LuceneScore(d,q) function returns the Lucene ranking score between d and q.

The training data set that consists of 75% (3000 documents) of the ASDD document collection. was used to find the best λ value. Each possible value from 0 to 1 was checked with an increment step of 0.05, and finally λ = 0.90 was selected for the experiments as it achieved the best performance.

3) Logistic regression (LR): This algorithm is developed based on the probabilistic IR model, in which the ranking score for a document to a query is modelled as

overlap qd ),( overlap qd ),( score qd ),( = -5.68999 + 22.3253* LuceneScor qde ),( + 2.4842* + 1.9533* area q)( area d )( Equation 4-6

where the coefficients were selected using regression analysis on the some training data set that used in the LC method.

61 4.3.3 Experiment 1: GP evolution control parameters

Before the running of the GP algorithm, it is necessary to find the appropriate values of the probability of each GP operator (i.e. pc for creation, po for crossover and pr for reproduction). The first experiment conducted was a parameter tuning procedure to find the optimal combination of pc, po and pr values. The fitness function used in this experiment is the Fitness2 discussed above, which calculates the fitness value as precision value at 50% recall. Four different configurations were tested and compared.

Table 4-2 illustrates the pc, po and pr values of each configuration, as well as the best fitness values achieved with these configurations, and which indicates that the third configuration (i.e. pc =30%, po = 50% and pr =20%) outperforms others. This configuration was used in the second experiment and remained fixed during the runs.

Table 4-2 Results of the best fitness value under four different combinations of pc, po and pr

pc (%) po (%) pr (%) Best Fitness Value #1 10 90 0 0.766 #2 20 70 10 0.745 #3 30 50 20 0.780 #4 50 30 20 0.769

4.3.4 Experiment 2: ranking function learning

This experiment used the proposed GP algorithm to learn ranking functions using evolution parameters selected in the experiment 1. The mean interpolated precision values at the standard 11 recall levels of all 100 queries were used as the evaluation matrix. The ranking functions learned using fitness function Fitness1, Fitness2, Fitness3 and Fitness4 are denoted as f1, f2, f3 and f4 respectively. For Fitness4, the scaling factor m was set to 10. Retrieval performance comparison for the three baseline methods and the ranking functions learned using the proposed method is listed in Table

4-3. The recall-precision curves are also shown in Figure 4-5.

62

As can be seen from Table 4-3 and Figure 4-5, the best one of the three baselines was the LR method. Three out of the four GP ranking functions outperformed LR and the performance improvement in the measure of Mean Average Precision (MAP) was from

143.33% to 196.28%. The greatest improvement was obtained with the ranking function f4 learned using Fitness4. The only GP ranking function didn't show improvement was the f1 using Fitness1. One possible reason for this may be that this fitness function doesn't take into account the order of the retrieved documents. This confirms the hypothesis that the effectiveness of GP algorithms is sensitive to fitness function used, and a careful design and select of fitness functions has beneficial impact on GP performance.

Table 4-3 Comparison of the results (mean interpolated precision values at the standard 11 recall levels) obtained using four fitness functions with three baselines Recall Lucene LC LR f1 f2 f3 f4 0.0 0.5647 0.5343 0.5553 0.5017 0.9272 0.7314 0.9323 0.1 0.4302 0.4210 0.4692 0.3788 0.8989 0.7259 0.9130 0.2 0.2913 0.3115 0.3715 0.2771 0.8748 0.7135 0.8755 0.3 0.2269 0.2694 0.3320 0.2531 0.8396 0.7006 0.8513 0.4 0.1675 0.2033 0.2653 0.2444 0.8185 0.6972 0.8348 0.5 0.1471 0.1703 0.2392 0.2353 0.8038 0.6945 0.8239 0.6 0.1353 0.1514 0.2305 0.2206 0.7859 0.6891 0.8066 0.7 0.1297 0.1393 0.2149 0.2020 0.7551 0.6847 0.7709 0.8 0.1234 0.1273 0.2025 0.1962 0.7345 0.6834 0.7486 0.9 0.1124 0.1124 0.1809 0.1865 0.7041 0.6793 0.7176 1.0 0.1106 0.1110 0.1758 0.1830 0.6980 0.6779 0.6981 MAP 0.2014 0.2114 0.2691 0.2365 0.7837 0.6548 0.7973

63 





                 

 

           

Figure 4-5 Precision values at the standard 11 recall levels

Figure 4-6 illustrates the tree representation of the ranking function that achieved the best result. Terminals in the ranking function include both statistics derived from textual content (e.g. RAWFREQ_KEYWORD) and statistics derived from geographic properties (e.g. OVERLAP_AREA). It is important to note that because of the random nature inherent in GP, the mathematical interpretation of the ranking function is usually incomprehensible to human logic and understanding.

Figure 4-6 The tree representation of the ranking function learned using Fitness4

Another interesting comparison is the computation time of GP learning, since GP learning, as well as other evolutionary computation technologies, is a time-consuming procedure. Figure 4-7 shows the generation numbers of the best fitness value was found

64 for each fitness function. This figure shows that the best fitness value was found after at most 15 generations. Given the computation time is around 10 minutes per generation in the experiment environment, the runtime performance is acceptable, as the maximum runtime for finding the best fitness value was 150 minutes.

     

          

Figure 4-7 The generation number of the best fitness value found for each fitness function

4.3.5 Discussion

There are several important observations emerged from this study. Firstly, the experimental results show that ranking functions learned using the proposed GP algorithm can achieve significant improvements on retrieval performance for GIR with acceptable runtime performance. The GP terminals and functions used are very simple and straightforward, but as can be seen from comparisons with baseline methods, very promising results have been obtained with appropriate GP implementations (e.g. evolution strategies and fitness functions).

Secondly, both thematic and geographic similarity measures are critical for GIR ranking.

Among all seven ranking methods implemented in the experiments, the Lucene search engine is the only one that totally ignores geographic similarity measures. As can be

65 seen from experimental results, the retrieval performance of Lucene is the worst one among all methods. Although conventional IR has been proven extremely valuable for resolving many problems of information searching and retrieval, it must be complemented by geographic knowledge to produce feasible solutions for GIR applications.

Finally, ranking order must be taken into account when designing GP fitness functions.

Previous work has shown that order-based fitness function can achieve higher performance in various IR tasks. The experiments confirmed this point by quantifying the performance of fitness functions that use or not use ranking order information.

These observations are important not only for effectively designing GIR ranking mechanisms using GP and other evolutionary computation techniques, but also important for the understanding of the fundamental difference between textual IR systems and GIR systems in how documents are modelled, searched and ranked.

4.4 Conclusions

Ranking of retrieved results plays a very important role in all IR applications. In contrast with conventional IR, GIR ranking must take into account both thematic and geographic similarity measures. This chapter described a GIR ranking function learning method where Genetic Programming is used. Selected statistics derived from textual content and geographic features of documents are used in the learning process. Two experiments have been conducted to evaluate the proposed method. The first one was devoted to the selection of GP evolution strategy, and the second one learned ranking functions using the selected evolution strategy and four different fitness functions.

66 Retrieval performance of the ranking functions learned was compared with other existing ranking mechanisms. The results showed that the proposed method produced significant improvement.

67 CHAPTER 5 A SPREADING ACTIVATION NETWORK MODEL FOR GEOGRAPHIC

INFORMATION RETRIEVAL

This chapter proposes a new GIR retrieval model based on the Spreading Activation

Network. The fundamental components of the model include a two-layer associative network that is used to construct a structured search space; a constrained spreading activation algorithm that is used to retrieve and to rank relevant documents; and a geographic knowledge base that is used to provide necessary domain knowledge for the model.

5.1 Introduction

Although many studies have been conducted in Geographic Information Retrieval

(GIR), the retrieval of information based on not only textual context but also geographic context continues to be a very challenging research problem in Information Retrieval

(IR) and Natural Language Processing (NLP) applications. One of the most fundamental problems is the integration of both textual and geographic context into a single and unified retrieval model.

When geographic context is taken into account in retrieval, keyword-based models developed in classic textual IR systems are not sufficient to provide insight into the complexity of geographic information. In particular, a certain mechanism is required to define, discover, and utilise various geographic relationships between documents and user queries. As a simple example, if a user asks for news stories about the "New South

Wales", a document mentioned "Sydney" should be retrieved even if the term "New

68 South Wales" absents from the document, given the truth that Sydney is the cultural, financial and media center of New South Wales. Thus there is a clear need to build a unified retrieval model for GIR because of its strong potential for geographic related information processing and management applications.

This chapter proposes a new GIR retrieval model based on the Spreading Activation

Network. The fundamental components of the model include: 1) an associative network that consists of two layers: a textual layer and a geographic layer. This network is used to construct a structured search space in which documents are represented as nodes, with weighted links between those documents that contain the same keywords in the textual layer and those documents that contain related geographic entities in the geographic layer; 2) a constrained spreading activation algorithm that is used to retrieve relevant documents and to assign each of them a ranking score. The activation process starts by selecting a set of document nodes that are explicitly related to user queries and then activating those document nodes that are implicitly related; and 3) a geographic knowledge base that is used to provide necessary domain knowledge for the construction of links and the calculation of weighting scores in the geographic layer.

This chapter is organised as follows. Section 5.2 provides background information related to the topic. Section 5.3 describes the proposed methodology. Section 5.4 concludes this paper.

5.2 Spreading Activation Network

The Spreading Activation Network model (SAN) has its origins in the psychological principles of human memory retrieval processes (Crestani 1997; Rumelhart & Norman

69 1983). With different configurations and extensions, SAN has been widely used in the

IR field (Belew 1989; Lee 1999; Gelgi et al. 2005). However, to the author's knowledge this is the first time that SAN model has been used in GIR.

The SAN model consists of two main components: a network of interconnected nodes and an iterated information retrieval procedure. In a SAN network, nodes are used to represent semantic features of documents and user queries, weighted links are used to represent associations between individual documents and between documents and queries. The weighting scheme used in the SAN network can be simply based on the statistics of terms in the document collections (Giuliano & Jones 1962), and also can be based on the semantic similarity relations (Cohen & Kjeldsen 1987).

The information retrieval procedure in the SAN model starts with choosing a set of originating nodes each of which represents a document that is related to the given query.

In the meantime, an activation level is assigned to each of these nodes. From these selected nodes and taking into account the link direction and weighting values, nodes that link to originating nodes are activated eventually. This procedure is iterated repeatedly until some termination criteria are met. Documents corresponding to nodes that have positive activation level are then retrieved as results.

5.3 Methodology

This section describes the proposed approach to the modeling of GIR based on SAN, first with an overview of the model and then more specifically, aspects including the spreading of activation, network constraints, retrieval and ranking, and construction of the network are discussed.

70 5.3.1 An overview of the Spreading Activation Network for GIR

The proposed Spreading Activation Network for GIR (G-SAN) is a two-level SAN model that consists of a textual level and a geographic level. Two different types of nodes exist in the network: T-nodes that are nodes in the textual level, and G-nodes that are nodes in the geographic level.

T-nodes are used to represent textual features of documents. Each T-node in G-SAN corresponds to a document. The content of a T-node is the text content of a document.

The total number of document T-nodes in a G-SAN network is the number of documents in the collection.

G-nodes are used to represent geographic features of documents. Each G-node in

G-SAN is labelled with a distinct geographic named entity, and is connected to those documents from which the geographic named entity can be recognised. The total number of G-nodes that connect to a document is determined by the number of distinct geographic named entities discovered from the document. As each document in the

G-SAN is represented as a T-node, each G-node connects to one or more T-nodes.

Based on this structure, documents that have common geographic features are connected to each other by shared G-nodes.

User queries in G-SAN are modeled using the same approach. Each user query is modeled as a T-node that represents its text content and a set of G-nodes that represents its geographical features. Furthermore, the T-node and G-nodes of a given user query can be used to identify documents that contain keywords and geographic entities specified in the query.

71 G-SAN is designed with emphasis on geographic features and geographic relationships.

In particular, each G-node is mapped to a geographic entity that is represented using a political administrative hierarchical structure. In addition to links between G-nodes and documents, three types of links are used to describe relationships between two G-nodes: identical, part-of and adjacent. An identical relationship can be seen between two

G-nodes that correspond to the same political administrative entity on the earth. For an example, United States is identical to U.S.A. A part-of relationship can be seen between two G-nodes that one contains another. For example, Australia/New South

Wales/Sydney is part-of Australia/New South Wales. An adjacent relationship can be found between two G-nodes that have adjacent boundaries. For an example, United

States is adjacent to Canada.

The identical and adjacent links are non-directed, as when a G-node g1 is identical or adjacent to another G-node g2, then g2 is identical or adjacent to g1 as well. On the other hand, the part-of links are directed from the G-node at lower hierarchical level to the G-node at higher hierarchical level.

A simple example of the G-SAN network is shown in Figure 5-1, in which user queries,

T-nodes and G-nodes are denoted as grey boxes, white boxes and white circles, respectively. Links between T-nodes and G-nodes are denoted as solid lines. The three relationships: identical, part-of and adjacent are denoted as labelled dashed lines.

72

Figure 5-1 The G-SAN spreading activation network for GIR

5.3.2 Spreading of activation

The activation spreading workflow in the G-SAN network consists of three phases: initialisation, spreading and termination.

During the initialisation phase, a set of T-nodes and G-nodes that are directly related to the user query are activated. The activation procedure is straightforward: firstly, those

T-nodes that contain one or more keywords in the user queries are activated, then those

G-nodes that are labelled with geographic named entities appearing in the user query are activated.

After the initialisation phase, the activation spreads to other documents during the spreading phase, which is an iteration procedure. In each spreading iteration, activation spreads from activated nodes to nodes that are connected with them. As mentioned before, G-SAN put emphasise on geographic relationships, therefore, the spreading of activation only happens between G-nodes. The path of spreading is decided during the

73 runtime based on the type of geographic relationship between G-nodes and the geographic criteria of the user query. An inactivated G-node is tagged as activated when activation is spread from activated G-nodes that are connected to it. The spreading is terminated at a new activated G-node if it is not connected to any inactivated G-nodes, or the geographic relationships between it and those inactivated G-nodes doesn't allow the spreading. Otherwise, activation spreading will be started from new activated

G-nodes in the next iteration.

The termination check is executed at the end of each spreading iteration to decide if the spreading of activation should be terminated. The termination could happen for several reasons, for example: no further G-nodes can be activated; a user-defined maximum iteration number is reached; and users issue a termination request.

5.3.3 Network constraints

One of the major challenges in the implementation of a SAN based model is the computation costs caused during the activation spreading procedure discussed in the previous section. Same as other SAN models, the G-SAN model applies a number of constraints to control the activation spreading procedure in order to improve the system computation performance.

Constraints used in the G-SAN can be categorised into as path constraints, which are defined as a set of applications dependent rules that are used to control the activation spreading path. More specifically, constraints included in G-SAN are based on the geographic relationship types supported by the system. Based on the geographic

74 relationships described above, three rules are defined to support the "in", "adjacent" and

"is a" conditions defined in GIR queries.

5.3.4 Retrieval and ranking

Once the activation spreading procedure is finished, the retrieval and ranking procedure is started in order to get a list of documents that are ordered based on their relevant level to the user query.

Each document that connects to activated T-nodes and, at the same time, activated

G-nodes is assigned a ranking score. After ranking scores are calculated, documents with positive scores are retrieved and sorted in a descent order based on their ranking scores. The top N documents are then returned to users as search result, where N can be specified by the system and by users.

5.3.5 Construction of a G-SAN network

Given a document collection and a set of user queries, four steps are involved to build a

G-SAN network: 1) collecting and organising geographic knowledge, 2) indexing documents, 3) recognising geographic named entities and 4) identifying and establishing geographic relationships.

A comprehensive geographic knowledge base is an essential element for a G-SAN network, as it provides necessary geographic knowledge for all geographic processing procedures. In practices, G-SAN integrates multiple resources including public gazetteer lists and geographic ontologies into a single schema.

75

The goal of the geographic named entity recognition is to identify all geographic named entities in a given document. This procedure is carried out using gazetteer lists derived from the geographic knowledge base. In additional, a grounding module is used in this procedure to map a recognised geographic named entity to a geographic object, various types of ambiguity associated (e.g. distinct places may use the same name, ambiguities between place names and people names) with geographic names are resolved in this step.

The last step of building a G-SAN network is to identify and to establish relationships between geographic named entities. Supported with domain knowledge provided by the geographic knowledge base, the three types of geographic relationships (i.e. identical, part-of and adjacent) used in G-SAN are discovered, and related G-nodes are connected.

This step can be run in a "once-for-all" manner to build the whole network, and, in order to reduce the computation costs, it also can be run in a "on-demand" manner, in which only a partial of the network is involved.

5.4 Conclusions

This chapter proposes a new GIR retrieval model based on the Spreading Activation

Network called G-SAN. The fundamental components of the model are discussed in detail firstly, then the retrieval process and the construction of a G-SAN network are presented. Further detail of the implementation and evaluation of an experiment system based on the G-SAN will be described later in Chapter 7.

76 CHAPTER 6 GEOTAGMAPPER: AN ONLINE MAP-BASED GEOGRAPHIC

INFORMATION RETRIEVAL SYSTEM FOR GEO-TAGGED WEB CONTENT

This chapter addresses several key challenges in designing and implementing retrieval systems for geo-tagged Web content by developing an online map-based Geographic

Information Retrieval (GIR) system called GeoTagMapper. The components of the proposed system are described in detail and the effectiveness and the usefulness of the system are demonstrated by applying it to a large collection of geo-tagged webpages and digital photos.

6.1 Introduction

Geo-tagged Web content is a term that describes World Wide Web (WWW) content with which small pieces of information (i.e. geo-tags) are tagged to describe the geographic context of the content. With the emergence of many Web 2.0 software tools

(e.g. flickr.com and del.icio.us), Web users are able to geo-tag online information such as webpages, digital images, audio and video files when they are browsing. Moreover, with advanced hardware support like digital cameras integrated with Global Position

Systems (GPS) receivers, geo-tags can be created automatically.

As the growth of go-tagged Web content, efficient management and utilisation of these data become more and more important. These requirements present new opportunities for Geographic Information Retrieval (GIR) systems.

77 In a retrieval system, a sophisticated user interfaces for visualisation and interaction is necessary when a large number of documents are retrieved as searching results. As general requirements for a visualisation tool, three basic functionalities must be implemented, namely, overview, zoom and filter, and details-on-demand (Shneiderman

1996). In many of current textual information retrieval systems, these functionalities are implemented using the method of separating large data sets into pages according to their degrees of relevance. Such simple approach is not suitable for the problem under investigation. How massive geographic information can be organised and presented on online maps in a satisfactory way for information retrieval tasks is still an open question.

This chapter addresses these issues by developing the GeoTagMapper system, an online map-based GIR system for geo-tagged Web content. The proposed system consists of four modules, including 1) a geo-tag parsing module that identifies geo-tags from Web content and then converts them to a uniform format that is based on the GeoRSS specification (Reed 2006); 2) an integrated geo-textual indexing module that composes of a textual indexing scheme based on inverted files (Araújo, Navarro & Ziviani 1997) and a geographic indexing scheme based on the Generalized Search Tree (GiST)

(Hellerstein, Naughton & Pfeffer 1995); 3) a Boolean retrieval model and 4) a user interface that exploits multi-resolution presentation techniques with online map services.

The effectiveness and the usefulness of the system are demonstrated by applying it to a large document collection that consists of geo-tagged webpages collected from the

GeoURL Server (c.f. http://geourl.org/) and the Wikipedia (c.f. http:// en.wikipedia.org), and geo-tagged digital photos collected from flickr.com.

78 Geo-tagged Web content is very popular in Australia Web communities. A recent search on flickr.com using the tag "Australia" returned more than 900,000 photos, and the tag "Sydney" returned more than 450,000 photos. Such common interest reflects the fact that all human activities are associated with geographic context. By developing and leveraging the GeoTagMapper system, it is hoped that the system will provide useful entry points into the searching, filtering and navigating of the geo-tagged Web content.

At meantime, from an international perspective, it is anticipated that the issues discussed in this chapter will provide insight to the fundamental ways we model and utilise geographic information.

This chapter is organised as follows, Section 6.2 gives some necessary background knowledge. Section 6.3 discusses the design and implement of the GeoTagMapper system. Section 6.4 describes a usage case study of the system. Section 6.5 concludes the paper and gives future work directions.

6.2 Background

This section reviews some background knowledge related to this work, including tagging and geo-tagging, and the GeoRSS specification.

6.2.1 Tagging and geo-tagging

Tagging is a process that users label content with self-selected tags, which introduces a lightweight and efficient way for Web content describing and categorising (Golder &

Huberman 2006). The selection of tags to be attached to Web content is arbitrary, so that various vocabularies and syntax can be used to meet the needs of individual users.

79 Geo-tagging refers to conventions that are used to encode location information within a single tag. Tags that adopt geo-tagging conventions are called geo-tags. Geo-tags are one of the most common online tag types (Guy & Tonkin 2006).

Figure 6-1 shows a geo-tagged digital photo published on the flickr.com photo sharing system in which geographic information are described using three tags, geotagged, geo:lat and geo:lon.

Figure 6-1 An example of geo-tagged digital photos on flickr.com with geo-tags geo:lat=-37.798993 and geo:lon=144.96049

6.2.2 GeoRSS

The Geographically Encoded Objects for RSS feeds (GeoRSS) is a proposal developed at www.georss.org for geo-tagging Web content with location information. The GeoRSS information model consists of geometry objects (e.g., point, line, box and polygon), their attributes (e.g., relationship, feature type, elevation and radius) and their associations with external information entities. The World Geodetic System 1984

(WGS84) is used in GeoRSS as a fixed global reference frame for all geographic coordinates.

80

GeoRSS specifies two different encoding models for information representation :

GeoRSS Simple and GeoRSS GML. The GeoRSS Simple model uses concise semantics and the GeoRSS GML model uses the Geography Markup Language (GML) developed by the Open Geospatial Consortium Inc (OGC).

6.3 GeoTagMapper

This section describes the specific approaches used in the development of the

GeoTagMapper system. The following subsections discuss the overall system architecture, and the detail of the design and implementation of each module.

6.3.1 System overview

Figure 6-2 shows the overall architecture of the GeoTagMapper system. The system consists of four major components, including 1) the geo-tag parsing module that identifies geo-tags from the collected Web content and then converts them to a

GeoRSS-based uniform format; 2) the textual-geo indexing module that creates textual and geographic indices for all documents in the system to support fast search and retrieval; 3) the Boolean retrieval module that process thematic and geographic searches issued by users; and 4) an integrated user interface that exploits geographic visualisation and interactive techniques with online map services.

81

Figure 6-2 Architecture of the GeoTagMapper system

At a high level, the data and control flow between these components can be described from following two different perspectives,

From the viewpoint of document management, once a document is collected, it is geo-parsed first and then is indexed into the integrated textual-geo indexing system.

Compared with conventional information retrieval systems, a geographic-enabled database that can be used for storing and searching of geographic indices is essential.

On the other hand, from the viewpoint of the system users, the submitting of a query is a starting point. A retrieval model using Boolean AND operation is adapted in

GeoTagMapper to search the textual and geographic indices, and then retrieves all documents that meet both the thematic criteria and the geographic criteria defined in the user query. Then a ranking function is used to calculate a numeric score for each retrieved document based on the geographic and thematic similarity measures between the document and the user query. These ranking scores directly impact how retrieved resulted are represented.

82 The GeoTagMapper system is developed as a Web application using the JAVA language. The PostGIS database, an open source geographic information system is used as the backend database for geographic indexing and searching. The Lucene search engine, an Apache open source project that provides full-text search functionalities, is used for textual indexing and searching. The GeoTagMapper user interface is implemented using JavaServer Page and Java Servlet technologies with Google Map.

JSP pages and Java Servlets run on the server side to process user input and generate dynamic responses. The Google Map APIs are used to display digital maps and aerial photographs.

6.3.2 Geo-tag parsing

The geo-tag parsing module identifies geo-tags from the collected Web content and then converts geo-tags to a GeoRSS-based uniform format. Geo-tags may be represented in various formats, strategies that are adopted in the GeoTagMapper system to discover geo-tags are discussed below:

HTML parsing: There are many webpages written in HyperText Markup Language

(HTML) have location information embedded into their META elements. The syntax used for this type of geo-tags is traditional referred as Intercontinental ballistic missile

(ICBM) address in which locations are presented as pairs of latitude and longitude coordinates. GeoTagMapper uses a HTML parser to extract geo-tags from HTML

META elements.

Regular expression-based text matching: Location information, such as geographic coordinates also can be found from document content. Regular expression, a powerful

83 tool to search and manipulate bodies of text based on certain patterns, is used in

GeoTagMapper to extract geo-tags that are embedded in the document content.

XML parsing: Another important representation method for geo-tagged Web content is using Extensible Markup Language (XML), which provides a structured data model for storage and exchange of information. Many tagging systems utilise XML to encode the invocations and responses of their services. GeoTagMapper utilises Java XML APIs to extract geo-tags that are encoded using specified schemas defined by content providers from XML documents.

After geo-tags are extracted from Web contents, they are converted to a uniform format.

This format includes a Uniform Resource Locator (URL) element that is used as identification of individual Web contents, and a geographic feature element that is used to describe geographic locations based on the GeoRSS specification.

6.3.3 Textaul-geo indexing

After geo-tags are extracted from a document, the textual-geo indexing module creates index entries for the document. A textual-geo index entry consists of two independent index entries, a textual index entry that reflects textual content and a geographic index entry that reflects geographic content of the document.

Textual indexing in GeoTagMapper is done through Lucene APIs. Inverted files are used for storing index entries. Two fields are used in the GeoTagMapper textual indexing scheme: URL and text. The URL field is used as the unique identification of each document and the text field contains the text body of the document.

84

The geographic index in GeoTagMapper is created using the PostGIS database with its built-in implementation of R-tree data structure based on the Generalized Search Tree

(GiST). The spatial referencing system used for geographic indexing is defined based on World Geodetic System 1984 (WGS 84) (NIMA 1997). The GeoTagMapper geographic indexing scheme includes two fields: URL and geom, where the URL field is the same as the one in textual indexing scheme and the geom field is used to store the location data.

6.3.4 Retrieval model

The searching and ranking in a GIR system are controlled by a retrieval model, which defines how the search space is pruned and how the retrieved documents are ranked.

GeoTagMapper uses a Boolean model to retrieve documents that meet both the thematic criteria and the geographic criteria defined in user queries. The searching process can be described as a three-step procedure: firstly, the textual index is used to retrieve all documents that satisfy keyword-based thematic criteria. The Lucene searching APIs are used to perform textual searching and to retrieve documents that contain query terms.

URLs of documents that satisfy thematic criteria are retrieved in this step. Secondly, the geographic index is used to retrieve all documents that satisfy geographic criteria. URLs of documents that satisfy geographic criteria are retrieved from the PostGIS database by using spatial queries. Lastly, the answer sets from the above two steps are combined with the Boolean AND operator, a document is considered as relevant only when its

URL appears in both answer sets.

85 Once retrieved, each document is assigned a numeric ranking score that reflects the relevance of the document to the query. The ranking method used in GeoTagMapper is a linear combination of two independent ranking functions, one for textual ranking and one for geographic ranking, as shown in following equation:

= + λ λ ≤≤ Equation 6-1 RankScore TextualRankScore * GeographicRankScore )10(,

The default Lucene ranking function is used as the textual ranking function (i.e.

TextualRankScore). The geographic ranking score is calculated based on geographic distance between the geo-tag of the document and the geographic criteria in the query, as shown in following:

1 GeographicRankScore qd ),( = Equation 6-2 + + 1 log(distnace )1 where distance is the distance in km between the geo-tag and the centre of query range along the sphere.

The weight value of ( is controlled by users rather than given by the system as a constant value. This approach has two advantages, firstly, it does not need training process beforehand and is easy to implement and computationally efficient; and secondly, users are able to reorder the retrieved results in order to fulfil their requirements.

6.3.5 Visualisation and interaction

The visualisation and interaction design in GeoTagMapper focuses on two questions that are fundamental in all map-based visualisation systems for large data bases: 1) how

86 massive amounts of information can be represented on users’ screens; and 2) how the display is influenced by the ranking of retrieved results.

In many situations, a user query tends to retrieve a large number of relevant documents.

Because of the limitation of display space on users’ screens, not all retrieved results can be presented at the same time. In conventional information retrieval systems, retrieved documents are sorted by their rank scores and are then organised into pages. Each page consists of detail information for a certain number of the ranked documents, and at the same time, the total page number and the current page location provide users with a global understanding of the overall result. This approach doesn’t work in geographic visualisation environments due to several reasons: firstly, whether the detail information of a given retrieved document is represented to the user is not only determined by its position in the ranking sequence, but also by the user’s current viewpoint; secondly, retrieved documents are clustered based on their corresponding geographic locations, instead of their ranking order; and lastly, to gain a global understanding of the whole result set, not only the total number but also the geographic distribution of the retrieved documents must be taken into account.

GeoTagMapper gives users three options for displaying retrieved results: showing all results in one session; showing sampled results with a fixed sample rate; and showing results with varied sample rates.

The first option is the simplest one: all retrieved results are shown in the user interface.

This method gives users the ability to see the whole result set in an instant. The drawback of this is that the computation cost to render the display and the amount of

87 data to communicate increases when the result size increases. Therefore, this method only works well for result sets that contain a small number of retrieved documents.

The second option reduces the volume of data to be displayed to users by sampling the whole result set using a fixed sample rate. To view retrieved documents that are not sampled, users should reduce the search space by performing the zoom in operation on the map or by revising query keywords. In the current implementation, the sample rate is specified by users.

Compared with the second option that uses a fixed sample rate, the third method samples the result set using varied sample rates that decrease with document ranking scores decrease, which means those retrieved documents that have higher ranking scores are given higher level of details. Users can control the setting of sample rates as well.

6.4 Case study

This section presents a case study to illustrate the effectiveness and the usefulness of

GeoTagMapper. The case study uses total 104,293 webpages and digital photos collected from following three data sources.

- The GeoURL Server: GeoURL is a location-to-URL reverse directory that lists Web

URLs based upon ICBM addresses embedded in HTML META elements. Records

are added to GeoURL by website owners and other contributors. The ICBM

(original acronym is Intercontinental Ballistic Missile) tags derive from a more

historical application in which the form (i.e. a pair of longitude and latitude,

preferably to seconds-of-arc accuracy) is used to register a site with the

88 mapping project for generating geographically-correct maps of Usenet links on a

plotter.

- The Wikipedia: Using the regular expression-based text matching approach

discussed above, articles that contain geographic location information are extracted

from the English Wikipedia database dump file created on 16th February 2006. An

example of Wikipedia geographic coordinates markup is {{coord|51|28|40|N|0|0|6}}.

- Flickr: Two flickr APIs (i.e., flickr.photos.search and flickr.photos.getInfo) are

used to pull data of geo-tagged digital photos from flickr.com. Although the total

number of geo-tagged photos in the flickr.com database is vary large, only the first

few thousands can be retrieved due to the limitation of the flickr.com search engine.

Statistics of these data sources are shown in Table 6-1.

Table 6-1 Statistics of data sources used in GeoTagMapper Number of documents of the GeoURL collection 64,043

Number of documents of the Wikipedia collection 37,675

Number of documents of the Flickr collection 2,575

Total number of documents 104,293

Total file size of the GeoURL collection 1.40 GB

Total file size of the Wikipedia collection 730 MB

Total file size of the Flickr collection 18.5 MB

Textual index size 3.03 GB

Geographic index size 74.1MB

89 Figure 6-3 shows a snapshot of the initial screen of the user interface. The Google Map displays satellite imageries and digital maps of the users’ current view. Users can perform zoom and pan operations to move around the map display. A HTML form is used to let users to input textual and geographic search criteria and to submit the query to the server side application. Two geographic search models are supported. The first limits the query region to the current user view and the second limits the query region using a user-specified distance to the centre of the current user view. The user query in this example can be described as “find all documents that mention beach and have geo-tags within the current user view using ranking factor as 0.5”.

The retrieved results of the above example query are shown in Figure 6-4. As discussed in the previous section, users can select one of the three visualisation strategies to view the retrieved result set, Figure 6.4a shows all results, Figure 6.4b shows sampled result set using a fixed rate (i.e., 0.2), and Figure 6.4c shows sampled result set using varied rate sampling (i.e., 1.0 for top rank 1 to 10, 0.5 for rank 11 to 50 and 0.2 for remain results). All retrieved documents are represented to users as a click-able icon on Google

Map and a HTML link that leads to its original source. Users are shown the detail of a retrieved document in a pop-up window when an icon is clicked (see Figure 6.4d), By using zoom in and zoom out operations, users can choose different level of details to check their searching results.

Figure 6-3 The search interface

90 From the three different displays we can see that comparing with the show-all-result approach (i.e., Figure 6-4a), which displays all retrieved results in one window, the sample-based approaches (i.e., Figure 6-4b and Figure 6-4c) considerably reduce the visualisation elements and make user interaction easier and faster. At the same time, communication costs for transforming data from the server to client-side browsers, and the computation costs on both the server and client sides are reduced as well.

(a) (b)

(c) (d)

Figure 6-4 The result page of the user interface: a) showing all results; b) showing sampled results with a fixed sample rate; c) showing sampled results with varied sample rates; and d) a document shown in a pop-up window

As a further comparison, the fixed and varied sample rate approaches profile the same result set quite differently. Using varied sample rates, retrieved results are given a

91 different level-of-detail based on their ranking scores, which means those results that have higher ranking scores are given more chance to be presented to users, so that users may find information that they are interested in more quickly. But, on the other hand, the varied sample rate approach may lose information about lower ranked documents, as the level-of-detail decreases in the same order with the increases in the sample rate.

6.5 Conclusions

Within this chapter, the GeoTagMapper – an online map-based Geographic Information

Retrieval system for geo-tagged Web content was presented. It showed how geo-tagged

Web content from heterogeneous data sources is modelled, indexed, retrieved and visualised in an integrated framework with the emphasis on the support of information retrieval tasks that take into account not only textual content but also geographic context of documents. The architecture and components of the system were described in detail first, and then a case study that involves a large Web document collection was illustrated to show the potential of the proposed system that aims to provide users a rich experience in geographic-enabled information retrieval tasks.

92 CHAPTER 7 THE UNIVERSITY OF NEW SOUTH WALES AT GEOCLEF 2006

This chapter describes my participation in the GeoCLEF monolingual English task of the Cross Language Evaluation Forum 2006. The main objective of this study is to evaluate the retrieve performance of the proposed geographic information retrieval system. The system used in the evaluation consists of four modules: a geographic knowledge base that provides information about important geographic entities around the world and relationships between them; a indexing module that creates and maintains textual and geographic indices for document collections; a document retrieval module that uses the G–SAN model to retrieve documents that meet both textual and geographic criteria; and a ranking module that ranks retrieved results based on ranking functions learned using Genetic Programming. Experiments results show that the geographic knowledge base, the indexing module and the retrieval module are useful for geographic information retrieval tasks, but the proposed ranking function learning method doesn't work well.

7.1 Introduction

The GeoCLEF track of the Cross Language Evaluation Forum (CLEF) aims to provide a standard test-bed for retrieval performance evaluation of Geographic Information

Retrieval (GIR) systems using search tasks involving both geographic and multilingual aspects (Gey et al. 2006; Gey et al. 2005). The main objective of this study is to evaluate the performance of a GIR system developed based on approaches proposed in previous chapters.

93 Five key challenges have been identified in building a GIR system for GeoCLEF 2006 tasks:

Firstly, a comprehensive geographic knowledge base that provides not only flat gazetteer lists but also relationships between geographic entities must be built, as it is essential for geographic information entity extraction and grounding during all GIR query parsing and processing procedures.

Secondly, comparing with GeoCLEF 2005, GeoCLEF 2006 no longer provides explicit expressions for geographic criteria. Topics must be geo-parsed to identify and extract geographic information entities that are embedded in the title, description and narrative tags. In addition, new geographic relationships are introduced in GeoCLEF 2006, such as geographic distances (e.g. within 100km of Frankfurt) and complex geographic expressions (e.g. Northern Germany).

Thirdly, an efficient textual-geo indexing scheme must be built for efficient document accessing and searching. Although the computation cost is not considered in the system evaluation, the indexing scheme is necessary for a practical retrieval system where large numbers of documents are involved.

Fourthly, the relevance of a document to a given topic in GIR must be determined not only by thematic similarity, but also by geographic associations between them.

94 And lastly, a uniform ranking function that takes into account both textual and geographic similarity measures must be specified to calculate a numerical ranking score for each retrieved document.

The rest of this chapter is organised as follows: Section 7.2 describes the GeoCLEF

2006 tasks. Section 7.3 describes the design and implement of the GIR system. Section

7.4 presents the runs carried out for the monolingual English task. Section 7.5 discusses the obtained results. Section 7.6 concludes the chapter.

7.2 GeoCLEF 2006

It is difficult to make comparison between various GIR implementations without a standard test-bed. In 2005, due to the efforts of research groups at the University of

California, Berkeley and the University of Sheffield, CLEF started a new track named

GeoCLEF that aims to provide a necessary framework in which GIR systems are evaluated using search tasks involving both geographic and multilingual aspects.

GeoCLEF became a regular CLEF track since 2006.

GeoCLEF 2006 tasks can be performed in two contexts, monolingual and bilingual. In the monolingual context both documents and topics are provided in the same language, while in the bilingual context documents and topics are given in different languages.

Available language options for document and topic include English, German,

Portuguese and Spanish. Totally 25 topics are defined for GeoCLEF 2006. An example topic is illustrated in Figure 7-1. The English document collections used in GeoCLEF

2006 are the same ones that were used in GeoCLEF 2005, which consists of total

169,477 documents including 56,472 documents that are collected from the British

95 newspaper "The Glasgow Herald" (1995) and 113,005 documents that are collected from the American newspaper "The Los Angeles Times" (1994). Figure 7-2 shows an example of GeoCLEF documents.

Figure 7-1 An example topic used in GeoCLEF 2006

Figure 7-2 An example GeoCLEF document

7.3 Approaches for GeoCLEF 2006

This section describes the specific approaches for my participation in the GeoCLEF

2006 monolingual English task. I did not work with the collections in other languages

(e.g. German, Portuguese and Spanish).

7.3.1 System overview

The proposed methodology includes a geographic knowledge base for representation and organisation of geographic data and knowledge, an integrated textual-geo indexing scheme for document indexing and searching, a retrieval model based on the G-SAN for

96 document retrieval and a ranking function learning algorithm based on Genetic

Programming (GP).

Figure 7-3 shows the overview architecture of the system. The JAVA programming language is used to implement the whole system; the MySQL database is used as the backend database; the Apache Lucene search engine (c.f. http://lucene.apache.org) is used for textual indexing and searching, and the Named Entity Recognition (NER) functionalities in the geographic information entity extraction and grounding are implemented using the Alias-I LingPipe system (c.f. http://www.alias-i.com/lingpipe).

Figure 7-3 System architecture of the GIR system used in the GeoCLEF 2006

7.3.2 Geographic knowledge base

Data in the geographic knowledge base is collected from various public sources and is compiled into the MySQL database. The statistics of the geographic knowledge base are

97 given in Table 7-1. The main sources used for carrying out the experiments are listed in Table 7-2.

Table 7-1 Statistics for the geographic knowledge base Description Statistic Number of distinct geographic entities 7817 Number of distinct geographic entities names 8612 - Number of countries 266 - Number of country names 502 - Number of administrative divisions 3124 - Number of administrative division names 3358 - Number of cities 3215 - Number of city names 3456 - Number of oceans, seas, gulfs, rivers 849 - Number of ocean, sea, gulf, river names 921 - Number of regions 363 - Number of region names 375 Average names per entity 1.10 Number of relationships 9287 - Number of part-of relationships 8203 - Number of adjacency relationships 1084 Number of entities that have only one name 7266 (92.95%) Number of entities without any relationship 69 (0.88%) Number of entities without any part-of relationship 123 (1.57%) Number of entities without any adjacency relationship 6828 (87.35%)

Table 7-2 The sources used for the construction of the geographic knowledge base Resource Geographic data The Federal Information Processing Standard Publication Countries, administrative divisions 10-4: Countries, Dependencies, Areas of Special Sovereignty, and Their Principal Administrative Divisions The World Factbook published by the Central Intelligence Border countries, coastlines, country Agency of the United States. capital cities The Wikipedia (c.f. http://en.wikipedia.org/) Oceans, seas, gulfs, rivers, regions Large cities in the world collected from TravelGIS.com Cities The Standard Country and Area Codes Classifications Regions, continents (M49) published by the United Nations Statistics Division The ESRI Gazetteer Server developed by the Minimum Boundary Rectangle (MBR) of Environmental Systems Research Institute, Inc. countries The WordNet developed by the Cognitive Science Variant place names Laboratory at Princeton University

7.3.3 Textual-geo indexing

The textual-geo indexing scheme creates and maintains a textual index and a geographic index separately, and links the two indices using document identifications. 98

The textual index is built using Lucene with its build-in support for stop words removing (Salton 1971) and the Porter stemming algorithm (Porter 1980).

The geographic index is built as a procedure of three steps. The first step performs a simple string matching against all documents in the collections utilising the place name list derived from the geographic knowledge base. The second step performs a NER process to tag three types of named entities: PERSON, LOCATION and

ORGANISATION. The final step matches result sets from the two previous steps using following rules: 1) for each string that found in the first step, it is eliminated if it is tagged as a non-location entity in the second step, otherwise it is added to the geographic index; 2) for each place name in the stop word list of the first step, it is added to the geographic index if it is tagged as a location entity in the second step.

7.3.4 Document retrieval

The retrieval of relevant documents in the system is a two-phase procedure that involves query parsing and searching.

The GeoCLEF query topics are in general modelled as Q = (textual criteria, geographic criteria) in the system. However, the query parser is configured in an ad hoc fashion for the GeoCLEF 2006 tasks at hand. Given a topic, the parser performs the following steps:

1) Removes guidance information, such as “Documents about” and “Relevant documents describe”. Description about irrelevant documents is removed as well. 2)

Extracts geographic criteria using string matching with names and types data obtained from the geographic knowledge base. The discovered geographic information entities,

99 geographic relationships and geographic concepts are added to the geographic criteria.

Then geographic related words are removed from the query topic. 3) Removes stop words. 4) Expands all-capitalised abbreviations (e.g. ETA in GC049) using the WordNet

APIs. Then, the resulting text is treated as textual keywords.

After query topics are parsed, the G-SAN based retrieval mode is used to retrieve documents that meet both textual and geographic criteria. The Lucene search engine is used to active all documents (i.e. T-nodes) that contain the textual keywords, and the geographic index and the geographic knowledge base are used to active G-nodes that meet the geographic criteria. After the G-SAN activation spreading, documents that connect to activated T-nodes and, at the same time, activated G-nodes are considered as relevant documents.

7.3.5 Document ranking

A Genetic Programming-based algorithm is developed in the system to learn ranking functions based on the methodology developed in Chapter 4. The GP algorithm consists of two elements: 1) a set of terminals and functions that can be used as logic unit of a ranking function; and 2) a set of fitness functions that are used to evaluate the performance of each candidate ranking function.

Terminals reflect logical views of documents and user queries. Table 7-3 lists terminals used in the system. Functions used in the experiments include addition (+), subtraction

(-), multiplication (×), division (/) and natural logarithm (log).

100

Table 7-3 Terminals used in the ranking function learning process Name Description DOC_COUNT Number of documents in the collection DOC_LENGTH Length of the document LUCENE_SCORE Lucene ranking score of the document GEO_NAME_NUM How many geographic names in the document GEO_NAME_COUNT Total number of geographic names of all geographic entities discovered from the document GEO_ENTITY_COUNT How many entities that have the geographic name GEO_RELATED_COUNT How many entities that have the geographic name and related to the query GEO_NAME_DOC_COUNT Number of documents that have the geographic name GEO_COUNT How many times of the geographic name appears in the document NAME_COUNT Number of geographic names in the geographic knowledge base ENTITY_COUNT Number of entities in the geographic knowledge base

Three fitness functions are used in the system:

F_P50. This fitness function returns the arithmetic mean of the precision values at 50% recall for all queries as results.

Q 1 ×= F_P50 B Pi,50 Equation 7-1 Q =1i where Q is the total number of queries, Pi,50 is the precision value at 50% recall level of the ith query.

F_MAP. This fitness function utilises the idea of average precision at seen relevant documents.

101 C j S D B dr )( T Q D j k 1 1 D = T ××= dr × k 1 F_MAP BB( D j )( T) Equation 7-2 ==DQ j i 11i j D T E U

where Q is the total number of queries, Di is the total number of retrieved documents for the ith query, and r(dk) is a function that returns 1 if the document dk is relevant to the ith query and returns 0 otherwise.

F_WP. This fitness function utilises the weighted sum of precision values on the 11 standard recall levels.

1 Q 10 1 F_WP = × BB( P ) Equation 7-3 + m , ji Q ij==1 0 j )1(

where Q is the total number of queries, Pi,j is the precision value at the jth recall level of the 11 standard recall levels for the ith query, m is a positive scaling factor determined from experiments.

7.4 Experiments

Following five runs were submitted for GeoCLEF 2006 monolingual English tasks:

unswTitleBase: This run used the title and description tags of the topics for query parsing and searching. After relevant documents were retrieved, the Lucene ranking scores were used to rank results.

unswNarrBase: This run used the title, description and narrative tags of the topics for query parsing and searching. After relevant documents were retrieved, the Lucene

102 ranking scores were used to rank results.

The above two runs were mandatory tasks defined by GeoCLEF, and they were used as baseline methods.

unswTitleF46: This run used the title and description tags of the topics for query parsing and searching. After relevant documents were retrieved, the ranking function given below was used to rank results. This ranking function was discovered using fitness function F_WP with m = 6.

LUCENE_SCO RE * LUCEN E_SCORE *L UCENE_SCOR E / GEO_NA ME_COUNT

Equation 7-4

unswNarrF41: This run used the title, description and narrative tags of the topics for query parsing and searching. After relevant documents were retrieved, the ranking function given below was used to rank results. This ranking function was discovered using fitness function F_WP with m = 1.

LUCENE_SCORE * LUCENE_SCORE *LUCENE_SCORE * GEO_RELATED_COUNT / DOC_LENGTH

Equation 7-5

unswNarrMap: This run used the title, description and narrative tags of the topics for query parsing and searching. After relevant documents were retrieved, the ranking function given below was used to rank results. This ranking function was discovered using fitness function F_MAP.

GEO_RELATED_COUNT * LUCENE_SCORE /DOC_COUNT/DOC_COUNT Equation 7-6

103 The ranking functions used in the above three runs were learned using the GP learning algorithm. The GeoCLEF 2005 topics and relevance judgments are used as training data.

7.5 Results

Table 7-4 summarises the results of the five runs using evaluation metrics include

Average Precision, R-Precision (i.e. the precision at the R-th position in the ranking, where R is the total number of relevant documents for the current query) and the increment over the mean average precision (i.e. 19.75%) obtained from all submitted runs.

The precision average values for individual queries are shown in Table 7-5.

Table 7-4 GeoCLEF 2006 monolingual English tasks results Run AvgP. R-Precision D AvgP. Diff AvgP. - Average (%) (%) over GeoCLEF Precision Avg P. (%) unswTitleBase 26.22 28.21 +32.75 D AvgP. Diff over unswNarrBase 27.58 25.88 +39.64 GeoCLEF Avg P. - the unswTitleF46 22.15 26.87 +12.15 increment over the unswNarrF41 4.01 4.06 -79.70 mean average precision unswNarrMap 4.00 4.06 -79.75 obtained from all submitted runs.

Several observations are made from the obtained results: firstly the geographic knowledge base and the retrieve model used in the system showed their potential usefulness in GIR as it can be seen from the higher average precision values of unswTitleBase (26.22%) and unswNarrBase (27.58%), which archived a 32.75% and a

39.64% improvement comparing to the mean average precision of all submitted runs. The unswTitleBase run was ranked fourth with 73 entries in the title and description fields only run (Gey et al. 2006).

104

Secondly, the ranking function learning algorithm used in the system doesn’t work well for GeoCLEF tasks, particular for those runs (i.e. unswNarrF41 and unswNarrMap) that utilise narrative information of the queries. It is supposed that such behaviour is due to a strong over-training effect. However, the unswTitleF46 run performed better than the two base line runs in a small set of topics (i.e. GC027, GC034 and GC044).

Thirdly, it is not immediately obvious that the narrative information should be included in the query processing. The unswTitleBase run achieved the same performance as the unswNarrBaseline run in 10 topics (i.e. GC026, GC027, GC030, GC036, GC037,

GC040, GC041, GC046, GC049 and GC050), and it even achieved better results in 6 topics (i.e. GC028, GC029, GC033, GC034, GC039 and GC044).

Table 7-5 Precision average values (%) for individual queries Topic unswTitleBase unswNarrBase unswTitleF46 unswNarrF41 unswNarrMap GC026 30.94 30.94 15.04 0.58 0.56 GC027 10.26 10.26 12.32 10.26 10.26 GC028 7.79 3.35 5.09 0.36 0.31 GC029 24.50 4.55 16.33 0.53 0.53 GC030 77.22 77.22 61.69 6.55 6.55 GC031 4.75 5.37 5.09 3.31 3.31 GC032 73.34 93.84 53.54 5.71 5.71 GC033 46.88 38.88 44.77 33.71 33.71 GC034 21.43 2.30 38.46 0.14 0.13 GC035 32.79 43.80 28.06 3.19 3.11 GC036 0.00 0.00 0.00 0.00 0.00 GC037 21.38 21.38 13.17 0.81 0.81 GC038 6.25 14.29 0.12 0.12 0.12 GC039 46.96 45.42 34.07 3.50 3.50 GC040 15.86 15.86 13.65 0.34 0.30 GC041 0.00 0.00 0.00 0.00 0.00 GC042 10.10 36.67 1.04 0.33 0.33 GC043 6.75 16.50 4.33 0.55 0.54 GC044 21.34 17.23 13.80 4.78 4.78 GC045 1.85 3.96 2.38 1.42 1.42 GC046 66.67 66.67 66.67 3.90 3.90 GC047 8.88 11.41 9.80 1.02 0.98 GC048 58.52 68.54 51.55 8.06 8.06 GC049 50.00 50.00 50.00 0.09 0.09 GC050 11.06 11.06 12.73 11.06 11.06 105

Lastly, it is interesting to see that the system didn’t retrieve any relevant document for topic GC036 and GC041. This is not surprising for GC036, as no document was identified as relevant in the assessment result. While for GC041, which talks about

“Shipwrecks in the Atlantic Ocean”, the keyword "shipwreck" doesn’t appear in any of the four relevant documents (i.e. GH950210-000051, GH950210-000197,

LA071094-0080 and LA121094-0182). This problem also leads to the poor performance of topics GC031, GC045 and GC050.

7.6 Conclusions

This chapter reported my participation in the GeoCLEF 2006 monolingual English task by developing a GIR system using approaches proposed in previous chapter. The key components of the system, including a geographic knowledge base, an integrated textual-geo indexing scheme, a Boolean retrieval model and a Genetic

Programming-based ranking function discovery algorithm, are described in detail. The evaluation results are presented and observation made from the results were discussed.

106 CHAPTER 8 APPLICATIONS

This chapter demonstrates GIR techniques developed in previous chapters using four case studies in different application domains, including: 1) the GNSS-RSS application that distribute of GNSS and GPS data over the Internet using RSS; 2) the Local News

Reader that categorises online news stories based on their geospatial context; 3) the

Web Sites Finder that enables users to find web sites that are close to users current locations; and 4) the Property List Map that provides users with a map-based interface for searching and reviewing of real estate properties.

The rest of this chapter is organised as follows: Section 8.1 gives background information about RSS-based web content syndication and describes the GNSS-RSS application. Section 8.2 presents the Local News Reader application. Section 8.3 describes the Web Sites Finder application. Section 8.4 introduces the Property List

Map application. Section 8.5 concludes the chapter.

8.1 GNSS-RSS

GNSS-RSS publishes RSS feeds for each GNSS (Global Navigation Satellite System) data file produced by the SydNET GPS observation network – a network of permanently operating high quality GPS base stations in the Sydney basin area (Rizos,

Yan & Kinlyside 2004). In SydNET, GPS data is acquired from each base station and transferred to the SydNET Control Centre via optical fibre links for real-time and post-processing applications. In the current version of the system, the GPS data for post-processing applications is distributed online on a request/response basis. A client

107 has to log onto the SydNET website and request data by specifying a desired base station, time period, and data rate, then the system extracts data from a repository and creates Receiver INdependent EXchange (RINEX) files for the client to download. The proposed GNSS-RSS application introduces an alternative approach for the distribution of GNSS data using the emerging RSS techniques.

RSS, which may stand for Rich Site Summary, Resource Description Framework (RDF)

Site Summary or Really Simple Syndication, is an emerging technology for web content syndication and delivery (Hammersley 2003). RSS is a lightweight type of eXtensible

Markup Language (XML) format, designed for syndicating web pages or sharing news headlines. A RSS delivery system produces a simple XML file of multiple tags comprised of , <link>, and <description> elements. The Uniform Resource </p><p>Locator (URL) for the full content of an item is specified in the <link> tag. </p><p>RSS feeds from various publishers can be automatically syndicated, ordered and grouped using RSS aggregators. RSS aggregators are dedicated programs that read RSS files. Users can have their own personalised aggregator so that only preferred topics/data can be subscribed. Although market research shows that this technology is still in an infancy stage of adoption (Rainie 2005), major web content providers, e.g. the </p><p>British Broadcasting Corporation (cf. news.bbc.co.uk) and Amazon (cf. www.amazon.com), are delivering their contents, such as news articles, Weblog, audio and video files, etc., to consumers. </p><p>Compared with the traditional online information subscription services in which email plays a major role, RSS has several significant advantages. Firstly, content providers no </p><p>108 longer need to maintain subscription lists and the server side network bandwidth consumption is reduced; secondly, users are free to subscribe and unsubscribe to a content source at any time without being constrained by the mailing list management system; finally, RSS introduces a spamless environment. </p><p>The proposed RSS publishing platform is middleware that links data providers and data users. The architecture of the platform consists of three software components: 1) RSS </p><p>Feed Generator, 2) External Data Proxy, and 3) RSS Feed Publisher. Figure 8-1 illustrates the architecture. </p><p>Figure 8-1 Architecture of SydNET data distribution platform </p><p>RSS Feed Generator is a software component that creates RSS feed items for GNSS data files. An example of such a data structure is provided in Figure 8-2. </p><p>109 </p><p>Figure 8-2 An example of SydNET RSS feeds item </p><p>Based on the RSS 2.0 specification, description of a GNSS data file is represented by an </p><p><item> element and its sub-elements: an <item> element contains a <title> element, which gives a title of the item; a <link> element provides the URL of the element, which provides an access point for RINEX file downloading; a <description> element provides description information of the item using HTML mark-ups; a <pubDate> element that gives the time at when the item is published; and a <guid> element contains a string which is assigned by the system as a globally unique identifier for the item. </p><p>For some post-processing applications, external GNSS data such as International GNSS </p><p>Service (IGS) ephemerides and meteorological data are required. Usually these data are available for downloading through providers' Web or FTP servers. The effort required to create RSS feeds for diverse data sources is not trivial. Different data providers use different methods to describe their data, most of these descriptions are given as textual notes, which human must interpret. For a software program, properties (i.e. URL, create time, etc.) and contents of data files are useful to discover RSS-related information. For example, the URL of an IGS ephemerides file is http://igscb.jpl.nasa.gov/igscb/product/1324/igr13243.sp3.Z, from which one can infer that it is a rapid file and the GPS date is the fourth day of the week 1324. Therefore, the </p><p>110 domain knowledge along with the file name convention of each data provider is necessary for the implementation of agent classes. </p><p>RSS Feed Publisher is the user-interface of the system, which accepts users' requests and produces RSS feeds in association with other system components. The RSS feed items from the RSS Feed Generator and External Data Proxy are syndicated and published by the RSS Feed Publisher. A typical example of a RSS XML document created by RSS Feed Publisher is shown in Figure 8-3. Each RSS document contains one <channel> element, which consists of a collection of descriptions of the feed. The </p><p><title> element presents the name of the feed, the <link> element gives the URL of the feed, and the <docs> element gives the URL points to the RSS specification. </p><p>Figure 8-3 An example of SydNET RSS feed </p><p>RSS readers can be used to subscribe multiple RSS feeds at the same time. RSS feeds from different SydNET stations and external data sources can be syndicated at users' sites to provide enhanced and integrated access. </p><p>111 The RSS software components are implemented as Java 2 Platform, Enterprise Edition </p><p>(J2EE) servlets and embedded in the SydNET web application, which is deployed on a </p><p>JBoss Application Server. </p><p>Figure 8-4 Desktop RSS aggregator </p><p>Figure 8-5 The Web-based RSS aggregator </p><p>To subscribe and view the SydNET RSS feeds, subscribers should use RSS aggregators and add the URLs of the SydNET RSS web application. For example, the RSS URL for the SydNET RINEX file is http://www.gmat.unsw.edu.au/sydnet/GetSydNETFeed?stationID=2. After these URLs </p><p>112 are added to an RSS aggregator, the aggregator connects to the SydNET web application and pulls RSS feed XML documents on a regular basis. After the RSS documents are validated and parsed, all items contained in the documents are displayed in the subscriber's interfaces. Subscribers are notified when new items are found. Figure </p><p>8-4 illustrates GNSS-RSS feeds in the RssReader software - a Windows desktop RSS aggregator, and Figure 8-5 illustrates GNSS-RSS feeds in BlogLines (cf. http://www.bloglines.com/), a well known online Web-based RSS aggregator. </p><p>8.2 Local news reader </p><p>The second demonstration application is the Local News Reader, which is designed to provide users with a map-based interface for reading news articles. Figure 8-6 shows the user interface for this application. Each place marker on the map presents a place that has one or more related news articles. The information window of place markers consists of the name of the place, weather condition data and a list of hyperlinks, each of which links to the full version of the article. </p><p>Totally 44 regional news RSS feeds published by the Australian Broadcasting </p><p>Corporation (ABC) are used in this application as underlying data source. List of links to these feeds can be found from the homepage of ABC RSS feeds (c.f. http://abc.net.au/news/services/default.htm). Figure 8-7 shows an example of ABC feeds for the state of New South Wales. Each ABC feed consists of meta-information </p><p>(e.g. title, description, url) of one or many news item. The locaitons of a news item are described by using one <category domain="ABCPrimaryLocation"> and zero or many </p><p><category domain="ABCLocations"> elements. In this application, feeds are first fed to a geographic entity parsing module. Place names included in related <category> </p><p>113 elements are extracted using a wrapper program based on XML-parsing and then are filtered using a gazetteer place name list. Then a geo-coding module assigns longitude and latitude coordinates to the extracted place names through a name/coordinate lookup table. In the case that an article mentions more than one distinct place name its link is included in all related place markers. </p><p>To demonstrate the syndication functionality of the system, weather reports provided by ninemsn.com.au are integrated into the application as well. Weather data is extracted from web pages using a wrapper program and then is anchored to place markers. </p><p>Figure 8-6A screen shot of the Local News Reader </p><p>114 </p><p>Figure 8-7 An example of ABC regional news feeds </p><p>8.3 Web sites finder </p><p>Web Sites Finder is an application that enables users to find web sites that are close to users' current locations. The application supports two models. The first one is the direct model, which is designed for users that have Global Positioning System (GPS) receivers connected to their computers, and the second model is the indirect model, which allows users to specify locations on the query interface. Figure 8-8 shows a screen shot of the application. The user location is given as (-33.916675, 151.283335), and the search radius is set to 10km. Web sites that it locates inside the search region are displayed as place markers on the Google map. </p><p>115 Once a place marker is clicked, an information window pops up and shows the user summary information of the web site, including its title, homepage thumbnail and a hyperlink to the web site. </p><p>The search function in the Web Sites Finder is very straightforward. The application uses the GeoURL Address Server (c.f. http://geourl.org/) as the underlying data source. </p><p>Currently, the GeoURL server lists more than 200,000 web sites. Each web site is indexed based on its addresses information (i.e. longitude and latitude pairs) embedded in its homepage META elements (see Figure 8-9). </p><p>Figure 8-8 A screen shot of the Web Sites Finder </p><p>116 </p><p>Figure 8-9 Address Mettags of http://pye.dyndns.org/ </p><p>To find web sites that near a given location, the application queries the GeoURL server by sending it a HTTP request in a format like http://geourl.org/near?lat=-33.916675&long=151.283335&dist=10. The result returned from the GeoURL server is a HTML page that lists all web sites that meet the search criteria (see Figure 8-10). Then a wrapper program is used to parse the title and homepage URL link for each web site. Once the URL links are extracted from the </p><p>GeoURL results, the application retrieves homepage of each web site and a click-able icon that leads to its thumbnail is place on the map based on the corresponding address. </p><p>Figure 8-10 Search results returned from the GeoURL address Server </p><p>117 Finally, all summary information is encoded into an XML file using the enhanced RSS </p><p>2.0 schema. A javascript module, which takes this XML file as input and updates the user interface, is developed using Asynchronous JavaScript and XML (AJAX) techniques. </p><p>8.4 Property list map </p><p>Property List Map aims to enhance user experience in finding real estate properties. The user interface is shown in Figure 8-11. Users first input some search criteria, such as the usage purposes (e.g. buy, rent or share), the name of the target suburb, property types, number of bedrooms and the price range. Once users submit their queries, the application displays all properties that meet the search criteria. Similar to the Web Sites </p><p>Finder, the application places all properties as place markers on a Google map. </p><p>Users can click the place marker icons and retrieve summary information about the selected property. Elements in the information window include a title, some short description, a photo and a hyperlink. Through the hyperlink users are redirected to the web page that contains detail information. </p><p>In this demonstration application the underlying data comes from www.domain.com.au, one of the biggest online service providers for the property market in Australia. Search queries issued by users are redirected to the search engine of www.domain.com.au, without any change. The resulting HTML page is fed to a wrapper in order to extract summary information for each property. </p><p>118 </p><p>Figure 8-11 A screen shot of the Property List Map </p><p>The physical location of a property is obtained by a geo-coding module using property street addresses extracted by the wrapper. Those properties which don't have street addresses specified are eliminated from the result list. Using the street address as input, the geo-coding module queries the Australia's Geocoded National Address File (G-NAF) database and retrieves its coordinates. </p><p>The information encoding and transformation procedures in Property List Map are the same as the Web Sites Finder application. One limitation of Property List Map arises from the fact that results returned from www.domain.com.au are organised into pages for convenient display, but the application only reads and parses the first page. </p><p>119 8.5 Conclusions </p><p>This chapter demonstrated four applications in different domains that utilise GIR techniques developed in previous chapters. Based on the observation that GIR approaches can be used to provide useful cues for people's synthesis and recognition of the geospatial context of information in everyday life, the user scenarios, user interfaces for navigation and browsing, and the implementation of these applications were presented. In particular, the details of the integration and syndication of various web information sources were presented. </p><p>120 CHAPTER 9 CONCLUSION AND FUTURE WORK </p><p>This chapter provides a summary of the research presented in this thesis, including a discussion of the limitations and recommending of feature research directions. </p><p>9.1 Thesis summary </p><p>Textual information retrieval systems treat documents and user queries as a set of keywords or search terms. In such a paradigm, the information retrieval process is based on the thematic features of documents and user queries. However, these keyword-based approaches incompatible with geographic information retrieval tasks due to the geographic-oriented nature of geographic information, which can be defined as information references places on the earth's surface. </p><p>The Geographic Information Retrieval (GIR) system developed in this study has adopted a different strategy. By putting focus on the geographic features and geographic relationships discovered from documents and user queries, the GIR system aims to combine advances from both information science and geographic science to improve the retrieval performance and user experiences. </p><p>In supporting of this idea, research focuses on the following themes have been presented. </p><p>Firstly, a supervised machine learning based methodology for geographic information entity grounding is developed and evaluated. </p><p>121 </p><p>Secondly, a GIR ranking function discovery algorithm using Genetic Programming is developed and evaluated. </p><p>Thirdly, an integrated GIR retrieval model based on the Spreading Action Network is proposed. </p><p>Fourthly, a platform for publishing, navigation and browsing of geographic information is described. </p><p>Fifthly, the retrieval performance evaluation of the proposed system is given by the participation in the GeoCLEF 2006 monolingual English tasks. </p><p>Finally, applications of the developed system are demonstrated using several case studies in different domains. </p><p>9.2 Limitations </p><p>There are several limitations inherent to the proposed approaches for geographic information retrieval tasks. </p><p>Firstly, the study regarding the recognising and extraction of geographic information entities is limited to information references discovered from document content that have explicit geographic senses only, such as place names and geographic coordinates, which can be categorised into internal geographic information entities. On the other hand, there are also external geographic information entities, which can be defined as </p><p>122 information references that have implicit geographic sense or are discovered from somewhere outside document content. Examples of external geographic information entities include Internet domain name and host locations for web pages, telephone numbers, import local events, names of famous local people and organisations. </p><p>Research has shown that external geographic information entities are also useful for word sense disambiguation (McDonald 1993) and geographic scope assigning (Zong et al. 2005). Additional efforts should be undertaken to expand the geographic information entities extraction module and the geographic knowledge base to utilise them. For those examples listed above, the inclusion of telephone numbers is straightforward, as they have specified format that can be easily modelled using methods like regular expressions, and it is easy to map a telephone number to a place using cues such as country code, area code and white page lookup, depending on the application scale. </p><p>However, a generic and extensible way in which names of local people and organisations can be integrated is more challenging. </p><p>Secondly, there are also inadequacies with the proposed approach to evaluate the user interaction interface. At present, the effeteness and usefulness of the user interface developed in the system was presented using several case studies in different application domains, and was tested by a small number of experienced users. Finding from this effort had an important impact on making the user interface easier to use and more useful from the perspective of presenting the geographic context of information. </p><p>However, the behaviour and interactions of end-users with the user interface are not totally aware. Many research in user interface design show that given a interaction system, the positive and negative characters of the interface and the interaction process </p><p>123 identified by domain experts and end-users may have big difference, because of various reasons such as different expectations from the system, different levels of familiarity to the content and the design, different usage style and even different ages (Hix & Hartson </p><p>1993). Therefore, follow-up end-user tests should be conducted to identify usability issues using quantitative and qualitative measures based on subjective and objective evaluation techniques, such as heuristic evaluation (Nielsen & Molich 1990) and survey-based user testing (Perlman 2001). Data collected in such tests would be very useful in regarding to develop a user-centred interaction environment. </p><p>A final limitation is the lack of quantitative evaluation of the system computing performance. Computing performance evaluation in IR and GIR systems studies the computing costs that related to document indexing, retrieval and ranking. Although the current evaluation activities such as GeoCLEF don't compare the computing cost of different implantations, the results from computation performance evaluation may impact the system design decision, as the computation performance is important to practical systems that deal with large document collections, in which end users not only concern the quality of the retrieved results, but also how fast their requests are responded. </p><p>In the proposed system, the textual-geo indexing module has the most important impact on the system computing performance. Previous research in combining of textual and geographic indexing has shown that different integration strategies lead to different computing costs (Vaid et al. 2005). A carefully designed evaluation experiment will provide insights into three questions: firstly, the time needed to index a given document collection; secondly, the time need to perform a search on the index to procedure </p><p>124 retrieval results; and lastly, the resources (i.e. memory, CPU and hard disk) required to support the indexing and searching procedures. Results from these evaluations are also useful for the selection of an underlying hardware and software platform. </p><p>9.3 Future work </p><p>Ample opportunities exist to expand and refine the presented work. It is hoped that my efforts will be expanded and new functionalities will be added to lower the barrier typically associated with the producing, processing and consuming of geographic information. Following subsections outline further works related to the integration with </p><p>Location-based Service (LBS), extending geographic knowledge base, multimedia GIR, integration with advanced thematic searching, fuzzy geographic reasoning and high performance computation. </p><p>9.3.1 Integration with Location-based Service </p><p>Location-based Service aims to provide information service to users based on their location (Karimi & Hammad 2004). In a general LBS application scenario, a user sends to a service provider his information requirements along with his location data that can be acquired by using, for examples, GPS receivers, beacons, and position services provided by mobile network carriers, then the server side application processes the user's request and returns information that meets the user's need. LBS has many applications in tourism industry (Zipf & Malaka 2001), wireless emergency service </p><p>(Rao & Minakakis 2003), personal navigation services (Virrantaus et al. 2001) and </p><p>Urban gaming (Graham-Rowe 2005). </p><p>125 It is natural to integrate GIR with LBS. From the integration prospective, techniques developed in LBS provide front-end capabilities and communication mechanisms, and </p><p>GIR technologies provide back-end support to enable information searching and retrieval. Work in this direction will be focused on researching and developing a content adaptation based information distribution platform that works with mobile devices that have different input and output capabilities, network connectivity, and programming model (Mohan, Smith & Li 1999). It is anticipated that these research will provide mobile users an efficient, informative and friendly environment for performing location-enabled searching, navigation and browsing of online geographic information. </p><p>9.3.2 Extending geographic knowledge base </p><p>The geographic knowledge base plays an essential role in almost all modules of GIR systems. The quality and cover range of the geographic knowledge base have direct impact on the system retrieval performance. In the current implementation, the geographic knowledge base is constructed in an ad-hoc fashion, such that geographic data is collected from many sources firstly, and then is compiled into a single scheme semi-manually, which is a time and human resources consuming task. An important feature research is investing methods that geographic data and knowledge collected from various sources can be modelled, normalised, aligned automatically (Jones, C. B., </p><p>Abdelmoty & Fu 2003). </p><p>The current data model used to describe geographic entities of the geographic knowledge base is limited to geographic place names, geographic locations (i.e. coordinates), geographic coverage (i.e. minimum boundary rectangles) and three geographic relationships (i.e. part-of, adjacent and same as). However, research such as </p><p>126 Amitay et al. (2004) and Zong et al. (2005) have shown that non-geographic attributes, such as population numbers are also useful for GIR tasks. It is planed to expand the geographic knowledge base data model to integrate non-geographic attributes and apply them to other modules, in particular, the grounding module and the ranking module. It is also planed to expand the geographic knowledge base to include information entity types, such as names of local people, local business and local organisations, that have implicit geographic senses. </p><p>9.3.3 Multimedia GIR </p><p>The document collections and query topics used in this thesis are all textual-based material. However, it has been seen from log files of real world web search engines that many user queries that contain explicit and implicit geographic constrains also involve multimedia material (e.g. audio and video files) (Zhang et al. 2006). Although multimedia information retrieval has long been studied, there are few investigations of applying GIR methods to multimedia material. Therefore, one interesting extension of this work would be research and development for Multimedia Geographic Information </p><p>Retrieval (MGIR) methods to support information retrieval tasks involves textual, geographic, and multimedia features. Example application area of MGIR includes location-based search (Tollmar, Yeh & Darrell 2004). </p><p>Existing evaluation campaigns such as the ImageCLEF can be used to evaluate the retrieval performance of MGIR systems. Similarly with GeoCLEF, ImageCLEF is a </p><p>CLEF subtask that performs comparative laboratory-style evaluation for image retrieval tasks. Takes query topics used in ImageCLEF 2006 (Clough et al. 2006) as examples, those that can be used for MGIR are: "tourist accommodation near Lake Titicaca", </p><p>127 which specifies a location and a geographic operator "near". </p><p>9.3.4 Advanced thematic searching </p><p>The presented GIR system is designed with emphasise on geographic properties and geographic relationships discovered from documents and user queries, the textual retrieval is developed based on existing off-shore methods (i.e. the Apache Lucene search engine). However, the current implementation limits the system performance, as it only retrieval documents that contains query keywords. The retrieval performance evaluation results shown in Chapter 7, in particular the results of GeoCLEF topic </p><p>GC031, GC041, GC045 and GC050, suggests that to improve the system performance, it would be necessary to leverage advanced textual information retrieval techniques, such as Latent semantic analysis (LSA) (Deerwester et al. 1990) and blind feedback </p><p>(Jones, K. S., Walker & Robertson 1998). </p><p>LSA considers documents that have many words in common to be semantically close, and documents with few words in common to be semantically distant. By searching and ranking based on related words, LSA can relevant not only documents contain keyword specified in the user query, but also documents contain related keywords. </p><p>Blind feedback, also known as pseudo-relevance feedback and local feedback, uses statistics of terms in the top-ranked documents of initial retrieval results to revise user queries by expanding the set of query terms and/or by adjusting the weights of the query terms. Using the revised queries, ranks of known relevant documents would be improved and other relevant documents missed in the initial rounds would be retrieved. </p><p>128 Future research exploring how these methods can be integrated into the current GIR system would be a major contribution to the improvement of system retrieval performance. </p><p>9.3.5 Fuzzy geographic reasoning </p><p>The three geographic relationships (i.e. part-of, adjacent and same as) defined in current system work well in the grounding and retrieval process discussed in this thesis. </p><p>However, in real world applications, there are more complicated scenarios in which geographic context are affected by the phenomenon of vagueness. Take an everyday example: how can we map the phrase "West of Sydney" to a precisely defined geometric object that can be processed by a computer? Human beings have an extraordinary ability to understand such geographic descriptions and relationships, but this ability is generally influenced by personal experiences. The answer to the above question will be different from one person to another, and at different times. For a GIR system, the question can be answered from two possible perspectives: form a geometric perspective, a middle line between the east and west boundaries of Sydney can be used to determine the west part of Sydney; from a human geography perspective, boundaries of local suburbs can be used to determine the border of "West of Sydney". A future geographic relationship model should support both of these two approaches by applying semantic analysis of complex linguistic expressions, and symbolic and numeric geographic computation (Bilhaut et al. 2003). Research on this direction will lead to more sophisticated logical and computational geographic matching and reasoning. </p><p>129 9.3.6 High performance computation </p><p>The current system is implemented on a single computer environment. The system computing performance is limited by the host computer's hardware capabilities (e.g. processor speed, memory size). When the volume of documents and indices increases, the performance decreases. A future work will need to address this issue by researching and development distributed geographic indexing methods. Traditional geographic indexing such as R-tree (Guttman 1984) and Quadtree (Finkel & Bentley 1974) requires a centralised storage scheme to perform geographic range query and cover query. In contrast, a distributed geographic indexing method such as the geographic hash table approach (Ratnasamy et al. 2002) distributes storage and computational load throughout a network of computers that are organised using a coordinates-based geographic hierarchy. Numerical simulations have shown that geographica distributed index methods can provide efficient index construction and range searches (Greenstein et al. </p><p>2003). This research will focus on the integration of textual indexing scheme with distributed geographic indexing scheme, and it is hoped that this research will result in a scalable higher computing performance platform for geographic information indexing and searching. </p><p>It is also planed to review and revise the implementation for time-consuming tasks in the system, such as the supervised machine learning procedure for ranking function discovery and geographic information entity grounding model construction, in order to apply parallel computation techniques to reduce the time required for these tasks. </p><p>In summary, I believe that the GIR system developed in this thesis is likely to have significant benefit and there are a number of ways in which this research can be </p><p>130 extended to provide high quality geographic information service. </p><p>131 APPENDIX GeoCLEF 2006 Query Topics </p><p><?xml version="1.0" encoding="iso-8859-1" ?> <GeoCLEF-2006-topics-in-English-May-10-2006> <top> <num>GC026</num> <EN-title>Wine regions around rivers in Europe</EN-title> <EN-desc>Documents about wine regions along the banks of European rivers</EN-desc> <EN-narr>Relevant documents describe a wine region along a major river in European countries. To be relevant the document must name the region and the river.</EN-narr> </top> </p><p><top> <num>GC027</num> <EN-title>Cities within 100km of Frankfurt</EN-title> <EN-desc>Documents about cities within 100 kilometers of the city of Frankfurt in Western Germany</EN-desc> <EN-narr>Relevant documents discuss cities within 100 kilometers of Frankfurt am Main Germany, latitude 50.11222, longitude 8.68194. To be relevant the document must describe the city or an event in that city. Stories about Frankfurt itself are not relevant</EN-narr> </top> </p><p><top> <num>GC028</num> <EN-title>Snowstorms in North America</EN-title> <EN-desc>Documents about snowstorms occurring in the north part of the American continent</EN-desc> <EN-narr>Relevant documents state cases of snowstorms and their effects in North America. Countries are Canada, United States of America and Mexico. Documents about other kinds of storms are not relevant (e.g. rainstorm, thunderstorm, electric storm, windstorm)</EN-narr> </top> </p><p><top> <num>GC029</num> <EN-title>Diamond trade in Angola and South Africa</EN-title> <EN-desc>Documents regarding diamond trade in Angola and South Africa</EN-desc> <EN-narr>Relevant documents are about diamond trading in these two countries and its consequences (e.g. smuggling, economic and political instability)</EN-narr> </top> </p><p>132 <top> <num>GC030</num> <EN-title>Car bombings near Madrid</EN-title> <EN-desc>Documents about car bombings occurring near Madrid</EN-desc> <EN-narr>Relevant documents treat cases of car bombings occurring in the capital of Spain and its outskirts</EN-narr> </top> </p><p><top> <num>GC031</num> <EN-title>Combats and embargo in the northern part of Iraq</EN-title> <EN-desc>Documents telling about combats or embargo in the northern part of Iraq</EN-desc> <EN-narr>Relevant documents are about combats and effects of the 90s embargo in the northern part of Iraq. Documents about these facts happening in other parts of Iraq are not relevant</EN-narr> </top> </p><p><top> <num>GC032</num> <EN-title>Independence movement in Quebec</EN-title> <EN-desc>Documents about actions in Quebec for the independence of this Canadian province</EN-desc> <EN-narr>Relevant documents treat matters related to Quebec independence movement (e.g. referendums) which take place in Quebec</EN-narr> </top> </p><p><top> <num>GC033</num> <EN-title> International sports competitions in the Ruhr area</EN-title> <EN-desc> World Championships and international tournaments in the Ruhr area</EN-desc> <EN-narr> Relevant documents state the type or name of the competition, the city and possibly results. Irrelevant are documents where only part of the competition takes place in the Ruhr area of Germany, e.g. Tour de France, Champions League or UEFA-Cup games.</EN-narr> </top> </p><p><top> <num> GC034 </num> <EN-title> Malaria in the tropics </EN-title> <EN-desc> Malaria outbreaks in tropical regions and preventive vaccination </EN-desc> <EN-narr> Relevant documents state cases of malaria in tropical regions and possible preventive measures like chances to vaccinate against the disease. Outbreaks must be of epidemic scope. Tropics are defined as the region between the Tropic of Capricorn, latitude 23.5 degrees South and the Tropic of Cancer, latitude 23.5 degrees North. Not relevant are documents about a single person's infection. </EN-narr> </top> </p><p>133 <top> <num> GC035 </num> <EN-title> Credits to the former Eastern Bloc </EN-title> <EN-desc> Financial aid in form of credits by the International Monetary Fund or the World Bank to countries formerly belonging to the "Eastern Bloc" aka the Warsaw Pact, except the republics of the former USSR</EN-desc> <EN-narr> Relevant documents cite agreements on credits, conditions or consequences of these loans. The Eastern Bloc is defined as countries under strong Soviet influence (so synonymous with Warsaw Pact) throughout the whole Cold War. Excluded are former USSR republics. Thus the countries are Bulgaria, Hungary, Czech Republic, Slovakia, Poland and Romania. Thus not all communist or socialist countries are considered relevant.</EN-narr> </top> </p><p><top> <num> GC036 </num> <EN-title> Automotive industry around the Sea of Japan </EN-title> <EN-desc> Coastal cities on the Sea of Japan with automotive industry or factories </EN-desc> <EN-narr> Relevant documents report on automotive industry or factories in cities on the shore of the Sea of Japan (also named East Sea (of Korea)) including economic or social events happening there like planned joint-ventures or strikes. In addition to Japan, the countries of North Korea, South Korea and Russia are also on the Sea of Japan.</EN-narr> </top> </p><p><top> <num> GC037 </num> <EN-title> Archeology in the Middle East </EN-title> <EN-desc> Excavations and archeological finds in the Middle East </EN-desc> <EN-narr> Relevant documents report recent finds in some town, city, region or country of the Middle East, i.e. in Iran, Iraq, Turkey, Egypt, Lebanon, Saudi Arabia, Jordan, Yemen, Qatar, Kuwait, Bahrain, Israel, Oman, Syria, United Arab Emirates, Cyprus, West Bank, or the Gaza Strip</EN-narr> </top> </p><p><top> <num> GC038 </num> <EN-title> Solar or lunar eclipse in Southeast Asia </EN-title> <EN-desc> Total or partial solar or lunar eclipses in Southeast Asia </EN-desc> <EN-narr> Relevant documents state the type of eclipse and the region or country of occurrence, possibly also stories about people travelling to see it. Countries of Southeast Asia are Brunei, Cambodia, East Timor, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand and Vietnam. </EN-narr> </top> </p><p>134 </p><p><top> <num> GC039 </num> <EN-title> Russian troops in the southern Caucasus </EN-title> <EN-desc> Russian soldiers, armies or military bases in the Caucasus region south of the Caucasus Mountains </EN-desc> <EN-narr> Relevant documents report on Russian troops based at, moved to or removed from the region. Also agreements on one of these actions or combats are relevant. Relevant countries are: Azerbaijan, Armenia, Georgia, Ossetia, Nagorno-Karabakh. Irrelevant are documents citing actions between troops of nationality different from Russian (with Russian mediation between the two.) </EN-narr> </top> </p><p><top> <num> GC040 </num> <EN-title> Cities near active volcanoes </EN-title> <EN-desc> Cities, towns or villages threatened by the eruption of a volcano </EN-desc> <EN-narr> Relevant documents cite the name of the cities, towns, villages that are near an active volcano which recently had an eruption or could erupt soon. Irrelevant are reports which do not state the danger (i.e. for example necessary preventive evacuations) or the consequences for specific cities , but just tell that a particular volcano (in some country) is going to erupt, has erupted or that a region has active volcanoes. </EN-narr> </top> </p><p><top> <num>GC041</num> <EN-title>Shipwrecks in the Atlantic Ocean</EN-title> <EN-desc>Documents about shipwrecks in the Atlantic Ocean</EN-desc> <EN-narr>Relevant documents should document shipwreckings in any part of the Atlantic Ocean or its coasts.</EN-narr> </top> </p><p><top> <num>GC042</num> <EN-title>Regional elections in Northern Germany</EN-title> <EN-desc>Documents about regional elections in Northern Germany</EN-desc> <EN-narr>Relevant documents are those reporting the campaign or results for the state parliaments of any of the regions of Northern Germany. The states of northern Germany are commonly Bremen, Hamburg, Lower Saxony, Mecklenburg-Western Pomerania and Schleswig-Holstein. Only regional elections are relevant; municipal, national and European elections are not.</EN-narr> </top> </p><p>135 <top> <num>GC043</num> <EN-title>Scientific research in New England Universities</EN-title> <EN-desc>Documents about scientific research in New England universities</EN-desc> <EN-narr>Valid documents should report specific scientific research or breakthroughs occurring in universities of New England. Both current and past research are relevant. Research regarded as bogus or fraudulent is also relevant. New England states are: Connecticut, Rhode Island, Massachusetts, Vermont, New Hampshire, Maine. </EN-narr> </top> </p><p><top> <num>GC044</num> <EN-title>Arms sales in former Yugoslavia</EN-title> <EN-desc>Documents about arms sales in former Yugoslavia</EN-desc> <EN-narr>Relevant documents should report on arms sales that took place in the successor countries of the former Yugoslavia. These sales can be legal or not, and to any kind of entity in these states, not only the government itself. Relevant countries are: Slovenia, Macedonia, Croatia, Serbia and Montenegro, and Bosnia and Herzegovina. </EN-narr> </top> </p><p><top> <num>GC045</num> <EN-title>Tourism in Northeast Brazil</EN-title> <EN-desc>Documents about tourism in Northeastern Brazil</EN-desc> <EN-narr>Of interest are documents reporting on tourism in Northeastern Brazil, including places of interest, the tourism industry and/or the reasons for taking or not a holiday there. The states of northeast Brazil are Alagoas, Bahia, Ceará, Maranhão, Paraíba, Pernambuco, Piauí, Rio Grande do Norte and Sergipe.</EN-narr> </top> </p><p><top> <num>GC046</num> <EN-title>Forest fires in Northern Portugal</EN-title> <EN-desc>Documents about forest fires in Northern Portugal</EN-desc> <EN-narr>Documents should report the ocurrence, fight against, or aftermath of forest fires in Northern Portugal. The regions covered are Minho, Douro Litoral, Trás-os-Montes and Alto Douro, corresponding to the districts of Viana do Castelo, Braga, Porto (or Oporto), Vila Real and Bragança. </EN-narr> </top> </p><p>136 <top> <num>GC047</num> <EN-title>Champions League games near the Mediterranean </EN-title> <EN-desc>Documents about Champion League games played in European cities bordering the Mediterranean </EN-desc> <EN-narr>Relevant documents should include at least a short description of a European Champions League game played in a European city bordering the Mediterranean Sea or any of its minor seas. European countries along the Mediterranean Sea are Spain, France, Monaco, Italy, the island state of Malta, Slovenia, Croatia, Bosnia and Herzegovina, Serbia and Montenegro, Albania, Greece, Turkey, and the island of Cyprus.</EN-narr> </top> </p><p><top> <num>GC048</num> <EN-title>Fishing in Newfoundland and Greenland</EN-title> <EN-desc>Documents about fisheries around Newfoundland and Greenland</EN-desc> <EN-narr>Relevant documents should document fisheries and economical, ecological or legal problems associated with it, around Greenland and the Canadian island of Newfoundland. </EN-narr> </top> </p><p><top> <num>GC049</num> <EN-title>ETA in France</EN-title> <EN-desc>Documents about ETA activities in France</EN-desc> <EN-narr>Relevant documents should document the activities of the Basque terrorist group ETA in France, of a paramilitary, financial, political nature or others. </EN-narr> </top> </p><p><top> <num>GC050</num> <EN-title>Cities along the Danube and the Rhine</EN-title> <EN-desc>Documents describe cities in the shadow of the Danube or the Rhine</EN-desc> <EN-narr>Relevant documents should contain at least a short description of cities through which the rivers Danube and Rhine pass, providing evidence for it. The Danube flows through nine countries (Germany, Austria, Slovakia, Hungary, Croatia, Serbia, Bulgaria, Romania, and Ukraine). Countries along the Rhine are Liechtenstein, Austria, Germany, France, the Netherlands and Switzerland. </EN-narr> </top> </GeoCLEF-2006-topics-in-English-May-10-2006> </p><p>137 REFERENCE </p><p>Amitay, E, Har'El, N, Sivan, R & Soffer, A (2004), 'Web-a-Where: <a href="/tags/Geotagging/" rel="tag">Geotagging</a> Web Content', in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, 25-29 July, 2004. pp. 273-280. </p><p>Andrade, L & Silva, MJ (2006), 'Indexing Structures for Geographic Web Retrieval', in Proceedings of the Conference on Mobile and Ubiquitous Systems, Guimarães, Portugal, 29-30 July, 2006. http://xldb.fc.ul.pt/data/Publications_attach/andrade-141.pdf. </p><p>Andrienko, GL (1999), 'Interactive Maps for Visual Data Exploration', International Journal of Geographical Information Science, vol. 13, no. 4, pp. 355-374. </p><p>Araújo, M, Navarro, G & Ziviani, N (1997), 'Large Text Searching Allowing Errors', in Proceedings of the 4th South American Workshop on String Processing, Valparaíso, Chile, 14-15 November, 1997. pp. 2-20. </p><p>Baeza-Yates, R & Ribeiro-Neto, B (1999), Modern Information Retrieval, Addison Wesley, 513. </p><p>Bartell, BT (1994), 'Optimizing Ranking Functions: A Connectionist Approach to Adaptive Information Retrieval', PhD thesis, University of California. 268. </p><p>Bartell, BT, Cottrell, GW & Belew, RK (1994), 'Automatic Combination of Multiple Ranked Retrieval Systems', in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3-6 July, 1994. pp. 173-181. </p><p>Beard, KM & Sharma, VM (1997), 'Multidimensional Ranking for Data in Digital Spatial Libraries', International Journal on Digital Libraries, vol. 1, no. 2, pp. 153-160. </p><p>Belew, RK (1989), ' Adaptive Information Retrieval: using a Connectionist Representation to Retrieve and Learn about Documents', in Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Cambridge, Massachusetts, United States, 25 -28 June, 1989. pp. 11-20. </p><p>Bilhaut, F, Charnois, T, Enjalbert, P & Mathet, Y (2003), 'Geographic Reference Analysis for Geographic Document Querying', in Proceedings of the HTL/NAACL Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May-1 June, 2003. pp. 55-62. </p><p>Brandt, L, Hill, LL & Goodchild, M (1999), Digital Gazetteer Information Exchange (DGIE), viewed 1 March 2007, http://www.alexandria.ucsb.edu/gazetteer/dgie/DGIE_website/DGIEfinal.pdf. </p><p>Brown, I (1999), 'Developing a Virtual Reality User Interface (VRUI) for Geographic Information Retrieval on the Internet', Transactions in GIS, vol. 3, no. 3, pp. 207-220. </p><p>Burrough, PA & McDonnell, RA (1998), Principles of Geographical Information Systems, Oxford University Press, 356. </p><p>Buyukkokten, O, Cho, J, Garcia-Molina, H, Gravano, L & Shivakumar, N (1999), 'Exploiting Geographical Location Information of Web Pages', in Proceedings of the 2nd International Workshop on the Web and Databases, Philadelphia, Pennsylvania, United States, 4-15 June, 1999. pp. 91-96. </p><p>Cai, G (2002), 'GeoVSM: An Integrated Retrieval Model for Geographic Information', in Proceedings of the 2nd International Conference on Geographic Information Science, Boulder, Colorado, United States, 25-28 September, 2002. pp. 65-79. </p><p>Chaves, MS, Silva, MJ & Martins, B (2005), 'A Geographic Knowledge Base for Semantic Web Applications', in Proceedings of the 20th Brazilian Symposium on Databases, Uberlândia, Minas </p><p>138 Gerais, Brazil, 3-7 October, 2005. pp. 40-54. </p><p>Clough, P, Grubinger, M, Deselaers, T, Hanbury, A & Müller, H (2006), 'Overview of the ImageCLEF 2006 Photographic Retrieval and Object Annotation Tasks', in Proceedings of the 2006 CLEF Workshop, Alicante, Spain, 20-22 September, 2006. http://www.clef-campaign.org/2006/working_notes/workingnotes2006/cloughOCLEF2006.pdf. </p><p>Cohen, PR & Kjeldsen, R (1987), 'Information Retrieval by Constrained Spreading Activation in Semantic Networks', Information Processing and Management: an International Journal, vol. 23, no. 4, pp. 255-268. </p><p>Couclelis, H (1998), 'Worlds of Information: The Geographic Metaphor in the Visualization of Complex Information', Cartography and Geographic Information Systems, vol. 25, no. 4, pp. 209-220. </p><p>Crestani, F (1997), 'Application of Spreading Activation Techniques in Information Retrieval', Artificial Intelligence Review, vol. 11, no. 6, pp. 453-482. </p><p>Deerwester, S, Dumais, ST, Furnas, GW, Landauer, TK & Harshman, R (1990), 'Indexing by Latent Semantic Analysis', Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407. </p><p>Duda, RO & Hart, PE (1973), Pattern Classification and Scene Analysis, John Wiley & Sons Inc, 482. </p><p>Dyer, CR, Rosenfeld, A & Samet, H (1980), 'Region Representation: Boundary Codes from Quadtrees', Communication of the ACM, vol. 23, no. 3, pp. 171-179. </p><p>Egenhofer, M (2002), 'Toward the Semantic Geospatial Web', in Proceedings of the 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, Virginia, United States, 8-9 November, 2002. pp. 1-4. </p><p>Fan, WP, Gordon, MDP & Pathak, PP (2005), 'Genetic Programming-based Discovery of Ranking Functions for Effective Web Search', Journal of Management Information Systems, vol. 21, no. 4, pp. 37-56. </p><p>Finkel, RA & Bentley, JL (1974), 'Quad Trees: A Data Structure for Retrieval on Composite Keys', Acta Informatica, vol. 4, no. 1, pp. 1-9. </p><p>Fox, EA & Shaw, JA (1994), 'Combination of Multiple Searches', in Proceedings of the 2nd Text Retrieval Conference (TREC-2), Gaithersburg, Maryland, United States, 31 August-2 September, 1993. pp. 243-252. </p><p>Gale, W, Church, K & Yarowsky, D (1992), 'One Sense per Discourse', in Proceedings of the 5th DARPA Speech and Natural Language Workshop, Harriman, New York, United States, 23-26 February, 1992. pp. 233-237. </p><p>Garbin, E & Mani, I (2005), 'Disambiguating Toponyms in News', in Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, 6-8 October 2005. pp. 363-370. </p><p>Gelgi, F, Vadrevu S & Davulcu, H (2005), ' Improving Web Data Annotations with Spreading Activation', in Proceedings of the 6th International Conference on Web Information Systems Engineering, New York, New York, United States, 20-22 November 2005. pp. 95-106. </p><p>Gerard, S (1989), Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley Longman Publishing Co., Inc., 530. </p><p>Gey, F (1994), 'Inferring Probability of Relevance using the Method of Logistic Regression', in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3-6 July, 1994. pp. 222-231. </p><p>139 Gey, F, Larson, R, Sanderson, M, Bischoff, K, Mandl, T, Womser-Hacker, C, Santos, D, Rocha, P, Nunzio, GMD & Ferro, N (2006), 'Geoclef 2006: The CLEF 2006 Cross-Language Geographic Information Retrieval Track Overview', in Proceedings of the 2006 CLEF Workshop, Alicante, Spain, 20-22 September, 2006. http://www.linguateca.pt/documentos/Geyetal2006.pdf. </p><p>Gey, F, Larson, R, Sanderson, M, Joho, H, Clough, P & Petras, V (2005), 'Geoclef: The CLEF 2005 Cross-Language Geographic Information Retrieval Track Overview', in Proceedings of the 2005 CLEF Workshop, Vienna, Austria, 21-23 September, 2005. http://clef.isti.cnr.it/2005/working_notes/workingnotes2005/sanderson05.pdf. </p><p>Giuliano, VE & Jones, PE (1962), 'Linear Associative Information Retrieval', in PW Howerton (ed.), Vistas in Information Handling, Spartan Books, pp. 30-54. </p><p>Göbel, S & Klein, P (2002), 'Ranking Mechanisms in Meta-data Information Systems for Geo-spatial Data', in Proceedings of the Workshop on Earth Observation and Geo-Spatial Data, Ispra, Italy, 13-15 May, 2002. http://hci.uni-konstanz.de/downloads/eogeo2002.html. </p><p>Goldberg, DE (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Professional, 432. </p><p>Golder, SA & Huberman, BA (2006), 'Usage Patterns of Collaborative Tagging Systems', Journal of Information Science, vol. 32, no. 2, pp. 198-208. </p><p>Graham-Rowe, D (2005), 'Gamers Turn Cities into a Battleground', New Scientist, vol. 186, no. 2503, pp. 26-27. </p><p>Greenstein, B, Estrin, D, Govindan, R, Ratnasamy, S & Shenker, S (2003), 'DIFS: A Distributed Index for Features in Sensor Networks', in Proceedings of the 1st IEEE International Workshop on Sensor Network Protocols and Applications, Anchorage, Alaska, United States, 11 May, 2003. pp. 163-173. </p><p>Guttman, A (1984), 'R-Trees: A Dynamic Index Structure for Spatial Searching', in Proceedings of the ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, United States, 18-21 June, 1984. pp. 47-57. </p><p>Guy, M & Tonkin, E (2006), '<a href="/tags/Folksonomy/" rel="tag">Folksonomies</a>: Tidying up Tags?' D-Lib Magazine, vol. 12, no. 1. http://www.dlib.org/dlib/january06/guy/01guy.html. </p><p>Hammersley, B (2003), Content Syndication with RSS, O'Reilly, 256. </p><p>Hellerstein, JM, Naughton, JF & Pfeffer, A (1995), 'Generalized Search Trees for Database Systems', in Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 11-15 September, 1995. pp. 562-573. </p><p>Hill, LL (1998), Building Georeferenced Collections Gazetteer Services, viewed 2 March 2006, http://www.alexandria.ucsb.edu/~lhill/Gazetteer_Taxonomy_Presentation/Taxonomy_presentatio n.htm. </p><p>Hill, LL, Frew, J & Zheng, Q (1999), 'Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library', D-Lib Magazine, vol. 5, no. 1. http://www.dlib.org/dlib/january99/hill/01hill.html. </p><p>Hiramatsu, K & Reitsma, F (2004), 'GeoReferencing the Semantic Web: Ontology based Markup of Geographically Referenced Information', in Proceedings of the Joint EuroSDR/EuroGeographics Workshop on Ontologies and Schema Translation Services, Paris, France, 15-16 April 2004. http://www.mindswap.org/2004/geo/geoStuff_files/HiramatsuReitsma04GeoRef.pdf. </p><p>Hirschman, L & Chinchor, N (1998), 'MUC-7 Coreference Task Definition', in Proceedings of the 7th Message Understanding Conferences (MUC-7), Fairfax, Virginia, United States, 29 April-1 May, 1998. http://www-nlpir.nist.gov/related_projects/muc/proceedings/co_task.html. </p><p>140 </p><p>Hix, D & Hartson, HR (1993), Developing User Interfaces: Ensuring Usability through Product & Process, John Wiley & Sons, Inc., 416. </p><p>Hosmer, DW & Lemeshow, S (1989), Applied Logistic Regression, Wiley, 392. </p><p>Humphreys, K, Gaizauskas, R, Azzam, S, Huyck, C, Mitchell, B, Cunningham, H & Wilks, Y (1998), 'University of Sheffield: Description of the Lasie-II System as used for MUC-7', in Proceedings of the 7th Message Understanding Conferences (MUC-7), Fairfax, Virginia, United States, 29 April-1 May, 1998. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/sheffie ld_muc7.ps. </p><p>Ide, N & Veronis, J (1998), 'Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art', Computational Linguistics, vol. 24, no. 1, pp. 1-40. </p><p>Janée, G, Frew, J & Hill, LL (2004), 'Issues in Georeferenced Digital Libraries', D-Lib Magazine, vol. 10, no. 5. http://www.dlib.org/dlib/may04/janee/05janee.html. </p><p>Jobling, MA (2001), 'In the Name of the Father: Surnames and Genetics', Trends Genet, vol. 17, pp. 353-357. </p><p>Jones, CB, Abdelmoty, AI & Fu, G (2003), 'Maintaining Ontologies for Geographical Information Retrieval on the Web', Lecture Notes in Computer Science, vol. 2888, pp. 934-951. </p><p>Jones, CB, Alani, H & Tudhope, D (2002), 'Geographical Terminology Servers - Closing the Semantic Divide', in M Goodchild, M Duckham & M Worboys (eds), Perspectives on Geographic Information Science, Taylor and Francis, pp. 201-218. </p><p>Jones, KS, Walker, S & Robertson, S (1998), A Probabilistic Model of Information Retrieval: Development and Status, TR 446, Computer Laboratory, University of Cambridge. </p><p>Jong, KAD (1975), 'An Analysis of the Behavior of a Class of Genetic Adaptive Systems', PhD thesis, University of Michigan. 266. </p><p>Junyan, D, Luis, G & Narayanan, S (2000), 'Computing Geographical Scopes of Web Resources', in Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, 10-14 September, 2000. pp. 545-556. </p><p>Karimi, HA & Hammad, A (eds) 2004, Telegeoinformatics: Location-based Computing and Services, CRC Press, 392. </p><p>Kavouras, M, Kokla, M & Tomai, E (2005), 'Comparing Categories among Geographic Ontologies', Computers & Geosciences, vol. 31, no. 2, pp. 145-154. </p><p>Koza, JR (1992), Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 840. </p><p>Larson, R (1996), 'Geographic Information Retrieval and Spatial Browsing', in LC Smith & M Gluck (eds), Geographic Information Systems and Libraries: Patrons, Maps, and Spatial Information, University of Illinois, pp. 81-124. </p><p>Larson, R & Frontiera, P (2004), 'Spatial Ranking Methods for Geographic Information Retrieval (GIR) in Digital Libraries', Lecture Notes in Computer Science, vol. 3232, pp. 45-56. </p><p>Lee, J (1999), ' Context-Sensitive Vocabulary Mapping with a Spreading Activation Network’, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, United States, 15- 19 August, 1999. pp. 198-205. </p><p>141 Lee, JH (1997), 'Analyses of Multiple Evidence Combination', in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, Pennsylvania, United States, 27-31 July, 1997. pp. 267-276. </p><p>Leidner, JL, Sinclair, G & Webber, B (2003), 'Grounding Spatial Named Entities for Information Extraction and Question Answering', in Proceedings of the HTL/NAACL Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May-1 June, 2003. pp. 31-38. </p><p>Li, H, Srihari, RK, Niu, C & Li, W (2002), 'Location Normalization for Information Extraction', in Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, 24 August-1 September, 2002. pp. 1-7. </p><p>Lim, EP, Goh, DHL, Liu, Z, Ng, WK, Khoo, CSG & Higgins, SE (2002), 'G-Portal: A Map-based Digital Library for Distributed Geospatial and Georeferenced Resources', in Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, Portland, Oregon, United States, 14-18 July, 2002. pp. 351-358. </p><p>MacEachren, AM & Kraak, MJ (1997), 'Exploratory Cartographic Visualization: Advancing the Agenda', Computers & Geosciences, vol. 23, no. 4, pp. 335-343. </p><p>Manber, U & Myers, G (1993), 'Suffix Arrays: A New Method for On-line String Searches', SIAM Journal on Computing, vol. 22, no. 5, pp. 935-948. </p><p>Markowetz, A, Chen, YY, Suel, T, Long, X & Seeger, B (2005), 'Design and Implementation of a Geographic Search Engine', in Proceedings of the 8th International Workshop on the Web and Databases, Baltimore, Maryland, United States, 16-17 June, 2005. pp. 19-24. </p><p>Martins, B, Silva, MJ & Andrade, L (2005), 'Indexing and Ranking in Geo-IR Systems', in Proceedings of the Workshop on Geographic Information Retrieval, Bremen, Germany, 31 October-5 November, 2005. pp. 31-34. </p><p>Martins, B, Silva, MJ & Chaves, MS (2005), 'Challenges and Resources for Evaluating Geographical IR', in Proceedings of the Workshop on Geographic Information Retrieval, Bremen, Germany, 31 October-5 November, 2005. pp. 65-69. </p><p>McDonald, D (1993), 'Internal and External Evidence in the Identification and Semantic Categorization of Proper Names', in B Boguraev & J Pustejovsky (eds), Corpus Processing for Lexical Acquisition, MIT Press, pp. 21-39. </p><p>Mikheev, A, Grover, C & Moens, M (1998), 'Description of the LTG system used for MUC-7', in Proceedings of the 7th Message Understanding Conferences (MUC-7), Fairfax, Virginia, United States, 29 April-1May, 1998. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/ltg_m uc7.pdf. </p><p>Mohan, R, Smith, JR & Li, CS (1999), 'Adapting Multimedia Internet Content for Universal Access', IEEE Transactions on Multimedia, vol. 1, no. 1, pp. 104-114. </p><p>Nielsen, J & Molich, R (1990), 'Heuristic Evaluation of User Interfaces', in Proceedings of the Conference on Human Factors in Computing Systems: Empowering People, Seattle, Washington, United States 1-5 April, 1990. pp. 249-256. </p><p>NIMA TR 8350.2 (1997), 'Department of Defense World Geodetic System 1984: Its Definition and Relationships with Local Geodetic Systems', Third Edition; National Imagery and Mapping Agency, viewed 1 March 2008, http://earth-info.nga.mil/GandG/publications/tr8350.2/wgs84fin.pdf. </p><p>Nissim, M, Matheson, C & Reid, J (2004), 'Recognising Geographical Entities in Scottish Historical Documents', in Proceedings of the Workshop on Geographic Information Retrieval at SIGIR </p><p>142 2004, Sheffield, United Kingdom, 25-29 July, 2004. http://www.ltg.ed.ac.uk/seer/papers/gir2004.pdf. </p><p>Norbert, B, Hans-Peter, K, Ralf, S & Bernhard, S (1990), 'The R*-tree: An Efficient and Robust Access Method for Points and Rectangles', in Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, New Jersey, United States, 23-25 May, 1990. pp. 322-331. </p><p>O'Reilly, T (2005), What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software, viewed 15 March 2007, http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html. </p><p>Peng, ZR & Tsou, MH (2003), Internet GIS: Distributed Geographic Information Services for the Internet and Wireless Networks, Wiley, 720. </p><p>Perlman, G (2001), Web-based User Interface Evaluation with Questionnaires, viewed 1 March 2006, http://www.acm.org/perlman/question.html. </p><p>Plewe, B (1997), GIS Online: Information Retrieval, Mapping, and the Internet, OnWord Press, 336. </p><p>Porter, MF (1980), 'An Algorithm for Suffix Stripping', Program, vol. 14, no. 3, pp. 130-137. </p><p>Quinlan, JR (1986), 'Induction of Decision Trees', Machine Learning, vol. 1, no. 1, pp. 81-106. </p><p>Rainie, L (2005), The State of Blogging, viewed 13 September 2005, http://www.pewinternet.org/pdfs/PIP_blogging_data.pdf. </p><p>Rao, B & Minakakis, L (2003), 'Evolution of Mobile Location-based Services', Communications of the ACM, vol. 46, no. 12, pp. 61-65. </p><p>Ratnasamy, S, Karp, B, Yin, L, Yu, F, Estrin, D, Govindan, R & Shenker, S (2002), 'GHT: A Geographic Hash Table for Data-Centric Storage', in Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Applications, Atlanta, Georgia, United States, 28 September, 2002. pp. 78-87. </p><p>Rauch, E, Bukatin, M & Baker, K (2003), 'A Confidence-based Framework for Disambiguating Geographic Terms', in Proceedings of the HTL/NAACL Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May-1 June, 2003. pp. 50-54. </p><p>Reed, C (2006), An Introduction to GeoRSS: A Standards based Approach for Geo-enabling RSS Feeds, viewed 1 August 2006, http://www.opengeospatial.org/pt/06-050r3. </p><p>Rizos, C, Yan, TS & Kinlyside, DA (2004), 'Development of SydNET Permanent Real-Time GPS Network', Journal of GPS, vol. 3, no. 1-2, pp. 296-301. </p><p>Rumelhart, DE & Norman, DA (1983), 'Representation in Memory', in AM Aitkenhead & JM Slack (eds), Issues in Cognitive Modeling, Lawrence Erlbaum Associates, London, pp. 15-62. </p><p>Salton, G (1971), The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, 556. </p><p>Savoy, J & Rasolofo, Y (2002), 'Report on the TREC 11 Experiment: Arabic, Named Page and Topic Distillation Searches', in The Eleventh Text REtrieval Conference (TREC 2002), National Institute of Standards and Technology (NIST), online publication: http: //trec.nist.gov/pubs/trec11/t11_proceedings.html. </p><p>Shipman, FM & Catherine, CM (1999), 'Spatial Hypertext: An Alternative to Navigational and Semantic Links', ACM Computing Surveys, vol. 31, no. 4, p. 14. </p><p>Shneiderman, B (1996), 'The Eyes Have It: A Task by Data Type Taxonomy for Information </p><p>143 Visualizations', in Proceedings of the IEEE Symposium on Visual Languages, Boulder, Colorado, United States, 3-6 September, 1996. pp. 336-343. </p><p>Skupin, A (2000), 'From Metaphor to Method: Cartographic Perspectives on Information Visualization', in Proceedings of the IEEE Symposium on Information Visualization, Salt Lake City, Utah, United States, 9-10 October, 2000. pp. 91-97. </p><p>Smith, DA & Crane, G (2001), 'Disambiguating Geographic Names in a Historical Digital Library', in Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, Darmstadt, Germany, 4-9 September, 2001. pp. 127-136. </p><p>Smith, DA & Mann, GS (2003), 'Bootstrapping Toponym Classifiers', in Proceedings of the HLT-NAACL Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May-1 June, 2003. pp. 45-49. </p><p>Southall, H (2003), 'Defining and Identifying the Roles of Geographic References within Text: Examples from the Great Britain Historical GIS Project', in Proceedings of the HLT-NAACL Workshop on the Analysis of Geographic References, Edmonton, Alberta, Canada, 31 May-1 June, 2003. pp. 69-78. </p><p>Souza, LA, Davis Jr, CA, Borges, KAV, Delboni, TM & Laender, AHF (2005), 'The Role of Gazetteers in Geographic Knowledge Discovery on the Web', in Proceedings of the 3rd Latin American Web Congress, Buenos Aires, Argentina, 31 October-2 November, 2005. pp. 157-165. </p><p>Tollmar, K, Yeh, T & Darrell, T (2004), 'IDeixis: Image-based Deixis for Finding Location-based Information', Lecture Notes in Computer Science, vol. 3160, pp. 288-299. </p><p>Tomlin, CD (1990), Geographic Information Systems and Cartographic Modeling, Prentice Hall College Div, 249. </p><p>Trotman, AS (2005), 'Learning to Rank', Information Retrieval, vol. 8, no. 3, pp. 359-381. </p><p>Vaid, S, Jones, CB, Joho, H & Sanderson, M (2005), 'Spatio-Textual Indexing for Geographical Search on the Web', in Proceedings of the 9th International Symposium on Spatial and Temporal Databases, Angra dos Reis, Brazil, 22-24 August, 2005. pp. 218-235. van Kreveld, M, Reinbacher, I, Arampatzis, A & van Zwol, R (2004), 'Distributed Ranking Methods for Geographic Information Retrieval', in Proceedings of the 20th European Workshop on Computational Geometry, Seville, Spain, 24-26 March, 2004. pp. 225-228. </p><p>Virrantaus, K, Markkula, J, Garmash, A, Terziyan, V, Veijalainen, J, Katanosov, A & Tirri, H (2001), 'Developing GIS-supported Location-based Services', in Proceedings of the 2nd International Conference on Web Information Systems Engineering, Kyoto, Japan, 3-6 December, 2001. pp. 66-75. </p><p>Vogt, CC & Cottrell, GW (1998), 'Predicting the Performance of Linearly Combined IR Systems', in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24-28 August, 1998. pp. 190-196. </p><p>Wang, L, Wang, C, Xie, X, Forman, J, Lu, Y, Ma, WY & Li, Y (2005), 'Detecting Dominant Locations from Search Queries', in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 15-19 August, 2005. pp. 424-431. </p><p>Witten, IH & Frank, E (2005), Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 560. </p><p>Woodruff, AG & Plaunt, C (1994), 'GIPSY: Georeferenced Information Processing System', Journal of the American Society for Information Science, vol. 45, no. 9, pp. 645-655. </p><p>144 Yangarber, R & Grishman, R (1998), 'NYU: Description of the Proteus/PET System as used for MUC-7 ST', in Proceedings of the 7th Message Understanding Conferences (MUC-7), Fairfax, Virginia, United States, 29 April-1May, 1998. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/nyu_st _paper.pdf. </p><p>Zhang, VW, Rey, B, Stipp, E & Jones, R (2006), 'Geomodification in Query Rewriting', in Proceedings of the 3rd Workshop on Geographic Information Retrieval, Seattle, Washington, United States, 6-11 August, 2006. http://www.geo.unizh.ch/~rsp/gir06/papers/individual/zhang_jones.pdf. </p><p>Zipf, A & Malaka, R (2001), 'Developing Location based Services for Tourism - the Service Providers View', in Proceedings of the 8th International Conference in Information and Communication Technologies in Tourism, Montreal, Quebec, Canada, 24-27 April, 2001. pp. 83-92. </p><p>Zong, W, Wu, D, Sun, A, Lim, EP & Goh, DHL (2005), 'On Assigning Place Names to Geography Related Web Pages', in Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, Colorado, United States, 7-11 June, 2005. pp. 354-362. </p><p>145 </p> </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = '57d8d59a67018d3af55c00a5ab650685'; var endPage = 1; var totalPage = 161; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/57d8d59a67018d3af55c00a5ab650685-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>