A Geospatial Web Approach to Exploring Online

Epidemiological Information

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Arts in

the Graduate School of The Ohio State University

By

Qian Hao, M.A.

Graduate Program in Geography

The Ohio State University

2011

Thesis Committee:

Ningchuan Xiao, Advisor

Mei-Po Kwan

Ola Ahlqvist

Copyright by

Qian Hao

2011

Abstract

The provides a tremendous amount of information about diseases and their environments, and much of the information has its geographic contexts.

Effectively exploring such information, however, presents a significant challenge to

GIScience research because the data is often ill-organized on the web. Commonly used search engines such as Google can only provide a list of raw web pages which often do not contribute to discovering knowledge about the diseases. In this thesis, a geospatial web approach will be developed to efficiently exploring online epidemiological information. A geospatial web organizes information based on the geographic and ontological relationships rather than merely key words. We will focus on news articles about the foot and mouth disease and construct an ontology that specifies the relationship between relevant diseases, geographical terms, social and economical concepts, and cultural contexts.

A prototype of this geospatial web approach will contain several components, such as a list of a few foot and mouth disease news, a map showing where these news articles are reported or happened, an ontology graph of this domain, a list of news with topics related to the term we are searching and a list of news happened nearby. This prototype not only allows people to explore closely related information in terms of semantics and locations but provides an effective way to visualize and analyze such information. ii

Key words: geospatial web, ontology, epidemiology, knowledge, foot-and-mouth disease

iii

Dedication

Dedicated this thesis to my parents

iv Acknowledgments

I would like to give many thanks to my advisor, Dr. Ningchuan Xiao, for his great help on my thesis and spending a lot of time on discussing and advising me on this research. Also I want to thank Dr. Mei-Po Kwan and Dr. Ola Ahlqvist for their comments and suggestions on my thesis. Special thanks to the Dr. Rebecca Garabed,

Dr. Laura Pomero and Dr. Mark Moritz for their helpful inputs for the foot and mouth disease ontology used in this research.

I also give my gratitude to my friend, Yanfei Yin and Rong Cong, for their support and help. And my department colleagues, Wei Chen, for his help on some technical issues.

Especially thank to my mother, without her support and love, I cannot finish my thesis and study.

v

Vita

Jul. 2009...... B.S. Geography, Central South University

Jan. 2011 to Aug. 2011...... Graduate Research Associate, Department of

Geography, The Ohio State University

Fields of Study

Major Field: Geography

vi Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments...... v

Vita...... vi

Table of Contents ...... vii

List of Tables ...... viii

List of Figures ...... ix

CHAPTER 1: Introduction ...... 1

CHAPTER 2: Methodology and Framework...... 10

CHAPTER 3: Implementation of the Geospatial ………………….22

CHAPTER 4: Application Interface and Results ...... 44

CHAPTER 5: Conclusions...... 53

REFERENCE ...... 57

Appendix A: User‟s Manual………………………………………………………….61

Appendix B: Sample saved news………………………………………………….....63

vii List of Tables

Table 1. The results of locations extraction methods‟ precision ...... 34

Table 2. The summary of main location detection method ...... 36

Table 3. Thesaurus for concepts in the ontology ...... 39

viii List of Figures

Figure 1. A simple communication between client and server……………………... 12

Figure 2. An example of using CGI technique…………………………………...… 12

Figure 3. A common communication between client and server……………………13

Figure 4. An example of how technique works………………………………. 17

Figure 5. A framework of a geospatial web application……………………………. 20

Figure 6. Workflow of developing a geospatial web application for exploring online epidemiological information………………………………………………………... 21

Figure 7. Code snippet of using Perl to make a news search request……………..... 23

Figure 8. Code snippet of saving the news as a text file…………...……………….. 24

Figure 9. Code snippet of encoding process…………...…………………………… 25

Figure 10. Encoded file on the server…………………...…………………….. 25

Figure 11. Webpage table for storing attributes for every news……………………..26

Figure 12. Html table for storing html source code of each news………………….. 27

Figure 13. Code snippet of saving news into database using Perl………………….. 28

Figure 14.Code snippet for saving location information into database…...... 29

Figure 15. Code snippet for locations extraction……………………...……………. 30

Figure 16. Locations extraction results for one particular news…………...……….. 31

Figure 17. Example of equal search criteria and sub-string search criteria……...…..32

Figure 18. Records contains “Columbus”………………………………...………… 33 ix Figure 19. Location table for storing locations in each news……………...……….. .34

Figure 20. A tentative ontology system for foot-and-mouth disease domain…...... 38

Figure 21. Ontology graph drawn by HMTL …………...…………. .41

Figure 22. Logical process for the implementation of the web application……...…. .43

Figure 23. The overview of the semantic geospatial web application for exploring epidemiological information………………………………………………………... .44

Figure 24. The section for active news……………………………...……………… .45

Figure 25. The section for showing map with locations………………...………….. .46

Figure 26. Example of exploring information by clicking the map…………...……. .48

Figure 27. The section for list five geographic related news to the active news…..... 48

Figure 28. Example of before clicking on the geographic related news……………. 49

Figure 29. Example of after clicking on the geographic related news……………….50

Figure 30. Sections for showing ontologically related news and the ontology graph.50

Figure 31. Example of before clicking on the ontology graph……………………….52

Figure 32. Example of after clicking on the ontology graph…………………………52

x Chapter 1: Introduction

1.1 Background

The World Wide Web (WWW) has evolved to be an enormous data repository for all kinds of purposes and it has become increasingly difficult to retrieve desirable information from the web efficiently. The traditional information retrieval methods, typically built around Internet search engines such as Google and Yahoo!, often do not satisfy users‟ information needs, because they do not capture the syntactic and semantic aspects of words. The current information searching and presenting methods are limited to key-word searches and present the results in lists of web pages many of which are irrelevant to the searched concepts. With the growth and development of the Internet, more efficient information retrieval techniques and presentation methods are needed.

The problem with the current information searching and presenting methods is that it only considers the searched terms as strings without meanings so that the computers cannot filter out the irrelevant information. To fill this gap, Tim Berners-Lee introduced the concept of the “semantic web” as the next generation of the current web and as an environment which enables computers and human cooperate based on well-defined meanings (Berners-Lee et al. 2001). In this way, the web will become a knowledge-based web which will provide qualitative services with a consideration of the underlying semantics for the data, web pages, and other web sources (Ding et al. 1 2002). The semantic web will be able to understand, identify, integrate, and filter different kinds of information from various sources and to return more relevant results than the current web searching method. For example, in a semantic web, when users purchase their flight tickets or book a hotel, the system could compare their calendars and schedules and return those results without a time conflict.

A geospatial semantic web is one kind of semantic web with particular attention on geospatial context (Egenhofer, 2002). The spatial properties of objects can be considered an important factor when doing online searches due to the large amount of geographic information on the web. In doing so, geospatial query will play an important role in information searching. Unlike the current available geographic searches, such as Google Maps, the geospatial semantic web considers geographic information as a criterion and returns the results that contain the given geographic information. If the semantic web organizes the information based on their underlying semantics, then the geospatial semantic web can not only organize the information based on semantics, but also on their geographic relationships, such as presenting information within a certain region.

In order to implement a geospatial semantic web, the first thing to do is to assign meanings and explicit structures to the web content so that it can be interpreted and processed by computers intelligently. For example, eXtensive Markup Language

(XML) gives the text on the web pages a structure. XML classifies the content into different classes and assign tags to each class. In XML-based languages, something can be written as “Columbus”. In this format, 2 the computers will treat “Columbus” as a city rather than something else and it will be recognized as a polygon type spatial object. But just tagging the text in this way will not help computers understand the semantics of classes. To enable web contents like mark-ups to be meaningful to machines, ontologies play a critical role in fulfilling this task. In philosophy, ontology refers to the study of existence and beings.

But when it comes to the semantic web, the term ontology is specifically used as a representation of human knowledge which can be understood and shared by computers (Gruber, 1993). Overall, ontologies determine the concepts and the relationship among concepts within a specific domain. From another perspective, an ontology can also help overcome heterogeneity issues when information needs to be integrated from different sources. In a geospatial semantic web, different ontology systems for geographic knowledge and other non-spatial information is needed.

The purpose of this thesis is to design and implement a geospatial semantic web application that facilitates the exploration of epidemiological information on the web.

In the context of this thesis, news about foot-and-mouth disease will be used as an example of epidemiological information. Foot-and-mouth disease is a highly contagious epidemic and the outbreak of it always causes huge economic loss (Blake et al., 2003; Yang et al., 1999; Perry et al., 1999; Pendell et al., 2007; Paarlberg et al.,

2002). The geospatial semantic web application developed in this thesis is designed as a new way for browsing online news about foot-and-mouth disease. Rather than simply relying on for example a Google Search that returns a list of web pages largely based on popularity rankings, the web application designed in this thesis organizes news about foot-and-mouth disease based on their geographic and semantic 3 relationships. In this thesis, a framework for developing a geospatial semantic web application is proposed and implemented and it should be possible to apply to other topics with similar purpose and requirements.

1.2 Literature Review of Geospatial Web Research

1.2.1 Semantic Geospatial web

The contents in the current web are presented in the form of pure text which is only understandable to humans. With the growth of the World Wide Web and increase of the amount of data on the web, this huge data repository needs an efficient and intelligent way to organize and manage its information. The concept of “Semantic

Web”, proposed by Tim Berners-Lee, 2001, takes semantics into consideration and is a good way to organize and retrieve information. The semantic web is an upgrade of the current web and is designed not only for human understanding but for computers to process meaningfully as well (Berners-Lee et al, 2001).

The semantic web involves primarily how to represent human knowledge and make the text on web pages understandable knowledge even to computers. In the paper by

Berners-Lee et al. (2001), the authors mentioned two already available technologies for building a semantic web, namely eXtensible Markup Language (XML) and the

Resource Description Framework (RDF). XML allows users to create tags in web-based documents using a tree structure. Using XML the content is structured to some extent which is easier for computers to capture but still difficult to analyze the relationship among those tags, because XML may separate one sentence into different

4 parts without considering the whole piece. However, the RDF can indicate the relationships between tags, because it represents a statement in the form of subject, predicate and object (Klyne et al. 2004). Comparing with XML, RDF keeps the relationships between words.

Since 80% of the information on the Web has a geographic reference (Franklin et al.

1992), browsing information based on geographic features and relationships becomes one possible classification method. By incorporating geographic and semantic relationships of web resources, a geospatial semantic web was proposed (Egenhofer,

2002). The geospatial semantic web is a particular semantic web with a special consideration of incorporating the geospatial context of online information (Egenhofer,

2009). Scharl (2007) provides a hypothetical example to argue that geospatial technology has placed an important impact on human‟s daily lives. In his hypothetical scenario, he describes how advanced geospatial technology changes and benefits a freelance editor‟s workflow and working environment. The freelance is an editor who sells her ability to gather and handle electronic information. And the geospatial semantic web provides her with the capability to view information from geographic and semantic perspective and observe the trends of topics and even simulate what-if scenarios. A geospatial semantic web will also allow users to do more sophisticated geospatial queries than the current web. Geospatial queries refer to query by spatial relationships, such as overlay, intersect and so on. A typical example of geospatial queries is “find data about highways in Ohio”. Such a query cannot fully be done by today‟s web due to its lack of capability to querying geographically. Because a semantic web with geospatial ontology will treat place as a spatial object rather than a 5 word, it can consider the spatial relationships between objects, such as overlay, intersect and contain.

There are two major mechanisms in a geospatial web: geoparsing and geotagging

(Scharl, 2007; Scharl et al. 2008). Geoparsing is the process of recognizing the geographic information, such as the places name for an event where it happens.

Geotagging or geocoding is the process of assigning geographic information with correct geographic coordinates which is usually accomplished by querying a standard and well-structured geographic database, such as a gazetteer (Hill et al. 1999;

Tochtermann et al. 1997). A certain amount of content on the web does not have explicit geographic information that can be recognized and distinguished easily

(Delboni et al. 2005). Especially in news articles, there are always more than one place mentioned which includes an event place and a report place (Morimoto et al.

2003). How to retrieve the exact locations from news article is an important issue when doing geoparsing. A helpful technique called Named Entity Recognition can extract the named entities from textual data, such as names of people, organizations and places (Weiss et al., 2005).

1.2.2 Ontology in a Semantic Geospatial Web

The term Ontology originates from philosophy which is the study of existence and beings. In computer science, researchers have redefined this concept as “a document or file that formally defines the relations among terms” (Berners-Lee et al, 2001).

Ontology is one of the fundamental components in artificial intelligence in general

6 and in the semantic web in particular as a way of representing human knowledge in which computers can understand and manipulate. Agarwal (2005) gave a good overview of ontology. She argued that ontology should have domain-specific and user-dependent properties. In different disciplines, the definitions of ontology and constructing ontology about what aspects vary accordingly. Agarwal pointed out the purpose of incorporating ontology into GIScience was to serve as a bridge that allows interoperability and integrates data from different sources, users and systems.

There have been many studies about geospatial ontology recently. Ontology in a geographic information system‟s context is becoming an attention-needed research realm proposed by the University Consortium for Geographic Information Science

(Mark et al. 2003). The research topics can be categorized into several aspects: cross-discipline integration (Kolas et al. 2005; Karimi et al. 2003), resolving the semantic heterogeneity issues (Klien et al. 2006; Lutz et al. 2006), and geographic information integration research (Harvey et al. 1999; Fonseca et al. 2002; Fonseca et al. 2003). Fonseca et al. (2002) raised five-universes-paradigm to understand the function of ontology in geospatial environment. The five-universes-paradigm includes

(1) the physical universe that contains the objects and phenomena in reality, (2) the mathematical or logical universe that is the formal definition of the objects and phenomena, (3) the representation universe describing the definitions in the mathematical universe in a symbolic way, (4) the implementation universe that converts the elements in the representation universe into a computer understandable structure, and (5) the cognitive universe that should be placed between the physical universe and the mathematical universe because cognitive universe recognizes human 7 understanding about the physical world. This paradigm suggests that ontology can link human‟s knowledge about the world with how the computers interpret it

(Peachavanish et al. 2007).

Kolas et al. (2005) outlined five ontology types which play key roles in constructing a geospatial semantic system. The five ontology types are: “Base geospatial ontology,

Feature data source ontology, Geospatial service ontology, Geospatial filter ontology,

Domain ontology.” The base geospatial ontology provides the representation of geospatial data and is referenced by all other ontologies. The Feature data source ontology defines WFS data. Geospatial service ontology enables the geospatial services to be executed. Geospatial filter ontology defines the integration of geospatial relationships into queries. Domain ontology is the ontology defined for specific user and domain perspectives.

Incorporating ontology into geospatial science will make geospatial queries meaningful and feasible to computers. Delboni et al. (2007) and Peachavanish et al.

(2007) both discussed how ontology in the geospatial context could benefit from the implementation of geospatial queries. The former article stated that geospatial ontology helped define the spatial terms on web pages more explicitly, while the latter paper presented an ontological engineering methodology which can minimize the gap between users‟ limited knowledge and the required knowledge and skills for doing geospatial queries. Delboni et al. (2007) developed a geospatial search which analyzes the semantics of geographic context in natural languages without using large geospatial database. This new query method uses natural language positioning 8 expression and confirms the expectation of the geospatial semantic web. And

Peachavanish et al. (2007) presented a new ontological method for interpreting and mapping geospatial queries.

1.2.3 Geospatial web approach to exploring epidemiological information

Geospatial web applications are a useful research tool to help explore the vast amount of information on the web, especially when the information has a geographic characteristic. Epidemiological information is one example of such data because of its spatial and temporal features (Wilesmith et al. 2003). Moreover, Geographic

Information Systems (GIS) has been utilized to study disease mapping and spatial epidemiology broadly (AvRuskin et al. 2004; Busgeeth et al. 2004). Most of the applications in this area are mainly concerned about the spatial patterns of the disease distribution or the management of a large amount of epidemiological information.

Perez et al. (2009) proposed a web-based system for real-time information sharing and space-time cluster analysis of infectious disease, especially foot-and-mouth disease. In this paper, they developed a real-time system for surveillance with the ability to visualize data and detect clusters. Although the system did not have a formal and systematic way to obtain data for real surveillance, it contributed to promoting global infectious disease surveillance and real-time information sharing.

This thesis applies an advanced geospatial semantic web approach to exploring and discovering online epidemiological information, primarily news about foot and mouth

9 disease (FMD). Foot and mouth disease arises more and more attention all over the world because it is a disease that can spread out easily among livestock and can cause huge economic losses.

This thesis aims to design and implement a web-based platform for exploring epidemiological information within a geospatial semantic environment. In this platform, information will be organized and explored by their geospatial and semantic relationships. Based on the above review of relevant literatures, the research process for this thesis is divided into two major steps. One is to extract geospatial information from online data. In this step, a simple but effective method for extracting geospatial information (mainly place names and their geospatial coordinates) will be proposed and implemented. The second step is to build a semantic environment, which will be done by exploring the underlying semantic relationships among different topics within the epidemiology domain relevant to foot-and-mouth disease. In this phase, a simple tentative ontology for foot-and-mouth disease will be defined based on expert input and used together with the geographical locations of the online data. This platform will take advantage of open source geospatial technologies which are cost effective and customizable in system functions. Eventually, a scalable web-based geospatial semantic framework will be developed and can be applied to other similar topics.

10 Chapter 2: Methodology and Framework

With the advent of Web 2.0, a new generation of World Wide Web, the web is moving from static HMTL pages towards a collaborative and interactive network (O'Reilly,

2005). Web 2.0 emphasizes network, collaboration and interaction, which opens up many opportunities for people to share information. More and more web 2.0 technologies are becoming aware of the importance of geography and location as a way to access the information on the web (Haklay et al., 2008). And the development of web 2.0 technologies has benefited the geospatial web on becoming an interactive, collaborative and information sharing platforms.

2.1 Web application architecture

In many web-based applications, two sides are required to request and receive web services. One is called the client side, while the other the server side. The client side refers to an application program running on a host (e.g., a computer) that sends service requests to and receives services from the server side (Kurose et al. 2003). The server side often refers to programs running on other hosts which processes requests from, and sends results back to, the client side through the network. There are many kinds of client side programs, notably web browsers and email clients such as

Thunderbird or Outlook. Examples for server side programs include a

11 such as Apache, a database server, such as MySQL, and a map server, such as

MapServer. Server side programs are often designed to perform special functions such as retrieving web pages (as a web server) or making maps.

The messages sent between client and server side must follow an application layer protocol, such as the HyperText Transfer Protocol (HTTP). These protocols define important issues in network communications such as how messages are sent and received between the two sides, and the format of the requests and results. The HTTP request sent from the client side can be a link clicking. The server side receives the

HTTP request and responds to the client side with the results, such as the content of the specified by a link clicked. Figure 1 shows an example of the transaction between client side and server side. However, this communication is not always limited to the client and server side programs. In fact, in many current web applications, a technique called Common Gate Interface (CGI) programs is frequently used. CGI programs run on the server side and are designed to make dynamic interactions between the users and the web server. For example, when users fill out some forms on a website, the data is usually sent to and processed by some application programs. And after they receive the data, a confirmation message will be sent back to the users. An example of using CGI is showed in Figure 2.

12

Figure 1. A simple communication between client and server (Brinzarea, 2009)

Figure 2. An example of using CGI technique (CGI)

On the server side, there can be several critical components, which may include a web server, a database server, and a map server especially for dealing with geographic information. On the client side, a client side program, usually a is

13 indispensable. Besides these actual programs, scripting languages for implementing specific functions are required on both sides. Some commonly used scripting languages are PHP and Perl on the server side and JavaScript on the client side.

Figure 3 shows an example of a complete interaction between the two sides using

PHP and JavaScript scripting languages.

Figure 3. A common communication between client and server (Brinzarea, 2009)

2.1.1 Server side components

2.1.1.1 Web Server

The web server, also named HTTP server, refers to the hardware which stores and delivers the data. The main function of a web server is to respond to clients with the requested web pages. There are several web servers available currently. Apache1 is one of the widely used open-source HTTP servers. In this research, Apache HTTP server serves on the server side to handle and response the HTTP requests.

2.1.1.2 Database Server

On the server side, besides the web server, a database server is often necessary, when

1 http://httpd.apache.org/ 14 there is a need to store and handle large dataset. A database server stores all the data other than web pages and sometimes the dataset for web applications can be very large. A database server provides not only the efficient ability to manage the large amount of information, but some embedded sophisticated computational functions for processing data as well. For example, a spatial database can easily calculate the distance between two spatial objects and the area or length for spatial objects.

There are many kinds of database servers currently widely used by various applications, including Oracle2, Microsoft SQL Server3 and PostgresSQL Server4.

PostgresSQL is an open-source object-relational database management system that can handle spatial data database servers. PostgreSQL is used in this research for its efficient data managing and analysis features, especially the fully-fledged spatial data processing characteristics which make it more suitable to be a geospatial database than the others.

The spatial extension of PostgreSQL database server is called PostGIS. The table structure in PostGIS spatial data is straightforward, since only one column is different than the traditional relational database and other columns are the same for storing the attributes. This special column is the geometry column which stores the geospatial type of objects, such as points, lines and polygons. One advantage of this structure is that the data can be easily used in spatial functions, such as calculating the area of a polygon, distance between two points and so on. In this way, one record will keep

2 http://www.oracle.com/index.html 3 http://www.microsoft.com/sqlserver/2005/en/us/default.aspx 4 http://www.postgresql.org/ 15 both its spatial property as well as its non-spatial attributes as a whole piece.

Moreover, storing a spatial object in PostGIS in this way is easier for some other programs to use. For example a map server accepts the geometry type data and maps those geographic objects on maps without worrying about data type incompatible issues.

2.1.1.3 Server-side scripting languages – Perl and PHP

Other than the several server programs residing on the server side, scripting languages are also important because they serve as a bridge to connect server programs and client side programs for information retrieval and rendering. Moreover, server-side scripting languages can create web pages dynamically based on the users‟ HTTP requests and connect the above components, web server and database server, to work together.

Some commonly used server side scripting languages are (ASP)5,

JavaServer Pages (JSP)6, and Hypertext Preprocessor (PHP)7. According to this thesis research‟s requirements, PHP server side scripting language is used for accessing the database and communicating between server and client side programs. Additionally

Perl is used as a CGI scripting language for fetching data from the web.

2.1.2 Client side components

5 http://www.asp.net/ 6 http://java.sun.com/products/jsp/docs.html 7 http://www.php.net/ 16 2.1.2.1 JavaScript

Client-side scripting is necessary to enable interactions between web pages and users and to make the web pages more dynamic. Web pages without embedding scripting languages are just static documents, such as text and figures. Client-side scripting languages equip the web pages with much more interesting and advanced functions.

JavaScript8 is the most widely used client-side scripting languages, since it is relatively easy to learn and with many powerful ready-to-use JavaScript libraries on the web it is relatively simple to build interesting and meaningful web pages.

Moreover, more and more web-based applications, such as Google Search, provide its application-programming interface (API) which mostly supports JavaScript-based development. An application programming interface (API) is a set of predefined specifications and instructions for programmers to customize the web applications. It works as an interface between various web-based applications. A commonly seen example will be embedded Google Search into some personal or company‟s web pages rather than linking to the Google.com.

2.1.2.2 Ajax

Ajax stands for Asynchronous JavaScript and XML and is applied on the client side for creating interactive and responsive web pages (Smith, 2006). Ajax is the fundamental underpinning that technically enables the emergence of Web 2.0

(O'Reilly, 2005). With the advent of Ajax, the performance of the front-end interface

(the interface that a web page user sees) is largely enhanced. Without Ajax, the

8 http://www.w3schools.com/js/default.asp 17 traditional communication between a client side and the server side is a synchronous process. It means once the client side sends a HTTP request, the web page interface can do nothing until the server side sends back the results. This is a really inconvenient and time-consuming experience for users. However, Ajax fixes this problem. As shown in figure 4, Ajax allows users to still work with the web pages while waiting for the server‟s reponses. For example, when users register as members on some web pages, after the users input the username, they can still fill out other information while waiting for the server to tell them if the username is available or not. This is a typical Ajax working situation. Overall, using Ajax, it can only refresh a part of the whole web page rather than reloading all of it.

Figure 4. An example of how Ajax technique works (Brinzarea, 2009)

There are three essential components that consist of Ajax (Brinzarea, 2009):

1. JavaScript and (DOM): identifies and manipulates

parts of the HTML pages

2. XMLHttpRequest object: enables scripting languages to access the server 18 asynchronously in the background

3. Server side scripting languages, for example PHP: handles requests from the

client side

“The Document Object Model (DOM) is an application programming interface (API) for valid HTML and well-formed XML documents” (DOM). DOM gives the static web pages a structure so that JavaScript can know which part it interacts with. The

XMLHttpRequest object is used to send and receive responses from the server asynchronously. The HTTP results sent back from the server is usually in eXtensible

Markup Language (XML) and JavaScript Object Notation (JSON) data exchange formats. These two formats have a well-organized structure, which is easy to be parsed and understood.

2.1.2.3 OpenLayers

OpenLayers9 is a popular open source JavaScript library for mapping geographic data.

It provides developers with a JavaScript-based API so that OpenLayers can be incorporated into various web applications. Like other web application which have an ability to display geographic data, OpenLayers API offers a wide range of functions for mapping geographic data, including map controls, such as zoom in and zoom out, map scales, and so forth. By utilizing OpenLayers API, the web pages can display a map overlaid by users‟ geographic data, such as a population distribution layer. More customized features and advanced functions can be implemented as well,

9 http://openlayers.org/ 19 for example, assigning different colors to different markers and popping up an information window after the map is clicked. OpenLayers API is implemented by

JavaScript and has no dependence on the server, so it is should be considered as a helpful and necessary components on the client side, especially for geographic related web applications.

2.2 A framework for geospatial web application

A general framework of a geospatial web application is shown in figure 5. The entire structure can be examined from the client and server sides. The client side is responsible for displaying several elements, such as maps and text, and getting and sending users‟ inputs to the server side. On the server side, the workflow can be divided into two parts. First one is retrieving data and constructing a spatial database, and the other part is communicating with the client side to parse and respond to users‟

HTTP requests. The workflow starts with users‟ inputs. After the users click on the map or the news, the title of the clicked item will be sent to the server side. The server side programming language, PHP, receives the query parameter, queries it in the database and returns the found information in an XML format back to the client side.

Some parts of the web page will change accordingly.

20 Render JavaScript, OpenLayers API User Interface (client side) (web page)

HTTP requests

PHP, Perl (server side) XML Results

Query Response

PostgreSQL (server side database)

Figure 5. A framework of a geospatial web application

This geospatial web application focuses on exploring epidemiological information on the web, which refers to the news about foot-and-mouth disease. The workflow

(Figure 6) of developing this application is divided into five parts. First of all, news data needs to be retrieved periodically from Google News. Secondly, creating tables in a database and saving news into these tables. Thirdly, geographic information needs to be extracted for every piece of news and assigned location coordinates by searching the gazetteer. Fourthly, building an ontology for the foot and mouth disease domain and tagging each piece of news using the concepts in the ontology. Lastly, integrating different components into the web page.

21 Input Data Output Data Google News Retrieval Google News (Perl) Tables for storing web pages Database Creation and Update (PostgreSQL) Tables for storing locations

Geopasing and Geocoding Gazetteer (Perl) Ontology of FMD domain

Building ontology and tagging news Experts‟ News (Perl) knowledge associated with tags

Web Developing web page applicationint (HMTL, JavaScript,PHP) erface

Figure 6. Workflow of developing a geospatial web application for exploring online epidemiological information

22 Chapter 3: Implementation of the Geospatial Web Application

3.1 Data collection

This research mainly focuses on exploring online epidemiological information in the form of news about Foot-and-Mouth disease (FMD).Because news data often has distinctive spatially-related and time-related characteristics, news data is appropriate to look at the difference between a semantic geospatial web approach and the traditional web search approach.

3.1.1 Data retrieval

The data was collected from online news related to Foot-and-Mouth disease through

Google News Search. Google News Search provides a News Search API for users to embed Google News in their own web page and develop customized programs.

Google provides News Search API support in various kinds of programming languages, such as PHP, Perl, Python and Java. In this research, Perl is chosen to send a request to Google News Search. Comparing to other languages, Perl has some modules which provides convenient access to the Google News Search API and parsing of the results. Figure 7 shows the code snippet of using Perl to make a request.

Two Perl modules are needed to do this search. LWP, The World-Wide Web library for

Perl, is an important module which provides a simple and consistent API to the

23 World-Wide Web (LWP, 6.02). The LWP:UserAgent is a class in the LWP module which is used to send HTTP requests and receive the responses. Another module is

JSON (JavaScript Object Notation) encoder/decoder. This module is used to parse the responses which are in JSON format. The Google News Search allows a maximum of

64 results every time, so the program runs periodically to get more news.

Figure 7. Code snippet of using Perl to make a news search request

The returned JSON file contains a lot of useful information about the news itself, such as the title, URL, publish date, and a short description. This information is saved for displaying news on the web page. The news search process is divided into two steps.

The first one is to use Google News Search API to get the for the 64 returned

24 news and the next step is saving the html file of each news collection as a separate text file (see figure 7and Figure 8. It needs to load another Perl module called

LWP:Simple which can retrieve the html code based on the web page‟s URL.

Figure 8. Code snippet of saving the news as a text file

There are several reasons to save the html code as a text file for each news web page.

First of all, the original html file cannot be saved into a database due to encoding issues. Some of the special characters, such as “ and „, needs to be converted and figure 9 shows the encoding process. Another reason is that a saved web page on the server is helpful for examining the content later on, because some web pages may be removed at some time. Figure 10 shows the html file that is saved on the server and database.

25

Figure 9. Code snippet of encoding process

Figure 10. Encoded html file on the server 26

3.1.2 Data storage and management

The data is stored in PostgreSQL database on the server. There are two tables for storing the data. The “webpage” table stores most of the information about news, including the title, URL, published date, a short description, and the main location mentioned in the news as well as its coordinates. The other table, “html”, is for holding the html source code for every news item. These two tables are related by the title of each news item. The “title” attribute is the primary key in the “webpage” table and the foreign key in the other tables. Foreign key means that the values for the “title” attribute in the “html” table much exist in the “webpage” table‟s “title” column. The

“webpage” table called referenced table, while the “html” table is called referencing table. Figure 11 and figure 12 are the structures for two tables in PostgreSQL.

Figure 11. Webpage table for storing attributes for every news

27

Figure 12. Html table for storing html source code of each news

Perl programming language is chosen for connecting to the database and saving the news into tables. For connecting to the database, it needs to load a Perl module called

DBI which is a database interface for Perl that defines a set of variables and methods for Perl to access and manipulate different kinds of databases . Another Perl module,

HTML::TokeParser, is used to parse the html source code for each news‟ web page. It will get all the source code in the html file, encode the html code, and save it in the

“html” column in the html table. The above process is shown in figure 13.

The original method separated the location and its coordinates from the “webpage” table into a spatial database. The intention for creating a spatial database is for easily making a map file for those locations via MapServer. MapServer can take the spatial point type data as inputs and use this format in map files. But MapServer is not indispensable in this research and showing locations using OpenLayers is much more convenient than MapServer. Plus, OpenLayers allows nicer markers than MapServer.

So OpenLayers is applied in this research instead of MapServer. And OpenLayers can take non-spatial attributes as inputs to map the locations. Therefore, a spatial database

28 is not necessary her so a relational database replaced a spatial database. However, with further research, a spatial database has some irreplaceable advantages comparing to relational database. For example, it is easier and faster to calculate the distance between two spatial objects.

Figure 13. Code snippet of saving news into database using Perl

Figure 14 shows the process of updating the main location into the webpage table.

The location table and webpage table are related by the title attribute. The main location for a piece of news is the first one in all the locations which are associated with this piece of news and stored in the location table. Figure16 is the code snippet in

Perl for extracting the first location and update it to the corresponding news in the webpage table.

29

Figure 14.Code snippet for saving location information into database

3.2 Geospatial information Extraction and Recognition

3.2.1 Geoparsing – spatial information extraction

News is one of those kinds of information which often contains explicit geographic information. But recognizing the geographic information accurately for computers is not as straightforward as it is for humans. Some advanced locations extraction methods use Natural Language Processing (NLP) techniques, which analyze the sentences based on lexical and syntactic structures. In this research, a simpler method is used. According to the extraction results, it can be proved to be an effective method for recognizing locations, although some results have some errors.

Based on human‟s knowledge and experience, place names exist in news in some certain patterns. For example, “Columbus, OH”, in this pattern, there is a comma

30 between two words and both words start with a capital letter. Another pattern can be found in “United States”. This pattern is two continuous words with a capital letter as the initial letter. “China” is a pattern too. It is a single word with an initial capital letter. The programs go through the body part of the web page and search for words with those certain patterns as the code snippet showed in figure 15. The program parse the whole text into each single word and the punctuation marks are associated with the previous word, so the first pattern is combined into the second pattern. Then the program searches words with the second and third patterns.

Figure 15. Code snippet for locations extraction

Using this method, the extracted information is always locations. Actually, it contains a lot of errors. Figure 16 shows the extraction result for one piece of news. And it successfully retrieved the location “China”, although it contains some unrelated words.

And by examining this news manually, it actually talks about the foot-and-mouth disease situation in China. Using this method, 80% of the news in the database is associated with at least one location. In this step, the program only retrieves all the words that can potentially be places. The next step is about how to identify which are place names.

31

Figure 16. Locations extraction results for one particular news

3.2.2 Geotagging – spatial coordinates identifying

Assigning coordinates to the geographic objects is called geotagging. Last step, all those satisfied words are extracted successfully. However, without a geographic coordinates, those words are meaningless. Matching those words with an accurate coordinate is accomplished by querying a standard geographic dictionary, gazetteer. If a word appears in the gazetteer, then it will be assumed to be a place name.

A gazetteer is a formal definition of the places around the world. There are various kinds of gazetteer available online. Some of them contain millions of locations, while some of them only contain a small amount of locations. The choice of gazetteer can influence the accuracy of the results to some extent. The first gazetteer used in this

32 research was from National Geospatial-Intelligence Agency website and had over seven million records. Although it was a very comprehensive gazetteer, it did not turn out to be satisfactory. One issue about this gazetteer is the format of the records. For example, “North Korea” exists in this gazetteer as “North Korea Highlands” and there is no place name called “North Korea”. In this case, it is hard for the programs to extract the right record, because the Perl program does the query by finding the matched strings and “North Korea Highlands” does not equal to “North Korea”.

Unless it applies the matched sub-string criterion which means as long as the searched words are a part of the record, then it will be returned as a satisfactory result. Figure

17 shows the difference between those two searching criteria. If the program uses matched sub-string criterion rather than equal criterion there will be a lot of unrelated records of other places extracted. For example, if “Columbus” is the key word and the condition is to return records which contain “Columbus”, there are 23 records as showed in figure 18.

Figure 17. Example of equal search criteria and sub-string search criteria

33

Figure 18. Records contains “Columbus”

Because of those issues with the first gazetteer, another gazetteer is used. The one was from world-gazetteer website10 and was a country-level gazetteer with about 1037 records. Based on the extraction results, this gazetteer works really well. According to observations, as long as a piece of news mentions place names, most of the place names are country names or major city names. There is one province in South Africa called “KwaZulu-Natal” does not exist in the gazetteer so that it can be extracted. But in this case, it can pinpoint the location to Africa or South Africa, if they are mentioned in the news as well.

10http://www.world-gazetteer.com/wg.php?x=&men=gcis&lng=en&des=wg&srt=npan&col=abcdefghinoq&msz=1 500&geo=0 34 Another reason to use this simply is about the scale of events. Foot and mouth disease is an animal disease which usually happens in a province/state or country scale level.

But some common diseases, such as flu, are human disease and most likely occur in a small region. By manually examining 64 pieces of news, excluding those pieces of news without locations mentioned, table 1 is the results of this locations extraction method‟s precision. As the results showed, as long as the location extraction results are not empty, they are all identified as real locations.

True locations False locations

Extraction results 64 0

Table 1. The results of locations extraction methods‟ precision

The geographic information retrieved from each news record is saved in a database via PostgreSQL. One table called “location” is created to store the location information with a foreign key “title” referencing to the primary key “title” in the

“webpage” table. The table structure is showed in figure 19.

Figure 19. Location table for storing locations in each news 35

The difference between this table and the “location” column in the “webpage” table is that the “webpage” table only holds one and the most important location in each news record, while the “location” table keeps records of all the locations as long as they appear in the news. Most of the time the news records talks about more than one place.

For example, one news item may discuss the foot-and-mouth disease‟ situation in the

USA and also mention several other countries for comparison. In this case, the USA is labeled as the main location in this news, while other places are categorized as secondary locations. It is necessary to pick out one location as a main location for each news report in this research, because the main location will be used to measure the distance between news. One of the important aspects of this semantic geospatial web application is allowing users to view news based on their geographic relationship.

The geographic relationship refers to the closeness between two news reports. When the user is viewing one news record, the application will recommend five other news reports which are closest to this one. In order to determine the closeness, the distance between two points serves as a good measurement. Therefore the main location of each news is selected and used in calculating the distance between two news.

Determining the most relevant location for a news report is hard for computers to do.

Some researches calculate the occurrence frequency of every location, and pick the most frequently mentioned location as the main location. A simple but effective way to do is applied in this research. If any location is mentioned in the news title, then it is recognized as the main location. Otherwise, the first location that appears in the

36 news report is considered the main location. This method may seem to be rough, however, the results are acceptable. By manually checking 64 news web pages and excluding those news without explicit locations, a summary of the accuracy is showed in table 2. The first row means the main location in the news and the first column represents the location extracted by programs from the news. Only 8 cases of the 64 news records checked for accuracy have a location that appears first in the report but is not the main location. There are some news, such as B2 in appendix B, that only talk about one place, so the number of news with non-first location is secondary location is less than the number of news with first location is the main location. The result of table 2 indicates that the method of assigning the first location as the main location is feasible and satisfactory.

Main location Secondary locations

First location 56 8

Non-first location 8 44

Table 2. The summary of main location detection method

3.3 Semantic context Development

3.3.1 Defining an ontology for the Foot-and-mouth Disease domain

Another critical component of this semantic geospatial web application is defining ontological relationships between news so that they can be viewed based on the semantic context. Ontology is a representation of human knowledge as a set of concepts and the relationships between those concepts in a domain. Ontology is

37 always built and classified by a specific domain, since it is impossible to build a giant ontology system included everything, and definition of ontology about a specific domain needs experts‟ knowledge input.

Ontology will be used as criteria to determine which concepts are related with each other and the degree of the relevance. In this geospatial web application, the purpose of incorporating ontology is to determine what other topics news should be displayed along with the searching results. If there is a relationship between two concepts when one of them is searched, then news tagged with the other term should be considered related and displayed as well. For example, within the foot-and-mouth disease domain, two basic concepts can be “infectious disease” and “prevention”. A possible relationship between these two concepts can be prevention methods that can prevent the occurrence of infectious disease. After these two concepts are linked in a relationship, when users are searching news about “infectious disease”, then other news about “prevention” should be considered as relevant news and presented to users as well. On the contrary, if it defines there is no relationship between “infectious disease” and “prevention”, and then news about those two terms should not be presented together.

There are some ontologies about the disease domain available on the web, but they are very comprehensive and not appropriate for the study topic in this paper. In this research, a specific ontology system of the foot-and-mouth disease domain is defined.

By manually examining news and papers about foot-and-mouth disease and

38 consulting with experts in the veterinary area, a tentative and scalable ontology system is proposed and developed as showed in figure 20. Eight critical and frequently mentioned terms are found from reviewed resources. There are various ways to interpret and define the relationship among those terms. A questionnaire about the relationships among those terms was sent out to veterinary experts, and they offered valuable suggestions. Taking their suggestions into consideration, figure 20 is one of possible understanding and applied in this research. Figure 20 shows the ontology as a graph with nodes referring to concepts and edges representing relationships. So if there is an edge linking two terms, then these two terms are directly related. For example, “Infectious disease” is directly related to

“foot-and-mouth disease”, while indirectly related to “outbreak” or unrelated to

“outbreak”.

Figure 20. A tentative ontology system for foot-and-mouth disease domain

The next phase is to tag each news record with the eight concepts in this ontology.

39 The tagging process is divided into two steps. First, a thesaurus is developed for each term. When news records talk about the same topics, they may not always use the same words. For example, if there is a news item that discusses infectious disease, it may also use “contagious disease” instead of “infectious disease”. So it is necessary and important to use some similar words to capture the accurate topic in news. Table 3 shows the thesaurus for each concept.

Concept Thesaurus

Infectious disease contagious disease, epidemic, virus

Prevention prevent, control methods

Vaccines inject, injection, medicine

FMD hoof-and-mouth disease

Outbreak spread

Livestock moving, mobile, grazing, transit

Herds cows, cattle, sheep, pigs, flocks

Disease transmission infection, catching disease ,disease dispersion

Table 3. Thesaurus for concepts in the ontology

The second step is tagging. The thesaurus is searched in every news and if a news contains one of the thesauruses, then it will be tagged with the corresponding concept.

For example, if a news mentions “epidemic”, then this news will be tagged with

“infectious disease”. As results, every the news has at least one tag and most of them have more than one tag. A new table called “tags” is created in the database for

40 storing the tags of each news.

3.3.2 Displaying the ontology on the web

The HTML canvas element is used to draw a clickable ontology graph on the web page. The canvas element is designed to draw graphics on the web page via scripting language, such as JavaScript. Unlike static images, users can click and interact with this graph. Figure21 shows the ontology graph drawn on the web page. Each node represents a concept and is placed at a fixed position and edges are drawn when two nodes are related. Each node is defined as an object of a class which declares several attributes and methods, such as tag, neighbors, and draw itself method. When the users click on one node, the program finds the right node by calculating the position of clicking. After locating the clicked node, it retrieves all its neighbors which refer to the nodes directly relevant to the clicked one. And the programs search in database to find the matching news with those tags and display them on the web page as ontologically related news. For example, the current highlighted node is “vaccines”, and then its neighbors only include “prevention”. So the programs will search for those news with a “prevention” tag and randomly display five of the returned news.

41

Figure 21. Ontology graph drawn by HMTL canvas element

3.4 Web application Design and Implementation

The whole web page is divided into 5 parts, which include one block for display of one active news at a time, another block is for displaying the maps with location markers, another block for listing 5 news that are geographically related to the active news, another block for listing 5 news that are ontologically related to the active news and the last one for showing the ontology graph. All these 5 parts are linked together and clickable. Every time one news record is picked it is displayed in the left upper block and called “active news”. The two blocks below it are two lists of news referring to geographically related and ontologically related news. The ontology graph highlights the tag that is associated with the active news. The map on the right upper corner shows the locations for the active news as well as the news listed in the other two blocks. Different locations are marked with different colors. The red marker without numbers represents the main location for the active news. The red markers

42 with numbers represent other locations associated with the active news. The blue markers with numbers are the main locations for the news that are geographically related to the active news. The green markers with letters are the main locations for the news that are ontologically related to the active news.

The web page is developed in HTML and JavaScript. JavaScript is responsible for sending HTTP requests to the server, while PHP on the server side is used to receive and respond HTTP requests. Due to the application of AJAX, JavaScript can send several HTTP requests at the same time and update the news in different parts of the web page without conflict.

The logical process of this web application starts at the active news block. All the other four blocks are related to the active news and change accordingly. For example, if the active news changes, then the logical process described in figure 22 is followed.

It will recalculate the distance and find the five closest news items to the new active news record. Meanwhile, according to the new active news‟ tag, it will find five other ontologically related news reports and refresh the ontology graph to reset the highlighted node. Also the map will be updated as well.

43 Calculating Five geographic distance related news

Active Displaying news News on web page

Searching neighbor concepts Five ontological related news

Searching related news based on ontology Figure 22. Logical process for the implementation of the web application

44 Chapter 4: Application Interface and Results

4.1 Web application Interface

Figure 23. The overview of the semantic geospatial web application for exploring epidemiological information

The interface of this semantic geospatial web application is designed and implemented as showed in figure 23. The whole web page is made up of six sections.

On the top, it is the banner for showing the title. At the upper left corner is the section for showing one news at a time and this news is called “active news”. On its right, it is the map showing locations with markers in different colors. At the lower left corner, it 45 is the block for listing five news which are the closest to the active news. On its right, it is a list of five ontologically related news to the active one. At the lower right corner, it is the ontology graph with the highlighted current tag of the active news.

The search in this semantic geospatial web application is not based on key-word searches. The searching process is done by querying by geographic relationship and semantic relationship. News exploring is more appropriate than searching in this application.

4.2 Exploring active news

Figure 24. The section for active news

Figure 24 shows the section for displaying one active news at a time. When the web page first loaded, the active news is chose already. In this section, news will be presented with detailed information, such as title, URL, published date, main location

46 and coordinates, and a short description of the news. The URL provided will direct users to the news web page, if clicked. The main location is only of the locations associated with this news as explained in previous chapter. The coordinates for the main location is presented in latitude and longitude format which is for OpenLayers to mark this location on the map. The active news is allowed to explore with detail information, while news listed in other sections only have titles. And clicking on any other news will make it as the active news.

4.3 Exploring news on the map

Figure 25. The section for showing map with locations

Figure 25 shows the section for displaying the world map with location markers.

There are several markers in different colors on the map. The red one with nothing inside marks the main location for the active news, while the other red markers with numbers inside are the other locations besides main location for the active news.

47 Sometimes, if there is only one red marker showing means that there is only one location assigned to this active news. The blue markers with numbers inside are the main locations for the list of five news in the geographic related section. The blue markers should always be around the red marker or the closest to the red marker. The distribution of the blue markers implies what other topic news are discussed around.

The green markers with letter inside represent the main locations for the five news that are ontologically related to the active news. They do not have geographic relationship to the active news, so there can spread all over the world. The distribution of the green markers show where in the world has a discussion of the related news. It seems that some markers are missing in figure 25. For example, blue markers only have 4 and 5. This is because the main location for some news may be at the same place, and when adding the next marker, the latter one will overlay on the previous marker and only one marker will show at one place.

Clicking on the map allows users to exploring information by places. For example, if users want to find out what the news are discussed around United States, by clicking on the United States, it will find out the closest news to the clicked point. Example is showed in figure 26.

48

Figure 26. Example of exploring information by clicking the map

4.4 Exploring nearby news

Figure 27. The section for list five geographic related news to the active news

Figure 27 shows the section for listing five news ranked by distance to the active

49 news. All the news are clickable and when the mouse is on one of them, it will become a pointer. These five news tell users what other topic discussed around the active news. Clicking on any of these five news allows users to examine the clicked news in detail, since the clicked news will become the new active news. And at the same time, geographic related news will be recalculated and a new set of five news will be listed here. Moreover, the location maps will be refreshed as well. As for the ontology graph, the highlighted node will change to the tag of the new active news.

Based on the new highlighted node and its neighbors, other five ontologically related news will be displayed to replace the old ones. Figure 28 and figure 29 show this process. In figure 28, this first news geographic related to the active news was clicked.

And in figure 29, it became the active news and other sections changed accordingly.

Figure 28. Example of before clicking on the geographic related news

50

Figure 29. Example of after clicking on the geographic related news

4.5 Exploring semantic related news

Figure 30. Sections for showing ontologically related news and the ontology graph

51

Figure 30 shows the sections for showing ontologically related news and the ontology graph. In the ontology graph, the node highlighted in red represents the tag for the active news. The news showed in the ontologically related section are identified as ontologically related to the active news. The identification process is done according to the ontology graph. The directly related to highlighted node “outbreak” is the node

“FMD”, then the five news on the left all have tagged as “FMD”.

There are two ways to explore information by ontological relationship. One is by clicking on one of the five ontological related news and the way it works is the same as how the geographic related news clicking works. The other way is by clicking on the ontology graph. Figure 31 and figure 32 shows this process. In figure31, the current active news has a tag “outbreak”, if users want to see news about “livestock”, they just need to click on the “livestock” node and as showed in figure 32, and the active news is changed to a new one with a tag “livestock”. All other sections change accordingly.

52

Figure 31. Example of before clicking on the ontology graph

Figure 32. Example of after clicking on the ontology graph 53 Chapter 5: Conclusion

5.1 Results evaluation and summary

5.1.1 Data evaluation

The data evaluation process can be divided into two aspects. One is the news data retrieved from Google News. The other kind of data is the data saved into database.

The data extracted from Google News via Google News Search API has a limited number of news. Every time, it only gets 64 news maximally. To expand the database, the program needs to run periodically in order to get more data. However, retrieving news every one or two days will get almost the same news, since the data will not be updated within a short time interval. Another issue about the news is that some web pages will not exist all the time. So this kind of news cannot be explored in details, even though the data is saved in the database.

As for the other kind of data, the information saved in the database, there are some issues as well. This kind of data refers to the locations extracted from the web pages.

Based on an observation of 64 news, 53 of them are associated with at least one location. For other news which does not have any locations, there are a few reasons.

First, there are no explicit places mentioned in the news. In appendix B, B3 is an example of news without places mentioned. Second, the extraction method only goes

54 through the body part of the web page rather than the entire html source code. To extract the body part of the web page is done by extracting data between “p” tags in the html source code. For most of the web pages, this method does work, although it gets some irrelevant information as well. But for some web pages, this method either will extract irrelevant information rather than the body of the news or will return nothing as results. In this situation, the program cannot extract locations without the main body of the news. The third reason is about the gazetteer problem. In appendix

B, B2 is an example of the third reason. In this example news, “Chungcheong” is a province in North Korea, but it does not exist in the current gazetteer. Some of the places are extracted but cannot be identified as a location, due to its absence in the current gazetteer. Overall, the quality of the current data is good enough to use in this application.

5.1.2 Web page performance evaluation

The performance of the web page should be evaluated from several different perspectives. First of all, from its functionality, the current web page is functioning well. The intention of the web page, to exploring information from geographic and ontology perspectives, is fulfilled to some extent. Second, from the efficiency aspect, finding the closest news needs to calculate and compare the distance between the active news and all other news in the database, without the support of spatial database, the current method may delay the web page‟s response time as the database expanded.

Third, the interface of the web page needs improvement. Although the current interface is straightforward for users to use, there are still some confusions and

55 inconvenience. For example, the news and the markers need a better connection.

5.2 Limitations

Based on the evaluation, a few limitations of this semantic geospatial web application is discussed as follows.

First of all, the method for extracting the body part of the news. Currently, detecting content between “p” tags cannot guarantee a satisfactory result for some web page.

Because not all web pages use “p” tag for the main body text. Besides, the content between “p” tags contains some irrelevant information as well as the body text and the irrelevant information may affect the results of location extractions if it contains place names.

Second, the location extraction method and determining the main location method need to be improved. The current extraction method works fine, but not efficient and intelligent. If the news contains a lot of words with initial capital letter, then the result will include lots of irrelevant information which may affect the efficiency of the program. More advanced technique, such as Natural Language Process (NLP), should be considered to exclude irrelevant words. The location extracted results are not satisfactory sometimes is because the gazetteer problem. An extracted location cannot be identified as a place if it does not appear in the gazetteer. But this problem is hard to resolve, because there is not a gazetteer available that will contain every place in the world. Some comprehensive gazetteers which hold millions of records have a

56 format compatibility issue. One possible way to improve the situation is to continuously update the gazetteer to fulfill the task needs.

Third, the current ontology for foot-and-mouth disease domain is small and informal.

It causes the news in the database are associated with many tags so that most of the news are directly related. In this case, when displaying the ontological related news, some news are always presented which is not helpful for information exploration.

Moreover, the current ontology is too simple which is cannot represent the knowledge about the foot-and-mouth disease domain. And without annotations, the relationship between concepts is hard to be understood.

5.3 Future research

Although the current web application functions well and present the core idea of this research, there always are further studies needs to be done. Based on the evaluation and discussion of limitations, some future search aspects should concentrate on solving those limitations and improving the practicability of this application. Future research should not only focus on improving the algorithms, but also make the framework more flexible to be applied to other topic and available to more people.

Several components can be added into this framework, such as a semantic map which maps news regarding their semantics and helps visualize the trend of discussion of various topics.

57 REFERENCE

Agarwal, P. (2005). Ontological considerations in GIScience. International journal of Geographic Information Science, 19(5): 501-536

AvRuskin, G.A., Jacquez, G.M., et al. Visualization and exploratory analysis of epidemiologic data using a novel space time information system. International Journal of Health Geographics, V3 N26: (2004).

Berners-Lee, T., Hendler, J., Lassila, O. (2001). "The Semantic Web". Scientific American Magazine, May: 34-43.

Adam, B., Thea, S.M., Guntur, S. (2003). Quantifying the impact of foot and mouth disease on tourism and the UK economy. Tourism Economics, 9(4): 449-465

Busgeeth, K., Ulrike, R. The use of a spatial information system in the management of HIV/AIDS in South Africa. International Journal of Health Geographics, V3 N13: (2004).

Brinzarea, B., Hendrix , A., Darie, C. (2009) AJAX and PHP. (2nd Ed.). Birmingham, UK: Packt Publishing

CGI, http://condor.cc.ku.edu/~grobe/docs/forms-intro.shtml

Delboni, T.M., Borges, K.A., Laender, A.H.F., Davis, C.A., Jr. (2007). Semantic expansion of geographic web queries based on natural language positioning expressions. Transactions in GIS 11(3): 377-397.

Delboni T M, Borges K A V, Laender A H F (2005). Geographic web search based on positioning expressions. In Proceedings of the 2005 Workshop on Geographic Information Retrieval, Bremen, Germany: 61–4

Ding, Y., Fensel, D., Klein, M., Omelayenko, B. (2002). The semantic web: yet another hip? Data & Knowledge Engineering 41: 205-227

DOM, http://www.w3.org/DOM/

Egenhofer, M. J. (2002). Towards the semantic geospatial web. Proceedings of the

58 10th ACM international symposium on Advances in geographic information systems

Fonseca, F.T., Egenhofer, M., Agouris, P., Câmara, G. (2002) Using Ontologies for Integrated Geographic Information Systems. Transactions in GIS 6(3), 231–257

Fonseca, F.T, Davis, C. Camara, C. (2003) Bridging ontologies and conceptual schemas in geographic information integration. GeoInformatica 7: 355–78

Franklin, C., Paula, H. (1992). An introduction to GIS: linking maps to databases. Database. 15 (2): 17-22.

Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition 5 (2): 199–220.

Haklay, M., Singleton, A., Parker, C. (2008). 2.0: The Neogeography of the GeoWeb. Geography Compass, 2(6):2011-2039

Harvey, F., Kuhn, W., Pundt, H., Bishr, Y., Riedemann, C. (1999). Semantic interoperability: A central issue for sharing geographic information. The Annals of Regional Science, 33(2): 213-232, DOI: 10.1007/s001680050102

Hill, L.L., Frew, J., Zheng, Q. (1999). Geographic Names – the Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, 5(1).

Karimi, H. A., Akinci, B., Boukamp, F., Peachavanish, R. (2003) Semantic interoperability in infrastructure systems. In Proceedings of the Fourth Joint Symposium on Information Technology in Civil Engineering, Nashville, Tennessee

Klien, E., Lutz, M., Kuhn, W. (2006). Ontology-based discovery of geographic information services – An application in disaster management. Computers, Environment and Urban Systems, 30:102-123

Klyne, G., Carroll, J. J. (2004). Resource Description Framework (RDF): Concepts and Abstract Syntax. Retrieved from: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/

Kolas, D., Hebeler, J., Dean, M. (2005). Geospatial Semantic Web: Architecture of Ontologies. Lecture Notes in Computer Science, 3799/2005, 183-194, DOI: 10.1007/11586180_13

Kurose, J., Ross, K. (2003). Computer Networking: A Top Down Approach Featuring the Internet. (2nd Ed.) Boston, Addison-Wesley

Le Hégaret, P., Wood, L., Robie, J. (2000). What is the Document Object Model? Retrieved from: http://www.w3.org/TR/DOM-Level-2-Core/introduction.html

59 Lutz, M., Klien, E. (2006). Ontology-based retrieval of geographic information. International journal of Geographic Information Science, 20(3): 233-260

LWP 6.02, http://search.cpan.org/dist/libwww-perl/lib/LWP.pm

Mark, D. M., Smith, B., Egenhofer, M., Hirtle, S. (2003). Ontological foundations for geographic information science. In Usery L and McMaster R B (eds) Research Challenges in Geographic Information Science. Boca Raton, FL, CRC Press: 335–50

Morimoto, Y., Aono, M., Houle, M.E., McCurley, K.S. (2003). Extracting Spatial Knowledge from the Web. Symposium on Applications and the Internet (SAINT-2003). Orlando, FL: IEEE Computer Society, 326–333.

O‟Reilly, T. (2005). What is Web 2.0–design patterns and business models for the next generation of software. http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.

Peachavanish, R., Karimi, H.A. (2007). Ontological engineering for interpreting geospatial queries. Transactions in GIS 11(1): 115-130.

Pendell, D.L., Leatherman, J., Schroeder, T.C., Alward, G.S. (2007). The Economic Impacts of a Foot-And-Mouth Disease Outbreak: A Regional Analysis. Journal of Agricultural and Applied Economics 39:19-33.

Perez, A. M., Zeng, D., Tseng, C.J., Chen, H., Whedbee, Z., Paton, D., Thurmond, M.C. (2009). A web-based system for near real-time surveillance and space-time cluster analysis of foot-and-mouth disease and other animal diseases. Preventive Veterinary Medicine, 91: 39–45

Perl BDI, http://dbi.perl.org/

Perry, B.D., Kalpravidh, W., Coleman, P.G., Horst, H.S., McDermott, J.J., Randolph, T.F., Gleeson, L.J. (1999). The economic impact of foot and mouth disease and its control in South-East Asia: a preliminary assessment with special reference to Thailand. Revue Scientifique et Technique 2, 478–479

Philip, L. P., Lee, J.G., Seitzinger, A.H. (2002). Potential revenue impact of an outbreak of foot-and-mouth disease in the United States. Vet Med Today: Food Animal Economics, 220(7):988-992

Scharl, A. (2007). Towards the Geospatial Web: Media Platforms for Managing Geotagged Knowledge Repositories. Advanced Information and Knowledge Processing, Part 1, 3-14, DOI: 10.1007/978-1-84628-827-2_1

Scharl, A., Stern, H. Weichselbraun, A. (2008). Annotating and visualizing location data in geospatial web applications. Proceedings of the first international workshop on

60 Location and the web, DOI: 10.1145/1367798.1367809

Snow, J. (1936). Snow on Cholera, New York: The Commonwealth Fund: Oxford University Press, 1936

Smith, K. (2006). Simplifying Ajax-style Web development. Computer, 39 (5):98-101

Tochtermann, K., Riekert, W.-F., Wiest, G., Seggelke, J., Mohaupt-Jahr, B. (1997). Using Semantic, Geographical, and Temporal Relationships to Enhance Search and Retrieval in Digital Catalogs. 1st European Conference on Research and Advanced Technology for Digital Libraries (LNCS, Vol. 1324). Pisa, Italy, 73–86.

Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.J. (2005). Text Mining – Predictive Methods for Analyzing Unstructured Information. New York: Springer.

Wilesmith, J.W., Stevenson, M.A., King, C.B., Morris, R.S. (2003). Spatio-temporal epidemiology of foot-and-mouth disease in two counties of Great Britain in 2001. Preventive Veterinary Medicine, 61(3): 157-170

Yang, P.C., Chu, R.M., Chung, W.B., Sung, H.T. (1999). Epidemiological characteristics and financial costs of the 1997 foot-and-mouth disease epidemic in Taiwan. Veterinary Record, 145:731-734

61 Appendix A: User‟s Manual

A1 View active news

1. Click on the URL for the active news in the upper left section and it will go the

web page for the active news

2. Locations for the active news are showed in red markers. The red marker without

numbers are the main location.

A2 Explore news on the map

1. Click on the map, it will show the news which is closest to the clicked point as the

active news.

2. News in geographic related and ontological related section will change

accordingly.

3. The highlighted node in the ontology graph will change accordingly.

A3 Explore geographic related news

1. Click on the news listed in the geographic related news section, and the clicked

one will become the active news

2. The five geographic related news will be recalculated and updated

3. News in ontological related section will change accordingly.

4. The highlighted node in the ontology graph will change accordingly.

A4 Explore ontological related news

62 1. Click on the news listed in the ontological related news section, and the clicked

one will become the active news

2. Another new set of five ontological related news will be selected and updated

based on the ontological relationship showed in the ontology graph

3. News in geographic related section will change accordingly.

4. The highlighted node in the ontology graph will change accordingly.

A5 View news by different concepts

1. Click on the node in the ontology graph. Each node represents a concept in the

ontology and is assigned to several news. Clicking on the node will present active

news with this concept as a tag.

2. News in geographic related and ontological related section will change

accordingly. The ontological related news all have a tag that is linked to the

current highlighted tag.

3. The highlighted node in the ontology graph will change accordingly.

63 Appendix B: Sample news saved in database

B1 News with locations extracted

Title: Hand, foot and mouth disease: spatiotemporal transmission and climate Published Date: 05 April 2011 URL: http://7thspace.com/headlines/378097/hand_foot_and_mo uth_disease_spatiotemporal_transmission_and_climate.html

HTML source code saved in table:

The Hand-Foot-Mouth Disease (HFMD) is the most common infectious disease in China, its total incidence being around 500,000 ~1,000,000 cases per year. The composite space-time disease variation is the result of underlining attribute mechanisms that could provide clues about the physiologic and demographic determinants of disease transmission and also guide the appropriate allocation of medical resources to control the disease.Methods and FindingsHFMD cases were aggregated into 1456 counties and during a period of 11 months. Suspected climate attributes to HFMD were recorded daily at 740 stations throughout the country and subsequently interpolated within 145611 cells across space-time (same as the number of HFMD cases) using the Bayesian Maximum Entropy (BME) method while taking into consideration the relevant uncertainty sources. The dimensionalities of the two datasets together with the integrated dataset combining the two previous ones are very high when the topologies of the space-time relationships between cells are taken into account. Using a self-organizing map (SOM) algorithm the dataset dimensionality was effectively reduced into 2 dimensions, while the spatiotemporal attribute structure was maintained. 16 types of spatiotemporal HFMD transmission were identified, and 3-4 high spatial incidence clusters of the HFMD types were found throughout China, which are basically within the scope of the monthly climate (precipitation) types. Conclusions: HFMD propagates in a composite space-time domain rather than showing a purely spatial and purely temporal variation. There is a clear relationship between HFMD occurrence and climate. HFMD cases are geographically clustered and closely linked to the monthly precipitation types of the region. The occurrence of the former depends on the later. © 2011 7thSpace Interactive All Rights Reserved - About | Disclaimer | Helpdesk

64 There are currently 46933 people browsing 7thSpace

B2 News without location extracted due to gazetteer issue

Title: Livestock Restrictions for Foot-and-Mouth Fully Lifted Published Date: 04 April 2011 URL: http://english.chosun.com/site/data/html_dir/2011/04/04/2011040400886.html

HTML source code saved in table:

The foot-and-mouth disease outbreak that has haunted farms nationwide seems to be under control. The Agriculture Ministry announced that it lifted the last remaining ban on livestock movement in the country on Sunday, in South Chungcheong Province. It said the highly contagious animal disease appears to have been contained, with the last case being reported on Feb. 25. No livestock has been culled since mid-March, following nationwide vaccinations that began in December last year. Some nearly 3.5 million pigs and cattle were culled at a cost of over US$2.7 billion after the first case was confirmed in late November. The government plans to remain on alert and closely monitor the situation.

B3 News without locations mentioned

Title: Early Detection and Quick Response Key to Minimizing Losses Resulting from Foot and Mouth Disease Published Date: 28 March 2011 URL: http://www.farmscape.com/f2ShowScript.aspx?i=23638&q=Early+Detection+and+Q uick+Response+Key

HTML source code saved in table:

Farmscape for March 28, 2011 (Episode 3545)

The chair of the Canadian Swine Health Board says early detection and quick response are key to limiting the losses 65 that would be caused by an outbreak of foot and mouth disease.The first in a series of Swine Health Awareness Bulletins designed to keep the pork industry informed of emerging or potential health threats deals with swine vesicular diseases several of which produce symptoms similar to foot and mouth disease, including fluid-filled blisters in the mouth and on the snout, feet and teats of recently farrowed sows.Canadian Swine Health Board chair Florian Possberg says, because food and mouth disease is so rare, it's something producers don't see very often.span style="FONT-SIZE: 12pt; mso-bidi-font-size: 10.0pt; mso-ansi-language: EN-CA">
Clip-Florian Possberg-Canadian Swine Health Board:
The issue around swine vesicular disease is the most severe form of it is foot and mouth disease which can really devestate a whole livestock industry for a country.
But there are other conditions that mimic what you would expect from foot and mouth disease so our bulletin is really to provide information to alert producers as to what to look for in terms of what this condition is.One of the real keys of dealing with this disease is early detection and control so, if we can provide information to producers so they can actually identify suspicious cases and engage our veterinarians and soon in identifying the disease very early on, we can sort through things much better than having it get out control.
Possberg says the Canadian Swine health Board is focusing on early detection to get on top of any outbreak.He says, if you can get early detection of the disease and can quarantine affected areas to prevent it's spread, it's possible to make a potentially huge problem much more manageable. The bulletin is being distributed to industry stakeholders and being made available through the Canadian Swine Health Board web site.

67