An IE and IR Approach to deal with Geographic Information Scope in Textual Documents

Christian Sallaberry & Mustapha Baziz & Julien Lesbegueries & Mauro Gaio Laboratoire d’Informatique-Université de Pau (UPPA) Avenue du doyen Poplawski BP 575 64013 PAU Cedex, {christian.sallaberry, mustapha.baziz, julien.lesbegueries}@univ-pau.fr

Abstract

We briefly present requirements and a methodology of semantic annotation for automatic indexing and geo-referencing of text documents. The first evaluation results shows that combining a spatial approach with a classical (statistical-based) IR one, improves in a significant way retrieval accuracy, namely in the case of “realistic” queries. Key-Words

Information Extraction, Information Retrieval, Geographic Information Scope, Digital Libraries, Cultural Heritage.

1. Introduction Geographically related queries form nearly one fifth of all queries submitted to Excite search engine, the terms occurring most frequently being place names (Sanderson et a.l, 2004). Our contribution focuses on digital libraries and proposes to extend basic services of existing Library Management System with new ones dedicated to geographic (spatial) information extraction and retrieval (PIV project1). Geographic information in such a repository is composed of a spatial feature (Lesbegueries et al., 2006), a temporal feature and a thematic one. “Music instruments in the vicinity of Laruns in the XIXth century” is an example of a complete geographic feature: “Music instruments” is the thematic feature, “vicinity of Laruns” is the spatial feature and “XIXth century” is the temporal one. Our spatial model supports absolute and Relative Geographic Features. Named geographic features such as “ district” are well-known named places. We call them Absolute Geographic Features (AGF). Complex Geographic Features as “Biarritz vicinity” or “South of Biarritz district” have to be interpreted and, therefore, need some spatial reasoning processes. Such features are called Relative Geographic Features (RGF). We associate each RGF to one or more spatial relationships (adjacency, inclusion, distance, orientation) for a recursive definition. A difference of our approach with other ones like SPIRIT (Jones et al., 2004) and GIPSY (Woodruff et al., 1994) relies on the back-office spatial reasoning used for both AGFs and RGFs interpretation and indexing. For instance, the SPIRIT system mainly tags AGFs within web documents (open domain) while we are mainly concerned with domain specific corpora issued from a cultural heritage of a closed and specific region (the western south area of France). Another specificity concerns the granularity level of the managed information units: textual paragraphs of digitalized archives in our case and web pages in the case of SPIRIT system. In the proposed approach, a refined spatial information interpretation and markup process are applied both within the information units indexing stage and the users’ query interpretation. As

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France we work on specific digital library collections and as these collections are quite stable (contrary to Web pages for example), the approach seems to be suitable and the cost of such refined spatial aware indexing is reasonable. Queries are interpreted dynamically and GFs blow-by-blow indexes allow a more accurate information retrieval. Actually, we make propositions and experiment the assumption made in PIV approach to deal with spatial and thematic features. We combine our pure qualitative spatial approach (PIV) with classical quantitative IR ones in order to enhance retrieval accuracy in the case of general queries.

2. The Geographic Core Model In this model, according to the linguistic hypothesis, a GF is recursively defined from one or several other GFs and spatial relations are part of the GFs’ definition. The target/landmark principle (Vandeloise 1986) can be approximately defined in a recursive manner. For instance, the GF “north of the Biarritz-Pau line” is first defined by “Biarritz” and “Pau” landmarks that are well known named places, the term “line” creates a new well-known geometrical object linking the two landmarks and cutting the space into two sub-spaces, finally, an orientation relation creates a reference on the target to focus on. In Figure 1 it appears that a GF has at least one representation (A) with a natural or artificial boundary; it can be specialized (B) into an absolute (AGF), i.e. named place or a relative feature (RGF). A RGF is defined with a reference, i.e. a relation linking at least one other GF (C). The cycle represents the recursive definition.

Figure 1. Geographic core model simplified schema

Figure 2 An excerpt of the PIV system rules used to extract GFs from free text

Therefore, a GF can be: • an Absolute Geographic Feature (AGF) if it only consists of a well-known named place, i.e. a toponym with its geocode, • a Relative Geographic Feature (RGF) if it is defined using a spatial relation (generally topological) linking at least a GF (that can be an AGF or another RGF). For textual IE, this approach has been adapted into a recursive grammar (Figure 2): In the core model all these spatial references have attributes used to characterize them. So for instance, distance has a numerical or a qualitative parameter, adjacency has a qualifier as

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France defined in (Lesbegueries et al., 2006; Muller, 2002). So, a XML tree complying with the XML schema describes any GF.

2.1. Geographic IE within PIV Textual Documents Hereinafter, we briefly describe the Linguistic Processing Sequence (LPS) supporting PIV spatial IE process. The LPS goal is to populate a structured information repository (XML indexes) from heterogeneous information sources. We also used it to separate spatial (geographic) features from the thematic ones in the query when evaluating the IR results (section 3).

Figure 3 : Linguistic processing sequence (LPS) According to work on textual documents (Lesbegueries et al., 2006), we adopt an active reading behavior, that is to say sought-after information is known a priori. This is why, unlike standard Natural Language Processing (NLP) (Abolhassani, 2003), our linguistic processing sequence is locally applied near candidates for named places. To mark these candidates a lexicon is used in order to have a quite good generic bootstrap process. So AGFs (i.e. villages’ names, forests’ names, etc.) are detected first and marked. Then RGFs are built from previously pointed out AGFs. The data processing sequence used to highlight geographic features is implemented as follows schematized in Figure 3. A tokeniser and a splitter parse the whole of textual flow (Figure 3-A). This pre-treatment corresponds to new textual flow where the initial content is added with logical sub-structures marks, words separators marks are added with their lemmas (thanks to a lemmatization phase embedded). In the second stage (Figure 3-B), geographic features called “candidates” are detected as following: First, all sentences having tokens starting with a capital letter and preceded with a token containing terms specified in a lexicon “in”, “from”… (known as geographic feature’s initiator) are marked. Then, a Part Of Speech (POS) tagger parses these marked sentences and retrieves words’ POS (e.g; “Paul” and “Laruns”) known as proper names. In the third stage (Figure 3-C), a Definite Clause Grammar (DCG) based analysis, allowing the interpretation of the extracted syntagms (inclusion, adjacency, distance to another geographic feature, etc.), is carried out. The feature “near Laruns” is interpreted as a RGF itself defined by an adjacency relation and by the AGF “Laruns”. The GFs validation stage calls external services (gazetteers) to confirm every candidate AGF (Figure 3-D). We use IGN (French

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France Geographic Institute) and ViaMichelin resources. For the sentence “Paul passe près de Laruns” (Paul passes near Laruns): “Laruns” GF is confirmed whereas “Paul” GF is removed. All the RGFs candidates associated to a non-validated AGF are also removed. Finally a MBR representation consisting on geocode coordinates is added to the XMLtree.

2.2. Query Evaluation based on spatial criteria: GFs Intersections Our search technique is based on a spatial mapping between the GFs of the query and those of the documents. This mapping is done thanks to the geospatial footprints created dynamically for the query and those stored in index files of the corpus. For example, Figure 4-A illustrates a query and some indexed areas (precise geospatial footprints for AGFs and approximated MBRs for RGFs) representing Pyrenean villages named in the corpus. Figure 4-A points out that “Laruns” village is more relevant for the query than “Louvie-Soubiron” village. In the same way, one can deduce that “Center of Beost” is not relevant to the same query since its footprint does not overlap with the query one.

Figure 4. A/ An example of query: “I want documents dealing with places which are near Eaux-Bonnes.” and its corresponding boxes (the biggest one). The other polygons represent GFs (extracted from documents of our corpus) that may match the query. B/ Relevance computing of the retrieved documents. The selection process consists in processing index files and computing intersections (Lesbegueries et al. 2006) with a GIS. Then, we select corresponding relevant documents fragments (Df). We are able to calculate the relevance of a document fragment by computing an evaluation of the surface which results from the intersection between the GF of the document fragment and the one of the query: For any query, the relevance of each recovered document may be different (Figure 4): surfaceI surfaceI d Df precision = , Df cesignifican = , tan cedisDf = surfaceDf surfaceQ D

Therefore, we compute Df score as following: ()Df precision + Df cesignifican scoreDf = (1) ()2 + tan cedisDf The closer the centroids of I and Q are to each other, the higher is the relevance score of Df. An XML DBMS1 and a GIS2 support these searching and computing operations on the corpus indexes.

1 eXist. http://exist.sourceforge.net 2 PostGIS. http://postgis.refractions.net

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France 2.3. Query Evaluation based on statistical criteria: term frequency The IR classical approach is based on the notion of “bag” of single words [2]. In such full text approaches, documents are first indexed using a classical term indexing. It consists in selecting single words occurring in the documents, and then stemming these words using an appropriate stemmer [9] and at the end removing stop-words according to a stoplist. A weight Wtd(t,d) is then assigned to each term t in a document dj following the formula given in (2):

ij log(..2 − nNtf i + )5.0

ni + )5.0( dtWtd jj ),( = (2) + j + tfdlavgdl ij ))_/.75.025.0(.2

Where tfi,j represents the frequency of the term ti in the document dj, ni is the number of documents containing the term ti and N the total number of documents in the collection. dlj represents the length of the document dj and avg_dl, the average length of the document in the collection. This weighting method, which is an enhanced TF.IDF formula, is introduced to attenuate the negative impact of large documents in the searching stage [2]. A vector-based model [1] is then used to retrieve documents: for a given query q, the Inner product between the vector of the query and the ones of each document dj in the collection is applied in order to compute the relevance score : q dql j = ∑ k k, dtWtdqtWtq )().,(),(Re (3) k =1

This relevance score is used to determine the ranking of the document (dj) in the final list of retrieved documents in response to the query (q).

3. Preliminary results It is not possible to develop here a complete case study combining PIV pure spatial information extraction and retrieval approach with the classical (statistical-based) one. Roughly, the idea is to subdivide the query onto a spatial sub-query and a thematic sub-query. The spatial sub-query corresponds to Absolute/Relative Geographic Features (AGF/RGF) identified by the Linguistic processing sequence (LPS) (section2.1) and is submitted to PIV system; whereas the remaining words form the thematic sub-query which is submitted to the statistical-based IR system. The final result is then built by intersecting the two sets returned by PIV and Classical approaches. The final ranking is based on the one obtained by PIV: each ranked document in the PIV result set is added to the final result if it belongs also to the Classical result set. Table 1 gives the results (when using actual queries dealing with both spatial and other thematic features (ex.: “music instruments in Laruns vicinity in the XIX century”)) obtained by the spatial approach (case A) and the classical approach (case B). Case C) concerns the results obtained when combining both spatial and classical approaches by intersecting the two sets of results as explained above. It can be seen that the results are very decreasing for the pure spatial (PIV) approach: only 15% at top 5 for the PIV approach whereas the classical approach brings 48%. A careful analysis of the results shows that some relevant documents are retrieved but they are not ranked at the top. So, PIV system is not suitable for rank-ordering in the case of general (spatial + thematic) queries. Indeed, PIV’s IE and IR processes deal only with spatial information. When combining PIV and classical approaches, the results are very increasing. For instance at top 5, precision reaches 70% when we combine the two approaches, whereas it was of 48% for the classical approach and only 15% for the spatial approach. However, one can notice the reduced number of retrieved document because of the trivial combination used (intersection criteria): for example the combined approach returns for the query 12 (not given in the table) only four documents whereas the Classical approach returns 233 and the PIV one returns 724. This will probably cause a decrease in recall.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France So an open area may concern the merging problem of the two sets of results (spatial based approach results and classical full text ones) in order to optimize not only precision at top retrieved documents, but also recall. This may probably be possible by adding more complex ranking operators to the intersection one.

All queries P@5 P@10 P@15 # of responses A) Spatial approach (PIV) Avg 0.15 0.18 0.18 1154 B) Classical approach Avg 0.48 0.39 0.36 331 C) Intersecting Spatial and Classical results sets

Avg 0.70 0.50 0.43 25.75

Table 1. Results of PIV and Classical approaches and their intersection.

4. Conclusion We briefly proposed in this paper an approach of information extraction and retrieval dealing with geographic information scope in textual documents. The implemented PIV prototype combines original geographic semantics Information Extraction (IE) and Information Retrieval (IR) approaches. Actually, the PIV system relies on web services architecture and supports full XML format (schemas, GFs representations, indexes, documents extracts, etc.); hence, PIV web services can be easily integrated into existing Library Management Systems. The proposed PIV spatial-based approach is compared with the classical statistical-based one. The first results show that PIV approach needs to implement and combine classical statistical-based approaches in order to enhance retrieval accuracy in the case of actual queries (dealing with both spatial and thematic scopes). It remains to formalize the combination between the two approaches by using more complex operators than the used slight intersection.

References

[1] Boughanem, M., Chrisment, C., Tmar, M. (2001). Mercure and MercureFiltre Applied for Web and Filtering Tasks at TREC-10. In Proceeding of TREC. [2] Robertson, S.E., Walker, S. , Hancock-Beaulieu, M., Gatford, M., Payne A. (1995). Okapi at TREC-4, 1995. In Proceeding of TREC. [3] Sanderson, M. and Kohler, J. (2004). Analyzing geographic queries. In Proceedings of the Workshop on Geographic Information Retrieval, SIGIR, www.geo.unizh.ch/ ~rsp/gir/ [4] Lesbegueries,J., Sallaberry,C., and Gaio, M. (2006). Associating spatial patterns to text-units for summarizing geographic information. Workshop GIR – SIGIR. [5] Jones, C.-B., Abdelmoty, A.-I., Finch, D., Fu, G., Vaid, S. (2004). The Spirit Spatial Search Engine: Architecture, Ontologies and Spatial Indexing. Third International Conference - Geographic Information Science, Adelphi, Usa, pp. 125 – 139. [6] Woodruff, A.G., Plaunt, C. (1994). GIPSY: Automated Geographic Indexing of Text Documents. Journal of the American Society for Information Science, 45:9:645-655. [7] Vandeloise, C. (1986). L’espace en français. Travaux Linguistiques. Seuil. [8] Muller, P. (2002). Topological spatio-temporal reasoning and representation. Computational Intelligence, pp. 420–450. [9] Porter M. An algorithm for suffix stripping, Program, 14(3) pp 130−137.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France