RIA02007 Guidelines for Final Submission of Articles
Total Page:16
File Type:pdf, Size:1020Kb
An IE and IR Approach to deal with Geographic Information Scope in Textual Documents Christian Sallaberry & Mustapha Baziz & Julien Lesbegueries & Mauro Gaio Laboratoire d’Informatique-Université de Pau (UPPA) Avenue du doyen Poplawski BP 575 64013 PAU Cedex, France {christian.sallaberry, mustapha.baziz, julien.lesbegueries}@univ-pau.fr Abstract We briefly present requirements and a methodology of semantic annotation for automatic indexing and geo-referencing of text documents. The first evaluation results shows that combining a spatial approach with a classical (statistical-based) IR one, improves in a significant way retrieval accuracy, namely in the case of “realistic” queries. Key-Words Information Extraction, Information Retrieval, Geographic Information Scope, Digital Libraries, Cultural Heritage. 1. Introduction Geographically related queries form nearly one fifth of all queries submitted to Excite search engine, the terms occurring most frequently being place names (Sanderson et a.l, 2004). Our contribution focuses on digital libraries and proposes to extend basic services of existing Library Management System with new ones dedicated to geographic (spatial) information extraction and retrieval (PIV project1). Geographic information in such a repository is composed of a spatial feature (Lesbegueries et al., 2006), a temporal feature and a thematic one. “Music instruments in the vicinity of Laruns in the XIXth century” is an example of a complete geographic feature: “Music instruments” is the thematic feature, “vicinity of Laruns” is the spatial feature and “XIXth century” is the temporal one. Our spatial model supports absolute and Relative Geographic Features. Named geographic features such as “Biarritz district” are well-known named places. We call them Absolute Geographic Features (AGF). Complex Geographic Features as “Biarritz vicinity” or “South of Biarritz district” have to be interpreted and, therefore, need some spatial reasoning processes. Such features are called Relative Geographic Features (RGF). We associate each RGF to one or more spatial relationships (adjacency, inclusion, distance, orientation) for a recursive definition. A difference of our approach with other ones like SPIRIT (Jones et al., 2004) and GIPSY (Woodruff et al., 1994) relies on the back-office spatial reasoning used for both AGFs and RGFs interpretation and indexing. For instance, the SPIRIT system mainly tags AGFs within web documents (open domain) while we are mainly concerned with domain specific corpora issued from a cultural heritage of a closed and specific region (the western south area of France). Another specificity concerns the granularity level of the managed information units: textual paragraphs of digitalized archives in our case and web pages in the case of SPIRIT system. In the proposed approach, a refined spatial information interpretation and markup process are applied both within the information units indexing stage and the users’ query interpretation. As Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France we work on specific digital library collections and as these collections are quite stable (contrary to Web pages for example), the approach seems to be suitable and the cost of such refined spatial aware indexing is reasonable. Queries are interpreted dynamically and GFs blow-by-blow indexes allow a more accurate information retrieval. Actually, we make propositions and experiment the assumption made in PIV approach to deal with spatial and thematic features. We combine our pure qualitative spatial approach (PIV) with classical quantitative IR ones in order to enhance retrieval accuracy in the case of general queries. 2. The Geographic Core Model In this model, according to the linguistic hypothesis, a GF is recursively defined from one or several other GFs and spatial relations are part of the GFs’ definition. The target/landmark principle (Vandeloise 1986) can be approximately defined in a recursive manner. For instance, the GF “north of the Biarritz-Pau line” is first defined by “Biarritz” and “Pau” landmarks that are well known named places, the term “line” creates a new well-known geometrical object linking the two landmarks and cutting the space into two sub-spaces, finally, an orientation relation creates a reference on the target to focus on. In Figure 1 it appears that a GF has at least one representation (A) with a natural or artificial boundary; it can be specialized (B) into an absolute (AGF), i.e. named place or a relative feature (RGF). A RGF is defined with a reference, i.e. a relation linking at least one other GF (C). The cycle represents the recursive definition. Figure 1. Geographic core model simplified schema Figure 2 An excerpt of the PIV system rules used to extract GFs from free text Therefore, a GF can be: • an Absolute Geographic Feature (AGF) if it only consists of a well-known named place, i.e. a toponym with its geocode, • a Relative Geographic Feature (RGF) if it is defined using a spatial relation (generally topological) linking at least a GF (that can be an AGF or another RGF). For textual IE, this approach has been adapted into a recursive grammar (Figure 2): In the core model all these spatial references have attributes used to characterize them. So for instance, distance has a numerical or a qualitative parameter, adjacency has a qualifier as Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France defined in (Lesbegueries et al., 2006; Muller, 2002). So, a XML tree complying with the XML schema describes any GF. 2.1. Geographic IE within PIV Textual Documents Hereinafter, we briefly describe the Linguistic Processing Sequence (LPS) supporting PIV spatial IE process. The LPS goal is to populate a structured information repository (XML indexes) from heterogeneous information sources. We also used it to separate spatial (geographic) features from the thematic ones in the query when evaluating the IR results (section 3). Figure 3 : Linguistic processing sequence (LPS) According to work on textual documents (Lesbegueries et al., 2006), we adopt an active reading behavior, that is to say sought-after information is known a priori. This is why, unlike standard Natural Language Processing (NLP) (Abolhassani, 2003), our linguistic processing sequence is locally applied near candidates for named places. To mark these candidates a lexicon is used in order to have a quite good generic bootstrap process. So AGFs (i.e. villages’ names, forests’ names, etc.) are detected first and marked. Then RGFs are built from previously pointed out AGFs. The data processing sequence used to highlight geographic features is implemented as follows schematized in Figure 3. A tokeniser and a splitter parse the whole of textual flow (Figure 3-A). This pre-treatment corresponds to new textual flow where the initial content is added with logical sub-structures marks, words separators marks are added with their lemmas (thanks to a lemmatization phase embedded). In the second stage (Figure 3-B), geographic features called “candidates” are detected as following: First, all sentences having tokens starting with a capital letter and preceded with a token containing terms specified in a lexicon “in”, “from”… (known as geographic feature’s initiator) are marked. Then, a Part Of Speech (POS) tagger parses these marked sentences and retrieves words’ POS (e.g; “Paul” and “Laruns”) known as proper names. In the third stage (Figure 3-C), a Definite Clause Grammar (DCG) based analysis, allowing the interpretation of the extracted syntagms (inclusion, adjacency, distance to another geographic feature, etc.), is carried out. The feature “near Laruns” is interpreted as a RGF itself defined by an adjacency relation and by the AGF “Laruns”. The GFs validation stage calls external services (gazetteers) to confirm every candidate AGF (Figure 3-D). We use IGN (French Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France Geographic Institute) and ViaMichelin resources. For the sentence “Paul passe près de Laruns” (Paul passes near Laruns): “Laruns” GF is confirmed whereas “Paul” GF is removed. All the RGFs candidates associated to a non-validated AGF are also removed. Finally a MBR representation consisting on geocode coordinates is added to the XMLtree. 2.2. Query Evaluation based on spatial criteria: GFs Intersections Our search technique is based on a spatial mapping between the GFs of the query and those of the documents. This mapping is done thanks to the geospatial footprints created dynamically for the query and those stored in index files of the corpus. For example, Figure 4-A illustrates a query and some indexed areas (precise geospatial footprints for AGFs and approximated MBRs for RGFs) representing Pyrenean villages named in the corpus. Figure 4-A points out that “Laruns” village is more relevant for the query than “Louvie-Soubiron” village. In the same way, one can deduce that “Center of Beost” is not relevant to the same query since its footprint does not overlap with the query one. Figure 4. A/ An example of query: “I want documents dealing with places which are near Eaux-Bonnes.” and its corresponding boxes (the biggest one). The other polygons represent GFs (extracted from documents of our corpus) that may match the query. B/ Relevance computing of the retrieved documents. The selection process consists in processing index files and computing intersections (Lesbegueries et al. 2006) with a GIS. Then, we select corresponding relevant documents fragments (Df). We are able to calculate the relevance of a document fragment by computing an evaluation of the surface which results from the intersection between the GF of the document fragment and the one of the query: For any query, the relevance of each recovered document may be different (Figure 4): I surface I surface d Df precision = , Df significance = , Df distan ce = Df surface Q surface D Therefore, we compute Df score as following: ()Df precision + Df significance Df score = (1) ()2 + Df distan ce The closer the centroids of I and Q are to each other, the higher is the relevance score of Df.