University of Hagen at CLEF 2004: Indexing and Translating Concepts for the GIRT Task
Total Page:16
File Type:pdf, Size:1020Kb
University of Hagen at CLEF 2004: Indexing and Translating Concepts for the GIRT Task Johannes Leveling, Sven Hartrumpf Intelligent Information and Communication Systems University of Hagen (FernUniversitat¨ in Hagen) 58084 Hagen, Germany {Johannes.Leveling,Sven.Hartrumpf}@fernuni-hagen.de Abstract This paper describes the work done at the University of Hagen for our participation at the German Indexing and Retrieval Test (GIRT) task of the CLEF 2004 evaluation campaign. We conducted both monolingual and bilingual information retrieval experiments. For monolin- gual experiments with the German document collection, the focus is on applying and compar- ing three indexing methods targeting full word forms, disambiguated concepts, and extended semantic networks. The bilingual experiments for retrieving English documents for German topics rely on translating and expanding query terms based on a ranking of semantically related English terms for a German concept. English translations are compiled from heterogeneous resources, including multilingual lexicons such as EuroWordNet and dictionaries available online. 1 Introduction This paper investigates the performance of different indexing methods and automated retrieval strategies for the second participation in the University of Hagen in the evaluation campaign of the Cross Language Evaluation Forum (CLEF). In 2003, retrieval strategies based on generating query variants for a natural language (NL) query were compared (Leveling, 2003). The result of our best experiment in 2003 with respect to mean average precision (MAP) was 0.2064 for a run using both topic title and description. Before presenting the different approaches and results for monolingual and bilingual retrieval experi- ments with the GIRT documents (German Indexing and Retrieval Test database, see Kluck and Gey (2001)) in separate sections, a short overview of the analysis and transformation of natural language queries and documents is given. Natural language processing (NLP) as described in the following subsections is part of query processing for the NLI-Z39.501 (Leveling and Helbig, 2002), a natural language interface for databases supporting the Internet protocol Z39.50. The major part of our experimental infrastructure was developed for and is applied in this NLP system. 1.1 The MultiNet Paradigm In the NLI-Z39.50, natural language queries (corresponding to a topic’s title, description, or narrative) are transformed into a well-documented knowledge and meaning representation, Multilayered Extended Semantic Networks (abbreviated as MultiNet) (Helbig, 2001; Helbig and Gnorlich,¨ 2002). The core of a MultiNet consists of concepts (nodes) and semantic relations and functions between them (edges). Figure 1 shows the relational structure of the MultiNet representation for the description of GIRT topic 116. The MultiNet Paradigm defines a fixed set of 93 semantic relations (plus a set of functions) to describe the meaning connections between concepts, including synonymy (SYNO), subordination, i.e. hyponymy and hypernymy (SUB), meronymy and holonymy (PARS), antonymy (ANTO), and relations for change of 1The NLI-Z39.50 is being developed as part of the project “Naturlichsprachliches¨ Interface fur¨ die internationale Standard- schnittstelle Z39.50” and funded by the DFG (Deutsche Forschungsgemeinschaft) within the support program for libraries “Mod- ernisierung und Rationalisierung in wissenschaftlichen Bibliotheken”. du.1.1 streß.1.1 psychisch.1.1 you stress mental PROP SUBS SUB dokument.1.1 problem.1.1 document *ALTN1 problem c3 c7 c6 PRED prüfling.1.1 PRED EXP PRED examinee c10 OBJ MCONT ATTCH *ALTN1 c2 c1 c5 c8 kandidat.1.1 candidate SCAR PRED SUBS SSPE c9 SUB SUBS PRED finden.1.1 c4 berichten.2.2 find report ASSOC prüfungskandidat.1.1 prüfung.1.1 exam Figure 1: The core MultiNet representation for the query “Finde Dokumente, die uber¨ psychische Prob- leme oder Stress von Prufungskandidaten¨ oder Pruflingen¨ berichten.” / ‘Find documents re- porting on mental problems or stress of examination candidates or examinees.’ (description of GIRT topic 116). Numerical suffixes for reading distinction of English concepts are omitted. sorts between lexemes. For example, the relation CHPA indicates a change from a property (such as ‘deep’) into an abstract object (such as ‘depth’). The relations shown in Figure 1 are association (ASSOC), attachment of object to object (ATTCH), property relationship (PROP), predicative concept specifying a plurality (PRED), experiencer (EXP), an informational process or object (MCONT), carrier of a state (SCAR), state specifier (SSPE), conceptual subordination for objects (SUB), conceptual subordination for situations (SUBS), neutral object (OBJ), temporal restriction for a situation (TEMP), and a function for the introduction of alternatives (?ALTN1). 1.2 The WOCADI Parser A syntactico-semantic parser is applied when preprocessing NL queries and for parsing all documents in the concept indexing approach and the network matching approach described in Section 2.1. The system uses the WOCADI parser (WOrd ClAss based DIsambiguating parser; see for example Helbig and Hartrumpf (1997); Hartrumpf (2003)), which is based on the principles of Word Class Functional Analysis (WCFA). The parser generates for a given German sentence its semantic representation as a semantic network of the MultiNet formalism. The NL analysis is supported by HaGenLex (Hartrumpf et al., 2003), a domain-independent computer lexicon linked to and supplemented by external sources of lexical and morphological information, in par- ticular CELEX (Baayen et al., 1995) and GermaNet (Kunze and Wagner, 2001). HaGenLex includes: • A lexicon with full morpho-syntactic and semantic information of more than 22,000 lexemes. • A shallow lexicon containing words with morpho-syntactic information only. This lexicon comprises about 50,000 entries. • Several lexicons with more than 200,000 proper nouns (including names of products, companies, countries, cities, etc.) MultiNet (and therefore also HaGenLex) differentiates between homographs, polysemes, and meaning molecules.2 The WOCADI parser provides powerful disambiguation modules (Hartrumpf, 2003, 2001), whose rules and statistics use syntactic and semantic information to disambiguate lexemes and structures. 2A meaning molecule is a regular polyseme with different meaning facets which can occur in the same sentence. For instance, two facets of ‘bank’ (building and legal person) are referred to in the sentence ‘The bank across the street charges a nominal fee for account management.’ The semantic network in Figure 1 illustrates several features of the parser: the disambiguation of a verb (the correct reading represented by the concept berichten.2.23), the representation of a nominal compound prufungskandidat.1.1¨ 4 not contained in the lexicons together with its constituents prufung.1.1¨ and kandi- dat.1.1, and correct attachment and coordination of noun phrases. These features are important for our approach to translate queries: linguistic challenges for translation such as lexical, syntactic, or semantic ambiguities are already resolved by the parser. 1.3 The Database Independent Query Representation To support access to a wide range of different target databases with different protocols and formal retrieval languages, the semantic network representation of a user query is transformed into an intermediate repre- sentation, a Database Independent Query Representation (DIQR). A DIQR expression comprises features typical for database queries: • Attributes (so-called semantic access points), such as author, publisher, title, or date-of-publication. • Term relations specifying how to search and match query terms. For example, the term relation ’<’ indicates that a matching document must contain a term with a value less than the given search term. • Term types indicating a data type for a search term. Typical examples for term types are number, date, name, word, or phrase. • Search terms identifying what terms a document representation should contain. Search terms include concepts (for example, prufung.1.1¨ / ‘exam’) and word forms (for example, “Prufungen”¨ / ‘exams’). • Boolean operators in prefix notation for the combination of attributes, term relations, term types, search terms, or expressions to construct more complex expressions, for example ’AND’ (conjunc- tion) and ’OR’ (disjunction). By convention, the operator ’AND’ can be omitted, because it is assumed as a default. • Optional numeric weights associated with search terms. These weights are used in information retrieval (IR) tasks to indicate how important a search term is considered in a query. The DIQR is the result of a rule-based transformation of the semantic network representation using a RETE-based compiler and interpreter (the implementation is described in more detail by Leveling and Helbig (2002)). It is mapped to a query in a formal language the database management software supports (such as a query for the Z39.50 protocol, an SQL query, or a SOAP request), which is then submitted to the target system. For example, the semantic network in Figure 1 is transformed into the DIQR ((OR title abstract) = (AND (OR (phrase “psychologisch.1.1” “problem.1.1”) (word “stress.1.1”)) (OR (word “prufungskandidat.1.1”¨ ) (wordlist “prufung.1.1”¨ “kandidat.1.1”) (word “prufling.1.1”¨ )) ) ) After expanding query terms with a disjunction of semantically related terms, the DIQR is normalized into a disjunctive normal form (DNF) and its components,