Precise Information Retrieval Exploiting Predicate-Argument Structures

Precise Information Retrieval Exploiting Predicate-Argument Structures Daisuke Kawahara† Keiji Shinzato‡ Tomohide Shibata† Sadao Kurohashi† †Graduate School of Informatics, Kyoto University ‡Rakuten Institute of Technology dk, shibata, kuro @i.kyoto-u.ac.jp, [email protected] { } Abstract upon surface word sequences. Consider, for example, the following sentence in a document: A concept can be linguistically expressed in various syntactic constructions. Such ? ? ? ? syntactic variations spoil the effectiveness (1) YouTube was acquired by Google. of incorporating dependencies between Dependency parsers based on the Penn Tree- words into information retrieval systems. bank and the head percolation table (Collins, This paper presents an information re- 1999) judge the head of “YouTube” as “was” trieval method for normalizing syntactic (“YouTube was”; hereafter, we denote a depen- variations via predicate-argument struc- dency by “←modifier head”). This dependency, tures. We conduct experiments on stan- however, cannot be← matched with the dependency dard test collections and show the effec- “YouTube acquire” in a query like: tiveness of our approach. Our proposed ← method significantly outperforms a base- (2) I want to know the details of the news that line method based on word dependencies. Google acquired YouTube. 1 Introduction Furthermore, even if a dependency link in a query matches that in a document, a mismatch of Most conventional approaches to information re- dependency type can cause another problem. This trieval (IR) deal with words as independent terms. is because previous models did not distinguish de- In query sentences1 and documents, however, de- 2 pendency types. For example, the dependency pendencies exist between words. To capture “YouTube acquire” in query sentence (2) can be these dependencies, some extended IR models found in the← following irrelevant document. have been proposed in the last decade (Jones, 1999; Lee et al., 2006; Song et al., 2008; Shin- (3) Google acquired PushLife for $25M ... zato et al., 2008). These models, however, did not YouTube acquired Green Parrot Pictures ... achieve consistent significant improvements over models based on independent words. While this document does indeed contain the dependency “YouTube acquire,” its type is dif- One of the reasons for this is the linguistic vari- ← ations of syntax, that is, languages are syntacti- ferent; specifically, the query dependency is ac- cally expressed in various ways. For instance, the cusative while the document dependency is nom- same or similar meaning can be expressed using inative. That is to say, ignoring differences in de- the passive voice or the active voice in a sentence. pendency types can lead to inaccurate information Previous approaches based on dependencies can- retrieval. not identify such variations. This is because they In this paper, we propose an IR method that use the output of a dependency parser, which gen- does not use syntactic dependencies, but rather erates syntactic (grammatical) dependencies built predicate-argument structures, which are normalized forms of sentence meanings. For example, 1In this paper, we handle queries written in natural language. query sentence (2) is interpreted as the following 2While dependencies between words are sometimes con- predicate-argument structure (hereafter, we denote sidered to be the co-occurrence of words in a sentence, in this a predicate-argument structure by ):3 paper we consider dependencies to be syntactic or semantic ⟨· · · ⟩ dependencies between words. 3In this paper, we use the following abbreviations: 37 International Joint Conference on Natural Language Processing, pages 37–45, Nagoya, Japan, 14-18 October 2013. (4) NOM:Google acquire ACC:YouTube . zato et al., 2008). For example, Shinzato et al. in- ⟨ ⟩ vestigated the use of syntactic dependency output Sentence (1) is also represented as the same by a dependency parser and reported a slight im- predicate-argument structure, and documents in- provement over a baseline method that used only cluding this sentence can be regarded as rele- words. However, the use of dependency parsers vant documents. Conversely, the irrelevant doc- still introduces the problems stated in the previous ument (3) has different predicate-argument struc- section because of their handling of only syntactic tures from (4), as follows: dependencies. (5) a. NOM:Google acquire ACC:PushLife , The second stream of research has attempted ⟨ ⟩ to integrate dependencies between words into in- b. NOM:YouTube acquire ACC:Green Parrot formation retrieval models. These models in- Pictures⟨ . clude a dependence language model (Gao et al., ⟩ 2004), a Markov Random Field model (Metzler In this way, by considering this kind of predicate- and Croft, 2005), and a quasi-synchronous depen- argument structure, more precise information re- dence model (Park et al., 2011). However, they trieval is possible. focus on integrating term dependencies into their We mainly evaluate our proposed method using respective models without explicitly considering the NTCIR test collection, which consists of ap- any syntactic or semantic structures in language. proximately 11 million Japanese web documents. Therefore, the purpose of these studies can be con- We also have an experiment on the TREC Robust sidered different from ours. 2004 test collection, which consists of around half Park and Croft (2010) proposed a method for a million English documents, to validate the appli- ranking query terms for the selection of those cability to other languages than Japanese. which were most effective by exploiting typed This paper is organized as follows. Section 2 in- dependencies in the analysis of query sentences. troduces related work, and section 3 describes our They did not, however, use typed dependencies for proposed method. Section 4 presents the experi- indexing documents. mental results and discussion. Section 5 describes the conclusions. The work that is closest to our present work is that of Miyao et al. (2006), which proposed 2 Related work a method for the semantic retrieval of relational concepts in the domain of biomedicine. They re- There have been two streams of related work that trieved sentences that match a given query using considers dependencies between words in a query predicate-argument structures via a framework of sentence. region algebra. Thus, they namely approached the One stream is based on linguistically-motivated task of sentence matching, which is not the same approaches that exploit natural language analy- as document retrieval (or ranking). As for the sis to identify dependencies between words. For types of queries they used, although their method example, Jones proposed an information retrieval could handle natural language queries, they used method that exploits linguistically-motivated anal- short queries like “TNF activate IL6.” Because ysis, especially dependency relations (Jones, of the heavy computational load of region alge- 1999). However, Jones noted that dependency re- bra, if a query matches several thousand sentences, lations did not contribute to significantly improv- for example, then it requires several thousand sec- ing performance due to the low accuracy and ro- onds to return all sentence matches (though it takes bustness of syntactic parsers. Subsequently, both on average 0.01 second to return the first matched the accuracy and robustness of dependency parsers sentence). were dramatically improved (Nivre and Scholz, In the area of question answering, predicate- 2004; McDonald et al., 2005), with such parsers argument structures have been used to precisely being applied more recently to information re- match a query with a passage in a document (e.g., trieval (Lee et al., 2006; Song et al., 2008; Shin- (Narayanan and Harabagiu, 2004; Shen and La- NOM (nominative), ACC (accusative), DAT (dative), ALL (alla- pata, 2007; Bilotti et al., 2010)). However, can- tive), GEN (genitive), CMI (comitative), LOC (locative), ABL (ablative), CMP (comparative), DEL (delimitative) and didate documents to extract an answer are re- TOP (topic marker). trieved using conventional search engines without 38 predicate-argument structures. structure analysis normalizes the following linguistic expressions: 3 Information retrieval exploiting predicate-argument structures relative clause • 3.1 Overview passive voice (the predicate is normalized to • active voice) Our key idea is to exploit the normalization of causative (the predicate is normalized to nor- linguistic expressions based on their predicate- • argument structures to improve information re- mal form) trieval. intransitive (the predicate is normalized to The process of information retrieval systems • transitive) can be decomposed into offline processing and on- giving and receiving expressions (the predi- line processing. During offline processing, analy- • cate is normalized to a giving expression) sis is first applied to a document collection. For example, typical analyses for English include tok- In the case of Japanese, we use the mor- enization and stemming analyses, while those for phological analyzer JUMAN,4 and the predicate- Japanese include morphological analysis. In addi- argument structure analyzer KNP (Kawahara and tion, previous models using the dependencies be- Kurohashi, 2006).5 The accuracy of syntactic de- tween words also used dependency parsing. In this pendencies output by KNP is around 89% and paper, we employ predicate-argument structures that of

Precise Information Retrieval Exploiting Predicate-Argument Structures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support