AM03 Procv3.Pdf

Contents A Fuzzy Set Approach to Query Syntax Analysis in Information Retrieval Systems ……. 1 Dariusz Josef Kogut Acquisition of Hypernyms and Hyponyms from the WWW …………………………….… 7 Ratanachai Sombatsrisomboon, Yutaka Matsuo, Mitsuru Ishizuka Micro View and Macro View Approaches to Discovered Rule Filtering ………………….. 13 Yasuhiko Kitamura, Akira Iida, Keunsik Park, Shoji Tatsumi Relevance Feedback Document Retrieval using Support Vector Machines ……………… 22 Takashi Onoda, Hiroshi Murata, Seiji Yamada Using Sectioning Information for Text Retrieval: a Case Study with the MEDLINE Abstracts …...………………………………… 32 Masashi Shimbo, Takahiro Yamasaki, Yuji Matsumoto Rule-Based Chase Algorithm for Partially Incomplete Information Systems ………….… 42 Agnieszka Dardzinska-Glebocka, Zbigniew W Ras Data Mining Oriented CRM Systems Based on MUSASHI: C-MUSASHI …………….… 52 Katsutoshi Yada, Yukinobu Hamuro, Naoki Katoh, Takashi Washio, Issey Fusamoto, Daisuke Fujishima, Takaya Ikeda Integrated Mining for Cancer Incidence Factors from Healthcare Data ……………….… 62 Xiaolong Zhang, Tetsuo Narita Multi-Aspect Mining for Hepatitis Data Analysis …………………………………………... 74 Muneaki Ohshima, Tomohiro Okuno, Ning Zhong, Hideto Yokoi Investigation of Rule Interestingness in Medical Data Mining ……………………………. 85 Miho Ohsaki, Yoshinori Sato, Shinya Kitaguchi, Hideto Yokoi, Takahira Yamaguchi Experimental Evaluation of Time-series Decision Tree …………………………………… 98 Yuu Yamada, Einoshin Suzuki, Hideto Yokoi, Katsuhiko Takabayashi Extracting Diagnostic Knowledge from Hepatitis Dataset by Decision Tree Graph-Based Induction ……………………………………….… 106 Warodom Geamsakul, Tetsuya Yoshida, Kouzou Ohara, Hiroshi Motoda, Takashi Washio Discovery of Temporal Relationships using Graph Structures …………………………… 118 Ryutaro Ichise, Masayuki Numao A Scenario Development on Hepatitis B and C ……………………………………………… 130 Yukio Ohsawa, Naoaki Okazaki, Naohiro Matsumura, Akio Saiura, Hajime Fujie Empirical Comparison of Clustering Methods for Long Time-Series Databases ………… 141 Shoji Hirano, Shusaku Tsumoto Classification of Pharmacological Activity of Drugs Using Support Vector Machine …….. 152 Yoshimasa Takahashi, Katsumi Nishikoori, Satoshi Fujishima Mining Chemical Compound Structure Data Using Inductive Logic Programming …… 159 Cholwich Nattee, Sukree Sinthupinyo, Masayuki Numao, Takashi Okada Development of a 3D Motif Dictionary System for Protein Structure Mining …………..… 169 Hiroaki Kato, Hiroyuki Miyata, Naohiro Uchimura, Yoshimasa Takahashi, Hidetsugu Abe Spiral Mining using Attributes from 3D Molecular Structures …………………………… 175 Takashi Okada, Masumi Yamakawa, Hirotaka Niitsuma Architecture of Spatial Data Warehouse for Traffic Management ………………………… 183 Hiroyuki Kawano, Eiji Hato Title Index …………………………………………………………...……………………...…. iii Author Index ……………..…………………………………………………………………… v A Fuzzy Set Approach to Query Syntax Analysis in Information Retrieval Systems Dariusz J. Kogut Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland [email protected] Abstract. This article investigates whether a fuzzy set approach to the natural language syntactic analysis can support information retrieval systems. It concentrates on a web search since the internet becomes a vast resource of information. In addition, this article presents a module of syntax analysis of TORCH project where the fuzzy set disambiguation has been implemented and tested. 1 Introduction Some recent developments on information technology have concured to acceler- ate a research in the field of artificial intelligence [5]. The appeal of fantasizing about intelligent computers that understand human communication is practi- cally unavoidable. However, the natural language research seems to be one of the hardest problems of artificial intelligence due to the complexity, irregularity and diversity of human languages [8]. This article investigates whether a fuzzy set approach to natural language processing can support the search engines in web universe. An overview of syntactic analysis based on fuzzy set disambiguation will be presented and may provide some insight for further inquiry. 2 Syntax Analysis in TORCH - a Fuzzy Set Approach TORCH is an information retrieval system with a natural language interface. It has been designed and implemented by author in order to facilitate searching of physical data in CERN (European Organization for Nuclear Research, Geneva). TORCH relies on a textual database composed of web documents published in the internet and intranet networks. It became an add-on module in Infoseek Search Engine environment [6]. 2.1 TORCH Architecture The architecture of TORCH has been shown in figure 1. To begin with, a natural language query is captured by the TORCH user interface module and transmit- ted to the syntactic analysis section. After the analysis is completed, the query Fig. 1. TORCH Architecture is being conveyed to the semantic analysis module [6], then interpreted and translated into a formal Infoseek Search Engine query. Let us focus on the syntax analysis module of TORCH which is shown in figure 2. The syntax analysis unit has been based on a stepping-up parsing algorithm and equipped with a fuzzy logic engine that supports syntactic disambiguation. The syntactic dissection goes through several autonomous phases [6]. Once the natural language query is transformed into a surface structure [4], the preprocessor unit does the introductory work and query filtering. Fig. 2. Syntax Analysis Module Next, the morphological transformation unit turns each word from a non- basic form into its canonical one [7]. WordNet lexicon and CERN Public Pages Dictionary provide the basic linguistic data and therefore play an important role in both syntactic and semantic analysis. WordNet Dictionary has been built on Princeton University [7], while CERN Public Pages Dictionary extends TORCH knowledge on advanced particle physics. At the end of dissection process, a fuzzy logic engine formulates the part of speech membership functions [6] which char- acterize each word of the sentence. This innovatory approach allows to handle the most of syntax ambiguity cases which happen in English (tests proved that approx. 90% of word atoms are properly identified). 2.2 Fuzzy Set Approach The fuzzy logic provides a rich and meaningful addition to standard logic and creates the opportunity for expressing those conditions which are inherently imprecisely defined [11]. The architecture of fuzzy logic engine has been shown in figure 3. Fig. 3. Fuzzy Logic Engine The engine solves several cases where syntactic ambiguation may occur. Therefore, it constructs and updates four descrete membership functions des- ignated as ΨN , ΨV , ΨAdj and ΨAdv. At first stage, the linguistic data retrieved from WordNet lexicon are used to form the draft membership functions, as it is described below: Let us assume that : LK - refers to the amount of categories that may be assigned to the word; Nx - refers to the amount of meanings within the selected category; Note that x ∈M,whereM = {N(oun),V(erb),Adj,Adv}; NZ - refers to the amount of meanings within the all categories; w(n) - refers to the n-vector (n-word) of the sentence; ξK - is a heuristic factor that describes a category weight (e.g., ξK =0.5) The membership functions have been defined as follows: NZ [w(n)]= Nx[w(n)] x∈M 1 − ξK ξK ΨN [w(n)]= · NN [w(n)]+ NZ [w(n)] LK Similarly, 1 − ξK ξK ΨV [w(n)]= · NV [w(n)]+ NZ [w(n)] LK 1 − ξK ξK ΨAdj[w(n)]= · NAdj[w(n)]+ NZ [w(n)] LK 1 − ξK ξK ΨAdv[w(n)]= · NAdv[w(n)]+ NZ [w(n)] LK In order to illustrate the algorithm, let us take an example query: Why do the people drive on the right side of the road? Figures 4 and 5 describe the draft ΨN , ΨV , ΨADJ and ΨADV functions. 1 1 VERB NOUN Ψ Ψ 0.75 0.75 0.5 0.5 0.25 0.25 n n 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 Fig. 4. ΨN and ΨV membership functions 1 1 ADJ ADV Ψ Ψ 0.75 0.75 0.5 0.5 0.25 0.25 n n 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 Fig. 5. ΨADJ and ΨADV membership functions Certainly, each word must be considered as a specific part of speech and may belong to only one of the lingual categories in the context of a given sentence. Thus, the proper category must be assigned in a process of defuzzyfication, which may be described by the following steps and formulas (KN , KV , KAdj and KAdv represent the part of speech categories): Step 1.: w(n) ∈KN ⇔ ΨN [w(n)] ≥ max(ΨAdj[w(n)],ΨV [w(n)],ΨAdv[w(n)]) Step 2.: w(n) ∈KAdj ⇔ ΨAdj[w(n)] ≥ max(ΨV [w(n)],ΨAdv[w(n)]) ∧ (ΨAdj[w(n)] >ΨN [w(n)]) Step 3.: w(n) ∈KV ⇔ ΨV [w(n)] > max(ΨN [w(n)],ΨAdj[w(n)]) ∧ (ΨV [w(n)] ≥ ΨAdv[w(n)]) Step 4.: w(n) ∈KAdv ⇔ ΨAdv[w(n)] > max(ΨN [w(n)],ΨAdj[w(n)],ΨV [w(n)]) Unfortunately, the syntactic disambiguation based on lexicon data exclusively cannot handle all the cases. Therefore, a second stage of processing - based on any grammar principles - seems to be necessary [2] [3]. For that reason, TORCH employs its own English grammar engine [6] (shown in figure 6) with a set of simple grammar rules that can verify and approve the membership functions. Fig. 6. English Grammar Engine It utilizes a procedural parsing, so the rules are stored in a sort of static library [10] and may be exploited on demand. When the engine exploits English grammar rules upon the query and verifies the fuzzy logic membership functions, the deep structure of the question is constructed [1] [9] and the semantic analysis initialized [6]. The query syntax analysis of TORCH seems to be simple and efficient. A set of tests based on 20000-word samples of e-text books has been done, and the accuracy results (Acc-A when the English grammar engine was disable, and Acc-B with the grammar processing switched on) are shown in table 1. Natural language clearly offers advantages in convenience and flexibility, but also involves challenges in query interpretation.

Load more