Development of a for the Analysis of Agricultural Information

Akane Takezaki1, Mitsuyuki Saito1 and Akiko Okabe1

1 Tsukuba Office, Agriculture, Forestry and Fisheries Research Council Secretariat, Japan, [email protected]

Abstract

This paper describes the dictionary we have developed for the morpheme analysis of agricultural bibliographical information. We have found inaccuracies in the automatically indexed records of the Japan Agricultural Science Index (JASI) provided by the Tsukuba Office. Problems in morpheme dictionary-based segmentation were one reason for these inaccuracies. Some technical terms were not recognized as by the automatic indexing system because the morpheme dictionary used for the indexing did not include a wide range of agricultural technical terms. To improve the automatic indexing system, we created a new morpheme dictionary containing terminology for all subject fields in agriculture, forestry, fisheries, food security and other related domains by using terms from the Japan agricultural thesaurus that the Tsukuba Office had developed. Approximately 16,000 names of registered plant varieties, including crops, vegetables, fruits, flowers, and trees were added to this dictionary. It has approximately 55,000 Japanese and English technical terms that are nouns, along with information on the terms’ orthographic variants of these terms (e.g., their kanji variants, kanji-hiragana relationships and abbreviations). To evaluate the quality of this dictionary, we extracted from documents ( both abstracts and titles) of records stored in JASI in 2000, 2003, 2004, 2005 and 2006 by using the well-known morpheme analyzer “ChaSen” and its dictionary of general terms, “IPADIC” and the developed new morpheme dictionary. We examined unknown words extracted automatically by ChaSen, and identified some of them as new terms. Each time we analyzed the morphemes of a set of about 1,000 sentences in JASI, we added the new terms to the dictionary and modified word segmentations. The number of unknown words extracted by the morpheme analysis decreased as we added the new terms. We plan to introduce this dictionary into the JASI system and will examine automatically indexed JASI records.

Keywords: Morpheme dictionary, Automatic indexing, JASI, ChaSen

Introduction

JASI, a database of the Japan Agricultural Science Index, has been provided by the Tsukuba Office since 1985 (Tsukuba Office, 1985). JASI includes bibliographic information on well over 270,000 journal articles from 500 journals published in Japan from 1970 to the present. In the information retrieval system, index words are extracted from documents to represent their content (Tokunaga, et al., 2006). In JASI, the indexing from documents (both titles and abstracts) of records stored from 1970 to 2002 had been performed manually by some trained personnel, but the one stored from 2003 to the present has been automatically done. However,

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

449

we have found inaccuracies in the automatically indexed words of JASI entries, which can deteriorate retrieval effectiveness. Generally, a simple or compound word is used as an index word. It is hard to find sharp boundaries between words in Japanese documents because the words are not separated by spaces. Consequently, the morphemes, the minimal meaningful elements, are analyzed prior to the extraction of index terms in the JASI search system. The morphemes are determined with reference to a dictionary. Most of index words in JASI are agricultural technical terms. However, some can not be recognized as words by morpheme analysis if the dictionary used for the analysis does not include a wide range of technical terms (Horyu, et al., 2004). On the other hand, an improvement in word segmentation by morpheme analysis was reported by adding technical terms to a standard dictionary. One reason for the inaccuracies in the automatic indexing of JASI can be a lack of agricultural technical terms in the morpheme dictionary. This paper describes a new morpheme dictionary containing 55,000 agricultural technical terms that we developed to improve the morpheme analysis of agricultural bibliographical information.

JASI search system

Bibliographic information, such as article title, author name, author affiliation, journal name, volume, issue number, year of publication, index word, and abstract with the permission by the author is included in JASI (Fig. 1). Relevant bibliographic records are retrieved by a full-text search system when a user submits a query about the selected search item, such as keyword, article title, abstract, author name, journal name, and index word (Fig. 2). The search item by default is the keyword extracted from both the article title and index word. Bibliographic records in only about 70% of journals have the abstract reproduced with permission. In the JASI search system, the index words are extracted from both the article title and the abstract to represent their content. It enables the use without permission of the abstract’s content for bibliographic information retrieval. The morphemes are determined with reference to the dictionary containing both general and agricultural technical terms prior to extract the index words. However, we found inaccuracies in the automatically indexed words, such as ill-formed or inappropriate words and assumed that the reason for these inaccuracies could be a lack of agricultural technical terms in the dictionary, which included only about 3,000 technical terms.

Lexical information stored in the new morpheme dictionary

In 2001, we cooperated with the FAO (the Food and Agriculture Organization of the United Nations) in the AGROVOC project. AGROVOC was created by the FAO and the Commission of the European Communities as a multilingual structured thesaurus containing terminology for all subject fields in agriculture, forestry, fisheries, food security and other related domains (Takezaki et al., 2008). In 2002, we translated approximately 27,300 English AGROVOC terms into Japanese, and submitted our work to the FAO. Besides the relevant Japanese terms submitted to AGROVOC, we have developed the Japan Agricultural Thesaurus (JAT), since the Japanese version of the AGROVOC thesaurus in its current state cannot be used for retrieving Japanese agricultural information because it lacks

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

450

many Japan-specific agricultural terms. The JAT contains approximately 48,000 terms written in both Japanese and English. By using terms from the JAT, we have developed a new morpheme dictionary to improve the automatic indexing system. Approximately 16,000 names of registered plant varieties, including crops, vegetables, fruits, flowers, and trees were added to this dictionary. (Ministry of Agriculture, Forestry and Fisheries, 2008). As a result, the morpheme dictionary contains approximately 55,000 Japanese and English technical terms that are nouns. This dictionary includes information on the orthographic variants of the terms, such as kanji variants, “芥子 (karashi)” and “辛子 (karashi)” ; kanji- hiragana relationships, “葉脈壊 死 (youmilyaku-eshi)” and “葉脈え死 (youmilyaku-eshi)” ; abbreviations, “捕食線虫 (hoshoku-senchu)” and “補食的線虫 (hoshokuteki-senchu).” This will help to prevent the loss of essential information by the variants.

Fig.1. Example of bibliographic record in JASI

Bibliographic records of JASI

Article title Author name Abstract

User Index word Year of publication Selection of search item Submission of query

Extraction of words Comparison by N-gram analysis

Display of search results

Fig. 2 Flowchart of JASI search process

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

451

Evaluation of the quality in the new morpheme dictionary

To evaluate the quality of the new dictionary, we extracted morphemes from documents (both abstracts and titles) of records stored in JASI in 2000, 2003, 2004, 2005, and 2006 by using the well-known morpheme analyzer “ChaSen” (Matsumoto et al., 2000) , its dictionary of general terms, “IPADIC,” and the developed new dictionary. We examined unknown words extracted automatically by ChaSen (Fig. 4) and identified some of them as new terms. Initially, we analyzed 5,449 records in 2006 and found 1,831 unknown words (Table 1). These words are technical terms, place names, variety names, organization names, and prefixes that are often used in the scientific article, such as “含 (gan)” and “耐 (tai).” Most of the unknown terms were added to the morpheme dictionary as new terms without modification or in combination with some morphemes. Moreover, those in 2000 (5,000 records), 2003 (5,725 records), 2004 (5,452 records), and 2005 (5,130 records) were analyzed in that order. Each time we analyzed the morphemes of a set of about 1,000 records, we added new terms to the dictionary and modified word segmentations. The addition of new terms in each analysis for the approximately 1,000 records caused a decrease in unknown word rate {(the number of unknown words) / (the number of morphemes) × 100}. The rate in 2000 is similar to the one in 2006, but higher than the one in 2003 (Table 1). The rate in 2003 was slightly higher than those in both 2004 and 2005. 776 unknown words were detected by analysis for the documents in 2005. These results showed that the addition of new terms to the morpheme dictionary improved the accuracy of the morpheme analyses. Many unknown words by analysis of records in 2005 suggested that the dictionary was still lacking adequate agricultural technical terms.

Bibliographic records of JASI Morpheme dictionary Article title Abstract General terms Agricultural technical terms (About 3,000 terms) Extraction of words by morpheme analyses Comparison

Word weighting

Extraction of index words

Fig. 3 Flowchart of JASI automatic indexing

Example 1: 摘/蕾/は/程度/が/強い/と/収量/の/低下/を/招く/が/、/草勢/の/維持/及び/大/果/生産/ に/有効/で/あっ/た/ Example 2: ドブガイ/に/見/られる/遺伝/的/2/型/の/グロキディウム幼生/の/形態/

Fig.4 Examples of morpheme analysis results Characters between slashes show morphemes extracted by analysis. Red figures show unknown words. “摘蕾 (tekirai)”in example 1, and “ドブガイ (dobugai)” in example 2 were identified as new terms.

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

452

Table 1 Subject records for evaluation of the quality in the new dictionary The number of The number of The number of Unknown word rate Year records morphemes(A) unknown words (B) ( B/A×100) 2006 5,449 742,691 1,831 0.25±0.007* 2000 5,000 247,425 621 0.25±0.048 2003 5,725 780,421 1,106 0.14±0.003 2004 5,452 717,602 951 0.13±0.003 2005 5,130 666,084 776 0.12±0.009

*Average±standard error

Disscussion

We introduced the developed morpheme dictionary into JASI search system, and began verifying morpheme analyses in the system. Our analyses suggested that the dictionary still lacked sufficient agricultural technical terms. To enrich the developed morpheme dictionary, new words identified by analyses of multiyear records (from 1970 to 1999, and in 2001 and 2002), variety names and place names should be added to the morpheme dictionary. In this analysis, prefixes extracted as unknown words by “ChaSen,” such as “含 (gan)”and “耐 (tai),” were added to the morpheme dictionary in combination with some morphemes, such as “含核果率 (gan-kakukaritsu)” and “耐摩耗性 (tai-mamousei). However, it is impossible to store all terms, including the specific prefixes, in the dictionary. Further improvements in prefix analyses remained as an issue to be solved.

Reference

Horyu D., T. Fukutatsu, A. Otuka, T. Kiura, M. Hirafuji, and S. Ninomiya (2004) Development of Japanese morphological analysis server for agricultural documents and verification through automated text categorization. 13:127-138. Matsumoto Y. (2000) Japanese Morphological Analysis System ChaSen. JPSJ Magazine 41: 1208-1214 Ministry of agriculture, forestry and fisheries (2008) The Plant variety protection. Available at http://www.hinsyu.maff.go.jp/maff/hinshu.nsf/enumberlist-set (verified March 15, 2008). Ministry of Agriculture, Forestry and Fisheries. Takezaki A., M. Saito and A. Okabe (2008) Development of Japanese agricultural for efficient searching. J. Soc. Agr. Info., 17: 42-49. Tokunaga T.(2006) Information retrieval and natural language processing, University of Tokyo Press. Tokyo. Tsukuba Office (1985) JASI Search system. Available at http://www.affrc.go.jp/db_search/jasi (verified May 23, 2008). Tsukuba Office. Tsukuba.

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

453