Collection of Usage Information for Language Resources from Academic Articles

Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy and Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. fields of linguistics, natural language processing, and spo- ken language processing, reflecting the recognition that ob- Therefore, we could more easily find LRs suitable for jectively analyzing linguistic behavior based on actual ex- our own purposes by collecting lists of usage information amples is important. Therefore, since the importance of for LRs from academic articles and integrating them with LRs is widely recognized, they have been constructed as metadata contained in SHACHI. Although the method for a research infrastructure and are becoming indispensable automatically extracting the lists was proposed (Kozawa et for linguistic research. However, existing LRs are not fully al., 2008), the variation of the extracted usage information utilized. Even though metadata search services for LR was limited since their extraction rules were based on the archives (Hughes and Kamat, 2005) and web services for analysis of small lists of usage information. This issue LRs (Dalli et al., 2004; Biemann et al., 2004; Quasthoff et would be addressed by collecting large lists of usage infor- al., 2006; Ishida et al., 2008) have become available, it has mation and then analyzing them to expand the extraction not been enough for users to efficiently find and use LRs rules. Therefore, in this paper, we propose to construct a suitable for their own purposes so far. text corpus annotated with usage information (UI corpus). If there exists a system which could give us a list of LRs This paper is organized as follows. In sections 2, we intro- that can be answers to the questions such as “Which LRs duce the design of UI corpus. Then, we construct the UI can be used for developing a syntactic parser?” and “Which corpus by extracting sentences containing LR names from LRs can be used for developing a Chinese-English machine academic articles and annotating them with usage informa- translation system?,” it would help users efficiently find ap- tion in section 3. In sections 4 and 5, we provide statistics propriate LRs. of the UI corpus and analytical results of usage information contained in the UI corpus. We show that the UI corpus The information satisfying these demands is sometimes de- contributes to efficient LR searches by combining the UI scribed as usages of individual LRs on their official home corpus with a metadata database of LRs and comparing the pages. The metadata database of LRs named SHACHI (To- number of LRs retrieved with and without the UI corpus in hyama et al., 2008) is managing and providing it in an in- section 6. Finally, in section 7, we describe the summary of tegrated fashion by collecting and listing it as “usage infor- this paper and the future work. mation” for LRs. SHACHI contains metadata on approximately 2,400 LRs. However, the number of lists of usage 2. Design of the UI Corpus information registered in SHACHI is only about 900 LRs since the usage information is not usually described on the 2.1. Data Collection official home while it is often described in academic ar- It is unrealistic to collect all sentences and annotate them ticles. For instance, the following sentence found in the because only small number of sentences in an article in- proceedings of ACL2006 shows that Roget’s Thesaurus is clude usage information for LRs. In this issue, Kozawa et useful for word sense disambiguation, although usage information is not announced on the web page of Roget’s 1http://poets.notredame.ac.jp/Roget/about.html Figure 1: Flow of the UI corpus construction. al. reported that most of the instances of usage information (E) usage information for LRs are found in the sentences containing LR names Each word sequence matched with usage information (Kozawa et al., 2008). Therefore, in this research, we col- for a certain LR is annotated with UI tags. Word se- lect sentences having LR names from academic articles to quences are annotated with usage information by re- build the UI corpus. ferring only to a given sentence without adjacent sentences in order to reduce the labor costs of annotators. 2.2. Annotation Policy In our research, we assume that usage information A The collected sentences are annotated with the following for LR X can be paraphrased as “X is used for A.” information: (A), (B) and (C) are provided for each sen- The followings are examples of usage information for tence. (D) and (E) are provided for word sequences. (A) WordNet. through (D) are automatically provided when sentences have been collected from academic articles. • We use hLRiWordNeth/LRi for hUIilexical lookuph/UIi. (A) sentence ID • hLRiWordNeth/LRi hUIispecifies relationships among the meanings of wordsh/UIi. (B) the title of the proceeding • It uses the content of hLRiWordNeth/LRi to (C) article ID hUIimeasure the similarity or relatedness between the senses of a target word and its sur- (D) LR name rounding wordsh/UIi . Word sequences matched with LR names are annotated with LR tags as the following example: Note that since usage information indicates specific events, such vague expressions as “our proposed • For comparison, hLRiPenn Treebankh/LRi con- method” and “this purpose” are not our target. We tains over 2400 (much shorter) WSJ articles. also ignore expressions that can be represented by “X is used for X” and those that represent updating, ex- Note that the LR tags with which word sequences are pansion, or modification of LR X, as shown in the fol- automatically annotated need to be examined, since lowing example: homographs of the LR names (ex. the names of • We applied an automatic mapping from projects and associations) are sometimes erroneously hLRiWordNet 1.6h/LRi to hLRiWordNet annotated as LR names. For example, ‘Penn Tree- 1.7.1h/LRi synset labels. bank’ is often used as an LR name and it is sometimes used as a project name in the different context. There- 3. Construction of the UI Corpus fore, it is difficult to discriminate proper LR names from others. Then, if inappropriate word sequences This section describes a method for constructing the UI cor- are annotated with LR tags, they are manually elimi- pus. Figure 1 shows the flow of the corpus construction. nated. First, sentences containing LR names are automatically extracted from academic articles. Then, the extracted sen- If word sequences matched with two or more LRs have tences are annotated by two annotators in a cascaded man- a coordinate structure as the following example, “Chi- ner. nese” and “English Propbanks” are annotated with LR tags. This is because we distinguish between two or 3.1. Automatic Extraction of Sentences Containing more LR names (e.g. Chinese and English PropBanks) LR Names and an LR name which have a coordinate structure First, we converted academic articles to plain texts using (e.g. Cobuild Concordance and Collocations Sam- the Xpdf2. We used 2,971 articles which are contained in pler). the proceedings of ACL from 2000 to 2006, LREC2004 and LREC2006 because LRs were often used in the field • The functional tags for hLRiChineseh/LRi and of computational linguistics. Next, we extracted sentences hLRiEnglish PropBanksh/LRi are to a large ex- tent similar. 2http://www.foolabs.com/xpdf/ Figure 2: Flow of corpus annotation. Figure 3: Web-based GUI for supporting annotation. containing the LR names from the articles. As for the LR Item Number names, approximately 2,400 LRs registered in SHACHI articles 1533 (Tohyama et al., 2008) were used. Consequently, 10,959 sentences 8135 sentences were extracted from 1,848 articles and word se- LR tags 10504 quences matched with the LR names are automatically an- sentences containing usage information 1110 notated with the LR tags. UI tags 1183 3.2. Annotation of Word Sequences with Usage Table 1: Size of the UI corpus. Information Two annotators were involved in the annotation. One of the • annotators (Annotator 1) had an experience in collecting Annotating word sequences representing the usage in- metadata in SHACHI, while the major of the others (Anno- formation for LRs with the UI tags tator 2) was computational linguistics.

Collection of Usage Information for Language Resources from Academic Articles

Collection of Usage Information for Language Resources from Academic Articles

Conference Abstracts

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

Better Web Corpora for Corpus Linguistics and NLP

Unit 3: Available Corpora and Software

The Bulgarian National Corpus: Theory and Practice in Corpus Design

Edited Proceedings 1July2016

New Kazakh Parallel Text Corpora with On-Line Access

Language, Communities & Mobility

Corpus Linguistics: Corpus Annotation

A GLOSSARY of CORPUS LINGUISTICS 809 01 Pages I-Iv Prelims 5/4/06 12:13 Page Ii

Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC