Journal of Information Processing Vol.24 No.3 554–564 (May 2016)
[DOI: 10.2197/ipsjjip.24.554] Regular Paper
NaDev: An Annotated Corpus to Support Information Extraction from Research Papers on Nanocrystal Devices
Thaer M. Dieb1,a) Masaharu Yoshioka1 Shinjiro Hara2
Received: May 22, 2015, Accepted: January 12, 2016
Abstract: The process of nanocrystal device development is not well systematized. To support this process, anal- ysis of the information produced by developmental experiments is required. In this study, we constructed an an- notated corpus to support the extraction of experimental information from relevant publications. We designed the corpus-construction guidelines by cooperating with a domain expert. We evaluated these guidelines through corpus- construction experiments with graduate students from this domain, and then evaluated the corpus with the domain expert. In the corpus construction experiments, we achieved a sufficient level of Inter-Annotator Agreement by using a loose agreement measure that ignored the term-boundary mismatch problem, and made an agreement corpus that excluded annotations based on misunderstanding the guidelines. The domain expert evaluated this agreement corpus and modified the guidelines based on real examples. Using these guidelines, we finalized the corpus called “NaDev” (Nanocrystal Device development corpus). The NaDev corpus and its construction guidelines will be released via our website, http://nanoinfo.ist.hokudai.ac.jp/. The NaDev corpus aims to support automatic information extraction from publications relevant to nanocrystal device development. This information can be used to solve problems in the nan- otechnology domain using the massive availability of fresh information. To the best of our knowledge, this is the first corpus constructed for the development of nanocrystal devices.
Keywords: nanoinformatics, annotation, corpus construction, information extraction, nanocrystal device develop- ment
Nanodevice Design and Manufacturing” to support the nanocrys- 1. Introduction tal device development process [12]. This is a joint research Nanoinformatics is a newly developing interdisciplinary re- project between the Research Center for Integrated Quantum search domain that aims to use information technology to sup- Electronics (RCIQE) and the Division of Computer Science at port research in the nanoscience field [1], [2]. Nanoinformatics is Hokkaido University. As part of this project, we want to exploit the science and practice of determining which information is rel- information about the development of nanocrystal devices that evant to the nanoscale science and engineering community and is reported in research publications. This information would be then developing and implementing effective mechanisms for col- used to facilitate a more effective development process through lecting, validating, storing, sharing, analyzing, modeling, and ap- various applications, including but not limited to experimental plying this information [3]. Alternatively, nanoinformatics has result analysis. been defined as an emerging area of information technology at In this study, we have developed a method for constructing the intersection of bioinformatics, computational chemistry, and an annotated corpus of publications relevant to nanocrystal de- nanobiotechnology [4]. Nanoinformatics could play the same vice development to support automatic information extraction. role in nanotechnology and nanomedicine as bioinformatics and The tag set and the corpus-construction guidelines were designed medical informatics have played in biology and medicine [5]. in collaboration with a domain expert. We evaluated the reli- Nanocrystal device development is an area of nanoscale re- ability of these guidelines through corpus construction experi- search where nanoelectronic devices are developed for future ments with graduate students from this domain. We evaluated nanoelectronic industrial applications using electronic materi- the constructed corpus using Inter-Annotator Agreement (IAA) als such as semiconducting, insulating, and magnetic materi- and confirmed that the guidelines achieved a satisfactory IAA als [6], [7], [8], [9], [10]. However, the process of nanocrystal level. We also constructed an agreement corpus that excluded device development is not well systematized, requiring both en- incorrect annotations based on misunderstanding the guidelines. gineering knowledge and craftsmanship [11]. We have been con- The domain expert evaluated this agreement corpus and modi- ducting a project called the “Knowledge Exploratory Project for fied the guidelines by checking them with real annotation exam- ples. Based on these modified guidelines, we finalized the corpus 1 Graduate School of Information Science and Technology, Hokkaido Uni- called “NaDev” (Nanocrystal Device development corpus) and its versity, Sapporo, Hokkaido 060–0814, Japan ffi 2 Research Center for Integrated Quantum Electronics, Hokkaido Univer- construction guidelines for an o cial release. sity, Sapporo, Hokkaido 060–8628, Japan Several attempts have been conducted to extract information a) [email protected]