RESEARCH ARTICLE Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition Tudor Groza1¤*, Karin Verspoor2,3 1 School of Information Technology and Electrical Engineering, The University of Queensland, St Lucia, Australia, 2 Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia, 3 Health and Biomedical Informatics Centre, The University of Melbourne, Melbourne, Australia ¤ Current address: Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Darlinghurst, Australia *
[email protected] Abstract OPEN ACCESS Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To Citation: Groza T, Verspoor K (2015) Assessing the Impact of Case Sensitivity and Term Information Gain date, several CR approaches have been proposed, most of which focus on a particular set on Biomedical Concept Recognition. PLoS ONE 10 of biomedical ontologies. Their underlying mechanisms vary from shallow natural language (3): e0119091. doi:10.1371/journal.pone.0119091 processing and dictionary lookup to specialized machine learning modules. However, no Academic Editor: Indra Neil Sarkar, University of prior approach considers the case sensitivity characteristics and the term distribution of the Vermont, UNITED STATES underlying ontology on the CR process. This article proposes a framework that models the Received: August 22, 2014 CR process as an information retrieval task in which both case sensitivity and the informa- Accepted: January 9, 2015 tion gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a Published: March 19, 2015 given ontology is assessed based on the distribution of so-called case sensitive tokens in Copyright: © 2015 Groza, Verspoor.