Indicators for the Data Usage Index (DUI): An

Ingwersen and Chavan BMC Bioinformatics 2011, 12(Suppl 15):S3 http://www.biomedcentral.com/1471-2105/12/S15/S3 RESEARCH Open Access Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure Peter Ingwersen1,2†, Vishwas Chavan3*† Abstract Background: A professional recognition mechanism is required to encourage expedited publishing of an adequate volume of ‘fit-for-use’ biodiversity data. As a component of such a recognition mechanism, we propose the development of the Data Usage Index (DUI) to demonstrate to data publishers that their efforts of creating biodiversity datasets have impact by being accessed and used by a wide spectrum of user communities. Discussion: We propose and give examples of a range of 14 absolute and normalized biodiversity dataset usage indicators for the development of a DUI based on search events and dataset download instances. The DUI is proposed to include relative as well as species profile weighted comparative indicators. Conclusions: We believe that in addition to the recognition to the data publisher and all players involved in the data life cycle, a DUI will also provide much needed yet novel insight into how users use primary biodiversity data. A DUI consisting of a range of usage indicators obtained from the GBIF network and other relevant access points is within reach. The usage of biodiversity datasets leads to the development of a family of indicators in line with well known citation-based measurements of recognition. Background down into single units, whereby each reference turns into Access to biodiversity data is essential for understanding a citation that can be aggregated in many different ways, the state of the art of biotic diversity and for taking forming a wide range of citation impact indicators [3,4]. informed decisions about sustainable use of biotic Typically, original academic articles, their references and resources and their conservation. Among several impe- the citations they receive are indexed in citation indexes, diments to publishing and discovery of primary biodi- such as the Thomson-Reuters’ Web of Science database versity data is a lack of professional recognition [1]. An [5]. These means of academic recognition and impact in important incentive for scientists to publish research science constitute the central indicators applied in the articles, monographs or conference papers is the explicit well established field of ‘scientometrics’. Similarly, we recognition that their work receives by means of cita- believe that institutionalization of a Data Usage Index tions from fellow scholars. Owing to the common but (DUI) [1] demonstrating impact of data publishing is fea- tacit conventions established in academic communities, sible, even for a dynamic and complex network, such as scientists recognize the use of previous research, criti- the Global Biodiversity Information Facility (GBIF). So cally as well as in a positive sense, by adding references far, however, no metrics exist for data usage, especially to such work into their publication text and list of refer- biodiversity data usage, that recognize all players involved ences. References can be seen as a kind of normative in the life cycle of those data from collection to publica- payment [2]. The reference lists can easily be broken tion. A set of DUI indicators is lacking [1]. We propose the indicators for the development of a DUI based on search events and dataset download instances - thus not * Correspondence: [email protected] † Contributed equally based on traditional scholarly references and citations 3Global Biodiversity Information Facility Secretariat, Universitetsparken 15, DK because no data citation mechanism now exists. As a 2100, Copenhagen, Denmark spin-off, the DUI is also intended to provide novel Full list of author information is available at the end of the article © 2011 Ingwersen and Chavan; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ingwersen and Chavan BMC Bioinformatics 2011, 12(Suppl 15):S3 Page 2 of 10 http://www.biomedcentral.com/1471-2105/12/S15/S3 insights into how scholars make use of primary biodiver- based on requests, viewing and downloading of research sity data in a variety of ways. Similar to scientometric publications in the form of metadata, abstracts or full analyses applying rank distributions, time series, impact text via the astronomy digital library client logs [6]. The indicators and similar calculations based on academic citation analysis avenue clearly refers to the authors’ publications, the usage of primary biodiversity datasets intellectual property in the astronomy papers. The usage leads to the development of a family of indicators and metrics avenue may also include players responsible for other significant metrics. the technical infrastructure presenting such properties. By applying instances of viewing, searching and down- The usage impact could be shared and the distribution loading biodiversity dataset records, three characteristics of credit would be the responsibility of the host institu- are observable that differ from the use of publication tion: a digital library or a data publisher. references. First, in contrast to having fairly complete Thus, the proposed DUI for primary biodiversity data- information on the nature of the original work (and its sets is initially intended to apply this second avenue of journal) citing a work, one has only limited knowledge action, based on usage indicators extracted from the of the internet protocol (IP) address that viewed, usage logs of the GBIF data portal [7] and later on other searched or downloaded dataset records, such as its access point log data. This avenue constitutes phase one location, that is, its geographic and institutional affilia- of three, implementing a universal DUI, as outlined in tion.Wedonotknowwhoactuallyviewedordown- [1]. The proposed DUI is thus expected to make the loaded the dataset records. Second, we only know the dataset usage visible, providing deserved recognition for data publisher’s name and location. We do not know their creators, managers and publishers, and to encou- who in reality designed, collected and prepared the con- ragethebiodiversitydatapublishersandusersto: tents of the dataset and its records. The proposed DUI increase the volume of high quality data mobilization indicators are thus directly attributable to the academic and publishing; further use of primary biodiversity data institution rather than to the scholars behind it. Third, in scientific, conservation, and sustainable resources use the basic unit in the proposed DUI is a biodiversity purposes; and improve formal citation behavior regard- dataset record. Thus, we regard the dataset record as ing datasets in research. analogous to a journal article and the datasets as analo- When developing DUI indicators one needs to take gous to a journal. Biodiversity datasets are produced by into account the fundamental characteristics of datasets data publishers. The latter may produce several datasets. and their usage patterns. As academic publications As in similar scientometric analyses normalization is indexed in traditional citation databases, such as the done by means of the basic analysis unit: here this is the Web of Science [5], PubMed [8] or SCOPUS [9], entire dataset record. datasets rarely become deleted from the databases cur- rently contributing to the GBIF data portal [7] and Why a Data Usage Index? other similar archives and publishing infrastructures. In line with the publication and citation behavior men- Their original records are rarely edited or erased; data- tioned above, and as stated by Chavan and Ingwersen sets can, however, be updated and grow in number of (p. 5 of [1]), “[the] DUI is intended to demonstrate to records over time or be modified or restructured. This data publishers that their biodiversity efforts creating characteristic of datasets is associated with the potential primary biodiversity datasets do have impact by being for change also observed in many web-based documents. accessed, searched and viewed or downloaded by fellow By applying the usage logs of the GBIF data portal [7], scientists”. All players and their host institutions the DUI indicators are confined to that context. The involved in the data life cycle from collection of data up usage as measured by searches or downloads of dataset to its publication require incentives to continue their records is detectable only within the coverage of the efforts and recognition of their contribution. In a scien- host system logs, as in a library. However, in contrast to tific digital library and open access environment, such as a closed library log system a substantial amount of log that developed for bibliographic information in astron- data are publicly accessible from the GBIF data portal omy [6], usage is measured in a two-dimensional way. usage logs and are consequently open to indicator calcu- The straightforward way is to apply common sciento- lations. The properties of the GBIF-mobilized data usage metric indicators with respect to citation patterns and indicators bridges between known scientometric indica- impact. However, this track is not yet feasible in the tors on impact and existing socio-cognitive

Indicators for the Data Usage Index (DUI): An

High-Throughput Analysis Suggests Differences in Journal False

Tracking the Popularity and Outcomes of All Biorxiv Preprints

Event Extraction on Pubmed Scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 from Workshop on Advances in Bio Text Mining Ghent, Belgium

A Comparison of Feature Selection and Classification Methods in DNA

Use and Mis-Use of Supplementary Material in Science Publications Mihai Pop1* and Steven L

Viewed: 8 Were Accepted As Full Papers, 4 As Poster Large Investment of Effort Needed to Develop Sufficiently Papers and One As a Demo Paper

Mining Locus Tags in Pubmed Central to Improve Microbial Gene Annotation Chris J Stubben* and Jean F Challacombe

Analyzing the Field of Bioinformatics with the Multi-Faceted Topic Modeling Technique Go Eun Heo1, Keun Young Kang1, Min Song1* and Jeong-Hoon Lee2

Semantic Web Applications and Tools for the Life Sciences: SWAT4LS 2010

Paperbot: Open-Source Web-Based Search and Metadata Organization of Scientific Literature Patricia Maraver1,2, Rubén Armañanzas1,2, Todd A

BMC Bioinformatics Biomed Central

Knowledge Discovery of Drug Data on the Example of Adverse Reaction Prediction Pinar Yildirim1*, Ljiljana Majnarić2, Ozgur Ilyas Ekmekci1, Andreas Holzinger3