Against the Grain

Volume 28 | Issue 1 Article 11

2016 How and Why Data Repositories are Changing Academia Phill Jones Digital Science, [email protected]

Mark Hahnel Figshare, [email protected]

Follow this and additional works at: https://docs.lib.purdue.edu/atg Part of the Library and Information Science Commons

Recommended Citation Jones, Phill and Hahnel, Mark (2018) "How and Why Data Repositories are Changing Academia," Against the Grain: Vol. 28: Iss. 1, Article 11. DOI: https://doi.org/10.7771/2380-176X.7269

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. How and Why Data Repositories are Changing Academia by Phill Jones (Head of Publisher Outreach, Digital Science) and Mark Hahnel (Founder, Figshare)

cademic and instance, Rosie Redfield of the University of programs based on machine suggested relation- is unquestionably in the process of British Columbia documented her attempts to ships. Immediately, this provides many more Aundergoing a revolution. It seems, replicate NASA’s claims of discovering arsenic promising avenues to explore across all fields however, that the nature of that revolution is based life on her blog ahead of publishing them of research in a practice that pharmaceutical still a somewhat open question. Libraries in in AAAS Science, which debunked the claim. companies have been exploiting with compu- particular are undergoing not so much a shift However, this sort of blogging/publishing gen- tational chemistry for decades. in focus but a diversification of roles. Where erally acts as a more rapid media for hypothesis the library once consisted primarily of a phys- driven scientific narratives, similar in concept Barriers to Sharing ical building containing curated collections of to traditional articles, rather than a way to make The reasons why many researchers choose books, journals and other resources, it is now data sets available. not to share their data, or share it only upon a diverse set of services ranging from research For many people interested in data pub- request through closed systems like email, is assessment to technology support to the new lishing, what’s required is a new infrastructure less well explored than the benefits mentioned frontier of data curation and dissemination. for communicating data and other research above. Last year, a survey of Wiley authors, outputs that is separate from hypothesis driven which was reported on in the Scholarly Kitchen Why Should Librarians Care by Alice Meadows, found that just less than narratives and judged on its own terms. The 2 about ? features of this infrastructure are not entirely half of researchers choose not to share data. The role of the library as manager of col- clear but we do know that it must be able to Wiley produced a survey infographic, which lections of information for the use of patrons cope with large quantities of data. Some data is linked from the Scholarly Kitchen article, is still alive and well. Increasingly, however, will be in well-codified and well-documented which contains a long list of reasons as to libraries have been concerned with recording formats, but much of it won’t be. Data needs why some researchers are reluctant to share. and curating the output of their institutions. to be discoverable and at least somewhat Broadly, there seems to be three overarching This expansion of role has on some level been interpretable, so that it is available for re-use themes. The first issue is a fear that sharing driven by a shift in the way that scholars are and re-analysis when needed. Finally, there’s data would have negative consequences either communicating their work and accounting for a need to protect a researcher’s ability to fully because another researcher appropriates data its value. Arguably, this trend began around 15 analyse their own data first through embargos and scoops the original experimenter, or their years ago with the rise of open access publish- and also to protect commercially or medically work gets picked apart and unfairly discredit- ing, which itself was made possible by the shift sensitive information. ed. The appropriate use of embargoes should to more scalable electronic journals. Many mitigate many of those concerns. The second libraries at the time took an interest in the new Taking all this together, data publishing issue is lack of researcher understanding of publishing model by either setting up central seems to be a fairly complicated issue, but one how to share data. Answers like “My funder/ funds for the payment of article processing that the library is well-placed to tackle. institution does not require data sharing,” or “I charges or supporting and educating scholars Why Researchers Care don’t think it was my responsibility” aren’t evi- in how and why to publish open access. Later, dence of a positive decision not to share, rather institutional repositories provided avenues for There are a number of potential advantages that some researchers are still not yet seriously green open access and library publishing oper- to scholars of sharing their data. Probably considering it. It’s easy to see how librarians the most compelling reason is the apparent ations began to develop during the first decade 1 and information professionals can help with of the 2000s, culminating in the creation of the citation advantage. Other reasons include that one. Finally, many of the responses speak Library Publishing Coalition requirements from funders, jour- to a lack of time and resources. This last issue in 2012. Many library publishing nals and institutions, as well as a is perhaps the toughest to tackle, so let’s look operations, in contrast with tradi- personal desire to make science at it in more depth. more open. tional university presses, aim to Researchers are often juggling many dispa- support niche areas of scholar- Many researchers believe that rate and seemingly unconnected responsibili- ship of interest to their own fac- open data is necessary to make ties, from research to managing their labs and ulty. However, early suggestions scholarship more effective. The getting grants, to teaching, to university admin- that institutional open access academic system does work, but istrative tasks and committees. With such a di- paper repositories may replace it can be an inefficient machine. verse workload, with so many responsibilities the role of traditional publishers The majority of inefficiencies lie to juggle, it can be challenging to incorporate have proven to be a bridge too in the inability for academics to new workflows. For this reason, simplicity and far. One can postulate many directly build on the research that intuitive workflows are increasingly important. reasons for this, but publisher has gone before them — to better You only have to look at the rising pressure that brands and the need to publish in stand on the shoulders of giants. publishers are under to simplify their submis- high impact factor journals seem Increased transparency can also sion systems and eliminate author burden, or at the most likely. This is not the case for the improve academia’s ability to self-correct the success of simplified search likeG oogle to emerging requirements of data dissemination. through openness to scrutiny and challenge. see that researchers often value simplicity and There are as yet no impact factors or prestige Making data sharable and open has the add- intuitiveness over comprehensive functionality. publication outputs. This means that libraries ed benefit of encouraging standards and codifi- Against that background, it’s not surprising may have another opportunity to play a key cation — a vital step to making data machine that many researchers are choosing to share role in communicating the academic content readable. The power of computers means that data using supplementary materials services that comes out of their institutions. data can be interrogated and cross referenced offered by publishers despite the fact that in As the movement has grown in order to automatically look for correlations many cases those systems were not designed in momentum over the past decade and a half, between research outputs. Of course, today’s with data sharing in mind.3 If data sharing is to scholars have sought new outlets for new artificial intelligence won’t enable computers become the norm, it will be important to create types of scientific output. The blogosphere to generate and confirm hypotheses the way a systems that are not only robust and scalable, has been used to “publish” work almost in real person can, hence the need for academics with but also very simple and time effective to use. time, resulting in some noteworthy cases. For subject specific knowledge to build research continued on page 24 22 Against the Grain / February 2016 Scholarly Publishers and Data industrial scale efforts to assemble super-data- How and Why Data Repositories ... Over the past decade, some traditional pub- sets like Zooniverse’s Galaxy Zoo (http:// from page 22 lishers have worked with repositories to link data.galaxyzoo.org/) and the NIH’s GenBank. raw digitised objects that underlie research to There are a number of libraries and other Data as a First Class Research Object the hypothesis-driven narrative of the article. groups that maintain lists of these types The idea that datasets should be treated as The goal is to standardize the approach to link- of databases, perhaps most notable are the an equal output to academic articles is a contro- ing research data to publications, irrespective Registry of Research Databases (www. versial one, but one that funders and advisory of the repository, which hosts the data. re3data.org), which was started in 2012 and is committees are beginning to support. Most Early succesful repositories, such as the funded by the German Research Foundation notably, the Royal Society’s “Science as an Protein Databank (http://www.rcsb.org/ (DFG) and Biosharing (www.biosharing. Open Enterprise: Open Data for Open Science” org), which is hosted by Oxford University. 4 pdb/) and Genbank (http://www.ncbi.nlm.nih. report in 2012 suggested that: “Assessment of gov/genbank/) archive molecules and genetic Encouraging patron participation in these university research should reward the develop- sequences to help reproduce research in the repositories where appropriate is just one ment of open data on the same scale as journal life sciences. Later, generic repositories came way that librarians can assist the open data articles and other publications.” This has led to to the forefront through projects like Dryad movement. many funders requiring that all data from the 5 (http://datadryad.org/), which helped motivate Institutional Data Repositories research they fund be made openly available. ecologists to make all of their one-moment-in- An obvious corollary being that the rewards for time series data available. Institutional data repositories have been open data would need to be comparable with historically designed with a view to managing those for traditional articles. When funders started requiring that data and curating the output of institutions. In that be made available at the point of article publi- sense, they are intertwined with both research Before we address whether data should have cation, academic publishers took steps to help such a status, there’s a more fundamental but assessment and library publishing efforts; at researchers comply with these requirements. some institutions, library publishing and data less obvious question to answer. Just what ex- Partnerships with repositories such as Figshare actly are data? There are several definitions, but repository services are provided using the same (www.figshare.com) allow journals to preview platform.8 As data dissemination becomes the general theme across disciplines is that data the digital files embedded within the HTML are the digital products of academic research. increasingly important, it makes sense to look version of the article. The long-term preser- at some of the work that pioneering library This can range from digitized field notes in vation of the data is contractually maintained biology to videos of dramatic performances to publishing efforts have made in populating and and each object is individually citable. Later, popularizing their repositories. niche file formats in computational chemistry. some publishers developed data journals, The ubiquity of digital scholarship means that In her 2001 article Institutional Repos- like the Geoscience Data Journal (http:// 9 any platform for disseminating research should onlinelibrary.wiley.com/journal/10.1002/ itories: Keys to Success, Joan Giesecke, work across the full range of disciplines, with (ISSN)2049-6060) published by the Royal then Dean of Libraries at the University of filters applied so that content can be grouped Meterolical Society, that allows researchers Nebraska-Lincoln, outlined how they suc- arbitrarily. That is to say, we need persistent file to publish short descriptive articles, that aren’t cessfully transformed their repository from storage, which is discoverable and interpretable hypothesis driven, linked to data archived in what she calls a collection centric viewpoint by machines and humans alike. approved repositories. which assumes faculty participation and fo- cuses on curation, to one of service provision A long-standing problem in academia is that In 2014, Nature Publishing Group technology has traditionally limited us to one which focuses on making the repository an launched Scientific Data, which applies tradi- attractive place to put content. Giesecke notes research output type with limited forms of as- tional to data descriptor articles: sessment, namely peer review and citation met- the danger that institutional repositories can “Acceptance for publication is based on the become overly restrictive, focusing too much rics like Impact factor. We are now at a point technical rigour of the procedures used to where all products of research can be released on the desire to create an orderly collection, generate the data, the reuse value of the data, thereby unintentionally creating barriers to (unless prevented by ethical or commercial and the completeness of the data description.” reasons). The number of evaluation metrics has participation. By adopting the service driven exploded to include altmetrics as supplements There are movements to codify standards approach of a university press, with a focus on to citations, as well as open post publication for data sharing outside of publishers, par- discoverability, dissemination, search engine peer review. However, when we look at data, ticularly in the sciences. A good example of optimization and improved user experience, that is, any digital output of research, we have this is the Open Microscopy Environment University of Nebraska-Lincoln were able to ask if we can apply the same criteria to a project (OME, www.http://www.openmicros- to grow their traffic from zero to 300,000 uses video, as we do to spreadsheet data and how copy.org/). OME develops both standards in per month in under five years. microscopy and open source imaging software. should those criteria differ from the existing Unstructured or General Repositories criteria for paper publications? Most likely, we Organizations like Research Data Alliance, With the growth in popularity of data will need to define both review and assessment CODATA, the Data FAIRport initiative and sharing among academics and the increase in criteria for each type of output. These may FORCE11 are working towards standards for funder mandates, it’s clear that all researchers be difficult to define and challenging to scale. data storage, markup and dissemination. The work being carried out by DataCite and OR- are going to need data sharing solutions. Sub- There have been suggestions that peer re- CID is of particular interest.7 This will enable ject specific and institutional repositories form view is only really of use for data when it is to research repositories to automatically update a an overlapping and occasionally incomplete be reused. There have been examples of serious researcher’s ORCID profile. This collabora- patchwork of coverage for authors looking to problems being discovered when researchers tion extends to CrossRef so that all academics place content, particularly data that doesn’t fit have tried to reanalyse data. For instance, in the should be able to sync their publications as well into the predefined data formats that structured case of LaCour whose fraudulent data was ex- as their data with no extra effort. repositories support. 6 posed in 2015. However, by the time the fraud There has been very little research into came to light, the research had been published in Subject Specific and Structured the volume of data produced by academics. Science and covered by the mainstream media so Repositories The true scale and nature of research data is the critical review arguably happened too late. Certain disciplines lend themselves more unknown as much of it sits on institutional One interesting development in this space easily to data sharing, such as astronomy, and and departmental servers or on the hard has been the idea of machine readable badges the -omics disciplines. Structured repositories drives of computers under researchers’ desks. (http://openresearchbadges.org/). These are require data to comply with format standards Anecdotally, researchers generally have large essentially automated or manual markup of thereby encouraging their adoption. They play personal collections of data in a diverse range content to better describe and accredit re- a key role in data science as community or of formats. search outputs. funder-driven focal points for collaborative and continued on page 25

24 Against the Grain / February 2016 There are still many open questions in data publishing, from how to How and Why Data Repositories ... deal with embargoes or sensitive data to how best to assess the quality of from page 24 the diverse range of digital research outputs. The field of data publishing is still in its formative stages and represents an opportunity for both As part of Figshare’s partnership with Nature Publishing Group publishers and libraries to help academics adapt to new requirements. and their journal Scientific Data, we’ve been able to analyze user be- haviour and preferences. Scientific Data ask researchers to place data in structured data repositories, institutional repositories or both when Endnotes suitable ones exists. Tellingly, over 30% of data submissions were made 1. Piwowar H. A., Day R. S., Fridsma D. B. (2007) Sharing Detailed to Figshare, making it the most used repository. We know from this Research Data Is Associated with Increased Citation Rate. PLoS ONE that the majority of researchers require an unstructured repository for 2(3): e308. doi:10.1371/journal.pone.0000308 their data. The extent to which this will change over time as codification 2. Meadows A. (2011) Scholarly Kitchen. http://scholarlykitchen. and structuring efforts proceed is arguable. It is our opinion that there sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research- will always be a strong need for unstructured repositories because it is data-question/ the nature of research that many experiments and techniques are novel 3. Schaffer, T. and Jackson, K. M. 2004. The use of online supplemen- tary material in high-impact scientific journals. Science & Technology and unique. Libraries 25(1/2):73-85. Where Does this Leave Us? 4. The Royal Society Science Policy Centre report: Science as an open enterprise (2012) https://royalsociety.org/~/media/Royal_Society_Con- It has taken longer than expected for the promise of the digital age tent/policy/projects/sape/2012-06-20-SAOE.pdf. to begin to make a real difference to the way scholars communicate 5. Valen D. and Blanchat K. (2015) Figshare Blog https://figshare. their work. The persistence of traditional measures of quality are the com/articles/Overview_of_OSTP_Responses/1367165. most likely explanation for academia’s apparent conservatism, but with 6. Stemwedel J. D. (2015) Forbes http://www.forbes.com/sites/ funding bodies increasingly encouraging and mandating the sharing of janetstemwedel/2015/06/01/reasons-to-keep-discussing-the-lacour- data, we are finally seeing diversification of what is considered legiti- and-green-retraction/. mate scholarship. 7. Laure Haak L. (2015) ORCID blog https://orcid.org/blog/2015/10/26/ The publishing industry has made strides over the last decade or so auto-update-has-arrived-orcid-records-move-next-level. to integrate with institutional, funder and community based repositories. 8. Jones P. B., Wesolek A., Scherer D., Watkinson A. (2015). A Game Together with groups interested in the standardization of data formats, a of Spot the Difference: Librarians, Repository Managers and Publishers. Presentation slides. http://works.bepress.com/andrew_wesolek/30/ lot of progress has been made to codify formats in many fields. There 9. Giesecke J. (2011), “Institutional Repositories: Keys to Success” remains, however a large quantity of data on researchers’ hard drives Faculty Publications, UNL Libraries. Paper 255. http://digitalcommons. and servers that don’t fit into easily standardized formats because the unl.edu/libraryscience/255 techniques are either new or unique.

Everything Evolves, Even Publishing by Jason Hoyt, PhD. (CEO and Co-founder, PeerJ) and Peter Binfield, PhD. (Publisher and co-founder, PeerJ)

e sometimes hear that for all the become more widely available? Will tools to should we make open, and how? Is publishing promise of the Internet, it is a shame make publishing faster never be developed? Open Access a bet on the future, or will it Wthat it has yet to impact scholarly Why have “megajournals” appeared in the past negatively affect my students or my career? communication in the same way it has other ten years and not just survived, but become the What the last ten years or so have done is industries. One could argue this point quite ef- future revenue model for new and old publish- to open our minds to questions that many of us fectively: prestige still dominates; the journal ers? Why are scholarly societies struggling never anticipated having to find solutions for. name matters just as much as it always has; the after decades/centuries of thriving? Why are It could be argued that just as the Internet has same legacy publishers still control most of the governments and funders making Open Access made us more globally aware, so academia has literature; Open Access is just a small fraction mandates? These events contradict the notion grown more concerned with its impacts outside of all articles, etc., etc. Meanwhile, in other that the Internet hasn’t changed things in an of the ivory tower. The decentralization that industries it is easy to spot how the old guards “unmovable” 300 year-old industry. Indeed, occurred with the World Wide Web makes it have changed and new names have sprung the evidence actually suggests that we are in the clear how we affect those around us, and this up: Google, Wikipedia, Amazon, Uber and midst of a change so expansive that we don’t has influenced our professional lives in a sim- Facebook to name just a few. quite know how to adapt to it. ilar way. It’s not that scientists are only just On the other hand, does anyone believe We take comfort in the way things worked now waking up to the fact that they can be open, Open Access is going away? Will data not in the past, as they had slowly developed in they just didn’t realize it was possible until manageable timetables over recently. Our policies and infrastructures are the 20th century. There was unprepared for these changes, just as much as certainty in how to commu- our readiness to leave the comfort of the past. nicate science, who to trust, or what to do for academic There Would be no Open or Mega- career progression. We now Journals without the Internet live in an era with an alluring Just as the printed journal was a forgone future, but one that raises new conclusion of the printing press, so too was concerns: Open Access and the megajournal a natural How will we fund schol- by-product of the Internet. Perhaps someone arly output? How much continued on page 26

Against the Grain / February 2016 25