<<

Open of Scholarly Publications

Open Monitor Case Study

Ludo Waltman EN July 2019

Open Metadata of Scholarly Publications

European Commission Directorate-General for and Directorate G — Research and Innovation Outreach Unit G.4 — E-mail [email protected] [email protected] European Commission B-1049 Brussels

Manuscript completed in July 2019. This document has been prepared for the European Commission however it reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

More information on the European Union is available on the internet (http://europa.eu).

Luxembourg: Publications Office of the European Union, 2019

EN PDF ISBN 978-92-76-12011-7 doi: 10.2777/132318 KI-01-19-807-EN-N © European Union, 2019. Reuse is authorised provided the source is acknowledged. The reuse policy of European Commission documents is regulated by Decision 2011/833/EU (OJ L 330, 14.12.2011, p. 39).

For any use or reproduction of photos or other material that is not under the EU copyright, permission must be sought directly from the copyright holders.

EUROPEAN COMMISSION

Open Metadata of Scholarly Publications Open Science Monitor Case Study

2019 Directorate-General for Research and Innovation EN

Table of Contents

ACKNOWLEDGEMENTS ...... 4 1 Introduction ...... 5 2 Drivers ...... 6 3 Barriers ...... 7 4 Impact ...... 8 5 Lessons learnt ...... 9 6 Policy conclusions ...... 10

ACKNOWLEDGEMENTS

Disclaimer: The information and views set out in this study report are those of the author(s) and do not necessarily reflect the official opinion of the Commission. The Commission does not guarantee the accuracy of the data included in this case study. Neither the Commission nor any person acting on the Commission’s behalf may be held responsible for the use which may be made of the information contained therein.

The case study part of Open Science Monitor led by the Lisbon Council together with CWTS, ESADE and .

Authors

Ludo Waltman – Centre for Science and Studies (CWTS)

4 STUDY ON OPEN SCIENCE: MONITORING TRENDS AND DRIVERS (Reference: PP-05622-2017)

1 Introduction

The Open Science Monitor partly relies on proprietary data sources, in particular the . Scopus is a data source that provides metadata of scholarly publications. It has been created by Elsevier, which contributes to the Open Science Monitor as a subcontractor. The use of Scopus data in the Open Science Monitor has been subject of debate. In a complaint to the European Commission, the use of Scopus data has been criticized. Among other things, the signatories of this complaint raised the following question: “Given the EU’s emphasis on Open Science, including , why is there (apparently) no requirement to insist that the Open Science Monitor must be based upon open data, open standards, and open source tools (with appropriate licenses for re-use accessibility) as a matter of principle?”1 The response of the Open Science Monitor consortium has been that it is not possible to create the Monitor based exclusively on open data sources. Given the currently available data sources, the only way to create the Open Science Monitor is to make use of proprietary data sources such as Scopus or . The same response has also been given by the European Commission: “Overall, the Commission wishes to have an as comprehensive Monitor as possible. … as long as there is in the European Union no fully open and transparent data-infrastructure, we are dependent on a fragmented data infrastructure and data sources from private operators. This implies that the Monitor has to be constructed under non-optimal conditions.”

The debate about the Open Science Monitor illustrates the importance of developments toward open metadata of scholarly publications (e.g., open metadata of articles in scholarly journals and in conference proceedings). For many publications, metadata such as titles, abstracts, author lists, and reference lists is available in proprietary data sources such as Scopus, produced by Elsevier, and Web of Science, produced by Clarivate Analytics. The use of metadata provided by these proprietary data sources usually involves considerable cost and is subject to significant restrictions. Open data sources make metadata of publications available under minimal restrictions.

Open metadata has several benefits. Open availability of metadata enables more researchers to carry out bibliometric studies, which will help to get a better understanding of the science system. There will also be more possibilities for testing the reproducibility of bibliometric studies. In addition, open metadata can be used in applied bibliometric analyses that aim to support research evaluation and research management. These analyses can be made more transparent, which will contribute to more responsible ways of using . There will also be more freedom in designing applied bibliometric analyses. For instance, these analyses do not need to rely on decisions made by a central authority (e.g., the producer of Scopus or Web of Science) on which can and cannot be included in an analysis. Finally, open metadata may make scientific literature easier to find. New search engines for scientific literature can be developed based on open metadata.

Open metadata is closely related to publishing. An increasing proportion of all scholarly publications are openly accessible. If a publication is openly accessible, its metadata is openly accessible as well, although not necessarily in a machine-readable format or in association with similar metadata from other publications. Conversely, if a publication is not openly accessible, the metadata of the publication may or may not be openly accessible, depending on the policies of the publisher.

This report first provides an overview of the drivers of and barriers to open metadata of scholarly publications. It then demonstrates the impact of open metadata. Finally, lessons learnt and policy conclusions are discussed. The focus of this report is on metadata of scholarly publications. Metadata of other types of scholarly outputs (e.g., data sets and software) is also of considerable importance, but falls outside the scope of this report.

1 https://doi.org/10.5281/zenodo.2554199

5 STUDY ON OPEN SCIENCE: MONITORING TRENDS AND DRIVERS (Reference: PP-05622-2017)

2 Drivers

A prominent driver of open metadata of scholarly publications is the United States National of Medicine (NLM) at the National Institutes of Health. The NLM maintains PubMed, an open data source of metadata of a large share of all scholarly publications in the biomedical domain. PubMed was launched more than two decades ago, in 1996. It is widely used by biomedical researchers. A limitation of PubMed is that it does not include the reference lists of publications. links between publications are therefore not available in PubMed. Also, for many publications, PubMed does not provide complete data on author affiliations.

In recent years, there have been a number of significant developments toward open metadata of scholarly publications. First of all, scholarly publishers have increasingly made metadata of their publications openly available in Crossref, a registration agency for Digital Object Identifiers (DOIs). Publishers that are members of Crossref are obliged to “deposit timely and accurate metadata for (their) content”.2 When a publisher registers a DOI for a publication, Crossref obtains basic metadata for this publication, such as the title, the names of the authors, and the name of the journal in which the publication has appeared. Crossref then makes this metadata openly available. The data “is not subject to copyright and available to use for whatever purpose you may have”.3 In many cases, publishers also deposit the references of publications in Crossref. However, references are made openly available by Crossref only if the publisher grants permission for this.

To persuade publishers to make the references of publications openly available in Crossref, the Initiative for Open (i4OC) was established in April 2017.4 I4OC is an advocacy group that started as a collaboration of six organizations: OpenCitations, Wikimedia Foundation, PLOS, eLife, DataCite, and the Centre for Culture and Technology at Curtin University. The initiative is supported by a large number of other organizations. I4OC has had a major effect on the openness of citation data. Before the launch of I4OC, for only 1% of the publications with references deposited in Crossref the references were open. Two years after the launch of I4OC, this has increased to 55%, resulting in about half a billion references being openly available in Crossref. Most large publishers, including for instance Springer and , support I4OC and make the references of their publications openly available.5 One publisher, BMJ, even went to the effort of extracting over two million references from the PDF files of old articles in order to make these references openly available in Crossref.6

I4OC has also received support from the scientometric community. Over 350 scientometricians have signed a letter supporting I4OC.7 Additional support is provided by , a plan developed by a number of research funders to promote open access publishing. Plan S strongly recommends that journals and repositories make references openly available according to the standards of I4OC.8

OpenCitations represents another important development toward open metadata of scholarly publications.9 OpenCitations is a scholarly infrastructure organization that uses Semantic Web to make metadata of publications and citation links between publications openly available under a CC0 public domain dedication. The two most important data sources provided by OpenCitations are the OpenCitations Corpus (OCC) and the OpenCitations Index of Crossref open DOI-to-DOI citations (COCI). OCC provides open metadata of publications and in particular of their references. OCC currently is still of limited size. It includes mainly metadata of publications in the Open Access subset of

2 https://www.crossref.org/member-obligations/ 3 https://www.crossref.org/services/metadata-delivery/rest-api/ 4 https://i4oc.org 5 https://i4oc.org/#publishers 6 https://www.scholarcy.com/unlocking-100-years-of-scientific-papers-how-scholarcy-partnered-with-bmj-to- further-i4oc/ 7 http://issi-society.org/open-citations-letter/ 8 https://www.coalition-s.org/principles-and-implementation/ 9 http://opencitations.net

6 STUDY ON OPEN SCIENCE: MONITORING TRENDS AND DRIVERS (Reference: PP-05622-2017)

PubMed Central, although various other sources of metadata are considered as well.10 In contrast, COCI provides open citation links obtained from Crossref. It currently includes almost 450 million citation links. David Shotton, the founder of OpenCitations, is one of the earliest advocates for open citations. In a commentary published in Nature in 2013, Shotton wrote: “it is a scandal that reference lists from journal articles — core of scholarly communication that permit the attribution of credit and integrate our independent research endeavours — are not readily and freely available for use by all scholars”.11

WikiCite, an initiative of Wikimedia, is another driver of open metadata of scholarly publications.12 It provides an open data source of citation links to which anyone can contribute.

Another potential driver of open metadata is Graph.13 This data source provides metadata of publications and citation links between publications. The completeness and accuracy of the data are still subject of study. The data is made available under an ODC-BY license.

A final driver of open metadata of scholarly publications is Metadata 2020, which describes itself as “a collaboration that advocates richer, connected, and reusable, open metadata for all research outputs”.14 Almost 100 individuals are involved in Metadata 2020, organized in six community groups: Researchers, service provider/platforms and tools, funders, publishers, librarians, and data publishers and repositories.15 According to one of the draft metadata principles formulated by a Metadata 2020 project group, “metadata must be as open, interoperable, parsable, machine actionable, human readable as possible” and also “as complete and comprehensive as possible”.16

3 Barriers

Crossref appears to provide the most promising way to realize large-scale availability of open metadata of scholarly publications. From this point of view, three barriers to improving the availability of open metadata can be identified.

First, many publications do not have a DOI from Crossref. Some of these publications do not have a DOI at all. Others do not have a DOI from Crossref but from another DOI registration agency.17 In either case, Crossref cannot be used to make the metadata of a publication openly available.

When publications do have a DOI from Crossref, a second barrier is that many publishers neglect to deposit full metadata in Crossref. For instance, abstracts, author affiliations, and references are often not deposited.18 In a recent analysis, it was found that abstracts are available only for 4% of the publications from the period 2008–2017 in Crossref, while affiliations are available only for 16% of the publications.19 A comparison of Crossref with Web of Science and Scopus showed that tens of millions of references are missing in Crossref because publishers failed to deposit these references.20

The third barrier is that some publishers deposit references in Crossref but do not allow Crossref to make these references openly available. Prior to the launch of I4OC, almost all publishers followed Crossref’s default policy of not making references openly available. After the launch of I4OC, many publishers decided to make references openly available in

10 https://opencitations.wordpress.com/2018/03/23/early-adopters-of-the-opencitations-data-model/ 11 https://doi.org/10.1038/502295a 12 https://meta.wikimedia.org/wiki/WikiCite 13 https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ 14 http://www.metadata2020.org 15 http://www.metadata2020.org/about/ 16 http://www.metadata2020.org/blog/2019-05-13-principles-call-for-comment/ 17 http://www.doi.org/registration_agencies.html 18 https://www.tedhabermann.com/blog/2019/3/25/the-big-picture-how-has-crossref-metadata-completeness- improved 19 https://deffopera.dk/wp-content/uploads/2019/04/Ludo-Waltman_presentation_28032009-1. 20 https://www.cwts.nl/blog?article=n-r2s234

7 STUDY ON OPEN SCIENCE: MONITORING TRENDS AND DRIVERS (Reference: PP-05622-2017)

Crossref. In addition, Crossref decided to change its default policy, resulting in references by default being made openly available. However, for 45% of the publications with references deposited in Crossref, the references are still not open.

The largest publishers that do not make references openly available in Crossref are the American Chemical Society, Elsevier, and IEEE.21 This appears to relate to the interests these publishers have in the commercial exploitation of the metadata of their publications. For instance, Elsevier owns Scopus, a proprietary data source in which it makes metadata of its own publications and those of other publishers available.22 Elsevier’s decision not to make references openly available in Crossref recently led to the resignation of the entire editorial board of an Elsevier journal and to the launch of a new competing journal.23 In the case of the American Chemical Society, a petition asking the society to make references openly available in Crossref has been signed by almost 350 individuals.24 Recently, the American Chemical Society and IEEE were strongly criticized for not making references openly available: “it is deeply regrettable and almost incomprehensible that any professional organization, or university press, whose primary mission is to serve the interests of the practitioners, scholars and readers it represents, should choose not open all its publications’ reference lists as a public good”.25

Another obstacle to the availability of open metadata of scholarly publications is the challenge of guaranteeing the long-term sustainability of open scholarly infrastructures. PubMed is funded by the National Institutes of Health in the United States, which seems to guarantee its long-term sustainability. Crossref receives revenues from publishers, who pay membership fees and DOI registration fees. These revenues seem to be sufficient to ensure a healthy financial position and to guarantee the long-term sustainability of Crossref.26 However, for other open metadata infrastructures, such as OpenCitations, the long-term sustainability is uncertain. Obtaining stable long-term funding for these infrastructures is a major challenge.

4 Impact

The impact of open metadata of scholarly publications can best be illustrated using examples of platforms and tools that benefit from open metadata. Most of these platforms and tools are freely accessible. In some cases, a subscription is required. Platforms and tools benefiting from open metadata basically serve two use cases. On the one hand, they make available bibliometric indicators and other types of analytics that can be used to support research evaluation and research management. On the other hand, they provide search engines for scientific literature. Below, a number of examples are discussed of platforms and tools that benefit from open metadata.

Dimensions, launched in 2018, is a platform that provides metadata of publications and citation links between publications. The full version of Dimensions, which also offers metadata of grants, patents, clinical trials, and policy documents, requires a subscription. A more limited version is freely accessible. Dimensions relies heavily on citation data that, as a result of the efforts of I4OC, is made openly available in Crossref. According to the founders of Dimensions, “I4OC has played a critical role in making citation data more openly available over the last 18 months to the extent that building Dimensions would

21 https://opencitations.wordpress.com/2019/02/07/crowdsourcing-open-citations-with-croci/ 22 https://medium.com/@ryregier/the-longer-elsevier-refuses-to-make-their-citations-open-the-clearer-it- becomes-that-their-high-78576a48e64e 23 http://issi-society.org/blog/posts/2019/january/the-international-society-for--and-informetrics- ends-support-for-journal-of-informetrics-launches-new-open-access-journal-quantitative-science-studies/ (Full disclosure: The author of the current report served as Editor-in-Chief of the Elsevier journal and now holds the same position at the newly established journal.) 24 https://www.change.org/p/asking-the-american-chemical-society-to-join-the-initiative-for-open-citations 25 https://opencitations.wordpress.com/2019/02/07/crowdsourcing-open-citations-with-croci/ 26 https://www.crossref.org/pdfs/annual-report-2016-17.pdf

8 STUDY ON OPEN SCIENCE: MONITORING TRENDS AND DRIVERS (Reference: PP-05622-2017) have been significantly more challenging, time-consuming, and the data would contain many more errors without their efforts”.27

Like Dimensions, the Lens is a platform that provides metadata of publications and citation links between publications. The Lens is freely accessible. It combines metadata obtained from a number of open data sources, including Crossref, Microsoft Academic Graph, and PubMed.28

Europe PMC is a freely accessible platform that provides access to metadata and in many cases also to the full text of biomedical publications. It obtains its data primarily from PubMed and PubMed Central.29 Citation data is obtained from Crossref.30

There are several freely available tools that make use of open metadata of publications. An example is the popular tool for calculating citation statistics that can be used with data from Crossref and Microsoft Academic Graph.31 Another example is the plug-in that has been developed for the reference management software to show citation data obtained from COCI.32

A number of tools are available for visualizing citation networks based on open citation data. VOSviewer can be used to visualize citation networks based on data from Crossref, OCC, COCI, and WikiCite.33 CiteSpace supports Crossref data for visualizing citation networks.34 Citation Gecko visualizes citation networks based on data from Crossref, Microsoft Academic, and OCC.35 While VOSviewer and CiteSpace are focused primarily on bibliometric analysis, Citation Gecko is intended to be used to support literature search. Other similar tools are still in a more experimental stage of development.36

Open metadata of publications and open citation data is increasingly being used in bibliometric studies. One study for instance examined whether open data sources could replace proprietary ones in the qualification process for university professors in the field of computer science in Italy.37 It was found that the amount of citation data that is currently openly available is still insufficient to replace proprietary data sources. In another study, open citation data was used to try to identify new opportunities for collaboration between different schools within University of Manchester.38

5 Lessons learnt

Despite their limitations, PubMed, Crossref, and I4OC provide examples of initiatives that have been successful in increasing the availability of open metadata of scholarly publications. PubMed demonstrates how a major research funder, the National Institutes of Health in the United States, has been able to set up a successful open infrastructure for metadata of publications. Crossref demonstrates how scholarly publishers have been able to work together to establish a powerful and sustainable infrastructure for metadata of publications. I4OC shows how an advocacy group has successfully put pressure on publishers to make better use of existing infrastructures.

Another lesson learnt is that publishers are in general supportive of initiatives that aim to increase the availability of open metadata of their publications. This may not be surprising,

27 https://doi.org/10.3389/frma.2018.00023 28 https://about.lens.org 29 https://europepmc.org/About 30 https://figshare.com/articles/Europe_PMC_and_open_citations/7039739 31 https://harzing.com/resources/publish-or-perish/manual/using/data-sources 32 https://github.com/zuphilip/zotero-open-citations 33 https://www.cwts.nl/blog?article=n-r2v284 34 https://sourceforge.net/projects/citespace/ 35 https://workshop-oc.github.io/presentations/Poster_Walker.pdf 36 http://visualbib.uniud.it/en/visualbib2/ and https://dossier-ng.univ-st- etienne.fr/scd/www/oci/OCI_graphe_accueil. 37 https://arxiv.org/abs/1902.03287 38 https://blog.research-plus.library.manchester.ac.uk/2019/03/04/using-open-citation-data-to-identify-new- research-opportunities/

9 STUDY ON OPEN SCIENCE: MONITORING TRENDS AND DRIVERS (Reference: PP-05622-2017) since such initiatives help to make the content of publishers easier to find. However, as is shown by the reluctance of a few large publishers (i.e., American Chemical Society, Elsevier, and IEEE) to participate in I4OC, publishers may decide not to support open metadata initiatives when they fear that such initiatives could harm their interests in the commercial exploitation of the metadata of their publications.

In recent years, discussions about open metadata of publications have focused primarily on open citation data. This reflects the impact of I4OC. Many publishers have made the reference lists of their publications openly available in Crossref. However, most publishers still do not make other metadata elements, such as abstracts and author affiliations, available in Crossref. The lack of author affiliations in Crossref is one of the reasons why the Open Science Monitor needs to make use of proprietary data sources. To convince publishers to make abstracts, author affiliations, and other metadata elements available in Crossref, there seems to be a need for new initiatives, perhaps by building on existing initiatives such as I4OC and Metadata 2020.

A final lesson learnt relates to the importance of finding ways to guarantee the long-term sustainability of open infrastructures for metadata of publications. PubMed and Crossref are successful examples, but for other open metadata infrastructures, such as OpenCitations, this remains a major challenge. Without stable funding for these infrastructures, their long-term sustainability will remain at risk. The recently launched Invest in Open Infrastructure initiative could potentially contribute to a solution.39

6 Policy conclusions

During the past few years, important steps have been taken to increase the availability of open metadata of scholarly publications. Crossref, I4OC, and OpenCitations have played a crucial role in these developments. They have focused their attention mainly on making citation data openly available. A natural next step for these organizations seems to be to broaden their activities to other metadata elements, such as abstracts and author affiliations.

Some publishers with an interest in the commercial exploitation of the metadata of their publications are reluctant to make metadata openly available. To increase the availability of open metadata, research institutions and research funders could put pressure on these publishers. Research institutions could do so in their negotiations with publishers. Research funders could require grantees to publish their research in journals that support open metadata.40 The research funders participating in Plan S have indeed formulated such a requirement. In order to be compliant with Plan S, journals need to make available “high- quality article level metadata in standard interoperable non-proprietary format, under a CC0 public domain dedication”.41

In order to guarantee the long-term sustainability of scholarly infrastructures for open metadata, stable funding is needed. Research institutions and research funders seem to be in the best position to provide such funding, for instance through the Invest in Open Infrastructure initiative. When stable funding for open metadata infrastructures is provided, many new platforms can be expected to emerge that build on these infrastructures. These platforms may for instance provide high-quality data curation, they may offer innovative research analytics, or they may enable new approaches to scholarly literature search. Some of these platforms may be freely accessible, while others may require a subscription. Compared with the present situation, there will be more competition between platforms and more opportunities for innovation. This can be expected to lead to significant cost savings for research institutions and research funders, so that they will easily earn back their investments in open metadata infrastructures.

39 https://investinopen.org 40 https://doi.org/10.1038/d41586-018-00104-7 41 https://www.coalition-s.org/principles-and-implementation/

10

Getting in touch with the EU

IN PERSON All over the European Union there are hundreds of Europe Direct Information Centres. You can find the address of the centre nearest you at: http://europa.eu/contact

ON THE PHONE OR BY E-MAIL Europe Direct is a service that answers your questions about the European Union. You can contact this service – by freephone: 00 800 6 7 8 9 10 11 (certain operators may charge for these calls), – at the following standard number: +32 22999696 or – by electronic mail via: http://europa.eu/contact

Finding information about the EU

ONLINE Information about the European Union in all the official languages of the EU is available on the Europa website at: http://europa.eu

EU PUBLICATIONS You can download or order free and priced EU publications from EU Bookshop at: http://bookshop.europa.eu. Multiple copies of free publications may be obtained by contacting Europe Direct or your local information centre (see http://europa.eu/contact)

EU LAW AND RELATED DOCUMENTS For access to legal information from the EU, including all EU law since 1951 in all the official language versions, go to EUR-Lex at: http://eur-lex.europa.eu

OPEN DATA FROM THE EU The EU Open Data Portal (http://data.europa.eu/euodp/en/data) provides access to datasets from the EU. Data can be downloaded and reused for free, both for commercial and non-commercial purposes.

[Catalogue number ]

This report first provides an overview of the drivers of and barriers to open metadata of scholarly publications. It then demonstrates the impact of open metadata. Finally, lessons learnt and policy conclusions are discussed. The focus of this report is on metadata of scholarly publications. Metadata of other types of scholarly outputs (e.g., data sets and software) falls outside the scope of this report.

Studies and reports