January 2018

A study of current policies, practices and infrastructure supporting the sharing of data to prevent and respond to epidemic and pandemic threats

Elizabeth Pisani, Amrita Ghataure, Laura Merson

Summary: Data sharing in emergencies: Wellcome/GloPID- ii

Executive summary Introduction Discussions around sharing public health research data have been running for close to a decade, and yet when the Ebola epidemic hit West Africa in 2014, data sharing remained the exception, not the norm. In response, the GloPID-R consortium of research funders commissioned a study to take stock of current data sharing practices involving research on pathogens with pandemic potential. The study catalogued different data sharing practices, and investigated the governance and curation standards of the most frequently used platforms and mechanisms. In addition, it sought to identify specific areas of support or investment which could lead to more effective sharing of data to prevent or limit future epidemics. Methods The study proceeded in three phases: a search for shared data, interviews with investigators, and a review of policies and statements about data sharing. We began by searching the academic literature, trials registries and research repositories for papers and data related to 12 pathogens named by the WHO as of priority concern because of their epidemic or pandemic potential. Chikungunya, Crimean Congo Haemorrhagic Fever (CCHF), Ebola, Lassa Fever, Marburg, Middle East Respiratory Syndrome Coronavirus (MERS-CoV), Severe Acute Respiratory Syndrome (SARS), Nipah, Rift Valley Fever (RVF), Severe Fever with Thrombocytopenia Syndrome (SFTS), Zika. For biomedical research involving these pathogens in humans, animals, in vitro and in silico, we sought to access the data underlying each paper, including through a short, on-line survey sent to 110 corresponding authors. A total of 28 people conducting or supporting research on the pathogens of interest were interviewed for 30-90 minutes about their data sharing practices and policies. In addition, we reviewed 105 institutional policies, discussion documents and academic commentaries about standards and norms in data sharing. Data sharing practice Extent of sharing We identified a total of 787 research studies related to the priority pathogens published since 2003. The majority were case studies involving a single patient. Of the remaining 319, 98 provided the underlying data in an openly accessible form, and a further 15 authors said they would make data available on request. Excluding the case reports, two thirds of all papers were based on undiscoverable data. Some 161 clinical trials relating to the priority pathogens were registered in either clinicaltrials.gov or the Pan-African registry of trials; 58 of them have registered completion, but 41 of those remain unpublished. Two trialists posted links to aggregate data several years after completion. None linked to individual patient data. Just over half of all published papers gave dates for data collection. Authors who shared data published faster than those who didn't (18 vs. 30 months median time lag to publication).

Summary: Data sharing in public health emergencies: Wellcome/GloPID-R iii

Open sharing mechanisms The most commonly used mechanism for the formal sharing of data was embedding supplementary material in journal articles, followed by deposit in general-purpose academic repositories. Very few of the datasets shared were in structured, machine- readable formats. Over half of the files provided as supplementary information included data in pdf format. Datasets in academic repositories were more likely to be structured, though many used proprietary software formats. Data shared on genetic sequences or protein structures databases are shared more commonly than clinical data for the priority pathogens, generally through extremely well-resourced community-designated databases such as GenBank or Protein Data Bank. However even these data do not appear to be shared rapidly; often, the public release of sequence data is timed to coincide with paper publication, even for pathogens with pandemic potential. Though registries do list some data repositories with a special interest in priority pathogens, uptake was limited amongst the researchers interviewed. Other formal sharing mechanisms included journal preprints and institutional repositories, together with "Big Data" approaches which collate information from public sources. Closed sharing mechanisms Our quantitative search strategy, based largely on publication, was biased towards data sharing mechanisms that were formal, findable and open. However, in interviews with the research community, we identified a second set of mechanisms for sharing data: closed consortia, regional collaborations and informal professional networks. We were unable to quantify the sharing of data through these mechanisms, but it is clear that for most of the priority pathogens informal or consortium sharing is currently more extensive, and faster, than through formal open mechanisms. In short:  Most research data related to priority pathogens are not being shared for reuse through formal, discoverable means.  Data that are shared through publication or open repositories are most often shared in formats that are not inter-operable or easily reusable.  Closed consortia and informal, trust-based networks are favoured by researchers, particularly those working in resource-constrained settings. Table A summarises the strengths and weaknesses of some of the different sharing mechanisms. Broadly speaking, those that invest most in curation are most likely to provide long-term access to data that can most readily be reused for additional knowledge generation. However, they also tend to be the most costly. Sharing through informal networks of trusted colleagues is popular and can happen rapidly, but it lacks transparency, greatly restricts potential users, and essentially excludes machine learning.

Summary: Data sharing in public health emergencies: Wellcome/GloPID-R iv

Table A: Platforms and mechanisms used for the sharing of data about priority pathogens Data sharing mechanism Advantages Disadvantages Well-established Genomic/ structural Discoverable Expensive databases Standardised metadata Slow Supplementary material in Findable by humans, with some effort Not machine readable, limiting discoverability journals Public investment not vital Depends on electronic access to papers; Often paywalled Diverse, largely unusable data formats Not discoverable even by humans Journal preprints Faster than traditional publishing models Diverse, largely unusable data formats Standardised metadata Extensive set-up costs Disease-specific curated Community support Sustainability questionable repositories Greatly facilitates reuse and interoperability Duplication of effort General purpose academic Standardised metadata rare Potentially cost-effective repositories Sustainability uncertain Duplication of effort Strong support for data management and curation Institutional repositories Not easily discoverable Easy entry point for academics reluctant to share Standardised metadata rare Standardised metadata Excludes reuse by non-members Facilitates reuse and interoperability. Set-up costs Closed consortia Protects equity and research interests, so strong Duplication of effort community support Sustainability questionable Informal professional Trust-based, so strong community support Zero transparency networks Big Data approaches Cost-effective Neither sensitive nor specific

Summary: Data sharing in public health emergencies: Wellcome/GloPID-R v Table B: FREE-FAIRER Principles and their inclusion in institutional data sharing policies Notes Findable Most policies address "findability" and require data to be linked to a DOI. In practice DOIs Data should be described with rich, machine-readable metadata, persistently are linked to papers, rarely to data. No policy requires standardised ontologies for metadata. and uniquely identified, and indexed in a searchable resource No policies apply to surveillance data. Few institutional policies exist specifically for research relevant to potential emergencies. Rapidly available Most are tied to publication models and other quality control norms that are inimical to speed. Data should be shared rapidly to maximise their potential utility in outbreak Preprints increase speed but reduce findability. Structures for rapid release of surveillance situations. data are dysfunctional. Ethical Discussions relating to ethics prioritise the protection of individuals. Few address the ethics of Data are collected with appropriate consent procedures. Data sharing protects withholding data or public health importance. individual privacy, without undermining the rapid protection of public health. Some policies include aspirational statements about equity, but few provide guidance on how Equitable to operationalize them. For interviewees, especially in research-poor settings, this is the Data sharing delivers fair benefit to those who collected the data, and to the biggest obstacle to sharing. In those policies that mention it, 'equity' is often reduced to norms communities from which data were collected for citation and authorship. Forever Many policies address persistent identifiers; few address long-term preservation or Data are shared through a mechanism that uses persistent identifiers and has sustainability of sharing platforms. clear, securely funding strategies to ensure the long-term preservation of data Accessible Potential data users are physically able to retrieve shared data using identifiers, Few policies address file formats or physical retrieval mechanisms. after authentication as necessary. Interoperable Discussion of interoperability is increasing, but policies addressing ontologies, languages and Data formats must be platform agnostic and metadata must use a formal, file formats remain rare. Awareness of these issues among researchers interviewed was low. shared language that allows for exchange of information between datasets. Reliable The onus for quality control is put on researchers; interviewees say that conflicts with Data should be quality assured before being shared, unless robust mechanisms demands for the rapid release of data. for community quality assurance post-release are in place. Most policies address direct costs of sharing, but few consider infrastructural costs or Economically viable opportunity costs. To researchers, the latter weigh most heavily. -based 'big data' approaches increasingly suggested for early warning of outbreaks. Reusable Many papers call for more "off the shelf" licenses, and increasing number of policies specify Data licenses allow for reuse; metadata allow for pooled analysis or other licenses for reuse. Few encourage use of community-standardised metadata, or curated platforms community reuse for clinical or surveillance data.

Summary: Data sharing in public health emergencies: Wellcome/GloPID-R vi

Desirable norms in data sharing In addition to examining data sharing in practice, we reviewed the institutional policies of research funders, academic institutions, data management organisations and publishers, as well as a large number of position statements, discussion papers, and academic recommendations around data sharing. Most of these referred to sharing of research data across a number of contexts. Only a few considered the particular needs of potential epidemics or public health emergencies, or focused specifically on infectious data. Many principles and features of shared data were considered important. The 'FAIR' principles proposed by Wilkinson and colleagues in 2016 relate to properties of the data themselves (findable, accessible, interoperable, reusable) and are widely agreed to be essential. In the particular situation of potential emergencies, 'reusable' is probably the most important of these. However, we found other principles, related to research and to infrastructure, that are also deemed important. In keeping with the existing FAIR acronym, we have dubbed the expanded set 'FREE-FAIRER'. They are shown in Table B above. Of the additions to the set, the most important in emergency settings are probably 'rapid' and 'equitable'. Sharing quickly is of paramount importance when many lives are potentially at stake. But relying on "public good" arguments has in the past failed to guarantee sharing. It is important to acknowledge that many of the people actually collecting samples and conducting clinical research likely to be of epidemic/pandemic importance are researchers in lower income countries marked by decades of under-investment in science. For them, a large disease outbreak of international interest represents an important career opportunity. Data sharing mechanisms will fail unless they work to support the needs and aspirations of investigators closest to the outbreak. Recommended areas for investment We mapped the FREE-FAIRER principles against different data sharing platforms and mechanisms. No data sharing mechanism is optimal across all of the principles. Under current practices, closed consortia and informal networks are the most likely to lead to the (limited) sharing of reusable data in emergency settings, followed by curated repositories. However, data sharing practices will change along with technology, community norms and incentives, and the investment decisions made by research funders and others. We identified several key areas in which investment or changes in practice could reduce existing hurdles to effective sharing of research data to prevent or limit pandemics.  Develop community-agreed metadata standards for priority pathogens, to increase findability, interoperability, cost effectiveness and to facilitate reuse.  Develop/ implement the use of software to increase data comparability at points of collection and curation. Also lowers costs, increases reliability.  Support "pre-qualification" system for data sharing governance templates.  Publish institutional policies explicitly requiring rapid access to priority pathogen data which can be effectively reused; actively monitor and reward compliance  Support the development of disease-specific networks and platforms

Summary: Data sharing in public health emergencies: Wellcome/GloPID-R 7

Table of contents

1 INTRODUCTION 1

2 METHODS 2

3 RESULTS 4 3.1 DATA SHARED THROUGH PEER REVIEWED PUBLICATION 4 3.2 DATA SHARED THROUGH TRIAL REGISTRIES 6 3.3 MOST COMMONLY USED DATA SHARING MECHANISMS 8 3.4 OTHER POTENTIAL REPOSITORIES AND SHARING MECHANISMS 11

4 STRENGTHS AND WEAKNESSES OF DIFFERENT DATA SHARING PLATFORMS AND MECHANISMS 14 4.1 GENOMIC AND STRUCTURAL DATABASES 15 4.2 JOURNAL PUBLICATIONS AND SUPPLEMENTARY MATERIAL 17 4.3 PREPRINTS 18 4.4 INSTITUTIONAL REPOSITORIES 18 4.5 GENERAL PURPOSE ACADEMIC REPOSITORIES 19 4.6 DISEASE-SPECIFIC CURATED REPOSITORIES 22 4.7 CLOSED CONSORTIA AND FORMAL PROFESSIONAL NETWORKS 23 4.8 INFORMAL PROFESSIONAL NETWORKS, AND REGIONAL HUBS 26 4.9 AGGREGATION OF DATA FROM PUBLIC SOURCES 27

5 DESIRABLE PRACTICE: ASPIRATIONS FOR DATA SHARING 28

6 SUMMARY AND RECOMMENDED ACTIONS 32 6.1 METADATA STANDARDS, SOFTWARE AND INDEXING 34 6.2 CONSENSUS ON GOVERNANCE 36 6.3 REFINING AND IMPLEMENTING INSTITUTIONAL POLICIES 37 6.4 DISEASE-SPECIFIC NETWORKS AND CURATED PLATFORMS 38

7 CONCLUSION 39 BIBLIOGRAPHY 40

Data sharing in public health emergencies. Wellcome/GloPID-R 1

1 Introduction

Since at least the SARS outbreak of 2003, public health authorities and researchers have been thinking seriously about how best to share samples, genetic sequences and other epidemiological and research data to allow for a rapid response to the threat posed by infectious diseases with pandemic potential. Years of discussion have resulted in many research funders, global health organisations, industry groups, academic publishers and academic institutions developing policies that support data sharing. Some structures that facilitate data sharing have been developed. And yet when Ebola raged across West Africa in 2014, the rapid sharing of data remained the exception, rather than the norm. In response, the GloPID-R consortium of research funders renewed their commitment to promoting the rapid sharing of research data to help the world to prepare better for the next infectious disease outbreak, and to respond better when future outbreaks occur. The Wellcome Trust, which chairs GloPID-R's working group on data sharing, in early 2017 commissioned a study that aimed to take stock of current data sharing practices involving research on pathogens with pandemic potential. The study focused principally on data generated under research protocols, although implications for surveillance data were also considered. The study addressed the following questions:  What data platforms are available for data relating to emerging infectious diseases? Where are these platforms located and how are they governed, funded and used?  What systems are used/ available to support data discovery, access and use as part of a rapid research response?  What are the existing data standards and governance mechanisms for rapid storing, accessing and sharing of research and results data?  In the absence of existing data platforms, where is data held?  What could researchers, responders and research funders do to support and sustain data platforms relevant to public health emergencies? The study aims to inform the programme of work around data sharing for GloPID-R members. Here we report the results of that investigation. We first describe our methods, then, in Section 3, investigate which platforms are used in practice. In Section 4 we consider these together with others that exist but are less used. We examine the strengths and weaknesses of each type of platform or data sharing mechanism, and summarise reasons driving their use or non-use. In order to address the final question – what could be done to support and sustain data platforms – Section 5 reviews academic literature and institutional policies laying out ideals for data sharing in the biomedical sciences. Finally, Section 6 maps those ideals against our findings, and identifies areas in which targeted investment

Data sharing in public health emergencies. Wellcome/GloPID-R 2

and action would be most useful in supporting rapid and productive sharing of data in public health emergencies.

2 Methods

We conducted the study in three phases.

2.1..1 Phase 1: searching for shared data We searched PubMed for any research papers relating to relevant pathogens named by the WHO as priority research targets in public health emergencies published since 2003, or named as areas of concern: Chikungunya, Crimean Congo Haemorrhagic Fever (CCHF), Ebola, Lassa Fever, Marburg, Middle East Respiratory Syndrome Coronavirus (MERS-CoV), Severe Acute Respiratory Syndrome (SARS), Nipah, Rift Valley Fever (RVF), Severe Fever with Thrombocytopenia Syndrome (SFTS), Zika. Our search criteria excluded editorials, commentaries and reviews. We read the abstracts of every paper. Those deemed to be in the scope of our study (covering human, animal, in vitro and in silico research involving the pathogen, but excluding socio-economic research) were entered into a spreadsheet and downloaded into an electronic library. Because of time constrains, we did not include papers related to genetic sequence data, since the culture and infrastructure for sharing these data are already well established and much discussed. We do, however, include a summary of the strengths and weakness of genetic data platforms in Section 4 of this report. In accordance with the brief, we also excluded discussion of biological sample sharing. For each paper, we sought to access underlying data. We looked for a data availability statement in each paper, and inspected the paper for embedded supplementary information. Where we could find no data, we contacted corresponding authors with a short, on-line survey about the underlying data specific to that paper. We excluded case reports from further investigation because case reports tend to include all relevant data in the paper itself. The detailed search workflow, a spreadsheet recording our results, and a bibliography of all papers deemed in scope are available at DOI:10.7910/DVN/D1HAPO Secondly, we searched clinicaltrials.gov, the largest clinical trial registry, for studies on each of the priority pathogens which may not have been published in peer- reviewed literature. The Pan-African registry was also searched. We recorded the availability of results, links to any published papers which we may have previously missed, and the principle investigator. Where we could find no publications or results relating to a study that was no longer recruiting, we contacted principle investigators with listed email addresses and asked about the availability of underlying data specific to that trial. Thirdly, we performed keyword searches in publically searchable repositories or platforms commonly used by researchers: Dataverse, Dryad, F1000, Figshare, GitHub, Mendeley and , where appropriate by searching on the names of the

Data sharing in public health emergencies. Wellcome/GloPID-R 3

priority pathogens as subject tags or title words, and restricting the information type to dataset. Finally, we included broad and less systematic internet-based searches using Google to identify results to search strings of the name of each priority pathogen along with ‘data repository’ and other synonyms. For each of the platforms that were used to share data, we collected information from public sources on metadata and curation standards, governance and funding. We sent personalised emails to 110 researchers, requesting that they fill in a very short questionnaire relating to data from a specific paper or clinical trials registry entry. Despite sending two reminders, we had only 13 responses. We therefore do not report specifically on the results from the on-line survey.

2.1..2 Phase 2: understanding the mechanisms of sharing Having searched for data using the methods that might most commonly be used by scientists or policy makers with access only to the scientific commons, we tried to understand what other formal or informal methods are used to share data, and why those are used. This phase involved conducting telephone or face-to-face interviews lasting 30 - 90 minutes with researchers and others in the research community, including those who share data informally, and those who have chosen not to make data available. Agreement to use interview content for this report was obtained from all included data. Responses were noted in real time and recorded when agreed. Notes and quotes were entered into NVivo software and coded to the criteria developed in Phase 1. We conducted a total of 17 interviews with a total of 28 individuals. All but two interviews were individual, one was a group telephone interview and one was a group face-to-face interview. Interviewees were purposefully sampled to include wide geographic representation in a range of roles and fields of interest to this study. Interviewees included experts in academic clinical research (8), academic preclinical research (2), government public health and/or development agencies (5), not-for-profit standards development firms (3), non-government organisation with surveillance or epidemic response remit (7), and regional and international intra-government health organisations (3). Interviewees were associated with organisations based in North America (8), South America (1), Europe (12), Africa (3), and Asia (4).

2.1..3 Phase 3: identifying areas for action In order to identify areas where investment would be most valuable, we sought to pinpoint the disparity between the practices identified in phases 1 and 2, and normative 'best practices'. To identify those best practices, we reviewed academic papers, reports, policy papers and institutional websites to collate observations, opinions and recommendations about:  the types of data essential to effective preparedness and response  where data are/should be hosted  how data sharing is/should be funded  how data access is/should be governed

Data sharing in public health emergencies. Wellcome/GloPID-R 4

 the principles that do/should underlie data sharing We searched PubMed, Google Scholar and SSRN using the key words "data shar*", "data manage*" "data repositor*" and "public health emergenc*" to identify academic papers relating to data sharing, and followed up references in papers thus identified. We also searched for institutional policies or guidance relating to data sharing on the websites of the World Health Organisation and other international and multilateral health organisations, research funders, global health think tanks, academic publishers, pharmaceutical companies and academic institutions. We read and coded a total of 105 statements, commentaries, academic papers and policy documents about data sharing, including 47 institutional policies/guidance documents. All documents were entered into NVivo software and coded thematically. High order codes were deductive, derived from the original study research plan. Further codes were developed inductively from the content of the data themselves, and refined iteratively as data analysis progressed. Based on the themes that emerged most strongly, we developed the list of criteria which we used to structure our investigation of current practice. A summary spreadsheet of key documents, together with the study research plan, a downloadable bibliography of all the documents consulted, and other documents related to this research are available at DOI:10.7910/DVN/D1HAPO.

3 Results

3.1 Data shared through peer reviewed publication

Our methodological approach took as its starting point the normative narrative of scientific research: that publication in peer-reviewed journals is the primary means of sharing quality-assured research results. Our initial assumption was that if data are shared, they will be linked to those publications. As Table 1 shows, some 60% of published papers relating to priority pathogens involved case reports, which by nature already describe individual patient data. For the remaining 319 papers, we looked for data sharing statements, and examined supplementary material. Papers that did not provide underlying data as supplementary material, and that also provided no way of accessing underlying data (including by request from study authors) were deemed to have "inaccessible data". As Table 1 indicates, 98 of the 319 papers that were not case reports provided access to all the data underlying the paper, without having to request it from authors – some 31%. In most cases data were embedded in the paper itself (although it was not always easy to assess whether supplementary material was comprehensive). Authors of a further 15 papers said they would make data available on request – an assertion that we did not test. Some 207 of 319 papers (65%) provided no information allowing readers to discover or access underlying data. A total of 28 papers provided links to data stored in repositories – excluding case studies, that’s less than nine percent of all papers. In addition, we found three Ebola datasets in repositories which provided data underlying a published paper but were not mentioned in the paper, and a further 21 datasets in repositories containing data

Data sharing in public health emergencies. Wellcome/GloPID-R 5

not apparently linked with a published paper. All but one of these contained Ebola data, the outlier described MERS. The GitHub platform, which is aimed largely at software developers, also listed many collections related to Ebola and Zika. Many of these compiled data scraped from other sources; we did not have the resources to assess all of these for relevance and completeness.

Table 1: Number of research papers or completed registered clinical trials related to priority pathogens, and data accessibility

Total Papers All Data Data not % with papers excluding underlying available accessible data not and case data openly on request accessible studies reports available ** Lassa Fever 23 8 2 0 6 75% CCHF 93 19 8 1 10 53% Ebola 180 124 42* 11 71 57% Marburg 11 8 0 0 8 100% MERS-CoV 73 32 13* 1 18 56% SARS 54 24 0 0 24 100% Nipah 31 31 3 0 28 90% RVF 15 10 1 0 9 90% SFTS 31 10 4 0 6 60% Zika 121 25 15* 1 9 36% Chikungunya 155 28 10 1^ 18 64% TOTAL 787 319 98 15 207 65%

CCHF: Crimean Congo Haemorrhagic Fever. MERS-CoV: Middle East Respiratory Syndrome Coronavirus. SARS: Severe Acute Respiratory Syndrome, RVF: Rift Valley Fever. SFTS: Severe Fever with Thrombocytopenia Syndrome *Four Ebola, one MERS-CoV and two Zika papers did not indicate that was available. We identified the relevant data for these papers in a separate search of repositories. ^One Chikungunya paper is categorized as both data openly available and as request. ** of papers excluding case reports. The more rapidly data are shared, the more likely they are to lead to an improved response in a public health emergency. We calculated the time lag to publication for those papers which included embedded data or links to those data, and those which did not. The results are shown in Table 2 for studies which involve several patients, as well as for case studies. It's interesting to note that even papers related to priority pathogens that involve only a single individual take more than a year, on average, to make it on to the academic record. More complex studies that do not make all underlying data readily available (in the paper itself, as supplementary material or in a repository) push that

Data sharing in public health emergencies. Wellcome/GloPID-R 6

up to two and a half years, compared to 18 months for papers that do make data available. Preparing data for sharing does not, then, appear to add to the lag time between data collection and publication. We attempted to classify papers by timing (pre-outbreak, during outbreak, inter-outbreak etc.) but found this difficult since a) the definition of an "outbreak" appears to vary and b) half of papers did not mention the time of data collection.

Table 2: Average time between completion of data collection and publication of results, of those which specified time of data collection

Papers Average Range with dates time lag (months) for (months) fieldwork Pubmed-indexed paper with shared data 27 18 2 – 59 (N=98) Pubmed-indexed paper with no shared data 95 30 0 – 73 (N=195)^ Pub-med indexed case study 267 15 0 – 158 (N=468) TOTAL* (761) 389 63 0 – 158 ^These include papers with not all the underlying data and those with request access. *This total does not include 27 papers which were published in Chinese, or in journals inaccessible to researchers at Oxford University. We further attempted to disaggregate data by type of research, coding studies as in silico, in-vitro, surveillance, observational (cohort, case-control, health care), clinical trial, animal research, genetic research or environmental. Besides case studies, the most prominent classifiable research types were other observational studies (n=80), animal research (n=43) and in vitro research (n=37). However, over 100 reports, including in peer reviewed papers, did not provide sufficient detail to allow for a clear classification, and we thus do not report the results here.

3.2 Data shared through trial registries

The World Health Organisation has been encouraging transparency in clinical trials for over a decade, including through the sharing of research protocols and related information. In a resolution in 2005, it endorsed the registration of all clinical trials, and it has since supported a portal which allows users to search for trials across a variety of registries which meet particular quality standards. In its 2015 statement on disclosure of clinical trials results, the WHO demanded that:  the results of clinical trials are submitted for publication in a peer reviewed journal within 12 months of study completion;

Data sharing in public health emergencies. Wellcome/GloPID-R 7

 published results are available in an format within 24 months of study completion  key study outcomes are made available in (or linked to) the clinical trials registry, within 12 months of study completion.(1)

To understand compliance with these requirements, we searched clinical trial registries for trials of interventions for priority pathogens. As Table 3 shows, just two of 58 completed clinical trials found through their registry entries provided aggregate results in that entry, and none included any link to individual patient data. Of the two studies which posted results to the trial registry, one posted aggregate results three and a half years after completion, the other seven and a half years. In short, neither peer reviewed publications not clinical trial registries are widely used to share data rapidly in public health emergencies. Explanations gathered during interviews for not sharing data more rapidly included lack of confidence in utility of the data and therefore unwillingness to invest resources to prepare it to be shared; lack of confidence in data quality; absence of academic incentive for speed; disconnect between those who are collecting the data and those who wish to use it quickly.

Data sharing in public health emergencies. Wellcome/GloPID-R 8

Table 3: Publication of clinical trial results

Av. Results Results Unpublished months Trials Trials posted to published < completed since registered completed registry 24 months trials completion Lassa Fever 3 1 0 0 1 8 CCHF 3 0 0 0 0 N/A Ebola 80 37 0 18* 22 16 Marburg 2 2 0 1 1 67 MERS-CoV 7 3 0 1* 3 21 SARS 16 7 0 0 7 115 Nipah 1 0 0 0 0 N/A RVF 4 3 1 0 3 110 SFTS 0 0 0 0 0 N/A Zika 25 1 0 0 1 5 Chikungunya 20 4 1 2 3 6 TOTAL 161 58 2 22 41 42

CCHF: Crimean Congo Haemorrhagic Fever. MERS-CoV: Middle East Respiratory Syndrome Coronavirus. SARS: Severe Acute Respiratory Syndrome, RVF: Rift Valley Fever. SFTS: Severe Fever with Thrombocytopenia Syndrome *Results from four of the Ebola trials and the MERS trial were published before the study was completed.

Conclusion 1 The majority of research data relating to priority pathogens are simply not being shared through formal, discoverable mechanisms.

3.3 Most commonly used data sharing mechanisms

We supplemented the literature and registry search with searches of other potential platforms and repositories, as well as with an online questionnaire, to try to ascertain which mechanisms were most commonly used to share data from published papers other than case reports. The results are shown in Table 4.

Data sharing in public health emergencies. Wellcome/GloPID-R 9

Table 4: Mechanisms used to share data

Data sharing N Notes mechanism No stated mechanism 205 Supplementary 60* Completeness of supplementary data difficult to material in journal ascertain articles Figshare 25 Appears to be driven by journal & funder requirements. While 25 papers stored some related data in Figshare, comprehensive data only available for 6 papers. Dryad 17 Appears to be driven by journal & funder requirements Request to author 17 Authors state they will make data available on request. We did not test this assertion, and have no data on file formats. GitHub 10 Harvard DataVerse 1 Zenodo 1 TOTAL** 336 *Includes only supplementary materials in papers that claim to provide access to all underlying data ** This exceeds the number of papers identified because some of the data found in repositories was not linked to papers. Most researchers interviewed in Phase 2 said they were willing to share data with trusted colleagues through informal networks, and sometimes through consortia. These responses are not reflected in Table 4; however, they are discussed at greater length below. The potential utility of shared data is greatly influenced by the format in which it is made available. Data are most easily reusable if they are in downloadable, software- agnostic formats, and if they use standardised metadata that increase inter- operability. Noting that by far the most common way of sharing data relating to studies of priority pathogens is through supplementary files embedded in electronic versions of published papers, we further looked at file formats used. Structured, machine-readable formats using XML, RDF and similar standards appear currently to be the exception, even where data are shared. Table 5 gives details of file formats used in supplementary data linked to publications about priority pathogens. We found that 57% of the supplementary data related to published papers was in .pdf formats which do not allow for easy data scraping or indexing, and a further 15% were Word files. Some of this material included protocols and other documents not containing data per se. Some 14% of supplementary data files were downloadable numerical files in .xls or .csv spreadsheet formats, or formatted for statistical software. The remaining files were image, video or presentation formats.

Data sharing in public health emergencies. Wellcome/GloPID-R 10

Table 5: Availability and format of supplementary information, Papers excludi ng case Includes supplementary information: reports Any .pdf .doc .xls, .csv, image, .dta, .sav video, vector Lassa Fever 8 3 1 2 CCHF 19 6 5 1 1 Ebola 124 79 49 10 14 5 Marburg 8 3 3 MERS-CoV 32 18 10 2 1 4 SARS 24 2 1 1 Nipah 31 15 4 3 2 6 RVF 10 10 3 0 SFTS 10 10 3 2 1 Zika 24 12 5 2 2 3 Chikungunya 28 5 4 1 TOTAL 318 149 89 23 22 18

CCHF: Crimean Congo Haemorrhagic Fever. MERS-CoV: Middle East Respiratory Syndrome Coronavirus. SARS: Severe Acute Respiratory Syndrome, RVF: Rift Valley Fever. SFTS: Severe Fever with Thrombocytopenia Syndrome We did not attempt a similar quantitative analysis of routine surveillance data relating to priority pathogens. However, some of the resources we did identify (particularly in GitHub) are derived from "scraping" data provided by government and international agencies in .pdf formats, or other formats in which the metadata for interoperable use is essentially absent.

Conclusion 2 The majority of research data that are shared are most often shared in formats that are not inter-operable or easily reusable.

Data sharing in public health emergencies. Wellcome/GloPID-R 11

3.4 Other potential repositories and sharing mechanisms

By taking published papers as a point of departure, the study methods clearly decrease the likelihood of identifying sharing that is unlinked to publication in peer reviewed journals. Although we did not look in detail at genomic or structural data, we conducted a brief search on the taxonomical terms describing the priority pathogens in Protein Data Bank and the International Nucleotide Sequence Database Collaboration (which groups GenBank, DNA Data Bank of Japan and the European Molecular Biology Laboratory databases). We also searched two registers of research data repositories – BioSharing and Re3Data -- for platforms that may be used to share data on priority pathogens. Table 6: Entries related to priority pathogens in INSDC/GenBank and Protein Data Bank GenBank Protein Data Bank Lassa Fever 1,006 23 CCHF 176 14 Ebola 2,554 277 Marburg 321 133 MERS-CoV 835 281 SARS 1,237 403 Nipah 156 74 RVF 1,126 92 SFTS 1,341 29 Zika 994 171 Chikungunya 4,166 88 INSDC: International Nucleotide Sequence Database Collaboration CCHF: Crimean Congo Haemorrhagic Fever. MERS-CoV: Middle East Respiratory Syndrome Coronavirus. SARS: Severe Acute Respiratory Syndrome, RVF: Rift Valley Fever. SFTS: Severe Fever with Thrombocytopenia Syndrome As Table 6 shows, the common international databases for genetic sequences and for macromolecular structures are indeed used to deposit data related to the pathogens of interest. Although we did not carry out a rigorous analysis of time between data collection and publication on these databases, a cursory glance at a random selection of gene sequences for two of the most commonly deposited pathogens, SARS and Ebola, reveal that data deposit is far from instantaneous. SARS samples listed as collected in 2009 and 2011 were first published in 2015, while Ebola samples collected in February 2012 were made visible to other researchers in December 2016. The date of publication in GenBank appears to coincide principally with submission of a manuscript to a peer reviewed journal, or its publication. (GenBank allows researchers to embargo the publication of a deposited sequence until a specified date).

Data sharing in public health emergencies. Wellcome/GloPID-R 12

Searching BioSharing and Re3Data for databases that refer specifically to priority infectious pathogens, we found that some of the specialist repositories listed simply aggregate data from other sources, such as Protein Data Bank or INSDC. Even these were limited, as Table 7 shows. The listings appear currently to exclude specialist databases maintained by consortia, such as the International Severe Acute Respiratory and emerging Consortium (ISARIC) respiratory illness database. There was very limited overlap between the repositories listed in these registries and those used by the researchers we identified as sharing data relating to priority pathogens through our study search strategies. In interviews, researchers mentioned other platforms that could be used for data sharing. These include highly curated pathogen-specific platforms such as those maintained by the Infectious Disease Data Observatory (IDDO) and the International Severe Acute Respiratory and emerging Consortium. They are discussed at greater length in the next section.

Conclusion 3 While researchers do use global databases such as GenBank and Protein Data Bank, they often embargo data availability until manuscript publication dates. Few other priority- pathogen-specific resources are available, and those that exist are not widely used.

Data sharing in public health emergencies. Wellcome/GloPID-R 13

Table 7: Repositories listed in BioShare and Re3Data directories as particularly relevant for priority pathogens

Lassa CCH MERS Chikun- Ebola Zika SARS Data type Standards Data access Fever F -CoV gunya Apollo x x x Epidemic scenarios Apollo SV CC 3.0 license Ebola and Hemorrhagic Free, web- x x x Viral sequences FASTA Fever Virus based Database VIRsiRNAdb siRNA sequence, viral target No policy x FASTA and subtype specified Virus Sequence records, gene and Pathogen protein annotations, 3D XML, FASTA, Free, web- Database and 0 x x protein structures, immune CLUSTAL-W, based Analysis epitope locations, clinical and GenBank Resource surveillance metadata Malaria Atlas 0 Geo-spatial, None cited CCBY 3.0 Project Zika Open Unclear. Does not appear to 0 None cited Public domain Data Portal have been widely used NCBI SARS 0 Genomic None cited CC0 x – found when searching this pathogen on Biosharing.org. 0 – found when searching this pathogen on Re3data.org

Data sharing in public health emergencies. Wellcome/GloPID-R 14

4 Strengths and weaknesses of different data sharing platforms and mechanisms

Broader discussions about cite two major benefits of data sharing: transparency and utility. The data sharing mechanisms which best achieve each of these goals differ slightly.(2) In the case of public health emergencies, utility – the ability to combine data from different sources to generate new learning more rapidly – is the driving motivation. We therefore characterised the strengths and weaknesses of each of the available data sharing mechanisms in terms of achieving that goal. These are summarised in Table 8, and discussed at greater length below.

Table 8: Advantages and disadvantages of data sharing platforms/mechanisms

Data sharing mechanism Advantages Disadvantages Well-established Genomic/ structural Discoverable Expensive databases Standardised metadata Slow Findable by humans, with Not machine readable, limiting Supplementary material in some effort discoverability journals Public investment not Depends on electronic access to vital papers; Often paywalled Diverse, largely unusable formats Not discoverable even by humans Faster than traditional Journal preprints Diverse, largely unusable data publishing models formats Standardised metadata Extensive set-up costs Disease-specific curated Some community support Sustainability questionable repositories Greatly facilitates reuse Duplication of effort and interoperability General purpose academic Standardised metadata rare Potentially cost-effective repositories Sustainability uncertain Strong support for data Duplication of effort management and curation Institutional repositories Not easily discoverable Easy entry point for Standardised metadata rare reluctant academics Standardised metadata Facilitates reuse and Excludes reuse by non-members interoperability Set-up costs Closed consortia Protects equity and Duplication of effort research interests, so Sustainability questionable strong community support Informal professional Trust-based, so strong Zero transparency networks community support Big Data approaches Cost-effective Neither sensitive nor specific

Data sharing in public health emergencies. Wellcome/GloPID-R 15

One challenge faced across all platform types is sustainability and the long-term preservation of data. This may initially seem of secondary importance in emergency situations, where rapidly sharing new data takes precedence. However, in the case of pathogens such as Ebola which emerge in sudden, often self-limiting bursts, the ability to access and reanalyse data from previous outbreaks may prove critical to testing new hypotheses. The International Standards Organisation (ISO) does issue standards for trustworthy digital repositories that include those that guarantee long- term preservation. However, very few of the existing data sharing repositories or platforms we investigated publish clear, funded strategies that guarantee that the data they hold will continue to be available should they go out of business. A second issue that cuts across all data sharing mechanisms is the timing of sharing data. In interviews, almost all researchers said they were willing to make data quickly available if it could impact an active outbreak. However, in inter-outbreak periods, or at other times when information might be used to prepare rather than to react, many were reluctant to share the data before acceptance of the manuscript through more traditional publishing channels. It's also worth noting that researchers had strong opinions about the utility of sharing different data types. The opportunity cost, cleaning, coding, anonymising, and organising data and meta-data for sharing is significant, especially in outbreak situations. It's only worth undertaking if there is a high likelihood the data will be reused by others. People working in pre-clinical and vaccine research particularly claimed that individual level data are of little practical use, because there are no standardised clinical or laboratory methods that would allow for reliable interpretation. As one preclinical researcher put it: “The circumstances under which data are created are complex… developers choose how they measure.” In regards to sharing data from vaccine research, an investigator stated: “What’s important are the safety and immunogenicity profiles at a population level – so its aggregate data that you are looking at and this is provided in the manuscript. Individual patient level data is not relevant.” Several interviewees said that, rather than spend time and resources in sharing data that has little use outside of the primary study, they'd rather invest their resources in building consensus around standardised procedures, especially for preclinical and vaccine research. Similarly, they'd rather research funders supported neutral testing centres where samples from different research teams could be centrally compared. They said this would go further towards generating new knowledge than sharing data created using heterogeneous methods.

4.1 Genomic and structural databases

Of the quantifiable data sharing mechanisms or platforms, those commonly used for information related to priority pathogens are the large, global databases with a very narrow disciplinary focus – principally genetic sequences and protein structures. Their success derives from an unusual conjunction of factors. The development of the International Nucleotide Sequence Database Collaboration (INSDC) provides the best example. The deposition of gene sequences in GenBank and other openly accessible databases became common only after the two funders that at the time dominated non-profit investment in genetic research (the National Institutes of

Data sharing in public health emergencies. Wellcome/GloPID-R 16

Health and the Wellcome Trust) demanded it of those they funded. With other partners, they also invested very heavily in the infrastructure to make it possible. Publishers of some biomedical journals began to require that authors provide accession numbers showing they had deposited relevant sequence data in accessible databases on publication. In addition, since the vast majority of sequencing was taking place in high income countries when these norms were established, those pushing for data sharing of genomic data were not focused on protecting the interests of researchers in low-income countries against those with greater access to analytic skills and computing power. It is difficult to ascertain the exact cost of maintaining these large databases, but it is certainly substantial. In their US budget appropriations for 2016, the National Library of Medicine requested US$190 million "to process, and provide public access to, the enormous quantities of data emanating from new NIH-funded sequencing, microarray, and small molecule screening technologies", as well as clinical trial data. The National Center for Biotechnology Information, which maintains GenBank, has the equivalent of 288 full-time employees.(3) The 2016 budget for 'scientific services' at the EMDL, which includes European Nucleotide Archive and Protein Data Bank in Europe (PDBe) was over Euro 65 million. All are funded through long-term commitments from public bodies together with a very small number of large philanthropic research funders. Despite the substantial investment, there are still barriers to data sharing in this field, including in public health emergencies. As Yozwiak and colleagues noted, no genetic data on Ebola was released between early August and early November 2014. This period represents the height of the West African Ebola outbreak of 2014, when several research groups were known to be carrying out genetics research.(4) They feared, in part, that premature access would allow other researchers to analyse sequences and "scoop" their own publications. In response, the Nature journal group stated that it would "encourage" authors to provide access to sequence information in public archives on submission of papers rather than on publication. (5) In terms of pathogen-specific resources, GenBank maintains a bioinformatics database for SARS-related data. The largest repository for genetic data related to a pathogen with pandemic potential is EpiFlu, compiled by the Global Initiative on Sharing All Influenza Data (GISAID). EpiFlu gathers pre- and post-publication genetic sequence data, as well as related epidemiologic information. It is described at greater length in Box 2 on page 24. An additional database for genetic sequences from pathogens collected in Africa is currently under development under the auspices of the Pan African Bioinformatics Network for H3Africa (H3ABioNet). While this will share infrastructural norms and standards with the INSDC resources (and human data will be held in the European Genome-Phenome Archive) its governance standards will differ. Data can remain private (to allow for quality control and cleaning) for two months, and is then shared within a consortium of African genetics researchers for a further nine months. Thereafter, it is available to other researchers with a 12-month publication embargo. In total, African researchers in the H3Africa consortium will have 23 months to publish on their data before others can.(6) Most of the research in the H3Africa consortium centres on chronic and non-communicable diseases, however. It's thus unclear that the new resource will influence data sharing related to diseases with pandemic potential.

Data sharing in public health emergencies. Wellcome/GloPID-R 17

4.2 Journal publications and supplementary material

Publication in peer reviewed journals remains the most common way of "sharing" information about scientific research, if not the underlying data. Though an increasing number of journals require data underlying papers to be shared, compliance is uneven. Of the publishing groups, PLOS has the most developed policies related to data sharing. A recent analysis of trends in data sharing by the multi-disciplinary journal PLOS One showed that by 2016, nearly two thirds of authors complied fully with data sharing rules (up from just 40% two years earlier). However, just 20% of current authors deposited their own data in a publically accessible repository such as Dryad, Figshare or Dataverse, while 60% make data available as supplementary material to the article.(7) Table 2 on page 6 showed that publication in the peer reviewed literature is rarely rapid, even in public health emergencies. In addition, data in publications are rarely structured. Take case studies, for example – a study type which makes up 60% of all the published papers we found relating to priority pathogens. These are most often reported in journal papers with no supplementary material, and are frequently paywalled. These reports often provide early insights to the characterisation of emerging infections, and combined with geographic information, data from these studies could be reused to give rapid overviews of disease distribution. But because of the unstructured formats, these data are not easily extracted and combined, other than manually. Supplementary material attached to journal articles can usually only be accessed from an html version of the paper. Where papers are posted in pdf format (as they are, for example, on the Zika Open platform provided by the World Health Organization), hyperlinks are often inactive and supplementary material entirely unavailable. Data are most often provided in pdf formats. Where data are provided in a downloadable form, they are rarely software agnostic, and there's little evidence that they have been subjected to validity checks. For example, we found Excel spreadsheet files containing numerical values stored as strings and other features that limit machine readability, let alone interoperability. This in turn makes it hard to reuse such data. There remains no term in MEDLINE/PubMed metadata that indicates the existence of supplementary data, and many publishers do not include the existence of supplementary material in article metadata.(8) This renders them largely invisible to machines, and means it's not possible even for humans to restrict searches to papers that include embedded data (or even links to data), limiting their easy discoverability. Few of the journals that publish studies in scope currently provide persistent digital object identifiers (DOIs) for data objects, other than for the paper itself. This means that ejournal archiving services such as CLOCKSS (a not-for-profit collaboration between libraries and academic publishers funded by user-fees) cannot ensure that data, as well as journal articles, remain available to researchers if a publisher goes out of business, or if subscriptions to a journal are terminated. The supplementary material, which most often contains data on which the publication is based, is least likely to be assigned a DOI because, say Rosenthal and colleagues "supplemental materials … by their nature are normally regarded as less

Data sharing in public health emergencies. Wellcome/GloPID-R 18

valuable than the primary content [of a published paper]". Of the 98 papers identified that included supplementary or linked data, 58 provided separate DOIs for the data. Open access journals, for example those in the PLoS and BMC families, are most likely to identify data supplied together with a journal article persistently. In the absence of a step-change in investment by academic publishers in standards and transparency mechanisms, data sharing models which further entrench the use of supplementary materials in journals represents a lose-lose solution.

4.3 Preprints

Preprints have been suggested as a way of reducing the time taken to share information, while maintaining academic publication as the principle vehicle for communicating research results. An early experiment with this strategy in the context of public health emergencies was the WHO initiative Zika Open, an on-line publication platform for papers about the virus and its impact. A total of five papers about Zika were published on the platform between February and July 2016, usually within just a few days of being submitted. One went on through the peer review process to achieve "full paper" status. The Zika paper was posted within a day of its initial submission in May 2016. It was accepted for full publication after peer review in November 2016, and finally published in March 2017. The pre-print process made the information in the paper available to researchers and responders 10 months earlier than it otherwise would have been, but only if potential users knew the paper existed. Preprints are not indexed until they have passed peer review, so they remain undiscoverable using standard publication-based search techniques. This situation highlights the tension between speed of data sharing and quality assurance. We found no policy or paper defining adequate quality assurance in the absence of peer review, the research community norm. Similarly, none suggests any mechanisms to limit liability or reputational damage for researchers whose interim data, released in good faith in response to calls for rapid sharing in emergencies, are subsequently found to be more imperfect than would be desirable under normal circumstances. As a vehicle for communication of raw data, preprints suffer from the same limitations as other journal-based mechanisms relating to file formats, persistence etc. Researchers interviewed in this study were dubious about the value of preprints, and distrustful of assurances by the International Committee of Medical Journal Editors that making data available through preprints would not prejudice later publication in high-impact journals.(9) In the words of one clinical researcher: “You only publish in preprints upon acceptance of your paper… and you would not make raw data available. There’s no guarantee that ICMJE member journals will uphold their commitment as you don’t ever really know why a paper is rejected.”

4.4 Institutional repositories

Some universities, pharmaceutical companies and other research institutions now provide repositories into which researchers can deposit data. These are usually

Data sharing in public health emergencies. Wellcome/GloPID-R 19

funded out of core budgets of well-resourced institutions, and restrict use to staff members only. Data standards and reuse licenses vary widely. While the uploading files and metadata for each dataset relies on the investigators, professional data management staff often provide intensive support in terms of data management and curation. This reduces barriers to sharing, while simultaneously increasing the likely usability of the data. However, institutional repositories also re-create many of the data hosting infrastructures provided by general purpose repositories that are institution-agnostic, while potentially duplicating curation and governance procedures used by disease-specific curated repositories.

4.5 General purpose academic repositories

General purpose academic repositories were originally developed to further research transparency, rather than increase the likely re-use of data. Their use has grown as research funders, scientific publishers and (less frequently) academic institutions have adopted policies mandating or encouraging sharing of data. In our study, we found they were mainly used to share data underlying specific papers. The major general purpose repositories are listed in Table 9, along with some of their key characteristics. The global data ecosystem as a whole appears to be shifting towards these sorts of discipline-agnostic repositories. While most enforce the use of minimal shared standards such as Dublin Core, they provide little or no subject-specific standardisation. None specifies the sort of ontologies or metadata standards that would make data deposited by different users easily comparable, or interoperable with other data sources. Succession plans that guarantee long-term preservation vary widely. Figshare, the repository most commonly used by authors included in this study, states only that data will be preserved "for the lifetime of the repository", and mentions agreements with academic publishers to maintain DOIs for at least 10 years. However, the services terms of submission state "Company does not guarantee that any Content or User Submissions [i.e. data deposited in a Figshare repository] will be made available or will be continuously available on or through the Service". Although succession plans are provided for institutional users, none are available for individual data depositors. Dryad, another repository favoured by life scientists, is a member of the DataOne distributed data storage network and replicates all content through CLOCKSS, meaning that data should be available if the repository ceases to exist. Meanwhile Mendeley backs all of its data up to DANS, a data archiving service supported by the government of the Netherlands.

Data sharing in public health emergencies. Wellcome/GloPID-R 20

Table 9: General purpose repositories which are/could be used to share pre-clinical and clinical data about priority pathogens

Repository Data access Curation; metadata standards; Funding Data types DOI? /Platform standardisation Dryad Open, web-based. Reusable under Data QC on ingest; Dublin Core metadata Fees for data deposit, All data Yes Creative Commons Zero waiver. standards. No further standardisation. Clear and institutional associated with succession plan. membership fees. published article. Figshare Web-based; depositor chooses Automated checks for data integrity. All Owned by Holtzbrink All academic Yes terms of license, and can restrict metadata CC0 licenced. DataCite standards publishing empire. Free data types. access, except to metadata. enforced. No further standardisation. No to deposit and access clear succession plan. data Harvard Web-based; Depositor can restrict Minimal curation; Dublin Core metadata Harvard university and Yes DataVerse access to data, except to metadata standards. No further standardisation. philanthropic grants Active preservation policy, but only social science data duplicated. Zenodo Web-based; Depositor chooses Minimal curation; JSON metadata EU funded; All Yes terms of license, and can restrict standards. No further standardisation. Poor infrastructure access, except to metadata. preservation policy, vague succession plan. maintained by CERN GitHub Web-based development Limited curation; User is responsible for Start-up business. Many data types, No* platform; host chooses terms of content and appropriate licensing. Investors: various though mostly license and visibility of data Metadata information available capital funds. code. (public/private). They can invite repositories. collaborators, change and limit interactions. Mendeley Web-based; Depositor can restrict Minimal curation. DataCite metadata Owned by Elsevier "Scientific" Yes access to data and metadata required. No further standardisation. publishing empire. Free datasets only, no Anonymisation required. Clear persistence to deposit and access previous DOI, no plan. data copyrighted material *The GitHub platform is integrated with Figshare and Zenodo, through which DOIs can be assigned

Data sharing in public health emergencies. Wellcome/GloPID-R 21

Table 10: Disease – specific repositories, including examples of closed consortia

Repository/ Data access Curation; metadata Funding Data types DOI? Platform standards*; standardisation IDDO Metadata openly available. Highly curated, including Dependent on Individual patient and Under (launching Data access through standardising metadata using philanthropic/ pathogen data; disease- discussion 2018 for independent data access CDISC standards. Persistence development grant specific platforms Ebola) committee. plan under discussion funding ISARIC Data only visible to Highly curated, including Dependent on Individual patient; No ISARIC researchers. standardising metadata using philanthropic/ currently respiratory External collaborations CDISC standards. No clear development grant disease-specific, with plans welcomed. persistence plan. funding to expand; ongoing, real- time data updates PREPARE Data only visible to Highly curated. No clear Dependent on Individual patient and No network researchers. persistence plan. grant funding pathogen data; respiratory External collaborations and arbovirus projects welcomed. underway Saudi MERS Data only visible to Highly curated, including Dependent on Individual patient; severe No Network network researchers. standardising metadata using grant funding respiratory illness; External collaborations CDISC standards. No clear ongoing, real-time data welcomed. persistence plan. updates Zika Open Public domain under CC No standards cited. No clear Part of WHO Public health information For papers BY IGO 3.0 license persistence plan. Bulletin of international yes, data significance no Zika Open Public domain No standards cited. No clear Unclear. Human; animal; raw; No Data Portal persistence plan. aggregate; not widely used IDDO: Infectious Diseases Data Observatory; ISARIC: International Severe and Acute Respiratory and emerging Infection Consortium; PREPARE: Platform for European Preparedness Against (Re-)emerging Epidemics *CDISC: Clinical Data Interchange Standards Consortium

Data sharing in public health emergencies. Wellcome/GloPID-R 22

4.6 Disease-specific curated repositories

Disease-specific repositories are the most likely to house data that are standardised across time and place and thus easily reusable. Some research funders and a growing number of journals strongly encourage the use of domain-specific repositories for shared data where they are available, though only a handful name a specific data-base to which data should be submitted. Some of those most relevant to pathogens with pandemic potential are listed, with their characteristics, in Table 10. Curated repositories tend to use controlled metadata standards and ontologies. These make it easy to use data across different studies held in a repository; where the standards are adopted by a wider community, they also greatly increase the likelihood of interoperability with other data sources. An example of community standards in clinical field is provided by the Clinical Data Interchange Standards Consortium (CDISC). These have gained traction since being adopted as the common metadata required for submitting applications for drug licensing in Japan and the United States (see Box 3 below). Disease-specific data repositories are relatively well-established in fields that are of interest to pharmaceutical companies, and researchers in high-income countries. Most pathogens known to have pandemic potential do not fall in to these categories. (Possible exceptions are influenza and SARS, as discussed in Box 2 below.) Specialist platforms for the priority pathogens discussed in this paper are dependent largely on public or philanthropic funding. But funding calls which include provisions for the relatively high infrastructure start-up costs are elusive. Information sharing ventures that arise in response to a specific outbreak are especially vulnerable to evaporation, disappearing when the funding streams that supported them change course. For example, APEC-EINET, an information exchange for flu data which survived for a decade and was once considered a model to emulate, is now moribund. Open Zika and an F1000 equivalent launched for Ebola failed to thrive. Platforms which curate disease-specific databases for reuse by consortium members and other researchers (such as IDDO, the Alpha Network and ISARIC) may have succession plans which move the data to other hosts in the case of closure, but have yet to develop business models that would sustain them securely in the absence of philanthropic or public research funding. Box 1: The Ebola Data Platform: Leveraging good will to set the standard for international collaboration in data sharing The 2013-2016 West African Ebola outbreak response was the effort of a large number of local and international players. Non-government organisations, national public health agencies, military, charities and academic institutions each brought different approaches to addressing the emergency. They also brought different approaches to collecting, using and storing data on affected individuals. The result is many disparate datasets, including record volumes of clinical, laboratory and epidemiological data, scattered globally on files, hard drives and servers. Some of these data have been used for research. However, the power and utility of this data has been limited to analysis of small batches, usually from single Ebola Treatment

Data sharing in public health emergencies. Wellcome/GloPID-R 23

Units or from single organisations that responded to the outbreak. There is no common data repository, de facto undermining the ability of the research community to effectively use these resources to address large number of remaining knowledge gaps, improve care of Ebola survivors, and reduce the impact of future outbreaks. A multi-disciplinary, international partnership has taken on the challenge of developing systems to aggregate, standardise and analyse these data within a strong ethical and governance framework that secures benefit to the communities of data origin. Development of the Ebola Data Platform is underway as a collaboration between West African governments, inter-government organisations, academia, health funders and non-government organisations. This novel initiative is leveraging the good will of those who contributed to outbreak response, to support the research priorities of affected communities. West African data, health, research and policy experts are implicated in the harmonisation of datasets, the design of governance systems and the analysis of data. Support for the strengthening of human and technical resources in these areas is integrated into the design of the platform, managed by the Infectious Diseases Data Observatory (IDDO). The platform is planned for launch in 2018. The success of this initiative could provide a much-needed model of global collaboration and best practice in data management for wider implementation across regions at risk of emerging infections.

4.7 Closed consortia and formal professional networks

While externally-accessible disease-specific repositories exist, they remain a rarity. Most disease-specific data sharing platforms take the form of closed consortia, regional networks, or other groupings to which non-members have no access. In interviews, the reason most frequently given for favouring this model is that it helps to foster equity in international research collaborations. Researchers conducting clinical research in lower income countries are keenly aware that large disease outbreaks of international interest represent important opportunities for them to access resources, further develop research skills and advance their careers. In interviews, several echoed oft-expressed concerns that increasing pressure to share data rapidly would undermine that opportunity. They feared that other, better resourced researchers who were not simultaneously trying to care for patients in an outbreak situation could analyse and publish data before those who collected it had the chance to do so. Recent cases of publications which make use of shared data without appropriate attribution attest to the fact that their concern is not without foundation. (10–12) Closed research consortia provide a way around this fear by providing data generators with access to skills and resources that allow for wider analyses, while ensuring that all contributors are appropriately credited. Data shared within consortia are often highly curated and standardised, which greatly increases their utility by allowing for pooled analyses across time and place, as well as for geographic and demographic comparisons.

Data sharing in public health emergencies. Wellcome/GloPID-R 24

Research consortia tend to be funded on a project basis, so the long-term preservation of data collected and shared within the consortium is rarely assured.

Box 2: Influenza and SARS: can data and benefit sharing models be replicated?* Influenza provides perhaps the most important model for sharing both data and benefits from research on a pathogen with pandemic potential. The WHO has maintained an influenza surveillance network (now known as Global Influenza Surveillance and Response System) for 65 years. In this network of specialist laboratories, which has existed in some form for 65 years, core Collaborating Centres receive specimens from 140 National Influenza Centres in over 100 countries. Their biological, and more recently genetic, investigations of these strains form the basis of the biannual seasonal flu vaccine selection, but the circulation of samples and data, between laboratories and onward to industry, was not always transparent. It was no secret among scientists that samples and sequencing data were sometimes withheld until publication of associated manuscripts in peer reviewed journals. Then, in the midst of a deadly outbreak of H5N1 in 2006, openly refused to share H5N1 samples with the WHO network unless they received a guarantee that any resulting product would be made affordably available to the country.(14) At the same time, some countries began to grow uncomfortable about contributing genetic sequencing data to the US-funded flu database in Los Alamos, USA, which had been collecting genetic sequencing data since 1997. The WHO responded to Indonesia's challenge by initiating negotiations which led, four years later, to the Pandemic Influenza Preparedness Framework. The framework, which is not legally binding, encourages countries to sharing flu samples, and commits industry to covering half the cost of the WHO surveillance system, which currently costs some US$56.5 million to annually. Meanwhile, a private philanthropist with expertise in intellectual property rights worked with scientists and others to tackle the mistrust and perverse incentives that discouraged scientists from sharing flu data. That groundwork, combined with a private investment described as "a low-mid seven figure sum" in dollars, led to the establishment of the Global Initiative on Sharing All Influenza Data (GISAID). Indonesia resumed sharing data through GISAID in 2008, and in 2009 the group launched the EpiFlu database, which sought to encourage more rapid sharing of genetic sequence data than was being achieved through public access resources such as GenBank. The key difference was a data access agreement which restricted data use to registered users, required acknowledgement of data generators, and encouraged collaboration with them wherever possible. The database is hosted in Germany and still sustained by German government funding. It now contains over 650,000 genetic sequences, over 1,000 of them classified as having human pandemic potential. Around 60% of the total are ingested into EpiFlu from International Nucleotide Sequence Database Collaboration (NSDC) databases and further curated to ensure quality. However,

* This section draws heavily on Elbe and Buckland Merrett, 2017. (13)

Data sharing in public health emergencies. Wellcome/GloPID-R 25

data from more recently collected samples are much more likely to be submitted directly to the database – 93% for those collected over the last 6 months – suggesting that EpiFlu is a more trusted repository for pre-publication data. The database additionally contains searchable data of epidemiological importance relating to the subjects from whom specimens were collected. For WHO Collaborating Centres for Influenza, it has become a core resource. The H7N9 outbreak in China in 2013 illustrates the value of the database in preventing potential pandemics. China uploaded the genomic sequences of viruses isolated form the first three human cases on the same day that it reported the outbreak to WHO. Within days, scientists in the US had used the data to synthesize the genes, allowing for rapid vaccine development. When Chinese researchers felt that American colleagues breached the spirit of the data access agreement by failing to invite collaboration on research using the sequences deposited by the Chinese lab, the matter was resolved within the framework of GISAID. The platform appears to be widely trusted by scientists in low and middle income countries, some of whom have been engaged through capacity building initiatives. Laboratories in Vietnam, Brazil, Argentina, Cambodia, Thailand, , Chile, Kenya, and Morocco have all submitted their sequence data. In the clinical sphere, advances have been made in standardising the approach to characterising emerging respiratory pathogens and in building networks to share data which can impact outbreak response. Since 2012, the International Severe Acute Respiratory and emerging Infections Consortium (ISARIC) has worked alongside WHO to coordinate preparedness-focused standardised data collection on acute respiratory illness. This network of research networks is currently active in 33 countries where efforts during the inter-epidemic period are invested in the development of standardised data collection tools, including built-in quality checks. ISARIC coordinates training to promote effective data collection in the field, so that data errors are minimised at the point of collection, thus cutting both curation costs and quality problems at the point of sharing. The framework of a rapid, outbreak-ready governance framework is in place for sharing the data collected both with researchers and public health agencies. The ISARIC model has triggered or is linked to a number of regional initiatives such as the China Severe Acute Respiratory-tract Infection Surveillance Platform, based at the Peking University People's Hospital. The platform is active in 20 provinces and autonomous regions, providing training in standardised data and sample collection procedures, and maintaining a communication network poised for collaboration. Similarly, the Saudi MERS network has implemented the ISARIC-developed case record forms to collect data across a national network of hospitals to address priority questions in the characterisation and treatment of middle-eastern respiratory syndrome coronavirus. These regional initiatives enable a refined implementation of data sharing governance which protects local application of a mandate for equity. As one implicated researcher put it: “Our network's position is that we are happy to collaborate with investigators outside of our Network to do additional analyses as appropriate that would be useful for public health decision making. We are not very receptive to the idea of just posting our raw data and allowing its use in an uncontrolled manner.” This regional ownership of international best practice has worked best where human, financial and technical resources are available to implement research. But funding to extend this model to less-resourced areas

Data sharing in public health emergencies. Wellcome/GloPID-R 26

remains elusive. It is unclear whether these models can be extended to other pathogens. GISAID refused to take on Ebola when requested during the West Africa outbreak, and an October 2016 review of the Pandemic Influenza Preparedness Framework convened by WHO recommended against it to other pathogens. The review noted that flu differs from other pathogens because  influenza strains are permanently in circulation, in virtually all countries  collaboration between scientists and laboratories working on flu is well- established  the disease is of interest to pharmaceutical firms who produce vaccine stock annually for wealthy countries.(15) It's noteworthy that pandemic influenza shares some characteristics with SARS, which, after a rocky start, became the other poster-child for successful international data sharing and collaboration in the face of an emergency. Both diseases:  occurred principally in countries with strong governments and public health systems  directly threatened populations that can afford to pay for research, prevention and control.(16) More recent outbreaks, such as Ebola, do not share these characteristics. They are sporadic, making it hard to amortise and sustain the cost of a permanent infrastructure. In addition, they principally affect citizens of low income countries with poor health systems and limited buying power. In these situations, the investments necessary to achieve successful global data and benefit-sharing mechanisms, though much talked of, have not been made.

4.8 Informal professional networks, and regional hubs

The quantitative methods used in this study are not able to record data shared through informal networks that link trusted colleagues. However, this emerged in interviews as among the most common ways of sharing information. In the words of one interviewee, describing Brazilian researchers' behaviour during the Zika outbreak: “Researchers didn’t make their data available on repositories. They had it and you had to write them to get it. Researchers are not clear what ‘repository’ means. In Brazilian culture people are very happy to collaborate... but it needs to have some personal interaction. You can’t collaborate with a website... People like to share experience when they share information.” Some attempts have recently been made to leverage the power of informal information exchange, especially at the regional level, and in relation to surveillance data rather than research data. For example, the East African Integrated Disease Surveillance Network has surfaced from minimal investment to build regional capacity in the conduct and communication of disease surveillance. The West African Network for Infectious Diseases Surveillance (WANIDS), formally

Data sharing in public health emergencies. Wellcome/GloPID-R 27

launched in April 2017, does much the same in that region, using a data sharing platform managed by the West African Health Organisation. Primarily supported by willing government staff on minimal budgets, recognition of the impact of such networks by global funders has been patchy. According to one interviewee, the absence of recognition by international funders has supported “the continuation of [the Networks'] own agendas” without too much external influence. However, the strained national health budgets on which many of these initiatives rely, limit their rate of development and potential for regionally-led impact. Amongst the regional surveillance networks supported by Connecting Organisations for Regional Diseases Surveillance (CORDS), “Some [networks] are better funded than others, …some are struggling for funding. Engaging in joint projects they strengthen to get more funding, but they are worried about sustainability.” West African health ministers have pointed out that regional data sharing networks have proved more robust and functional than the previous model in which all data related to public health emergencies is supposed to be passed up to WHO in Geneva before being shared with other countries. However, the ministers observe, they continue to be side-lined in reviews of the global public health architecture. (57) In the research context, the preference for informal exchange of data is to some extent recognised by the data access policies of many journals and funders, which allow individual investigators to manage access as long as clear instructions for requesting data are available. While this presents challenges of sustainability and inefficient accessibility, it may help to address common concerns including risks of data misinterpretation and loss of publication control, while promoting culturally appropriate collaboration with researchers in endemic countries.

4.9 Aggregation of data from public sources

A few initiatives have taken a "Big Data" approaches, scraping local and social media, internet search histories and other digital sources for information that might not have been reported through official channels. The World Health Organisation's Global Outbreak Alert and Response Network makes use of the Global Public Health Intelligence Network, a data scraping service, to inform its early warning system. Funded by Canadian taxpayers, Network data are available only to fee- paying subscribers. Other systems, such as ProMed, perform a similar function but are openly available. ProMed aggregates data from formal and informal sources, and supplements it with information reported directly to the system by a network of individuals, including health care providers. ProMed is based in the United States but has several national chapters and provides information by email and over the internet in a number of languages. It has been active since 1994 and is funded by donations from individuals and philanthropic organisations, through the International Society for Infectious Diseases.

Data sharing in public health emergencies. Wellcome/GloPID-R 28

5 Desirable practice: aspirations for data sharing

Our review found that in practice, the majority of data related to pathogens with epidemic potential that are collected through surveillance systems or generated through research are not shared rapidly; often they are not shared at all. This situation is very far from the ideal envisaged by the many global health institutions and research funders that have developed data sharing policies. It does not accord, either, with the recommendations of a number of think tanks and academic observers who have been working to develop principles and standards that will maximise the utility of data sharing. Here, we briefly review those expressed ideals. One set of principles for data sharing, known as the "FAIR" principles (for findable, accessible, interoperable and reusable), has gained considerable traction in discussion of sharing results from health-related research in general.(17) However, we also found other areas of agreement. Some of these, such as the expectation that data should be shared as rapidly as possible, were particularly pertinent to public health emergencies. Others were more general. Our analysis of over 100 institutional policies, review papers and commentaries about data sharing in the bio- medical sciences yielded an extended set of principles, which we have dubbed FREE-FAIRER. These incorporate the three principles that funders of public health research in low and middle income settings espoused in their 2010 declaration on data sharing: Equitable, Ethical and Efficient (which we translate as Economically viable) Though some of the words chosen differ (largely for the sake of an acronym), they map very closely on to the principles laid out in the GloPID-R data sharing working group's March 2017 Principles for Data Sharing in Public Health Emergencies, as shown in Table 11.(18,19) Many of these papers, and all but one of the policies, relate to data collected in research settings. The International Health Regulations, negotiated by the World Health Organization in 2005, provide a policy governing the sharing of surveillance data, which are of particular importance when outbreaks of infectious disease threaten. However, many of their data sharing provisions kick in only once a public health emergency has been declared. By definition, that's too late to use shared data for primary prevention of outbreaks of pathogens known to have pandemic potential. Table 11 gives a brief description of the principles identified, together with a non- exhaustive list of references to the policies that require or suggest its consideration, and papers that discuss its importance. In some cases, it was difficult to assign institutional of documents as either "policies" or "papers". Many institutions have issued discussion documents, guidance, statements of principle, or high level policies around data sharing which are non-binding or which include opt-outs or other flexibilities. Some ask researchers to report their plans or practices in areas such as metadata standards or licenses, without requiring that they adhere to specific criteria. Only the Bill and Melinda Gates Foundation and the National Institutes of Health mention verification mechanisms or potential sanctions for non-compliance with specific requirements to share data.

Data sharing in public health emergencies. Wellcome/GloPID-R 29

Broadly speaking, the original FAIR principles relate principally to attributes that are intrinsic to the data themselves. 'Forever' and 'Economically Viable' are related principally to the infrastructure which supports data sharing. A third group, 'Rapidly available', 'Ethical, 'Equitable' and 'Reliable' are more strongly related to the research process. Researchers interviewed for this study were most concerned about this last group of attributes. As several of the papers we reviewed note, there are trade-offs on this list of ideals. For example, the need to make data rapidly available to the research community in an outbreak may make it difficult to achieve the degree of data management and curation that would be needed to make the data immediately interoperable. Similarly, many papers discuss the tension between speed of release and data quality. However, others are interdependent. Several attributes, including rapidly available, findable, accessible and interoperable are prerequisites for effective reuse of data in an emergency setting. And in interviews, it became apparent that unless issues of equity are effectively addressed, data will not be shared at all.

Data sharing in public health emergencies. Wellcome/GloPID-R 30

Table 11: Policies and papers discussing the data sharing principles most frequently cited as necessary for the effective and useful sharing of research data in public health emergencies

Institutional Recommending Principle (and GloPID-R equivalent) Summary notes polices papers Findable* Most policies address "findability" and require data Data should be described with rich, to be linked to a DOI. In practice DOIs are linked to machine-readable metadata, persistently (20–30) (8,12,17,31–47) papers, rarely to data. No policy requires and uniquely identified, and indexed in a standardised ontologies for metadata. No policies searchable resource apply to surveillance data. Few institutional policies exist specifically for Rapidly available (Timely) research relevant to potential emergencies. Most are Data should be shared rapidly to maximise (4,9,16,19,39,41,44, tied to publication models and other quality control (5,19,48–51) norms that are inimical to speed. Preprints increase their potential utility in outbreak 52–60) speed but are not discoverable until papers are situations. published in indexed journals. Structures for rapid release of surveillance data are dysfunctional. Ethical (Ethical) Data are collected with appropriate consent procedures. Data sharing protects Discussions relating to ethics prioritise the (22,23,25,61) (52,62–68) protection of individuals. Few address the ethics of individual privacy, without undermining withholding data or public health importance. the rapid protection and promotion of public health.

Equitable (Equitable, Fairness) Some policies include aspirational statements about Data sharing delivers fair benefit to those (2,4,5,12,13,15,34,39– equity, but few provide guidance on how to operationalize them. For interviewees, especially in who collected the data, and to the (19,50,69) 41,44,53,53,58,59,66, research-poor settings, this is the biggest obstacle to communities from which data were 70–76) sharing. In those policies that mention it, 'equity' is collected often reduced to norms for citation and authorship.

Data sharing in public health emergencies. Wellcome/GloPID-R 31

Forever Data are shared through a mechanism that uses (20–22,25, Many policies address persistent identifiers; few persistent identifiers and has clear, securely (8,38,81–84) address long-term preservation or sustainability of 29,30, 77–80) funding strategies to ensure the long-term sharing platforms. preservation of data Accessible* (Accessible) Potential data users are physically able to (19,21,22,24, Few address file formats or physical retrieval (4,7,12,17,42,86–88) retrieve shared data using identifiers, after 51,85) mechanisms. authentication as necessary. Interoperable* Data formats must be platform agnostic and Discussion of interoperability is increasing, but policies addressing ontologies, languages and file metadata must use a formal, shared language (19,22) (12,17,32,43,46,89) formats remain rare. Awareness of these issues that allows for exchange of information among researchers interviewed was low. between datasets. Reliable (Quality) The onus for quality control is put on researchers; Data should be quality assured before being (19) (45,53,71,89,90) interviewees say that conflicts with demands for the shared, or by immediate community review. rapid release of data. Most policies address direct costs of sharing, but few consider infrastructural costs or opportunity (19,25,30,91– (8,12,15,38,44,45,56 costs. To researchers, the latter weigh most heavily. Economically viable 93) –58,60,70,94–96) Internet-based 'big data' approaches increasingly suggested for early warning of outbreaks. New funding mechanisms suggested. Many papers call for more "off the shelf" licenses, Reusable* (21,22,24,86, (9,12,17,32,39,42, and increasing number of policies specify licenses Data licenses allow for reuse; metadata allow for reuse. Few encourage use of community- 93,97–99) 43,45,100) for pooled analysis or other community reuse standardised metadata, or curated platforms for clinical or surveillance data. *The original "FAIR" principals, from Wilkinson et al, 2016

Data sharing in public health emergencies. Wellcome/GloPID-R 32

6 Summary and recommended actions

A constellation of different mechanisms for data sharing are currently available. Data standards, governance and funding mechanisms vary within and between groups, each has different advantages and disadvantages from the point of view of potential data depositors and data users. Table 12 attempts to summarise the pros and cons of each data sharing mechanism with respect to each of the attributes of shared data considered by the research community to be most important in emergency settings. A "+" indicates that the mechanism supports or favours the attribute, a "-" that it does not. "+/-" indicates that it is neutral or varies between platforms within that class of mechanism. Because of the variability within a given mechanism, these ratings are necessarily somewhat subjective and reflect the general consensus of those interviewed as a part of this study. In the case of pathogens with pandemic potential, the key purpose of data sharing is to allow faster learning about how best to prevent and contain outbreaks, and how to care for patients. To that extent, the two most important attributes are probably 'rapidly shared' and 'reusable'. The data sharing mechanism which is used in ways which best align these is closed consortia and formal networks. Curated repositories (for genomic, molecular and clinical data) greatly facilitate reuse, but are less adapted to rapid sharing. In practice, this is because to date, researchers have proven reluctant to contribute data to openly-accessible data repositories before they have published it. Information collected in interviews suggests that this reluctance is prevalent. However, it seems especially pronounced in low and middle income countries with high burdens of infectious disease and limited resources to support research – the very settings where pandemics are most likely to arise. In these settings, data sharing mechanisms which are most 'equitable' are most likely to lead to effective sharing. Again, closed consortia or professional networks score highly in theory and in practice as voiced by those interviewed. While data embedded as supplementary material in journals score well for reliability, on the assumption that the peer review process is an effective form of quality control, they score poorly on most other attributes of importance in potential outbreaks. Preprints – a mechanism designed to compensate for the long lead times often involved in academic publishing – score better for speed but otherwise suffer from many of the same limitations as data attached to journals, without the benefit of quality assurance or widespread indexing. The ratings in Table 12 are based on the status quo. However, these may change along with technology, community norms, or the investment decisions made by research funders. We turn, then, to the final question addressed by this study: What could researchers, responders and research funders do to support and sustain data platforms relevant to public health emergencies? To answer this, we considered the potential impact of changes in particular areas of technology, research infrastructure or governance on the FREE-FAIRER attributes that are considered important for effective sharing of data related to priority pathogens. Our analysis is shown in Table 13.

Data sharing in public health emergencies. Wellcome/GloPID-R 33

Table 12: Pros and cons different data sharing mechanisms, relative to attributes considered important for the effective sharing of data on priority pathogens

Genomic Embedded Preprints Institutional General Curated Closed Informal Big Data and in journal repositories academic repositories consortia, networks structural repositories formal networks Findable +++ - --- - ++ ------+/- Rapidly shared + --- ++ +/- +/- +/- ++ +++ +++ Ethical + +/- +/- + +/- + ++ +/- +/- Equitable +/- +/- +/- +/- +/- ++ +++ +++ +/- Forever +++ --- - + + ------Accessible +++ -- ++ + ++ ++ -- --- + Interoperable +++ ------+/- - ++ + ------Reliable +++ +++ +/- +/- +/- +++ +++ +/- -- Economically --- + + + + -- -- +++ ++ viable Reusable +++ ------+++ +++ +/- -

+ indicates that the mechanism supports or favours the FREE-FAIRER attribute; - indicates that the mechanism does not support or favour the FREE-FAIRER attribute; +/- indicates neutral support for the FREE-FAIRER attribute or that support varies between platforms within that class of mechanism

Data sharing in public health emergencies. Wellcome/GloPID-R 34

With the exception of "big data" approaches, most research-relevant data must be actively shared by those who generate it. Thus, researchers obviously have an important role to play in bringing data sharing practices into line with aspirations in potential pandemic situations. Changes in their data collection instruments, data management techniques and data sharing practices (including use of secondary data for analysis) stand at the core of more effective sharing. And yet almost all of the actions open to them are dependent on infrastructures and incentives that are controlled not by individual researchers and research groups but by research funders or the wider research ecosystem. Here, we focus on actions and investments that others could take to facilitate the use of effective data sharing mechanisms by researchers. We recommend that these measures should apply for preclinical and clinical research on priority pathogens with pandemic and epidemic potential, regardless of whether or not an emergency has been formally declared. Data must be findable, interoperable and reusable from the outset, if they are to be quickly used when an emergency looms.

Table 13: Potential impact of different interventions to enhance attributes considered important for the effective sharing of data on priority pathogens

Improvement/investment in this area… Metadata Software Governance Strengthening Disease standards standards and policing networks institutional & curated

policies platforms

Findable +++ ++ Rapidly +++ +++ ++ + attribute shared Ethical +++ ++ Equitable +++ +++ ++ Forever + + ++ Accessible ++ + +++ + Interoperable +++ +++ +++

Reliable ++ +++ ++ …yields benefits for this this for …yieldsbenefits Economically +++ + + viable Reusable +++ ++ ++ ++ +++ + indicates the likely change that investment in this area will have on the FREE- FAIRER attribute, from + (some improvement) to +++ (strong improvement).

6.1 Metadata standards, software and indexing

Investment in community-agreed metadata standards, and in software that reinforces standards and allows for effective searching, make data more findable and

Data sharing in public health emergencies. Wellcome/GloPID-R 35 interoperable, as well as reducing the cost of data management. Pre-agreed case definitions, case report forms, research protocols and machine-readable metadata standards are all needed. Centralising the development of these standards increases the opportunity for community input and consensus. They must be findable, openly accessible and user-friendly. These standards can and should be built into pre-designed data collection software including mobile-phone based apps with secure, off-line data entry capability that can be rapidly developed and iterated, then pushed out to data collectors globally. Built-in error checking in such devices reduces mistakes during data entry and management, cutting the time needed for data cleaning, and leading to quicker release of reliable data. The standards development should build on existing work. The most appropriate home for work on metadata standards is likely to be the Clinical Data Interchange Standards Consortium (CDISC) collaboration, which has already developed generic standards for clinical trials and other types of data collection, as well as standards for Ebola, influenza, virology and vaccines. Support for CDISC implementation has already been demonstrated in the pharmaceutical sector by wide-scale adoption due to the mandate from the US and Japan drug administrations. However, making the standards accessible to a wider and less resourced research community will require investment in training and tools. The collaboration could be supported to develop standards for other priority pathogens, and contingency funding mechanisms could be used to develop and disseminate provisional data standards in the case of emerging pathogens (see details and limitations in Box 3). While it doesn't relate directly to data sharing, we note also that investment in standardised reagents, assays and methodologies for pre-clinical and vaccine research would increase researchers' perception that their data are likely to be usefully interpretable to others, thus reducing an important barrier to sharing.

Box 3: Clinical Data Interchange Standards Consortium (CDISC) CDISC produces domains, specifications and models to represent data across the clinical and non-clinical research processes. Standards exist for clinical trials in general, as well as for virology and vaccine development in general, and epidemiology standards are in development. A number of disease-specific guidance documents exist, but only one of WHO's priority pathogens is covered to date: Ebola (released in December 2016). Influenza (released in November 2014) is also available and covers much of the same data as emerging coronaviruses, which feature on the WHO list. These standards are XML based, but they are freely available in pdf version only, while more functional machine readable downloads are available to paying members of the CDISC consortium. There are two limitations to this approach. Firstly, the requirement of paid membership for access to XML files is a barrier as open standards are a sine qua non for their successful and widespread use in the context of public health emergencies. Secondly, since they are XML-based, CDISC ontologies may not interact as well with other ontologies on the semantic web as those expressed in RDF. The standards are heavily used by the pharmaceutical industry, but have been slow to gain traction among the academic research community. This is partly because the initial training and conversion from current practice required for uptake is an

Data sharing in public health emergencies. Wellcome/GloPID-R 36

investment justified only by larger research institutions with high data volumes. However, academics are beginning to join in, thanks in part to the efforts of IDDO and ISARIC, which have worked with the research community to develop data collection forms mapped to the CDISC CDASH standards for Zika, Ebola and severe acute respiratory syndromes. ISARIC describes the development and use of these agreed standards in data collection, analysis and reporting as a pre-condition for a successful platform. (101) A domain-specific platform for Ebola, which uses CDISC standards to curate data so that disparate datasets can be compared and collectively analysed, is currently under development by the Infectious Disease Data Observatory. However, across pre-clinical, vaccine, and clinical fields, the large majority of researchers interviewed as a part of this study were not using, and often not familiar with, any specific standards for data collection or organisation. This is in part because standards are not yet available for most of the pathogens with pandemic potential. Developing standards requires a great deal of community consultation and is thus time-consuming and costly; funds for such activities are often lacking in the absence of a disease outbreak, while there's no time for such activities once an outbreak occurs.

In addition to the need for metadata and software developments, we identified a number of practical barriers to data discoverability and accessibility which could very easily be fixed within the confines of existing structures, including through indexing of preprints, and of datasets more generally Machine-readable structured formats for case reports in both academic publications and public health reporting forms would allow automated scripts to find data related to cases, combine them with geographic, ecological, vector and other data, and map the results electronically, as soon as data are published on the internet. Efforts to improve discoverability of datasets are currently coordinated under the auspices of the National Institutes of Health funded Big Data to Knowledge initiative. Researchers from 56 research institutions, all in North America and Europe, have been working on developing DataMED, an indexing system for data which aims to do for datasets what PubMed does for research papers, allowing researchers to discover datasets across multiple repositories. The challenges are greater for datasets than for papers; while papers by definition pass through a publishing process which allows for the imposition of standardised metadata, datasets rarely do. However, much work has already been done to develop a tagging ontology based on that used so successfully for papers.(102) We recommend deepening these efforts, including by supporting the inclusion of research topics and datasets of interest in public health emergencies, and the participation in product development of researchers from lower income countries.

6.2 Consensus on governance

A short head behind metadata and associated data collection software comes governance. Uncertainty about terms of access and use, concerns about legal standing with respect to data ownership and privacy protection and conflicting

Data sharing in public health emergencies. Wellcome/GloPID-R 37

views about intellectual property are all regularly cited as brakes on sharing data, even among those who profess themselves willing. A great deal of the uncertainty can be taken out of the picture by using community- agreed norms and "boiler-plate" templates for data sharing. A number of templates related to sharing data from public health surveillance have recently been compiled by Chatham House.(100) These or other templates must now be "pre-qualified" by national governments, research funders, researchers, research users and institutional review boards. In data sharing, terms of use must always try to balance the potential gains against the possible harms, including to individuals involved in research. With pathogens capable of triggering public health emergencies, the potential costs of not sharing data quickly are great, so terms of use approved for these settings may differ from those more habitually used in oncology research or other fields. Interviewees highlighted that agreed terms of use, collaboration and accreditation are especially needed for data generated during an active outbreak because those who are collecting the data and responding to the outbreak may have less time to be analysing data. If data is shared rapidly, the analysis may be completed by others before those who are treating patients have a chance to sit down at a computer, thus fostering inequity in opportunity to use shared data. Research funders should push for, support and participate in the discussions needed to arrive at consensus around norms and templates for patient consent as well as terms of data submission and use, specifically for pathogens with epidemic and pandemic potential.

6.3 Refining and implementing institutional policies

Most institutional data sharing policies are broadly drawn to cover many different types of research; they try to balance the burden on researchers and resources against the potential benefits of sharing data. However, data relating to pathogens that might cause pandemics cannot be bracketed with research on chronic or endemic diseases, because the cost of not sharing is potentially very much higher. It seems reasonable, then, to specify more stringent norms for sharing data relating to priority pathogens, and to enforce those norms with greater conviction. The gap between even the more general policies and practice suggest that institutions could also do more to support data sharing simply by doing more to ensure that existing policies are implemented, and by doing more to actively incentivise sharing. Research funders could request details of data availability in end-of-grant reports, and prioritise datasets rapidly shared through approved repositories over peer-reviewed publications when evaluating funding applications for research relating to priority pathogens, for example. Interviews indicated that such requirements would be accepted if they were made clear from the start within the call for funding applications. Our review of institutional data sharing policies found only a very small handful that specified any sort of accountability mechanisms, and we could find no records

Data sharing in public health emergencies. Wellcome/GloPID-R 38

of funders enforcing those mechanisms.* We were unable to find records of any case where a researcher has been meaningfully sanctioned for violating principles around equity, including for re-using data collected and shared by others without adequate recognition. There is clearly room for institutions issuing data sharing policies (including but not limited to research funders) to oversee them more vigorously, including by using machine-actionable data management plans.(87)

6.4 Disease-specific networks and curated platforms

The rhetoric around data sharing favours a maximum of openness to users regardless of their institutional or geographic affiliation. In practice, however, current incentive structures in academia and public health work against data sharing except within tight networks of trusted investigators. As long these incentives persist, investment in disease-specific networks and platforms is likely to deliver more in the way of fruitful data sharing than absolutely open access. Such networks create a zone of trust within which data generators and users from different constituencies can collaborate, and force agreement around metadata, data sharing governance and attribution, thus greatly facilitating reuse while also improving the science. Interviews showed strong support for this approach. However, the downside of disease-specific approaches is that it is difficult to sustain interest and funding in times where there is no crisis. They may also re-invent or duplicate existing technologies, tools and governance systems. One potentially cost-effective solution is to disaggregate the different roles involved in the effective sharing of data, which can be grouped broadly into three areas – domain networks, community platforms and technological infrastructure.  Domain networks are composed of researchers with a specialist knowledge of a particular disease area. Besides generating the data that are to be shared, the members of these networks are best placed to contribute to the development and implementation of domain-specific protocols, methods and metadata standards.  Community platforms are developed by researchers with a shared interest in a disease context or data-type that may be relevant across many specialist areas, such as genomics or imaging. They are able to interface with domain networks and develop and manage governance policies and practices which can be shared across several specialisations. In the case of data of importance in preventing and responding to public health emergencies, a single "public health emergency" platform could support curated databases for each of the priority pathogens, sharing data management and governance expertise across all of them.  Technological infrastructure can be provided by infrastructural organisations which support any number of community platforms, in much the same ways as the privately-owned company Figshare provides the back-

* We know of one journal, PLOS One, that has employed sanctions with respect to biomedical data on one occasion, contacting the authors' institution and later attaching an "expression of concern" to the paper.(88)

Data sharing in public health emergencies. Wellcome/GloPID-R 39

end for a number of repositories curated and branded for specific universities or institutions. It is quite possible for these three functions to be performed within a single organisation. However, in some cases it may be more desirable and cost-effective for the roles to be undertaken by different configurations of partners. This would allow for domain networks to be supported flexibly in accordance with epidemiological need, while reducing duplication of effort and promoting consistency in governance across similar areas through broader and longer term platform support. At the same time, a critical mass of expertise and technology shares across many platforms would allow for economies of scale in infrastructure over long periods.

7 Conclusion

No existing data sharing platform that we discovered delivers on all of the elements that the research community considers desirable for sharing data effectively and rapidly in potential public health emergencies, across all pathogens of interest. For flu, which is more likely to threaten high income countries than many of the others with epidemic potential, well-resourced global data sharing efforts have been established. Similarly, some data types, notable genetic and protein structure data, are shared through globally accepted mechanisms which appear to be securely funded and adequately staffed. No similar investment has yet been made for pathogens which principally threaten tropical regions, or for clinical data. In Section 6 above, we have identified a number of investments and actions which are within the sphere of research funders, and which will lower specific barriers to the FREE-FAIRER sharing of data in public health emergencies. We note that these investments are "necessary but not sufficient" to change the data sharing landscape. Many of the hurdles to more effective sharing are political or cultural, deeply embedded in incentive structures that shape global health and the research that relates to it, and not easily altered by research funders alone. By starting with feasible and important wins, however, GloPID-R members can support willing data sharers in developing new norms and in demonstrating the truth of the existing rhetoric: sharing data prevents pandemics and saves lives.

Data sharing in public health emergencies. Wellcome/GloPID-R 40

Bibliography

1. World Health Organization. WHO Statement on Public Disclosure of Clinical Trial Results [Internet]. WHO; 2015 [cited 2017 Mar 1]. Available from: http://www.who.int/ictrp/results/reporting/en/

2. Pisani E, Aaby P, Breugelmans JG, Carr D, Groves T, Helinski M, et al. Beyond : realising the health benefits of sharing data. BMJ. 2016 Oct 10;i5295.

3. US National Library of Medicine. Congressional Justification FY2016 [Internet]. 2015 [cited 2017 Sep 1]. Available from: https://www.nlm.nih.gov/about/2016CJ.html#Budget_graphs

4. Yozwiak NL, Schaffner SF, Sabeti PC. Data sharing: Make outbreak research open access. Nature. 2015;518(7540):477.

5. Nature. Benefits of sharing. Nature News. 2016 Feb 11;530(7589):129.

6. Beiswanger CM, Abimiku A ’le, Carstens N, Christoffels A, de Vries J, Duncanson A, et al. Accessing Biospecimens from the H3Africa Consortium. Biopreservation and Biobanking. 2017 Apr;15(2):95–8.

7. Byrne M. Making Progress Toward Open Data: Reflections on Data Sharing at PLOS ONE [Internet]. EveryONE. 2017 [cited 2017 May 26]. Available from: http://blogs.plos.org/everyone/2017/05/08/making-progress-toward-open-data/

8. Rosenthal DS, Reich VA. Archiving supplemental materials. Information Standards Quarterly. 2010;22(3):16–21.

9. Whitty CJM, Mundel T, Farrar J, Heymann DL, Davies SC, Walport MJ. Providing incentives to share data early in health emergencies: the role of journal editors. The Lancet. 2015 Nov;386(10006):1797–8.

10. Butler D, Cyranoski D. Flu papers spark row over credit for data. Nature News. 2013 May 2;497(7447):14.

11. Callaway E. Zika-microcephaly paper sparks data-sharing confusion. Nature [Internet]. 2016 Feb 12 [cited 2016 Feb 14]; Available from: http://www.nature.com/doifinder/10.1038/nature.2016.19367

12. Brack M, Castillo T. Data Sharing for Public Health: Key Lessons from Other Sectors [Internet]. London: Chatham House; 2015 Apr [cited 2017 Mar 28]. Available from: https://www.chathamhouse.org//node/17453

13. Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges. 2017 Jan 1;1(1):33–46.

14. Sedyaningsih ER, Isfandari S, Soendoro T, Supari SF. Towards mutual trust, transparency and equity in virus sharing mechanism: the avian influenza case of Indonesia. Annals Academy of Medicine Singapore. 2008;37(6):482.

Data sharing in public health emergencies. Wellcome/GloPID-R 41

15. World Health Organisaton. Report of the 2016 PIP Framework Review Group [Internet]. Geneva: WHO; 2016 Oct [cited 2017 May 16]. Available from: http://apps.who.int/gb/ebwha/pdf_files/EB140/B140_16-en.pdf

16. Institute of Medicine (US) Forum on Microbial Threats. Learning from SARS: Preparing for the Next Disease Outbreak: Workshop Summary [Internet]. Knobler S, Mahmoud A, Lemon S, Mack A, Sivitz L, Oberholtzer K, editors. Washington (DC): National Academies Press (US); 2004 [cited 2017 Mar 28]. (The National Academies Collection: Reports funded by National Institutes of Health). Available from: http://www.ncbi.nlm.nih.gov/books/NBK92462/

17. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data [Internet]. 2016 Mar 15 [cited 2017 Mar 13];3. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/

18. Wellcome Trust. Sharing research data to improve public health: full joint statement by funders of health research [Internet]. 2010 [cited 2014 Dec 10]. Available from: http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/Public- health-and-epidemiology/WTDV030690.htm

19. GloPID-R Data Sharing WorkingGroup. Principles for Data Sharing in Public Health Emergencies [Internet]. 2017 [cited 2017 Apr 6]. Available from: https://figshare.com/articles/Principles_for_Data_Sharing_in_Public_Health_Emergen cies/4733590

20. Cancer Research UK. Submission of a data sharing and preservation strategy [Internet]. CRUK; 2014 [cited 2017 Apr 20]. Available from: http://www.cancerresearchuk.org/funding-for-researchers/applying-for- funding/policies-that-affect-your-grant/submission-of-a-data-sharing-and- preservation-strategy

21. Bill & Melinda Gates Foundation. Open Access Policy Frequently Asked Questions [Internet]. Bill & Melinda Gates Foundation. 2017 [cited 2017 Apr 20]. Available from: http://www.gatesfoundation.org/How-We-Work/General-Information/Open-Access- Policy/Page-2#UNDERLYINGDATAGUIDELINES

22. Wellcome Trust. Data guidelines [Internet]. Wellcome Open Research. 2016 [cited 2017 Jan 3]. Available from: https://wellcomeopenresearch.org/for-authors/data- guidelines

23. National Institutes of Health. NIH Data Sharing Policy and Implementation Guidance [Internet]. NIH. 2003 [cited 2017 Apr 20]. Available from: https://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm

24. Nature Scientific Data. Data Policies [Internet]. Undates [cited 2017 May 9]. Available from: https://www.nature.com/sdata/policies/data-policies

25. Medical Research Council. Medical Research Council Data Sharing Policy v 2.0 [Internet]. MRC; 2016. Available from: http://www.mrc.ac.uk/PolicyGuidance/EthicsAndGovernance/DataSharing/PolicyonDa taSharingandPreservation/index.htm

Data sharing in public health emergencies. Wellcome/GloPID-R 42

26. European Centre for Disease Prevention and Control. Policy on data submission, access, and use of data within TESSy – 2015 revision [Internet]. European Centre for Disease Prevention and Control. 2015 [cited 2017 Mar 29]. Available from: http://ecdc.europa.eu/en/aboutus/what-we-do/surveillance/Pages/data-access.aspx

27. Research Councils UK. RCUK Common Principles on Data Policy [Internet]. Research Councils UK. 2015 [cited 2017 Apr 20]. Available from: http://www.rcuk.ac.uk/research/datapolicy/

28. European Research Council. Guidelines on the Implementation of Open Access to Scientific Publications and Research Data in projects supported by the European Research Council under Horizon 2020. Version 1.1 [Internet]. ERC; 2017 [cited 2017 May 3]. Available from: https://erc.europa.eu/sites/default/files/document/file/ERC%20Open%20Access%20g uidelines-Version%201.1._10.04.2017.pdf

29. US Food and Drug Administration. Study Data Standards: What you need to know [Internet]. FDA; 2016 [cited 2017 Jul 24]. Available from: https://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/FormsSubmissi onRequirements/ElectronicSubmissions/UCM511237.pdf

30. Netherlands Organisation for Scientific Research. Data management protocol [Internet]. NWO; 2016 [cited 2017 May 29]. Available from: https://www.nwo.nl/en/policies/open+science/data+management

31. Canham S, Ohmann . A metadata schema for data objects in clinical research. Trials [Internet]. 2016 Dec [cited 2017 Mar 13];17(1). Available from: http://trialsjournal.biomedcentral.com/articles/10.1186/s13063-016-1686-5

32. Sansone S-A, Rocca-Serra P. Review: Interoperability standards [Internet]. 2016 [cited 2017 Sep 5]. Available from: https://figshare.com/articles/Review_Interoperability_standards/4055496

33. International Council for Science : Committee on Data for Science and Technology. Coordinating Data Standards amongst Scientific Unions [Internet]. CODATA. Undated [cited 2017 Aug 30]. Available from: http://www.codata.org/task-groups/coordinating- data-standards

34. Pisani E, AbouZahr C. Sharing health data: good intentions are not enough. Bull World Health Organ. 2010 Jun;88(6):462–6.

35. Oliva A. The semantic way of thinking [Internet]. Thoughts on Medical Informatics. 2017 [cited 2017 Jul 25]. Available from: http://aolivamd.blogspot.com/

36. Shottong D, editor. Force11 White Paper: Improving The Future of Research Communications and e-Scholarship [Internet]. 2012 [cited 2017 Mar 29]. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.678.9012&rep=rep1&type =pdf

37. Royal Society. Science as an open enterprise: open data for open science. London: The Royal Society; 2012.

Data sharing in public health emergencies. Wellcome/GloPID-R 43

38. Goldacre B, Gray J. OpenTrials: towards a collaborative open database of all available information on all clinical trials. Trials [Internet]. 2016 Dec [cited 2017 Mar 13];17(1). Available from: http://trialsjournal.biomedcentral.com/articles/10.1186/s13063-016- 1290-8

39. Delaunay S, Kahn P, Tatay M, Liu J. Knowledge sharing during public health emergencies: from global call to effective implementation. Bulletin of the World Health Organization. 2016 Apr 1;94(4):236–236A.

40. Sane J, Edelstein M. Overcoming barriers to data sharing in public health: A global perspective [Internet]. London: Chatham House; 2015 Apr [cited 2017 Mar 23]. Available from: http://www.a51.nl/sites/default/files/pdf/20150417OvercomingBarriersDataSharingP ublicHealthSaneEdelstein.pdf

41. Chretien J-P, Rivers CM, Johansson MA. Make Data Sharing Routine to Prepare for Public Health Emergencies. PLOS Medicine. 2016 Aug 16;13(8):e1002109.

42. Vallance P, Freeman A, Stewart M. Data Sharing as Part of the Normal Scientific Process: A View from the Pharmaceutical Industry. PLOS Medicine. 2016 Jan 5;13(1):e1001936.

43. Chan M, Kazatchkine M, Lob-Levyt J, Obaid T, Schweizer J, Sidibe M, et al. Meeting the Demand for Results and Accountability: A Call for Action on Health Data from Eight Global Health Agencies. PLoS Medicine. 2010 Jan 26;7(1):e1000223.

44. Littler K, Boon W-M, Carson G, Depoortere E, Mathewson S, Mietchen D, et al. Progress in promoting data sharing in public health emergencies. Bulletin of the World Health Organization. 2017 Apr 1;95(4):243–243.

45. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)?A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics. 2009 Apr;42(2):377–81.

46. Franklin JD, Guidry A, Brinkley JF. A partnership approach for Electronic Data Capture in small-scale clinical trials. Journal of Biomedical Informatics. 2011 Dec;44:S103–8.

47. Medical Research Council. Good practice principles for sharing individual participant data from publicly funded clinical trials. V1.0 [Internet]. MRC; 2015 [cited 2017 May 9]. Available from: http://www.methodologyhubs.mrc.ac.uk/files/7114/3682/3831/Datasharingguidance 2015.pdf

48. Dye C, Bartolomeos K, Moorthy V, Kieny MP. Data sharing in public health emergencies: a call to researchers. Bull World Health Organ. 2016;94(3):158.

49. World Health Organization. Policy Statement on Data Sharing by the World Health Organization in the Context of Public Health Emergencies [Internet]. WHO; 2016 [cited 2017 Mar 1]. Available from: www.who.int/ihr/procedures/SPG_data_sharing.pdf

50. World Health Organization, editor. International health regulations (2005); Third edition. Third edition. Geneva, Switzerland: World Health Organization; 2016. 74 p.

Data sharing in public health emergencies. Wellcome/GloPID-R 44

51. Wellcome Trust. Statement on Data Sharing in Public Health Emergencies [Internet]. Wellcome Trust; 2016 [cited 2017 Mar 22]. Available from: https://wellcome.ac.uk/press-release/global-scientific-community-commits-sharing- data-zika

52. World Health Organisaton. Developing global norms for sharing data and results during public health emergencies: Statement arising from a WHO Consultation held on 1-2 September 2015 (Includes consensus statement from journals) [Internet]. WHO. 2015 [cited 2017 Mar 22]. Available from: http://www.who.int/medicines/ebola- treatment/blueprint_phe_data-share-results/en/

53. Modjarrad K, Moorthy VS, Millett P, Gsell P-S, Roth C, Kieny M-P. Developing Global Norms for Sharing Data and Results during Public Health Emergencies. PLOS Medicine. 2016 Jan 5;13(1):e1001935.

54. Mietchen D. Data sharing in public health emergencies. International Journal of Infectious Diseases. 2016;53:35–36.

55. Heymann DL, Chen L, Takemi K, Fidler DP, Tappero JW, Thomas MJ, et al. Global health security: the wider lessons from the west African Ebola virus disease epidemic. The Lancet. 2015;385(9980):1884–1901.

56. Keller M, Blench M, Tolentino H, Freifeld CC, Mandl KD, Mawudeku A, et al. Use of Unstructured Event-Based Reports for Global Infectious Disease Surveillance. Emerging Infectious Diseases. 2009 May;15(5):689–95.

57. Milinovich GJ, Williams GM, Clements AC, Hu W. Internet-based surveillance systems for monitoring emerging infectious diseases. The Lancet infectious diseases. 2014;14(2):160–168.

58. Dahn B, Fallah MP, Platts J, Moon S, Kimball AM. Getting pandemic prevention right. The Lancet. 2017;389(10075):1189.

59. Dahn B, Diallo A, Cooper Z, Nasidi A, Adjisomo B. West African (Monrovia) Workshop on Post-Ebola Global Reforms The West African Perspective [Internet]. ECOWAS; 2016 [cited 2017 May 14]. Available from: https://globalhealth.harvard.edu/west-african- comminique

60. Moon S, Leigh J, Woskie L, Checchi F, Dzau V, Fallah M, et al. Post-Ebola reforms: ample analysis, inadequate action. BMJ. 2017 Jan 23;j280.

61. Medical Research Council. MRC Policy and guidance on sharing of Research Data from population and patient studies [Internet]. MRC; 2011 [cited 2017 Apr 20]. Available from: https://www.mrc.ac.uk/publications/browse/mrc-policy-and-guidance-on- sharing-of-research-data-from-population-and-patient-studies/

62. Shaw D, Elger BS. Publication ethics in public health emergencies. J Public Health (Oxf). :1–3.

63. Haug CJ. Whose Data Are They Anyway? Can a Patient Perspective Advance the Data- Sharing Debate? New England Journal of Medicine. 2017 Jun 8;376(23):2203–5.

64. Chatham House - 2017 - Ethical principles.pdf.

Data sharing in public health emergencies. Wellcome/GloPID-R 45

65. Bauchner H, Golub RM, Fontanarosa PB. Data sharing: An ethical and scientific imperative. JAMA. 2016 Mar 22;315(12):1238–40.

66. Langat P, Pisartchik D, Silva D, Bernard C, Olsen K, Smith M, et al. Is There a Duty to Share? Ethics of Sharing Research Data in the Context of Public Health Emergencies. Public Health Ethics. 2011 Apr 1;4(1):4–11.

67. Parker M, Bull SJ, de Vries J, Agbenyega T, Doumbo OK, Kwiatkowski DP. Ethical Data Release in Genome-Wide Association Studies in Developing Countries. PLoS Med. 2009 Nov 24;6(11):e1000143.

68. Hrynaszkiewicz I, Altman D. Towards agreement on best practice for publishing raw clinical trial data. Trials. 2009;10(1):17.

69. World Health Organization. Pandemic influenza preparedness Framework [Internet]. Geneva: WHO; 2011 [cited 2017 Mar 1]. Report No.: 978 92 4 150308 2. Available from: http://www.who.int/influenza/resources/pip_framework/en/

70. Moorthy VS, Roth C, Olliaro P, Dye C, Kieny MP. Best practices for sharing information through data platforms: establishing the principles. Bulletin of the World Health Organization. 2016 Apr 1;94(4):234–234A.

71. Goldacre B, Harrison S, Mahtani K, Heneghan C. Background Briefing for WHO consultation on Data and Results Sharing During Public Health Emergencies. [Internet]. Oxford: Centre for Evidence-Based Medicine, Oxford; 2015 Sep [cited 2015 Dec 14]. Available from: http://www.who.int/medicines/ebola- treatment/background_briefing_on_data_results_sharing_during_phes.pdf

72. Chatham House. Ethical principles [Internet]. Public Health Surveillance Data and Benefits Sharing. 2017 [cited 2017 May 29]. Available from: https://datasharing.chathamhouse.org/ethical-principles/

73. The Chatham House data sharing advisory group. Public Health Surveillance: A Call to Share Data [Internet]. International Association of National Public Health Institutes; 2016 [cited 2016 May 30]. Available from: http://ianphi.org/news/2016/datasharing1.html

74. Ross E. Perspectives on Data Sharing in Disease Surveillance [Internet]. London: Chatham House; 2014 Apr [cited 2017 Mar 28]. Available from: https://www.chathamhouse.org//node/13983

75. Bierer BE, Crosas M, Pierce HH. Data Authorship as an Incentive to Data Sharing. http://dx.doi.org/101056/NEJMsb1616595 [Internet]. 2017 Mar 29 [cited 2017 Apr 3]; Available from: http://www.nejm.org/doi/full/10.1056/NEJMsb1616595

76. Pisani E, Whitworth J, Zaba B, Abou-Zahr C. Time for fair trade in research data. The Lancet. 2010 Mar;375(9716):703–5.

77. Dataverse. Harvard Dataverse Preservation Policy [Internet]. Dataverse Project. 2015 [cited 2017 Jun 25]. Available from: http://best-practices.dataverse.org/harvard- policies/harvard-preservation-policy.html

Data sharing in public health emergencies. Wellcome/GloPID-R 46

78. Figshare. Preservation Policies [Internet]. 2016 [cited 2017 Jun 25]. Available from: https://support.figshare.com/support/solutions/articles/6000079077-preservation- policies

79. Office of National Statistics. Protocol on Data Management, Documentation and Preservation, 1.0. United Kingdom Office of National Statistics; 2004.

80. DeFeo C. Mendeley Data awarded Data Seal of Approval [Internet]. Mendeley Blog. 2017 [cited 2017 Jul 4]. Available from: https://blog.mendeley.com/2017/06/30/mendeley-data-awarded-data-seal-of- approval/

81. Mannheimer S, Yoon A, Greenberg J, Feinstein E, Scherle R. A balancing act: The ideal and the realistic in developing Dryad’s preservation policy. First Monday [Internet]. 2014 Aug 5 [cited 2017 Jun 25];19(8). Available from: http://firstmonday.org/ojs/index.php/fm/article/view/5415

82. CLOCKSS. The CLOCKSS archive [Internet]. CLOCKSS. 2017 [cited 2017 Jun 25]. Available from: https://clockss.org/clockss/Home

83. Medical Research Council. Open research data: clinical trials and public health interventions [Internet]. MRC Policy on Data Sharing and Preservation. 2017 [cited 2017 Apr 20]. Available from: https://www.mrc.ac.uk/research/policies-and-guidance- for-researchers/open-research-data-clinical-trials-and-public-health-interventions/

84. Digital Curation Centre. Trustworthy Repositories [Internet]. Digital Curation Centre. [cited 2017 Jan 5]. Available from: http://www.dcc.ac.uk/resources/repository-audit- and-assessment/trustworthy-repositories

85. PLoS. Data Availability [Internet]. PLoS One. Undated [cited 2017 Apr 20]. Available from: http://journals.plos.org/plosone/s/data-availability#loc-recommended- repositories

86. Assante M, Candela L, Castelli D, Tani A. Are Scientific Data Repositories Coping with Research Data Publishing? Data Science Journal [Internet]. 2016 Apr 26 [cited 2017 Jun 1];15(0). Available from: http://datascience.codata.org/articles/10.5334/dsj-2016-006/

87. Simms S, Jones S, Mietchen D, Miksa T. Machine-actionable data management plans (maDMPs). Research Ideas and Outcomes. 2017 Apr 5;3:e13086.

88. Puebla I, Heber J. Data sharing in clinical research: challenges and open opportunities [Internet]. EveryONE. 2017 [cited 2017 May 26]. Available from: http://blogs.plos.org/everyone/2017/05/02/data-sharing-in-clinical-research/

89. Institute of Medicine. Enabling Rapid and Sustainable Public Health Research During Disasters: Summary of a Joint Workshop by the Institute of Medicine and the U.S. Department of Health and Human Services [Internet]. Washington, D.C.: National Academies Press; 2015 [cited 2017 Apr 10]. Available from: http://www.nap.edu/catalog/18967

90. World Health Organization. Developing global norms for sharing data and results during public health emergencies [Internet]. WHO; 2015 [cited 2016 Jun 27]. Available

Data sharing in public health emergencies. Wellcome/GloPID-R 47

from: http://www.who.int/medicines/ebola-treatment/blueprint_phe_data-share- results/en/

91. Higher Education Funding Council for England, Research Councils UK, Universities UK, Wellcome Trust. Concordat on Open Research Data [Internet]. Research Councils UK; 2016 [cited 2017 May 3]. Available from: http://www.rcuk.ac.uk/documents/documents/concordatonopenresearchdata-pdf/

92. Dryad. Terms of service [Internet]. Dryad. 2016 [cited 2017 Jun 25]. Available from: http://datadryad.org/pages/policies

93. Figshare. Terms and conditions [Internet]. Figshare. 2012 [cited 2017 Jun 25]. Available from: https://figshare.com/terms

94. Velasco E, Agheneza T, Denecke K, Kirchner G, Eckmanns T. Social media and internet- based data in global systems for public health surveillance: a systematic review. Milbank Quarterly. 2014;92(1):7–33.

95. Adhanom T. Remarks delivered by Dr Tedros Adhanom Ghebreyesus to G20 [Internet]. WHO; 2017 [cited 2017 Jul 11]. Available from: http://www.who.int/dg/speeches/2017/g20-summit/en/

96. . Pandemic Emergency Financing Facility [Internet]. 2017 [cited 2017 Jul 11]. Available from: http://www.worldbank.org/en/topic/pandemics/brief/pandemic- emergency-financing-facility

97. Hrynaszkiewicz I, Busch S, Cockerill MJ. Licensing the future: report on BioMed Central’s public consultation on open data in peer-reviewed journals. BMC research notes. 2013;6(1):318.

98. Nature Scientific Data. Recommended Data Repositories [Internet]. Nature Scientific Data. Undated [cited 2017 Apr 20]. Available from: https://www.nature.com/sdata/policies/repositories

99. European Comission. H2020 Programme: Guidelines to the Rules on Open Access to Scientific Publications and Open Access to Research Data in Horizon 2020. V 3.2. European Comission; 2017.

100. Chatham House. Public Health Surveillance Data and Benefits Sharing: Model agreement and annexes [Internet]. Chatham House; 2017 [cited 2017 May 29]. Available from: https://datasharing.chathamhouse.org/resource/model-agreement/

101. International Severe Acute Respiratory and Emerging Infection Consortium. ISARIC Sample and Data Sharing Policy V4 [Internet]. ISARIC; 2014 [cited 2017 May 11]. Available from: https://isaric.tghn.org/articles/isaric-sample-and-data-sharing-policy- v4/

102. Ohno-Machado L, Sansone S-A, Alter G, Fore I, Grethe J, Xu H, et al. Finding useful data across multiple biomedical data repositories using DataMed. Nat Genet. 2017 Jun;49(6):816–9.