A Study of Current Policies, Practices and Infrastructure Supporting the Sharing of Data to Prevent and Respond to Epidemic and Pandemic Threats

January 2018 A study of current policies, practices and infrastructure supporting the sharing of data to prevent and respond to epidemic and pandemic threats Elizabeth Pisani, Amrita Ghataure, Laura Merson Summary: Data sharing in public health emergencies: Wellcome/GloPID-R ii Executive summary Introduction Discussions around sharing public health research data have been running for close to a decade, and yet when the Ebola epidemic hit West Africa in 2014, data sharing remained the exception, not the norm. In response, the GloPID-R consortium of research funders commissioned a study to take stock of current data sharing practices involving research on pathogens with pandemic potential. The study catalogued different data sharing practices, and investigated the governance and curation standards of the most frequently used platforms and mechanisms. In addition, it sought to identify specific areas of support or investment which could lead to more effective sharing of data to prevent or limit future epidemics. Methods The study proceeded in three phases: a search for shared data, interviews with investigators, and a review of policies and statements about data sharing. We began by searching the academic literature, trials registries and research repositories for papers and data related to 12 pathogens named by the WHO as of priority concern because of their epidemic or pandemic potential. Chikungunya, Crimean Congo Haemorrhagic Fever (CCHF), Ebola, Lassa Fever, Marburg, Middle East Respiratory Syndrome Coronavirus (MERS-CoV), Severe Acute Respiratory Syndrome (SARS), Nipah, Rift Valley Fever (RVF), Severe Fever with Thrombocytopenia Syndrome (SFTS), Zika. For biomedical research involving these pathogens in humans, animals, in vitro and in silico, we sought to access the data underlying each paper, including through a short, on-line survey sent to 110 corresponding authors. A total of 28 people conducting or supporting research on the pathogens of interest were interviewed for 30-90 minutes about their data sharing practices and policies. In addition, we reviewed 105 institutional policies, discussion documents and academic commentaries about standards and norms in data sharing. Data sharing practice Extent of sharing We identified a total of 787 research studies related to the priority pathogens published since 2003. The majority were case studies involving a single patient. Of the remaining 319, 98 provided the underlying data in an openly accessible form, and a further 15 authors said they would make data available on request. Excluding the case reports, two thirds of all papers were based on undiscoverable data. Some 161 clinical trials relating to the priority pathogens were registered in either clinicaltrials.gov or the Pan-African registry of trials; 58 of them have registered completion, but 41 of those remain unpublished. Two trialists posted links to aggregate data several years after completion. None linked to individual patient data. Just over half of all published papers gave dates for data collection. Authors who shared data published faster than those who didn't (18 vs. 30 months median time lag to publication). Summary: Data sharing in public health emergencies: Wellcome/GloPID-R iii Open sharing mechanisms The most commonly used mechanism for the formal sharing of data was embedding supplementary material in journal articles, followed by deposit in general-purpose academic repositories. Very few of the datasets shared were in structured, machine- readable formats. Over half of the files provided as supplementary information included data in pdf format. Datasets in academic repositories were more likely to be structured, though many used proprietary software formats. Data shared on genetic sequences or protein structures databases are shared more commonly than clinical data for the priority pathogens, generally through extremely well-resourced community-designated databases such as GenBank or Protein Data Bank. However even these data do not appear to be shared rapidly; often, the public release of sequence data is timed to coincide with paper publication, even for pathogens with pandemic potential. Though registries do list some data repositories with a special interest in priority pathogens, uptake was limited amongst the researchers interviewed. Other formal sharing mechanisms included journal preprints and institutional repositories, together with "Big Data" approaches which collate information from public sources. Closed sharing mechanisms Our quantitative search strategy, based largely on publication, was biased towards data sharing mechanisms that were formal, findable and open. However, in interviews with the research community, we identified a second set of mechanisms for sharing data: closed consortia, regional collaborations and informal professional networks. We were unable to quantify the sharing of data through these mechanisms, but it is clear that for most of the priority pathogens informal or consortium sharing is currently more extensive, and faster, than through formal open mechanisms. In short: Most research data related to priority pathogens are not being shared for reuse through formal, discoverable means. Data that are shared through publication or open repositories are most often shared in formats that are not inter-operable or easily reusable. Closed consortia and informal, trust-based networks are favoured by researchers, particularly those working in resource-constrained settings. Table A summarises the strengths and weaknesses of some of the different sharing mechanisms. Broadly speaking, those that invest most in curation are most likely to provide long-term access to data that can most readily be reused for additional knowledge generation. However, they also tend to be the most costly. Sharing through informal networks of trusted colleagues is popular and can happen rapidly, but it lacks transparency, greatly restricts potential users, and essentially excludes machine learning. Summary: Data sharing in public health emergencies: Wellcome/GloPID-R iv Table A: Platforms and mechanisms used for the sharing of data about priority pathogens Data sharing mechanism Advantages Disadvantages Well-established Genomic/ structural Discoverable Expensive databases Standardised metadata Slow Supplementary material in Findable by humans, with some effort Not machine readable, limiting discoverability journals Public investment not vital Depends on electronic access to papers; Often paywalled Diverse, largely unusable data formats Not discoverable even by humans Journal preprints Faster than traditional publishing models Diverse, largely unusable data formats Standardised metadata Extensive set-up costs Disease-specific curated Community support Sustainability questionable repositories Greatly facilitates reuse and interoperability Duplication of effort General purpose academic Standardised metadata rare Potentially cost-effective repositories Sustainability uncertain Duplication of effort Strong support for data management and curation Institutional repositories Not easily discoverable Easy entry point for academics reluctant to share Standardised metadata rare Standardised metadata Excludes reuse by non-members Facilitates reuse and interoperability. Set-up costs Closed consortia Protects equity and research interests, so strong Duplication of effort community support Sustainability questionable Informal professional Trust-based, so strong community support Zero transparency networks Big Data approaches Cost-effective Neither sensitive nor specific Summary: Data sharing in public health emergencies: Wellcome/GloPID-R v Table B: FREE-FAIRER Principles and their inclusion in institutional data sharing policies Notes Findable Most policies address "findability" and require data to be linked to a DOI. In practice DOIs Data should be described with rich, machine-readable metadata, persistently are linked to papers, rarely to data. No policy requires standardised ontologies for metadata. and uniquely identified, and indexed in a searchable resource No policies apply to surveillance data. Few institutional policies exist specifically for research relevant to potential emergencies. Rapidly available Most are tied to publication models and other quality control norms that are inimical to speed. Data should be shared rapidly to maximise their potential utility in outbreak Preprints increase speed but reduce findability. Structures for rapid release of surveillance situations. data are dysfunctional. Ethical Discussions relating to ethics prioritise the protection of individuals. Few address the ethics of Data are collected with appropriate consent procedures. Data sharing protects withholding data or public health importance. individual privacy, without undermining the rapid protection of public health. Some policies include aspirational statements about equity, but few provide guidance on how Equitable to operationalize them. For interviewees, especially in research-poor settings, this is the Data sharing delivers fair benefit to those who collected the data, and to the biggest obstacle to sharing. In those policies that mention it, 'equity' is often reduced to norms communities from which data were collected for citation and authorship. Forever Many policies address persistent identifiers; few address long-term preservation or Data are shared through a mechanism that uses persistent identifiers and has sustainability of sharing platforms. clear, securely funding strategies to ensure the long-term

A Study of Current Policies, Practices and Infrastructure Supporting the Sharing of Data to Prevent and Respond to Epidemic and Pandemic Threats

Get Laid, Get Paid, Get High, Survive

Recursive Definitions Across the Footprint That a Straight- Line Program Manipulated, E.G

Type 2 Diabetes Prediction Using Machine Learning Algorithms

Scalable Fault-Tolerant Elastic Data Ingestion in Asterixdb

Anomaly Detection and Identification Scheme for VM Live Migration in Cloud Infrastructure✩

A Decidable Logic for Tree Data-Structures with Measurements

Data Intensive Computing with Linq to Hpc Technical

Cloud Computing Paradigms for Pleasingly Parallel Biomedical

ERC Document on Open Research Data and Data Management Plans

Universal Health Coverage UNIVERSAL HEALTH COVERAGE

In the Eye of the Beholder: to Make Global Health Estimates Useful, Make Them More Socially Robust

Open Science Toolbox