The Development of Large-Scale De-Identified Biomedical Databases in the Age of Genomics—Principles and Challenges Fida K
Total Page:16
File Type:pdf, Size:1020Kb
Dankar et al. Human Genomics (2018) 12:19 https://doi.org/10.1186/s40246-018-0147-5 REVIEW Open Access The development of large-scale de-identified biomedical databases in the age of genomics—principles and challenges Fida K. Dankar1* , Andrey Ptitsyn2 and Samar K. Dankar3 Abstract Contemporary biomedical databases include a wide range of information types from various observational and instrumental sources. Among the most important features that unite biomedical databases across the field are high volume of information and high potential to cause damage through data corruption, loss of performance, and loss of patient privacy. Thus, issues of data governance and privacy protection are essential for the construction of data depositories for biomedical research and healthcare. In this paper, we discuss various challenges of data governance in the context of population genome projects. The various challenges along with best practices and current research efforts are discussed through the steps of data collection, storage, sharing, analysis, and knowledge dissemination. Keywords: Biomedical database, Data privacy, Data governance, Whole genome sequencing Background first, now increasingly shifting to whole exome and whole Overview genome sequencing [2, 3]). The project started by Decode Databases are both the result and the instrument of has generated one of the best resources for discovery in research. From the earliest times, assembling collections biomedical sciences and inspired development of multiple of samples and stories was essential for any research pro- populational and national genomics projects, also feeding ject. The results of research feeding back into the libraries into integrated databases. Genomics England [4], Human and collections create a positive feedback in the accumula- Longevity [5], All of US (formerly known as Precision tion of knowledge limited only by the technological plat- Medicine Initiative) [6], China’s Precision Medicine Initia- form for storage and retrieval of information. The modern tive [7], Korean Reference Genome Project [8], Saudi times did not change the principle but further emphasized Human Genome Program [9], and Qatar Genome [10] it with the advent of computers, mass information storage, programs are just a few recent examples of active and high-throughput research instrumentation. Modern large-scale projects generating enormous databases of biomedical databases may vary in size, specialization, and complex biomedical information. Large-scale population type of access but with a few exceptions are voluminous genomics projects proliferating in the second decade and include complex data from multiple sources. Argu- of the twenty-first century show enormous diversity in ably, the first integrated database of the population scale goals and strategies. The Icelandic genome program has was initiated in Iceland when Decode Genetics started in evolved from the largest population genetics study of the 1996 [1]. This new generation of integrated biomedical time and has primary objectives in advancing biomedical databases incorporates both phenotype (medical records, research. China’s Precision Medicine Initiative is one of clinical studies, etc.) and genotype (variation screening at the most ambitious programs with an aim to sequence 100 million whole human genomes by 2030. The objective * Correspondence: [email protected] is to improve disease diagnosis, develop targeted treat- 1 College of IT, UAEU, Al Ain, UAE ments, and provide better wellness regimes. Genomics Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Dankar et al. Human Genomics (2018) 12:19 Page 2 of 15 England is an augmented (100,000) research cohort First, genomic data is highly distinguishable. There study that implies sampling of the most common dis- is confirmation that a sequence of 30 to 80 SNPs eases and reflecting the genetic diversity of the popu- could uniquely identify an individual [13]. Genomic lation in Great Britain. The All of Us project has data is also very stable [14]. It undergoes little similar objectives and aims to collect a sufficiently changes over the lifetime of an individual and thus large cohort (1,000,000). The numbers alone have a has a long-lived value (as opposed to other bio- great ameliorating effect on statistical power of asso- medical data such as blood tests which have ex- ciation studies. Deep phenotyping and follow-up sam- piry dates). pling in All of Us are aiming to develop the new Second, genetic data provides sensitive information level of precision in diagnostic and treatment of mul- about genetic conditions and predispositions to tiple diseases. The declared aims of the Human certain diseases such as cancer, Alzheimer, and Longevity project are even more focused on a specific schizophrenia. If breached, such information can be range of age-associated diseases. To achieve its goals, stigmatizing to participants and can be used against Human Longevity plans to recruit about 1,000,000 them in employment and insurance opportunities, donors. The Saudi Human Genome Program has a even if these pre-dispositions never materialize. very different focus; it aims to develop effective Third, genetic data does not only provide methods and facilities for early diagnostics and treat- information about the sequenced individuals but ment of heritable diseases. Such goal does not require also about their ancestors and off springs. Whole the genome sequencing effort on the same scale as genome data increases our ability to predict All of Us or Genomics England. The program imple- information related to relatives’ present and future ments only a small number of whole genome sequen- health risks, which raises the question as to the cing and up to 100,000 whole exome sequencing to obligation of a consented participant towards their collect the data reflecting local genetic variation and family members (the authors in [15] describe design a microarray chip for cost-effective mass neo- privacy risks to family members of individuals who natal screening. In contrast, the national genome pro- shared their genetic data for medical research). gram in Kuwait requires complete sampling of the Finally, and most concerning, there is great fear entire population including nationals and non-citizen from the potential information hidden within residents because the principal goal, according to the genomic data [16]. As our knowledge in genomics recently adopted DNA Law [11], is to counteract ter- evolves, so will our view on the sensitivity of rorist activity by precise unequivocal identification of genomic data (in other words, it is not possible to every human being. The Qatar Genome Programme quantify the amount and sensitivity of personal (QGP) aims to integrate genome sequencing informa- information that can be derived from it). tion of all Qatari nationals with electronic medical re- cords(EMRs)andresultsofclinicalstudiesto Paper outline provide quick and precise personalized diagnostic and In this paper, we discuss various privacy and govern- treatment of diseases. The goal is to provide a solid ance challenges encountered during the construction basis for the biomedical research in the country. and deployment of population-scale sequencing pro- These biomedical databases are often viewed as a plat- jects. The various challenges are discussed through form for regional and worldwide collaborative research the stages of: projects. Both the construction of these resources and serving them to a growing research community (national 1. Initial data collection, and international) present a significant challenge toward 2. Data storage, preserving the privacy of the participants. 3. Data sharing (utilization), and 4. Dissemination of research findings to the Particularities of genomic data community. In 2008, James Watson, a co-discoverer of the double- helix DNA model, opted to release his sequenced gen- At each stage, we discuss current practices and chal- ome in a public database with the exception of his lenges, as well as contemporary research efforts, with a APOE gene (which has been associated with Alzheimer’s particular interest in data sharing for research purposes disease). However, a statistical model was later devel- [17]. We provide examples from a diversity of large- oped that inferred the missing gene with a high degree scale population sequencing projects and reflect on their of confidence [12]. This incident conveys one of many scope and data governance models. new privacy concerns that genomic data raises and that Note that the above division is simplistic as the differ- are difficult to deal with: ent stages are not mutually exclusive; however, it makes Dankar et al. Human Genomics (2018) 12:19 Page 3 of 15 for a simpler and more organized presentation of the consultation with the community through surveys, different ideas. focus groups (from different ethnic, racial,