Population Genetic Considerations for Using Biobanks As International
Total Page:16
File Type:pdf, Size:1020Kb
Carress et al. BMC Genomics (2021) 22:351 https://doi.org/10.1186/s12864-021-07618-x REVIEW Open Access Population genetic considerations for using biobanks as international resources in the pandemic era and beyond Hannah Carress1, Daniel John Lawson2 and Eran Elhaik1,3* Abstract The past years have seen the rise of genomic biobanks and mega-scale meta-analysis of genomic data, which promises to reveal the genetic underpinnings of health and disease. However, the over-representation of Europeans in genomic studies not only limits the global understanding of disease risk but also inhibits viable research into the genomic differences between carriers and patients. Whilst the community has agreed that more diverse samples are required, it is not enough to blindly increase diversity; the diversity must be quantified, compared and annotated to lead to insight. Genetic annotations from separate biobanks need to be comparable and computable and to operate without access to raw data due to privacy concerns. Comparability is key both for regular research and to allow international comparison in response to pandemics. Here, we evaluate the appropriateness of the most common genomic tools used to depict population structure in a standardized and comparable manner. The end goal is to reduce the effects of confounding and learn from genuine variation in genetic effects on phenotypes across populations, which will improve the value of biobanks (locally and internationally), increase the accuracy of association analyses and inform developmental efforts. Keywords: Bioinformatics, Population structure, Population stratification bias, Genomic medicine, Biobanks Background individuals, families, communities and populations, ne- Association studies aim to detect whether genetic vari- cessitated genomic biobanks. ants found in different individuals are associated with a The completion of the human genome allowed gen- trait or disease of interest, by comparing the DNA of in- omic biobanks to be envisioned. The International Hap- dividuals that vary in relation to the phenotypes [1]. For Map Project, practically the first international biobank example, the major-histocompatibility-complex antigen [3], facilitated the routine collection of data for genome- loci are the prototypical candidates that modulate the wide association studies (GWAS) [4]. GWAS to improve genetic susceptibility to infectious diseases. As a result, clarity soon after became the leading genetic tool for association studies aim to identify which loci may pro- phenotype-genotype investigations. Over time, GWAS vide valuable information for strategising prevention, have been used to identify associations between thou- treatment, vaccination and clinical approaches [2]. Such sands of variants for a wide variety of traits and diseases, cardinal questions striking the core differences between with mixed results. GWAS drew much criticism con- cerning their validity, error rate, interpretation, applica- tion, biological causation [5] and replication [6]. Since * Correspondence: [email protected] 1Department of Animal and Plant Sciences, University of Sheffield, Sheffield, much of this criticism was due to spurious associations UK yielded from small sample sizes with reduced power of 3Department of Biology, Lund University, Lund, Sweden association analyses, major efforts were taken to recruit Full list of author information is available at the end of the article © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Carress et al. BMC Genomics (2021) 22:351 Page 2 of 19 tens of thousands of participants into studies where their remained largely inaccessible, and the company filed for biological data and prognosis were collected. These col- bankruptcy [19]. The experience of deCODE highlighted lections served as the basis for what is considered today the risks in entrusting private companies to manage gen- as a (genomic) biobank [7]. omic databases, promoting similar efforts to have at least Today, biobanks are known as massive scale datasets partial government control in the dozens of newly containing many hundreds of thousands of participants founded biobanks (reviewed in [20]), as illustrated in from specified populations. Biobanks have brought enor- Fig. 1. Moreover, as the use of biobanks is expanding be- mous power to association studies. Although it was un- yond their locality, for example, in the case of rare con- clear whether these new databases would deliver their ditions where samples need to be pooled from multiple most ambitious promises, the potential of biobanks in biobanks, the view of biobanks should be changed from enabling personalised treatment was noted before the locally-managed resources to more global resources. technology matured. It was initially expected that these These should adhere to international standards to in- databases would lead to the rapid discovery of a better crease the accuracy of association studies and the use of genetic understanding of complex disorders, allowing for biobanks [21]. personalised treatments [8]. However, it is now clear Even past the formation of biobanks, many associa- that this expectation was exaggerated [8]. For example, a tions results failed to replicate (e.g., [22]) or show a dif- comprehensive review of the genomics of hypertension ference in the effect across worldwide populations, in on its way to personalised medicine concluded that des- traits and disorders like body-mass index (BMI) [23], pite the wealth of identified genomic signals, actionable schizophrenia [24], hypertension [25] and Parkinsons’ results are lacking [9]. No new drugs for the treatment disease [26]. Although strong associations between gen- of hypertension were approved for more than two de- etic variants and a phenotype typically replicated within cades. Moreover, the tailoring of therapy to each patient the population that was studied, they may not have been has not progressed beyond considering self-reported Af- replicated elsewhere. This leads naturally to further rican ancestry and serum renin levels [9]. Another ex- questioning the value and cost-effectiveness of associ- ample is autism, the most extensively studied (40 years) ation studies and biobanks [27] – what do the associa- and heavily funded ($2.4B in NIH funding over the past tions mean, and what are they useful for? How can we ten years [10]) mental disorder with nearly three dozen decide whether the association is relevant for different biobanks [11]. Despite these major efforts at understand- individuals, particularly those of mixed origins or those ing the disorder, there is still no single genetic test for who may not know their origins? What are the consider- autism, not to mention genetic treatment [12]. These ations when designing a new biobank or merging data gloomy reports of the state of knowledge in two of the from multiple biobanks? most studied complex disorders, which typically harness We argue that understanding population structure is a massive biobanks, were not what the biobank enthusiasts key component to answering these questions and con- envisioned at the beginning of the century [8]. tributing to the usefulness of biobanks and their ability Back then, both private and government-sponsored to serve the general population [28–30]. In the following, banks began amassing tissues and data. For example, we review the current state of knowledge on the import- Generation Scotland [13] includes DNA, tissues and ance of population structure to association studies and phenotypic information from nearly 30,000 Scots [14]; biobanks and the implications to downstream analyses. the 100,000 Genomes Project sequenced the genomes of We then review biobank relevant models that describe over 100,000 NHS patients with rare diseases, aiming to population structure. We end with the challenges and understand the aetiology of their conditions from their benefits of the tools that implement these models. genomic data [15]; and the UK Biobank project se- quenced the complete genomes of over half a million in- Main text dividuals [16] with the aim of improving the prevention, Population diversity diagnosis and treatment of a wide range of diseases [17]. Human genetic variation is a significant contributor to Pending projects include the Genome Russia Project, phenotypic variation among individuals and populations, which aims to fill the gap in the mapping of human pop- with single-nucleotide polymorphisms (SNPs) being the ulations by providing the whole-genome sequences