DNA.Land Is a Framework to Collect Genomes and Phenomes in the Era
Total Page:16
File Type:pdf, Size:1020Kb
comment DNA.Land is a framework to collect genomes and phenomes in the era of abundant genetic information Creating large genome/phenome collections can require consortium-scale resources. DNA.Land is a digital biobank that collects genetic data from individuals tested by consumer genomic companies using a fraction of the resources of traditional studies. Jie Yuan, Assaf Gordon, Daniel Speyer, Richard Aufrichtig, Dina Zielinski, Joseph Pickrell and Yaniv Erlich lucidating the genetic basis of complex and MyHeritage4. These services provide a by a small for-profit company. This website traits requires substantial quantities dense genotyping array with approximately offers a wide repertoire of genetic genealogy Eof genomic data1. In the last 20 years, half a million SNPs for about $69–$99 tools that extend the features offered by the field has seen an exponential decline per participant. As of today, more than 8 DTC companies. By serving the genetic in the cost of genomic technologies. As million individuals have been tested with genealogy community, GedMatch has of today, a genotyping array costs on the these services, and over 10,000 new DTC reached critical mass and grown a large order of tens of dollars, and whole-genome kits are purchased daily. None of these community of hundreds of thousands of sequencing costs about $1,000. However, companies currently shares individual- individuals in approximately five years of collecting genetic and phenotype data at level data with researchers, and to the operation. However, the website does not scale is a time- and resource-consuming best of our knowledge only 23andMe focus on basic research: it neither obtains task that poses massive logistical and and MyHeritage collect phenotype consent from users nor collects phenotypic operational challenges. On top of the information on disease traits. These information, and it provides minimal costs of genotyping, researchers need to policies limit the ability to migrate data to privacy settings, reducing its attractiveness advertise the study, recruit participants, academic studies through collaboration for human genetic research by academic obtain consent, provide DNA collection kits, with these companies. However, all of groups. Nonetheless, its success highlights track and store samples, extract DNA, and these services hold the view that the raw the possibility of achieving large-scale prepare the DNA library before data can be genetic information belongs to the tested collection of DTC data by developing a available in a digital format. Phenotyping individual and allow downloading of the third-party service that offers added value requires further resources, even when genomic data in a tabulated textual format. in the form of genetic-genealogy analysis done using online questionnaires. These The ability to download the raw genotypes for participants. operations are labor intensive and incur provides researchers with an opportunity Building upon these observations, massive costs. For example, the US to crowdsource the raw genetic data and we developed DNA.Land, a website to National Institutes of Health’s Precision repurpose the data for academic studies, crowdsource genomic and phenotypic Medicine Initiative (“All of Us”) has recently circumventing the cumbersome sample- information for human genetics research. allocated $50 million for recruitment processing procedures of traditional studies. DNA.Land has two overall goals: (a) to centers (“HPO”) and biobank operations Previous efforts to crowdsource DTC demonstrate the potential for genotype and that collectively proposed to recruit and genomic data using an online platform phenotype collection by crowdsourcing data handle biospecimens and basic phenotypic have shown mixed results. For example, from users of direct-to-consumer companies information from a total of ~500,000 OpenSNP.org offers a not-for-profit service and (b) to promote the idea of patient-led participants. These costs translate to about and provides a basic mechanism for users genetic research, with controls left to the $100 per participant before genotyping and to upload their DTC genomic data and participants: for example, the choice of the without the inclusion of more advanced data publicly share their data, but it does not degree of sharing of phenotype data, and collection methods, such as wearable devices offer features such as privacy controls or avenues for providing feedback to researchers. (Supplementary Table 1). In Europe, the UK the sort of Institutional Review Board (IRB) In 20 months of operation, DNA.Land has Biobank reported that it needed “careful protection generally provided for those collected over 50,000 genomic datasets configuration” of its operational chain to participating in research5. While OpenSNP from DTC participants and is growing daily. support the recruitment of one hundred serves as an important open resource for Notably, this effort was accomplished by a participants per day in each of its centers2,3. the community, its approach has yet to small team in an academic environment. In We sought to develop a cost-effective become a viable alternative to traditional this Comment, we describe the operating alternative for collecting genome and genomic data collection. Analysis of guidelines, the ELSI (ethical legal social phenome data at scale. The past five years uploading dates shows that the website implications) approach, and technical details have witnessed the advent of large-scale attracts only one or two participants per of our website, while highlighting key points direct-to-consumer (DTC) genetic services day, and after five years of operation, it has and lessons learned about operating a digital for genealogy and the satisfaction of reached only 5,000 participants. Another biobank. We hope this information can be personal curiosity, with companies such as website for crowdsourcing DTC genomic useful for other academic efforts seeking 23andMe, AncestryDNA, FamilyTreeDNA, data is GedMatch.com, which is operated alternatives to traditional approaches for 160 NATURE GENETICS | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics © 2018 Nature America Inc., part of Springer Nature. All rights reserved. comment constructing genetic databases, for start-ups We found that the amount of time in April 2016, generated a massive spike in that operate in the growing DTC domain, and users spent on the consent documents traffic, and we have since observed many for bioinformaticians interested in learning corresponds to their length. For example, participants publicly sharing their ancestry more about the architecture of scalable the users spent approximately 17 seconds results on Facebook pages dedicated to pipelines for the analysis of genetic data. (s.d. = 22 seconds) on the ‘just-in-time’ genetic genealogy. On the other hand, we consent page displayed for a breast cancer believed that users would value highly the Design principles and user experience survey that contains approximately 250 option to download their fully imputed The design and operation of DNA. words. For the trait consent, with twice as genome, with 39 million variants, as Land have emphasized two principles— many words, the users spent 34 seconds compared to their half-a-million array. We reciprocation and autonomy—which (s.d. = 22 seconds) on average. These reading instead found that most users do not have were highlighted by previous studies as a rates correspond to 15 words per second. the computational resources to analyze viable route for large-scale engagement in The increased dwell time on longer consent their genome, and this feature proved to be genomics6–8. Participants who volunteer pieces suggest that most users do not just infrequently used14. their genomic data contribute an essential ‘click through’ the page. However, the fast Finally, we provide tech support and resource for advancing research. We reading time indicates that the users mostly engagement for our users through a hypothesized that providing services in skim through the language, presumably dedicated member of our team. The need return would help maintain user interest and to detect major issues, and argues against for this function became apparent when we interaction with our study and encourage lengthy consent forms. were flooded with hundreds of emails after participation from new users. For every After the consent, participants upload the launch, straining our ability to respond piece of information requested of the user, their genetic data and can optionally provide and diverting significant amounts of time we aim to reciprocate by displaying online minimal information about themselves. We from the development team. In addition, reports detailing interesting information currently accept data files from all major our DNA.Land Facebook page has become about his or her genome. In addition, we DTC companies: 23andMe, AncestryDNA, a place for users to report bugs and pose provide a “Learn More” link that explains FamilyTreeDNA, and MyHeritage. questions about the website, whereas our the value of the information for science Once the user has logged in, the main initial expectation was that it would only and for the user. To respect the autonomy profile page presents three primary types serve for promotional purposes and would of individuals, we give our users the ability of reports to users: ancestry composition, be of minor importance. Our tech support to choose the extent of involvement in the relative matching, and trait