comment DNA.Land is a framework to collect and phenomes in the era of abundant genetic information Creating large /phenome collections can require consortium-scale resources. DNA.Land is a digital biobank that collects genetic data from individuals tested by consumer genomic companies using a fraction of the resources of traditional studies. Jie Yuan, Assaf Gordon, Daniel Speyer, Richard Aufrichtig, Dina Zielinski, Joseph Pickrell and Yaniv Erlich

lucidating the genetic basis of complex and MyHeritage4. These services provide a by a small for-profit company. This website traits requires substantial quantities dense genotyping array with approximately offers a wide repertoire of Eof genomic data1. In the last 20 years, half a million SNPs for about $69–$99 tools that extend the features offered by the field has seen an exponential decline per participant. As of today, more than 8 DTC companies. By serving the genetic in the cost of genomic technologies. As million individuals have been tested with genealogy community, GedMatch has of today, a genotyping array costs on the these services, and over 10,000 new DTC reached critical mass and grown a large order of tens of dollars, and whole-genome kits are purchased daily. None of these community of hundreds of thousands of sequencing costs about $1,000. However, companies currently shares individual- individuals in approximately five years of collecting genetic and phenotype data at level data with researchers, and to the operation. However, the website does not scale is a time- and resource-consuming best of our knowledge only 23andMe focus on basic research: it neither obtains task that poses massive logistical and and MyHeritage collect phenotype consent from users nor collects phenotypic operational challenges. On top of the information on disease traits. These information, and it provides minimal costs of genotyping, researchers need to policies limit the ability to migrate data to privacy settings, reducing its attractiveness advertise the study, recruit participants, academic studies through collaboration for human genetic research by academic obtain consent, provide DNA collection kits, with these companies. However, all of groups. Nonetheless, its success highlights track and store samples, extract DNA, and these services hold the view that the raw the possibility of achieving large-scale prepare the DNA library before data can be genetic information belongs to the tested collection of DTC data by developing a available in a digital format. Phenotyping individual and allow downloading of the third-party service that offers added value requires further resources, even when genomic data in a tabulated textual format. in the form of genetic-genealogy analysis done using online questionnaires. These The ability to download the raw for participants. operations are labor intensive and incur provides researchers with an opportunity Building upon these observations, massive costs. For example, the US to crowdsource the raw genetic data and we developed DNA.Land, a website to National Institutes of Health’s Precision repurpose the data for academic studies, crowdsource genomic and phenotypic Medicine Initiative (“All of Us”) has recently circumventing the cumbersome sample- information for human genetics research. allocated $50 million for recruitment processing procedures of traditional studies. DNA.Land has two overall goals: (a) to centers (“HPO”) and biobank operations Previous efforts to crowdsource DTC demonstrate the potential for and that collectively proposed to recruit and genomic data using an online platform phenotype collection by data handle biospecimens and basic phenotypic have shown mixed results. For example, from users of direct-to-consumer companies information from a total of ~500,000 OpenSNP.org offers a not-for-profit service and (b) to promote the idea of patient-led participants. These costs translate to about and provides a basic mechanism for users genetic research, with controls left to the $100 per participant before genotyping and to upload their DTC genomic data and participants: for example, the choice of the without the inclusion of more advanced data publicly share their data, but it does not degree of sharing of phenotype data, and collection methods, such as wearable devices offer features such as privacy controls or avenues for providing feedback to researchers. (Supplementary Table 1). In Europe, the UK the sort of Institutional Review Board (IRB) In 20 months of operation, DNA.Land has Biobank reported that it needed “careful protection generally provided for those collected over 50,000 genomic datasets configuration” of its operational chain to participating in research5. While OpenSNP from DTC participants and is growing daily. support the recruitment of one hundred serves as an important open resource for Notably, this effort was accomplished by a participants per day in each of its centers2,3. the community, its approach has yet to small team in an academic environment. In We sought to develop a cost-effective become a viable alternative to traditional this Comment, we describe the operating alternative for collecting genome and genomic data collection. Analysis of guidelines, the ELSI (ethical legal social phenome data at scale. The past five years uploading dates shows that the website implications) approach, and technical details have witnessed the advent of large-scale attracts only one or two participants per of our website, while highlighting key points direct-to-consumer (DTC) genetic services day, and after five years of operation, it has and lessons learned about operating a digital for genealogy and the satisfaction of reached only 5,000 participants. Another biobank. We hope this information can be personal curiosity, with companies such as website for crowdsourcing DTC genomic useful for other academic efforts seeking 23andMe, AncestryDNA, FamilyTreeDNA, data is GedMatch.com, which is operated alternatives to traditional approaches for

160 Genetics | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics © 2018 Nature America Inc., part of Springer Nature. All rights reserved. comment constructing genetic databases, for start-ups We found that the amount of time in April 2016, generated a massive spike in that operate in the growing DTC domain, and users spent on the consent documents traffic, and we have since observed many for bioinformaticians interested in learning corresponds to their length. For example, participants publicly sharing their ancestry more about the architecture of scalable the users spent approximately 17 seconds results on Facebook pages dedicated to pipelines for the analysis of genetic data. (s.d. =​ 22 seconds) on the ‘just-in-time’ genetic genealogy. On the other hand, we consent page displayed for a breast cancer believed that users would value highly the Design principles and user experience survey that contains approximately 250 option to download their fully imputed The design and operation of DNA. words. For the trait consent, with twice as genome, with 39 million variants, as Land have emphasized two principles— many words, the users spent 34 seconds compared to their half-a-million array. We reciprocation and autonomy—which (s.d. =​ 22 seconds) on average. These reading instead found that most users do not have were highlighted by previous studies as a rates correspond to 15 words per second. the computational resources to analyze viable route for large-scale engagement in The increased dwell time on longer consent their genome, and this feature proved to be genomics6–8. Participants who volunteer pieces suggest that most users do not just infrequently used14. their genomic data contribute an essential ‘click through’ the page. However, the fast Finally, we provide tech support and resource for advancing research. We reading time indicates that the users mostly engagement for our users through a hypothesized that providing services in skim through the language, presumably dedicated member of our team. The need return would help maintain user interest and to detect major issues, and argues against for this function became apparent when we interaction with our study and encourage lengthy consent forms. were flooded with hundreds of emails after participation from new users. For every After the consent, participants upload the launch, straining our ability to respond piece of information requested of the user, their genetic data and can optionally provide and diverting significant amounts of time we aim to reciprocate by displaying online minimal information about themselves. We from the development team. In addition, reports detailing interesting information currently accept data files from all major our DNA.Land Facebook page has become about his or her genome. In addition, we DTC companies: 23andMe, AncestryDNA, a place for users to report bugs and pose provide a “Learn More” link that explains FamilyTreeDNA, and MyHeritage. questions about the website, whereas our the value of the information for science Once the user has logged in, the main initial expectation was that it would only and for the user. To respect the autonomy profile page presents three primary types serve for promotional purposes and would of individuals, we give our users the ability of reports to users: ancestry composition, be of minor importance. Our tech support to choose the extent of involvement in the relative matching, and trait prediction answers emails, responds to user comments website in terms of data contribution and (Fig. 1). On average, the ancestry reports on our Facebook page, and writes blog information sharing. Lastly, security is a are available after 7.1 hours (median: posts promoting DNA.Land on social major concern of the website, and we discuss 4.6 hours), and the relative matching and media, keeping users appraised of our in the Supplementary Note measures taken the trait predictions are processed by batch development efforts. to safeguard uploaded genetic data and every 12 hours, so typically users will wait a user information. maximum of 24 hours for results. Data acquisition during the project New users start their interaction with Currently, our trait-prediction reports DNA.Land collects several forms of data DNA.Land with account creation and a describe only physical and wellness features from users: genome-wide genotyping data, consent form. Previous studies have shown such as height and neuroticism and do not basic demographic information about that users rarely read website terms of include any disease-related traits, to avoid the participant and immediate family, service9, but despite that, IRBs insist on regulatory complexities. However, we do and questionnaires about traits. With the overly long consent forms10,11. To address collect questionnaires about disease traits, exception of the genomic information, this challenge, our consent philosophy such as family history of breast cancer. The all other types of data are optional for uses a ‘just-in-time’ presentation of relative finder and trait-prediction reports participation in DNA.Land. information. Rather than enumerating are ‘opt in’ and implement a ‘just-in-time’ We launched DNA.Land in October all possible scenarios, as in a traditional consent for participation. About 90% of 2015. As of July 2017, the project has consent form for broad research12,13, our users opted in to the relative-matching collected 50,000 genomic datasets from consent form sets only the framework for report, which, among other features, makes participants (Fig. 2a,b; Supplementary the relationships between the user and their username and email address publicly Fig. 1b,c). In general, we can divide the the study and describes in plain language visible for other genetically related DNA. participation rates into three phases. The the risks and benefits of sharing genetic Land users. The trait-prediction report, launch phase in the first month saw a rapid data. While exploring the website, users having launched a year after the main rate of growth of nearly 8,000 genomes. may decide to increase their involvement site, currently has a 34% opt-in rate, likely Then, after the initial excitement, the rate by answering questionnaires about health because early users have not revisited the declined to an average of 900 genomes traits or contributing genealogical data. website to activate this feature. per month. Finally, after launching the In these cases, we present additional Interestingly, the popularity of the reports improved ancestry report in April 2016, consent forms that are geared toward the did not match our initial expectations. We we have seen a steady growth of nearly specific feature before allowing the user to initially believed that the ancestry report 2,000 new genomic datasets per month. contribute more data. The ‘just-in-time’ would generate only secondary interest About 45% of the users submit files from approach allows the general consent form to among users, as similar reports are returned AncestryDNA, 40% from 23andMe, and be only 1,500 words long, or a five-minute by DTC services. However, the ancestry 15% from FamilyTreeDNA (Fig. 2c). read in a normal pace, increasing the report has proven to be one of the most We also allow users to delete their chance that users will read it. We share the popular features and generates nearly equal accounts at any time. Since the launch of consent language under CC-BY-2.0 license traffic to the relative-matching report the website, the deletion rate has remained to facilitate adoption by the community (Supplementary Fig. 1a). The launch of a at an average fraction of 4.9% of new user (Supplementary Note). more visually appealing ancestry report, uploads. The deletions are mostly technical

Nature Genetics | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics 161 © 2018 Nature America Inc., part of Springer Nature. All rights reserved. comment

a b

c

Fig. 1 | The DNA.Land reports. a, Ancestry report based on a STRUCTURE-like algorithm and a specialized reference of worldwide populations. b, Trait- prediction report. Predictions are calculated from published genome-wide association study summary statistics and users’ imputed genomes. The report also displays the distribution of DNA.Land predicted scores and the effect sizes and locations of relevant SNPs. c, Relative matching is based on finding shared IBD segments and calculating the most likely genealogical relationship. Each row of the report indicates a matching user and provides statistics relevant to the match, such as degree of relatedness, total length of matching segments, the likelihood distribution on the degree of relatedness, and a display of the location of matching segments on the . in cause and represent users encountering and since then about 12,000 of our users for users to directly identify their mother technical problems, such as uploading a have completed at least one questionnaire. and father. Lastly, an analysis of the results truncated genome file. In addition, about Users have answered over 275,000 questions of the relative-matching algorithm across all 6.3% of all submitted genomes are essentially in total, or about 3,100 questions per day DNA.Land users shows that 7,100 profiles identical. These cases mostly reflect users since the feature’s launch. We have not have at least one immediate family member. who have been tested by more than one discovered any significant differences Additional information about relative- company and have created a separate profile between participation rates in the different matching statistics of DNA.Land users are for each of the resulting genome datasets. questionnaires even though they sample presented in Supplementary Fig. 3. We gather phenotypes by providing very different traits. Analysis of the demographic data users with various questionnaires about We also give users the opportunity to provided by users shows that the average physical and health traits (Supplementary provide detailed information about relatives, participant is of North European ancestry in Note; Supplementary Fig. 2a,b). Each with an emphasis on identifying nuclear her late 40s (interquartile region: 36–63 years questionnaire pertains to a single trait, and families. We have integrated into DNA. old) (Supplementary Fig. 4a). We see a slight users may choose which questionnaires Land’s relative finder an option for users over-presentation of self-reported females to complete. To facilitate participation, we of Geni.com, a website for building family (53%) versus males (47%). To understand limited the number of questions in each trees, to link their Geni accounts with those the ethnic composition of our study, we questionnaire to a maximum of 15, and most of their matching relatives on DNA.Land. analyzed the genetic ancestry of individuals users spend less than 2 minutes completing Family trees built by Geni.com users have and identified the leading ancestry each questionnaire (Supplementary Fig. 2c). been shown to facilitate large-scale analyses component of each individual. While this We launched the questionnaires in October of populations, such as historical migration measure may not directly correspond to how 2016, a year after DNA.Land launched, patterns15. We also provide survey questions users self-identify their ancestry16, it provides

162 Nature Genetics | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics © 2018 Nature America Inc., part of Springer Nature. All rights reserved. comment

a 6,000

5,000

4,000

3,000

2,000 New user count (per week) 1,000

0

b 50,000 c

Genotype file origin 40,000

30,000

20,000 Cumulative new users

10,000

0

Total FTDNA 10/07/1511/04/1512/02/1512/30/1501/27/1602/24/1603/23/1604/20/1605/18/1606/15/1607/13/1608/10/1609/07/1610/05/1611/02/1611/30/1612/28/1601/25/1702/22/17 Ancestry23AndMe Date (MM/DD/YYYY)

Fig. 2 | The growth of DNA.Land. a, The number of new users participating in DNA.Land during its first 16 months of operation. Pink indicates the number of new user registrations; green indicates the net number of user genomes uploaded, with users who subsequently deleted their accounts subtracted; and dark blue indicates the number of users completing at least one trait-prediction questionnaire. Large spikes in new user uploads occurred during launch and after the release of an updated ancestry report in April 2016. b, Cumulative new users per week since launch. c, The bar graph corresponds to the net genomes uploaded and indicates the proportion of total genomes arriving from each currently accepted direct-to-consumer genotyping company. a proxy for the demography of individuals genomic data (e.g., imputation and ancestry instances for up to $0.60 per hour, but we represented in our data. The genetic analysis analysis) is executed on AWS spot instances, can manually decide to bid higher prices in shows that the primary ancestry of 53.9% which process each genome in parallel, situations of acute need, such as the days of our users is Northern European, with allowing us to scale up quickly in periods of following a feature launch during which the next most common groups from other high demand. The imputation and ancestry we experience an influx of new users. As parts of Europe (Supplementary Table 2; results are stored on AWS S3 storage. A of December 2016, the cumulative cost Supplementary Fig. 4b,c). physical in-house server then runs the has been approximately $73.4k, or about relative-matching and trait-prediction $2 per genome-wide genotyping array, Data acquisition costs processes, which are CPU, RAM, and disk often in combination with a phenome DNA.Land employs a hybrid cloud intensive. The processed results, including consisting of tens of data points and and/ design to achieve cost-effective, scalable lists of inferred relatives, are transferred to or with genealogy information. In addition, operation (Fig. 3a). The architecture of the database on the front-end server. the DNA.Land team has consisted of the project is extensively documented in The data acquisition costs of our digital approximately two full-time academic the Supplementary Note, so we outline approach are low and translate to a few programmers, who are mainly required for here only general details important to the dollars per genome. The costs of running the development of new features to collect operational costs. Briefly, the front end of our hybrid cloud operation is on the order new types of information, and a part-time the website operates on an Amazon Web of $5,000 per month (Fig. 3b), which position for technical support. Services (AWS) EC2 reserved instance. It includes computing engines, storage, and provides the web interface for managing transfer costs, in addition to irregular costs Discussion users, collecting genomic and phenotypic for development and purchases relating We have described a scalable, software- data, compiling surveys, and reporting to our in-house server. To keep the costs based method to gather direct-to-consumer relative-matching and trait-prediction low, we have developed an automated genotype data at low cost and low personnel results. The pipeline for the processing of bidding system that will bid for spot requirements relative to traditional

Nature Genetics | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics 163 © 2018 Nature America Inc., part of Springer Nature. All rights reserved. comment

a b 16,000 AWS 14,000 EC2 12,000

EC2 spot instances 10,000 DNA.Land user Front-end web server Imputation and ancestry inference S3 SQS 8,000 $ (US)

6,000

Amazon Amazon S3 SQS 4,000 Storage of genomes Queueing for spot instances and physical server 2,000

Physical server 0 Relative matching, trait prediction, and backups

10/201511/201512/201501/201602/201603/201604/201605/201606/201607/201608/201609/201610/201611/201612/2016 Date (MM/YYYY)

Fig. 3 | DNA.Land operation and expenses. a, Overview of DNA.Land architecture. EC2 spot instances process uploaded genome files and perform imputation and ancestry inference. The physical server performs computation involved in relative matching and trait prediction, and performs backups of genotype data. Genotype files and output files of the impute pipeline are stored in an AWS S3 repository, and selected results are stored in a PostgreSQL database on the front-end server. User web requests are handled by the high-efficientcy NGINX web server, with dynamic HTML content generated by python code using the scalable Flask web-framework. AWS SQS is used to manage assignment of new users to spot instances or processing by the physical server. b, Monthly expenses for all AWS services. EC2 services (blue) are used to process new users in the pipeline. S3 (yellow) is used to store uploaded genotype files and any output files from the imputation pipeline. Transfer costs (green) pertain to user downloads of their imputed genome files. Irregular costs (pink) indicate purchases of EC2 reserved instances, as well as the purchase of our current physical server, in February 2016. biobanking methods. In the span of that this process, while resource intensive, has allowed us to test new features with a 20 months, we have managed to obtain over has signaled to the community that we smaller set of users and detect technical 50,000 genomes, many of which are paired are serious partners who can be trusted issues before a feature is discussed and with additional phenotypic, demographic, with their information. Third, we placed promoted widely on the Facebook groups of and family data. an emphasis on scalable software. After our participants. In the future, we hope also We credit the success of DNA.Land to the initial growing pains of stabilizing the to be able to launch a feature to only a small several factors. First, we achieved great website, the day-to-day operation of DNA. subset of users, but currently our framework momentum immediately after the launch Land has required only minimal efforts does not support this option. of the project, and within the first month to maintain. This has allowed our small Second, our experience highlights of operation we had collected over 8,000 team to focus mainly on the development the necessity of a ‘customer support’ genomes. We attribute this successful of new features and reports, which further function. We were initially overwhelmed launch to working closely with leaders drive participation. This stands in contrast by the amount of communication from in the genetic genealogy community, to traditional biobanking techniques that participants, mainly on our Facebook who promoted the resource to their require scaling of personnel to increase page, which we had set up as a means to social media followers and were crucial sample collection efforts. release messages to the community. We did in communicating to us the needs of the We also faced a few challenges in not anticipate the necessity of a support community. Indeed, less than half of our running DNA.Land. First, in academia, the function before the launch but found traffic comes from searches, and a availability of scientific software is usually ourselves unexpectedly answering thousands substantial proportion of users come from welcomed, regardless of its quality, but this is of emails and Facebook posts in the first Facebook pages mentioning DNA.Land, not the case when providing a public website week of operation while also managing the genetic genealogy community websites, for a non-academic population. Most of our development issues that arose during the and blog posts (Supplementary Table 3). users showed little patience for technical launch. We encourage others who undertake In addition, the initial website already problems with our website, and we found such a similar endeavor to dedicate a included several interesting features not very quickly that we needed to operate at the member of their team to answering those presented in most existing DTC reports, highest standards of software development emails. In addition, we greatly benefited such as a visualization of shared identical- and quality assurance, including providing from an internal system developed by the by-descent (IBD) segments between support for various browsers as well as team that makes it possible to track the matching relatives. Second, we invested mobile and laptop devices. We addressed status of each sample in the computational considerable efforts into addressing user this issue in part by having a development pipeline. This has allowed us to provide concerns on a variety of issues including the environment that enables prototyping participants with accurate information and quality of our results, privacy and consent and testing of code before the launch. manage technical issues. policies, and even suggestions for improving In addition, we found soft launching Third, unlike in traditional biobanking, our user-interface, such as making our (launching without substantial promotion) DNA.Land can only recruit people who visualizations color-blind friendly. We posit to be a more reliable path. This technique have already been tested by one of the DTC

164 Nature Genetics | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics © 2018 Nature America Inc., part of Springer Nature. All rights reserved. comment

companies. Not every person can participate and does not scale well, as it requires Jie Yuan and Assaf Gordon contributed equally to in our biobank. Thus, our marketing needs participants to repeatedly visit the website. this work. to focus on this much smaller group rather The last few years have highlighted the rise *e-mail: [email protected] than on the general population. We partly of digital phenotypes, a term referring to the overcame this problem by introducing the quantification of phenotypes from human Published online: 26 January 2018 website to leaders in the genetic genealogy interactions with digital technology17. https://doi.org/10.1038/s41588-017-0021-8 community, but even this community Recent studies have shown that a range of References encompasses only a fraction of the overall traits can be measured with data collected 1. Ashley, E. A. Nat. Rev. Genet. 17, 507–522 (2016). people who have been tested by DTC on web activity. These include measuring 2. Sudlow, C. et al. PLoS Med. 12, e1001779 (2015). companies. In addition to creating a five factor personality traits from Facebook 3. Downey, P. & Peakman, T. C. Int. J. Epidemiol. 37(Suppl. 1), 18 i46–i50 (2008). marketing challenge, this restriction means likes , highly accurate quantification of 4. Khan, R. & Mittelman, D. Genome Biol. 14, 139 (2013). 19 that the ethnic composition of our users heart rate from videos , and finding early 5. Greshake, B., Bayer, P. E., Rausch, H. & Reda, J. PLoS ONE 9, reflects the DTC customer base and consists signals of pancreatic cancer from Internet e89204 (2014). 20 6. Erlich, Y. et al. PLoS Biol. 12, e1001983 (2014). mostly of Northern European ancestry. We searches . Unlike traditional questionnaires, 7. Delaney, S. K. et al. Expert. Rev. Mol. Diagn. 16, 521–532 (2016). hope that as the price of genotyping DTC collection of digital phenotypes requires less 8. Wilbanks, J. & Friend, S. H. Nat. Biotechnol. 34, 377–379 (2016). services continues to decrease, our website labor from the participant as they leverage 9. Bakos, Y., Marotta-Wurgler, F. & Trossen, D. R. J. Legal. Stud. 43, 1–35 (2014). will see an increased representation of existing data using the APIs of social 10. Albala, I., Doyle, M. & Appelbaum, P. S. IRB Ethics Hum. Res. minority groups. media sites such as Facebook and allow 32, 3 (2010). Available at http://www.thehastingscenter.org/irb_ Finally, we learned to pay closer attention measurement of longitudinal changes. article/the-evolution-of-consent-forms-for-research-a-quarter- century-of-changes/ (accessed 17 September 2017). to the ‘actionability’ of the data from the We hope to focus on collecting such 11. Klitzman, R. L. J. Empir. Res. Hum. Res. Ethics 8, 8–19 (2013). user perspective. The research community phenotypes after obtaining proper consent 12. Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. usually encourages sharing of various types from our participants. Nat. Rev. Genet. 9, 406–411 (2008). of raw data, but we found some challenges As an ultimate goal, we hope to create 13. Ball, M. P. et al. Proc. Natl. Acad. Sci. USA 109, 11920–11927 (2012). to that philosophy among our participants. a digital biobank that integrates streams 14. Curnin, C., Gordon, A. & Erlich, Y. Bioinformatics 33, For example, we thought that the imputation of data from genetic, genealogical, and 2191–2193 (2017). feature would be in high demand, as we social media resources. This approach will 15. Kaplanis, J. et al. Preprint at https://www.biorxiv.org/content/ early/2017/02/07/106427.1 (2017). generate for participants the status of 39 establish a complementary effort to existing 16. Bryc, K., Durand, E. Y., Macpherson, J. M., Reich, D. & Mountain, million variants from an array which may large-scale traditional studies. Our data- J. L. Am. J. Hum. Genet. 96, 37–53 (2015). have only 700,000 SNPs. However, this intensive society offers growing numbers 17. Jain, S. H., Powers, B. W., Hawkins, J. B. & Brownstein, J. S. Nat. Biotechnol. 33, 462–463 (2015). feature met with negative feedback from of opportunities to harness existing 18. Kosinski, M., Stillwell, D. & Graepel, T. Proc. Natl. Acad. Sci. USA many users who found that it was not resources, and we envision that the value 110, 5802–5805 (2013). clear what to do with the file and noted and scope of such integrative approaches 19. Wu, H.-Y. et al. Eulerian video magnifcation for revealing subtle changes in the world. Preprint at https://dspace.mit.edu/ that it is impossible to open with standard will continue to rise. openaccess-disseminate/1721.1/86955 (2012). applications such as Excel or Notepad. We 20. Paparrizos, J., White, R. W. & Horvitz, E. J. Oncol. Pract. 12, have addressed this through the development URLs. DNA.Land, https://dna.land; DNA. 737–744 (2016). of more tools, such as DNA.Land Compass14, Land general consent, https://dna.land/ which provides a graphical-user-interface- consent; DNA.Land breast cancer consent, Acknowledgements based website with which to browse the data https://dna.land/nbcc-consent; DNA. Y.E. holds a Career Award at the Scientific Interface from and learn about each SNP. Land trait consent, https://dna.land/traits- the Burroughs Wellcome Fund. This study was supported by a generous gift from Andria and Paul Heafy to the The future for DNA.Land involves more consent; GitHub code, https://github.com/ Erlich Laboratory, funding from the National Breast granular consent and expansion of the ways TeamErlich/dnaland. Cancer Coalition, and support from Amazon Web Services’ in which we collect phenotypic information. Education Grants. J.Y. is supported by the Columbia We are developing a method for participants Data sharing. Key software for DNA.Land University Integrative Graduate Education and Research to share their data with other organizations is available on GitHub. We also share, upon Traineeship (IGERT), funded by NSF research grant number 1144854. We thank the tens of thousands of DNA.Land using an organization-specific consent. request and with partial cost-sharing, the participants—especially our early adopters, whose feedback In a first attempt, we recently partnered summary statistic genotype–phenotype was integral in our efforts to improve the site—and genetic with the National Breast Cancer Coalition data for this project for any of the traits genealogist C. Moore for her valuable advice. We welcome (NBCC), a patient advocacy group, to collect collected. Finally, information on over 8,000 inquiries by researchers who are interested in collecting genotype and phenotype information for individual-level genome and breast cancer genotype and phenotype information with our resource. breast cancer research. We re-consent users phenotypes is available through our joint who participant in our survey and allow project with the National Breast Cancer Author contributions them to opt in to sharing their genome Coalition. Interested researchers can contact J.Y., A.G., D.S., D.Z., J.P., and Y.E. coded the DNA.Land ❐ website. A.G. is the chief architect of DNA.Land. J.Y., A.G., with the NBCC under a specific code of the corresponding author. and Y.E wrote the manuscript and analyzed the data. Y.E. conduct provided by the NBCC. Six month conceived the website. R.A. provided technical assistance. after the feature was launched, more than Jie Yuan1,2, Assaf Gordon1, Daniel Speyer1,2, J.P. and Y.E. supervised the study. 10,000 participants have completed the Richard Aufrichtig1, Dina Zielinski1, survey. We aim to create more opportunities Joseph Pickrell1,3 and Yaniv Erlich1,2* Competing interests that will empower participants to decide 1New York Genome Center, , NY, USA. Y.E. is the Chief Science Officer of https://www. for themselves about sharing their data. In 2Department of Computer Science, Fu Foundation MyHeritage.com. J.P. is the CEO and co-founder of Gencove. addition, we plan to reduce the burden on School of Engineering, Columbia University, New 3 our participants when collecting phenotypic York, NY, USA. Center for Computational Biology Additional information information. The current procedure of and Bioinformatics (C2B2), Department of Systems Supplementary information is available for this paper at answering questionnaires is cumbersome Biology, Columbia University, New York, NY, USA. https://doi.org/10.1038/s41588-017-0021-8.

Nature Genetics | VOL 50 | FEBRUARY 2018 | 160–165 | www.nature.com/naturegenetics 165 © 2018 Nature America Inc., part of Springer Nature. All rights reserved.