Feature to share genome data, the field is generally viewed as generous compared with other disciplines. Still, the repositories meant to foster sharing often present barriers to those HOW A FIELD uploading and downloading data. Researchers tell tales of spending months or years track- ing down data sets, only to find dead ends or unusable files. And journal editors and funding agencies struggle to monitor whether scien- BUILT ON DATA tists are sticking to their agreements. Many scientists are pushing for change, but it can’t come fast enough. Clinical genomicist Heidi Rehm says the SHARING BECAME field has come to recognize that big scientific advances require vast amounts of genomic data linked to disease and health-trait data. “But it isn’t compatible and shareable,” says Rehm, based at Massachusetts General Hos- A TOWER OF BABEL pital in Boston and the Broad Institute in Cambridge. “How do we get everyone in the The immediate and open exchange of world — patients, clinicians and researchers information was key to the success of — to share?” Barriers everywhere the 20 years Sequencing the human genome made it easier ago. Now the field is struggling to keep to study diseases associated with mutations in a single gene — Mendelian disorders such as its data accessible. By Kendall Powell non-syndromic hearing loss2 (see page 218). But identifying the genetic roots of more com- mon complex diseases, including cardiovascu- lar disease, and other leading causes of death, required the identification of multiple genetic risk factors throughout the genome. n July 2000, David Haussler remembers scientists hoarded the data they were produc- To do this, researchers in the mid-2000s began crying as he watched the first fully assem- ing, it would derail the project. So in 1996, the comparing the genotypes of thousands to hun- bled human genome streaming across HGP researchers got together to lay out what dreds of thousands of individuals with and his computer screen. He and , a became known as the Bermuda Principles, with without a specific disease or condition, in an graduate student at the time, built the all parties agreeing to make the human genome approach known as genome-wide association first-ever web-based tool for exploring the sequences available in public databases, ideally studies, or GWAS. three billion letters of the human genome. within 24 hours — no delays, no exceptions. The approach proved popular — more than They had published the rough draft of the Fast-forward two decades, and the field 10,700 GWAS have been conducted since Igenome on the Internet a mere 11 days after is bursting with genomic data, thanks to 2005. And that has produced oceans of data, finishing the herculean task of stitching it all improved technology both for sequencing says Chiea Chuen Khor, a group leader at the together — a task assigned to them as part of whole genomes and for genotyping them Genome Institute of Singapore, who studies the Human Genome Project (HGP), the inter- by sequencing a few million select spots to the genetic basis of glaucoma. A study with national collaboration that had been working quickly capture the variation within. These 10,000 people, looking at 1 million genetic towards this goal for a decade. It would still be efforts have produced genetic readouts for markers in each, for example, says Khor, would several months before the group published its tens of millions of individuals, and they sit generate a spreadsheet with 10 billion entries. analysis of the genome in the pages of Nature1, in data repositories around the globe. The Most of these individual-level genomic but the data were ready to share. principles laid out during the HGP, and later data now live in ‘controlled-access’ databases. “There it was, going out into the whole adopted by journals and funding agencies, These were set up to deal with the sticky legal world,” recalls Haussler, scientific director of meant that anyone should be able to access and ethical concerns that come with genomic the University of California Santa Cruz Genom- the data created for published genome studies data that have been linked to personal infor- ics Institute. Soon, every person in the world and use them to power new discoveries. mation — ‘phenotype data’ that can include could explore it — chromosome by chromo- If only it were that simple. health-care records, disease status or lifestyle some, gene by gene, base by base — on the web. The explosion of data led governments, choices. Even in anonymized data sets, it’s It was a historic moment, says Haussler. funding agencies, research institutes and pri- technically possible that individuals can be Before the HGP launched in the early 1990s, vate research consortia to develop their own reidentified. So, controlled-access databases “there had not been a serious discussion about custom-built databases for handling the com- vet the researchers seeking access and ensure data sharing in biomedical research”, Haus- plex and sometimes sensitive data sets. And that the data are used only for the purposes sler says. “The standard was that a successful the patchwork of repositories, with various that participants consented to. investigator held onto their own data as long rules for access and no standard data format- The US National Institutes of Health (NIH) as they could.” ting, has led to a “Tower of Babel” situation, requires its grant recipients to place GWAS That standard clearly wouldn’t work for such says Haussler. data into its official repository, the Database a large and collaborative effort. If countries or Although some researchers are reluctant for Genotypes and Phenotypes, or dbGaP.

198 | Nature | Vol 590 | 11 February 2021 ©2021 Spri nger Nature Li mited. All ri ghts reserved. ©2021 Spri nger Nature Li mited. All ri ghts reserved.

©

2 0 2 1

S p r i n g e r

N a t u r e

L i m i t e d .

A l l r i g h t s r e s e r v e d .

ILLUSTRATION BY ANA KOVA ©

2 0 2 1

S p r i n g e r

N a t u r e

L i m i t e d .

A l l r i g h t s r e s e r v e d . Nature |Vol 590 |11 February 2021 |199 Feature European researchers can deposit data into UK BioBank, which holds genomic data on Sussman explains that the journal’s editors the European Genome-phenome Archive 500,000 people, are still invaluable. Mathias will work through data-sharing obstacles with (EGA) housed at the European Bioinformat- is fiercely protective of the participants in authors on a case-by-case basis to find solu- ics Institute (EMBL-EBI) in Hinxton, UK. Simi- TOPMed and sees merit in the protection tions. This can go as far as asking authors to larly, other large generators of genomic data, that controlled access provides. Like many, reapply for approval from their institutional such as the for-profit company 23andMe in she would like to see the repositories better review board, going back to participants to Sunnyvale, California, and the non-profit resourced. But, she says, “I am an advocate for reobtain their consent or rerunning an analysis England in London, operate their the checks and balances”. after removing unshareable data. The journal own controlled-access databases. And others are happy to have access, even has turned away authors who state upfront But uploading data into some of these repos- if it is hard to obtain. “It’s out of our scope to that they cannot share data. “The community itories often takes a long time. As a result, says generate that amount of data,” says Melanie and the funders demand this transparency and Khor, the data are often “minimal and sparse”, Bahlo, who runs a statistical-genetics lab at reproducibility,” she says. because researchers are depositing just what’s the Walter and Eliza Hall Institute of Medical But even when authors do agree to share required to be compliant. Research in Melbourne, Australia. Her lab is data, editors and reviewers have limited ability Sometimes the data get stored in more than more than willing to wade through the digital to confirm that it is being done. They might not one place, and that creates other challenges. paperwork to use the dbGaP, and has done so have the time — or the access to controlled-ac- Rasika Mathias, a genetic epidemiologist for more than ten projects. She also recently cess databases — to check data quality, format- at Johns Hopkins University in Baltimore, spent a fruitless six months chasing after a data ting or completeness. Maryland, who studies the genetics of asthma set that was supposed to be publicly available Trenkmann says funders should require in people of African ancestry, says that decen- through a research institute’s data portal, but researchers to have a concrete data-sharing tralization is a huge problem. She is part of wasn’t. plan from the outset of a project. This could TOPMed, a precision-medicine programme “Nothing is harder than getting data out of help to shift attitudes so that researchers see run by the NIH’s National Heart, Lung, and dbGaP and EGA,” says Khor, “unless it’s getting sharing as a duty, she says. Blood Institute. It consists of more than it from a researcher who is unwilling to share.” An NIH-wide data-sharing policy to be 155,000 research participants across more implemented in January 2023 does just that. than 80 studies and shares its data in sev- The sharing police It requires all NIH grant applicants to put a eral repositories, including dbGaP and some Twenty years on from the HGP, there is no Data Management and Sharing (DMS) Plan into university-based portals. specific universal policy that says research their grant proposals and allows researchers “It’s a remarkable resource,” says Mathias. groups have to share their human-genome to allocate some of their budget to the task. But it’s cumbersome for an outsider to find all data, or share them in a particular format This should ensure that data sharing is the pieces of available data and request access, or database. That said, many journals have aligned both with ethical and privacy consid- she says. They must often provide detailed continued to abide by the Bermuda Princi- erations, and with the FAIR principles — which proposals and letters of support. “It’s unnec- ples, requiring that genomic data be shared mean that data must be findable, accessible, essarily difficult.” in approved databases at the time of publica- interoperable and reusable, says Carolyn Many look for workarounds. “I personally do tion. Enforcement of these policies can be hit Hutter, director of the National Human not download dbGaP data, I just go straight to or miss. Genome Research Institute’s (NHGRI) Division the researchers and ask if they want to collabo- of Genome Sciences in Bethesda. “That does rate,” says Ruth Loos, a genetic epidemiologist not mean, I threw my data over a wall and hope at the Icahn School of Medicine at Mount Sinai someone caught it,” she says. in New York City. Several years ago, she tried to “The enforcement part of it is tricky,” Hutter access a dbGaP data set, filing multiple rounds adds, “because data sharing often comes at of digital paperwork, only to be rejected. “Even the very end of the project.” And like journal logging into dbGaP can be a pain. It’s just not IF YOU DON’T HAVE editors, grant administrators can only do spot researcher-friendly,” she says. THE RAW DATA, checks of any data-sharing accession numbers Stephen Sherry, acting director of the NIH’s that appear in annual progress reports. National Center for Biotechnology Informa- YOU CAN’T LOOK tion in Bethesda, Maryland, which runs the Searching for solutions dbGaP, acknowledges that the processes to AT THE QUALITY. ” There could be ways to share more simply submit and access data are “imperfect and without falling foul of proprietary or privacy painful”. And the complex, heterogeneous Michelle Trenkmann, senior editor for issues. Many genomic stakeholders agree data require case-by-case review, which cannot genetics and genomics at Nature in London, that an aggregated form of GWAS data, called simply be sped up by throwing “more people says that authors are often reluctant to share, GWAS summary statistics, can and should be at the crank to turn it faster”. citing concerns over participant privacy, con- shared broadly and freely. These summaries But, Sherry says, the NIH is investing in mod- sent or national or corporate rules governing include the aggregated scores for each genetic ernizing the system to make it more stream- who owns the data. “What’s remarkable, is variant found to be associated with a disease lined and flexible. Carrie Wolinetz, associate that, as a field, geneticists expect the data to or condition across multiple genomes. They director for science policy at the NIH, says it be shared, but sometimes they do not want are easier for researchers to work with, and is yet to be determined whether the remedy to share their own data,” she says. Trenkmann they protect participant privacy. will be a dbGaP 2.0 or an alternative resource. pushes back in such cases, and if the challenges Many research consortia do share these on “Do you put in a stop-gap measure, or is it time can’t be overcome, the authors must spell out their websites or portals. But an open-access to invest in a whole bathroom renovation?” their reasons directly in the paper for trans- collaboration between EMBL-EBI and the she asks. parency. (Nature’s news team is editorially NHGRI, called the GWAS Catalog3, is working For all the problems that controlled access independent from its journal team.) towards a centralized, standardized solution. causes in sharing genome data, many research- The journal Genome Research, has a ‘no Starting in 2020, the GWAS Catalog gave ers say databases such as dbGaP and the exceptions’ policy. Executive editor Hillary researchers a way to submit their summary

200 | Nature | Vol 590 | 11 February 2021 ©2021 Spri nger Nature Li mited. All ri ghts reserved. ©2021 Spri nger Nature Li mited. All ri ghts reserved.

harmonize data (like the GWAS Catalog on a grander scale) and to federate, or loosely link, the disparate data warehouses. GA4GH chief executive Peter Goodhand uses the analogy of global mobile-phone communications. There’s huge competition between mobile-phone makers and service providers, but at the end of the day, they all have to work on the same network. “For true interoperability to take place, there have to be working relationships between the providers,” says Goodhand. “You can set up the systems that permit the sharing and make it easier.” Scientists used a GA4GH standard to create the Matchmaker Exchange, for example. This service lets clinicians and researchers working on the rarest of rare diseases search a single federated network of eight international data- bases to find individuals with a similar geno- type or phenotype to a case they’re working on. If a match is returned, both parties are connected in a way that protects both patient confidentiality and research ownership and authorship. The NIH’s RAS will also use a GA4GH standard, called the Data Repository

REGENTS OF UNIV. CALIFORNIA. COURTESY OF SPECIAL COLLECTIONS, UNIV. LIBRARY, UNIV. UNIV. LIBRARY, UNIV. COLLECTIONS, OF SPECIAL COURTESY CALIFORNIA. OF UNIV. REGENTS PHOTOGRAPHS. SERVICES PHOTOGRAPHY CRUZ UC SANTA CRUZ. SANTA CALIFORNIA, Service, a software interface that helps differ- In 2000, Jim Kent, a graduate student at the University of California, Santa Cruz, helped to ent repositories to communicate. assemble and share the results of the decade-long Human Genome Project. Bahlo and others say that data federation efforts become even more important as the statistics along with metadata describing the the GWAS Catalog. “The dbGaP wasn’t posi- field pivots to digging deeper into phenotype study and participants. In return, research- tioned to evolve and handle every new type of data, which have grown in scope and com- ers get a prepublication accession ID to use in data,” she says. For example, the cost of storing plexity. “That data comes in all sorts of forms preprints and submitted manuscripts. data from whole genomes is very different from — environmental exposures, smoking status, But many researchers say that summary that for GWAS data. As such, the NHGRI has cre- medical imaging data,” says Bahlo. statistics are not sufficient for advancing ated a cloud-based infrastructure known as the She and others see data federation as a genomic science. “That’s a major threat to Analysis, Visualization, and Informatics Lab- great opportunity to inject global equity into GWAS,” says Chris Amos, a genetic epidemiolo- space (AnVIL), where researchers can share and genomic data sharing. Researchers from devel- gist who studies lung cancer at Baylor College analyse across large genomic data sets, includ- oping countries could access and work with of Medicine in Houston, Texas. Researchers ing whole genome and exome sequences. data sets without needing to generate their need the individual-level genome data and the Another NIH initiative is the Researcher Auth own data or have their own supercomputing linked phenotypic trait data to reveal exactly Service (RAS), which would authorize research- resources. And better data sharing should how genetic variation plays out in disease. ers to access AnVIL, the dbGaP and several also improve representation of non-white, They also need the full data to check the sci- other data resources. “The vision is that we’d non-European global ancestries. Under-rep- ence. “If you don’t have the raw data, you can’t push this out like a visa stamp,” says Sherry, resentation is especially stark for continental look at the quality. That is not good enough allowing researchers to ultimately merge and African ancestries, which make up less than to make a reproducible finding,” Amos says. analyse data at will in cloud-based systems. 0.5% of all GWAS participants4 (see pages 209 And the owners of the data for very large “We’re building one of the first systems of and 220). cohorts, such as 23andMe and Genomics library cards for researchers,” says Sherry. Haussler thinks that positive peer pressure England, don’t give unrestricted access to Haussler and some other big-data wranglers should convince scientists to share in better their summary statistics. They cite concerns also have ideas. As data-sharing frustrations ways. The need is only growing. Twenty years over participants’ data privacy and the wish were mounting in 2013, Haussler, along with after releasing the first human genome to the to retain ownership of their data. In effect, David Altshuler, Eric Lander and other inter- Internet, his team has built a way for anyone to they run their own controlled-access data- national colleagues laid the groundwork for explore the SARS-CoV-2 viral genome5. bases, with custom processes for accessing the Global Alliance for Genomics and Health, “Data should be a living thing,” says and reanalysing their data. A precondition for or GA4GH (see go.nature.com/3app3xr). Haussler. “I want to click on it and play with it working with much of their data is allowing the It started with the same ideals as the HGP. immediately. That should be the motivation. companies to share authorship of the resulting “We’d get the world to share data on one big If you don’t share your data, you can’t do that.” work. Bahlo says these kinds of requirements database, and we’d all agree on how we’d use set too high a bar for her and other bioinforma- that data, and Kumbaya,” says Haussler. “Very Kendall Powell is a freelance science ticians who wish to crunch data from Genom- quickly, it became evident that that was utterly journalist in Lafayette, Colorado.

ics England’s 100,000 Genomes Project. impossible.” 1. International Human Genome Sequencing Consortium. Hutter acknowledges that not all the cur- Instead, the GA4GH now focuses on creating Nature 409, 860–921 (2001). rent growing pains of genomic data sharing standards for the multitude of genomic data- 2. Chong, J. X. et al. Am. J. Hum. Genet. 97, 199–215 (2015). 3. Buniello, A. et al. Nucl. Acids Res. 47, D1005–D1012 (2019). can be fixed simply through improvements to bases around the world. Its working hypoth- 4. Mills, M. C. & Rahal, C. Commun. Biol. 2, 9 (2019). the dbGaP or by sharing summary statistics in esis is that it will be technically possible to 5. Fernandes, J. D. et al. Nature Genet. 52, 991–998 (2020).

Nature | Vol 590 | 11 February 2021 | 201 ©2021 Spri nger Nature Li mited. All ri ghts reserved. ©2021 Spri nger Nature Li mited. All ri ghts reserved.