A Wealth of Discovery Built on the Human Genome Project — by the Numbers

A Wealth of Discovery Built on the Human Genome Project — by the Numbers

Comment A wealth of discovery built on the Human Genome Project — by the numbers Alexander J. Gates, Deisy Morselli Gysi, Manolis Kellis & Albert-László Barabási A new analysis traces the story of the draft genome’s impact on genomics since 2001, linking its effects on publications, drug approvals and understanding of disease. he 20th anniversary of the publica- research has changed over time. The trends hunt1,2. Indeed, evidence for the first tion of the first draft of the human remain when we control for the growth in biol- protein-coding gene emerged in 1902, with genome1,2 offers an opportunity to ogy publications over the same period (see SI, the discovery of the hormone secretin4 (SCT track how the project has empowered Fig. S6). We did not control for time since the gene), 50 years before the structure of DNA research into the genetic roots of discovery of genes, but estimate that doing so was uncovered, and 75 years before genome Thuman disease, changed drug discovery and would not have altered our conclusions. sequencing became commonplace. Our helped to revise the idea of the gene itself. These connections offer a snapshot of the analysis shows that, between the start of the Here we distil these impacts and trends. We evolution of the research landscape before and HGP in 1990 and its completion in 2003 (after combined several data sets to quantify the dif- after the HGP. It shows an intense focus on a the draft was published in 2001), the number ferent types of genetic element that have been small number of ‘superstar’ protein-coding of discovered (or ‘annotated’) human genes discovered and that generated publications, genes, potentially to the detriment of interest- grew drastically. It levelled out suddenly in the and how the pattern of discovery and publish- ing work that could be done on others. There mid-2000s at about 20,000 protein-coding ing has changed over the years. Our analysis has been a pivot towards non-protein-coding genes (see ‘Twenty years of junk, stars and linked together data including 38,546 RNA sections of the genome, and to understand- drugs: Non-coding elements’), far short of the transcripts; around 1 million single nucleotide ing interactions between genetic material 100,000-strong estimate previously adopted polymorphisms (SNPs); 1,660 human diseases by many in the scientific community2. with documented genetic roots; 7,712 approved Although discoveries of protein-coding and experimental pharmaceuticals; and genes reached a plateau, interest in individual 704,515 scientific publications between 1900 By 2017, 22% of gene- genes grew rapidly following the HGP. Each and 2017 (see Supplementary information; SI ). year since 2001, between 10,000 and 20,000 The results highlight how the Human related publications papers mentioning protein-coding genes have Genome Project (HGP), with its comprehensive referenced just 1% been published (see SI; Fig. S3). list of protein-coding genes, spurred a new era However, that interest has focused largely of elucidating the function of the non-coding of genes.” on just a few genes. Before 1990, HBA1 was the portion of the genome and paved the way most studied because it encodes one of the for therapeutic developments. Crucially, the proteins in adult haemoglobin. From 1990, results track the emergence of a systems-level and proteins. And drug discovery has been attention then shifted to CD4 (based on the view of biology alongside the conventional sin- grounded in just a few protein targets. cumulative number of publications) given the gle-gene perspective, as researchers mapped Some of these trends are familiar to biolo- protein’s involvement in T-cell immunity and the interactions between cellular building gists, but to quantify and visualize them is to as the cell receptor for HIV. Yet the interest in blocks (see ‘No jump for big teams’). consider them anew. these two genes pales next to the explosion There are limitations to our analysis. For There is no world without an HGP for com- of attention on individual genes following the example, there is no consensus on where a parison. So it is impossible to say whether draft 2001 HGP sequence. Some superstar gene starts and ends or, surprisingly, even these trends would have arisen anyway. Other genes, including TP53, TNF and EGFR, became what sequence exactly encodes some genes3. factors, from increased computing power to the subject of hundreds of publications a year, Multiple naming conventions are in use for sophisticated sequencing methods, also had with most other genes receiving scant atten- some genomic elements, so sometimes our a role in many of these developments. It is tion (see ‘Deep impact’ and ‘Twenty years of methodology did not connect them. And other nonetheless clear that the HGP’s catalogue junk, stars and drugs: Star genes’). We find links between publications and elements might catalysed the continuing genetic revolution. that, by 2017, 22% of gene-related publications not have been added to databases by authors. referenced just 1% of genes. Finally, our graphs end in 2017, because there Superstar genes Intense study is, of course, justified for can be a time lag between an article’s publica- The popular perception is that the HGP genes that have profound biological impor- tion and entry into the databases we used. marked the start of the intensive search for tance. A good example is TP53 — it is crucial However, we do not expect these issues protein-coding genes. In fact, the 2001 draft to cell growth and death, and leads to cancer to affect the trends we note in how genome HGP paper signalled the end of a decades-long when inactivated or altered. Variations in this 212 | Nature | Vol 590 | 11 February 2021 ©2021 Spri nger Nature Li mited. All ri ghts reserved. ©2021 Spri nger Nature Li mited. All ri ghts reserved. Tiny dots The gene ADRA1A is 3% of genes were targeted by 99 dierent not discussed by drugs, 5% of all those any publications. approved. It is the subject of just 130 publications. TNF is associated with 160 known diseases, the most of any gene. Top 8 genes 7 1. TP53 6 2. TNF 3. EGFR 4. IL6 5. VEGFA 6. APOE 8 7. TGFB1 8. MTHFR DEEP 5 4 IMPACT 3 The 19,757 genes that encode proteins are arranged according to their relative position along each of the chromosomes, shown as rings. The Number of diseases 10 50 150 plane marks the publication of the draft Human Genome Project in 2001. 2 Number of Length beneath the plane scales to the publications number of publications on a gene since 50 100 200 before 2001 then; height above it denotes 100 500 1,000 publications beforehand. The breadth of the base of each peak reflects the number of diseases associated with Number of each gene. A few genes, distributed publications across the genome and chromosomes, after 2001 have been studied intensely, as have non-coding elements in between (not shown). In the past two decades, Chromosomes ordered researchers have learnt that these by number of genes latter regions help to regulate the 1 dynamic code of life. 19 11 2 17 3 6 12 7 5 X 16 Long story 9 4 The gene TP53 on 10 8 chromosome 17 was 14 discovered in 1979. 15 20 Associated with most 22 13 cancers, it has since 18 accumulated 9,232 21 1 publications. Y M chromosome (mitochondrial DNA) SOURCE: BARABASI LAB BARABASI SOURCE: Nature | Vol 590 | 11 February 2021 | 213 ©2021 Spri nger Nature Li mited. All ri ghts reserved. ©2021 Spri nger Nature Li mited. All ri ghts reserved. Comment gene are found in more than 50% of tumour regions of genome that were called junk DNA, thousands of individuals; these included the sequences. It is mentioned in 9,232 publica- or the dark matter of the genome? Thanks in International HapMap Project8 (the third and tions between 1976 and 2017 (see SI, Fig. S4). large part to the HGP, it is now appreciated final phase of which was completed in 2010) One might assume that the more that is that the majority of functional sequences in and the 1000 Genomes Project9 (completed known about the same genes, the greater the human genome do not encode proteins. in 2015). These data sets, combined with the incentive would be to explore the rest of Rather, elements such as long non-coding advances in statistical analysis, ushered in the genome. Instead, the opposite happened RNAs, promoters, enhancers and countless genome-wide association studies (GWAS) of during the past two decades: more attention gene-regulatory motifs work together to countless traits, including height10, obesity11 was lavished on a select few. Despite this being bring the genome to life. Variation in these and susceptibility to complex diseases such flagged as a potential problem on the tenth regions does not alter proteins, but it can as schizophrenia12. anniversary5 of the draft genome’s publica- perturb the networks governing protein There are now more than 30,000 papers tion, there has been no course correction. expression. per year linking SNPs and traits. A large frac- Our previous work on other, very different tion of these associations are in the once-dis- systems from human social networks to the missed non-coding regions (see SI, Table S3). World Wide Web indicates that this vast imbal- Cellular function relies on weak and strong ance can be explained by a ‘rich-gets-richer’ The discovery of links between genetic material and proteins. dynamic6,7 rooted in social factors. It is likely Mapping out this network now complements that as the number of papers focusing on TP53 non-protein-coding the Mendelian perspective (see page 218).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    4 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us