Expanded Human Gene Tally Reignites Debate After 15 Years, Researchers Still Can’T Agree on How Many Genes Are in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
NEWS IN FOCUS GENETICS Expanded human gene tally reignites debate After 15 years, researchers still can’t agree on how many genes are in the human genome. BY CASSANDRA WILLYARD But many geneticists aren’t yet convinced an average of around 40,000. These days, the that all the newly proposed genes will stand up span of estimates has shrunk — with most now ne of the earliest attempts to estimate to close scrutiny. Their criticisms underscore between 19,000 and 22,000 — but there is still the number of genes in the human just how difficult it is to identify new genes, or disagreement (see ‘Gene tally’). genome involved tipsy geneticists, even to define what a Salzberg’s team used data from the Oa bar in Cold Spring Harbor, New York, and “People have gene is. Genotype-Tissue Expression (GTEx) pro- pure guesswork. been working “People have been ject, which sequenced RNA from more than That was in 2000, when a draft human hard at this for working hard at this 30 different tissues taken from several hundred genome sequence was still in the works; geneti- 20 years, and we for 20 years, and we cadavers. RNA is the intermediary between cists were running a sweepstake on how many still don’t have still don’t have the DNA and proteins. The researchers wanted to genes humans have, and wagers ranged from the answer.” answer,” says Steven identify genes that encode a protein and those tens of thousands to hundreds of thousands. Salzberg, a computa- that don’t, but that still have an important role Almost two decades later, scientists armed tional biologist at Johns Hopkins University in in cells. So they assembled GTEx’s 900 billion with real data still can’t agree on the number — Baltimore, Maryland, whose team produced tiny RNA snippets and aligned them with the a knowledge gap that they say hampers efforts the latest count. human genome. to spot disease-related mutations. Just because a stretch of DNA is expressed The latest attempt to plug that gap uses data HARD TO PIN DOWN as RNA, however, does not necessarily mean from hundreds of human tissue samples and In 2000, with the genomics community abuzz it’s a gene. So the team attempted to filter out was posted on the bioRxiv preprint server on over the question of how many human genes noise using a variety of criteria. For example, 29 May (M. Pertea et al. Preprint at bioRxiv would be found, researcher Ewan Birney the researchers compared their results with http://doi.org/cq5s; 2018). It includes almost launched the GeneSweep contest. Birney, genomes from other species, reasoning that 5,000 genes that haven’t previously been spot- now co-director of the European Bioinformat- sequences shared by distantly related crea- ted — among them nearly 1,200 that carry ics Institute (EBI) in Hinxton, UK, took the tures have probably been preserved by evolu- instructions for making proteins. And the first bets at a bar during an annual genetics tion because they serve a useful purpose, and overall tally of more than 21,000 protein- meeting, and the contest eventually attracted so are likely to be genes. coding genes is a substantial jump from more than 1,000 entries and a US$3,000 jack- The team was left with 21,306 protein- previous estimates, which put the figure at pot. Bets on the number of genes ranged from coding genes and 21,856 non-coding genes around 20,000. more than 312,000 to just under 26,000, with — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, GENE TALLY includes 19,901 protein-coding genes and Scientists still don’t agree on how many protein-coding genes the human genome holds, but the range of 15,779 non-coding genes. RefSeq, a database their estimates has narrowed in recent years. run by the US National Center for Biotechnol- 6,700 ogy Information (NCBI), lists 20,203 protein- coding genes and 17,871 non-coding genes. Kim Pruitt, a genome researcher at the 100 NCBI in Bethesda, Maryland, and a former head of RefSeq, says the difference is probably SOURCE: M. PERTEA & S. L. SALZBERG due in part to the volume of data that Salzberg’s 80 team analysed. RefSeq relies on an older data set that contains 21 billion short sequences. GENCODE uses different data again: a type 60 Estimated value that makes recognizing transcripts easier, but Range of estimates The latest count which can miss genes. And there’s another found 21,306 major difference. Both GENCODE and Ref- 1 Launch of the Human protein-coding 40 Genome Project genes. Seq use manual curation — a person reviews the evidence for the gene and makes a final 2 First draft of human genome released determination. Salzberg’s group relied solely on computer programs to sift the data. Number of protein-coding genes (thousands) Number of protein-coding 20 3 Rened analysis of “If people like our gene list, then maybe a complete genome 1 2 3 couple years from now we’ll be the arbiter of 0 human genes,” says Salzberg. 1960 1970 1980 1990 2000 2010 But many scientists say they need more evidence to be convinced that the latest list is 354 | NATURE | VOL 558 | 21 JUNE© 20182018 Mac millan Publishers Li mited, part of Spri nger Nature. All ri ghts reserved. ©2018 Mac millan Publishers Li mited, part of Spri nger Nature. All ri ghts reserved. IN FOCUS NEWS accurate. Adam Frankish, a computational his group and others. genes and disease. Uncounted genes are often biologist at the EBI who coordinates the man- Further confounding counting efforts is the ignored, even if they contain a disease-causing ual annotation of GENCODE, says that he and imprecise and changing definition of a gene. mutation, Salzberg says. But hastily adding his group have scanned about 100 of the pro- Biologists used to see genes as sequences that genes to the master list can pose risks, too, says tein-coding genes identified by Salzberg’s team. code for proteins, but then it became clear that Frankish. A gene that turns out to be incorrect By their assessment, only one of those seems some non-coding RNA molecules have impor- can divert geneticists’ attention away from the to be a true protein-coding gene. And Pruitt’s tant roles in cells. Judging which are important real problem. team looked at about a dozen of the Salzberg — and should be deemed genes — is contro- Still, the inconsistencies in the number of group’s new protein-coding genes, but didn’t versial, and could explain some of the discrep- genes from database to database are prob- find any that would meet RefSeq’s criteria. ancies between Salzberg’s count and others. lematic for researchers, Pruitt says. “People Salzberg acknowledges that the new genes Having an accurate tally of all human genes want one answer,” she adds, “but biology is on his team’s list will require validation by is key for efforts to uncover links between complex.” ■ MEDICAL RESEARCH Silent cancer cells targeted Researchers hunt dormant cells that break off tumours, and aim to keep them asleep. BY HEIDI LEDFORD — such as those in the breast, prostate and breast-cancer specialist at Indiana University pancreas — that recur at a high rate, some- in Indianapolis. But several labs have made fter decades of designing drugs to kill times many years after treatment. “You remove progress, developing models to track dormant rapidly dividing tumour cells, many the tumour, you irradiate, you do this, you do cells in mice for more than a year. cancer researchers are switching that,” says cancer researcher Mina Bissell, of Techniques for identifying those cells are Agears: targeting malignant cells that lie silent the Lawrence Berkeley National Laboratory also improving. Joshua Snyder, a cell biolo- and scattered around the body, before they give in California. “But sooner or later the cancer gist at Duke University School of Medicine in rise to new tumours. metastasizes, and you say to yourself, ‘Where Durham, North Carolina, uses a mix of fluo- These cells seed the metastases responsi- did these things come from?’” rescent markers to identify and trace rogue ble for about 90% of cancer deaths. They are cells expressing cancer-linked genes. the source of the heartbreaking cancer resur- CELL SPOTTING And at the meet- gence seen in many people whose seemingly Mounting evidence suggests that dormant cells “As long as those ing in Montreal, successful initial treatment had fostered break away from a parent tumour early in its cells remain geneticist Jason hopes that they were cured. Treatments that development and travel through blood vessels dormant, Bielas of the Fred target proliferating tumour cells often miss to new sites in the body (see Nature Methods they’re not Hutchinson Cancer these silent cells because they’re not actively 15, 249–252; 2018). But then, after settling into killing my Research Center in dividing. other tissues or organs, such cells will effec- patient.” Seattle, Washing- Dormant cancer cells are rare, and they are tively go to sleep, lying dormant until a trigger ton, will present difficult to sift from the trillions of normal — as yet unknown — rouses them. Only then preliminary results from his efforts to cells in the body. For years, scientists lacked do they begin dividing and form a new tumour. barcode such cells using specific DNA the tools to study them, says cancer researcher When cancer researchers tried to study this sequences.