Trends in Computational Biology—2010

FEATURE Trends in computational biology—2010 H Craig Mak Interviews with leading scientists highlight several notable breakthroughs in computational biology from the past year and suggest areas where computation may drive biological discovery. he field of computational biology encom- been featured in our pages and elsewhere; oth- Seq), but mapping and assembling the resulting Tpasses a set of investigative tools as much ers were completely off our radar. Although reads are challenging owing to the complexities as being a research endeavor in its own right. we surveyed a small group of 15 scientists, the introduced by RNA splicing. The two methods It is often difficult to gauge the utility and nominated papers (Box 1) provide a snapshot are the first that robustly assemble full-length significance of a computational tool, at least of some of the most exciting areas of current transcripts, including alternative splicing iso- until the research community has had suffi- computational biology research. forms. In contrast to previous approaches, these cient time to explore, exploit and hone it in All the papers featured in the following two methods first map reads to the genome various applications. In an effort to identify pages were nominated by at least two scien- using software that takes possible splice junc- recent notable breakthroughs in the field of tists. Our analysis not only highlights the rich- tions into account, thereby making assembly computational biology, Nature Biotechnology ness of approaches and growth of the field, but more manageable. Then, they apply graph- surveyed leading researchers in the area, also suggests that researchers of a particular based algorithms to determine1,2 and quantify1 asking them to nominate papers of particu- type are driving much of cutting-edge com- the most likely splice isoforms. The algorithms lar interest published in the previous year putational biology (Box 2). Read on to find were applied to mammalian transcriptomes to that have influenced the direction of their out what characterizes them and what they’ve follow global patterns of splicing during a devel- research. Some of the nominated papers had been doing in the past year. opmental time course1 and to identify novel, spliced, long, noncoding RNAs that had not been annotated by existing methods2. Next-generation longer molecule. Researchers face challenges on Progress toward the second challenge sequence analysis two levels when turning massive collections of of genome interpretation was reported in Imagine an experi- reads into biologically meaningful information. papers that demonstrate the potential of ment generating a bil- The first set of challenges lies in processing the genome sequencing © 2011 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2011 lion data points every reads themselves: mapping them to their genomic for genetic analysis day, the equivalent of locations, and then assembling them into longer of human traits. The running millions of contiguous stretches of DNA. The second set of approach, pioneered agarose gels—and tap- challenges lies in interpreting large collections by Jay Shendure at ing (remember that!) of reads, which may be assembled into whole the University of the pictures into tens genomes, to understand the functional effects of Washington in Seattle, of thousands of labo- genetic variation. Thousands of genomes from sequences the exonic ratory notebooks, or hybridizing thousands of humans, plants, animals and disease tissues have regions of several gene-expression microarrays. Computational already been sequenced—and all are in need of genomes to identify biology has risen in prominence in recent better interpretation. Although algorithms, protein-disrupting years largely because of the increase in the such as BLAST for searching and CLUSTALW mutations linked to disease. “This study dra- data-generation capacity of high-throughput for aligning, continue to be the workhorses of matically demonstrates how we can make new technology. More data create more opportuni- sequence analysis, several next-generation com- genetic discoveries by sequencing all the exons ties and a more pressing need for systematic putational methods have emerged to cope with in a set of patients,” says Steven Salzberg of the methods of analysis. And nowhere has that the DNA sequences captured in billions of short University of Maryland and a co-author with need been more evident than in the field of reads and thousands of genomes. Pachter1. Since the publication of the first success next-generation sequencing. of this strategy in January 2010 (ref. 3), several The latest sequencers take a week or two to The advance. Two methods for de novo additional studies have taken a similar approach generate about a billion short reads, stretches transcriptome assembly of short reads were to study the genetics of human diseases. of about 50–400 bp of DNA sequenced from a published this year from Lior Pachter and col- leagues1 and from Aviv Regev and colleagues2. What it means. Advances in transcript assem- H. Craig Mak is Associate Editor, The transcriptome can be analyzed by sequenc- bly from RNA-Seq data should allow alterna- Nature Biotechnology ing cDNA reverse transcribed from RNA (RNA- tive splicing to be studied genome-wide across NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 1 JANUARY 2011 45 FEatUrE many biological con- tific advances. As a result of incentives built tested to see whether they carried a total of ditions, such as in into the Health Information Technology for five SNPs previously associated with seven different tissues, over Economic and Clinical Health (HITECH) diseases (coronary artery disease, carotid different time points Act passed as part of the Obama administra- artery stenosis, atrial fibrillation, multiple and in response to tion’s health-care reform, in 10 years almost sclerosis, lupus, rheumatoid arthritis and genetic and chemi- every US hospital could be using electronic Crohn’s disease). To identify patient pheno- cal perturbations. In medical records, up from 1.5–2% of hospitals types in an automated fashion, they used bill- addition, RNA-Seq today. “What’s fun to think about,” says Atul ing codes in the electronic medical records is now poised as a Butte of Stanford University, “is what kind of to group patients into ‘case’ and ‘control’ tool for discover- science can we derive from this?” populations for 776 phenotypes. Finally, Steven Salzberg ing new RNAs that Electronic medical records can contain a Chi-squared statistical test was used to collaborated with Lior we may not have a wealth of history on physical exams and evaluate whether patients harboring a spe- Pachter and Barbara Wold to develop a even known were treatment regimes. Particularly amenable cific SNP also tended to display a particular method for assembling transcribed from a to automated analysis in the records are the phenotype. The authors noted that, although short reads sequenced genome. Armed with standardized administrative billing codes used there are many statistical challenges with from cDNA into better knowledge to charge for each procedure, test or clinical this kind of analysis and there is much room full-length spliced of splicing patterns visit. These codes can for improvement of their method, four of transcripts. and comprehensive track anything from the seven previously known disease-gene transcript catalogs, diagnosis of coronary associations could be replicated, and several it should be possible to improve the annota- artery disease to the potential associations with other diseases tion of genomes. “Thousands of people have procedure for insert- were identified but not rigorously validated. accessed our software,” says Salzberg, “and it ing a stent to keep These results highlight the possibility that is being integrated into easy-to-use graphical blood vessels open. novel biological discoveries might be made interfaces such as Galaxy”4. Several hospitals, using this approach. The flood of whole genomes and exomes with Vanderbilt Uni should also drive sequence-analysis methods. versity (Nashville, What it means. “Everyone wishes they could Attending an International Cancer Genome TN, USA) among Atul Butte: “Ninety- do this kind of study,” remarks Butte, “but it nine percent of the Consortium meeting in Brisbane, Australia, in the leaders, are pair- work is not in software represents a multimillion dollar investment. December, Debbie Marks of Harvard University ing electronic medi- engineering or coding, Vanderbilt is leading the way.” Several other noted, “We’re still in the early days of whole- cal records with the it’s in coming up hospitals are making similar investments. genome analysis…problems need to be articu- collection of tissue with the right kind of The Mayo Clinic, for example, is coupling lated in ways that are computationally tractable.” samples from every question:…[one that] specimens collected from 20,000 patients Many of the problems will require algorithmic patient treated. These no one even realizes with electronic medical records and other we can ask today.” advances, such as better de novo assembly of resources represent data gathered and standardized across the transcriptomes and genomes. However, much an unprecedented hospital system. “There has always been a of the work that goes into interpreting a patient’s source of data on the genetic and physiologi- question about whether electronic medi- whole genome to identify disease-causing cal state of people linked to standardized, com- cal records would be of sufficient quality to mutations, for instance, involves filtering vari- putable records of their phenome, or the set of allow genetic discovery,” says Russ Altman, © 2011 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2011 ants identified by sequencing against databases all phenotypes including disease diagnosis and also at Stanford. “This paper sets the stage of known variants. As better databases should responses to treatment. for widespread use of electronic records for lead to improved genome analysis, it might be genomic discovery.” reasonable to expect the development and min- The advance.

Trends in Computational Biology—2010

Program Book

Conference Proceedingssmall

Pacific Symposium on Biocomputing 2020 January 3-7, 2020 Big Island of Hawaii

Daniel Aalberts Scott Aa

Advances in Rosetta Protein Structure Prediction on Massively Parallel Systems

Precise Assembly of Complex Beta Sheet Topologies from De Novo 2 Designed Building Blocks

I S C B N E W S L E T T

Top 100 AI Leaders in Drug Discovery and Advanced Healthcare Introduction

Bioengineering Professor Trey Ideker Wins 2009 Overton Prize

A Glance Into the Evolution of Template-Free Protein Structure Prediction Methodologies Arxiv:2002.06616V2 [Q-Bio.QM] 24 Apr 2

Deep Learning-Based Advances in Protein Structure Prediction

IG-VAE: Generative Modeling of Immunoglobulin Proteins by Direct