Volume 12, Issue 2
Total Page:16
File Type:pdf, Size:1020Kb
THE spring/summer volume 12 PRIMER issue 2 Data Quality, Data Sets and New Directions: Plotting IMG’s Next 10 Years in this issue At the recent 10th Annual Genomics An example of Kyrpides’ efforts to A Decade Using IMG .......... 2 of Energy & Environment meeting systematically describe and classify Tapping Microbial Communities hosted by the U.S. Department of microbes in action can be seen in the in Colorado ................ 4 Energy Joint Genome Institute (DOE Integrated Microbial Genomes (IMG) Highlights from the Annual JGI), a DOE Office of Science User data management system that his DOE JGI Meeting ............ 6 Facility, Nikos Kyrpides (below), head program developed and maintains in Concerned about Melting of the DOE JGI Prokaryote Super partnership with the Biosciences Permafrost ................. 8 Program, received the van Niel Computing Group of Berkeley Lab’s International Prize in Bacterial Computational Research Division. Systematics. The Van Niel Prize was IMG is the leading data analysis established in 1986 in honor of system of the DOE JGI’s Prokaryote microbiologist Cornelis Van Niel’s Super Program, and Kyrpides has contribution to scholarship in the been pushing the developments as field of microbiology, and is awarded the scientific lead of the project from Growing the Interest every three years by the University of its first working prototype in 2005 to in Genomics Queensland in Australia on the its current incarnation. On the IMG recommendation of a panel of experts system’s 10th year anniversary, he With a capacity crowd in atten- of the International Committee on took time to reflect on the milestones dance, the DOE JGI hosted the 10th Systematics of Prokaryotes. Phil achieved thus far and future directions. Annual Genomics of Energy & Hugenholtz, Director of the Center Environment Meeting. To mark the for Ecogenomics at the University of What are the highlights of the last 10 occasion, instead of a single opening Queensland and a former DOE JGI years to you? keynote address, the DOE JGI invited colleague of Kyrpides, was on hand representatives from the three at the Meeting to present the award. In a period of 10 years, IMG has Bioenergy Research Centers to give a Watch the ceremony at broken several records and has been series of short talks that highlighted http://bit.ly/JGI15KyrpidesVanNiel. established as one of the premier their collaborations with the DOE data management systems in the JGI, and featured applications of the community for comparative analysis basic science provided by the of microbial genomes and metage- Institute. Blake Simmons from the nomes. Its data size has grown Joint Bioenergy Institute (JBEI), 70-fold in terms of number of data Shawn Kaeppler of the Great Lakes sets and 22,000-fold in number of Bioenergy Research Center (GLBRC), genes. We have currently almost and Jerry Tuskan of the Bioenergy 50,000 genomes in our system, Science Center (BESC) all spoke containing 90 million genes. It’s briefly, while the closing keynote was taken 20 years to sequence all of delivered by Ed DeLong of the those genomes; I anticipate we will University of Hawaii at Manoa. easily double that number in the next The themes of their talks echoed two years. We have 6,000 metage- in presentations given over the nome data sets, which contain 29 three-day meeting held March billion genes. As far as I know, this 24–26, 2015 in Walnut Creek, Calif. represents the largest publicly Videos of these keynote talks, and of available database of metagenomics other presentations from the annual genes and therefore this is one more meeting, can be viewed on the DOE of IMG’s records. We’ve grown from a JGI YouTube channel at http://bit.ly/ few hundred to about 12,000 JGIUM2015videos. Images from the registered users in more than 90 meeting are online at http://bit.ly/ countries. We continued on page 2 JGI15UMphotos. continued on page 6 THE PRIMER A Decade of IMG Data and new directions continued from page 1 provide an alternative source of data, on how the data analysis tools and efficiently provide a comparison of a particularly for metagenomes, and we workflows should be organized, and metagenome against other metage- add significant value through the the developers implemented exactly nomes. Given the size of the data integration of various data types, as what the biologists wanted. It’s clear involved, that would take weeks and well as with curation and annotation. there was a grand vision upfront to you can’t do this efficiently on a In terms of data integration, we’ve handle this much growth in the past production scale (i.e. on a weekly managed to integrate several different 10 years. We can continue another 10 basis) even with high performance data types including one of the largest years on this current system, although computing (HPC) right now. collections of curated metadata from we also need to start exploring new The National Energy Research the GOLD database, as well as several solutions for more efficient handling Scientific Computing Center (NERSC) omics types including transcriptomics, of the data deluge ahead. is a vital partner in succeeding in the metatranscriptomics, proteomics, and One more of our early choices that era of big data. We’re already operat- methylomics. In an effort to connect I believed proved to be critical both ing at the scale where processing of to our DNA synthesis program at the for the growth and the success of the our data requires a HPC environment JGI, we have integrated a large system was to offer only a single data and we are very fortunate that at the collection of known natural products processing option for all datasets JGI this is provided by NERSC. We and connected them to their biosyn- submitted into our system. We do the need a bigger database and bigger thetic gene clusters, creating one of annotation for the users, and we computer clusters to support the the largest resources in the field. We process the datasets the way we know growing community demand, but we are currently working towards the best. Maintaining a huge system such also need to have the right computa- integration of metabolomics and trans- as IMG gives you great power, and tional environment to run our pipelines. posomics data produced at the JGI. with great power comes great respon- Another big challenge is how to Adding all of these means a complete- sibility. I believe we’re obliged to support big data, without sacrificing ly different operation from the figure out and apply the best annota- data quality. For example, annotating straightforward comparison of genes tion practice at any time rather than the metadata in the Genomes OnLine and genomes. With transcriptomes, allowing users to figure out what to Database (GOLD) is heavily manual, for example, you’re now talking about use and which one choose as some but it adds tremendous value to the the expression of genes you already other systems do. Providing an sequence data. Manual annotation have, and expression levels vary under environment where all the data are certainly contradicts with scaling, but varying conditions. In transposomics, uniformly processed and annotated is the availability of metadata is critical you look at the genes that are essen- of paramount value and importance. information in order to interpret the tial or have different fitness under data we have. varying conditions. So the original Looking forward to the next 10 years, IMG’s three-dimensional model of what are some of the challenges the How do you see IMG integrating with genes, genomes and functions has IMG system will need to tackle? KBase? What are the challenges here? become more multidimensional as you add each of the different data types. Our data sets are thousands of The two systems have different terabytes in size and we’ll be going to scientific goals and overall mission What do you think has helped IMG petabytes soon. We need to scale at and because of that they also have grow over the past 10 years? the level of hundreds of thousands of fundamentally different design data sets and hundred of billions of commitments, and follow different One of the critical things is that it genes. Right now our user interface principles in data organization and was a joint development between a can support the comparison of a few user support. For example, while group of engineers under the leader- hundred datasets but what we need IMG’s focus is on the comparative ship of Victor Markowitz (http://bit.ly/ and what researchers are asking for is analysis of microbial genomes and LBNL-BCG), long experience in to compare thousands against thou- metagenomes with emphasis on the genomic data, and a group of biolo- sands. No one is doing something like interface between the two, KBase’s gists that had very strong genomics that now. Everyone is currently focus seems to be more on the isolate and bioinformatics backgrounds. comparing a metagenome against genome side and metabolic modeling, Biologists provided the requirements isolate genomes, but no system can at least for now. System integration 2 / spring/summer 2015 / volume 12 / issue 2 THE PRIMER A Decade of IMG nomes, worldwide. In terms of new directions, my expectation is that in the next decade, the biggest overhaul in the landscape of microbial genom- ics and metagenomics will be at the interface of the two, and therefore this is where a large part of future IMG developments will focus. In keeping with its goal of supporting the analysis of both the parts and the whole, I would like to see IMG playing a central role in enabling the identifica- tion and analysis of individual popula- tions from environmental communi- ties, as well as facilitating the elucidation of their role within the community.