THE Summer 2017 volume 14 PRIMER issue 3 Expanding Minimum Information in this issue Genome Standards Filling the Genomic Encyclopedia of Microbes...... 2

Big Data Collaborations. . . . . 3

Fungal Enzyme Complexes. . . . 4

Regulating Fungal Gene Expression ...... 5

Science Highlights. . . . . 6 – 7

2018 JGI User Meeting. . . . . 8

Uncovered: 1000 New Microbial Genomes

“Bacteria and comprise the largest amount of biodiversity of free-living organisms on Earth,” said JGI’s Prokaryote Super Program head Nikos Kyrpides. “They have already conquered nearly every environment on the planet, so they have found (Artistic rendering by Zosia Rostomian, Berkeley Lab Creative Services) ways to survive under the harshest of conditions with different enzymes and During the Industrial Revolution, has developed standards for the with different biochemistry.” the establishment of standards—e.g., minimum metadata to be supplied JGI scientists have taken a sizes of nuts and bolts, etc.—allowed with single amplified genomes decisive step forward in uncovering builders to produce supplies in bulk, (SAGs) and metagenome-assembled the planet’s microbial diversity. In maintain production quality and fuel genomes (MAGs) submitted to public a paper published June 12, 2017 in interstate commerce. The importance databases. Nature , Kyrpides and of standards is dramatically illus- Microbes play crucial roles in his team of researchers report the trated when they don’t exist or are regulating global cycles involving release of 1,003 phylogenetically not commonly accepted. carbon, nitrogen, and phosphorus diverse bacterial and archaeal refer- More than a century after the among others, but many of them ence genomes—the single largest Industrial Revolution, advances in remain uncultured and unknown. release to date. DNA sequencing technologies have Learning more about this so-called The effort is part of the JGI’s caused similarly dramatic shifts in “microbial dark matter” involves Genomic Encyclopedia of Bacteria scientific research, and one aspect sequencing metagenomes or ampli- and Archaea (GEBA) initiative that is studying the planet’s biodiversity. fied DNA of single cells, then bioin- aims to sequence thousands of Published August 8, 2017 in formatically extracting genomes. As bacterial and archaeal genomes to Nature Biotechnology, an interna- genomic data production has ramped fill in unexplored branches of the tional team led by researchers at the up over the past two decades, using tree of life. Microbes play important DOE (JGI), various platforms around the world, roles in regulating Earth’s biogeo- a DOE Office of Science User Facility, scientists have continued on page 4 chemical cycles— continued on page 2 THE PRIMER Genomic Encyclopedia

New Microbial Genomes continued from page 1

processes that govern nutrient circulation in terrestrial and marine environments, for example. Uncover- ing the functions of genes, enzymes and metabolic pathways through genome sequencing and analysis has wide applications in the fields of bioenergy, biomedicine, agriculture and environmental sciences. “In addition to identifying over half a million new protein families, this effort has more than doubled the coverage of phylogenetic diversity of all type strains with genome sequences,” said Supratim Mukherjee, a JGI computational biologist and co-first author of the paper. Since a great portion of research in microbial has been focused on human pathogens or biotechnologi- cal work horses, GEBA is the main effort worldwide attempting to address the phylogenetic coverage knowledge gap by sequencing a diverse set of cultured but poorly characterized microbial type strains. “It was recognized that we weren’t sampling many parts of the tree of life,” said Rekha Seshadri, a JGI computational biologist and co-first author of the paper. “And if we with the Genomics Standards Consor- (Artistic rendering by Zosia Rostomian, sampled some of those parts of the tium available through the Genomes Berkeley Lab Creative Services) tree, we’d discover new functions, OnLine Database. which could be an important resource Seshadri said that with the release ATCC Global Bioresource Center in for new applications.” of high quality genomic information, the U.S. was critical in accomplish- The release of these genomes is the JGI is providing a wealth of new ing this endeavor,” said Kyrpides. culmination of almost a decade’s worth sequences that will be invaluable to “At a time when throughput can of work, with the first 56 GEBA scientists interested in experiments come at the cost of quality, resulting genomes published in 2009. The such as characterizing biotechnologi- in highly fragmented and chimeric or microorganisms were isolated from cally relevant secondary metabolites contaminated genomes, the signifi- environments ranging from sea water or studying enzymes that work under cance of genomes from the type and soil, to plants, and to cow rumen specific conditions. And because strains as invaluable taxonomic and termite guts. Genome sequencing Kyrpides’ research team sequenced signposts cannot be overstated.” and analysis was done at the JGI type strains that are readily available through the Community Science from culture collections, scientists Program, and the 1,003 genomes can perform follow-up experiments are publicly available through the with them in the lab, she added. Full story at http://jgi.doe.gov/ Integrated Microbial Genomes with “The partnership with culture uncovered-1000-new-microbial- Microbiomes (IMG/M) system, with all collection centers such as the Leibniz genomes/. associated metadata in compliance Institute DSMZ in Germany and the

2 / summer 2017 / volume 14 / issue 3 THE PRIMER FICUS

Six Proposals Meld Genomics, Supercomputing in First JGI-NERSC Call

A sequence similarity network of a family of enzymes from the nitroreductase superfamily (some nitroreductases can reduce TNT or NERSC’s supercomputer Cori. (Roy Kaltschmidt, Berkeley Lab) trinitrotoluene, a significant soil contami- Six proposals were selected to and computational science experts to nant). Nodes represent enzyme sequences, while edges represent pairwise similarities participate in a new partnership accelerate discoveries. Through the more significant than 1e-42 (BLAST E-value). between the JGI and the National JGI-NERSC FICUS call, users can Red and blue nodes represent enzymes Energy Research Scientific Computing query across all available data to found in public sequence databases and Center (NERSC) through the “Facilities look for patterns across data sets belong to two sub-families, and white nodes Integrating Collaborations for User in the JGI’s Integrated Microbial represent sequences found only in the JGI’s Science” (FICUS) initiative. Genomes and Microbiomes (IMG/M) metagenomic database (IMG/M). Large “This FICUS program represents a database with the help of NERSC’s nodes represent experimentally-characterized enzymes of diverse functions. Notably, major milestone for the JGI, as for the supercomputer Cori, resulting in a significant expansion of the sequence first time we provide a service to the a more powerful analysis with increased space is observed (from 300 enzymes to community, not based on products (i.e. capacity for novel discoveries. >10,000), revealing a new potential group DNA sequencing or synthesis) but on The accepted proposals come from: of enzymes found only in IMG/M. “Linkers” data alone,” said Prokaryote Super • Patricia (Patsy) Babbitt of the that are also unique to metagenomes display Program head Nikos Kyrpides. “As University of California (UC), sequence similarity to experimentally-charac- terized enzymes of diverse functions and Microbiome research is rapidly moving San Francisco (See sidebar) serve as attractive targets for synthesis towards Data Science we anticipate • David Baker at the University and biochemical assays for intermediate that the demand for this program will of Washington function. (Eyal Akiva and Patsy Babbitt) also increase.” • Phillip Brooks of UC Davis The JGI-NERSC FICUS call is the • Ed DeLong of the University latest partnership since the collabora- of Hawaii at Manoa “If you look at the data sets being tive science initiative was formed in • Steve Hallam of Canada’s generated and the questions that 2014 by the Office of Biological and University of British Columbia people have, you can see that Environmental Research (BER) to • Kostas Konstantinidis of researchers are going to have to harness the combined expertise and Georgia Institute of Technology combine different datasets—like resources of two of the national user The full list of approved proposals genomics, metabolomics, protein facilities stewarded by the DOE Office is available at http://jgi.doe.gov/ crystal structures and potentially of Science in support of DOE’s energy, our-projects/csp-plans/fy-2017-csp- even brain scans and more—to find environment, and basic research plans/#jgi-nersc. answers. This work cannot be done missions. The expertise and capabili- “I really believe the future of comput- on a laptop or small cluster.” ties available at these national user ing is going to be dominated by biology. facilities, both at the Lawrence The volumes of biological data that need Berkeley National Laboratory (Berkeley to be synthesized, aggregated and Lab), will help researchers explore the interrogated will require supercomput- Full story at http://jgi.doe.gov/ wealth of genomic and metagenomic ers,” said Kjiersten Fagnan, both JGI’s doe-user-facilities-ficus-join- data generated worldwide through Chief Informatics Officer and NERSC’s forces-to-tackle-biology-big-data/. access to supercomputing resources Data Science Engagement Group Lead.

summer 2017 / volume 14 / issue 3 / 3 THE PRIMER Early Fungal Lineages

Fungal Enzyme Complexes Formed to Break Down Cellulose

work more efficiently than they would as individuals. “There are protein complexes in bacteria called cellulosomes that pack together the enzymes to break down plant biomass,” said UC Santa Barbara’s Michelle O’Malley, senior author of the study published May 26, 2017 in Nature . “The idea is that these clusters are better at attacking biomass because they are keeping the different en- zymes in place with plugs called dockerins so they work more effi- Scanning electron micrograph (SEM) of Neocallimastix californiae. (Chuck Smallwood, PNNL) ciently. This has been detailed in bacteria for more than 20 years, but Cost-effectively breaking down any substance on earth, including plant now seen for the first time in fungi.” bioenergy crops into sugars that can biomass. Now a team led by researchers The study involved a comparative then be converted into fuel is one of the at the University of California (UC), genomics analyses of five fungi that biggest barriers in the commercial Santa Barbara has found for the first belong to the Neocallimastigomyce- production of sustainable biofuels. time that early lineages of fungi can tes, a clade of the early-diverging To reduce this barrier, bioenergy re- form complexes of enzymes capable of lineages that are not well-studied. searchers are looking to nature and the degrading plant biomass. By consolidat- Read the full story at http://jgi. estimated 1.5 million species of fungi ing these enzymes, in effect into protein doe.gov/fungal-enzymes-team-up- that, collectively, can break down almost assembly lines, they can team up to efficiently-break-down-cellulose/.

Genome Standards lenges. For researchers who want to 23S, 16S and 5S rRNA genes, as well continued from page 1 conduct comparative analyses, it’s as at least 18 tRNAs, and with less worked together to establish defini- really important to know what goes than 5 percent contamination. The tions for terms such as “draft assem- into the analysis. Robust comparative Finished Quality category is reserved bly” and data collection standards genomics relies on extensive and for single contiguous sequences that apply across the board. One correct metadata.” without gaps and less than 1 error critical item that needs standardiza- In their paper, Woyke and her per 100,000 base pairs, the same tion is “metadata,” defined simply as colleagues proposed four categories standard applied to isolate genomes. “data about other data.” of genome quality. Low-Quality Drafts The JGI has generated approxi- “Over the last several years, would be less than 50 percent mately 80 percent of the over 2,800 single-cell genomics has become complete, with minimal review of the SAGs and more than 4,500 MAGs a popular tool to complement metage- assembled fragments, but would still registered on the JGI’s Genomes nomics,” said study senior author be required to be less than 10 percent OnLine Database (GOLD). JGI scien- Tanja Woyke, head of the JGI Micro- contaminated with non-target se- tist and study first author Bob Bowers bial Program. “Starting in 2007, the quence. Medium-Quality Drafts would said many of the SAGs already in first single-cell genomes from environ- be at least 50 percent complete, with GOLD would be considered Low-Quali- mental cells appeared in public minimal review of the assembled ty or Medium-Quality Drafts. While databases and they are draft assem- fragments and less than 10 percent these are highly valuable datasets, blies with fluctuations in the data contamination. High-Quality Drafts for some purposes, researchers might quality. Metagenome-assembled would be more than 90 percent prefer to use High-Quality or Finished genomes have similar quality chal- complete with the presence of the datasets.

4 / summer 2017 / volume 14 / issue 3 THE PRIMER Early Fungal Lineages

Finding a Major Gene Expression Regulator in Fungi

Though fungi have been around functions, impacting its traits. In for a billion years and collectively addition, though, subtler changes can are capable of degrading nearly all and do happen, involving modifications naturally-occurring polymers and of the DNA bases themselves. The even some human-made ones, most best-known example of this kind of of the species that have been studied change is a methylation of the base belong to just two phyla, the Ascomy- cytosine at the 5th position on its cota and Basidiomycota. The remain- aromatic ring (5mC). In eukaryotes, ing 6 groups of fungi are classified a less-well known modification involves as “early diverging lineages,” the adding a methyl group at position 6 of earliest branches in fungal genealogy. adenine (6mA). “By and large, early-diverging fungi “This is one of the first direct are very poorly understood compared comparisons of 6mA and 5mC in Linderina pennispora (ZyGoLife Research to other lineages. However, many of eukaryotes, and the first 6mA study Consortium, Flickr, CC BY-SA 2.0) these fungi turn out to be important across the fungal kingdom,” said in a variety of ways,” said JGI analyst JGI Fungal Genomics head and senior preferentially positioned based on gene Stephen Mondo (See page 4). In the author Igor Grigoriev. “6mA has been function and conservation.” May 8, 2017 issue of Nature Genetics, shown to have different functions Watch Stephen Mondo on the a team led by Mondo and his JGI depending on the organism. For proposed role of 6mA in early diverg- colleagues reported the prevalence example, in animals it is involved in ing fungi at the 2017 Genomics of of a marker for functionally important suppressing transposon activity, while Energy & Environment Meeting at genes in early diverging fungi. in algae it is positively associated with http://bit.ly/JGI2017Mondo. Changing a single letter, or base, gene expression. Our analysis has Read the full story at http://jgi.doe. in an organism’s genetic code can lead shown that 6mA modifications are gov/finding-major-gene-expression- to changes in protein structures and associated with expressed genes and is regulator-fungi/.

tories where the minimum metadata “These extensions complement continued from page 4 requirements are implemented. By the MIxS suite of metadata standards Moving from a proposal in print to working directly with the data provid- by defining the key data elements implementation requires community ers, the GSC can assist both large- pertinent for describing the sampling buy-in. Woyke and Bowers conceived scale data submitters and databases and sequencing of single-cell of the minimum metadata require- to align with the MIxS standard and genomes and genomes from metage- ments for SAGs and MAGs as exten- submit compliant data. nomes,” said GSC President and sions to existing metadata standards Nikos Kyrpides, head of the JGI study co-author Lynn Schriml of the for sequence data, referred to as Prokaryote Super Program and GSC Institute of Genome Sciences at “MIxS,” developed and implemented Board member, noted that JGI has University of Maryland School of by the Genomic Standards Consor- been involved in organizing the Medicine. “These standards open tium (GSC) in 2011. The GSC is an community to develop genomic up a whole new area of metadata open-membership working body that standards as part of its core mission. data exploration as the vast majority ensures the research community is “The GSC has been instrumental in of microbes, referred to as microbial engaged in the standards develop- bringing the community together to dark matter, are currently not ment process and includes represen- develop and implement a growing described within the MIxS standard.” tatives from the National Center for body of relevant standards. In fact, Read the full story at http://jgi.doe. Biotechnology Information (NCBI) the need to expand MIxS to unculti- gov/defining-standards-genomes- and the European vated organisms was identified in uncultivated-microorganisms/. Institute (EBI). This is important one of the recent GSC meetings at since these are the main data reposi- the JGI.”

summer 2017 / volume 14 / issue 3 / 5 THE PRIMER Science Highlights

the seafloor to extract petroleum from this deep subsurface reservoir. To learn more about how some of these subsurface microbial populations respond to disruptions in their envi- ronment, researchers in the petroleum industry conducted a comparative genetic analysis of the microbial communities in multiple oil wells within an offshore oil field. During the analyses, reported in The ISME Journal, tools developed at the JGI and made available through the IMG/M system were utilized. The data generated provides insights into just how deep subsurface microbial communities are perturbed by active oil wells injecting foreign substances into these previously isolated popula- tions, and helps industry researchers Hot spring at Yellowstone National Park. (Paul Blainey, Christina Mork and Geoffrey Schiebinger) develop new techniques for managing microbiological problems. New Technology to Access this new technology to additional Microbial Dark Matter sample sites will add to the range of Lessons from Simulating The majority of the planet’s micro- hitherto uncharacterized microbial bial diversity remains uncultivated, and features with potential DOE mission a Deep Ocean Oil Spill their genes and metabolic functions applicability. The work was enabled The 2010 Deepwater Horizon oil could have potential applications in by the JGI’s Emerging Technologies spill released 4.1 million barrels of fields ranging from bioenergy to Opportunity Program (ETOP). oil into the Gulf of Mexico and was biotechnology to environmental the first major release of oil and research. There are thousands of Tracking Microbial Succession natural gases into the deep ocean microbial datasets in the JGI’s IMG/M (1,500 meters). Due to the depth system, and many of them have been in Petroleum Wells of the spill, vast plumes of small uncovered through the use of metage- While shifting toward more sustain- oil droplets remained trapped deep nomic sequencing and single-cell able, alternative energy sources, in the ocean (900-1,300 meters) genomics. Despite their utility, these people still rely on fossil fuels for where they underwent biodegradation techniques have limitations: single-cell energy and transportation fuels. by the local microbial community. genome amplifications are time- Understanding how microbial commu- Until now, researchers have been consuming, and often incomplete, and nities in subsurface petroleum puzzled over the metabolic capabili- shotgun sequencing reservoirs are impacted by human ties driving the shifts between generally works best if the environmen- activity, and how their responses can microbial communities to degrade tal sample is not too complex. sour oil production, can influence oil the crude oil. In the Proceedings of In eLife, Stanford University industry practices. Microbes are the National Academy of Science, researchers led by Stephen Quake invisible to the naked eye, but play a team led by Berkeley Lab research- reported the development of a key roles in maintaining the planet’s ers have been able to present the microfluidics-based, mini-metage- biogeochemical cycles. first complete picture of how succes- nomics approach to mitigate these In the North Sea is an offshore sive waves of microbial populations challenges. They extracted 29 novel petroleum reservoir managed by a degraded the released oil. They were microbial genomes from Yellowstone joint venture of which Shell, one of also able to recover high-quality hot spring samples while still pre- the world’s largest oil companies, genomes of the key microbial players, serving single-cell resolution to is a partner. Over the past 15 years, and determine the metabolic factors enable accurate analysis of genome 32 oil wells have been drilled, reach- driving the shifts between microbial function and abundance. Applying ing depths over 2 kilometers below communities.

6 / summer 2017 / volume 14 / issue 3 THE PRIMER Science Highlights

Identifying the microbes involved grasses that photosynthetically fix

in degrading hydrocarbons (the chief carbon from CO2 through a water-

components of petroleum and natural conserving (C4) pathway, the genomes gas), as well as the drivers that of both green foxtail (Setaria viridis) trigger successive waves of microbial and foxtail millet (S. italica) have responders, allows researchers to been sequenced and annotated better understand how the microbial through the JGI’s Community Science community adapted to the events of Program. A team led by Tom Brutnell seven years ago. The expertise and at the Donald Danforth Plant Science resources used to reconstruct the Center and including JGI researchers microbial genomes demonstrates how reported in Nature Plants that they new technology development through had identified genes that may play ETOP at the JGI is enabling energy a role in flower development on the and environment research. They panicle of green foxtail, highlighting discovered that the microbial com- the utility of S. viridis as a model munity transformed with chemical crop. Panicle development is critical changes in residual oil. Additionally, for determining grain yield which in their lab-based method allowed for turn is crucial to food crops as well as Genome-wide distribution of fast neutron- the first time the successful resolu- candidate crops for producing renew- induced mutations in the Kitaake rice mutant tion of high-quality genomes and the able and sustainable fuels. A homolo- population. (Guotian Li and Rashmi Jain) characterization of functional capa- gous gene in maize was identified as bilities for all the key microbes. playing a similar role, illustrating the genomics resource for grass models value of model systems in finding involved in plant cell wall biosynthesis A Gene that Influences Grain genes involved in important properties studies. Until now, mutant collections in potential bioenergy-relevant plants. for grass models have lagged behind Yields in Grasses those available for the Arabidopsis Setaria species are related to Mutant Rice Database for model system. several candidate bioenergy grasses Fast-neutron irradiation or exposure Bioenergy Research including switchgrass and Miscan- to high energy neutrons, induces a thus. As model systems to study For more than half of the world’s wide variety of mutations by making population, rice is the primary staple changes in DNA. Using this approach, crop. As a grass, it is a close relative rice researchers were able to create the of the candidate bioenergy feedstock first major, large-scale collection of switchgrass. A team led by University mutations for grass models. Rese- of California, Davis, and including quencing the 1,504 mutants has researchers at the JGI and the Joint allowed researchers to identify struc- BioEnergy Institute (JBEI), a DOE tural variants and mutations, providing Bioenergy Research Center, have an invaluable resource for grass models assembled the first major large-scale being used to improve candidate collection of mutations for grass bioenergy feedstock crops such as models. They used the model rice switchgrass. Information on this new, cultivar Kitaake (Oryza sativa L. ssp. large-scale collection of more than japonica), and compared the genes 90,000 mutations affecting nearly against the reference rice genome of 60 percent of all rice genes is available another japonica subspecies called on a publicly accessible database Nipponbare available on the JGI Plant called KitBase. This comprehensive Portal Phytozome. Boosting yields of resource could help identify rice lines bioenergy feedstock crops such as with mutations in specific genes and grasses requires a better understanding to characterize gene function. of how enzymes and proteins synthe- To learn more about each of these size plant cell walls in order to modify stories, go to the JGI’s Science Field of Setaria viridis growing in western the processes and the composition. Highlights page: http://jgi.doe.gov/ Nebraska. (Pu Huang) The team’s goal is to have a functional category/science-highlights/.

summer 2017 / volume 14 / issue 3 / 7 THE PRIMER

Join us in San Francisco... FOR THE 13TH ANNUAL Genomics of Energy & Environment Meeting HOSTED BY THE U.S. Department of Energy Joint Genome Institute Hilton San Francisco Union Square

MARCH 13–16, 2018 REGISTER NOW http://usermeeting.jgi.doe.gov/

We welcome any and all researchers and students pursuing frontier energy and environmental genomics research. Also all current JGI Community Science Program (CSP) users, as well as investigators considering an application for future CSP calls. AGENDA: Workshops to be held Tuesday, March 13–Wednesday, March 14. The Meeting opening keynote address is at 5pm on Wednesday evening, followed by the first poster session. The main sessions take place all day Thursday, March 15 through Friday afternoon, March 16. KEYNOTE SPEAKERS: John Ioannidis, Stanford; Victoria Orphan, Caltech; Kristala Prather, MIT. TOPICS: Microbial genomics, fungal & algal genomics, metagenomics, and plant genomics; genome editing, secondary metabolites, pathway engineering, synthetic biology, high-throughput functional genomics, high-performance computing applications and societal impact of technological advances.

In parallel with the JGI User Meeting on March 14–15 will be the first everViral EcoGenomics and Applications (VEGA) 2018 Symposium: Big data approaches Nikos Kyrpides Named ASM’s 2018 USFCC/J. Roger Porter Awardee to help characterize Earth’s Virome. This Symposium brings together the Nikos Kyrpides, JGI’s Prokaryote Super “viral ecogenomics” community to foster Program head, has been selected as the discussions about how to best capture 2018 USFCC/J. Roger Porter Award and characterize uncultivated viruses, recipient by the American Society for understand the role of viruses in natural Microbiology (ASM). This award recognizes outstanding efforts by a scientist who has ecosystems, and functionally explore demonstrated the importance of microbial viral genetic diversity. biodiversity through sustained curatorial or stewardship activities for a major resource by the scientific community. Contact The Primer For more than a decade, Dr. Kyrpides David Gilbert, Managing Editor and his JGI colleagues have been working [email protected] toward developing a comprehensive genomic Massie Santos Ballon, Editor Dr. Kyrpides will receive his award at the 2018 ASM Microbe Meeting in Atlanta, Ga. catalog by sequencing the genome of at least one representative of every bacterial and archaeal species. (More on page 1) 17-JGI-4277

8 / summer 2017 / volume 14 / issue 3