Metagenomics Tools: BACS/Fosmid Libraries Whole Shotgun Sequencing

Amy Apprill OCN 750: Molecular Methods in Biological Oceanography November 17, 2005 Why Metagenomics?

- Limited physiology and functional role information known about microbes from cultures -Phylotypes of noncultured microbes derived from rRNA only provide phylogenetic info, no information about physiology, biochemistry, or ecological function; subject to PCR-based biases - Metagenomics allows isolation of large portions of which provide access to genes for protein-coding for biochemical pathways → insight into specific physiological and ecological functions, metabolic variability of an environment History of Marine Metagenomics

1991: used as a vector to create 10-20 kb insert shotgun library of picoplankton rRNA sequences, but also revealed other genes of interest (Schmidt TM, DeLong EF, Pace NR, 1991)

1992: Introduction of BAC & Fosmid cloning vectors from E. coli improved cloning efforts by controlling copy numbers - BAC vectors replicate >300kb & display few chimeras (Shizuya et al. 1992)

1996: First environmental fosmid library with environmental samples from Oregon coast (Stein et al. 1996)

2000: First BAC library from marine environment (Beja et al. 2000); proteorhodopsin discovered from Monterey Bay BAC (Beja et al. 2000)

2002: AAnP diversity uncovered from Monterey Bay BAC (Beja et al. 2002)

2005: Whole genome shotgun sequencing approach used on first marine environmental samples from the Sargasso Sea (Venter et al. 2005) BAC: bacterial artificial A modified that contains an derived from the E. coli F factor frequently used for large insert cloning experiments; exists within the very much like a cellular chromosome. Specifics:

- 100- 300kb (even 600kp!) inserts; 1 insert ~10-15% bacterial genome - Requires large amounts DNA (800-2000 L seawater) - Useful for screening specific protein- coding genes and genes of uncultivated microbes - Used to discover proteorhodopsin in several phylotypes, genes for anoxygenic photosynthesis Marine BAC/ fosmid construction - general

DeLong, 2005 How to create a BAC from seawater:

1. Collect ~1000 L seawater 2. Pre-filter, use TFF to pellet cells 3. Agarose embed cell pellet 4. Lyse agarose embedded cells

5. Prepare large DNA fragments by HindIII digestion of agarose slices - Run PFGE - Excise 150-400 kbp regions - Extract gel-embedded DNA (Beja et al 2000, Fig. 1) How to create a BAC, cont.

6. Ligate DNA into vector (previously removed from cells)

7. Transform vector into cells using electrophoration

8. Screen for phylogenetic info, purify & sequence http://www.ptf.okstate.edu/pulser.html Pulse Field Gel Electrophoresis of BAC clones digested with NotI describes size of inserts

plasmid

(Beja et al 2000, Fig. 2A) BAC Screening: rRNA Gene Surveys using Multiplex PCR - Digest BAC/fosmid DNA to remove E. coli chromosome - Screen fragments for rRNA gene from clones using 3 bacterial primer sets (SSU & LSU) and -specific - Excise amplicons form gel, purify - Clone & sequence purified products

Phylogenetic-informative multiplex PCR products describes phylogenetic groupings (Beja et al 2000, Fig. 5) BAC Screening: ITS-LH-PCR

Uses natural length variations in ITS, and location of tRNA-alanine gene within the ITS, to ID unique gene fragments corresponding to phylogenetic groupings

1. Pool plasmid-safe treated DNA and PCR with fluorescent labeled SSU & LSU primers to amp ITS & tRNA genes 2. Capillary electrophoresis compares size stds to fragment lengths 3. Sequence unknown fragments w/ ITS primers and 16S primers Figure 4. Suzuki et al. 2004 BAC Screening Comparison rRNA gene surveys LH-ITS-PCR

PROS: PROS: - Sequence data; no fragment - No direct DNA sequencing interpretation - Easier to distinguish E. coli fragments - High-throughput analysis

CONS: CONS: - Contaminating E. coli - Multiple clones w/ over lapping DNA size - PCR-based biases - Disruption of ITS may occur w/ cloning - Not suitable for high- throughput analysis - Some groups w/o linked SSU & LSU - PCR-based biases Pros & Cons of BAC libaries Pros: Cons:

- Represents 10-15% - Requires large amounts bacterial genome; gain sample (800-2000L sw) info about uncultured - No direct phylogenetic microbes information - Functional gene - Screening may introduce presence implies PCR biases physiology or ecology - Expensive (time, screening) - Controlled replication ( at 2 copies/cell) - Low level of chimerism Fosmid library - F1 origin-based vector - ~40kb DNA inserts - Requires smaller samples (>1L sw) PROS: Quick; Takes days compared to months – year for BACS CONS: Recovers fewer clones & more sheared DNA compared to BACS

Figure from Epicentre® biotechnologies (http://www.epibio.com/item.asp?ID=278&CatID=125&SubCatID=60) Whole genome shotgun sequencing: cloning the entire genome in a random fashion and sequencing the resultant clones

-Collect >200L seawater, pre- filter, TFF or 0.22µm - Shotgun cloning of small fragments ranging 2-6 kb -Shotgun Assembly: Computer program searches for overlapping sequences and assembles the sequenced fragments in correct order

(DeLong 2005) Assembled Fragments Prochlorococcus marinus MED4

Figure 2. Venter et al. 2005 Whole genome shotgun sequencing

Pros: Cons: - Lots of data - Challenging to assemble fragments correctly in current - Various phylogenetic context (lots of data!) marker genes assess diversity without PCR - Redundant sequencing biases - Unknown order and - Unbiased identification of orientation of clones gene diversity - Expensive - Functional gene info - Large sample size (>200L) implies ecology, physiology for generating hypothesis

Table 1. Suzuki et al. 2004 Figure 1. Suzuki et al. 2004 Figure 2. Suzuki et al. 2004 Figure 3. Suzuki et al. 2004 Figure 4. Suzuki et al. 2004

Figure 1. Venter et al. 2005 Figure 2. Venter et al. 2005 Figure 3. Venter et al. 2005 Figure 4. Venter et al. 2005 Figure 5. Venter et al. 2005 Table 1. Venter et al. 2005 Figure 6. Venter et al. 2005 Table 2.

Table 3. Venter et al. 2005 Figure 7. Venter et al. 2005 Whole genome shotgun sequencing success

Sargasso Sea WGS (Venter et al. 2005):

Large magnitude and total gene count -1.045 billion base pairs non-redundant sequence -1,625 Mb DNA sequence -1,214,207 new genes identified New discoveries

- 1,800 new microbial species - 148 previously unknown bacterial phylotypes - 782 new rhodopsin-like photoreceptors - Open ocean Burkholderia Shewanella presence (??) - Archaea with amo gene (followed up by Francis et al. 2005) What we can learn from marine BAC libraries

Apparent taxonomic affiliation of protein-encoding genes from different depths in Monterey Bay (DeLong 2005). Published Metagenomics studies

DeLong 2005