Serial Analysis of (SAGE)

SAGE analysis is a method derived to provide a read-out, via sequencing, of the spectrum of genes being expressed in a . · A method for comprehensive analysis of gene expression patterns · Serial sequencing of 15-bp tags unique to each gene · Each sequencing run can generate sequence for up to 50 high quality tags · At least 50,000 tags are required per sample to approach saturation · Costs upward of $5000 per sample · Unbiased relative to microarrays Choice of reference sample Hybridization artifacts Clone representation Three principles underlie the SAGE methodology 1. A short sequence tags (10-15 bp) contains sufficient information to uniquely identify a transcript provided that the tag is obtained from a unique position within each transcript 2. Sequence tags can be linked together to form long serial molecules that can be cloned and sequenced 3. Quantification of the number of times a particular tag is observed provides the expression level of the corresponding transcript An individual 9bp tag is sufficient to describe all the genes in humans, since 49 equals 262,144, which is much larger than any of the proposed number of genes in the human genome (20,000- 40,000). Extra stringency step that facilitates gene identification is that the tag must include the 3’ most anchoring site in a predicted transcript. A fraction of genes will have multiple tags due to alternative splicing near the 3’ end, or use of alternative polyadenylation sites; but for the most part these can be identified. The number of times a specific tag is found in the SAGE sequences reflects its abundance in the mRNA population. Unlike array and chip methods, you do not have to make cDNAs and ESTs. The expression information derives from SAGE tags, which are produced as part of the analysis. Sequence information is required to assign the tags to individual ORFs. However, unassigned SAGE tags are also useful (in species for which the complete genomes have not been sequenced, unassigned tags will be encountered frequently). They can be used to pull out promoters from genomic clones, to provide information about coordinated gene regulation, and to identify previously unknown genes. Quantitative comparison of SAGE samples is not always easy to interpret. A tag present in four copies in one sample of 50,000 tags and two copies in another may actually be twofold induced or the difference is due to random sampling.

Profile of the genes expressed in the human peripheral retina, macula, and retinal pigment epithelium determined through serial analysis of gene expression (SAGE) Dror Sharon, Seth Blackshaw, Constance L. Cepko, and Thaddeus P. Dryja PNAS 2002 99(1): 315-320 Use of Long SAGE • Gene discovery by hybridization of SAGE tags to large genomes requires that these tags be unique to transcripts • Used Type II restriction enzymes that generated 21 bp tags >99.8% of these should occur only once in the human genome

Serial Analysis of Nuclear Transcript Accumulation (SANTA) • Purify nuclei by flow sorting. • Extract polyA+ RNA. • Subject this to SAGE analysis.

drawback of this technique is that the process is costly in terms of sequencing sufficient ditags for in-depth genome coverage.

Microbeads or LYNX technology

Combines the flexibility of hybridization based procedures with the precision offered by tag sequencing. LYNX (www.lynxgen.com) technologies are microbeads (MegaClone), MegaSort (“fluid microarray”) massively parallel signature sequencing (MPSS) (see Brenner et al. 2000, PNAS 97: 1665-1670)/ This technology comprises three components: • In vitro cloning and analysis of complex mixtures of DNA on microbeads. • Physical separation of differentially expressed cDNA molecules. • Massively-parallel DNA sequencing.

Methodology • Construction of a cDNA library on the surface of microbeads • Every microbead displays on its surface about 100,000 copies of a specific cDNA • Unlike bacterial clones, these clones cannot be propagated, but each microbead can be isolated and the attached to it can be analyzed.

Principle of the methodmethod

• Each nucleic acid molecule Fluorophore first is labeled with an oligonucleotide address tag Address tag within a library.

Amplify and • The resulting library tag expose address tag molecules are amplified

cDNA using PCR. • After amplification, a PCR handle Address tag Vector sample of n molecules yields a library containing n families of clones, each

identified by a unique tag.

This permits the

cDNA-tag molecules to be collected on a BEAD microbead that carries the complement (antitag) to a family’s

unique tag.

mRNA The bulk mRNA is converted to cDNA using oligo -dT primers Convert to cDNA cDNA The 3’ sequences of the mRNA with their polyA+ tails, are then cloned into a plasmid

library of address tags. cDNA Address tag Vector PCR handle The entire library is amplified with common PCR primers, one of which introduces a fluorophore at the end of the cDNA. Amplify Expose address tag

Attach to microbeads Finally, the address tags are rendered single Ligate stranded and annealed to the complementary capture oligonucleotide on the surface of microbeads.

Ligation attaches one strand of the DNA During the 72 h permanently to the hybridization process in beads the last step, each cDNA- address tag conjugate

becomes anchored to their host microbead.

• The first step in the construction of this library is the attachment of a different 32-mer capture oligonucleotide to the surface of each microbead by combinatorial synthesis. • They used a special DNA language with a vocabulary made up of eight four-base “words”: TTAC, AATC, TACT, ATCA, ACAT, TCTA, CTTT, and CAAA. Each tag consists of eight “words” (4 x 8 = 32 bp). • Each word uses only three (A,T, and C) of the four DNA bases, and differs from all the other words in three of the four bases. • All words form, with the respective complement, one G:C and three A:T base pairs. • This eliminates self-complementarity within any sequence made up of the word, and prevents their cleavage by any restriction enzyme with symmetrical recognition sites for G and C. Example 5’-TACT.TTAC.ACAT.ATCA.CTTT.CAAA.AATC 3’ 3’-ATGA.AATG.TGTA.TAGT.GAAA.GTTT.TTAG 5 • Because all the eight-word tag/anti-tag duplexes have the same base pair composition (24 A:T and eight G:C), all the duplexes should have about the same Tm. • Because all tag sequences differ from their nearest neighbors by at least one word, or at least three base mismatches, there is a good discrimination between perfect matches and one word mismatches. • A total of 16,777,218 eight word tags are available which is far in excess of the number of unique transcripts fragments (usually about 50,000). Consequently each transcript molecule receives a unique signature, and each type of transcript will be represented by a number of different clones (microbeads) in proportion to its abundance in the transcriptome.

Tags were prepared by eight rounds of combin-atorial synthesis wherein each word, w1-w8, is added from a separate column of a DNA synthesizer. After the eighth round, a portion of the microbeads were separated for further synthesis of a 5’ primer binding site, followed by cleavage, amplification, and insertion into the vector Differential expression (“fluid microarrays”) • This approach physically extracts clones that are differentially represented in two libraries • After melting off the non-covalently attached DNA strand from the microbeads, the remaining strands are available for competitive hybridization with two probes labeled with different fluorophores, to produce fluid microarrays of millions of microbeads. • The ratio of probes hybridizing to each bead can easily be measured using Fluorescent- Activated Cell Sorter (FACS). • The ability to generate >106 microbeads carrying different cDNA clones allows identification and selection of sequences differentially regulated at very low levels

Massively Parallel Signature Sequencing (MPSS) • It utilizes tagged cDNAs “cloned” on microbeads: • The “fluid microarray” is analyzed in a flow cell. • A novel signature sequencing approach is used: Type II restriction endonucleases. Ligation of adapters. Fluorescent molecules bound to decoders. Does not require gels. Does not require DNA fragment sizing. • Reading is done by a multispectral optical imager. • The process is statistically robust.

Immobilization of the microbeads in the flow cell

Reagents are pumped in and out through a microbead monolayer

The flow cell is mounted on a confocal microscope

Sequencing does not require physical separation of fragments Signatures are 16-20 bp in length Very highly parallel (many millions of sequences at one time)