ABOUT EA

EA, located in Durham, North Carolina, provides cutting-edge genomic sequencing, expression, genotyping, and bioinformatics services to the world’s largest pharmaceutical companies, diagnostic test developers, government agencies, and academic labs. All projects are conducted under clinical-grade quality control, ensured through CLIA certification, GLP compliance, and adherence to CLSI guidelines. EA’s bioinformatics staff are key contributors to the Food and Drug Administration’s MicroArray Quality Control (MAQC) and Sequencing Quality Control (SEQC) studies, which aim to improve standards and quality measures for reliable use of next-generation sequencing and gene expression technologies in clinical practice and regulatory decision-making. As part of its mission to improve human health, EA has donated more than $2.2 million towards academic genomic research grants and its “Leave Your Fingerprint on the Cure” Program for pediatric cancer hosted at the American Society for Human Genetics annual meeting.

HOW TO NAVIGATE

A. Click on items under each “Table of Contents” to link to specific sections within the guide. B. All EA logos will bring you to ExpressionAnalysis.com.

C. Use the arrows in the bottom tool bar to flip back and forth through pages or jump tothe beginning or end of the guide, or simply click on the bottom corner of a page to turn the page. D. Select “Thumbnails” from the left hand tab to scroll through all pages and select which one you’d like to view. E. Use the “Search” tab in the left hand tool bar or the Search Field in the top tool bar to search for a particular keyword. F. Click any page to zoom in or use the minus and plus signs located in the bottom tool bar.

G. Use the buttons in the top tool bar to: G E • Download a PDF • Print the guide D B • Share the ebook through email E A or other social media channels • Change the settings of your guide • Bookmark a particular page of the guide

C F

www.ExpressionAnalysis.com • 919-405-2248 • 866-293-6094 TABLE OF CONTENTS Specimen Quality Control • LabChip Assay­—RNA QC Analysis

Expression Profiling • ENZO: Single-Round RNA Amplification and Biotin Labeling System • Ambion WT Expression • Illumina Expression Specimen Probe Output Guide • Illumina Expression Control Probe Output Guide • Illumina Summary Expression QC Report Guide

Genotyping • Affymetrix Genome-Wide SNP 6.0 Assay • Illumina Genotyping Assay • Illumina Genotyping QC Summary Report Guide • Illumina Genotyping QC Summary SNP Report Guide

Sequencing • mRNA Seq o Analysis Pipeline • PacBio RS Sequencing o PacBio Microbial Assembly o Variant Calling with PacBio RS Sequencing • Sequencing Protocols

Sequence Enrichment: Coming Soon! www.ExpressionAnalysis.com • 919-405-2248 • 866-293-6094 Assay—RNA QC Analysis

expressionanalysis.com LabChip Assay­—RNA QC Analysis

TABLE OF CONTENTS

• Quantitation • Integrity • Comparing Data Between the BioAnalyzer and LabChip LabChip Assay—RNA QC Analysis

A key to the success of your RNA Expression project is starting with an adequate amount of high quality RNA. With this in mind, EA will determine the concentration and integrity of each of your RNA samples. Our laboratory procedures have been extensively characterized using specific amounts of high quality RNA. By assessing the amount and quality of your RNA samples in our laboratory as a precursor to your chosen expression assay, we can identify samples that are unlikely to perform well. It is helpful for us to obtain the QC results that you have obtained in your laboratory on your specimens, but since our assays have been optimized using our laboratory equipment it is nonetheless beneficial for EA to perform an independent assessment. EA therefore requires a small amount of additional sample to perform these tests.

You will be promptly notified in the event that any specimens do not pass our specifications. At your discretion, we can continue forward with the assay, discontinue testing, or suspend your project until you send a replacement. If you choose to discontinue testing you will only be charged for the quality assessment testing that was performed.

QUANTITATION

expressionanalysis.com EA quantitates all incoming RNA specimens and reports the results to you electronically, in most cases as part of your final data package. However, if any specimens have an insufficient amount of RNA for your chosen assay, Project Management will forward the Test Result Report (TRR) to you for your review and will provide options for next steps. QUANTITATION RNA specimens are generally quantitated by spectrophotometry (e.g., NanoDrop) although in specific cases we may use fluorometric methods (e.g., RiboGreen, QuBit). The actual method used is not typically reported with the quantitation results, but is tracked internally. INTEGRITY

EA uses one of two methods to determine the integrity of your RNA specimens: the Agilent BioAnalyzer (BA) or the Caliper LabChip. Both devices use the same core technology: microfluidic-based electrophoretic separation of nucleic acids according to size. The resulting data from the BA and the LabChip are therefore very similar. However, during extensive testing EA has identified certain differences that result primarily from the use of different algorithms to summarize RNA integrity (see comparative data on the right). Specifically, the Agilent BioAnalyzer calculates an RNA Integrity Number, or RIN value, by examining specific areas of the electrophoretic trace. The Caliper LabChip uses a similar approach to calculate an RNA Quality Score, or RQS value. However, the specific areas examined and the details of the calculation differ between the RIN and RQS values, and as a result, RQS and RIN values will not be the same for any given RNA sample. EA will report RNA integrity as a RIN or RQS value in the TRR, to indicate which of the different methods was used. EA continues to recommend a RIN ≥ 7.0 for most assays; the equivalent RQS value is ≥ 6.44. Specimens with RIN or RQS values below the recommended value will require your approval prior to proceeding to the assay. The figures to the right show representative electrophoretic traces from the two instrument platforms.

expressionanalysis.com LabChip Assay—RNA QC Analysis

Figure 1: Example of a BA trace Fluorescence

Aligned Time (sec)

Figure 2: Example of a LabChip trace LabChip Assay—RNA QC Analysis

COMPARING DATA BETWEEN THE BIOANALYZER AND LABCHIP

EA compared BA and LabChip results directly by analyzing each of 83 different RNA samples in triplicate on the two instruments. Specimen types included: • 22 human blood specimens • 15 human cell line specimens (HEK and HeLa) • 18 human nasal brush specimens • 16 human liver specimens • 12 mouse tissue specimens (4 each of brain, stomach and liver)

The RQS and RIN values are plotted in the figure below. An exponential fit produced a slightly higher correlation (0.91) than a linear fit (0.87). ROS

RIN

Figure 3: Scatter Plot of LabChip RQS and Bioanalyzer RIN values across 83 specimens

expressionanalysis.com The maximum difference between the RQS and the RIN was 2.5 points, and in most cases the largest differences were observed for highly degraded samples. The vast majority of specimens (93%) had an absolute difference between RIN and RQS values of 1.5 or less. On the whole, RQS values tend to be slightly lower than RIN values for higher quality RNAs analyzed in parallel, while for lower quality RNAs the RQS value tends to be slightly higher. We also characterized variability in RIN and RQS values when samples were repeatedly measured on the same platform, and found that the maximum coefficient of variation (CV) for RIN values was 20%, whereas the maximum CV for RQS values was 14%.

EA has long recommended a RIN value ≥ 7.0 for most expression assays because experience has demonstrated that higher quality RNAs perform more predictably in all phases of assay processing. According to our analysis, the analogous RQS value is ≥ 6.44 Single-Round RNA Amplification and Biotin Labeling System

expressionanalysis.com Enzo Single-Round RNA Amplification and Biotin Labeling System

TABLE OF CONTENTS

• Highlights • Assay Summary • Input Specifications • Output Specifications • Frequently Asked Questions • References

Enzo Single-Round RNA Amplification and Biotin Labeling System

HIGHLIGHTS

• Standard method for preparing total RNA samples for hybridization to Affymetrix 3’ expression arrays. • This method produces data that is most closely aligned with Affymetrix One-Cycle data (discontinued).

INPUT SPECIFICATIONS

• RNA samples must be free of contaminating proteins and other cellular material, organic solvents (including phenol and ethanol), and salts used in many RNA isolation methods. • It is generally recommended to use DNase-treated RNA.

Starting Material Total RNA

Total Amount (ng) 2500*

Concentration (ng/µl) 125

Volume (µl) 20

RIN >7

260:280 >1.8

*We recommend 2500ng to allow for QC of incoming RNA as well as reamplification, if necessary.

expressionanalysis.com

ASSAY SUMMARY RNA samples are converted into labeled target antisense RNA (cRNA) using the Single-Round RNA Amplification and Biotin Labeling System (Enzo). Briefly, total RNA is converted into double stranded cDNA via reverse transcription using an oligo-d(T) primer-adaptor. This cDNA is purified and used as a template for in vitro transcription using T7 RNA polymerase and biotinylated ribonucleotides. The resulting cRNA is purified using magnetic beads and quantitated using spectrophotometry. Next, 11ug of purified cRNA is fragmented using a 5X fragmentation buffer, then a hybridization cocktail is prepared and added to the fragmentation product using the Hybridization, Wash and Stain kit (Affymetrix), applied to arrays, and incubated at 45°C for 16 hours. Following hybridization, arrays are washed and stained using standard Affymetrix procedures before scanning on the Affymetrix GeneChip Scanner and data extraction using Expression Console. INPUT SPECIFICATIONS (CONT.)

• Bioanalyzer: When analyzing amplified cRNA using the Agilent 2100 Bioanalyzer the distribution of biotin-labeled cRNA should range from 200nt to 6000nt, with a broad peak between 1 kb and 2 kb

BIOANALYZER

OUTPUT SPECIFICATIONS

Standard Affymetrix Output Files • EXP file (experimental information file) • CEL file (averaged cell intensities) • CHP file (analysis output file) Standard Array Analysis Data Summaries • Conversion of the raw Affymetrix output files into useable formats: tabulated expression intensity estimates in text files and/or Excel formats, summarized for all specimens in the experiment.

expressionanalysis.com Enzo Single-Round RNA Amplification and Biotin Labeling System

BIOANALYZER

FREQUENTLY ASKED QUESTIONS

1. Is this method suitable for RNA specimens isolated from blood? • Yes, there is a globin-reduction procedure* that may be performed at an additional cost. *We recommend GR300, Affymetrix Enzo Service w/ Globin RNA Reduction.

REFERENCES

1. Enzo Technote, EA 2. EA Sales Quote Template (Affy Enzo Quote Temp v2), EA 3. http://www.enzolifesciences.com/ENZ-42420-10/ single-round-rna-amplification-and-biotin-labeling-system/ WT Expression

expressionanalysis.com Ambion WT Expression

TABLE OF CONTENTS

• Highlights • Assay Summary • Input Specifications • Output Specifications • Frequently Asked Questions • References Ambion WT Expression

HIGHLIGHTS

• This method is used to prepare total RNA samples for hybridization to Affymetrix GeneChip® Sense Target (ST) Arrays. • WT100 denotes the service code for hybridization to Gene ST arrays. • This method enables more sensitive and highly reproducible results with Affymetrix whole genome microarrays.

INPUT SPECIFICATIONS

• RNA samples must be free of contaminating proteins and other cellular material, organic solvents (including phenol and ethanol), and salts used in many RNA isolation methods. • It is generally recommended to use DNase-treated RNA.

Starting Material Total RNA

Total Amount (ng) 680

Concentration (ng/µl) 85

Volume (µl) 8

RIN >7

260:280 >1.8

expressionanalysis.com ASSAY SUMMARY RNA samples are converted to sense-strand cDNA using the The Ambion® WT Expression Kit and subsequently labeled using the Affymetrix GeneChip® WT Terminal Labeling Kit. The WT Expression Kit is optimized for use with human, mouse, and rat Affymetrix GeneChip® Sense Target (ST) Arrays. Briefly, 200ng of total RNA undergoes a reverse transcription step designed to exclude priming of ribosomal RNA. Primers are designed using a proprietary-oligonucleotide matching algorithm that avoids rRNA binding, thereby providing comprehensive coverage of the transcriptome while significantly reducing coverage of rRNA. This method also avoids the 3’ bias inherent in methods that prime exclusively with oligo-dT-based primers. The resulting sense-strand cDNA is next fragmented and labeled using uracil-DNA glycosylase (UDG) and apurinic/apyrimidinic endonuclease 1 (APE1) which recognizes and fragments the cDNA at dUTP residues, which were incorporated during the 2nd-cycle. Next, DNA is labeled by terminal deoxynucleotidyle transerase (TdT) using the Affymetrix DNA Labeling Reagent. Hybridization cocktail is prepared using the Hybridization, Wash and Stain kit (Affymetrix), applied to arrays, and incubated at 45°C for 16 hours. Following hybridization, arrays are washed and stained using standard Affymetrix procedures before scanning on the Affymetrix GeneChip Scanner and data extraction using Expression Console. OUTPUT SPECIFICATIONS

Standard Affymetrix Output Files • CEL file (averaged cell intensities) • CHP file (analysis output file) • DAT files (available upon request) Standard Array Analysis Data Summaries • Conversion of the raw Affymetrix output files into useable formats: tabulated expression intensity estimates in text files and/or Excel formats, summarized for all specimens in experiment.

FREQUENTLY ASKED QUESTIONS

1. Are there advantages to the Affy ST array when compared to the standard Affy 3’ expression array? A key feature of the ST array is that it may allow detection of alternatively spliced transcripts. This is because the target labeling method does not suffer from inherent 3’ bias and the probe sequences for transcripts can be distributed along the entire length of the transcript, rather than clustered near the 3’ end. In addition, Gene ST arrays tend to be less expensive than 3’ expression arrays.

expressionanalysis.com Ambion WT Expression

2. How does the Affy ST array design differ from the standard Affy 3’ expression array? • The main difference concerns the design of probes for the array. Standard 3’ expression arrays contain just over 20 probes per probe set that are typically designed to interrogate bases proximal to the 3’ end. Some are represented by more than one probe set. In contrast, Exon ST arrays contain about 40 probes per gene, with about 4 probes per exon (where applicable). Gene ST arrays have about 26 probes per gene distributed along the entire length of the transcribed gene. Thus, there are a higher number of probes that interrogate a larger fraction of the targeted transcript. • Probes on the standard Affy 3’ expression array are in the sense orientation and the target that is hybridized to the array is antisense. In contrast, probes on Affy ST arrays are antisense and the hybridized target is sense. Hence the designation ST, for Sense Transcript. • Standard Affy 3’ expression arrays contain PM (perfect match) and MM (mis-match) probes. The difference in summarized signal intensity between the PM and MM probes is used to make detection calls. ST arrays contain PM probes only and there is therefore no detection call. • An ST array cannot be analyzed individually and needs to be analyzed in comparison with other samples hybridized to the same ST array design. All samples comprising a project or comprising a group within a project must be analyzed at the same time. Ambion WT Expression

FREQUENTLY ASKED QUESTIONS (CONT.)

REFERENCES

1. The Ambion® WT Expression Kit Protocol, PN 4425209 Rev. C 2. The Affymetrix WT Terminal Labeling Assay Manual, PN 701880 Rev. 5 3. http://www.ambion.com/techlib/prot/fm_4411973.pdf 4. http://www.affymetrix.com/browse/level-0-landingpage.jsp

expressionanalysis.com FREQUENTLY ASKED QUESTIONS (CONT.) 3. How does the array data of an Affy ST array differ from the standard Affy 3’ expression array? FREQUENTLY ASKED QUESTIONS (CONT.) • Since the ST array contains many more probes, smaller features, and an additional chemistry step, data from ST arrays are somewhat more variable. • There are no detection calls: expression data does not include percent present (%P) values. • A primary QC metric is Positive versus Negative AUC (area under the curve), which is akin to Signal to Noise – it relates to the ability to distinguish true signal from noise. Another important value is RLE (relative log expression). As a general rule, RLE values within a biologically defined group should be consistent. Probe Output Guide

expressionanalysis.com Illumina Expression Specimen Probe Output Guide

TABLE OF CONTENTS

• Summary • Sample Tables • Definitions Illumina Expression Specimen Probe Output Guide

SUMMARY

Format = Excel; File Name = FinalReport[Norm|NoNorm]_EAID_yyyymmdd.xls This report assimilates the standard Illumina expression information that is available for each specimen in a common batch. There is typically one file for all specimens in a batch or study. The results are provided in an Excel spreadsheet with one row per probe. An excerpt from a report is shown to the right. (In this illustration, data lines are wrapped to allow for easier viewing. In the actual report file, all information pertaining to a probe is contained in a single row. Additional specimens are reported in adjacent columns prior to the detailed annotations).

expressionanalysis.com SAMPLE TABLES

PROBE_ID SYMBOL 95289_4-1. 95289_4-1. 95289_4-1. 95289_4-1. SEARCH_KEY ILMN_GENE AVG_Signal Detection BEAD_ Avg_ Pval STDERR NBEADS

ILMN_2993882 Cyp11b1 128.345 0.000034 16.685732 36 ILMN_245925 CYP11B1

ILMN_2860070 Cyp11b2 50.034 0.0031 12.699523 50 ILMN_222335 CYP11B2

ILMN_2748181 Cyp11b2 43.348 0.0078 13.096592 25 ILMN_222335 CYP11B2

CHROMOSOME DEFINITION SYNONYMS TargetID ProbeID SPECIES SOURCE TRANSCRIPT SOURCE_ REFERENCE_ID

15 Mus CPN1; CYP11B1 5270092 Mus Ref Seq ILMN_245925 NM_001033229.1 musculus Cyp11b-1; musculus cytochrome P450

15 Mus MGC151285; CYP11B2 4760246 Mus Ref Seq ILMN_222335 NM_009991.2 musculus Cpn2; musculus cytochrome P450

15 Mus MGC151285; CYP11B2 7560736 Mus Ref Seq ILMN_222335 NM_009991.2 musculus Cpn2; musculus cytochrome P450 SAMPLE TABLES (CONT.)

REFSEQ_ID UNIGENE_ID ENTREZ_ GI ACCESSION PROTEIN_ ARRAY_ PROBE_ PROBE_ GENE_ID PRODUCT ADDRESS_ID TYPE START

NM_0013229.1 110115 8437029 NM_0013229.1 NP_0028401.1 5270092 S 2237

NM_009991.2 13072 9845247 NM_009991.2 NP_034121.1 4760246 S 1392

NM_009991.2 13072 9845247 NM_009991.2 NP_034121.1 7560736 S 898

PROBE_ PROBE_CHR_ PROBE_ CYTOBAND ONTOLOGY_ ONTOLOGY_ ONTOLOGY_ SEQUENCE ORIENTATION COORDINATES COMPONENT PROCESS FUNCTION

CATGTACT… - 74662234- 15qD3 membrane… electron monooxygenase 74662283 transport… activity…

ACTGAAAA… - 74681530- 15qD3 membrane… drinking oxidoreductase 74681579 behavior… activity…

CTCTCGAC… - 74683475- 15qD3 membrane… drinking oxidoreductase 74683524 behavior… activity…

expressionanalysis.com Illumina Expression Specimen Probe Output Guide

DEFINITIONS

REPORT HEADER DEFINITIONS

ITEM DESCRIPTION

PROBE_ID Illumina unique identifier for a probe sequence.

SYMBOL Gene name as reported in RefSeq.

PER SPECIMEN (PREFIX NOTATION IS EAID_ClientSampleID)

ITEM DESCRIPTION

AVG_Signal Average intensity of the bead type/target in the group.

Detection Pval Proportion of negative controls that had more intensity than the target probe avg. signal. A lower value implies it is more likely to be detected.

BEAD_STDERR Standard error associated with bead-to-bead variability for the specimen.

Avg_NBEADS Number of beads per bead type for the probe when using one array per specimen (i.e., it is usually not an average but a count of the number of beads used to estimate signal). Illumina Expression Specimen Probe Output Guide

GENE/PROBE ANNOTATION

ITEM DESCRIPTION

SEARCH_KEY Gene identifier provided by the customer.

CHROMOSOME The chromosome on which the probe is located.

DEFINITION Single-line description of gene in RefSeq.

SYNONYMS Gene synonyms.

TargetID Identifies the nature of the probe, usually a gene name for this report.

ProbeID Identifies the bead type. Note: this is a different value than PROBE_ID.

SPECIES The species of the BeadChip product.

SOURCE Data table form which the annotation was acquired.

TRANSCRIPT RefSeq entry specifying an isoform (GI number).

SOURCE_REFERENCE_ID Database accession number.

REFSEQ_ID Reference sequence identifier from NCBI RefSeq database.

UNIGENE_ID NCBI's Unigene gene database identifier (historically based primarily on ESTs).

ENTREZ_GENE_ID Reference gene identifier based on RefSeq genomes.

GI RefSeq entry specifying an isoform (GI number).

ACCESSION RefSeq entry (NM or XM number).

PROTEIN_PRODUCT Reference protein identifier.

ARRAY_ADDRESS_ID Internal address used by Illumina software.

expressionanalysis.com GENE/PROBE ANNOTATION (CONT.)

ITEM DESCRIPTION

PROBE_TYPE A, I, or S coding for dealing with isoforms:

• For transcripts with a single isoform, Illumina designs “-S” probes (S=single). • For transcripts with multiple isoforms, Illumina designs two types of probes. • “-I” (I=isoform-specific) are probes designed to query only one of multiple isoforms. • “-A” (A=all) are probes designed to query all known isoforms of that transcript.

PROBE_START Coordinate in database entry where probe sequence begins (approximate distance from 5' end).

PROBE_SEQUENCE Sequence used as a probe on the array.

PROBE_CHR_ORIENTATION The DNA strand on which the probe is located (+ or -).

CYTOBAND Location of the gene in the designated subregion. (Or cytogenetic band)

ONTOLOGY_COMPONENT The (GO) component classification(s) for this probe.

ONTOLOGY_PROCESS The gene ontology (GO) process classification(s) for this probe.

ONTOLOGY_FUNCTION The gene ontology (GO) function classification(s) for this probe. Probe Output Guide

expressionanalysis.com Illumina Expression Control Probe Output Guide

TABLE OF CONTENTS

• Summary • Sample Table • Definitions • TargetID Table • Definitions Illumina Expression Control Probe Output Guide

SUMMARY

Format = Excel or Text; File Name = Control_Probe_Profile.[xls/txt] This table assimilates the standard information for Illumina control probes that is available for each specimen in a common batch. There is typically one file for all specimens in a batch or study. This file should be used with FinalReport[Norm|NoNorm]_EAID_yyyymmdd.xls. The results are provided in table form with one row per probe. An excerpt from the table is shown to the right. Additional specimens would occupy adjacent columns.

expressionanalysis.com SAMPLE TABLE

TargetID ProbeID 95289_4-1. 95289_4-1. 95289_4-1. 95289_4-1. AVG_Signal Detection Pval AVG_Signal Detection Pval

BIOTIN 3940609 12777.96 0 13372.770 0

BIOTIN 6510136 9395.014 0 67833.302 0

CY3_HYB 1010475 46150.16 0 46347.330 0

CY3_HYB 2260059 1377.44 0 0 1385.611 0

DEFINITIONS

ITEM DESCRIPTION

TargetID Identifies the nature of the probe (housekeeping, etc.). See the TargetID table.

ProbeID Unique identifier for a probe sequence. Note: this is a different value than PROBE_ID.

AVG_Signal Average intensity of the bead type/target on the array.

Detection Pval Proportion of negative controls that had more intensity than the target probe avg. signal. TargetID TABLE

Human arrays Mouse array

HOUSEKEEPING ProbeID/Gene 7 total: Human6 v3 and HumanHT 12 v3 14 total:

1430239/UBC 10220/Txn1 2100273/EEF1A1 520379/3010033P07Rik 2940403/TUBB2A 1010296/Tubb2b 3940446/TXN 1030133/Tubb2b 4490161/GAPDH 1690689/Eef1a1 5570132/ACTB 2260521/Ubc 590110/RPS9 2470521/3010033P07Rik 2680722/Actb 6 total: RefSeq8 and WGDASL 4640114/Actb 4830750/Txn1 2640088/ALDOA 5870154/Gapd 3450719/EEF1A1 5960379/Ubc 5080364/ACTN1 6280626/Eef1a1 5570132/ACTB 6940475/Gapd 5860347/AKR1D1 6520026/NUCB1

BIOTIN 2 total 2 total

CY3_HYB ProbeID/Ranking 6 total: 6 total:

1110170/high 6450180/high 4010327/high 3800196/high 2510500/middle 1400044/middle 4610291/middle 2600040/middle 1450438/low 5820544/low 7560739/low 730475/low

LABELING 2 total 8 total

LOW_STRINGENCY_HYB 8 total: 8 total:

4 PM and 4 MM 4 PM and 4 MM

NEGATIVE 759 total: Human6 v3 and HumanHT 12 v3 936 total

309 total: RefSeq8 and WGDASL

expressionanalysis.com Illumina Expression Control Probe Output Guide

DEFINITIONS

ITEM DESCRIPTION

HOUSEKEEPING All Housekeeping probes reflect genes in the specimen that are typically expressed. Therefore, these genes should generally be well above background. Exceptions can occur under various biological conditions.

BIOTIN All Biotin probes should have high intensity.

CY3_HYB For these hybridization controls, the probes should have intensity relative to their class. “high” > “medium” > “low”

LABELING If used, Labeling controls should be above background. Otherwise, Labeling controls would behave similarly to background or negative control probes.

LOW_STRINGENCY_HYB There are typically 4 PM and 4 MM probes. In general, the PM probes should have higher intensity than the MM probes.

NEGATIVE These are nonredundant collections of probes using alien sequences that provide insight into background hybridization and intensity due to scanning, etc. They are used to remove estimated background levels from the array and to determine detection. QC Report Guide

expressionanalysis.com Illumina Summary Expression QC Report Guide

TABLE OF CONTENTS

• Summary • Sample Tables • Definitions

QC Report Guide Illumina Summary Expression QC Report Guide

SUMMARY Format = Comma separated; File Name = TableControl_EAID_yyyymmdd.csv This report assimilates the standard Illumina Expression information that is available for each specimen in a common batch. There is typically one file for all specimens in a batch or study. The results are provided in an Excel spreadsheet with one specimen per row. An excerpt from a report is shown to the right. (In this Illustration, data lines are wrapped to allow for easier viewing. In the actual report file, all information pertaining to a specimen is contained in a single row. This report is based on non- normalized probe intensity data.

expressionanalysis.com SAMPLE TABLES

Index Sample ID Sample Group Sentrix Barcode Sample Section Detected Genes (0.01)

1 93926_380 93926_380 4836278015 D 8222

2 93927_378 93927_378 4836278013 B 6765

3 93928_379 93928_379 4836278013 F 7918

Detected Signal Signal Signal Signal Signal Signal Genes (0.05) Average P05 P25 P50 P75 P95

9997 185.7895 79.166 82.915 86.851 100.901 341.663

8649 173.0811 80.208 85.724 91.911 107.264 282.910

9777 183.8995 82.003 86.434 91.228 105.852 323.232

Sample_Well Sample_Plate Pool_ID BIOTIN CY3_HYB HOUSEKEEPING LABELING

5411.485 7383.296 4964.682 82.63918

5076.515 6707.969 3954.465 90.24446

5446.844 7676.829 4401.727 89.41646

LOW_STRINGENCY_HYB NEGATIVE (background) Noise

6221.482 85.19336 4.852496

5665.249 91.09724 11.69113

6481.939 89.07066 6.454274 DEFINITIONS

ITEM DESCRIPTION

Index The row index of the sample.

Sample ID The specimen identifier, usually a combination of the EA ID and a client-provided ID.

Sample Group An Illumina convention to allow group identification but which is typically just a restatement of the specimen identifier.

Sentrix Barcode The barcode of the Sentrix Array.

Sample Section The section on the Sentrix Array.

Detected Genes (0.01) The number of genes with a detection p-value of 0.01 or less.

Detected Genes (0.05) The number of genes with a detection p-value of 0.05 or less. Note: the detected number of genes changes based on normalization.

Signal Average The average of signal intensity for the specimen across all probes.

Signal P05 The fifth percentile signal intensity across all probes.

Signal P25 The twenty-fifth percentile signal intensity across all probes.

Signal P50 The fiftieth percentile signal intensity across all probes.

Signal P75 The seventy-fifth percentile signal intensity across all probes.

Signal P95 The ninety-fifth percentile signal intensity across all probes.

Sample_Well Typically blank.

expressionanalysis.com Illumina Summary Expression QC Report Guide

ITEM DESCRIPTION

Sample_Plate Typically blank.

Pool_ID Bead pool ID but typically blank.

BIOTIN The average intensity across all biotin control probes.

CY3_HYB The average control intensity across all CY3_HYB probes.

HIGH_STRINGENCY_HYB The average intensity across all HIGH_STRINGENCY_ HYB control probes.

HOUSEKEEPING The average intensity across all HOUSEKEEPING control probes.

LABELING The average intensity across all LABELING control probes.

LOW_STRINGENCY_HYB The average intensity across all LOW_STRINGENCY_HYB control probes.

NEGATIVE (background) The average intensity across all NEGATIVE (background) control probes.

NOISE The average intensity across all noise control probes. Genome—Wide SNP 6.0 Assay

expressionanalysis.com Affymetrix Genome-Wide SNP 6.0 Assay

TABLE OF CONTENTS

• Highlights • Assay Summary • Input Specifications • Output Specifications • Frequently Asked Questions

Affymetrix Genome-Wide SNP 6.0 Assay

HIGHLIGHTS

• Standard method for preparing total RNA samples for hybridization to Affymetrix Human SNP arrays. • This method produces data that is most closely aligned with Affymetrix’ Human Mapping 250K and 500K arrays.

INPUT SPECIFICATIONS

• Genomic DNA samples must be highly intact and double-stranded. • It is generally recommended to use the PicoGreen assay for quantitation as it only binds to dDNA. A UV Spec will see all nucleic acids, resulting in a misrepresentation to the actual input amount.

Starting Material Genomic DNA

Total Amount (ng) 1000*

Concentration (ng/µl) 50

Volume (µl) 20

*We recommend 1000ng to allow for QC of incoming DNA as well as reamplification, if necessary.

expressionanalysis.com

ASSAY SUMMARY Genomic DNA (500ng) samples are digested with Nsp I and Sty I restriction enzymes. After a 30 cycle PCR amplification, the amplified DNA is cleaned using a magnetic bead based method and purified with filtered columns. The resulting purified DNA is quantitated using spectrophotometry. Next,180ug of purified DNA is fragmented using a 10X fragmentation buffer, then a hybridization cocktail is prepared and added to the fragmentation product using the Hybridization, Wash and Stain kit (Affymetrix), applied to arrays, and incubated at 50°C for 18 hours. Following hybridization, arrays are washed and stained using standard Affymetrix procedures before scanning on the Affymetrix GeneChip Scanner and data extraction using Genotyping Console. Affymetrix Genome-Wide SNP 6.0 Assay

OUTPUT SPECIFICATIONS

Standard Affymetrix Output Files • CEL file (averaged cell intensities) • CHP file (analysis output file) Standard Array Analysis Data Summaries • Conversion of the raw Affymetrix output files into useable formats: tabulated genotype intensity estimates in text files (birdseed files) summarized for all specimens in experiment.

1. Is any QC performed? • Yes, there is a QC check after amplification. If there are any failures, the client will review the data and respond before processing continues.

expressionanalysis.com FREQUENTLY ASKED QUESTIONS expressionanalysis.com Illumina Genotyping Assay

TABLE OF CONTENTS

• Assay Summary • Highlights • Input Specifications • Output Specifications ASSAY SUMMARY Genomic DNA (200-500ng) samples are amplified during an overnight amplification, the amplified DNA is fragmented and purified with an isopropanol precipitation. The resulting purified DNA is bound to the arrays using an overnight Hybridization, followed by a Wash and Stain on Tecan robots. The arrays are then scanned on the Illumina iScan Scanner and data

extraction using GenomeStudio.

expressionanalysis.com Illumina Genotyping Assay

HIGHLIGHTS

• Standard method for preparing genomic DNA samples for hybridization to Illumina Whole Genome Genotyping arrays.

INPUT SPECIFICATIONS

• Genomic DNA samples must be highly intact and double-stranded. • It is generally recommended to the PicoGreen assay for quantitation as it only binds to dDNA. A UV Spec will see all nucleic acids, resulting in a misrepresentation to the actual input amount.

Starting Material Genomic DNA Total Amount (ng) 1000 Concentration (ng/µl) 50 Volume (µl) 20

*We recommend 1000ng to allow for QC of incoming DNA as well as reamplification, if necessary.

OUTPUT SPECIFICATIONS

• Standard Illumina Output Files o IDAT file (raw intensities) • Standard Array Analysis Data Summaries o Conversion of the raw Illumina output files into useable formats: tabulated genotype intensity estimates in Final Report files summarized for all specimens in experiment. QC Summary Report Guide

expressionanalysis.com Illumina Genotyping QC Summary Report Guide

TABLE OF CONTENTS

• Summary • Sample Tables • Definitions • DNA Table Summary • DNA Table • DNA Table Definitions Illumina Genotyping QC Summary Report Guide

SUMMARY

Format = Text; File Name = EAID-PID (yyyy-mm-dd)_Samples Table.txt These two tables (Samples and DNA) provide the standard Illumina QC summary information that is available for each specimen in a common genotyping study. There is one file containing the Samples Table for all specimens in a study. The results are provided in tab-delimited text with one row per specimen. An excerpt from the Samples Table is shown below.

SAMPLE TABLES

Index Sample ID Call Rate Gender p05 Grn p50 Grn p95 Grn p05 Red p50 Red

1 0741-07A_7A 0.99301 Unknown 1087 9005 19720 939 5865

2 0741-07B_7B 0.991772 Unknown 1162 8607 22189 977 5618

3 0741-07C_7C 0.990308 Unknown 700 5096 15678 619 3442

p95 Red p10 GC p50 GC A.Sentrix ID A.Sentrix Position

15251 0.396424 0.711446 4770740081 R01C01

16550 0.397104 0.711787 4770740081 R02C01

12473 0.396914 0.712086 4770740076 R01C02

expressionanalysis.com DEFINITIONS

ITEM DESCRIPTION

GenCall (GC) Score A quality metric that indicates the reliability of each genotype call and is a value between 0 and 1 assigned to every called genotype. GenCall Scores are calculated using information from the clustering of the samples.

Index Numbered index of the specimens.

Sample ID EA Identifier for each specimen, usually a combination of the EA ID and client ID.

Call Rate Percentage of SNPs whose GenCall score is greater than the specified threshold.

Gender Gender of the specimen.

p05 Grn 5th percentile of A-allele intensity.

p50 Grn 50th percentile of A-allele intensity.

p95 Grn 95th percentile of A-allele intensity.

p05 Red 5th percentile of B-allele intensity.

p50 Red 50th percentile of B-allele intensity.

p95 Red 95th percentile of B-allele intensity.

p10 GC 10th percentile GenCall score over all SNPs for this specimen.

p50 GC 50th percentile GenCall score over all SNPs for this specimen.

A.Sentrix ID The barcode that this specimen was hybridized to for Manifest A.

A.Sentrix Position The position within array that this specimen was hybridized to for Manifest A (similarly for _B, _C, etc. depending on how many manifests are used with the project). DNA TABLE SUMMARY

Format = Text; File Name = EAID-PID (yyyy-mm-dd)_DNAReport.txt There is one file containing the DNA Table for all specimens in a study. The results are provided in tab-delimited text with one row per specimen. An excerpt from the DNA Table is shown below.

DNA TABLE

Row DNA_Name #No_Calls #Calls Call_Rate A/A_Freq

1 0741-07A_7A_4770740081_... 3925 557565 0.993 0.3157

2 0741-07B_7B_4770740081_... 4620 556870 0.9918 0.3164

3 0741-07C_7C_4770740081_... 5442 556048 0.9903 0.3182

A/B_Freq B/B_Freq Minor_Freq 50%_GC_Score 10%_GC_Score 0/1

0.3151 0.3692 0.4733 0.7114 0.3964 1

0.3231 0.3605 0.478 0.7118 0.3971 1

0.3239 0.3578 0.4802 0.7121 0.3969 1

expressionanalysis.com Illumina Genotyping QC Summary Report Guide

DNA TABLE DEFINITIONS

ITEM DESCRIPTION

Row Index for the DNA.

DNA_Name Identifier for each specimen.

#No_Calls Number of loci with GenCall scores below the call region threshold.

#Calls Number of loci with GenCall score above the call region threshold.

Call_Rate Percentage of SNPs (expressed as a decimal) whose GenCall score is greater than the specified threshold.

A/A_Freq Frequency of homozygote allele A calls.

A/B_Freq Frequency of heterozygote calls.

B/B_Freq Frequency of homozygote allele B calls.

Minor_Freq Frequency of the minor allele. If the number of AA < number of BB for a specimen, the frequency for the minor allele A for that specimen is (2*AAs + ABs) for the specimen divided by (2*AAs + ABs + BBs) for the specimen across all loci.

50%_GC_Score Median GenCall score when scores are ranked for all loci.

10%_GC_Score GenCall score at the 10%ile when scores are ranked for all loci.

0/1 An indicator for whether a specimen is recommended for inclusion or exclusion by Illumina. 0 = Remove / 1 = Include QC Summary SNP Report Guide

expressionanalysis.com Illumina Genotyping Assay QC Summary SNP Report Guide

TABLE OF CONTENTS

• Summary • Sample Tables • Definitions SUMMARY

Format = Text; File Name = EAID-PID_LocusSummary.txt This table provides the standard QC summary information that is available for each or SNP in a common Illumina genotyping study. There is one file for all SNPs in a study. The results are provided in a tab-delimited text table with one locus/SNP per row. An excerpt from a table is shown to the right. (In this Illustration, data lines are wrapped to allow for easier viewing. In the actual report file, all information pertaining to a locus is contained in a single row.)

expressionanalysis.com Illumina Genotyping Assay QC Summary SNP Report Guide

SAMPLE TABLES

Row Locus_ Illumicode_ #No_Calls #Calls Call_Freq A/A_Freq A/B_Freq Name Name

1 rs245703 24456879 0 12 1 0.167 0.5

2 rs205606 24436475 0 12 1 0.167 0.333

3 rs2893047 24449576 0 12 1 1 0

B/B_ Minor_ Gentrain_ 50%_GC_ 10%_GC_ Het_Excess_ ChiTest_P100 Freq Score Score Score Freq

0.333 0.417 0.8931 0.9299 0.4439 0.0286 0.7751

0.5 0.333 0.7824 0.7877 0.7877 -0.25 0.0124

0 0 0.8318 0.8612 0.7769 0 1

Cluster_Sep AA_T_Mean AA_T_Std AB_T_Mean AB_T_Std BB_T_Mean BB_T_Std

1 0.027 0.01 0.436 0.023 0.968 0.01

0.9871 0.05 0.016 0.456 0.025 0.98 0.01

0.514 0.05 0.023 0.362 0.038 0.959 0.022

AA_R_Mean AA_R_Std AB_R_Mean AB_R_Std BB_R_Mean BB_R_Std

1.119 0.111 1.3 0.153 0.949 0.116

1.833 0.221 2.187 0.179 1.595 0.155

0.522 0.1 0.466 0.1 0.328 0.1 Illumina Genotyping Assay QC Summary SNP Report Guide

DEFINITIONS

ITEM DESCRIPTION

GC-Score GenCall (GC) Score is a quality metric that indicates the reliability of each genotype call. The GenCall Score is a value between 0 and 1 assigned to every called genotype. GenCall Scores are calculated using information from the clustering of the samples.

Row Row number.

Locus_Name Locus name, typically a database rsID (SNP).

Illumicode_Name Locus ID.

#No_Calls Number of specimens with GenCall score below the call region threshold.

#Calls Number of specimens with GenCall score above the call region threshold.

Call_Freq Call frequency, or call rate, calculated as follows: #Calls/(#No_Calls + #Calls)

A/A_Freq Frequency of homozygote allele A calls.

A/B_Freq Frequency of heterozygote calls.

B/B_Freq Frequency of homozygote allele B calls.

Minor_Freq Frequency of the minor allele. If the number of AA < number of BB for a specimen, the frequency for the minor allele A for that speci- men is (2*AAs + ABs) for the specimen divided by (2*AAs + ABs + BBs) for the specimen across all loci.

GenTrain_Score A number between 0 and 1 indicating how well the specimens clustered for this locus.

50%_GC_Score Median GenCall score when scores are ranked for all specimens.

expressionanalysis.com ITEM DESCRIPTION

10%_GC_Score GenCall score at the 10th percentile when scores are ranked for all specimens.

Het_Excess_Freq Heterozygote excess frequency, calculated as (Obs - Exp) /Exp for the heterozygote class. If fAB is the heterozygote frequency observed at a locus, and p and q are the major and minor allele frequencies, then Het Excess is the relative difference in the desired Het Frequency compared to the expected Het Frequency under typical assumptions.

ChiTest_P100 Hardy-Weinberg p-value estimate calculated using genotype frequency. The value is calculated with 1 degree of freedom and normalized to 100 individuals.

Cluster_Sep Cluster separation score.

AA_T_Mean Mean of the normalized theta(θ) angles1 for the AA genotype A.

AA_T_Std Standard deviation of the normalized theta(θ)angles for the AA genotype.

AB_T_Mean Mean of the normalized theta(θ) angles for the AB genotype.

AB_T_Std Standard deviation of the normalized theta(θ) angles for the AB genotype.

BB_T_Mean Mean of the normalized theta(θ) angles for the BB genotypes.

BB_T_Std Standard deviation of the normalized theta(θ) angles for the BB genotypes.

AA_R_Mean Mean of the normalized R-values2 for the AA genotypes.

AA_R_Std Standard deviation of the normalized R-values for the AA genotypes.

AB_R_Mean Mean of the normalized R-values for the AB genotypes.

AB_R_Std Standard deviation of the normalized R-values for the AB genotypes.

BB_R_Mean Mean of the normalized R-values for the BB genotypes.

BB_R_Std Standard deviation of the normalized R-values for the BB genotypes.

1 Theta (θ) angles - The angle of the (A,B) allele probe intensity ordered values relative to the X-axis. 2 R-value - typically defined as the log of the sum of the A and B allele probe intensities OVERVIEW:

mRNA-Seq (also known as transcriptome sequencing) potentially allows profiling the entire population of mRNA species and potentially enables mapping and quantification of all transcripts. It can also provide information on alternative splice sites, novel transcripts, previously unknown isoforms, rare transcripts and cSNPs.

expressionanalysis.com mRNA-Seq

TABLE OF CONTENTS

• Overview • Highlights • Advantages • Assay Summary • Sample Preparation • Globin RNA Reduction Protocol • Cluster Generation • Sequencing

• RNA-Seq vs. Array Comparative Summary • Effect of Sequencing Parameters • Output Specifications • Added Value Informatics • Microarray Compatible File Formats • Alignment • EA Alignment vs Traditional Methods • Abundance Estimation • Differential Expression

• Frequently Asked Questions • Example of a Typical QC Summary Report • Summary • Cloud-Based Solution for Biologists mRNA-Seq

PURPOSE

Transcriptome studies are a cornerstone of functional genomics and provide insight into which genes are expressed in specific circumstances, such as tissue-specific expression patterns, organismal and strain variation in expression patterns and differential expression in disease conditions. In contrast to microarray-based analyses, mRNA-Seq does not rely upon a predetermined set of probe sequences.

HIGHLIGHTS

• mRNA-Seq is a universal discovery platform used to sequence eukaryotic mRNA without the need for prior sequence information. • mRNA-Seq is flexible allowing deep sequencing for detecting rare events or basic sequencing to be comparable or better overall than microarray-based results in differential detection.

ADVANTAGES

expressionanalysis.com ADVANTAGES

• More versatile and comprehensive than microarrays, with superior reproducibility. • Enables quantification of and comparisons between transcript isoforms. • Open platform that does not depend on any prior sequence information. ADVANTAGES • Provides information about potentially all mRNAs within the sample and allows for the discovery of novel mRNAs. • High sensitivity facilitates the detection of low abundance mRNAs. mRNA-Seq

SAMPLE PREPARATION

The TruSeq™ RNA sample Prep Kit (Illumina) is used to build cDNA libraries for single-read and paired-end sequencing on the GAIIX or HiSeq. Starting with 500ng of total RNA, mRNA is first purified using polyA selection, then chemically fragmented. The mRNA fragments are converted into single-stranded cDNA using random hexamer priming of reverse transcription. Next, the second strand is generated to create double-stranded cDNA, followed by end repair and the addition of a single A’ base at each end of the molecule. Adapters that enable attachment to the flow cell surface are then ligated to each end of the fragments. The adapters also contain unique index sequences that allow the libraries to be pooled and then individually identified during downstream analysis (multiplexing). PCR is performed to amplify and enrich the ligated material to create the final cDNA Library. (Illumina TruSeq RNA Sample Preparation Guide).

GLOBIN RNA REDUCTION PROTOCOL

Whole blood samples present a unique challenge to mRNA sequencing studies due to the large fraction of hemoglobin-related messenger RNA. In fact, hemoglobin message typically comprise > 50% of all sequenced reads. The result is that a whole blood sample requires double the amount of sequencing to produce expected levels of repeatable detection and precision in abundance estimation. EA provides a globin reduction protocol which effectively reduces the hemoglobin message by 100x, to 1% of all sequence. At this level of abundance, the globin clear protocol produces a 33% increase in isoform detection and a 22% increase in gene detection.

expressionanalysis.com ASSAY SUMMARY mRNA-Seq is a cDNA sequencing application that generates a more comprehensive, quantitative view of the mRNA portion of the transcriptome than traditional microarrays. With no probes or primer design needed, RNA-Seq has the potential to provide high-quality sequence information from polyA-tailed RNA for analysis of gene expression, novel transcripts, novel isoforms, alternative splice sites, allele-specific expression, cSNPs, and rare transcripts in a single experiment, depending on read depth. mRNA-Seq

CLUSTER GENERATION

The Standard Cluster Generation Kit v5 binds cDNA libraries to the flow cell surface. The cBot isothermally amplifies the attached cDNA constructs to create clonal clusters of ~1000 copies each.

SEQUENCING

The DNA sequence is directly determined using sequencing-by-synthesis technology via the TruSeq SBS Kit.

SAMPLE REQUIREMENTS

Starting Material Total RNA

Total Amount (ng) 1250*

Concentration (ng/µl) 25

Volume (µl) 50

RIN >8

260:280 1.6-2.2

*Our standard method calls for 500ng of input total RNA. We have set our performance expectations and performed comparisons to microarray data using this amount. As little as 100ng of starting material can be used per specimen if sample is limiting. As with any assay, EA strongly recommends maintaining consistency throughout the course of a study. Your results will be the most consistent and repeatable if the input amount is uniform across all specimens.

expressionanalysis.com

RNA-SEQ VS ARRAY COMPARATIVE SUMMARY

• EA alignment of 10M-12M paired reads provides similar or better performance as currently available gene expression microarrays in terms of detection, estimation of fold change, and detection of differential expression for the gene and transcript content assayed by the microarray. • EA alignment of 25M paired reads provides fold change estimates that are 75-100% larger, and significantly more transcripts are identified as differentially expressed for the content assayed by the microarray. When examining all isoforms, one typically sees a 5x increase in detection of differential expression. • PE provides modest benefits in all aspects of quantification relative to SE and greatest benefit to de novo capture and assembly.

Reads Differential Expression Gene-level Detection Isoform-level Detection

12 Million 25% - 35% Increase 50% Increase 500% Increase

25 Million 40% - 50% Increase 67% Increase 550% Increase

50 Million 45% - 55% Increase 70% Increase 600% Increase

EFFECT OF SEQUENCING PARAMETERS

• Repeatability and Reproducibility of differential expression estimates (the final results from 50b SE vs 50b PE experiments) improves with an increasing number of reads as illustrated by examining r2 of log ratios from our Breast Cancer exemplar experiment (top graph 10M reads vs bottom 25M reads). • 50 cycle paired end (PE) is superior to 50 cycle single end (SE) in terms of transcript and isoform detection. OUTPUT SPECIFICATIONS

Standard Output Files – Primary Analysis • QC summary statistics for each lane • Demultiplexing of Illumina barcodes • FASTQ and clipped FASTQ files Secondary Analysis Data Summaries (Additional Charges May Apply) • Simultaneous alignment to transcriptome and genome reference sequence • Relative abundance estimation at the gene and transcript level

Information summarized for each lane of the run includes the total number of clusters, number of purity filtered clusters, pool ID, percent of clusters that could not be assigned to a barcode, average percent aligned across all samples in the lane, and average number of genes detected across all samples in the lane.

expressionanalysis.com mRNA-Seq

Raw Cluster +/- Clusters +/- PF Clusters +/- 1st Cycle Lane Yield PF Cluster Count 1st Cycle Int Count Per Tile Per Tile Int

1 10,407,352 258,425,985 434,327 208,147,057 306,778 781 16 2 10,409,384 256,323,323 423,930 208,187,692 248,896 774 17 3 8,736,013 262,535,079 393,149 174,720,265 312,187 757 19 4 8,641,480 269,284,242 366,138 172,829,617 382,366 757 19 5 8,674,731 266,627,418 280,481 173,494,620 345,586 759 20 6 10,047,380 269,365,037 259,343 200,947,602 553,740 758 19 7 10,323,174 262,602,063 373,642 206,463,483 364,982 751 17 8 10,534,025 269,235,011 298,181 210,680,516 319,434 766 20 mRNA-Seq

expressionanalysis.com Quality scores from base calling as well as frequency of individual bases are summarized and charted by cycle.

Raw sequencing information is provided in FASTQ format which is a text file including four lines per sequence read. These are: 1) Cluster sequence identifier 2) Sequence 3) Cluster quality identifier 4) Quality score of each base in phred64 format Paired end runs will contain two FASTQ files representing each end of the cluster. More information regarding the fastq format can be found at: http://maq.sourceforge.net/fastq.shtml

• @HWI-ST845:1:1101:18276:2378#0/1 • GGACTGGGAAGATGGCTCCCATCTCAGGGTGAGGGGCTTCGGCAGGCCCC • +HWI-ST845:1:1101:18276:2378#0/1 • ggfgggcggggggggggggggggfgceffceeefdegdaecdcaaT^[aQ • @HWI-ST845:1:1101:18442:2391#0/1 • GACATATTTAACATACTTAGGAACTTTTTTTGTGCGGTGGGAATTCTCT • +HWI-ST845:1:1101:18442:2391#0/1 • gggggggggggggggggggggggggggggggggggggegggeggggggg • @HWI-ST845:1:1101:18380:2463#0/1 • CCTCAGTCCAACCCCAACCGGACACTCCCAGGGCCTCTGCTCACTGAC • +HWI-ST845:1:1101:18380:2463#0/1 • gggggggggfffgggfgggggggggbgggggdggggdeegefggdfdg • @HWI-ST845:1:1101:18529:2286#0/1 • CACCACATCAAACCCACTGAGTGAGCTCCCTTGTTGTTGCATGGGATGGC • +HWI-ST845:1:1101:18529:2286#0/1 • d_eeeed\ec_aaaaccce`daa\_bcb`cedeeeeed`edeeedW\ccc Added Value Informatics (Additional Charges May Apply) MICROARRAY COMPATIBLE FILE FORMATS – SECONDARY ANALYSIS EA has a long history of providing gene expression testing services with Affymetrix microarrays. The standard format for all Affymetrix array data is the CEL file, which contains intensity information for all probes on an array. Numerous databases and statistical packages have adopted this format as a standard input to facilitate downstream analysis. In contrast, no widely adopted standard has emerged for formating mRNA sequencing results which can complicate analysis tasks. EA has developed an mRNA-sequencing data format that mimics the Affymetrix CEL files, and can be used by tools built to work with CEL files. By providing data in this form, scientists will be able to make more rapid inferences with mRNA sequencing results in familiar contexts. The wealth of additional sequence-based information that is more comprehensive in scope can be examined concurrently.

expressionanalysis.com mRNA-Seq

ALIGNMENT – SECONDARY ANALYSIS

Alignment of reads is a unique challenge in mRNA sequencing studies due to the extensive processing of mRNAs after transcription. Alignment to a representation of the transcriptome is necessary to efficiently align reads from exon junctions, alignment to the genome is needed to reveal previously unknown transcription events, and alignment of novel exon junctions faces both of these challenges simultaneously. EA also recognizes the importance of unambiguously aligned reads, as these are the most valuable sequences for precise quantification of isoform specific expression. With these challenges in mind, EA has developed an alignment strategy that demonstrates 10% more unambiguous alignment than state-of-the-art strategies. The improvement in unambiguous alignment results in improved detection, abundance estimation, and differential expression.

The results of alignment are provided in bam format (binary). More information regarding sam and bam format can be found at: http://samtools.sourceforge.net/SAM1.pdf

HWI-ST845:1:1104:11264:41631#0 147 chr1 16602 255 50M = 16441 0 CCCACCTGAAGGAGACGCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTT BDD.GAA>EDGFBGDFEFCFHAHHFFHHHEEHEFHHHHHHHHHHHHHHHH NM:i:0 NH:i:1 HWI-ST845:1:1108:2102:98002#0 147 chr1 16602 255 50M = 16441 0 CCCACCTGAAGGAGACGCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTT EDDAE::FGFHHFGFEHHFFHBHHHHHHHHHHHHHHHHHHHHHHHHHHHH NM:i:0 NH:i:1 HWI-ST845:1:1205:6553:135498#0 147 chr1 16602 255 50M = 16441 0 CCCACCTGAAGGAGACGCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTT ??C4DA<FFFEC??FFEFDHGFFHHHGHFHHDHHHHHHHHHHHHHHH NM:i:0 NH:i:1 mRNA-Seq

EA ALIGNMENT VS TRADITIONAL METHODS

11% 14% 14%

28% unaligned

ambiguous 44% 69% 79% 88% unambiguous

Bowtie (RefSeq) Bowtie (hg19) TopHat EA

• Alignment estimates generated from the Illumina body map data • Based on 50bp single end data • Reference database and error tolerance was identical for all methods • EA = EA developed alignment strategy • refseq GTF was used for both Tophat and EA

expressionanalysis.com

ABUNDANCE ESTIMATION – TERTIARY ANALYSIS ( Additional Charges Apply)

Once alignment is complete, abundance estimation is performed at the gene and individual transcript level using RSEM (RNA-Seq by Expectation Maximization) (http://www.biomedcentral.com/content/pdf/1471-2105-12-323.pdf) or TopHat/Cufflinks (http://cufflinks.cbcb.umd.edu/).

RSEM is the default method due to its algorithmic stability and the nature of its output. RSEM output includes raw and normalized gene and transcript isoform counts for the relevant transcriptome. RSEM estimates counts associated with genes and isoforms and apportions ambiguously mapped reads according to its expectation maximization algorithm. Utilization of unambiguous and ambiguous reads are important for achieving the best accuracy and improved repeatability and reproducibility.

The following is illustrative estimated counts by gene for many of the well-documented breast cancer cell lines. mRNA-Seq

DIFFERENTIAL EXPRESSION – TERTIARY ANALYSIS (Additional Charges Apply)

*Additional charges apply – 2 hours per comparison for simple two-group comparisons, independent of the number of samples per group. Any comparison more complicated than a two-group comparison must be custom quoted.

When requested, differential expression may be calculated between pre-defined groups of samples. Our preferred differential expression test (t-test) uses a standard normal model when evaluating variation. Results indicate the fold change of each transcript between the groups as well as p-value of significance. The false discovery rate at which the transcript is called significant is also estimated with the adjusted p-value (padj). However, this differential testing method may not be optimal for all situations, especially where there are fewer than 3 samples per group. In that case methods using the negative binomial distribution as described by Huber et al. (http://dx.doi.org/10.1186/gb-2010-11-10-r106) for modeling variation are used (edgeR, or deSEQ if requested). All techniques yield fold change and log ratio estimates, p-values and adjusted p-values.

expressionanalysis.com FREQUENTLY ASKED QUESTIONS

1. Can TruSeq RNA library prep be used for Degraded or FFPE specimens? Not in its current form, though we have had success with specific add-on procedures. Please contact your service consultant for additional information.

2. Is a QC report provided? Yes, see below.

EXAMPLE OF A TYPICAL QC SUMMARY REPORT

This report assimilates the standard information that is available for each specimen in a common study. There is frequently one file for all specimens in a study, but intermediate reports may not contain the entire study. REPORT HEADER DEFINITIONS

ITEM DESCRIPTION

Lab Sample EA identifier for specimen

Client Sample Name Specimen identifier used by the client

Sequence Start Date Date at which sequencing began (yyyymmdd)

Run ID Internal Run/Flowcell ID associated with the specimen

Sequencer Internal ID for sequencer that performed the specimen sequencing

Lane Channel the specimen is sequenced in

Barcode Barcode Label used for the specimen

Species Independent assessment of predominant species (best species alignment)

# Genes Detected The number of genes that have counts above the detection threshold

expressionanalysis.com mRNA-Seq

REPORT HEADER DEFINITIONS (CONT.)

ITEM DESCRIPTION

The number of genes that may theoretically be detected from the Total Genes Searched reference transcriptome

% Detected The ratio of (#genes detected) / (total genes searched)

In the estimated frequency (from alignment and quantitation) of the Median Count genes, the 50th percentile value across all genes for the specimen

In the estimated frequency (from alignment and quantitation) of the 75th Perc Count genes, the 75th percentile or Q3 value across all genes for the specimen

In the estimated frequency (from alignment and quantitation) of Max Count the genes, the maximum value or highest count across all genes for the specimen

Max Count Transcript The gene associated with the transcript maximum count

The total number of clusters (reads or paired reads) assigned to the Total Clusters (Mil) specimen after barcode decoding (demultiplexing)

The percentage of all reads that were mapped to either the % All Mapped transcriptome or genome

% Transcriptome The percentage of all paired reads that were mapped to the tran- Mapped scriptome

%rRNA mapped The percentage of all reads that were mapped to ribosomal RNA

Median Mapping The typical mapping quality of a read to genome/transcriptome Quality

The percentage of bases aligned that were G or C compared to %(G+C) Aligned all bases from aligned reads

The ratio of the percentage of C bases versus the percentage of T %(C/T) Ratio bases aligned SUMMARY

• More versatile and comprehensive than microarrays, with superior reproducibility • Open platform that does not depend on any prior sequence information • Higher sensitivity which facilitates the detection of lower abundance mRNAs • More quantitative • Provides information about potentially all mRNAs within the sample and allows for the discovery of novel mRNAs • Reduced batch bias • Greatly improved specificity for detection of transcripts and isoforms • Hypothesis free data • Enable comparisons between transcript isoforms

MORE BIOLOGICAL INSIGHT

expressionanalysis.com mRNA-Seq

CLOUD-BASED SOLUTION FOR BIOLOGISTS

When you are ready to “make sense” of your data, take advantage of our cloud-based toolkit to combine primary, secondary and tertiary analysis. This state-of-the-art solution also features pathway analysis as well as genome browsing and viewing capabilities for a full and detailed picture of your data set.

• SNV • Indels • Splice variants MORE BIOLOGICAL INSIGHT • Isoform switching • Novel genes • Novel isoforms ANALYSIS PIPELINE

expressionanalysis.com mRNA-Seq Analysis Pipeline

TABLE OF CONTENTS

• EA Pipelines Powered by Golden Helix

• Why the Cloud? mRNA-Seq Analysis Pipeline

EA PIPELINES POWERED BY GOLDEN HELIX

Get the Same EA Quality with the Convenience of the Cloud As next generation sequencing continues to evolve and produce unprecedented amounts of data, your approach to data storage and analysis will play an even larger role in your success. Introducing EA Pipelines powered by Golden Helix, the answer to your RNA sequencing and gene expression bioinformatics needs.

Infinite Scalability, Global Capabilities With EA Pipelines, your research is free from typical sequencing restrictions thanks to the power of cloud computing. Unlike internal servers and in-house networks that have limited processing speeds and data capacities, EA Pipelines offer users quick turnaround times, regardless of project size, and unlimited storage space. The cloud also promotes global collaboration by providing easy access to authorized users of your choosing, anywhere in the world. Supported by the software and bioinformatics expertise of Golden Helix, EA Pipelines provide your projects with the same trusted sequencing and pipeline experience you’ve come to expect from EA.

As Easy as 1-2-3 1. Choose pipeline of interest and submit raw genetic information via online portal 2. Data is analyzed following EA quality assurance standards 3. Results available online anytime, anywhere for as long as you like*

*Nominal monthly storage fee applies

expressionanalysis.com WHY THE CLOUD?

EA Pipelines cloud-based service saves you from investing in sequencing platforms and infrastructure that often do not have the longevity or the flexibility to adapt to the ever-changing landscape of next-generation sequencing.

Take Your Research to the Next Level and Switch to the Cloud For: • Infinite scalability • Powerful processors for fast, efficient analysis of large and small projects • Worldwide collaboration made easy • Inexpensive, unlimited, and easily accessible storage • Simple download and data distribution capabilities • Unmatched reproducibility • Unlimited access to current and past projects

Safe and Secure Built on Amazon’s world-class, secured cloud-based computing service, Amazon Web Service, you can rest assured knowing that your data is protected with the most up-to-date digital security system available. Amazon Web Service successfully completed SAS70 Type II audits, achieved ISO 27001 Certification, was approved as a Level 1 service provider under the Payment Card Industry (PCI) Data Security Standard (DSS), and received authorization from the U.S. General Services Administration. It is also the platform for applications with Authorities to Operate (ATOs) under the Defense Information Assurance Certification and Accreditation Program (DIACAP).

If you’re not ready to make the transition, EA’s internal pipelines and infrastructure are still in place to bring you the same high-quality analysis and results. RS SEQUENCING

expressionanalysis.com PacBio RS Sequencing

TABLE OF CONTENTS

• Highlights

• Background

• How It Works

• Assay Summary

• Library Preparation

• Sample Requirements

• In-Process QC Checkpoints

• Output Specifications/File Explanations

• Representative Data

• Frequently Asked Questions

RS SEQUENCING PacBio RS Sequencing

HIGHLIGHTS

• Single molecule, real-time (SMRT®) sequencing technology based on real-time measurement of nucleotide base incorporation by a single DNA polymerase molecule sequencing in a continuous, processive manner. • Longer read lengths compared to traditional next-generation sequencing: With read lengths in the thousands of bases, the PacBio RS can resolve both SNPs and large-scale structural rearrangements. Long reads increase understanding of disease heritability through haplotype phasing. Long reads also simplify and improve genomic assembly by reducing the number of contigs. For single molecules of up to 1000 nucleotides, associated quality scores are superior to other next generation sequencing technologies. • SMRT detection provides direct measurement of individual molecules, capturing multiple dimensions of data. Templates can be prepared without PCR amplification, resulting in more uniform sequence coverage across genomic regions regardless of GC content or sequences that are otherwise difficult to amplify by PCR. • Faster time to results: sequencing in less than a week.

HOW IT WORKS

expressionanalysis.com BACKGROUND

The PacBio RS system is a DNA sequencing system which conducts, monitors and analyzes biochemical reactions at the individual molecule level in real time. There are two sequencing strategies:

1. Standard sequencing generates single pass, long reads. Long insert lengths ensure continuous polymerase synthesis along a single strand. 2. Circular consensus sequencing uses DNA templates with shorter inserts, to enable multiple reads across a single molecule. This approach provides both forward and reverse reads with a double stranded template.

HOW IT WORKS

Inserts for the library are generated either by fragmenting genomic DNA, or through PCR amplification of regions of interest. The library is generated by ligating two PacBio SMRTbell™ adapters to both ends of a linear DNA fragment, generating a structurally linear, yet topologically circular template

HOW IT WORKS Subsequently a DNA polymerase is attached to the above generated SMRTbell™ template and sequenced on the PacBio RS as a continuous circle.

The polymerase is not halted by terminators, reagent starvation, or because the end of the template molecule is reached. Instead, polymerase continuously sequences the circular template until exhausted. The resulting DNA sequence is a long contiguous read of the large insert template, while short inserts contain the same sequence read over and over again as the polymerase repeatedly makes it around the circle.

expressionanalysis.com PacBio RS Sequencing

If you were to stretch out the circular long reads you typically will get the following pattern:

As you can see in the example above, the “yellow” (i.e. positive sense) strand of the DNA sequence and the “purple” (i.e. minus sense) of the DNA sequence are sequenced twice. This may occur for a portion of the molecules in your sequencing library, but you will also get different variations, e.g. 2 full yellow sequences, and 1.25 purple sequences, or only 1 full (or partial) yellow sequence. This all depends on how big your insert size is, where the polymerase initiates active incorporation of labeled nucleotides, and how long the polymerase retains its catalytic activity for the single molecule to which it is bound. The sequencing read diagrammed above is simply called a Long Read and contains the SMRTbell adapters as well as raw sequence bases. These reads are provided to you as is, in the base fastq files (see the next page for a detailed description of standard files provided to clients).

When a Long Read contains at least two full sequences (i.e. a full yellow and purple), these reads are combined into what is called a Circular Consensus Sequence (CCS). CCS reads take the consensus of the reads to create a very high quality sequencing read. Most CCS reads have quality scores that meet or exceed quality scores of Q30 when derived from libraries containing inserts 500-1000 nt in length. PacBio RS Sequencing

The first step in generating a CCS read is to strip the long reads of the PacBio adapters. The resulting sequences are called subreads.

These subreads then have two possible fates:

Scenario 1— 2 or more subreads: When two or more full-length subreads are generated from the same molecule (as diagrammed above) the reads go into two separate files. The first file consists of all subreads individually (PacBio defaults also filter these reads for a quality score above 75%), and as discussed below, we take this file and put it through a second filtering step before providing it to clients as a *clipped.fastq file. The second file contains subreads collapsed into the CCS reads. This file is provided to clients as the *ccs.fastq file (see explanations to the right).

Scenario 2 — less than 2 full subreads: When PacBio sequencing does not generate two or more full-length subreads from a given molecule, the subreads (full-length and/ or any partial length subread) are output into the same subreads files as in Scenario 1 (again with default quality filtering).

expressionanalysis.com The two files output by this process are then further processed by an EA-internal sequence clipper, known as fastq-mcf (available for free download). Both of the resulting fastq files are prepended with the word “clipped” so that end users can easily identify these sequence files. For most users and most applications, these clipped files are the most appropriate for subsequent analysis.

ASSAY SUMMARY Genomic DNA samples are sheared to target insert size (500bp, 6kb or 10kb) depending on the chosen sequencing strategy using a Covaris Adaptive Focused Acoustics instrument, or the g-Tube, also from Covaris . Sheared DNA is then purified using magnetic beads and verified on a Bioanalyzer or agarose gel, depending upon the target size. End-repair is performed followed by ligation of universal hairpin adapters to produce the SMRTbell library. Hairpin adapter dimers formed during library production are substantially reduced using size-selective conditions during magnetic bead purification. Sequencing primer is annealed to the SMRTbell library and then polymerase is bound to the primer-library complex. Sequencing is carried out on the PacBio RS system.

expressionanalysis.com

PacBio RS Sequencing

LIBRARY PREPARATION

At EA we offer two different types of library preparation to take advantage of the unique characteristics of the PacBio platform. The type of library preparation chosen depends on your application. There is a tradeoff that must be made between long reads and CCS reads. As the size of the DNA insert increases, the likelihood of generating CCS reads decreases – the polymerase is less likely to make it through two full-length subreads. Long insert library preparation methods accept this downside in favor of providing the longest possible reads. This method of library preparation is primarily suited for hybrid assembly approaches, where the long PacBio reads can be “error corrected” by a higher accuracy data type (e.g., PacBio CCS reads, Illumina reads, or 454 reads). The short insert library preparation method is for generating reads that, while still long compared to conventional next generation sequencing devices, have a higher quality score than the long read libraries. In this scenario, the goal is specifically to generate as many CCS reads as possible while maintaining read length greater than conventional next generation sequencing platforms. We typically recommend using CCS reads for SNP calling or correction of long reads during hybrid assembly. PacBio RS Sequencing

Long Read Libraries : Long insert libraries are specifically designed to obtain contiguous reads that are as long as possible. We generally target DNA inserts in the 8-12 Kb range, but some DNA samples can be difficult to reliably fragment into such large pieces. Thus, for certain libraries we may only be able to achieve fragment sizes in the 5-7 Kb range. Due to the extreme length of these insert sizes, there will be fewer CCS reads. Still, unless there are zero CCS reads, clients will nonetheless receive a CCS file with those CCS reads that were generated. Why does this happen? Despite the fact that we are aiming to get mostly long reads, occasionally one of two things can happen:

1. A polymerase is highly active (we have seen reads in excess of >15,000 nt). 2. An insert smaller than the targeted size has been incorporated into the library.

When one or both of these situations occur, the polymerase may be able to sequence two or more full-length subreads and therefore be able to generate a CCS read. This will be less likely to occur as the target insert size increases. These reads are worth keeping because they are highly accurate, but the bulk of the data will be in the other fastq and clipped.fastq files.

Short Insert (CCS Read) Libraries : Short insert libraries are specifically designed to obtain higher redundancy of measurement of individual molecules, thus yielding higher quality sequences. Higher quality scores are achieved through the error correction that occurs when sampling the same DNA molecule’s sequence multiple times – this is known as circular consensus sequencing. The source of DNA can be amplicons or genomic DNA that has been fragmented to a smaller size range, typically 500-1000 bp. The clipped.fastq file will always contain the individual subreads that comprise each CCS, but generally speaking there will be little, if any, reason for you to work with this data. Your main dataset will be the ccs.clipped.fastq dataset.

expressionanalysis.com SAMPLE REQUIREMENTS

• Library preparation does not include amplification of the starting material, so larger amounts of DNA are required when compared to alternative sequencing technologies. • Genomic DNA samples or PCR product should be provided in TE and must be free of contaminating proteins and other cellular material, organic solvents (including phenol and ethanol), as well as free of salts used in many DNA isolation methods. • Concentration must be determined by an intercalating dye method (either PicoGreen assay or Qubit Fluorometer). Spectrophotometric methods – including Nanodrop measurements – tend to overestimate the amount of double-stranded genomic DNA. • For large Scale library preps, a long term storage complex is created that is typically sufficient for at least 100 SMRT Cells per prep. The DNA requirement is 5 times the requirement listed in the table below. This type of prep is recommended for projects that will utilize a large number of SMRT Cells. • DNA integrity and size should be determined by agarose gel for genomic DNA and Agilent Bioanalyzer for fragmented DNA and PCR product. Degraded DNAs are usually poor substrates for library production.

Targeted Insert Size (bp) 6000 10,000 500 500 ~ 1000

Genomic Genomic Genomic DNA Starting Material PCR product DNA DNA or PCR product

Total Amount (ng) 4000 9000 750 750

Concentration (ng/µl) 50 100 25 25

Volume (µl) 80 90 30 30

RIN NA NA NA NA

260:280 1.8 ~ 2.1 1.8 ~ 2.1 1.8 ~ 2.1 1.8 ~ 2.1 Bioanalyzer electropherogram of a 975 bp PCR product.

Bioanalyzer electropherogram of genomic DNA fragmented to a target size of 6Kb.

expressionanalysis.com PacBio RS Sequencing

Agarose gel electrophoresis of 100ng intact genomic DNA.

IN-PROCESS QC CHECKPOINTS

Samples are evaluated for concentration, total amount and size distribution prior to beginning the library preparation process. You will be notified if one or more of your samples does not meet our requirements. For certain library preparations, genomic DNA will be fragmented to achieve a targeted insert size. The size distribution and yield of fragmented DNA is evaluated.

After library preparation is completed, the final yield and the size distribution of the library is determined. The table on the next page summarizes the expected yield in nanograms and the approximate number of SMRT Cells that can be run with this amount of library. The electropherogram below depicts the size distribution of a library prepared from genomic DNA fragmented to 6 Kb.

Please note that for some projects the estimated number of SMRT Cells that can be run (based on the expected yield) will exceed the number of SMRT Cells necessary to complete the project. We are currently examining the feasibility of reducing the sample requirement – for fewer SMRT Cells we should be able to start with a reduced amount of sample – for the time being we are unable to routinely support reduced sample amounts. PacBio RS Sequencing

Type SMRTbell library – SMRTbell library – SMRTbell library - SMRTbell library – Standard 6 kb Standard 10 kb 500 Large Scale 10 kb

Expected yield 300 ~ 500 750 ~ 1250 35 ~ 65 4250 ~ 6250 (ng)

# of SMRTCell for 20 ~ 35 20 ~ 35 20 ~ 35 110 ~180 expected yield

Bioanalyzer electropherogram of a library prepared from genomic DNA fragmented to 6 kb.

expressionanalysis.com OUTPUT SPECIFICATIONS/FILE EXPLANATIONS

With each SMRT Cell that you run, you will receive between 12 and 14 associated files. Each of your files will have the following prefix which will allow you to uniquely identify your sample, the plate, and the SMRT Cell on which it was run: SAMPLENAME_CELL-ID_PLATEID_*

After sequencing is completed, PacBio RS software will automatically generate:

1. *metadata.xml: the metadata.xml file contains instrument information related to the conditions present during a sequencing run. It also provides key pieces of information for integration with primary and secondary analysis. This file is associated with one movie (denoted s1 or s2). Unless you are an advanced user, you will likely not need this information, but should not delete it. 2. *bas.h5: These are the most basic sequence files output by the PacBio RS. Unless you are an advanced user you will likely not use this file, however other than the metadata.xml file, all other files we provide can be derived from this file. Therefore, we highly recommend that you back-up all the *bas.h5 and metadata.xml files in a safe location. 3. fastq: This basic fastq file contains all the reads, with no adapter removal and no quality filters. It is automatically generated from the bas.h5 file. 4. ccs.fastq: This fastq file contains all the CCS reads with no quality filters. It is automatically generated from the bas.h5 file.

Bioanalyzer electropherogram of a library prepared from genomic DNA fragmented to 6 kb. In addition to the files automatically generated by PacBio, EA also provides these additional output files for our clients:

1. clipped.fastq file – the adapter sequences are removed from the fastq file, internal control sequences are removed, and reads are filtered for qualities below 75% (PacBio recommended default). An intermediate file is generated (clean.fastq). Reads in the clean.fastq file smaller than 50 nt are removed and low quality bases are trimmed off the edge, resulting in the clipped.fastq file. 2. ccs.clipped.fastq file – reads smaller than 50 nt and low qualitybases are trimmed off the ends of the ccs.fastq file.

3. *clipped.info – In this file you will find an explanation of exactly what was ccs.fastq. removed during the transition from the clean.fastq to the clipped.fastq. stats 4. *.stats – EA provides four statistics files for each cell for a quickevaluation of your read data. These statistics files are each associated with the specific fastq.

*not part of standard deliverables but available upon request

expressionanalysis.com PacBio RS Sequencing

PacBio Instrument

bas.h5 metadata.xml fastq (1 or 2 files) (1 or 2 files)

EA generated statistics file ccs.fastq. EA generated ccs.fastq stats statistics file PB: Adapter removal/ Quality filtering/Spikein sequence filtering fastq.stats

EA: Sequence clipping clean.fastq*

EA: Sequence clipping ccs.clipped. ccs.clipped. fastq info clipped.fastq clipped.fastq. info EA generated statistics file EA generated statistics file ccs.clipped. fastq.stats clipped.fastq. stats PacBio RS Sequencing

REPRESENTATIVE DATA

FREQUENTLY ASKED QUESTIONS

1. What application(s) is the PacBio RS well suited for? The PacBio RS sequencing technology resolves single molecules in real time, allowing observation of structural and cell type variation not accessible with other technologies. These unique capabilities of the PacBio RS system are ideally suited for: • Hybrid De Novo assembly — long read lengths combined with high accuracy reads to produce high quality, finished genomes. • Targeted sequencing — long read lengths provide comprehensive characterization of variants through standard and circular consensus sequencing.

2. Is amplification required? No, it is only needed if minimum DNA input requirements are not met. It is not a routine part of the sequencing assay. However, PCR enrichment has proven to be a very valuable method for selective enrichment of targets of interest in targeted resequencing experiments.

expressionanalysis.com Typical Long Read Output (based on 62 SMRT Cells worth of data)

Number of Long Reads Max Long Read Length Long Read Mean Long Read Total Bases

75,152 20,524 3,676 276,000,000

Clipped Max Long Clipped Long Clipped Long Read Clipped Long Reads Read Length Read Mean Total Bases

44,232 14,630 2,630 116,000,000

Typical CCS Read Outputs (based on 24 SMRT Cells worth of data)

CCS Reads CCS Read Mean CCS Total Bases Clipped CCS Reads

40,472 405 16,456,199 40,471

Clipped CCS Clipped CCS Read Mean Total Bases

405 16,441,241 RS SEQUENCING MICROBIAL ASSEMBLY expressionanalysis.com PacBio RS Sequencing Microbial Assembly

TABLE OF CONTENTS

• Introduction

• EA Microbial Assembly Pipeline (EA-MAP)

• Case Study

• Assembly Correctness

• Conclusions

• References

RS SEQUENCING PacBio RS Sequencing Microbial Assembly

INTRODUCTION

expressionanalysis.com Even with the volume of sequencing data generated today, de novo assembly of genomes remains a difficult and complex problem to solve. Assembly software must take short stretches of DNA sequence (often less than 100nt in length), and using complex algorithms assemble them into contiguous sequences (contigs) with the goal of achieving the lowest number of contigs with the greatest total length. Recently Pacific Biosciences (PacBio) has introduced a sequencing technology that significantly improves the read lengths achievable with high-throughput sequencing. Among a variety of applications for these reads, de novo assembly stands out as being particularly well-suited to this technology for its ability to span repeat regions and reduced sensitivity to GC bias. In particular, the relatively small genomes (<6Mbp) of microbial species are significantly improved by this technology, and indeed several publications showing its utility have been published1-4.

While PacBio long reads have great promise in improving the assembly of microbial genomes, there are some drawbacks to the technology. PacBio long reads demonstrate a higher error rate as compared to other next generation sequencing technologies (stochastic rate of ~15%), and reduced data output (~100Mb per 90min INTRODUCTION movie). Because of this, today’s contemporary assembly procedures all focus on hybrid assembly – utilizing short reads in conjunction with PacBio long reads to significantly reduce the complexity of any given assembly.

Two recent papers published in Nature Biotechnology illustrate two distinct ways to perform hybrid assembly1,5. In the first paper, the authors began with preassembled contigs generated from short-read data, and through a series of iterative steps use long PacBio reads to connect or “scaffold” the contigs together. In the second paper, Koren and colleagues chose to correct the errors found in PacBio long reads by using the higher quality short reads as a reference. After correction, these long reads were then used in the assembly. At EA, we have recently adapted these methods to create our own hybrid microbial assembly pipeline that takes full advantage of all sequencing technologies, including PacBio, to produce the best possible assembly for our clients. EA MICROBIAL ASSEMBLY PIPELINE (EA-MAP)

Similar to Koren et al., the EA-MAP first corrects PacBio reads with short, high quality reads using the module pacBioToCA from Celera Assembler5. These short reads can be from any technology (i.e. 454, IonTorrent, Illumina, and even PacBio CCS), as long as a sufficient genomic coverage is obtained. Typically this is done with paired-end Illumina data, but optimally with CCS reads with greater than 500bp inserts. Following correction, several filtering steps are performed and 20 independent assemblies using Celera Assembler for each specimen is performed, with varying parameters/optimizations in an effort to get the most complete and correct assembly possible.

One of the most critical parts of any assembly, and often overlooked, is the quality of the assembly. Traditionally, many involved in the assembly of genomes looked at two metrics, the number of contigs and the N50 value. However, these numbers only assess completeness not accuracy. The best assemblies are those with the fewest contigs while still remaining highly accurate. Because of this, the EA-MAP evaluates all 20 assemblies along 5 different features, designed to more accurately estimate the correctness of any given assembly. This evaluation uses the programs Amosvalidate6 and FRCurve7 which are capable of evaluating de novo assemblies. To do this, Amosvalidate maps the reads used for assembly back to the assembled contigs and then examines these alignments for possible signs of misassembly (see Table 1). FRCurve combines this data to generate a table that lists the number of “features” (i.e. potential assembly mistakes) as a function of total genome coverage. EA-MAP chooses the assembly with the lowest number of features at 90% of genome coverage as the best assembly.

expressionanalysis.com PacBio RS Sequencing Microbial Assembly

After reducing the 20 assemblies down to 1 assembly that has been assessed as being the best and most accurate, EA-MAP looks to connect and extend contigs into scaffolds. Because PacBio reads cover regions that are often missed by other technologies (i.e. repeat regions, high GC regions) the correction step of PacBio long reads before assembly results in many sequences that go uncorrected, and which are automatically discarded by the correction software. To fully take advantage of these regions/reads and the ability of PacBio long reads to span regions that go uncovered by the more common next generation sequencing reads, we employ a final finishing step. This finishing step follows the general strategy outlined in Bashir et al. in which contigs are connected with uncorrected PacBio long reads, using the PacBio program AHA, part of the SMRTAnalysis package. These connections both extend the final contig length and generate scaffolds, which are simply contigs that are known to be connected but with unknown bases in between (denoted by “N”).

By using this assembly pipeline, EA is able to greatly enhance previously completed assemblies. For microbial genomes (<6Mb) EA typically sees total number of contigs reduced to approximately 100 per assembly, with max contigs of 300KB or higher and N50 of greater than 150KB. Scaffolds generated by the final EA-MAP finishing step demonstrate even higher levels of completeness. All of this while ensuring a high level of assembly accuracy (See Figures 1-4 and Tables 2 and 3). PacBio RS Sequencing Microbial Assembly

CASE STUDY

To demonstrate the utility of our pipeline, four different E.coli strains were sequenced with both conventional second generation sequencing technology, and with the PacBio. DNA from all of the microbial species were sheared with the Covaris G-tube instrument, resulting in long DNA fragments (~10kb) which then went into our standard PacBio library prep protocol. Following preparation the libraries were sequenced on the PacBio instrument, 4 SMRT cells per microbe were used in this study.

Second generation sequencing reads were generated on an Illumina HiSeq. For each strain, 300bp insert paired end libraries were generated and sequenced as 2x100 to approximately 50X coverage.

Each PacBio sequencing reaction underwent our standard processing pipeline (See our Sequencing Guide for details) before being plugged into EA-MAP along with their corresponding short read data. To compare these assemblies with a short-read only assembly, the 50X 2x100 Illumina data was also assembled with Velvet, a commonly used, open-source assembly package8. Six Velvet assembly conditions for each strain of E. coli were tested, and assemblies were evaluated in a manner similar to that which is used in the EA-MAP pipeline. After the best Velvet assembly for each strain was identified, the resultant contigs were filtered to remove contigs less than 101 bases, which represent unassembled reads.

Comparisons were then made between the best EA-MAP assemblies and the best Velvet assembly. EA-MAP produced a stark improvement over short-read only data. Number of contigs went down by approximately 80% (Figure 1). While the maximum contig length was largely unaffected, the mean contig length was over 5 times longer (Figure 2), and the N50 value was increased by 25% (Figure 3).

expressionanalysis.com CASE STUDY

Figure 1 – Number of contigs significantly reduced with PacBio and EA-MAP. FourE.coli strains were assembled with either Velvet and Illumina-only data, or with EA-MAP and Illumina + PacBio data. The number of contigs was drastically reduced when using EA-MAP.

Figure 2 – Mean contig length significantly increased with PacBio and EA-MAP. FourE.coli strains were assembled with either Velvet and Illumina-only data, or with EA-MAP and Illumina + PacBio data. The average length of the resultant contigs was nearly 5 times greater than with Velvet and Illumina. Figure 3 – N50 length increased with PacBio and EA-MAP. Four E.coli strains were assembled with either Velvet and Illumina-only data, or with EA-MAP and Illumina + PacBio data. N50 was increased by 25% when using EA-MAP and Illumina + PacBio.

In addition to the E.coli assemblies, two strains of Salmonella enterica were also sequenced on the PacBio, this time with 6 SMRT cells per strain. As with the strains of E.coli, genomic DNA was sheared, prepared into libraries and sequenced on the PacBio machine. After sequencing, the different samples went through our standard PacBio pipeline. Short read data was downloaded from the Short Read Archive database maintained by NCBI. The short read data and PacBio data were fed into EA-MAP, and the resulting assembly statistics can be seen in Tables 1 and 2.

expressionanalysis.com PacBio RS Sequencing Microbial Assembly

ASSEMBLY CORRECTNESS

Beyond reducing the number of contigs and increasing their average size, using EA-MAP also produces a more accurate assembly. To visualize this we used a program called FRCurve7. FRCurve, aspects of which are automatically incorporated into EA-MAP, allows visualization of de novo assembly accuracy by using the input reads to assign possible errors, which are called “features.” Since Velvet and EA-MAP assemblies use slightly different input data, there will be some differences with how FRCurve works with the two sets of data, however the plots can give us a good idea of the comparative accuracies of each assembly. Figure 4 shows that the EA-MAP assemblies were significantly more accurate than those with Illumina-only data PacBio RS Sequencing Microbial Assembly

Figure 4 – The number of “features,” i.e. potential assembly errors identified by FRCurve is approximately 10-fold less for when using EA-MAP with PacBio + Illumina data than when using Velvet and Illumina-only assembly. The number of potential errors were calculated with approximately 50% of the genome covered and with 75% of the genome covered.

expressionanalysis.com CONCLUSIONS

Many researchers have generated short read next generation sequencing data and attempted to assemble their microbes of interest. Because of this, EA will gladly accept your previously generated data (minimum of 50X coverage, 100bp in length or greater), and combine it with PacBio data generated here. With these two sources of data, EA-MAP generates a new, more complete and accurate assembly.

EA consistently sees drastic improvements in the assembly of microbes. Critically important metrics such as number of contigs and mean contig length are greatly improved over short-read only data. More importantly, the assemblies come out more accurate than those assembled with only short read data.

REFERENCES

1. Bashir, Ali, et al. “A hybrid approach for the automated finishing of bacterial genomes.” Nature Biotechnology 30:7, 2012. 2. Grad, Y.H., et al. “Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011” PNAS 109:8, 2012. 3. Rasko, D.A., et al. “Origins of the E. coli Strain Causing an Outbreak of Hemolytic–Uremic Syndrome in Germany” N Engl J Med 365, 2011. 4. Chin, C-S., et al. “The Origin of the Haitian Cholera Outbreak Strain” N Engl J Med 364, 2011. 5. Koren, S., et al. “Hybrid error correction and de novo assembly of single-molecule sequencing reads” Nature Biotechnology 30:7, 2012. 6. Phillippy, A.M., et al. “Genome assembly forensics: finding the elusive mis-assembly” Genome Biology 9:R55, 2008. 7. Narzisi G. and Mishra B. “Comparing De Novo Genome Assembly: The Long and Short of It” PLoS ONE 6:4, 2011. 8. Zerbino, D.R. and Birney, E. “Velvet: Algorithms for de novo short read assembly using de Bruijn graphs” Genome Research 18, 2008. PacBio RS Sequencing Microbial Assembly

Table 1 – Assembly evaluation using Amosvalidate. Input reads are mapped back to assembled contigs, and the following metrics are evaluated.

Metric Description When evaluated

Paired-end read When a paired-end library is generated, the distance between sequences is known Only when concordance and their orientation is known. This evaluation determines whether paired-end reads paired-end are oriented properly on the assembled contigs, and whether they are approximately short-read data the appropriate distance apart. When multiple paired-end reads pile-up with is provided incorrect orientation/spacing, this region becomes suspicious.

SNP detection This occurs when the majority of reads indicate one particular base, but a subset of Always reads indicate a second, different base. This scenario indicates misassemblies such as collapsed repeats or multicopy genes.

Depth of coverage If shearing of genomic DNA is random, then coverage across the contigs should be Always fairly uniform. Nonuniformity could again indicate problems with the assembly.

Singleton Analysis Reads that were input into the assembly, but not used in the actual assembly are Always known as singletons. If multiple singletons partially align to a contig, with that alignment ending at the same position, it suggests that there was a misassembly in this place.

Repeat K-mer Directly from the amosvalidate website: "A k-mer is a k-length substring of a longer Always analysis sequence. Using a sliding window across a sequence, we can catalog all k-mers and count the number of occurrences of each. Call K_r the set of k-mers in the reads, and K_c the set of k-mers in the contig consensus sequences. A normalized k-mer count, K*, is the number of times a given k-mer q occurs in K_r divided by the number of times q occurs in K_c. This simple statistic can reveal which repeats have been mis-assembled."

expressionanalysis.com Table 2 – Detailed results of Assemblies. Four strains of E.coli were assembled with either Illumina only and Velvet assembler or with Illumina and PacBio using EA-MAP. EA-MAP assemblies were more complete.

Genome Short Read ~Short # of PacBio ~PacBio Contigs Mean Max N50 Data read SMRT Cells Coverage Contig Contig Contig coverage Length Length Length

E. coli 2x100 50x n/a n/a 735 7546 475396 81963 Strain 1 Illumina

2x100 50x 4 80x 105 54152 426720 108567 Illumina

E. coli 2x100 50x n/a n/a 230 20871 513032 169443 Strain 2 Illumina

2x100 50x 4 80x 42 116250 534930 236189 Illumina

E. coli 2x100 50x n/a n/a 560 9406 374003 132620 Strain 3 Illumina

2x100 50x 4 80x 154 36771 380906 144019 Illumina

E. coli 2x100 50x n/a n/a 572 9263 345671 91842 Strain 4 Illumina

2x100 50x 4 80x 117 46957 270371 116884 Illumina

S. enterica 2x150 24x, 24x, 6 80x 75 73210 716510 369620 Strain 1 Illumina 40x (MiSeq), 454, Ion Torrent

S. enterica 454 30X 6 140x 34 139574 495564 250915 Strain 2

Table 3 – Final assembly scaffold numbers – After generation of the contigs, EA-MAP does a final scaffolding step which significantly improves the assembly. Compare scaffold statistics to contig statistics in Table 2.

Genome Scaffolds Mean Scaffold Length Max Scaffold Length N50 Scaffold Length

E. coli Strain 1 70 81926 506976 222032

E. coli Strain 2 30 163039 692905 506036

E. coli Strain 3 131 43287 665361 182649

E. coli Strain 4 86 64153 337220 152331

S. enterica Strain 1 72 76290 1276158 498294

S. enterica Strain 2 23 206480 917550 457103 RS SEQUENCING VARIANT CALLING expressionanalysis.com PacBio RS Sequencing Variant Calling

TABLE OF CONTENTS

• Summary

• Introduction

• Example

• Case Study

• FLT3-ITD Study

• Conclusions

• References

RS SEQUENCING PacBio RS Sequencing Variant Calling

SUMMARY

Single nucleotide polymorphism discovery is a key feature of any DNA sequencing study. EA is proud to offer a service to our clients that combines both traditional SNP calling and single molecule variant detection, using the PacBio RS sequencer. This service (EA-SMVD) calls variants both in isolation and on a per molecule basis, allowing the investigator to identify SNPs that occur on the same DNA strand.

• EA-SMVD works with amplicons up to 1.5 KB in length • Detects SNP and genotype relative frequency (based on directly observed linkage), down to 1% sensitivity • Infers amino acid changes (strand-specific) due to identified variants • Provides easy to interpret reports • Is applicable to any amplicon under 1.5 KB in length, including complex viral mixtures, tumor samples, and 16S rRNA profiling

expressionanalysis.com INTRODUCTION

Single nucleotide polymorphisms (SNPs) are ubiquitous in genomic studies, since many contribute directly to a variety of phenotypes, or are correlated with other variants underlying a phenotype of interest. Generally, these studies rely on sequencing DNA of interest in a high throughput manner on a second generation sequencing device in hopes of identifying SNPs that correlate with any number of phenotypes. However, there are a few drawbacks to using second generation sequencing platforms to call SNPs. All second generation sequencers are known to have base-calling biases that can obscure accurate variant calling1-3. Additionally, the short sequence reads generated by these devices greatly increases the difficulty of identifying SNPs that may be found on the same physical strand of DNA. Sometimes called phasing or linkage, obtaining information about which SNPs occur on the same physical nucleic acid strand has the potential to revolutionize our understanding of a variety of biological processes, such as viral drug response, or to gain deeper insights into the accumulation of somatic mutations that occurs in cancer4.

Recently these drawbacks have been overcome with the advent of third generation sequencing instruments that produce long sequencing reads in a high-throughput manner. A leading example is the PacBio RS instrument, which can generate highly accurate sequence reads of up to 1.5KB with no detectable base-calling biases1. To take advantage of this technology and confidently identify SNPs that are in phase with one another, EA has developed a data analysis pipeline that identifies SNPs not only in isolation, but in the context of the DNA molecule on which it is found. We call this the EA Single Molecule Variant Detector (EA-SMVD). EXAMPLE

Consider Figure 1 in which a PCR product has been sequenced. In this example six strands are illustrated. Strands 1 and 6 have a SNP at nucleotide 222 that results in an A>G conversion. Strands 2 and 4 have two SNPs, one at nucleotide position 300 (G>C), and the second at position 920 (T>C). Finally strands 3 and 5 are wildtype.

Figure 1

A traditional SNP caller may report the following:

Position 222 A>G: 33.3% Position 300 G>C: 33.3% Position 920 T>C: 33.3%

expressionanalysis.com PacBio RS Sequencing Variant Calling

Unfortunately, the investigator has no information as to whether these mutations are all on the same strand, if they are all on different strands, or if specific variants are linked. Contrast this to the output of EA-SMVD, which informs the investigator that there are three genotypes present and provides the relative percentage of each.

Genotype 1: 222:A>G Count: 2 Percent: 33.3% Genotype 2: 300:G>C; 920:T>C Count: 2 Percent: 33.3% Genotype 3: WT Count: 2 Percent: 33.3%

Now the investigator knows which SNPs are linked, which are independent, and how many strands contain no variants at all, providing more insight into the nature of the specimen.

In addition, EA-SMVD goes the extra step to infer, at the amino acid level, the nature of the mutation. EA-SMVD automatically translates each mutation in all three reading frames, and reports back those SNPs that result in synonymous and nonsynonymous mutations and how that impacts the strands. For example, suppose that the reference given in Figure 1 is given in reading frame 1. Furthermore, let us suppose that nucleotide position 222, which is equivalent to amino acid position 74, results in a synonymous mutation, while the mutation at position 300 results in an Arginine to Serine mutation at amino acid position 100 and mutation 920 results in a Leucine to Proline mutation at amino acid position 307. EA-SMVD will take this into account, producing an output SNP file that is similar to the following:

Mutation 222:A>G Count: 2 Percent: 33.3% Synonymous Mutation: A74A Mutation 300:G>C Count: 2 Percent: 33.3% Nonsynonymous Mutation: R100S Mutation 920: T>C Count: 2 Percent: 33.3% Nonsynonymous Mutation: L307P EA-SMVD will also infer implications of these mutations for individual strands and will produce a second SMVD file based on amino acid changes. In this scenario, the strands containing the nucleotide mutation of 222:A>G are now recognized as wildtype strands, since they are synonymous changes:

Amino acid mutations: R100S; L307P Count: 2 Percent: 33.3% Amino acid mutations: Wildtype Count: 4 Percent: 66.6%

CASE STUDY

To validate EA-SMVD’s ability to properly identify phased SNPs and to determine its sensitivity, we performed an experiment in which a characterized wildtype amplicon derived from HCV-1a strain H77 was spiked with increasing quantities of a known mutant amplicon containing two mutations (Figure 2). The amplicons were approximately 1kb in length. Mixtures were done at ratios of 50:50, 90:10, 95:5, and 99:1 wildtype:mutant.

Figure 2

expressionanalysis.com PacBio RS Sequencing Variant Calling

The mixtures were prepared using standard PacBio library preparation procedures and sequenced on a PacBio RS instrument. After sequencing, the circular consensus reads (PacBio’s high quality sequence reads; see our general introduction to PacBio sequencing here) were processed through the EA-SMVD pipeline to determine how EA-SMVD would perform with actual third-generation sequencing data.

For each of the admixtures, EA-SMVD was able to correctly identify the linked mutations with no, or few false positive calls (Table 1). In addition, EA-SMVD translated these mutations in all three reading frames relative to the provided reference. In this case the reference was given in the first reading frame, and EA-SMVD identifies both CASE STUDY mutations that result in amino acid changes.

Each EA-SMVD run produces four output files. The first file is a simple listing of all the different Genotypes present and their relative percentage. For example the output of our 50% mutant sample appears as follows:

Genotype 1: Wildtype Count: 861 Percent of strands: 48.920 Genotype 2: 446:C>T; 482:A>T Count: 878 Percent of strands: 49.972 Genotype 3: 446:C>T Count: 21 Percent of strands: 1.195 PacBio RS Sequencing Variant Calling

The second file is a listing of SNPs with no information regarding linkage of the mutations:

Mutation: 482A>T Count: 878 Percent of strands: 49.943 Missense Mutation: D161V Mutation: 446C>T Count: 894 Percent of strands: 50.882 Missense Mutation: A149V

The next file includes information regarding how the different nucleotide mutations influence the amino acid sequence of the different genotypes.

A.A. mutation positions and changes: Wildtype Count: 861 Percent of strands: 48.920 A.A. mutation positions and changes: A149V;D161V Count: 878 Percent of strands: 49.972 A.A. mutation positions and changes: A149V Count: 21 Percent of strands: 1.195

expressionanalysis.com Finally a FASTA formatted file with all genotypes present above a given threshold is provided so that the end user can view their mutations in the context of the entire reference sequence. In this case three sequences would be output, one with the reference sequence, the second with the 50% mutant sequence containing both nucleotide mutations, and the third with the 1% genotype containing one mutant nucleotide.

Table 1 – Different ratios of wildtype and mutant amplicons were mixed, sequenced on a PacBio RS and evaluated through EA-SMVD. Background was defined as mutant strains identified by EA-SMVD in excess of 1%

Wildtype: Mutant Approximate Initial CCS Reads Read Depth Post Identified Mutant Background Insert Size Filtering Percent

99:1 1000 3981 685 1.335 Two other strains

95:5 1000 4296 1134 4.223 None

90:10 1000 1773 468 7.725 None

50:50 1000 6839 1760 49.972 One other strain*

*This other strain was one of the intentional mutations (446C>T) which 1% of the time was not found linked to the other.

EA-SMVD results agreed closely with expected percentages. Also as expected, EA-SMVD became less accurate with lower depth, since this increases the likelihood of sampling bias. When designing experiments a depth of 5000 CCS reads for calling SNPs between 1-5%, between 1000 and 5000 for calling SNPs at 5-10%, and at least 500 reads for calling SNPs at 50%. Nevertheless, with these lower numbers, EA-SMVD provided highly accurate estimates of genotype frequencies. FLT3-ITD STUDY

As a further validation of our variant calling pipeline, we examined data recently published in Nature that explored SNP calling using PacBio reads in human acute myeloid leukemia patients5. Although this study was not specifically focused on detecting linked variants, they did identify one set of linked variants.

For this comparison study we first downloaded the data from NCBI’s Short Read Archive. After download and labeling each sample appropriately, each sample was run through the EA-SMVD pipeline, with some modifications. The quality of the reads in this study were both of lower quality and lower depth than what is typically generated at EA. Therefore we reduced our quality filter from reads with an average quality of Q30 to an average quality of Q20. Furthermore, we typically throw out any reads with more than 10 consecutive insertions as compared to the reference. Here we relaxed those restrictions, placing no limits on the number of consecutive insertions.

The data is divided into three subsets. Subset one consists of normal samples, subset two consists of patients prior to treatment, while subset 3 consists of patients after relapse. The focus of the study was on several mutations in the kinase domain of FTL3-ITD that confer resistance to the drug AC220. For the normal samples neither we nor the original authors detected any SNPs at the investigated sites. In patients prior to treatment with AC220 the authors also found no evidence of significant mutation accumulation at the drug-resistance sites. EA-SMVD largely confirmed these observations

expressionanalysis.com PacBio RS Sequencing Variant Calling

with one notable exception: EA-SMVD identified a G>T mutation resulting in a D835Y conversion in patient 1005-009 prior to treatment at a percentage of 15.7%.

To investigate this disagreement we examined this region in BROAD’s Integrative Genomics Viewer (IGV).6 IGV confirmed the presence of this mutation in our studied sample, suggesting that EA-SMVD accurately called this mutation. Both IGV and EA-SMVD found exactly 49 G>T mutations at this site.

In contrast to the pretreatment samples, the relapse samples had many mutations at the investigated sites. We detected every SNP reported by the original investigators (Table 2). In most cases our percentages were also in close agreement with the original report. Small deviations are likely due to differences in pre-processing of the data. The authors do not go into a detailed discussion regarding the preprocessing of their reads. As mentioned, we filtered for reads with an average base quality of over Q20 and we applied some size filtering. In addition to the SNP calls, EA-SMVD was able to identify the same linked mutation as was identified in the study that results in a D835F nonsynonymous mutation in subject 1009-003. PacBio RS Sequencing Variant Calling

Table 2 – Comparison of Patient Relapse SNP prevalence between a previously published report and EA-SMVD. In those situations where the two numbers disagreed by more than 10%, IGV was used as a third, independent measure of SNP count and percentages.

Published Report EA-SMVD IGV Subject Mutation Native Alternative Percent Count Percent Count Percent Count Number Codon Codon 1009-003 D835Y GAT TAT 8.4 28 16.4 92 18 92 D835V GAT GTT 3.3 11 15.3 86 16 86 D835F** GAT TTT 10.2 34 8.4 47 1011-006 D835Y GAT TAT 41 165 17.8 166 19 170 1011-007 F691L TTT TTG 6.2 21 7.5 35 8 35 D835Y GAT TAT 3 13 2.9 13 D835V GAT GTT 29.6 129 27.6 124 1005-004 F691L TTT TTG 29.6 152 29.7 236 1005-006 D835Y GAT TAT 39.5 103 23.3 159 24 159 D835F GAT TTT 2.7 7 4.1 28 4 28 1005-007 D835Y GAT TAT 4 15 2.4 13 2 13 D835V GAT GTT 47.4 179 31.4 169 36 170 1005-009 D835Y GAT TAT 50.6 225 46.2 165 1005-010 F691L TTT TTG 25.3 38 32.4 100 33 100

For those mutations in which EA-SMVD’s reported numbers differed significantly (>10%) we reviewed the calls in IGV. In all cases investigated IGV confirmed EA-SMVD as accurately calling the appropriate number of SNPs. Again, since the exact method of read pre-processing employed by the authors is unclear, we cannot determine the root cause of the differences.

expressionanalysis.com CONCLUSIONS

Variant calling with the PacBio RS and EA-SMVD has proven to be an exquisitely sensitive and unique method for identifying SNPs, both in isolation and phased, in multiple datasets with multiple sample types. Its ability to accurately identify phased SNPs promises to have applications across a wide spectrum of research questions.

REFERENCES

1. Carneiro et al. “Pacific biosciences sequencing technology for genotyping and variation discovery in human data.” BMC Genomics 2012 13:375. 2. Ledergerber et al. “Base-calling for next-generation sequencing platforms.” Brief Bioinform. 2011 12:489. 3. Luo et al. “Direct Comparisions of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample.” PLOS One 2012 7. 4. Stratton, M.R. “Exploring the Genomes of Cancer Cells: Progress and Promise.” Science 2011 331:1553. 5. Smith et al. “Validation of ITD mutations in FLT3 as a therapeutic target in human acute myeloid leukemia.” Nature 2012 485:260. 6. Robinson et al. “Integrative Genomics Viewer.” Nature Biotechnology 2011 29:24. PROTOCOLS

expressionanalysis.com Sequencing Protocols

TABLE OF CONTENTS

• RNA-SEQ

• Agilent—Exome

• Small RNA Sequencing Protocols

RNA-SEQ

Libraries are prepared for RNA-Seq using the TruSeq RNA Sample Prep Kit (Illumina), including the use of Illumina in-line control spike-in transcripts. Prior to library preparation, RNA samples are quantitated by spectrophotometry using a Nanodrop ND-8000 spectrophotometer, and assessed for RNA integrity using an Agilent 2100 BioAnalyzer or Caliper LabChip GX. RNA samples with A260/A280 ratios ranging from 1.6 – 2.2, with RIN values ≥ 7.0, and for which at least 500 ng of total RNA is available will proceed to library preparation.

Library preparation begins with 500 ng of RNA in 50 µl of nuclease-free water, which is subjected to poly(A)+ purification using oligo-dT magnetic beads. After washing and elution, the polyadenylated RNA is fragmented to a median size of ~150 bp and then used as a template for reverse transcription. The resulting single-stranded cDNA is converted to double-stranded cDNA, ends are repaired to create blunt ends, then a single A residue is added to the 3’ ends to create A-tailed molecules. Illumina indexed sequencing adapters are then ligated to the A-tailed double-stranded cDNA. A single index is used for each sample. The adapter-ligated cDNA is then subjected to PCR amplification for 15 cycles. This final library product is purified using AMPure beads (Beckman Coulter), quantified by qPCR (Kapa Biosystems), and its size distribution assessed using an Agilent 2100 BioAnalyzer or Caliper LabChip GX. Following quantitation, an aliquot of the library is normalized to 2 nM concentration and equal volumes of specific libraries are mixed to create multiplexed pools in preparation for sequencing.

expressionanalysis.com Agilent—Exome

Libraries are prepared for Exome Sequencing using the SureSelect XT Target Enrichment System for Illumina Paired-End Sequencing Library (Agilent). Prior to library preparation, DNA samples are quantitated using PicoGreen (Invitrogen) and assessed for integrity via agarose gel electrophoresis. DNA samples that are demonstrated to be intact and for which at least 3000 ng is available will proceed to library preparation.

Library preparation begins with fragmentation of the genomic DNA to a peak size of 185 bp using a Covaris E210 and snap cap microTUBEs (Covaris). Following fragment purification using AMPure XP beads (Beckman Coulter), fragment ends are repaired using a cocktail of T4 DNA Polymerase, Klenow fragment DNA polymerase and T4 Polynucleotide Kinase. Following purification of the blunt-ended DNA fragments with AMPure XP beads, a single A residue is added to the 3’ ends using Exo(-) Klenow fragment DNA polymerase. The resulting molecules are purified using AMPure XP beads and indexed paired-end sequencing adapters are ligated to the ends, followed by AMPure XP purification. The resulting library undergoes an initial PCR amplification of 5 cycles, followed by AMPure XP purification. This product is then quantitated using PicoGreen, the concentration is normalized to 147 ng/µl, and the size range is determined using an Agilent 2100 BioAnalyzer. A 3.4 µl aliquot (500 ng) is then carried forward into hybridization using the appropriate biotinylated capture library. Hybridizations are allowed to proceed for 22-26 hrs. Hybrids are captured via the biotin moiety using Dynabeads MyOne Streptavidin T1 (Life Technologies). Multiple temperature-controlled washes are carried out at 65°C using SureSelect Wash Buffer #2 prior to eluting the captured DNA from the baits and purification with AMPure beads. The resulting captured DNA then undergoes a second PCR amplification of 11-16 cycles, depending upon the size of the capture library, followed by purification with AMPure beads. The final library products are quantified by qPCR (Kapa Biosystems), normalized to 2 nM concentration, and then mixed using equal volumes to create multiplex library pools in preparation for sequencing. Small RNA

Libraries are prepared for Small RNA Sequencing using the TruSeq Small RNA Sample Prep Kit (Illumina). Prior to library preparation, RNA samples are quantitated by spectrophotometry using a Nanodrop ND-8000 spectrophotometer, and assessed for RNA integrity using an Agilent 2100 BioAnalyzer or Caliper LabChip GX. RNA samples with A260/A280 ratios ranging from 1.6-2.2, with RIN values ≥ 7.0, and for which at least 1000 ng of total RNA is available will proceed to library preparation. Total RNA samples must have been prepared using extraction chemistry that does not exclude small RNA species, for example, the QIAGEN miRNeasy kit.

Library preparation begins with 1000 ng of total RNA in 5 µl of nuclease-free water, to which is added an adapter oligonucleotide that is then ligated to the 3’ hydroxyl present on miRNA species using T4 RNA Ligase (New England Biolabs). Similarly, a different adapter sequence is ligated to the 5’ end of RNAs that possess a 5’ phosphate, to create a single-stranded molecule with defined sequences at both the 5’ and 3’ ends. This molecule is reverse-transcribed and amplified using 14 cycles of PCR using primers that include sequences complementary to the 5’ and 3’ adapter sequences, a specific index sequence, and Illumina sequencing adapter sequences. The resulting product is analyzed using an Agilent 2100 BioAnalyzer and the molar amount of mature miRNA present in the library is estimated by integrating the area under the curve in the 145 -160 bp range. Individual librar- ies are mixed to create multiplexed pools, and the mixture is purified by gel electrophoresis, wherein the 145 -160 bp range is excised from the gel, crushed using a Gel Breaker tube (IST Engineering), eluted into nuclease-free water, and concentrated by precipitation with ethanol. The concentration of the final library pool is determined using PicoGreen (Invitrogen) and the size distribution of the pool is determined using an Agilent 2100 BioAnalyzer. Library pools are normalized to 2 nM in preparation for sequencing.

expressionanalysis.com Sequencing Protocols

Visit www.expressionanalysis.com for more information. Expression Analysis, Inc. 4324 South Alston Avenue P: 919-405-2248 www.expressionanalysis.com EA © 2012 Durham, NC 27713 F: 919-806-0219