The Pennsylvania State University

The Graduate School

The Eberly College of Science

TISSUE SPECIFIC DE NOVO TRANSCRIPTOMICS IN THE PARASITIC

OROBANCHACEAE

A Dissertation in

Plant Biology

by

Loren Axel Honaas

© 2013 Loren Axel Honaas

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2013

ii The dissertation of Loren Honaas was reviewed and approved* by the following:

Claude W. dePamphilis Professor of Biology Dissertation Advisor Chair of Committee

Naomi S. Altman Professor of Statistics and Bioinformatics

Dan Cosgrove Professor and Holder of the Eberly Chair in Biology

Gabriele Monshausen Assistant Professor of Biology

Dawn Luthe Professor of Stress Biology

Teh-hui Kao Distinguished Professor of Biochemistry & Molecular Biology Chair, Intercollege Graduate Degree Program in Plant Biology

*Signatures are on file in the Graduate School

iii Abstract

The power to develop a comprehensive picture of any biological system lies with understanding the myriad processes underway in complex organs and tissues. Furthermore, we must develop such pictures in systems that are optimal rather than convenient. We must continue to strive towards single-cell resolution analyses, the ability to harvest these cells non-invasively and classification and quantification of all biomolecules with minimal bias. To that end, the work presented here represents steps to help us explore the tissue and cell specific transcriptomes of select members of the , a plant family containing members that span the range of parasitic plant lifestyles, none of which have a sequenced genome. The first step was a detailed comparison of tools for de novo transcriptome assembly by referencing the 10th generation genome of the premiere plant model system, Arabidopsis thaliana. The results of this comparison reveal de novo assembly is reassuringly good and identifies superior de novo assembly tools. Concurrently, a workflow was developed that allows high spatial resolution sampling of cryopreserved plant tissues followed by deep transcriptome sequencing with second generation sequencing technology. This workflow was applied to a generalist parasite member of the Orobanchaceae, , and was combined with a high performance transcriptome assembler to reveal that this generalist parasite expresses genes in a host specific manner at the host-parasite interface. The sample set was expanded to include two other economically important (i.e. weedy) members of Orobanchaceae, Striga and Orobanche (syn. Phelipanche). The comparison of these three members of Orobanchaceae, representing lifestyles with increasing dependence upon their hosts, revealed that the host parasite interface in each case is enriched for genes of unknown function. Excluding gene families detected in shoots, the interface transcriptomes share 11 parasite gene families and many members of these gene families are differentially expressed in tissues and life stages critical to parasitism. The differentially expressed genes that belong to these shared families are excellent candidates for focused studies that will shed light on the largely unknown molecular dialogue between parasitic and their host plants. These insights represent steps towards understanding the evolution of parasitism in plants and development of novel control strategies for the weedy Orobanchaceae. Finally, this work demonstrates the potential for discovery by leveraging cutting edge technology to explore tissue specific transcriptomes in plants that lack a sequenced genome.

iv Table of Contents

List of Figures ...... vi

List of Tables ...... xiv

Acknowledgements ...... xvi

Chapter 1 Introduction and Background ...... 1

Part I The rapidly evolving field of transcriptomics ...... 1 de novo transcriptomics ...... 3 Tissue and cell specific transcriptomics ...... 5 Part II The Orobanchaceae: Insights into plant biology and evolution ...... 7 The weedy Orobanchaceae ...... 7 The Haustorium and genes imporant for parasitism ...... 9 Triphysaria is an important model parasite ...... 11

Chapter 2 Genome Reference Based Evaluation of de novo Transcriptome Assembly ...... 13

Introduction ...... 13 Results ...... 15 Discussion ...... 34 Figures and Tables ...... 47 Materials and Methods ...... 74

Chapter 3 Functional Genomics of a Generalist Parasitic Plant: Laser Microdissection of host-parastie interface reveals host specific parasite gene expression ...... 88

Introduction ...... 88 Results ...... 91 Discussion ...... 102 Figures and Tables ...... 110 Materials and Methods ...... 131

Chapter 4 A Comparison of Laser Microdissected Host-Parasite Interface Transcriptomes of three parasitic members of the Orobanchaceae reveals shared interface genes ...... 142

Introduction ...... 142 Results ...... 145 Discussion ...... 156 Figures and Tables ...... 165 Materials and Methods ...... 181

v Appendix A Comprehensive Assembly Statistics for the A. thaliana young leaf transcriptome - See Attached or https://drive.google.com/file/d/0B8QIACTZKU7JbnI5LXFiS2JpMzA/edit?usp=sharing

Appendix B Comprehensive Quality vs. Coverage Plots - See Attached or https://drive.google.com/file/d/0B8QIACTZKU7JM0RvaFdmZnB6dkk/edit?usp=sharing

Appendix C Detailed follow up analysis of the 6 poorly correlated qPCR candidates ...... 186

Appendix D MEGAN analysis of putative Arabidopsis genes that do not align to TAIR10 cDNAs - See Attached or https://drive.google.com/file/d/0B8QIACTZKU7JZ2RBaUZVbC1TZkE/edit?usp=sharing

Appendix E OrthoMCL DB and InterProScan summary for unclassified unigenes - See Attached or https://drive.google.com/file/d/0B8QIACTZKU7JcEhxLWxHNE9sQjA/edit?usp=sharing

Appendix F BLASTx search of NR results for unigenes in Figure 3.8 - See Attached or 1: https://drive.google.com/file/d/0B8QIACTZKU7JZTY1Ml91OXJnV2M/edit?usp=sharing 2: https://drive.google.com/file/d/0B8QIACTZKU7JT1FBczc5TnNCdkE/edit?usp=sharing

Appendix G InterProScan results for unigenes in Figure 3.8 - See Attached or https://drive.google.com/file/d/0B8QIACTZKU7JZ2tYXzM1TzRyTWM/edit?usp=sharing

Works Cited ...... 189

vi List of Figures

Figure 2.1 Read level coverage for each sequencing data set Coverage of all representative TAIR10 cDNAs by reads. The darkest bar represents the number of the expressed gene set not tagged and each progressively lighter bar represents genes in a bin with a 5% increase in coverage with the two lightest bars showing the number of genes covered at >90% and >99%, respectively. 454 Norm = Normalized 454FLX Ti (Eurofins Operon), 454 BR12 = 454FLX combined biological replicates 1 and 2, BR12 Norm = Normalized Illumina library pooled biological replicates 1 and 2, BR12 = combined coverage of Illumina biological replicates 1 and 2, BR1 = Illumina biological replicate 1, BR2 = Illumina biological replicate 2.

Figure 2.2 SCERNA flowchart. SCERNA stands for Scaffolding and Error correction for de novo assemblies of RNA-Seq data. This collection of post-processing tools allows flexible implementation at various steps post assembly and with multiple assemblers and data types.

Figure 2.3 As minimum sequence length cutoffs are imposed, the assembly landscape becomes more even. The effect on the N50 of assembled sequence length and N50 of Mbp of assembled sequence resulting from sequence length cutoffs (imposed at 100-600bp) for the post-processed assemblies of Illumina biological replicate 1.

Figure 2.4 Summary diagram of assembly error types. Type I assembly reports cases of incomplete assemblies where a given transcript is not assembled into a single sequence (Case I = gap, Case II = Insufficient overlap). Type I error can also consist of failure to bring contigs together (Case III) with sufficient overlap, presumably due to conflict. Type II error reports cases where portions of assembled sequences have good alignments to >1 TAIR10 cDNAs. Case I reports possible chimerism with Cases II-V reporting ambiguity in annotation.

Figure 2.5 The coverage of Arabidopsis cDNAs shows a subtle gradation of assembly completeness. Assembled sequences were aligned TAIR10 cDNAs to determine coverage, which was expressed as the percent of cDNA bases covered by assembled sequence. The darkest bar is 0% or “No Hit ” and each progressively lighter bar is a bin containing genes covered in 10% increments, with the last two bars representing the number of genes covered at >90% and >99%, respectively.

Figure 2.6 Post-processed assembly delta plot, showing the effect of post-processing in several assembly quality categories (X axis labels). The histogram shows the magnitude (% change) and direction of change in each category for the Mosaik-S, CLC-S and Trinity- ICB assemblies of Illumina biological replicate 1. The post processed values in each category are printed above the x axis for each assembly. A vertical line separates categories where an assembly improvement would result in a decrease in the respective measure ( “Expect ∆<0” categories, left of line) or an increase in the respective measure (“Expect ∆>0” categories, right of line).

Figure 2.7 The quality of assembled sequences as a function of sequencing depth for Illumina biological replicate 1. The units for “Assembly Quality” are Normalized Bit Score (BS, maximum of 2) and the units of “Sequence Depth” are Sequenced Fragments/bp (SFB). The number printed in the plot area is the number of assembled sequences with normalized Bit Score above 1.5. A BS of 1.5 is an arbitrary threshold, yet represents long

vii and accurate assemblies, and is used to illustrate the difference in the high-density region seen in most plots near BS 1.75-2.

Figure 2.8 Alignment comparison of CLC and Trinity (Inchworm) unigenes representing the AT1G31330.1 transcript. The Sequence order from the top is gDNA (with a single intron – colored gray), cDNA, CDS and unigene(s). A) Alignment of At1G31330 reference sequences and the Inchworm BR1 unigenes (x2) annotated as AT1G31330. B) Alignment of AT1G31330 reference sequences and the Trinity-ICB BR1 unigene sequence annotated as AT1G31330. C) Alignment of AT1G31330 reference sequences and the CLC BR1 unigenes (x623) annotated as AT1G31330. D) Alignment of AT1G31330 reference sequences and the CLC-S BR1 unigenes (x96) annotated as AT1G31330. For this highly expressed gene, Trinity is able to distill extensive variation into a single perfect unigene, whereas subsequences with minor differences (often single nucleotides) are maintained as numerous subsequences in CLC primary and post-processed assemblies including unigenes that seem to contain introns (middle portion of C & D). This alignment illustrates the trade- off of highly representative assemblies vs highly contiguous assemblies.

Figure 2.9 Normalization does not preferentially remove closely related gene pairs. Scatter plot of read counts to the expressed gene set of the Illumina biological replicates 1 & 2 and the normalized Illumina data set. The log2 read counts +1 (to avoid taking the log of zero) for each gene were calculated for the Illumina Normalized data set and the Combined (BR12) data set. The “detected gene set” are plotted in gray. The Ultra Conserved Orthologs (UCO, http://compgenomics.ucdavis.edu/.) are plotted in green. The closely related genes set (CRG) are plotted in red.

Figure 2.10 Normalization improves the recovery rate of highly expressed genes. The units for “Assembly Quality” are Normalized Bit Score (BS) and the units of “Sequence Depth” are Sequenced Fragments/bp (SFB). The number printed in the plot area is the number of assembled sequences with normalized Bit Score above 1.5. A BS of 1.5 is an arbitrary threshold and is used to illustrate the differences in the high density region seen in most plots near BS 1.75-2.

Figure 2.11 A) The plot of Ks frequencies reveals that closely related genes in the “detected gene set” are not efficiently recovered in reference based or de novo transcriptome assembly. Gene pairs were identified by a reciprocal best BLAST hit. Gene number is on the y-axis, Ks value of pairs in on the x-axis. Equivalent best-fit model components are identified by similar color. “TAIR10” pairs were identified from the comprehensive cDNA collection. The “Detected cDNAs” pairs were identified from the detected gene set (at least one tag from any sequencing data set). The remaining plots are of pairs identified from the indicated assemblies (less unigenes <300bp).

Figure 2.11 B) The recovery of sufficiently expressed gene pairs (see A) follows a pronounced hit/no hit pattern with de novo assemblies out performing the reference based assembly. Bit Score (BS) frequency histogram and assembly summary table for Expressed Gene Pairs (EGPs). 473 gene pairs present in the Ks plot of the “Expressed Genes” were absent in the Ks plot of the Mosaik assembly of BR1. For each assembly of BR1 the BS of each mate (946 genes) was plotted. Below the plot is a summary table of the fate of the 473 gene pairs absent in the Mosaik assembly relative to the “Expressed Genes” list.

viii Figure 2.12 Between gene pairs with expression sufficient for assembly of only one mate of the pair, the Ks value is generally higher than all pairs in the Ks analysis. The frequency of pairs with increasing Ks values were plotted revealing that pairs with lower Ks values were more likely to have expression sufficient (BS >0.1) for assembly of both pairs, yet pairs with higher Ks values were more likely to have one mate with reads insufficient for assembly. This may indicate that the expression pattern of closely related genes becomes divergent as the gene sequences also diverge.

Figure 2.13 Read titration analysis. Reads were mapped to post-processed assemblies of biological replicate 1 (BR1). The number of reads (x-axis) that map to an assembly are an indicator of assembly completeness and quality, while the incidence of new tags (y-axis) indicates how completely the assembly reflects the diversity present in the read data.

Figure 2.14 Ultra-conserved orthologs (UCO) coverage in assemblies of BR1. The darkest bar is 0% or “No Hit ” and each progressively lighter bar is a bin containing genes covered in 10% increments, with the last two bars representing the number of genes covered at >90% and >99%, respectively. The use of UCOs as a proxy for the transcriptome assembly helps to reveal the leading assemblies when leveraged with the proportion of mappable reads and the read titration curve analysis. When all three are considered together, Trinity-ICB, which was the clear leader in our reference-based analyses, is also selected as the leader by the reference independent criteria.

Figure 2.15 Summary of the 12 candidates chosen for a follow-up analysis by qRT-PCR. The arrows on the plot show the candidates that were chosen for this analysis. Many of the poorly correlated features with high RNA-Seq signal relative the array were organellar transcripts and were excluded. The “Probe– cDNA position” columns show where on the reference cDNA the microarray probes hybridized. Generally, the poorly correlated candidates also had a poorer probe set, which may also have contributed to the aberrant signal on the array.

Figure 2.16 Well correlated genes (see Figure 2.15), and genes with a robust probe set that were beyond the dynamic range of the array (see Figure 2.15), show excellent agreement by all estimates of gene expression. Fold difference in expression relative to AtActin (AT3G18780.1) was determined for candidates indicated. Those within the linear portion the Array vs. RNA-Seq correlation with each method as appropriate (Figure 2.15). Well-correlated qRT-PCR candidates are indicated by solid black arrows (Figure 2.15). qRT-PCR candidates which extend beyond the range of the array but followed the linear trend are indicated by dashed black arrows (Figure 2.15).

Figure 2.17. Gene expression correlations. Array BR1 v BR2: correlation of background corrected, normalized array intensities for biological replicates 1 & 2. Illumina BR1 v BR2: correlation of log2 read counts (reads +1) from the non-normalized Illumina data sets generated from biological replicates 1 & 2. Array v Illumina: correlation of log2 read counts (reads +1) mapped at high stringency to the set of array probes and the average, background corrected, normalized array intensities from biological replicates 1 & 2. Mosaik v Illumina: correlation of average log2 read counts (reads +1) from biological replicates 1 & 2 mapped to The Mosaik-S assembly and the TAIR10 cDNA defined as the set of expressed genes. Trinity-S v Illumina: correlation of average log2 read counts (reads +1) from biological replicates 1 & 2 mapped to the Trinity-S assembly and the TAIR10 cDNA defined as the set of expressed genes. Trinity-S v Mosaik-S: correlation of log2 read counts (reads +1, less features that lack read counts) from biological replicate 1 mapped to

ix the Trinity-S assembly and the Mosaik-S assembly. Person’s R is displayed in the upper left corner of each plot.

Figure 2.18. MetaGenome classification alignment stringency (via bitscore) plot shows that a threshold alignment quality of 175 is sufficient to exclude erroneous hits while still classifying plant genes. The frequency of non-assignment is minimally increased at alignment score ≥175. The increase of non-assignment from alignment scores of 125 to 175 is minimal yet the instance of hits to plant genes is also decreased from alignment scores of 125 to 175. Depending on the desired outcome, alignment scores >125 can be used with confidence to exclude erroneous classification while classifying more plant genes.

Figure 2.19 (MEGAN) metagenome analysis classification of unigenes that do not align to Arabidopsis TAIR10 cDNAs. The classification was determined for unigenes that aligned to sequences in NR with a bit score A) ≥125 and B) ≥175.

Figure 3.1 Laser Microdissected Haustorium. LPCM allows highly tissue- and cell-specific harvest after histological identification of tissues or cells of interest. A) Representative 25µm cross-section of T. versicolor haustorium on the host M. truncatula approximately 9 days post infestation, and prior to LPCM. The mature haustorium contains the xylem bridge that connects the parasite and host vasculature and is visible in the penetration peg. B) The same section after LCPM shows the cleared interface tissue from the user-defined region of interest (ROI). The flakes of tissue are catapulted by a photonic cloud resulting from pulses of laser light focused between the tissue and glass slide. Multiple pulses of laser light raster across the ROI causing tissue in the selected region to be catapulted and then captured in the adhesive coated cap of a 0.5mL tube held by a robotic arm in very close proximity (<0.5mm) to the upper surface of the section affixed to the slide.

Figure 3.2 Unigene Pairwise Nucleotide Identity Plot. Sequence identity between unigenes considered in this study and reference EST sets (PlantGDB public ESTs, http://www.plantgdb.org/) for the hosts Z. mays and M. truncatula. Triphysaria unigenes were aligned to the host reference to identify host contaminants and aligned to the reciprocal non-host reference sets to identify the incidental nucleotide pairwise identity. A whole plant normalized transcriptome assembly of Lindenbergia philippensis (a non-parasitic member of the Orobanchaceae) was used to determine the distribution of pairwise identity for a non- parasite to each host and to control for high unigene identity to host ESTs from potential cross contamination. A threshold of 95% was chosen to balance exclusion of host transcripts with retention of Triphysaria unigenes that had incident high identity to host ESTs.

Figure 3.3 VENN diagram summary of OrthoMCL DB and InterProScan (IPS) results. ESTScan ORF predictions from unigenes in each interface transcriptome that remained unclassified after extensive BLAST-based database searching were translated and submitted to OrthoMCL DB and InterProScan. The pattern is similar between unigenes from each transcriptome indicating equivalent unigene classification for T. versicolor grown on both hosts. The number of unigenes for which an ortholog or peptide motif was identified was relatively small, indicating our unigene classification using PlantTribes 2.0 and external database queries was robust. Approximately 25% of the known orthologs identified in the OrthoMCL database from each transcriptome are shared. A majority of the unigenes remain unknown, and these include many (~500 in each transcriptome) that are >300 nucleotide bp and have read support.

x Figure 3.4 Transcriptome Orthogroup Venn. Venn diagram showing the number of Orthogroups in the interface transcriptomes of T. versicolor with hosts Z. mays and M. truncatula and an above ground, autotrophically grown T. versicolor transcriptome (TrVeBC1) constructed from leaves, stems and inflorescences. Also shown are the numbers of host-derived Orthogroups. The lack of overlap between host and parasite transcriptomes does not imply lack of shared Orthogroups, but indicates the total number of host Orthogroups for a point of comparison.

Figure 3.5 GO Slim Category Summary. GO Slim category terms of unigenes in interface transcriptomes of T. versicolor and the above ground reference assembly of T. versicolor. Each series displays the average number of unigenes in equivalent transcriptome components with a given GO Slim term. For instance, “Interface Unique” indicates the average number of unigenes from interface unique components in both Medicago and Zea grown T. versicolor transcriptomes. Error bars are standard error of the mean. “Interface Unique” = unigenes from Orthogroups that are host and interface specific, “Interface Shared” = unigenes from Orthogroups that are interface specific and shared between interface transcriptomes, “Shared All” = unigenes from Orthogroups shared between both interface transcriptomes and the above ground transcriptome, “Interface/Above Ground Shared” = unigenes from Orthogroups that are shared between the above ground, autotrophic transcriptome and the host-specific interface transcriptome.

Figure 3.6 A GO Slim Function category analysis for the interface transcriptome of T. versicolor grown on Z. mays. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (A-D) of the Venn. The numbers of unigenes in each GO category for regions A-D are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-.

Figure 3.6 B GO Slim Component category analysis for the interface transcriptome of T. versicolor grown on Z. mays. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (A-D) of the Venn. The numbers of unigenes in each GO category for regions A-D are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-.

Figure 3.6 C GO Slim Process category analysis for the interface transcriptome of T. versicolor grown on Z. mays. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (A-D) of the Venn. The numbers of unigenes in each GO category for regions A-D are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-.

Figure 3.6 D GO Slim Function category analysis for the interface transcriptome of T. versicolor grown on M. truncatula. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (E-H) of the Venn. The numbers of unigenes in each GO category for regions E-H are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-.

Figure 3.6 E GO Slim Component category analysis for the interface transcriptome of T. versicolor grown on M. truncatula. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (E-H) of the Venn. The numbers of unigenes in each GO category for regions E-H are indicated in the table. Cells with strongly positive residual

xi values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-.

Figure 3.6 F GO Slim Process category analysis for the interface transcriptome of T. versicolor grown on M. truncatula. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (E-H) of the Venn. The numbers of unigenes in each GO category for regions E-H are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-.

Figure 3.7 Correlation of normalized read counts (RPKM) for unigenes in orthogroups shared between the interface transcriptomes and reference assembly TrVeBC1 (ppgp.huck.psu.edu). Reads from each interface transcriptome were mapped to a reference assembly (TrVeBC2, ppgp.huck.psu.edu) that included whole haustorium data from T. versicolor grown on M. truncatula. A subset of unigenes is more highly expressed in the interface transcriptome of T. versicolor grown on M. truncatula; a similar pattern is not observed for T. versicolor grown on Z. mays. This is due to a bias for Medicago grown Triphysaria unigenes in the reference dataset TrVeBC2, which was constructed with reads from Medicago grown Triphysaria. For unigenes in shared orthogroups, the RPKM values are highly correlated (Pearson’s R = 0.81) between interface transcriptomes indicating that technical and biological variation is low.

Figure 3.8 Highly Expressed Interface Unigenes. The 20 most highly expressed (RPKM) unigenes (ID) in each indicated portion of the transcriptome Venn diagram for the interaction of T. versicolor with each host species. NR BLASTx – description, species and %id.: the description, species of origin, and percent pairwise identify, respectively, of the best unigene alignment (<1e-10) resulting from the NR database query, 17 genomes BLAST and %id.: best hit species in a BLAST database of 17 annotated plant genomes with the percent pairwise identity in the nucleotide BLAST (N) or translated nucleotide BLAST (P). TXXX = IPS transmembrane prediction, SXXX = IPS secretion signal prediction.

Figure 3.8 A Top expressed unigenes from shared orthogroups between parasite-host interface transcriptomes of T. versicolor.

Figure 3.8 B Top expressed unigenes from orthogroups unique to interface transcriptomes of T. versicolor.

Figure 3.8 C Top expressed unigenes from host-specific yet above ground shared orthogroups of T. versicolor transcriptomes.

Figure 3.9 RaxML analysis of A: Triphysaria beta expansin gene TvEXPB1 (TrVeIntZeamaGB1_772*, green text), and B: alpha expansin gene TvEXPA4 (TrVeIntMedtrGB1_11*, green text). Bootstrap proportions are given above each node. Taxon abbreviations for A: Arabidopsis thaliana (AT), Oryza sativa (Os), Mimulus guttatus (Mg), Triphysaria versicolor (TrVe), Striga hermonthica (StHe), Phelipanche (=Orobanche) aegyptiaca (OrAe), Selaginella moellendorffii (Smoellendorffii). Taxon Abbreviations for B: Oryza sativa (Os), Sorghum bicolor (Sb), Striga hermonthica (StHe), Phelipanche (=Orobanche) aegyptiaca (OrAe), Triphysaria versicolor (TrVe), Carica papaya (Carpa), Populus trichocarpa (Poptr), Medicago truncatula (Medtr), Vitis vinifera

xii (Vitvi), Arabidopsis thaliana (AT), Selaginella moellendorffii (Selmo), Physcomitrella patens (Phypa).

Figure 3.10 Differential Expansin Expression. qRT-PCR analysis of TvEXPA4 and TvEXPB1 expression relative to TvActin in parasite-host interface cells harvested by LPCM from the haustoria of T.versicolor. *P<0.05

Figure 4.1 LPCM dissection strategy. Preparative cryo-sections of A) T. versicolor on Medicago, B) S. hermonthica on S. bicolor and C) O. aegyptiaca on A. thaliana. The sampling strategy (white outlined area = ROI) was to capture tissue enriched for intact parasite interface cells, thus a portion of the host interface was captured with the parasite tissue. T. versicolor haustoria were oriented individually with the host root on the vertical axis (haustorium cross section) since immediately adjacent connections were rare due to the co-culture strategy where ~20 parasites were placed in a plate with 1-3 host plants. The smaller dimensions of S. hermonthica and O. aegyptiaca made orientation of individual haustoria impractical. Additionally, many more (>100 germinating parasites) were “painted” onto host roots allowing multiple attachments to occur in a small space. Thus host roots with multiple parasites were harvested and oriented on the horizontal axis in the cryo- mold resulting in longitudinal haustorium sections.

Figure 4.2 MEGAN analysis of the interface transcriptomes of T. versicolor, S. Hermonthica and O. aegyptiaca. Unigenes that were not classified as parasite or host based upon BLAST searches of the PPGP BigBuilds or host cDNA and EST sequences were classified by the best hit in NR with an e-value <1e-5 and an alignment bit score of 125 or better. A) The axenic co-culture system used for T. versicolor produced few unigenes attributable to non- plant origins whereas the bag co-culture system used for S. hermonthica and O. aegyptiaca produced unigenes that were assigned to numerous bacterial and fungal origins. The Viridiplantae clade is exploded for detail. B) Of the few non-plant unigenes detected in T. versicolor interface transcriptomes only the genus Burkholderia was strongly represented in both (arrow). The genus Burkholderia is represented with similar frequency in the interface transcriptomes of S. hermonthica and O. aegyptiaca as well (arrow). The Bacteria clade is exploded for detail.

Figure 4.3 Ultra Conserved Orthologs tagged in interface transcriptome assemblies. From each interface assembly the parasite UCOs were determined by best BLAST hit (BLASTx e- value <1e-20, >45 residues in the HSP) to Vitis genes in the UCO orthos and are reported as a percentage of Vitis UCOs (of 404). The frequency of unigene lengths is plotted in orange for each set of parasite UCOs compared to the whole assembly unigene length distribution.

Figure 4.4 Orthogroup classification Venn of Interface and Stage 6.1 parasite transcriptomes. The distinct and overlapping orthogroups represented in the “Putative parasite with PlantTribes Ortho” unigene class and filtered Stage 6.1 unigenes are shown.

Figure 4.5 The interface specific Orthogroups of T. versicolor, S. hermonthica and O. aegyptiaca. A) Venn diagram showing distinct and overlapping orthogroups from interface specific orthogroups in each interface transcriptome. B) Summary table of best descriptions for representative of the 12 shared interface Orthogroups. Transcription factortf, differentially expressed* in the haustorium and/or interface in 2 of the 3 species.

Figure 4.6 Protein alignments of representatives of the 12 interface shared orthogroups. Interface unigenes from each interface transcriptome that were assigned to the 12 shared

xiii orthos were translated and aligned. The sequence order is the same in each alignment: O. aegyptiaca, T. versicolor and S. hermonthica. Agreements to the consensus sequence are highlighted. Interface unigenes were used to search the respective BigBuilds for unigenes with high nucleotide pairwise identity. Where alignments were not full length, stage specific PPGP assemblies were searched to find unigenes with high nucleotide pairwise identity to unigenes used to populate the alignments. Where a single unigene did not contain longest open reading frame in a given alignment, consensus sequences were extracted, translated and aligned with unigenes/consensus sequences from each species. Orthogroups are listed at left. *Differentially expressed in 2 of 3 species.

Figure 4.7 Nucleotide alignment of a collection of T. versicolor unigenes with good alignments to interface unigenes in ortho 6571. A) The T. versicolor BigBuild sequences (prefix TrVeBC3) populating this alignment were identified by a BLASTn search using interface sequences (prefix TrVeInt). By classifying sequences based upon pairwise nucleotide identity, 2 distinct groups arise consisting of sequences 1-5 and 6-10 with 11 being classified as a subsequence of 10 (see B). Disagreements to the consensus are highlighted. B) Protein alignment of sequences 10 and 11 in “A”.

Figure 4.8 Stage Specific expression patterns of representative genes in the 12 orthos for each interface transcriptome. White bars = differential expression in at least 2 of the 3 species. Red outlined bars = DE species shown. A = T. versicolor plot with Medicago grown interface; B = T. versicolor plot with Zea grown interface; C = S. hermonthica; D = O. aegyptiaca; 500-1500FPKM*; >1500FPKM**; “hetero” = cultured with host; “auto” = cultured with out host. FPKM is on the y-axis, stages on the x-axis and orthos on the z-axis.

xiv List of Tables

Table 2.1 Summary of assembly softwares.

Table 2.2 Sequencing and alignment statistics. The sequencing and alignment statistics for the normalized and non-normalized libraries sequenced for this study.

Table 2.3 Summary of assembly quality metrics.

Table 2.4 List of suffixes and abbreviations.

Table 2.5 Assembled sequence and RNA-Seq statistics for assemblies of Illumina biological replicate 1 (BR1). Unigenes shorter than 100bp were removed. N-content refers to the number of bases in assembled sequences that were ambiguous. Reads were mapped to the respective assemblies and percentages were calculated by dividing by the total number of reads from BR1 that mapped to TAIR10 cDNAs. RNA-Seq correlations used respective Mosaik assemblies as a reference. SCC = Spearman’s rank correlation coefficient and PCC = Pearson’s correlation coefficient. *includes gaps in cDNA reference coverage.

Table 3.1 Read and assembly level statistics for T. versicolor interface transcriptomes. Low quality reads were filtered before assembly, and host sequences were filtered both before and after assembly. Unigenes remaining after removal of host plant and non-plant sequences were aligned with BLASTx to sequences detected in any other PPGP transcriptome library of Triphysaria versicolor (http://ppgp.huck.psu.edu/). Unigenes with less than 95% pairwise identity to either host or to other Triphysaria libraries were sorted further if a BLASTx search of the NR (www.ncbi.nlm.nih.gov) database yielded alignments of 1e-10 or stronger. The remaining unclassified unigenes were submitted to OrthoMCL DB and InterProScan. Unigenes that remained unclassified after the final screen are called “no hit” unigenes.

Table 4.1 Summary of sequencing, assembly and classification statistics for interface transcriptomes. The parasitic plant and host plant pair are indicated in the top two rows. Read level statistics were calculated after quality filtering, to remove low quality reads, and quality trimming, to remove low quality bases from reads that passed the quality filter. Fields in italics are not mutually exclusive, but subsets of the field below – either “Putative parasite” list or “Unclassified”. The “Putative parasite with PlantTribes Ortho” field contains the unigene count for parasite unigenes considered in the Orthogroup level comparisons. BLAST based classifications (PPGP Parasite reference BigBuild, Host cDNA and Host EST) were done according to a 95% pairwise identity threshold. “Unclassified” refers to unigenes that cannot be classified by BLAST to the respective Parasitic Plant Genome Project Big Build, Host cDNA or EST, or a non-plant source by either BLASTx to NR or query of OrthoMCL db and are not assigned to a Plant Tribes Orthogroup.

Table 4.2 GO Slim Chi-Square test results of equivalent transcriptome components. Residuals <4 are represented by “-“ and >4 by “+”. O = O. aegyptiaca, S = S. hermonthica, TM = Medicago grown T. versicolor and TZ = Zea grown T. versicolor. Shoots = Stage 6.1 unique PlantTribes Orthogroups, Shared = Stage6.1 and Interface shared PlantTribes Orthogroups and Interface = Interface unique PlantTribes Orthogroups. Each panel consists of 3 tests separated by a darker line. For instance, the top left panel section contains results from a test on GO Slim Function categories in Shoot specific orthos for each of the 4 interface transcriptomes.

xv Table 4.3 GO Slim Chi-Square test between overlapping and distinct orthos in each interface transcriptome. Residuals <4 are represented by “-“ and >4 by “+”. Red symbols represent O. aegyptiaca (OrAe) test results. Orange symbols represent S. hermonthica (StHe) test results. Green symbols represent M. truncatula grown T. versicolor (TrVeMed) test results. Teal symbols represent Z. mays grown T. versicolor (TrVeZea) test results. Shoots= Stage 6.1 unique PlantTribes Orthogroups, Shared= Stage6.1 and interface shared PlantTribes Orthogroups and Interface= Interface unique PlantTribes Orthogroups.

xvi Acknowledgements

I would like to thank Dr. Claude dePamphilis for excellent mentorship, support and feedback during the undertaking of this work. I would like to thank Dr. Teh-hui Kao for his enduring support and for exemplifying the role of an academic mentor. I would like to thank members of my committee and also members of the dePamphilis lab past and present. Those with contributions to specific work described here are listed below. Other lab mates to whom I owe thanks are Dave Arginteau, Apjeet, Laura Warg, and Lena Landherr. I would also like to thank my parents for unwavering and unconditional support. I am so very thankful, and lucky, to say that my supportive fiends and family are too numerous to mention.

Individual contributions:

Chapter 1: Loren Honaas, Claude W. dePamphilis

Chapter 2: Loren A. Honaas*, Eric Wafula*, Joshua P. Der, Norman J. Wickett, Yeting

Zhang, Zhenzhen Yang, Naomi Altman, Pat Edgar, Chris Pires, Dick McCombie, Claude

W. dePamphilis

Chapter 3: Loren A. Honaas, Eric Wafula, Zhenzhen Yang, Joshua P. Der, Norman J.

Wickett, Naomi S. Altman, Christopher G. Taylor, John I. Yoder, Michael P. Timko,

James H. Westwood, Claude W. dePamphilis

Chapter 4: Loren A. Honaas, Eric Wafula, Zhenzhen Yang, John I. Yoder, Michael P.

Timko, James H. Westwood, Claude W. dePamphilis

*These authors contributed equally to this work

1 Chapter 1

Part I

The rapidly evolving field of transcriptomics

The most cost effective approach for initial examination of the gene space (the parts of a genome that encode genes) of non-model systems is typically an assessment of an expressed set of genes by sequencing Expressed Sequence Tags (ESTs). Massively parallel next-generation sequencing technologies (NGS syn. SGS - second generation sequencing technology) have substantially reduced the cost of generating ESTs, driving a rapid expansion of sequencing data resources1. The Sequence Read Archive (SRA) at the National Center for Biotechnology

Information (NCBI) is growing at an exponential rate and at the end of 2012 the total number of bases exceeded 1x1015 with nearly 40% of those bases openly accessible2. Though the majority of sequence data is dedicated to exploring genomes, the desire to explore transcriptomes is evidenced by the Transcriptome Shotgun Data division of GenBank grabbing the top spot in annual growth with a 370% increase in 20113.

Historically, genomic and bioinformatic resources have been dominated with sequences from just a few model organisms. The power of these substantial informatics resources is indisputable, yet prohibitive cost has made such resources unattainable for non-model organisms.

The recent revolution in massively parallel sequencing technology4 has substantially reduced the cost per nucleotide base5. For instance, at the Huck Institutes of the Life Sciences Genomics

Core Facility6 the price advantage of Illumina over Sanger sequencing is >10,000 fold. Initially the field of genomics was slow to embrace such revolutionary technologies7, likely in part to key differences in the data structure including shorter reads (25-800) and an incomplete understanding of the various error structures in each SGS technology8. Both of these attributes present hurdles to de novo assembly, thus specific recommendations of 454 over short-read technologies for de

2 novo assembly have been made9-15. Yet the affordability of short(er) read technologies offer greater power (per dollar spent) to generate unique sequence reads and also to estimate coverage depth that can inform relative sequence abundance or, in the case of RNA-Seq, gene expression.

RNA-Seq is the interrogation of a transcriptome by sequencing double stranded cDNAs derived from single stranded RNAs (i.e. ESTs). These RNAs potentially include protein-coding transcripts that are poly-adenylated (poly-A), many classes of small RNAs, ribosomal RNAs and antisense transcripts, though typically poly-A enriched RNA samples are targeted for RNA-Seq.

SGS technology allows RNA-Seq16 on a scale that offers the researcher a nearly comprehensive picture of the transcriptome landscape at a single lab budget level5.

The enticing prospect of using RNA-Seq as a tool for gene space exploration is accompanied by a desire to use individual sequencing reads to estimate gene expression, though this type of analysis requires a high quality mapping reference. While the reference is typically a genome it is not a requirement per se, though a set of sequences to which reads can be mapped is required for RNA-Seq. To that end a de novo assembled transcriptome can serve as a reference.

Additionally, de novo assembly has the potential to reveal novel components of the transcriptome

(i.e. new genes and variants) and then assess their relative abundance via RNA-Seq. This potential for discovery is a key advantage over the microarray where analog intensity data are generated when labeled probes hybridize to known sequences at discrete locations on the array compared to the ab initio, digital detection of sequences by SGS sequencing. Digital read data could also be reanalyzed and reinterpreted; unlike the microarray, the read data do not rely on a

“fixed” annotation. As assembly and data analysis tools are developed and refined, the rapidly growing catalog of SGS read data can be leveraged over and over again to address new and persistent questions. Indeed the desire to leverage affordable transcriptome sequencing to explore the genomes of plants is clear with initiatives like the National Science Foundation’s Assembling the Tree of Life17 (AToL) and Bejing Genomics Institute’s 1000 plants initiative (1KP)18.

3 de novo transcriptomics

With the ability to capture a complete catalog of expressed genes and then determine their relative abundance using RNA-Seq, de novo transcriptome assembly remains as a substantial hurdle to the characterization of organisms that lack a sequenced genome. Early tools for de novo transcriptome assembly, like CAP319, were simple overlap layout methods8 with more sophisticated overlap layout consensus methods like Newbler20 and Mira21 being implemented for assembly of 454 data generated from non-model plants22-27. These methods relied primarily on joining reads together whose ends overlapped with a user-defined threshold pairwise identity and threshold alignment length. More recently, a departure from overlap layout consensus has occurred with the implementation of de Bruijn graphs28 in which sub-sequences of each sequencing read, called k-mers, are used to construct de Bruijn graphs representing transcripts29.

The paths through the graph are then traversed to produce contiguous sequences (i.e. contigs).

The de Bruijn assemblers were originally developed for genome, not transcriptome, assembly8 though they have been adapted and implemented for plant transcriptome assembly and include

Velvet30,31, Oases32,33 (includes Velvet34), SOAPdenovo35-45, CLC46, ABySS47, Trinity48-53 plus hybrid approaches that leverage 454 and Illumina data with various overlap layout consensus and de Bruijn assemblers54.

Despite numerous published examples of de novo assembly of plant transcriptomes, the community has not established a standard set of quality metrics8. Typically the length statistics for assembled sequences are presented revealing that the number of unigenes (singletons, contigs and scaffolds – collectively “unigenes”) is greater than the expected number of transcripts and that total length of the assembly differs from the expected length of the transcriptome.

Annotation efforts are largely aimed at reporting the number of unigenes with hits in an external database such as NCBI’s Non-redundant protein sequences database2 (NR), though the BLASTx parameters are not standardized and range from e-value thresholds of 1e-5 to 1e-10 with some

4 studies failing to report BLASTx parameters55 and most choosing an e-value of 1e-5 for BLASTx searches. Furthermore, other parameters, like percent pairwise identity or alignment bitscore, are not typically considered in the assembly classification. Searches of other databases like the

Kyoto Encyclopedia of Genes and Genomes56 (KEGG), Swiss-Prot57, Clusters of Orthologous

Groups of proteins2 (COG) and The Gene Ontology58 (GO) are frequently used for functional unigene annotation. While it is clear that there is a consensus for what to do with an assembly once it has been generated, steps to evaluate the output are typically limited to reporting the N50, which provides the unigene length at which the cumulative assembled base pairs reaches 50% of the total assembly length, other length statistics like the number of genes greater than length x, or reporting the proportion of sequencing reads that map back to an assembly. While these can be informative statistics, they fall short of adequately informing quality. For instance, a high rate of mis-assembly resulting in chimeras can inflate the N50 and length statistics while assembly of primarily highly-expressed genes can result in assemblies that garner a large proportion of the sequencing reads while failing to assembly a large fraction of moderate to lowly expressed genes

(i.e. an inefficient assembly). Other efforts involve reporting the relative frequency of hits to expected sequences (e.g. conserved plant sequences) in external databases, though by itself this approach can be rather arbitrary and fails to inform how well the assembly represents the data and thus the transcriptome, especially when the search parameters are not consistent.

Thus, the sequencing platforms and assembly tools that are optimal to address specific questions remain unknown. Standardized metrics, including reference-dependent and reference- independent, that serve as a proxy for assembly quality have been proposed15,59. Efforts to explore the parameters of single tools60 and use a reference as a quality standard10,61,62 have been undertaken. There is agreement that the ideal context for evaluation of a de novo assembly is via comparison to a high quality reference genome. The ideal experimental design begins with careful preparation of high quality, replicated RNA samples from a model plant with a “finished”

5 or chromosome-level genome assembly, like Arabidopsis. The next step is the generation of a transcriptome dataset with leading sequencing technologies that is representative of a typical non- model plant transcriptome. Finally, de novo assembly with several leading de novo assemblers using standardized parameters, comparisons and quality metrics should be undertaken to determine which de novo assembler is the best choice for various research goals.

Tissue and cell specific transcriptomics

As the plant community strives to embrace and utilize SGS to explore non-model plant genomes, the need for higher spatio-temporal resolution sampling is evident. The precipitous drop in the cost of sequencing has opened the door to larger numbers of individuals, treatments and tissues for large-scale comparisons. The RNA-Seq Atlas of Glycine max63 used fine scale tissue sampling with an emphasis on seed developmental stages to reveal patterns of tissue specific expression of legume specific genes and identification of >177 genes that likely have a role in the seed filling process. Though this study relied on the G. max genome to map sequencing reads, the increased dynamic range and sensitivity of RNA-Seq compared to microarray64 provided an unprecedented view of the transcriptome of G. max.

Emphasis has been placed on the importance of high resolution, high through-put investigations of gene expression in a tissue and cell specific manner as well as the need to survey gene expression in a global manner65-67. Methods to harvest single cells using micro-capillaries have been used68,69 but success of downstream manipulation of the RNA is limited70. The use of cell specific promoter-marker fusions enhanced this technique69 yet the availability of cell specific promoters is limited. Preparation of protoplasts can also facilitate the isolation of cells of interest from complex tissues, but changes in gene expression during protoplast preparation71 have been observed. The labor intensive nature and contingencies (such as amenability to proto- plasting and cell specific promoters) of these techniques make global, high throughput estimation of gene expression patterns difficult if not impossible. More recently, a method named TRAP-

6 Seq72 was developed that aims to enrich RNA samples for transcripts associated with polysomes that contain tagged ribosomal proteins under the control of cell specific promoters. The review by Brandt65 describes tissue and cell specific sampling excluding discussion of Laser Assisted

Microdissection (LAM) and the review by Day et al. 66 is an excellent report on then available technologies leveraging LAM and Laser Capture Microdissection (LCM, though collectively hereafter referred to as LM). A recent review73 emphasizes the technical challenges and potential rewards of cell specific sampling to undertake systems level analyses in plants. These reviews highlight the relative strengths and weaknesses of the available technologies and variations thereof with LM emerging as a powerful and desirable tool for high spatio-temporal resolution genomics. In fact, a recent survey using “laser (capture) microdissection” and “plant” retrieved

173 records since 2003 with almost >80% of those records published since 2006 (ISI Web Of

Knowledge).

7 Part II

The Orobanchaceae: Insights into plant biology and evolution

Approximately 1% of all extant angiosperm species are parasitic, deriving all or part of the water and nutrients from host plant species using specialized feeding structures known as haustoria. Among families containing parasites, only the Orobanchaceae contain species representing the full spectrum of parasitism from potentially free-living facultative forms to non- photosynthetic, obligate parasites74. Lindenbergia is a non-parasitic lineage of Orobanchaceae sister to all parasitic species75; therefore the family represents an ideal comparative framework to study the evolution of parasitism.

The weedy Orobanchaceae

Parasitic Orobanchaceae growing in Africa and the Mediterranean include the devastating agricultural pests witchweed (Striga), broomrape (Orobanche and Phelipanche), and Alectra76.

The Striga infestation covers 123.5 million acres resulting in annual yield losses greater than

US$7 billion77,78. Broomrapes threaten nearly 40 million acres, though yield losses are difficult to assess due to the frequent abandonment of infested fields and unreliable data on yield loss79.

Striga is one of the primary biotic constraints to agriculture in Sub-Saharan Africa and the affected areas are increasing in size80. The weedy, parasitic Orobanchaceae also threaten parts of

Asia, Europe, and North America81.

The Parasitic Plant Genome Project (PPGP) aims to discover the genetic changes in

Orobanchaceae that allowed, and resulted from, the transition from autotrophy to heterotrophy by transcriptome analysis of various tissue and life stage-specific samples. The PPGP builds upon a rich body of work that has uncovered complex and varied interactions between parasitic

Orobanchaceae and their hosts. This complex relationship presents a daunting hurdle when navigated by focused, single gene and small to medium scale interrogations. The need for

8 comprehensive genomics resources for the weedy Orobanchaceae is becoming ever more evident in the face of control strategies that still allow losses in the $US billions each year 79. Therefore, the means to generate comprehensive resources for non-sequenced organisms are a prerequisite for the integration of information that will lead to long-term solutions. In addition to the rich EST resource developed in the PPGP, the workflow, tools and general methodologies developing in the context of the PPGP will allow comprehensive views into the biology of organisms that are optimal (i.e. parasites) rather than convenient (i.e. hosts with sequenced genomes) in efforts towards control.

The desire to elucidate global gene expression profiles of crop hosts interacting with weedy Orobanchaceae is evident in recent microarray studies. Swarbrick et al.82 profiled resistant and susceptible cultivars of rice challenged with Striga, with a focus on resistance QTLs, identifying genes potentially relevant to control strategies, including genes of unknown function.

Dita et al.83 analyzed gene expression in two accessions of Medicago displaying different modes of resistance, with a focus on temporal patterns of expression, to reveal hundreds of differentially regulated genes. Smaller scale approaches like Suppression-Subtractive Hybridization (SSH) have been employed in non-sequenced hosts challenged with weedy Orobanchaceae to identify differentially expressed host genes84-86. However, the means to investigate host parasite interactions in a meaningful context are limited to technologies that are comparatively (to NGS) low through-put or to systems with developed genomics resources. The ground work for development of powerful comparative frameworks has been laid with years invested in carefully cataloging and describing interactions of crop hosts and the weedy Orobanchaceae with special attention to modes and degrees of host resistance (see 79,87-94 and references therein). Leveraging

SGS in concert with a well-developed catalogue of host-parasite interactions and tools for de novo assembly and analysis holds the key to the establishment of comprehensive resources that

9 comprise the foundation for long-term control strategies. Such a resource would allow identification of genes central to mechanisms spanning from host resistance to parasite virulence thus presenting the international parasitic plant research community with a robust catalogue of gene candidates for control strategies.

The haustorium and genes imporant for parasitism

The parasitic plant life cycle contains two key semagenesis95 events (sema “sign” or

“signal” and genesis “origins”, i.e. signal creation) that represent targets for control: xenognosis96

(xenos “host” and gignoskein “to recognize”, i.e. host-recognition), which leads to germination, and haustoriogenesis96,97 (i.e. formation of the parasitic attachment organ called the haustorium).

Strigol was described in 197298 as a component of cotton root exudates that strongly induced germination of Striga lutea seeds. Subsequently, strigolactones, including strigol, were identified as branching factors for mycorrhizal fungi99 and then (simultaneously100,101) as a new class of plant hormones with branching inhibition activity. The discovery of strigol and related compounds that serve as germination stimulants as well as their role in other important plant processes has led to a boom in strigolactone research102. The ephemeral signals in the rhizosphere that serve as haustorium inducing factors (HIFs) have received less attention, yet several allelopathic compounds have been identified as HIFs103-106. The use of functional analogs of strigol (GR24) and the commercially available HIF 2,6-dimethoxy-p-benzoquinone (DMBQ) as positive controls for germination and haustoriogenesis, respectively, has provided researchers with powerful tools to investigate these two important signaling events in parasitic members of the Orobanchaceae. The recent expansion in strigolactone research has included massive surveys of the germination inducing the activity of myriad variants in several parasitic species and is beyond the scope of this work. For an excellent review see Xie et al.102.

Haustoriogenesis represents an important developmental landmark in the lifecycle of the parasitic Orobanchaceae. For obligate parasites like Striga this process is critical because

10 diminutive seed support limits (e.g. 5 days in the case of Striga hermonthica) autotrophic growth without host attachment107. Obligate parasites in the Orobanchaceae are very fecund and are predominantly r-strategists with seed production from individual Orobanche and Striga reaching into the tens of thousands and seed bank survival measured in decades107. This fecundity makes control strategies aimed at the germination signaling event quite daunting. Additionally, the realization that the germination stimulants are important plant hormones and central to the molecular dialogue with beneficial mycorrhizal fungi102 add serious contingencies to breeding strategies aimed at lower strigolactone production. The signaling event underpinning haustoriogenesis is also a difficult process to target since semagenesis involves host derived essential cell wall phenolics96. Indeed efforts to identify genes involved in these signaling events are underway, yet the mechanisms of host perception and the details of the host-parasite molecular dialogue remain largely unknown92,102.

The haustorium, presumed to be a modified root structure108, is adapted to enable feeding on the host plant and unique to parasitic plants; thus it is a focal point for interactions between the parasite and host108. Heide-Jorgensen and Kuijt109,110 showed that the haustorium of T. versicolor contains many specialized cells including haustorial hairs, a xylem bridge between the host and parasite, and transfer-like cells adjacent to vessel elements at the host-parasite interface. Although histological evidence for xylem connectivity between the haustorium of T. versicolor and its host is well documented109,110, there is no evidence for phloem connectivity. However, there is evidence that phloem-mobile virus particles move between host and parasite in the holoparasite

Phelipanche (syn. Orobanche) ramosa111 and phloem continuity has been observed in Orobanche crenata112. The mechanisms of transport between host phloem and parasite phloem likely vary in different parasites from direct phloem connections112 to transport via apoplastic pathways. Bi- directional movement of small RNAs between host and parasite has been documented in T.

11 versicolor attacking transgenic lettuce113. The anatomy of the haustorial interface cells and empirical evidence for bi-directional transport point to the host-parasite interface as an epicenter of host-parasite dialogue.

Triphysaria is an important model parasite

Motivated by the agronomic threat presented by some parasitic Orobanchaceae,

Triphysaria versicolor has been developed as a model parasitic plant for the family. Native to

Eastern North America114, its broad host range includes 30 species in 17 families of monocot and eudicot host plants and the only plants known to be resistant to attack are other Triphysaria115-117.

The investigation of parasite biology, and specifically the work to characterize haustorium initiation and development and host-parasite interactions, has resulted in a number of important insights into parasitic plant biology. This work includes description of species specific host recognition in T. versicolor118, examination of flavonoids as haustorium inducing factors

(HIFs)106, demonstration of heritable variation in HIF sensitivity119, examination of self- compatibility in different species of Triphysaria120, description of hormone fluxes in haustorium development121 and multiple studies surveying gene expression in Triphysaria that identified asparagine synthetase122, expansins49,123,124 and the quinone oxidoreductases TvQR1 and

TvQR2114,125 as host responsive genes in T versicolor. Torres et al.126 report the establishment of an EST database, Pscroph, that contains over 9000 Suppressive Subtractive Hybridization (SSH)

Sanger ESTs in T. versicolor aimed at identifying transcripts that are differentially regulated in response to DMBQ. Additionally the Yoder lab has demonstrated that root cultures of T. versicolor maintain the ability to form haustoria127, that transformed T. versicolor expressed a scorable marker and maintained parasitic competence128 and that hair-pin (HP) silencing constructs expressed in a host plant were able to silence transgenes in the parasite113. Recently

Bendarayanake et al.129 has shown that the quinone oxidoreductase, TvQR1, is necessary for

12 haustorium development in T. versicolor and is thought to encode an enzyme that generates a semiquinone intermediate that serves as an early signal for haustoriogenesis. As a transformable128 and tractable facultative generalist parasite, T. versicolor represents an excellent species to investigate the evolution of parasitism, haustorium development, plant-plant communication, host-parasite interactions, and many other facets of parasite biology118,130.

13 Chapter 2

Genome Reference Based Evaluation of de novo Transcriptome Assembly

Here we describe an in-depth comparison of multiple de novo assemblies of the

Arabidopsis thaliana leaf transcriptome using the 10th generation Arabidopsis genome as a reference. Analysis of the transcriptome in light of the known genome offers the best possible estimate of gene content and structure that can be made, given a particular transcriptome sequence data set. Our analyses include sophisticated metrics that inform assembly completeness, overall quality and utility for RNA-Seq. Importantly, the exemplification of a transcriptome by a de novo assembly can have different meanings depending on the question at hand – a single best assembler may not yet exist.

With the tools summarized in Table 2.1 we assembled SGS RNA-Seq data including two biological replicates of paired-end Illumina mRNA-Seq, normalized paired-end Illumina mRNA-

Seq and two biological replicates of a cDNA library with Roche’s 454FLX. Our analysis focuses on data produced by Illumina sequencing technology (GAIIx, HiSeq, MiSeq) for 3 primary reasons: 1) this technology contributes the majority of new data to NCBI’s sequence read archive,

2) the cost per base provides a price advantage over Roche’s 454 and 3) many of the current de novo assemblers are designed to assemble Illumina data (or a mixture of Illumina and 454) but not SOLiD data or 454 data. We also evaluated the accuracy of measurements of expression levels extracted from de novo assemblies by interrogating the same RNA samples with

NimbleGen Multiplex microarrays. A set of 12 genes was carefully chosen for follow-up analyses including quantitative RT-PCR (qPCR). A key feature of our analysis is the use in all experiments of a pair of replicated, high-quality RNA samples prepared from A. thaliana young leaf RNA. The verification that biological variability was very low in every aspect of our

14 analysis allowed us to focus on the technical biases of sequencing platforms, transcriptome assembly and post-assembly processing steps.

The results of these analyses reveal that Trinity131, CLC132 and SOAPdenovo-trans133 have superior performance. We analyzed primary assemblies and those that were post-processed, a series of steps akin to the Trinity pipeline, which yielded improved assemblies. We have developed SCERNA, a collection of versatile post-processing tools and protocols for SGS transcriptome data that improve the quality of de novo transcriptome assemblies. One of the primary hurdles to overall assembly quality is highly variable transcript coverage. Generally, and as expected, lowly expressed genes were typically not represented by complete and accurate transcript assemblies due to lack of sufficient read depth. However, and to our surprise, increasing read coverage increased the likelihood of missing, incomplete or poorly assembled transcripts. We also evaluated hybrid assemblies of 454 and Illumina data finding that while very modest gains in select categories were afforded by the addition of 454 data, these paled in comparison to the value of additional, replicate Illumina sequence data.

15 Results

Read mapping reference definition and data summary statistics

To represent the transcripts contained in two biological replicate Arabidopsis thaliana young-leaf total RNA samples (used exclusively for all analyses presented here) we chose the 10th generation A. thaliana genome (TAIR10) cDNA set as a reference. This set was further refined to include only the longest splice variants for detected transcripts. We defined “detected genes” as the set of loci to which any sequencing read mapped uniquely. This resulted in a list of 25,512 genes. 25,246 of these are nuclear encoded and represent 75.8% of the nuclear encoded A. thaliana genes. Table 2.2 contains a summary of sequencing and read alignment statistics for all

A. thaliana young leaf RNA-Seq libraries. Our largest single dataset resulted from sequencing the normalized Illumina library, which provided 6.4 Gbp in >55 million 120 bp reads. The biological replicate Illumina libraries were next largest each with ~4.2 Gbp represented by >55 million 76 bp reads. A commercially prepared normalized 454 transcriptome yielded 210 Mbp of sequence in >0.6 million reads with an average read length of 345 bp. The biological replicate

454 libraries were the smallest with 24 Mbp and 21 Mbp of sequence represented by >100,000 and >89,000 reads, respectively, each with an average length of ~235 bp.

Gene tagging and cDNA coverage

The coverage of each A. thaliana gene was determined by mapping quality-trimmed reads to the longest splice variant of the detected TAIR10 cDNAs. Figure 2.1 shows the distribution of cDNA read-level coverage for each of the sequencing data sets in this study.

Average cDNA coverage was highest in the Normalized Illumina library at 79.7%. For each non- normalized Illumina biological replicate library the average coverage was 76% (±0.21) and when combined, the replicate Illumina libraries provided an average cDNA coverage of 80.7%. The normalized 454 transcriptome data set had an average cDNA coverage of 53.0% and the combined 454 non-normalized dataset provided and average cDNA coverage of 22.6%. While

16 normalization did substantially increase the average coverage of both the 454 and Illumina transcriptomes, the combined, replicate non-normalized Illumina transcriptomes produced the greatest average cDNA read-level coverage. Though the normalized Illumina transcriptome had the greatest proportion of high coverage cDNAs (99-100%) it also lacked reads for a large proportion of the detected transcripts relative to the combined replicate non-normalized Illumina transcriptomes (Figure 2.1).

We predicted the sequencing statistics of the non-normalized Illumina biological replicate one (BR1) data set using ESTCalc at http://fgp.huck.psu.edu5. This simulation indicated that BR1 would not only provide tags for all expressed genes (assuming a transcriptome of 18,000 genes), but cover the entire transcriptome. The non-normalized Illumina library generated from biological replicate one yielded 4.2Mbp of sequence data and generated tags for 96.2% of the detected gene set. When we doubled the sequence data amount with Illumina biological replicate two we tagged an additional 966 genes bringing the total to 99.9% of detected genes tagged. This indicates that we tagged most genes by sequencing biological replicate one (~4.2 Gbp).

Additionally since mean transcript coverage was not substantially increased, BR1 was deemed sufficient to represent the transcriptome. The simulations performed by Wall et al.5 also included a cost component and revealed a substantial cost advantage by choosing Illumina over 454 in terms of cost-per-base pair of sequence data.

Assembly reference definition and data choice for de novo assembly

The TAIR10 cDNA reference based assembly (with Mosaik-SCERNA– see methods) of

BR1 contained unique assembled sequences (unigenes) that represented 20,176 cDNAs (79.1% of the detected gene set). 17,723 (87.8%) of these transcripts were covered >50%. 5336 (20.9%) of the detected cDNAs were not represented in the reference-based assembly indicating insufficient cDNA coverage (e.g. low expression). When we included a second data set of equivalent size and quality (Illumina biological replicate 2 – BR2) for reference-based assembly with Mosaik, the

17 number of detected genes represented by unigenes increased to 20,840 (81.7%) from 20,176

(79.1%). The number of assembled transcripts covered >50% by the assembly increased to

18,596 (89.2%), while the number of missing genes decreased to 4672 (18.3%). These modest gains indicate that BR1 (4.2Gbp) was sufficient to afford a representative transcriptome assembly and was chosen to serve as a cost-effective benchmark data set for our detailed de novo transcriptome assembly comparison. Indeed the 454 datasets covered far less of the A. thaliana young leaf transcriptome and the generation of a 454 dataset similar in transcriptome coverage to the Illumina datasets was cost prohibitive. An important distinction between the cDNA reference and de novo assembly reference is between detection and reconstruction. While we detected

25,512 genes, the number of reconstructed genes (>50% coverage) is closer to 18,000 as indicated by the number genes represented in our reference based transcriptome assemblies of

BR1 with Mosaik.

Quality metrics and terms for comparison of de novo assembler outputs

We developed an extensive set of quality metrics for transcriptome assembly (Table 2.3) that includes reference-dependent and reference-independent metrics that are used to evaluate so- called “primary” assemblies (most basic output of a given assembler) and post-processed assemblies that represent refined assembly outputs. To facilitate a concise presentation of 78 assemblies plus multiple analyses and comparisons, we present a list of assembly abbreviations and terms, suffixes and associated definitions in Table 2.4. The majority of analyses presented below were carried out with carefully controlled parameters that aimed to level the playing field as much as possible for assemblies of Illumina biological replicate one (BR1) with the assemblers listed in Table 2.1. We have also included a select set of informative analyses of de novo assemblies generated with updated versions of CLC (CLCscaf) and SOAPdenovo (SOAPtrans)

(see Table 2.1 and Materials and Methods).

Assembly statistics:

18 Evaluation of de novo assembly via a reference assembly standard

Primary assembly

The output of primary assemblies are summarized and presented in Table 2.5. Compared to the Mosaik assembly the primary assembly statistics display variable patterns of performance.

For instance, the CLC primary assembly has outstanding RNA-Seq correlation (with Mosaik) statistics and is one of the larger assemblies, yet has the highest assembled sequences count and the lowest N50 length indicating a preponderance of short unigenes. The Inchworm assembly has excellent length statistics and good RNA-Seq correlations, yet has >50,000 unigenes which is

~2.5 times the number of assembled sequences found in the Mosaik assembly. The assembly with the most similar number of unigenes to Mosaik is NG IT. This assembly has good length statistics but is the second smallest assembly and is clearly mid-pack in terms of RNA-Seq correlation suggesting a less complete assembly - curiously this primary assembly garners the greatest proportion of mappable reads. These conflicting patterns suggest the output of each assembler is highly variable and simple length and read-mapping statistics do not describe the assembly sufficiently to choose a superior output (e.g. most similar to the reference assembly).

That all assemblers have a greater number of sequences than the Mosaik assembly and many are larger indicate incomplete assembly that takes the form of fragmented and duplicated sequences.

That others garner a small proportion of reads when mapped-back to the assembly indicate that the assembly is not representative, highly fragmented or incomplete compared to the Mosaik standard.

Post-processed assembly

In Table 2.5 the landscape of assembly statistics appears much more even after refinement by post-processing with the Velvet-Oases, Inchworm-Chrysalis-Butterfly (Trinity) or our modular SCERNA (Figure 2.2) post-processing tools. The similarity is greater when increasing minimum length cutoffs are imposed (Figure 2.3) revealing comparatively large

19 differences between the assemblies that are contained in short rather than long unigenes. The goal of post-processing is to improve key (namely length-statistic and unigene count related) aspects of an assembly without diminishing other aspects such as usefulness as a read mapping reference (RNA-Seq correlation) and the proportion of mappable reads. The number of sequences was decreased in all assemblies and generally the length statistics are improved or marginally decreased. The RNA-Seq correlations were not substantially changed nor were the proportions of mappable reads. For a comprehensive summary of post-processing effects on all assemblies see Appendix A. Though all assemblies are improved by post-processing, none are as good as Mosaik-S assemblies of the same data.

Assembly statistics:

Evaluation of de novo assembly via a genome (cDNA) standard

Error rates

The statistics of each primary and post-processed assembly reveal only limited information about certain aspects of the assembly, namely the occurrence of errors that may contribute to fragmented, incomplete or incorrect assemblies. By estimating various types of error we can begin to explain the differences of each de novo assembly compared to the A. thaliana cDNA reference. Base call error rates were vanishingly low (<0.5%) in all post- processed BR1 assemblies and assemblies were improved (or nominally changed) by post- processing (Appendix A). The rate of alignment gaps (to TAIR10 cDNAs) was also very low and ranged from 0.001% for Mosaik-S to 0.005% for Inchworm-S, Velvet-S and CLC-S.

We defined two other error types that inform assembly failures. Type I errors indicate failure to assemble (fragmentation) and Type II errors indicate annotation ambiguity (with a subcategory that may suggest a chimerism – Figure 2.4). The number of assembly gaps between 2 unigenes (Type I Case I) ranged from 16,931 for Mosaik-S to 8,964 in Oases-VO. Type I Case II counts (overlap ≤kmer-1) were very low in Mosaik-S at 56 and ranged from 5,637 in CLC-S to

20 1,227 in ABySS-S. Fragmentation that occurs despite an overlap >kmer-1 (Type I Case III) was also low in Mosaik-S at 256 and higher in the de novo assemblies ranging from 7,819 in CLC-S to 1,381 in Oases-VO. Type II error rates are essentially a measure of the inability to unambiguously assign a unigene to a reference cDNA. The Type II error rates for post-processed assemblies of BR1 were all below 0.3% with many near 0. However, Type II Case I represents a special case that may suggest chimerism more strongly that cases II-V in that the ambiguous regions of the unigene do not overlap. The Trinity-S Type II Case I error count was 4, the count for Mosaik-S was 3 and NGMO-S had only 1. The highest number of the possible chimeric assembly errors in post-processed assemblies of BR1 was in CLC-S at 56, yet this was still only

0.07% of all unigenes in the assembly. The proportion of Type II Case I errors were consistently less than half of all Type II errors, with many much lower. This indicates a rate of chimeric assembly that is very low across all assemblies. Our annotation strategy relied on BLASTn alignments to TAIR10 cDNAs and the majority of Type II errors are likely due to ambiguous sequence annotation rather than erroneous assemblies. cDNA coverage statistics

The average read-level cDNA coverage of BR1 was 75.8% and the average unigene-level cDNA coverage of the Mosaik assembly of BR1 was 65.3%. All primary and post-processed de novo assemblies of BR1 yielded poorer coverage statistics than the respective TAIR10 cDNA reference based Mosaik assembly. The differences in all de novo assemblies were not dramatic and illustrate a subtle gradation in terms of completeness (Figure 2.5). Inchworm-S and CLC-S had the best unigene-level coverage statistics, yet were incrementally superior than other assemblies which still yielded many thousands of unigenes that covered reference transcripts

>75%.

Integration of assembly quality metrics reveals a complex assembly landscape

21 The effect of post-processing tools like those integrated into Velvet-Oases, Trinity and our versatile, stand-alone SCERNA (Figure 2.2) are seen in all assembly metrics presented thus far. In Figure 2.6 we summarize the effect of SCERNA post-processing on the reference assembly, Mosaik, and CLC as well as the effect of the Trinity pipeline on the intermediate output generated with Inchworm. The direction of change varies for certain categories. For instance we expect that successful post-processing should reduce error rates and unigene numbers and increase length statistics, indicating error correction and consolidation of information resulting in a smaller, more concise and contiguous assembly. A summary of the effect of post- processing on all assemblies is presented in Appendix A.

The Mosaik assembly is generated by aligning reads to the reference set of cDNAs, thus the effect of post processing should be minimal since each unigene is reference checked during assembly. Very complete yet highly fragmented assemblies like CLC should benefit greatly from post-processing while high performance transcriptome assemblers like Trinity should show more modest gains during post-processing. The effect on the Mosaik assembly is minimal with the exception that the Type II error rate is reduced to by ~35% to 0.025% and the number of genes missing from the assembly was increased. The effect on the CLC assembly was pronounced with substantial improvement of error rates and either minimally negative or positive changes in length statistics and unigene count. The changes for the Trinity assembly from the intermediate

Inchworm output were more modest yet showed improvement or minimal change in all categories. Generally these data show that post-processing will have a minimal effect on excellent assemblies like Mosaik yet substantial positive effects on de novo assemblies by correcting error, improving length statistics and refining and simplifying the transcriptome assembly.

Visualizing the complex assembly landscape

22 Despite a rigorous examination of numerous assembly quality metrics, a comprehensive view of the assembly landscape remains obscured. It is clear that assemblies produced by CLC are very complete, yet highly fragmented. Others like Trinity produce concise assemblies by excluding portions of the transcriptome that are difficult to resolve as evidenced by the lower error rates and relatively greater number of missing genes. It is easy to make comparisons and see the effect of post-processing protocols, like SCERNA, in individual categories, yet the fate of each gene needs to be considered to draw conclusions about transcriptome-wide assembler performance.

We integrated our read mapping data (sequenced fragments/basepair - SFB) and BLAST based sequence identification data (BitScore/basepair - BS) to reveal how the quality of de novo transcriptome assembly changes across the dynamic range of expression in the A. thaliana young leaf transcriptome. Our approach selects a single unigene with the highest BLASTn bit score to a given TAIR10 cDNA, normalizes this score by cDNA length and then plots it against the normalized read depth (SFB). The ideal BS (bit score normalized for sequence length) is 2 for a perfectly matched alignment. Though due to variations in nucleotide base frequency, and thus the likelihood of transitions at a given position, the BS for a perfect alignment will approach (and in rare cases reach) but not exceed 2.0134. For example, a unigene that aligns to a 1000bp reference transcript with 2 BS and 0.2 SFB is identical to the reference, and 200 sequenced fragments (pairs and/or orphans) map back to the unigene. If the coverage of a perfect alignment dropped to 75%, the BS would be 1.5.

In Figure 2.7 the results of this integrated approach are presented for all post-processed assemblies of BR1. In all assemblies lowly expressed genes are generally represented by low quality (i.e. incomplete) unigenes. The accumulation of read depth results in a rapid increase in unigene quality from 0.05 to 0.1 SFB for the Mosaik-S assembly indicating that for an average gene (1000bp), >50 85x85bp paired-end Illumina reads is a practical minimum for full length and

23 accurate reference based assembly. The read depth threshold for high quality unigenes is greater for de novo assembly. With the exception of the NG MO-S assembly, the read use efficiency at lower sequencing depths was more or less equivalent among assemblies with a sequencing depth between 0.1 and 1 SFB routinely producing long and accurate assembled sequences (Figure 2.7).

Reassuringly, for cDNAs covered >50% by post-processed assemblies of BR1 (except

NG MO-S) 78±1.0% of those cDNAs were represented by at least one contig of BS >1.0. This indicates that, generally, all de novo assemblers considered here (except NG MO-S), given a sequencing depth of 0.1SFB (~100 Paired-end 76x76bp reads) will assemble at least one unigene that covers >50% of the original transcript. The Mosaik-S assembly, a best-case scenario, manages to accomplish this for 87% of cDNAs covered at >50%.

A surprising result is that for cDNAs covered at >75% by an assembly the proportion of single contigs >1.5BS for that locus decreases to 64±1.3% for de novo assemblies and to 77% for the Mosaik-S assembly. This counterintuitive pattern is illustrated in Figure 2.7 as a persistence of lower quality unigenes at a normalized read depth above 1.0 SFB (i.e. 1,000 sequenced fragments for a 1000bp transcript). The assemblies CLC-S, Oases-VO, Velvet-S, SOAPdenovo-

S and ABySS-S all fail to assemble highly expressed transcripts at >10 FSB while more robust assemblers like Inchworm-S, Trinity-ICB, CLCscaf-S, SOAPtrans-S, NG MO-S and NG IT-S continue to efficiently produce unigenes, while also producing unigenes of lower quality. This pattern is also seen in our primary assemblies (Appendix B) and is not a result of post-processing.

For assemblies we know to be less complete (Velvet-S, Oases-VO, SOAPdenovo-S,

ABySS-S) this is not surprising in that it points to potential missing transcriptome components.

However, that CLC-S also displays a clear read-depth threshold for high-quality assemblies

(SFB10) while at the same time having the best RNA-Seq correlation and most complete cumulative assembly suggests the many of the fragmented transcripts are from highly expressed genes. In the case of the other assemblies that fail at high read counts there are virtually no

24 unigenes that garner >5000 reads. The Inchworm-S, Trinity-ICB, CLCScaf-S, SOAPtrans-S,

NGI-S and NGMO-S all lack a gene expression threshold for successful assembly. NGMO-S is inefficient, yet the implementation of an iterative approach (NGI-S, designed to control for extreme sequencing depth) substantially improves assembly efficiency and contiguity (Figure

2.7).

To illustrate the challenge of distilling tremendous variation into concise unigenes representing a single locus, we aligned unigenes from 4 assemblies of BR1 (Inchworm, Trinity-

ICB, CLC and CLC-S) with the genomic DNA, cDNA and coding sequences (CDS) of

AT1G31330.1, which encodes photosystem I subunit F (Figure 2.8). In this extreme case the

SFB of this gene (in BR1) was 128 (>121,000 reads). The Inchworm-S assembly produced two unigenes that matched the AT1G31330.1 cDNA. The post-processed Trinity-ICB assembly produced a single perfect unigene that extended the 3’ and 5’ UTR’s by 72bp and 74bp, respectively. Yet the primary CLC assembly produced 623 unigenes that aligned to the

AT1G31330.1 cDNA. Even after post-processing the CLC-S assembly contained at least 96 unigenes that aligned to AT1G31330.1. The alignments of CLC unigenes reveal numerous single base-pair and structural differences, including unigenes that seem to contain introns (Figure 2.8,

C & D). Trinity was able to distill these various differences into a single representative that was identical to the reference cDNA while CLC maintained these differences in individual sub- sequences that even after post-processing of the BR1 assembly by SCERNA were represented by numerous unigenes. Depending upon research goals, it may be desirable to examine the numerous differences for AT1G31330.1 displayed in CLC assemblies or, alternatively, extract a single perfect unigene from the post-processed Trinity assembly for downstream analysis.

To further reveal read-use characteristics of each assembly a white trend line is plotted in each graph in Figure 2.7. This trend line shows the normalized read depth inflection point at which an assembly accumulates more contiguous and accurate unigenes. Except for NG MO-S,

25 the trend of increasing quality as read depth increases is highly similar with the most dramatic differences seen above 10 SFB. This analysis also shows that most de novo assemblers considered here accumulate unigenes that are representative of transcripts at 0.1-0.2 SFB indicating that for an average 1000bp transcript, 100-200 reads (i.e. 17-34x depth with 85x85 paired end reads) is a practical lower limit for successful de novo assembly. More important, though, is the performance of de novo assemblers at very high read depth. The post-processed assemblies of BR1 using Inchworm-S, Trinity-ICB, CLCscaf-S, and SOAPtrans-S show equivalent trends at lower read depths and continue the trend of increasing assembly quality with higher read depth beyond 10SFB up to 1000SFB, a range of 10,000 to 1,000,000 reads for a

1000bp transcript.

Strategies to improve assembly quality

Our analysis of the de novo assembled BR1 dataset reveals that the dynamic range of transcript abundance is a primary hurdle to full-length and accurate transcript assembly, often resulting in missing or fragmented transcripts. Two approaches that should address this problem are normalization via Duplex Specific Nuclease (DSN) to reduce the frequency of the most abundant transcripts and hybrid assemblies that combine structurally dissimilar sequencing data

(e.g. 454 + Illumina).

The effect of normalization

The normalized Illumina transcriptome provided the highest average coverage for any single dataset (Figure 2.1) yet also lacked reads for >2000 of our detected transcripts list, more than double the number missing from BR1. This was unexpected since the process of normalization should remove highly abundant transcripts thus increasing the likelihood of detecting more transcripts. The result should be a more read diverse library with an increase in average transcript coverage. We hypothesized that this apparent dichotomy was due to the removal of lowly expressed genes with highly expressed and highly similar relatives: a result of

26 sloppy complementary base-pairing leading to digestion by the DSN during normalization. To test this we identified two distinct gene sets, the Ultra-Conserved Orthologs (UCOs, http://compgenomics.ucdavis.edu/) and the Closely Related Genes set (CRG), a set of 300 gene pairs with a Ks of <0.2 and with each mate having a read depth of >0.1 SFB. We correlated the read counts from the normalized Illumina library and the average read counts from the combined non-normalized Illumina biological replicates (BR12) and highlighted the UCO and CRG genes

(Figure 2.9). The results of this analysis do not support our hypothesis that the basis for aberrant removal transcripts is due to high sequence similarity of CRGs since read counts for CRGs, compared to the UCOs and all detected transcripts, are not substantially effected in the normalized library. Despite the aberrant removal of transcripts from sequencing libraries, the increase in average cDNA coverage (Figure 2.1) indicates that normalization was successful in its primary purpose.

Since we have verified that normalization was successful in increasing cDNA level coverage (Figures 2.1 & 2.9), it stands to reason that de novo assembly of these data should help address the issues presented by a large dynamic range of transcript abundance that causes assembly failure at very high levels of expression (SFB >10, Figure 2.7). The assembly of normalized read data does increase the number of full length and accurate transcript assemblies

(Figures 2.1, 2.9 & 2.10, Appendix A) by ~15% in the Mosaik and (Inchworm) Trinity assemblies and by a larger margin in the CLC assembly (~25%). Similar gains were observed in

(Velvet) Oases and ABySS assemblies, though NG MO and SOAPdenovo assemblies were notably poorer, due possibly to structural changes in the data following the removal of SMART adapter sequences (see Materials and Methods).

The fate of closely related genes, hybrid and alternate assemblies, and instance of chimeras

27 Only one assembler considered here, CLC, was designed to generate hybrid assemblies from a mixture of data types. Our read level analysis indicated that the 454 data (normalized and non-normalized) covered far less of the detected gene set than even the Illumina biological replicate 1 dataset (BR1). Thus we had no expectation that a hybrid assembly would generate more accurate or complete assemblies. However, the ability to assemble certain genes (e.g. closely related genes) may be enhanced by the addition of the 454 data owing to the longer average read length that may span regions of high sequence identity, thus reducing ambiguity in read assignment.

For BR1 we plotted the Ks (synonymous substitutions/synonymous site) values for gene pairs detected in each assembly plus those detected in our expressed gene set and TAIR10 (Figure

2.11A). The profiles of the Mosaik-S assembly of BR1 differed from the expressed gene set in that gene pairs with low Ks values were absent from the Mosaik-S assembly, not a random subset, indicating that closely related gene pairs were preferentially excluded from the reference-based assembly. To determine if the difference was due to sampling bias (in which one mate of the pair is expressed at a level sufficient for assembly and the other is not) or due to limitations of assembly (highly similar reads from closely related gene pairs are ambiguous) we examined the read count statistics for the 2080 gene pairs lacking from the Mosaik-S plot relative to the detected gene (cDNAs) plot. For ~75% (1607) there was only sufficient unique reads (SFB ≥0.1) for one mate of the pair. Those pairs with reads sufficient for only one mate tended to have higher Ks values relative to the other pairs (Figure 2.12). This may indicate that paralogous gene pairs that have been diverged longer tend not to be co-expressed. The remaining 473 expressed gene pairs (EGP) had sufficient read depth for assembly, and provided consituted a list of genes likely to be misassembled into chimeric unigenes owing to high sequence similarity.

The instance of successful assembly (BS >1.5) of both mates in each EGP was very low for Mosaik-S at only 0.8% (Figure 2.11B). Roughly half (44%) of EGPs were represented by

28 both mates in the Mosaik-S assembly, though the majority of these unigenes (74%) were below

0.5 BS indicating very short and/or low accuracy assemblies. Generally, the de novo assemblies produced about the same number of high quality pairs with Inchworm-S and Velvet-S at 1.5% to

NG MO-S and Trinity-ICB at 0.6%. The primary difference was in the total number of high quality (>1.5BS) unigenes with nearly all de novo assemblers producing more high quality unigenes than Mosaik-S. Additionally, high performance de novo assemblies like CLC-S lacked

~7% and Trinity-ICB missed ~10% while Mosaik-S and other lower performance de novo assemblers lacked ~20% of the EGP (Figure 2.11B). Together this resulted in a pronounced reciprocal pattern of “hit/no hit” (Figure 2.11B) in the representation of EGP genes in the de novo assemblies. Between 66-76% of the EGP were represented by only one mate of the pair in all de novo assemblies. This pattern was far less pronounced (34% reciprocal “hit/no hit” pattern) in the

Mosaik-S assembly of BR1 with the result being multiple low quality unigenes that represent more pairs, though fewer EGP genes. Generally, this complex pattern indicates that the de novo assemblers tend to assemble higher quality unigenes representing one mate, or the other, while the reference-based Mosaik-S produces lower quality unigenes that represent more pairs. Taken together with the vanishingly low Type II error rates, this analysis of the closely related genes suggests the instance of chimeric assembly is very low, if not practically zero.

Assembly statistics – evaluation of de novo assembly without a genome reference

Using our standardized quality metrics we can evaluate de novo assemblies and provide the framework for comparison of various parameter adjustments. However appealing de novo assembly is in the context of a reference genome, the power to study non-model organisms is undeniable. Reference independent assessment of a de novo assembly is challenging primarily since the context of the outcome is dependent on available reference genomes. Here we describe important reference independent metrics of assembly quality that leverage read mapping (coupled

29 with rough gene count estimates) and defined groups of universally conserved orthologous gene families.

Read Titration Analysis

In principle, if an assembly is accurate and complete, i.e., it accurately represents all of the raw data from which it was constructed, then a large proportion of those reads should map back to the assembly. In addition to the proportion of reads that map back to an assembly, a measure of completeness, in terms of number of genes detected, is necessary. To evaluate the degree of read saturation present in a de novo assembly, we calculated gene accumulation curves for each post-processed assembly of BR1 (Figure 2.13). This analysis calculates the rate of gene detection (estimated using the number of unique unigenes observed) as a function of sampling effort (number of sequence reads). Analogous to the species accumulation curves used to estimate species richness in biodiversity inventories, similar approaches have been used to evaluate gene capture in transcriptome studies for systems without a reference genome sequence26,135. This approach allows us to qualitatively and quantitatively assess whether we have sequenced to sufficient depth to capture all of the genes present in a sample (an important consideration is the expected number of captured genes) and can be used to inform quality assessment in an assembly.

The read titration analysis shows that superior assemblies (like Trinity-ICB) are representative of the read data and capture a large amount of sample diversity before approaching saturation. At saturation a large proportion of the data are sampled and selecting a read that maps to a previously un-sampled unigene becomes increasingly rare. It is clear that NG IT-S, NG MO-

S, Inchworm-S, Trinity-ICB, CLCscaf-S and SOAPtrans-S reach saturation at ~110-150% of the expected gene count, while CLC-S has more than 3 times the expected number of sequences. We do not expect a typical plant transcriptome to contain 60,000 genes, nor do we expect that only a small fraction of reads will map to a high-quality de novo assembly. This easily excludes CLC-S since the proliferation of unique sub-sequences (Figure 2.13) causes an inflation of “unique tags”

30 in the read titration analysis. Velvet-S, Oases-VO, ABySS-S and SOAPdenovo-S are excluded since they garner a very small proportion of reads before reaching saturation. However the plot of NG IT-S is quite similar to the Mosaik-S assembly, closer in fact than Inchworm-S, which we know to be a leader by nearly all measures. This is a reflection of the highly accurate, if not inefficient and exclusive nature, of NG IT-S assemblies. The unigenes that NG IT-S produces are very accurate and tend to require more reads for assembly while at the same time excluding conflicted unigenes and lowly expressed genes.

The results of this type of analysis will reveal which assemblies represent the read data well, while at the same time providing a measure of assembly diversity. In theory, a good assembly should produce a number of sequences that reflect the estimated transcript number.

However, the choice between leading assemblers is not clear, as in the case of NG IT-S, NG MO-

S, Trinity-ICB, Inchworm-S, CLCscaf-S and SOAPtrans-S. To make a more informed choice we should not only estimate the number of expected sequences, but also look for specific genes (e.g. broadly conserved genes) we expect to find.

Ultra Conserved Orthologs (UCOs) as a proxy for the whole transcriptome

Our read level analyses suggests that UCOs are among the ~60% of genes expressed at moderate levels in the A. thaliana young leaf transcriptome (Figure 2.9). The presence of UCOs in a de novo assembly is an indicator of assembly quality since moderately expressed genes with low sequence similarity to other genes are likely to be assembled and represented by (often single) high quality unigenes. The Mosaik-S (BR1) assembly contained high quality assembled sequences (>BS1.5) for ~85% of UCOs and the leading de novo assemblers, Inchworm-S,

Trinity-ICB, CLCscaf and SOAPtrans-S all captured ~75% of UCOs with high quality assembled sequences (Figure 2.14). Combined with the read titration analysis we can now exclude NG IT-S and NG MO-S since the exclusive and inefficient nature of these assemblers results in an inferior complement of UCOs; NG IT-S lacks >100 UCOs covered at >90%, and ~45 UCOs covered at

31 >99%, compared to the leading assemblies, while NG MO-S lacks even more. The results of these two genome reference independent analyses clearly identify Inchworm-S, Trinity-ICB,

CLCscaf-S and SOAPtrans-S as the superior assemblies, which is concordant with our ranking based upon the 10th generation A. thaliana reference genome based analyses.

Microarray and RNA-Seq analysis of the A. thaliana young leaf transcriptome

We interrogated the A. thaliana young leaf transcriptome with a NimbleGen Multipex

Microarray and variations of RNA-Seq to examine the correlation between global estimates of gene expression. A key difference (aside from analog vs digital detection) between RNA-Seq and the microarray is in which portion of a transcript (cDNA) is detected. Since RNA-Seq uses reads derived from whole transcripts and the microarray interrogates specific transcript regions

(defined by complementary probe sequences) we reasoned that we could improve the correlation coefficient of gene expression by masking the RNA-Seq signal from all but microarray probe regions. We improved the Pearson’s correlation of Illumina BR1 and the Nimblegen Multiplex

2 Microarray from R =0.817 (average Log2 RPKM vs. average Log2 Array intensity) to 0.853

(average Log2 reads/probe vs. average Log2 Array intensity). These results are an improvement over previous studies64,136 and consistent with Malone and Oliver137, albeit with fewer RNA-Seq replicates. The increased agreement between the two technologies comes at the expense of mapping only ~2.5% of the Illumina reads from both biological replicates to the array probe sequences.

Correlation of the array intensities and RNA-Seq read counts revealed that while ~85%

(R2=0.853) of the features can be explained by a linear relationship, RNA-Seq and the

NimbleGen Microarray reported very different expression levels for some genes. After verifying the probe sequences (TAIR6 against the then current TAIR9) and excluding organellar genes, we used the correlation analysis to choose candidates for qRT-PCR analysis. Twelve genes (3 each in

4 categories - Figure 2.15) were chosen for analysis by qRT-PCR. We searched for candidates in

32 each category (Figure 2.15) that had verified (still accurate from TAIR6 to TAIR9) probe sequences. The six candidates for the well-correlated group (black arrows Figure 2.15) were easily chosen and were the first 6 examined. Alternatively, the search for candidates in the poorly correlated group (red arrows Figure 2.15) extended to 25-30 candidates before any were chosen due to updated or obsolete annotations, and each of these had only two probes, compared to three probes for the well-correlated gene set.

Gene expression levels for the well-correlated candidates were generally consistent between the microarray, qRT-PCR and multiple variations of RNA-Seq with-in the dynamic range of the array (Figure 2.16). Outside of the dynamic range of the array, qRT-PCR and RNA-

Seq generally agreed (Figure 2.16). Alternatively, the poorly correlated candidates represented either cases of false positive signal on the array (2 of 3) or had evidence of mis-annotation. For detailed results of the follow-up analyses on the six poorly correlated candidates, see Appendix C.

Since we established that the RNA-Seq data set and microarray data set agreed very well, we sought to investigate the power to predict levels of gene expression using our de novo assemblies as read mapping references. The RNA-Seq correlation (PCC) between experiments using TAIR10 and the Mosaik-S assembly of BR1 revealed high similarity (R2 =0.74) with the

TAIR10 reference capturing more reads (Figure 2.17) as expected since the TAIR reference is more complete that the Mosaik-S assembly. The interpretation of RNA-Seq data using a de novo assembled transcriptome from a non-model organism is more challenging since annotation is based on similarity to known genes rather than to a reference genome. Our study reveals a best case scenario for assembled sequence annotation since the unigenes were easy to match to the

Arabidopsis genome. Correlated read counts from all de novo assemblies against a reference- based assembly of the same data set were done (Appendix A). We also verified that the RNA-

Seq correlation between reads mapped to an excellent de novo assembly, Inchworm-S (BR1), and

33 reads mapped to TAIR10 was good (R2 =0.64) and the PPC between the assemblies of BR1 by

Mosaik-S and Inchworm-S was quite good with R2 =0.86 (Figure 2.17).

Metagenome analysis with MEGAN

The A. thaliana transcriptome analyzed here contains unigenes that do not align with A. thaliana cDNA sequences (TAIR10). To determine the origin of these putative transcripts we used MEGAN138 to classify these unigenes. The threshold alignment quality (bit score) of 125 was determined by plotting the frequency of hits to plants vs. frequency of non-assignment

(Figure 2.18). At this threshold a substantial number of unigenes remain unassigned due to low quality alignments or low sequence complexity (Figure 2.19). 357 unigenes from all post- processed assemblies of BR1 had best hits to fungal genes (Figure 2.19A) indicating low levels of contamination from the greenhouse grown plants. The genus Arabidopsis was represented by 61 unigenes (collectively from the post-processed assemblies of BR1), which align to 20 unique sequences in NCBI’s non-redundant protein sequences database (see Appendix D). Of these 20, none are represented by TAIR10 cDNAs, 14 are in the A. thaliana genome, 9 are predicted (ab intio) A. thaliana mRNA transcripts, 15 are A. thaliana ESTs (2 are not in the A. thaliana genome), and 1 is similar to an Arabidopsis lyrata ribosomal protein (see Appendix D). This collection of Arabidopsis unigenes represents evidence for new genes with concordant evidence in A. thaliana databases at NCBI2. If a more stringent alignment quality threshold is imposed, the remaining unigenes are classified primarily as fungi or into the genus Arabidopsis (Figure 2.19B) and the number of new genes detected falls by 50% to 10.

34 Discussion

The rigorous comparison described here highlights the challenges of de novo assembly evaluation, even with the benefit of a high quality plant genome reference. The integration of numerous quality metrics, including those that are genome reference independent, has revealed performance advantages for certain de novo transcriptome assemblers. None were able to reconstruct the Arabidopsis thaliana young leaf transcriptome as well as the reference based assembler Mosaik. However, the top performing de novo assemblers were able to accurately reconstruct a majority of the transcriptome and even exhibited superior performance in select comparisons to the A. thaliana reference genome-based assemblies.

Sequencing platform selection

The overwhelming depth, breadth, and cost-effectiveness afforded by Illumina over 454 was revealed in our read level analysis. Using ESTcalc (http://fgp.huck.psu.edu5) we predicted that 5Gbp of Illumina data would be sufficient to reconstruct a plant transcriptome consisting of

18,000 genes with ~75% of transcripts represented fully by a single unigene. The Mosaik assembly of BR1 (4.2Gbp of 85x85 bp paired-end reads) was able to reconstruct 17,723 transcripts at >50% coverage. This indicates that in practice 5Gbp of paired-end Illumina data is sufficient to reconstruct a typical plant transcriptome. This was verified by the minimal gains

(<5% increase in the number of >50% coverage transcripts) seen when we doubled the amount of sequencing data by co-assembling BR1 and BR2.

Assuming perfect normalization and doubling the amount of data in our normalized 454 transcriptome (based on simulation data using ESTcalc) still only 85% of genes would be covered

>90% and only 36% would be covered at 100% compared to 100% and 99.9%, respectively, for a non-normalized Illumina dataset the size of BR1 (4.2Gbp). In practice these values are lower owing to a more complex transcriptome, various sources of error and in the case of normalization, less than perfect equal representation of each transcript. 4,205 Mbp of Illumina data provided

35 transcript coverage >90% for roughly twice as many genes (15,171 vs. 7,335) as 210Mb of normalized 454 data at a fraction (<1/8th) of the cost.

From strictly a cost-per-base perspective, Illumina sequencing is superior. But when also considering our ability to leverage a low cost (5Gbp) paired-end Illumina data set to accurately reconstruct a majority of the A. thaliana young leaf transcriptome without the drawbacks of normalization (additional cost and erroneous transcript removal) plus the benefit of affordable, replicated count data for gene expression, the argument for the use of Illumina as a reliable and cost effective tool for the initial exploration of gene space is strong. Differences in the data structure, namely complementary error structure and longer read lengths afforded by 454 compared to Illumina, improved certain aspects of assembly. We examined CLC assemblies of a combination of 454 and Illumina (45Mbp, non-normalized 454 + Illumina BR1) to see if closely related genes were differentiated and if fragmented genes were coalesced. Compared to an assembly with CLC of Illumina-only BR1, the fate of closely related genes remained virtually unchanged (as did nearly every other aspect of assembly evaluation – see Appendix A) though the number of unigenes was reduced while accompanied by only a small increase in error rates and N50 statistics. Interestingly, an overlap layout assembler (CAP3) applied to CLC assemblies of BR1 resulted in improved assembly statistics similar to the hybrid (454 + Illumina) assemblies.

Importantly, assemblies of BR1 with Inchworm-S, Trinity-ICB, SOAPtrans-S and CLCscaf-S were superior to the hybrid assemblies by a large margin. de novo assembler selection

The top performing assemblies were reassuringly good and in many aspects quite competitive with the A. thaliana reference genome based assemblies generated with Mosaik.

Trinity and SOAPdenovo-trans are open source while CLC can be purchased and offers a more intuitive and user-friendly graphical user interface. Our extensive list of assembly quality metrics fall into two categories: 1) reference dependent and 2) reference independent. With a carefully

36 controlled set of quality metrics, assembly parameters and evaluation criteria we are able to readily identify superior assemblies in our reference dependent evaluations. Once we identified the superior assemblies, we were able to choose reference independent metrics that were are most effective at informing assembly quality. Generally, the assemblers could be ranked based upon analyses that revealed a subtle gradation in performance rather than substantial differences. The most surprising result was not that the dynamic range of expression was a key factor in reconstructing the Arabidopsis young leaf transcriptome, but rather that the most highly expressed genes were unlikely to be accurately reconstructed by all but the best de novo transcriptome assemblers. This observation indicates that insufficient or excessive sequencing depth can cause assembly failure.

The Arabidopsis genome reference reveals accurate transcript assembly

A discussion of each reference based quality metric invariably leads to an interpretation based upon integration of each individual metric. To that end, we developed a method that reports numerous types of assembly error based on alignment to a reference sequence that relies on the substitution matrix for BLASTn alignments to report the assembly (i.e. alignment) quality of a unigene to a given Arabidopsis thaliana transcript. By aligning the “best” unigene for a given locus, assemblies with fragmented or gapped transcripts are penalized in that only the highest quality alignment is considered, the other unigenes that align to said locus are excluded.

This method rewards long and accurate assemblies while penalizing alignment gaps, mismatches and short alignments. Thus, length statistics, which inform contiguity, are integrated with base call error (mismatches) and the frequency of other errors (e.g. alignment gaps, frame shift errors, insertions/deletions, and Type I & II errors). The number of ambiguous base calls is also penalized with a reduced bitscore. This method also penalizes ambiguous or potentially chimeric assemblies since the length of the high scoring pair (HSP) can be shortened in the (rarely observed) event of a chimera. Finally, by considering the sequencing depth for each unigene, we

37 can visualize the sequencing depths (high and low) that result in highly accurate and contiguous unigenes and how assembly quality changes over that dynamic range. Using this approach we have shown that the Trinity-ICB (Inchworm-S was excluded from this group since it is a subtle variation on the marginally superior Trinity pipeline), CLCscaf-S and SOAPtrans-S assemblies are robust to the large dynamic range of expression for reconstructed genes, have exceedingly low error rates and have superior read use efficiencies. That said, a discussion of individual evaluation categories reveals where these assemblies differ and what tradeoffs were made to produce the leading assemblies.

The statistics were fairly even among CLCscaf-S, Trinity-ICB and SOAPtrans-S assemblies of in terms of assembly size (~13.5Mbp), N50 (~1550) and unigene count (~30,000).

Trinity-ICB has the greatest number of reads that mapped back to the assembly along with the lowest Type II error rate and highest mismatch (base call) error rate. SOAPtrans-S had the lowest

Type I error rates, garnered the fewest number of reads that mapped back to the assembly and lacked unigenes for the greatest number of reference cDNAs. CLCscaf-S had the lowest mismatch error rate and had the greatest number of high coverage and high quality unigenes while also producing the greatest number of unigenes. The difference among CLCscaf-S, Trinity-

ICB and SOAPtrans-S were generally minor but indicate that SOAPtrans-S was more exclusive, and CLCscaf-S and Trinity-ICB more inclusive, in terms of resolving variation in the transcriptome read data. One assembly does get an honorable mention for the greatest number of mapped reads with competitive error rates, NG IT-S. This assembly was comparatively small

(10Mbp and ~23,000 unigenes) and showed lower read use efficiency than CLCscaf-S, Trinity-

ICB and SOAPtrans-S, though the assemblies have competitive RNA-Seq correlation statistics and produces a large number of high quality (~6500 >1.5BS) unigenes. This is especially impressive when considering the improvement over the NG MO-S assembly, however the

38 improvement comes at a substantial labor cost when compared to higher performance assemblers and is therefore ranked below CLCscaf-S, Trinity-ICB and SOAPtrans-S.

Which genome reference independent metrics identify the leading assemblers?

By comparing our various assemblies of Arabidopsis young leaf transcriptomes to the

Arabidopsis genome, we can easily choose superior assemblies. However, the task becomes more difficult when the reference sequences are not known. A closely related species is useful for evaluating a de novo assembly yet can pose a major hurdle when choosing research organisms that are optimal rather than convenient. The reference independent metrics that were most useful to inform superior assemblies were selected by examining the results of these analyses knowing which assemblies were superior based upon the reference dependent metrics.

The assembly statistics that describe how data are distributed in the assembly are useful, namely N50, unigene count and over all assembly length. With the exception of NG MO-S assembly all post processed assemblies of BR1 had similar N50 statistics, with greater similarity as increasing length cutoffs are imposed (see Figure 2.3). This indicates successful assembly of many transcripts. However, when considered with the cumulative assembled sequence length, it is clear that better assemblies generally had more assembled sequence, though like the pattern of transcript coverage, the differences reveal a subtle gradation in assembly quality. The N-content tends not to be informative since a leading assembly SOAPtrans-S has the highest N-content while both the worst and the best assembly have an N-content of 0. Surprisingly, the superior assemblies had a relatively large number of unigenes, which is inconsistent with the expectation that superior assemblies have fewer, longer unigenes. The proportion of mappable reads is a reliable indicator of assembly quality, though less efficient assemblers like NextGENe, which assemble highly expressed genes well, can lack a substantial portion of the transcriptome while still garnering a large number of reads from few, highly expressed genes. This reflects the inefficiency of NextGENe but also the ability to resolve variation and thus assemble transcripts

39 from highly expressed genes (see Figure 2.7). Selection of a superior assembly based solely upon length and read mapping statistics is not possible, and in our study the superior assemblies often displayed counterintuitive patterns.

A key reference independent analysis was the read titration plot. The shape of the gene accumulation curve is driven by two characteristics of a library and assembly. First, the rate of increase on left-hand side of this curve will be driven by the frequency distribution of reads among sequences in the assembly (analogous to beta diversity in an ecological context and is essentially the dynamic range of expression levels in the cDNA pool). Second, the asymptotic height of the curve will be limited by the total number of sequences in the assembly. These characteristics will be affected by both library and sample properties as well as by assembly features. If an assembly retains more of the transcriptome complexity present in the sample, both more unigenes and more reads will be present, extending the endpoint of the curve compared with an assembly that only produces unigenes for well represented transcripts in the cDNA pool. The relative sharpness of the inflection point informs assembly quality in that a more contiguous and efficient assembly will have a sharper inflection point, as is seen with Mosaik-S. This is due to rapid detection of new unigenes in the assembly, that are quickly exhausted, and a switch to re- sampling high coverage unigenes. In assemblies that fail to reconstruct highly expressed genes, like SOAPdenovo-S, the inflection point is sharper due to the absence of fragmented, highly expressed genes. On the other end of the spectrum, CLC-S assemblies reconstruct highly expressed genes as a fragmented collection of unigenes resulting in the more gradual detection of unique, though highly expressed unigenes.

Knowing that de novo assembly can fail at extremes of expression, one can expect that the number of unigenes will exceed the number of expected transcripts due to fragmentation

(Type I error). Trinity-ICB produced roughly 150% of the expected transcript number, which, compared to assemblies with unigene counts closer to the expected number of transcripts may

40 signal an assembly failure via high type I error rates. Yet the reference based analysis shows that, for instance, Oases-VO lacks highly expressed genes, while, conversely, NG IT-S lacks moderate and lowly expressed genes. So, paradoxically, superior assemblies will have more unigenes than transcripts since the dynamic range of gene expression poses assembly challenges at both extremes. The changes in Type I error rates between Trinity-ICB assemblies of BR1 (4.2Gbp) and BR12 (8.4Gbp) were small and were seen with a concurrent increase in Type II error rates.

This indicates that 4.2Gbp was sufficient to reconstruct a majority of the transcriptome and, in terms of reliable assembly, has reached a point of diminishing returns. Importantly, the majority of the transcriptome consists of moderately expressed genes. For this collection of genes (see the dense region of the de novo assembly plots in Figure 2.7 and in Figure 2.17) the sequencing level of BR1 (4.2Gbp) seems to just exceed the required read depth for assembly (See Figure 2.7), which is indicated by the inflection point in the plot of high quality unigene accumulation. A key difference that separates the superior assemblies is the limited assembly failure at high read depths displayed by Trinity, CLCscaf and SOAPtrans. The robust assemblers still performed well the input data volume was increased 2-fold (from 4.2Gbp to 8.4Gbp) even though it had nominal effect on assembly quality (Appendix A, Appendix B).

A strategy that has been implemented to evaluate de novo assembly is the detection of broadly expressed, highly conserved, single copy genes26. A scan for genes expected to be present in the transcriptome (i.e. ultra conserved orthologs – UCOs) can inform assembly efficiency and completeness. In our comparison a strong indicator of assembly quality was UCO coverage by leading assemblies which tagged >70% of the UCOs with unigenes covering >90% of the reference cDNA. As an exclusive indicator, UCOs coverage was not ideal since assemblies that fail at high read counts, may still assemble a majority of UCOs since these genes tend to be moderately expressed, as is the case with Oases-VO.

Are Strategies for Sequencing and Assembly Improvement Warranted?

41 Post-processing of assemblies has been shown to be a critical step in generating superior assemblies. The Trinity and Velvet-Oases pipelines, and our collection of modular post processing tools called SCERNA, made notable improvements to de novo assemblies in key categories while minimizing negative effects, such as increased Type II error rates. SCERNA also had minimal effects on Mosaik assemblies indicating that high quality assemblies would not be compromised. The most dramatic improvement in our analysis was in the implementation of the Iterative NextGENe assembly method139 which brought NextGENe from a last place ranking to fourth place. This approach was labor intensive, but directly addressed assembly failure for highly expressed genes. The single biggest difference in all assemblers considered here was the ability to efficiently assemble transcripts across a large dynamic expression range. The combination of sophisticated post-processing tools with efficient assemblers results in remarkably good de novo transcriptome assemblies.

A common approach to improve the detection of rare and lowly expressed transcripts is normalization. By the action of duplex specific nuclease, highly expressed transcripts are removed since they have distinct (e.g. more rapid) hybridization kinetics compared to more rare transcripts. This then increases sequencing efficiency by increasing the likelihood that more rare transcripts will be detected rather than more abundant transcripts re-sampled. Our comparison revealed that the normalized Illumina libraries had increased average transcript coverage (Figure

2.1, Figure 2.5 and Appendix B) and assembly quality (Figure 2.10 and Appendix B) indicating that abundant transcripts were reduced in frequency. However, we also observed that genes were erroneously removed during normalization. Together with the cost of normalization, the loss of gene expression information, the high value of replicated RNA-Seq data and the erroneous removal of low to moderately expressed transcripts, the benefit of normalization is eclipsed by the benefits of sequenced biological replicates. Furthermore, the exceedingly high cost of the

42 normalized 454 transcriptome which failed to detect as many genes as even a modest Illumina data set (~4Gbp) makes the argument for normalization weaker still.

Hybrid assemblies also offer a strategy to improve assembly, namely by providing complimentary data structure to correct assembly errors. Indeed, the hybrid 454 + Illumina hybrid assemblies were improved, namely by a decrease in unigene count, compared to assemblies with just Illumina data. Yet an even greater benefit to the comparable Illumina-only assembly was achieved by implemented the overlap layout assembler CAP3 rather than including

454 data. Importantly, both the CLC HYBRID-S and CLC CAP3-S assemblies were inferior to the leading assemblies which include CLCscaf-S.

An important category where 454 should improve the assembly quality owing to longer read lengths is in the successful assembly of our set of closely related genes (CRGs).

Surprisingly, the de novo assemblies generally outperformed the Arabidopsis genome based assemblies (using Mosaik) by reconstructing a greater proportion of high quality (>1.5 BS) mates of CRGs. The rate of reconstruction of CRGs in the hybrid assembly was virtually indistinguishable from the CLC CAP3-S and other leading assemblies. The reciprocal “hit/no hit” pattern seen in de novo assemblies indicates that generally only one mate of the CRG pair is recovered, and while this is in contrast to the Mosaik-S assembly where more mates are represented, it results in longer and more accurate assemblies of one of the mates of the CRG pair.

The global pattern of CRG pair expression was examined by Ks plot analysis (Figure

2.11) revealing that most gene pairs detected by read mapping (Figure 2.11 “Detected cDNAs”) were not sufficiently expressed for assembly of both mates (Figure 2.12). A stark contrast exists in the Ks values for pairs with sufficient reads for assembly of both mates as opposed to mates with reads sufficient for only one of the pair. This results in the absence of a majority of gene pairs in the transcriptome wide plot of Mosaik-S. Reassuringly, the plots for leading assemblies,

43 like Trinity-ICB, strongly resemble those of the reference-based assembly. This suggests that more recently diverged genes (i.e. lower Ks) tend to be co-expressed while older duplicate genes

(greater Ks) have evolved divergent expression patterns. The abundance of pairs with very low

Ks (the negative exponential on the left side of the plot) likely results from very similar subsequences or transcript isoforms. The plot of ABySS Multi-K-S, the largest and consequently the most redundant assembly, is an extreme example of this. The post-processing tools in the

Velvet-Oases and Trinity pipelines outperform SCERNA, yet the assembly which was most improved by SCERNA shows a plot which is nearly identical to the reference based assembly.

Since SCERNA was optimized for improving CLC assemblies, this result suggests that post- processing with SCERNA may benefit from platform specific tuning. de novo RNA-Seq

RNA-Seq has emerged as a powerful tool for gene expression analysis, yet the availability of a high quality reference is essential. Traditionally, a sequenced genome is used as a mapping reference, yet this limits the application of the technique to a select group of model organisms. The microarray is established as a powerful tool for gene expression analysis yet this too is dependent upon a reference. Custom arrays are expensive and must be preceded with at least a partial reference for probe design, which is often provided by ESTs. Commercially available arrays are limited to a select group of model organisms with developed informatics resources. We have shown that by masking the signal for RNA-Seq to cDNA probe binding regions that we can improve the correlation coefficient with microarray data. Additionally, we have shown that high quality de novo assemblies can serve as a mapping reference for RNA-Seq yielding Pearson’s correlation coefficients >0.85. More complete transcriptome assemblies like

CLC-S yield better RNA-Seq correlations, yet the added complexity of numerous subsequences and isoforms can complicate interpretation of gene expression estimates. For de novo RNA-Seq the Trinity-ICB assembly offers a balance of completeness and concision.

44 Our follow-up analysis of the 12 candidates (Figure 2.17 and Appendix C) revealed that by-and-large the poorly correlated candidates represent false positives or inaccurate annotations.

The poorly correlated candidates fell into 2 categories: 1) strong array signal and weak RNA-Seq signal and 2) vice versa. The follow up analysis of category 1 candidates suggest that the array intensities represent false positives since these transcripts were virtually undetectable by qRT-

PCR and three variations of RNA-Seq. The category 2 candidates all seem to have inaccurate or incomplete annotations that skew reference based estimates of transcript abundance.

Alternatively, the well-correlated candidates, and those at saturation on the array, produced remarkably consistent gene expression estimates which all agreed quite well. This analysis reveals that a potentially useful way to test the accuracy of genome annotations, and the probe set on an array, is to correlate the analog array probe signal with the digital signal from RNA-Seq masked for all but the probe sequences. Even though our sample size was small, we were able to identify problematic probe sequences and identify potential annotation errors in the Arabidopsis genome, plus we were able to verify the robust probe set for 6 of these genes.

New Arabidopsis genes and the young leaf metagenome

For each de novo assembly the potential for discovery is appealing since the assembly reveals patterns in the data rather that filtering for what matches a known reference sequence set.

To explore the metagenome of Arabidopsis thaliana we used MEGAN138 to classify sequences that did not align to the TAIR 10 cDNA set. This revealed that numerous unigenes had no hit in

NCBI’s non-redundant protein sequences database (NR), were unassigned (below our alignment score threshold of bitscore 125) or were filtered for low complexity. Of the genes that were assigned to a putative source many were classified as fungi, which is not unexpected for greenhouse-grown plants. A small number were assigned to the core with the majority of these assigned to the genus Arabidopsis. We queried several Arabidopsis databases by

BLASTn and detected putative transcripts for genomic loci that were not part of the TAIR10

45 cDNA set. These included transcripts that were predicted (ab initio) Arabidopsis mRNAs and those that did not align to the Arabidopsis genome, but did align to sequences in NR annotated as

Arabidopsis in origin. A BLASTn search of the Arabidopsis thaliana EST database yielded hits to putative transcripts and one putative new transcript produced a good alignment with a sequence in NR that was annotated as A. lyrata in origin. While some of these cases have more convergent evidence supporting their identity as cryptic A. thaliana genes than others, they do represent opportunities to enhance our understanding of arguably the best plant genome available. If the alignment score threshold is increased to 175, the number of new genes is roughly halved, but the frequency of hits outside the does decrease slightly suggesting that the false discovery rate is increased by the relaxed parameters.

Conclusion

de novo assembly of the Arabidopsis thaliana young leaf transcriptome has shown that, compared to a reference guided assembly, reconstruction of the transcriptome is reassuringly good. The leading assemblies were Trinity-ICB, CLCscaf-S and SOAPtrans-S. The reference based transcriptome assembly metrics revealed that dynamic range of expression poses assembly challenges at both extremes. Furthermore, superior assemblies accurately reconstruct transcripts efficiently at low transcript coverage depth and continue to generate accurate assemblies at high coverage depth. Paradoxically, the superior assemblies routinely produced more unigenes than expected transcripts due to the ability to assemble transcripts at extremes of expression. Except for Type I error rates, which are predominantly due to insufficient coverage of lowly expressed genes, the rates of all other types of error were reassuringly low. The reference independent metrics we examined, when considered together, can inform relative, as well as absolute assembly quality. The characteristics of a superior transcriptome assembly include: 1) a high proportion (>65%) of mappable reads, 2) unigenes that number roughly 150% of the expected transcript count, 3) N50 ≥1200bp, and 4) ≥70% recovery of Ultra Conserved Orthologs. This

46 combination of reference independent metrics is useful to identify superior assemblies since the obvious leader in any individual category may not represent the superior assembly. Overall, the combination of affordable, replicated Illumina sequencing with high performance assemblers and sophisticated post-processing tools offer unprecedented opportunities to explore plant genomes that are selected based upon the most evident questions rather than convenient means.

47 Figures and Tables

Assembler ID Description Reference/URL Mosaik Mosaik Assembler http://bioinformatics.bc.edu/marthlab/Mosaik (v1.1.0014) Trinity Trinity http://trinityrnaseq.sourceforge.net/131 Inchworm (release 03122011)

CLC CLC Assembly Cell (v3.2) http://www.clcbio.com/ CLC Assembly Cell + CLCscaf* Scaffolding (v4.0.6 beta) Oases Oases (v0.1.22) http://www.ebi.ac.uk/~zerbino/oases/140 Velvet Velvet (v1.1.03) SOAPdenovo SOAPdenovo (v1.04) http://soap.genomics.org.cn/about.html141 SOAPtrans* SOAPdenovo-trans (v1.01) ABySS Trans-ABySS http://www.bcgsc.ca/platform/bioinfo/software/abyss142 (v1.3.0) NG MO NextGENe http://www.softgenetics.com/NextGENe_9.html NG IT (v2.17) * used to assembly Illumina Biological replicate 1 only

Table 2.1 Summary of assembly softwares.

48 Mean Quality Quality Aligned Sequence Read Trimmed Trimmed Aligned Read % Library Type Platform Protocol Length Mbp Read # Read # Read % Read # Normalized 454 GS-FLX Ti Single 345 210 609,205 603,120 99.0 417,631 69.3 Biological Rep. 1 454 GS-FLX Single 237 24 100,923 100,854 99.9 82,537 81.8 Biological Rep. 2 454 GS-FLX Single 235 21 89,994 89,937 99.9 73,511 81.7 Normalized GAIIx Paired-end 120 6,615 55,132,260 53,150,341 96.4 51,114,108 96.2 Biological Rep. 1 GAIIx Paired-end 76 4,205 55,332,394 55,307,542 99.9 49,853,958 90.1 Biological Rep. 2 GAIIx Paired-end 76 4,185 55,075,520 55,064,050 99.9 49,521,064 89.9

Table 2.2 Sequencing and alignment statistics. The sequencing and alignment statistics for the normalized and non-normalized libraries sequenced for this study.

49

Figure 2.1 Read level coverage for each sequencing data set Coverage of all representative TAIR10 cDNAs by reads. The darkest bar represents the number of the expressed gene set not tagged and each progressively lighter bar represents genes in a bin with a 5% increase in coverage with the two lightest bars showing the number of genes covered at >90% and >99%, respectively. 454 Norm = Normalized 454FLX Ti (Eurofins Operon), 454 BR12 = 454FLX combined biological replicates 1 and 2, BR12 Norm = Normalized Illumina library pooled biological replicates 1 and 2, BR12 = combined coverage of Illumina biological replicates 1 and 2, BR1 = Illumina biological replicate 1, BR2 = Illumina biological replicate 2.

50 Assembly Metric Description Assembled Sequence Count Number of assembled sequences Median Length Median length of assembled sequences Maximum Length Maximum length of assembled sequences N50 Length 50th percentile unigene length N50 Mbp 50th percentile mega base pairs N-content Number of ambiguous base pairs RNA-Seq SCC Spearman’s Rank Correlation Coefficient with Mosaik assembly RNA-Seq PCC Pearson’s Correlation Coefficient with Mosaik assembly Mappable Reads Number of total reads that map to an assembly (also expressed as %) Mismatch Error (%) Number of base call errors (% of bases called incorrectly) Alignment Gap Rate Alignment gap rate - missing bases Error (% of assembled sequences with alignment gap errors) Type I Coverage Gap Error – Case I Number of internal coverage gaps (noncontiguous assembly) Type I ISO Error – Insufficient Overlap - number of instances of non-assembly with Case II alignment overlap ≤(kmer – 1) Type I Conflict Conflict Overlap – number of instances of non-assembly with Error – Case III alignment overlap >(kmer – 1) Type II Error Ambiguity error – assembled sequence has “best hits” to multiple genes (Cases I-V see Figure 2.4) No Hit Number of “detected genes” not represented by an assembled sequence Coverage Percentage of cDNA covered by an assembly, not limited to a single assembled sequence, cumulative Normalized Bit Score Normalized Quality Metric - Bit score normalized for sequence – BS length, indicator of long and accurate assembled sequences – penalties: mismatch -2, gap -5 Sequenced Normalized Sequencing Depth Metric - The number of sequenced Fragment/bp - SFB fragments (orphans or pairs) that map to an assembled sequence, normalized for sequence length Table 2.3 Summary of assembly quality metrics.

51

Label Description Mosaik Mosaik primary assembly Mosaik-S Mosaik assembly with SCERNA post-processing Inchworm Trinity intermediate output (i.e. Inchworm) Inchworm-S Trinity intermediate output with SCERNA Trinity-ICB Trinity with native clustering CLC CLC primary assembly CLC-S CLC assembly with SCERNA post-processing CLCscaf CLC + scaffolding primary assembly CLCscaf-S CLC + scaffolding assembly with SCERNA post-processing Velvet Oases intermediate output (i.e. Velvet) Velvet-S Oases intermediate output with SCERNA Oases-VO Oases with native clustering SOAPdenovo SOAPdenovo primary assembly SOAPdenovo-S SOAPdenovo assembly with SCERNA post-processing SOAPtrans SOAPdenovo-Trans primary assembly SOAPtrans-S SOAPdenovo-Trans assembly with SCERNA post-processing ABySS ABySS primary assembly ABySS-S ABySS assembly with SCERNA post-processing NG IT NextGENe Iterative primary assembly NG IT-S NextGENe Iterative assembly with SCERNA post-processing NG MO NextGENe Maximum Overlap primary assembly NG MO-S NextGENe Maximum Overlap assembly with SCERNA post-processing ABySS Multi-K ABySS Multiple-K primary assembly ABySS Multi-K-S ABySS Multiple-K assembly with SCERNA post-processing CLC HYBRID CLC 454(non normalized)+Illumina Hybrid primary assembly CLC HYBRID-S CLC 454(non-normalized)+Illumina Hybrid assembly with SCERNA) CLC CAP3 CLCpostprocessing CAP3 primary assembly CLC CAP3-S CLC CAP3 assembly with SCERNA post-processing BR1 Illumina paired-end biological replicate one data set BR2 Illumina paired-end biological replicate two data set BR12 Combined Illumina paired-end biological replicates one and two NORM Normalized Illuimina data set (pooled biological replicates one and two) UCO Ultra Conserved Ortholog (http://compgenomics.ucdavis.edu/) CRG Closely Related Gene set – (300 pairs with Ks<0.2, SFB >0.1) Table 2.4 List of suffixes and abbreviations.

Table&2.5.&&Assembled(sequence(and(RNA(Seq(statistics(for(assemblies(of(Illumina(biological(replicate(1((BR1).((Scaffolds†(and(contigs‡(shorter( than(100bp(were(removed.((NDcontent(refers(to(the(number(of(bases(in(assembled(sequences(that(were(ambiguous.(Reads(were(mapped(to(the( respective(assemblies(and(percentages(were(calculated(with(the(total(number(of(reads(from(BR1(that(mapped(to(TAIR10(cDNAs.((RNADSeq( correlations(used(respective(Mosaik(assemblies(as(a(reference.(( 52 Assembled& Sequences& Median& Max.& RNA&Seq& RNA&Seq& %&Reads& ( Count& Length& Length& N50&Length& N50&Mbp& N?content& SCC& PCC& Mapped& Primary&Assemblies& Mosaik& 20,930! 1,326( 16,339( 1,838( 15.57( 1,771,804*( 1( 1( 71.7( Inchworm& 51,896! 435( 15,057( 1,622( 21.83( 0( 0.95( 0.93( 81.0( CLC& 161,183! 135( 15,057( 578( 22.04( 7,022( 0.96( 0.95( 65.0( CLCscaf& 42,265! 313( 12,532( 1,421( 14.73( 47,268( 0.95( 0.93( 68.7( Velvet& 37,357( 629( 13,061( 1,582( 17.85( 12,748( 0.93( 0.91( 35.4( SOAPdenovo& 44,537( 289( 15,247( 1,322( 14.09( 304,230( 0.87( 0.87( 24.9( SOAPtrans& 37,469( 426( 21,355( 1,557( 15.11( 515,984( 0.93( 0.92( 65.9( ABySS& 28,362( 601( 14,818( 1,474( 12.58( 1,026( 0.92( 0.89( 34.8( NG&MO& 30,880( 232( 7,391( 877( 7.12( 0( 0.88( 0.81( 75.8( NG&IT& 24,137( 602( 12,307( 1,326( 10.51( 0( 0.93( 0.89( 82.6( Post&Processed&Assemblies& Mosaik?S& 20,178( 1,361( 16,339( 1,848( 15.40( 1,707,111*( 1( 1( 71.0( Inchworm?S& 34,812( 584( 15,057( 1,560( 15.76( 0( 0.95( 0.93( 80.2( Trinity?ICB& 30,086( 543( 15,057( 1,552( 13.25( 0( 0.90( 0.87( 75.5( CLC?S& 81,734( 153( 15,057( 1,114( 16.63( 179( 0.96( 0.94( 66.6( CLCscaf?S& 32,290( 509( 12,532( 1,511( 13.70( 1,809( 0.95( 0.93( 68.5( Velvet?S& 26,707( 729( 13,061( 1,561( 13.40( 147( 0.91( 0.88( 34.1( Oases?VO& 22,503( 722( 13,061( 1,579( 11.36( 122( 0.82( 0.78( 28.1( SOAPdenovo?S& 32,515( 472( 15,175( 1,427( 12.86( 10,040( 0.84( 0.81( 25.4( SOAPtrans?S& 27,260( 662( 21,355( 1,622( 13.23( 402,496( 0.91( 0.89( 63.7( ABySS?S& 24,930( 717( 14,818( 1,507( 12.10( 15( 0.91( 0.87( 34.1( NG&MO?S& 27,971( 255( 7,406( 912( 6.85( 0( 0.88( 0.81( 73.5( NG&IT?S& 22,783( 640( 12,307( 1,344( 10.24( 0( 0.93( 0.89( 82.0(

Table 2.5 Assembled sequence and RNA-Seq statistics for assemblies of Illumina biological replicate 1 (BR1). Unigenes shorter than 100bp were removed. N-content refers to the number of bases in assembled sequences that were ambiguous. Reads were mapped to the respective assemblies and percentages were calculated by dividing by the total number of reads from BR1 that mapped to TAIR10 cDNAs. RNA-Seq correlations used respective Mosaik assemblies as a reference. SCC = Spearman’s rank correlation coefficient and PCC = Pearson’s correlation coefficient. *includes gaps in cDNA reference coverage.

53

Figure 2.2 SCERNA flowchart. SCERNA stands for Scaffolding and Error correction for de novo assemblies of RNA-Seq data. This collection of post-processing tools allows flexible implementation at various steps post assembly and with multiple assemblers and data types.

54

Figure 2.3 As minimum sequence length cutoffs are imposed, the assembly landscape becomes more even. The effect on the N50 of assembled sequence length and N50 of Mbp of assembled sequence resulting from sequence length cutoffs (imposed at 100-600bp) for the post- processed assemblies of Illumina biological replicate 1.

55

Figure 2.4 Summary diagram of assembly error types. Type I assembly reports cases of incomplete assemblies where a given transcript is not assembled into a single sequence (Case I = gap, Case II = Insufficient overlap). Type I error can also consist of failure to bring contigs together (Case III) with sufficient overlap, presumably due to conflict. Type II error reports cases where portions of assembled sequences have good alignments to >1 TAIR10 cDNAs. Case I reports possible chimerism with Cases II-V reporting ambiguity in annotation.

56 BR1 BR2

gene # gene #

% coverage BR12 NORM

gene # gene #

Figure 2.5 The coverage of Arabidopsis cDNAs shows a subtle gradation of assembly completeness. Assembled sequences were aligned TAIR10 cDNAs to determine coverage, which was expressed as the percent of cDNA bases covered by assembled sequence. The darkest bar is 0% or “No Hit ” and each progressively lighter bar is a bin containing genes covered in 10% increments, with the last two bars representing the number of genes covered at >90% and >99%, respectively

57

Figure 2.6 Post-processed assembly delta plot, showing the effect of post-processing in several assembly quality categories (X axis labels). The histogram shows the magnitude (% change) and direction of change in each category for the Mosaik-S, CLC-S and Trinity-ICB assemblies of Illumina biological replicate 1. The post processed values in each category are printed above the x axis for each assembly. A vertical line separates categories where an assembly improvement would result in a decrease in the respective measure ( “Expect ∆<0” categories, left of line) or an increase in the respective measure (“Expect ∆>0” categories, right of line)

58

59 Figure 2.7 The quality of assembled sequences as a function of sequencing depth for Illumina biological replicate 1. The units for “Assembly Quality” are Normalized Bit Score (BS, maximum of 2) and the units of “Sequence Depth” are Sequenced Fragments/bp (SFB). The number printed in the plot area is the number of assembled sequences with normalized Bit Score above 1.5. A BS of 1.5 is an arbitrary threshold, yet represents long and accurate assemblies, and is used to illustrate the difference in the high-density region seen in most plots near BS 1.75-2.

60

Figure 2.8 Alignment comparison of CLC and Trinity (Inchworm) unigenes representing the AT1G31330.1 transcript. The Sequence order from the top is gDNA (with a single intron – colored gray), cDNA, CDS and unigene(s). A) Alignment of At1G31330 reference sequences and the Inchworm BR1 unigenes (x2) annotated as AT1G31330. B) Alignment of AT1G31330 reference sequences and the Trinity-ICB BR1 unigene sequence annotated as AT1G31330. C) Alignment of AT1G31330 reference sequences and the CLC BR1 unigenes (x623) annotated as AT1G31330. D) Alignment of AT1G31330 reference sequences and the CLC-S BR1 unigenes (x96) annotated as AT1G31330. For this highly expressed gene, Trinity is able to distill extensive variation into a single perfect unigene, whereas subsequences with minor differences (often single nucleotides) are maintained as numerous subsequences in CLC primary and post- processed assemblies including unigenes that seem to contain introns (middle portion of C & D). This alignment illustrates the trade-off of highly representative assemblies vs highly contiguous assemblies.

61

Figure 2.9 Normalization does not preferentially remove closely related gene pairs. Scatter plot of read counts to the expressed gene set of the Illumina biological replicates 1 & 2 and the normalized Illumina data set. The log2 read counts +1 (to avoid taking the log of zero) for each gene were calculated for the Illumina Normalized data set and the Combined (BR12) data set. The “detected gene set” are plotted in gray. The Ultra Conserved Orthologs (UCO, http://compgenomics.ucdavis.edu/.) are plotted in green. The closely related genes set (CRG) are plotted in red.

62

Figure 2.10 Normalization improves the recovery rate of highly expressed genes. The units for “Assembly Quality” are Normalized Bit Score (BS) and the units of “Sequence Depth” are Sequenced Fragments/bp (SFB). The number printed in the plot area is the number of assembled sequences with normalized Bit Score above 1.5. A BS of 1.5 is an arbitrary threshold and is used to illustrate the differences in the high density region seen in most plots near BS 1.75-2.

63

Figure 2.11 A) The plot of Ks frequencies reveals that closely related genes in the “detected gene set” are not efficiently recovered in reference based or de novo transcriptome assembly. Gene pairs were identified by a reciprocal best BLAST hit. Gene number is on the y- axis, Ks value of pairs in on the x-axis. Equivalent best-fit model components are identified by similar color. “TAIR10” pairs were identified from the comprehensive cDNA collection. The “Detected cDNAs” pairs were identified from the detected gene set (at least one tag from any sequencing data set). The remaining plots are of pairs identified from the indicated assemblies (less unigenes <300bp).

64

Figure 2.11 B) The recovery of sufficiently expressed gene pairs (see A) follows a pronounced hit/no hit pattern with de novo assemblies out performing the reference based assembly. Bit Score (BS) frequency histogram and assembly summary table for Expressed Gene Pairs (EGPs). 473 gene pairs present in the Ks plot of the “Expressed Genes” were absent in the Ks plot of the Mosaik assembly of BR1. For each assembly of BR1 the BS of each mate (946 genes) was plotted. Below the plot is a summary table of the fate of the 473 gene pairs absent in the Mosaik assembly relative to the “Expressed Genes” list..

65

500"

450" All"gene"pairs" 400" Both"mates"expressed" 350" Only"one"mate"expressed" 300" 250" 200"

number'gene'pairs' 150" 100" 50" 0" 0" 0.2 0.5" 0.7 1" 1.2 1.5" 1.7 2" Ks'

Figure 2.12 Between gene pairs with expression sufficient for assembly of only one mate of the pair, the Ks value is generally higher than all pairs in the Ks analysis. The frequency of pairs with increasing Ks values were plotted revealing that pairs with lower Ks values were more likely to have expression sufficient (BS >0.1) for assembly of both pairs, yet pairs with higher Ks values were more likely to have one mate with reads insufficient for assembly. This may indicate that the expression pattern of closely related genes becomes divergent as the gene sequences also diverge.

66

Figure 2.13 Read titration analysis. Reads were mapped to post-processed assemblies of biological replicate 1 (BR1). The number of reads (x-axis) that map to an assembly are an indicator of assembly completeness and quality, while the incidence of new tags (y-axis) indicates how completely the assembly reflects the diversity present in the read data.

67

0" 10" 20" 30" 40" 50" 60" 70" 80" 90" 99" 100"

400" 350" 300" 250" 200"

gene$#$ 150" 100" 50" 0"

CLC.S" NG"IT.S" Mosaik.S" Velvet.S" ABySS.S" Trinity.ICB" CLCscaf.S" Oases.VO" NG"MO.S" Inchworm.S"" SOAPtrans.S" SOAPdenovo.S" Figure 2.14 Ultra-conserved orthologs (UCO) coverage in assemblies of BR1. The darkest bar is 0% or “No Hit ” and each progressively lighter bar is a bin containing genes covered in 10% increments, with the last two bars representing the number of genes covered at >90% and >99%, respectively. The use of UCOs as a proxy for the transcriptome assembly helps to reveal the leading assemblies when leveraged with the proportion of mappable reads and the read titration curve analysis. When all three are considered together, Trinity-ICB, which was the clear leader in our reference-based analyses, is also selected as the leader by the reference independent criteria.

Figure'2.15!!Summary!of!the!12!candidates!chosen!for!a!follow6up!analysis.!!! 68

A. Correlation'of'Array'intensities'and'RNA9Seq'(reads'mapped'to'probes)''

''''''''''''''''' '

B. 'Candidate'description,'expression'values'and'Array'probe'position'detail! '

Gene'ID' Array'intensity'Log2' Illumina'Reads'Log2' Probe'9'cDNA'position' Well!correlated!between!Array!!and!RNA6Seq!(reads!mapped!to!probes)!! AT5G11530.1! 7.75! 2.00! 2931! 2585! 2239! AT4G30935.1! 10.54! 4.01! 252! 1368! 1253! AT3G56460.1! 13.20! 6.00! 800! 985! 354! Robust!probe!set,!beyond!dynamic!range!of!the!array! AT1G19150.1! 15.48! 7.88! 320! 462! 829! AT1G74470.1! 15.25! 10.01! 341! 1348! 852! AT1G31330.1! 15.82! 12.09! 476! 47! 241! Strong!signal!in!Array,!no!signal!in!RNA6Seq!(reads!mapped!to!probes)! AT1G24851.1! 12.57! 0! 120! 127! NA! AT4G33120.1! 12.32! 0! 574! 120! NA! AT5G43640.1! 12.62! 0! 210! 287! NA! Strong!signal!in!RNA6Seq!(reads!mapped!to!probes),!proportionally!weak!signal!in!Array! AT1G24996.1! 8.33! 7.74! 328! 7! NA! AT3G24480.1! 8.13! 7.36! 84! 1121! NA! AT4G02970.1! 4.73! 5.31! 708! 718! NA! ' Figure 2.15 Summary of the 12 candidates chosen for a follow-up analysis by qRT-PCR. The' arrows on the plot show the candidates that were chosen for this analysis. Many of the poorly correlated features with high RNA-Seq signal relative the array were organellar transcripts and' were excluded. The “Probe– cDNA position” columns show where on the reference cDNA the! microarray probes hybridized. Generally, the poorly correlated candidates also had a poorer probe set, which may also have contributed to the aberrant signal on the array..

Candidate(Expression(Rela1ve(to(AtAc1n( 69 25$ qPCR$ Array$ RNA$Seq$!$Probe$ RNA$Seq$!$TAIR10$ RNA$Seq$!$Trinity!ICB$ 0$

!25$

!50$

!75$ RNA$Seq$!$ RNA$Seq$!$ RNA$Seq$!$ Pearson's$R$$ qPCR$ Array$ Probe$ TAIR10$ Trinity!ICB$ !100$ qPCR$ 1.00$ Array$ 0.93$ 1.00$ !125$ Probe$ 0.96$ 0.99$ 1.00$ !150$ TAIR$10$ 0.96$ 0.98$ >0.99$ 1.00$ Trinity!ICB$ 0.96$ 0.98$ >0.99$ >0.99$ 1.00$ !175$

!200$ AT5G11530.1$ AT4G30935.1$ AT3G56460.1$ AT1G19150.1$ AT1G74470.1$ AT1G31330.1$

Figure 2.16 Well correlated genes (see Figure 2.15), and genes with a robust probe set that were beyond the dynamic range of the array (see Figure 2.15), show excellent agreement by all estimates of gene expression. Fold difference in expression relative to AtActin (AT3G18780.1) was determined for candidates indicated. Those within the linear portion the Array vs. RNA-Seq correlation with each method as appropriate (Figure 2.15). Well-correlated qRT-PCR candidates are indicated by solid black arrows (Figure 2.15). qRT-PCR candidates which extend beyond the range of the array but followed the linear trend are indicated by dashed black arrows (Figure 2.15).

70

Figure 2.17. Gene expression correlations. Array BR1 v BR2: correlation of background corrected, normalized array intensities for biological replicates 1 & 2. Illumina BR1 v BR2: correlation of log2 read counts (reads +1) from the non-normalized Illumina data sets generated from biological replicates 1 & 2. Array v Illumina: correlation of log2 read counts (reads +1) mapped at high stringency to the set of array probes and the average, background corrected, normalized array intensities from biological replicates 1 & 2. Mosaik v Illumina: correlation of average log2 read counts (reads +1) from biological replicates 1 & 2 mapped to The Mosaik-S assembly and the TAIR10 cDNA defined as the set of expressed genes. Trinity-S v Illumina: correlation of average log2 read counts (reads +1) from biological replicates 1 & 2 mapped to the Trinity-S assembly and the TAIR10 cDNA defined as the set of expressed genes. Trinity-S v Mosaik-S: correlation of log2 read counts (reads +1, less features that lack read counts) from biological replicate 1 mapped to the Trinity-S assembly and the Mosaik-S assembly. Person’s R is displayed in the upper left corner of each plot.

MEGAN&7&BLASTx&stringency&cutoff&plot& 71 10000"

1000" Viridiplantae" Arabidopsis" Other"Green"plants" Not"Assigned" 100" Unigene&Number& 10"

1" 25" 50" 75" 100" 125" 150" 175" 200" 225" 250"

Bit&Score&Cutoff& Figure 2.18. MetaGenome classification alignment stringency (via bitscore) plot shows that a threshold alignment quality of 175 is sufficient to exclude erroneous hits while still classifying plant genes. The frequency of non-assignment is minimally increased at alignment score ≥175. The increase of non-assignment from alignment scores of 125 to 175 is minimal yet the instance of hits to plant genes is also decreased from alignment scores of 125 to 175. Depending on the desired outcome, alignment scores >125 can be used with confidence to exclude erroneous classification while classifying more plant genes.

72

73

Figure 2.19 (MEGAN) metagenome analysis classification of unigenes that do not align to Arabidopsis TAIR10 cDNAs. The classification was determined for unigenes that aligned to sequences in NR with a bit score A) ≥125 and B) ≥175.

74 Materials and Methods

Arabidopsis growth conditions

Arabidopsis thaliana Col-0 was grown in the Penn State Biology department greenhouse

(http://www.bio.psu.edu/general/greenhouse). Seeds were not stratified and sown in a lawn at a density of approximately two seeds/cm2. The plants were grown for 21-28 days in Metro-mix

360TM (SunGro) in late September through early October. Young leaves ranging from 7-12 mm in length were harvested at the petiole-blade junction with fine forceps and flash frozen in liquid

N2. Tissue from two biological replicates (each sampled from four trays), grown tandem under similar conditions, was harvested and then stored at -80oC.

Isolation of total RNA

Frozen leaf tissues were macerated in the presence of liquid N2 in an RNase free, pre- chilled mortar and pestle. Tissue was processed in ~200 mg portions using the RNAqueousTM

Midi large scale phenol-free total RNA isolation kit (Ambion) according to the manufacturer’s instructions with the following exceptions: 1) lysis buffer solution was made fresh for each isolation, 2) Plant RNA Isolation Aid (Ambion) was added to each lysis buffer prep in a 1:8 ratio by volume. Total RNA was assessed on the Agilent Bioanalyzer using the RNA 6000 Nano kit

(Agilent) with the Plant Total RNA assay. High quality total RNA samples (28s/18s ratio ≥1.7;

RIN ≥8; A260/A280: ≥1.8) from individual biological replicates were pooled. Biological replicates were never mixed except prior to normalized library construction.

RNA precipitation and concentration

To further purify and concentrate RNA samples they were divided into ~100ug portions and were precipitated by adding 0.1 volumes RNase-free 3M NaOAc pH 5.2, and three volumes

100% reagent grade Ethanol and incubated at -80oC overnight. Precipitated total RNA was collected by centrifugation at 14000 x G at 4oC for one hour and the supernatant was discarded.

The resulting pellet was washed twice with ice-cold 100% ethanol with a five minutes spin at

75

14000 x G following each wash and the supernatant was discarded each time. The pellet was allowed to air dry for 60 seconds and was resuspended in 100uL RNase free water. Multiple precipitated samples from each biological replicate were pooled by replicate and mixed thoroughly before being stored at -80oC.

DNase treatment

Total RNA was treated with 2U amplification grade DNase (Invitrogen) in 100ug aliquots in a total volume of 100 uL with a final buffer concentration of 1x and 40U of RNase

OUTTM (Invitrogen) at 25oC for 30 min. RNA was isolated from the reaction using an RNeasyTM

(Qiagen) mini kit following the manufacturer’s instructions, with the following exceptions: 1)

350uL of buffer RLT was added directly to the DNase reaction, 2) RNA elution was performed in two steps using 30uL of RNase free water each time. Total RNA was re-assessed on the Agilent

Bioanalyzer using the RNA 6000 Nano kit (Agilent) with the Plant Total RNA assay following

DNase treatment. !

Paired-end mRNA-Seq library construction!

Arabidopsis thaliana total RNA was assessed on an Agilent Bioanalyzer using the RNA

6000 Nano Kit (Agilent) with the Plant Total RNA Nano assay. Only high quality (28s/18s ratio:≥1.7; RIN ≥8; A260 /A280: ≥1.8) RNA was used to prepare Illumina Paired End mRNA libraries according to the mRNA-Seq Sample Prep Guide (Illumina, 1004898 Rev. D) following the manufacturer’s instructions. The Illumina RNA-Seq libraries were assessed on the Agilent

Bioanalyzer using the DNA 1000 kit (Agilent) with the DNA 1000 assay. !

Paired-end sequencing on the Illumina Genome Analyzer IIx

Sequencing of the Illumina PE mRNA-Seq libraries was done in the McCombie lab at the

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY USA. Each library was sequenced in one lane following a paired-end (85x85bp) sequencing cycle protocol.

76

Normalized Illumina paired-end mRNA-Seq library construction

20ug of high quality (28s/18s ratio: ≥1.7; RIN ≥8; A260/A280: ≥1.8), DNase treated RNA from each biological replicate was thoroughly mixed. Pooled samples were provided to Chris

Pires at the University of Missouri, Columbia for cDNA synthesis (Evrogen Mint dscDNA synthesis kit), normalization (Evrogen Trimmer Kit cDNA normalization kit) and Illumina sequencing following a paired-end (85x85bp) sequencing cycle protocol.

Poly-A RNA enrichment for 454 sequencing

RNA samples were enriched for Poly-A RNA using the Poly(A) PuristTM mRNA purification kit (Ambion) according to the manufacturer’s instructions. Poly-A enriched RNA was assessed on the Agilent Bioanalyzer using the RNA 6000 Nano kit (Agilent) with the mRNA assay. cDNA synthesis for 454 sequencing

cDNA was synthesized using the JGI cDNA synthesis protocol version 1.0

(http://my.jgi.doe.gov/general/index.html) with the following modifications: 1) multiple reactions were pooled for phenol:chloroform extractions, 2) an additional chloroform extraction following the phenol:chloroform extraction at step 3 was done, 3) Roche 454 GS20 library preparation specific steps (4-7) were omitted. cDNA was evaluated on a Agilent Bioanalyzer using the DNA

7500 kit (Agilent) with the DNA 7500 assay. The expected, and observed, yield by mass of cDNA from poly-A selected RNA was 50-75%.

Roche 454 FLX cDNA library construction

High-quality cDNA showing a mean length of ~1500bp or greater was used to prepare

454 FLX libraries according to the Roche 454 GS DNA Library Preparation Kit (Roche,

04852265001) following the manufacturer’s instructions. The 454 GS FLX libraries were assessed on the Agilent Bioanalyzer using the DNA 7500 kit (Agilent) with the DNA 7500 assay.

77

Roche 454 FLX sequencing

Sequencing of the Roche 454 FLX cDNA libraries was done at the Penn State Genomics

Core Facility - University Park, PA USA. Each library was sequenced on ¼ plate.

Eurofins MWG Operon 454 GS-FLX Titanium normalized transcriptome

20ug of high quality (28s/18s ratio: ≥1.7; RIN ≥8; A260/A280: ≥1.8), DNase treated RNA from each biological replicate was thoroughly mixed and provided to Eurofins MWG Operon.

RNA quality was verified post shipment with the Shimadzu MultiNA microchip electrophoresis system (Shimadzu). Poly-A RNA was prepared from the total RNA. First-strand cDNA synthesis was primed with random hexamers. 454 adapters A and B were ligated to the 5' and 3' ends of the double stranded cDNA (ds-cDNA). cDNA was amplified with PCR (16 cycles) using a proof-reading enzyme. Normalization was carried out by one cycle of denaturation and reassociation of the cDNA. Reassociated ds-cDNA was separated from the remaining

(normalized) single-stranded cDNA (ss-cDNA) by passing the mixture over a hydroxylapatite column. After hydroxylapatite chromatography, the ss-cDNA was amplified with 10 PCR cycles.

Adapter ligated ds-cDNA fragments were resolved on a preparative agarose gel. Fragments ranging from 500-700bp were excised and an aliquot of the size fractionated (~500-700bp) cDNA was analyzed by capillary electrophoresis with the Shimadzu MultiNA microchip electrophoresis system (Shimadzu). The library was sequenced on ½ plate.

Micro Array hybridization and analysis

Arabidopsis thaliana total RNA was provided to the Penn State Genomics Core Facility -

University Park, PA. Total RNA samples were amplified, labeled and hybridized to an

Arabidopsis thaliana 4x72K Array (A4511001-00-01) (protocol: http://www.huck.psu.edu/facilities/genomics-core-up/faq/nimblegen-microarrays-procedure).

The design consisted of biological replicates 1 and 2, with a dye swap, on a 4-plex array slide.

An identical technical replicate was prepared on a second array. Arrays were scanned using the

78

MS 200 Microarray Scanner (NimbleGen) and the MS 200 Data Collection Software

(NimbleGen) according to the NimbleGen Array User’s Guide (V3.2, NimbleGen). Background correction of the array images was done using NimbleScan (NimbleGen) and RMA normalization was done on the raw signal intensities from technical replicates using NimbleScan (NimbleGen).

Log2 transformed, background-corrected and normalized signal intensities were used as estimates of gene expression in subsequent comparisons. qRT-PCR candidate gene selection

Gene candidates were chosen based on a correlation of the NimbleGen Arabidopsis thaliana 4x72K Array intensities and the average PE Illumina mRNA-Seq read numbers mapped at high stringency to NimbleGen array probe sequences (no mismatches and alignment with the full probe 60mer). Well correlated examples (n=3) that spanned the dynamic range of log transformed reads counts from Illumina were chosen to corroborate the estimates of expression reported by the array and RNA-Seq. Features that extend beyond the range of the array (n=3) that should follow the well-correlated trend were chosen as the other 3 “well correlated” candidates. Poorly correlated candidates (n=6, with n=3 high in array, low in RNA-

Seq, and n=3 vise versa) were also chosen for interrogation by qRT-PCR. Candidates were selected by choosing the most poorly correlated estimates of gene expression that met the following criteria: 1) gene models were not obsolete in subsequent iterations of the Arabidopsis genome (TAIR6 compared to the then current TAIR9143), 2) complete probe set must match perfectly to the then current TAIR9 gene model, 3) gene must be nuclear, protein-coding transcript, 4) candidate must fall within log2 array intensity of 4 to13, and 5) the then current

TAIR9 annotation did not include known splice variants. qRT-PCR

Arabidopsis thaliana total RNA was provided to the Penn State Genomics Core Facility -

University Park, PA. Primers and probes for the 12 candidates were designed using Primer

79

Express (v2.0, Applied Biosystems). For qRT-PCR analysis, high quality DNase treated RNA was reverse-transcribed with the High Capacity cDNA Reverse Transcription kit (Applied

Biosystems) following the manufacturer’s instructions. Relative quantification by real-time PCR was determined by adding 10 or 20ng of cDNA to 2X TaqMan Universal PCR Master Mix

(Applied Biosystems) in a volume of 20uls. Primers were added at a concentration of 400nM and the TaqMan probe, labeled with a 5' FAM and a 3' Black Hole Quencher (Biosearch Tech,

Novato, CA), at 200nM. The amplification protocol consisted of 10 min at 95°C, followed by 40 cycles of 15 sec at 95°C and one min at 60°C in the 7300 Real-Time PCR System (Foster City

CA). qPCR data analysis

Average crossing point (Ct) values for each gene and PCR efficiency, calculated by

E=10(-1/slope), 144 was used in the Ect/Ect calculation 145 to determine the transcript abundance relative to the reference gene AtActin (AT3G18780) in both biological replicates. We made two key assumptions for this analysis, 1) the cDNA population accurately represents the poly-A

RNA population from which it was made (standard assumptions for micro-array and RNA-Seq) and 2) for the TaqMan assay the number of amplicons at the crossing point is the same for each reaction.

Informatics

The Arabidopsis thaliana young leaf transcriptome normalized and non-normalized libraries were sequenced with both Roche’s 454 GS-FLX and Illumina’s GAIIx. 454 sequence fragments were clipped with version 0.2.8 of sff_extract (http://bioinf.comav.upv.es/sff_extract) at the recommended clipping points by the 454 software. Low-quality bases (50% original length with each remaining base having >Q20. SnoWhite146 (version 1.1.4), a

80 sequence-cleaning pipeline, was used to remove normalization adapter (sub-)sequences from the normalized read files. SnoWhite was originally implemented for 454 data, and requires sequences in fasta format (and optionally base qualities in base-quality fasta format). Since

Illumina sequences are typically provided in fastq format, Biopython147 was used to convert the

Illumina-fastq to standard-fastq, and then the resultant standard-fastq to fasta and base-quality fasta format before clipping adapters. After running Snowhite the read files were converted back to fastq format using Biopython, paired-end read files were reconstructed and single-end read files were written (for orphaned reads) using a custom script.

Reference Read Mapping

A reference-based assembly was created for detailed comparison with the de novo assemblies (below). All reads were aligned to TAIR14810 cDNA representative gene models

(sequences for the longest CDS at each locus) using version 1.1.0014 of Mosaik Assembler

(http://bioinformatics.bc.edu/marthlab/Mosaik) with the recommended aligner settings. Mosaik alignments were converted to bam format and CoverageBed program in Bedtools149 was used to compute the depth and breadth of coverage, and to determine the number of TAIR10 gene sequences tagged. 25,512 TAIR10 gene models were tagged by at least one read in any of the data sets and are here after referred to as the “detected genes” in Arabidopsis thaliana young leaf.

Transcriptome assembly

The trimming of low-quality bases and adapter clipping can compound insert size estimation by read mapping. To estimate insert sizes required for subsequent analysis steps, test assemblies were run for all three Illumina paired-end reads libraries BR1, BR2 and NORM using

CLC Assembly Cell (Version 3.2). Then all paired reads were uniquely mapped to each lllumina library and its respective large contigs (≥600bp) using version 0.12.7 of Bowtie150. This enabled estimation of the average insert size for each library. The insert sizes for biological replicate one

81

(BR1), biological replicate two (BR2), and normalized (NORM) libraries were 134 bases (sd =

15.06), 140 bases (sd = 18.15), and 217 bases (sd = 19.45) respectively.

Reference assembly

Mosaik Assembler was used to resolve paired-end read alignments, filter out duplicate alignments, and assemble reference sequences from previously created Mosaik reference read alignments. Reference scaffolds were created from all Mosaik reference assembly ACE files for each assembly by clipping flanking reference bases of the consensus sequence, substituting intervening reference bases of the consensus sequence with nucleotide ambiguity code N, and filtering out sequences shorter 100 bases (not including scaffolding Ns). Reference primary assembly statistics are shown in Appendix A. de novo assembly

The following de novo assembly tools were used and evaluated in this study: ABySS142

(version 1.3.0), CLC Assembly Cell132 (version 3.2) & CLC Assembly Cell with Scaffolding

(version 4.0.6 beta), Oases34 (version 0.1.22) with Velvet140 (version 1.1.03), SOAPdenovo151

(version 1.04) & SOAPdenovo-trans133 (version1.01), Trinity131 (release 03122011, includes

Inchworm), and NextGENe152 (version 2.17). de novo assemblies (see Table 2.4 for abbreviations) were performed with default settings whenever possible except for the minimum contig/scaffold cutoff of 100 bases, k-value of 31 (except for the Trinity pipeline which is only compatible with 25-mers), and scaffolding turned on whenever applicable. ABySS, Oases

(Velvet), SOAPdenovo, SOAPdenovo-trans and CLC Assembly Cell + Scaffolding produced scaffolds, while CLC Assembly Cell and Trinity (Inchworm) produced contigs. All assemblies were performed using trimmed reads except for SOAPdenovo non-normalized assemblies, which became more fragmented as result of trimming. Trimmed normalized reads were used in

SOAPdenovo assembly because the untrimmed normalized reads were contaminated with adapter

(sub-sequences). To investigate the effect of a mixture of 454 and Illumina reads a 454-Illumina-

82 hybrid assembly was performed using reads from BR1 with CLC Assembly Cell. For the

NextGENe Maximum Overlap (NG MO) assemblies we used NextGENe to quality filter the raw fastq data to remove reads with a median quality score of less than 22, trim reads at positions that had 3 consecutive bases with a quality score of less than 20, and remove any trimmed reads with a total length less than 40bp. The quality-filtered data was assembled de novo using the

Maximum Overlap assembler in NextGENe. The NextGENe Interative assemblies (NG IT) were done as described in Wickett et al. 2011139. We attempted to improve upon the BR1 hybrid 454 +

Illumina CLC assembly using the overlap consensus assembler CAP3153, using a minimum overlap length of 30 (k-value -1) and at least 97% identity, to merge contigs with significant overlap that had not been assembled into contiguous sequences (due either to single base mismatches in the reads or path ambiguity in the graph. Illumina BR1 reads were also assembled over a range of k-values (25-74) using ABySS (ABySS Multi-K) and these assemblies were merged to remove redundancy as described in Robertson G, Schein J, et al. 201062.

Assembly post-processing

We developed a suite of post-processing tools we call SCERNA (Scaffolding Clustering

Error correction of RNAseq data). These assembly post-processing tools address specific aspects of an assembly with the goal of generating high confidence sequences for downstream analysis. de novo assemblies had different entry points into the post-processing pipeline depending upon the output of each assembler (Figure 2.2). Paired-end information of Illumina reads was used to scaffold and extend CLC Assembly Cell and NextGENe contigs with SSPACE154 (version 1.0) because assembly algorithms of these two assemblers do not have built-in scaffolding functions and thus tend to produce more fragmented sequences. A minimum of 5 links (read pairs) were required to join two contigs into a scaffold and at least 30 bases of overlap (k-value -1) and a minimum of 20 reads were required for contig extension. Similar parameters were set for the assembly programs with built-in scaffolding capabilities, whenever possible. In order to bridge

83 some of the gaps introduced in transcripts during scaffolding, paired-end reads were utilized with

SOAPdenovo’s GapCloser program (overlap length set at 30 bases (k-value -1)) to either close or reduce the size of gap opening in all de novo assemblies except the for gapless transcripts produced by Trinity. ESTScan 2.0155 using HMM models built with A. thaliana was used to identify translatable sequences. USEARCH156 (version 4.0) was used to de-replicate (create non- redundant) de novo assemblies before the terminal clustering steps of SCERNA. Using UCLUST, a global alignment-clustering algorithm of USEARCH, scaffolds (or contigs where applicable) were clustered at 97% identity for each de novo assembly into non-redundant sets. The longest sequence in each cluster was selected as the best representative for a given putative locus in the final post-processed assembly. Additionally, the longest translatable sequences-per-cluster were selected from non-redundant natively clustered Oases (Oases-VO) and Trinity (Trinity-ICB) assemblies for comparison with assemblies clustered with UCLUST.

Completeness of Coverage

A commonly used criterion to assess the optimality of a de novo assembled transcriptome of species that has previously determined gene models is how well it recapitulates the models12.

Using BLASTn157, with an e-value threshold of 1e-10, each assembly was aligned (using cluster representatives) to the set of 25,512 detected genes as determined by read mapping (described above). Only the most significant locus (least e-value) for each assembled sequence was recorded (with the exception of alignments for Type II error rate evaluation, where two most significant loci recorded). Using custom scripts the alignments were parsed to compute the breadth of sequence coverage, mismatch and gap-opening rates, Type I and Type II error rates across gene models158. For each reference gene model, the breadth of sequence coverage was determined by computing the percentage of bases covered by assembled sequences from each assembly. The mismatch and gap-opening rates for each assembly was determined by computing

84 the ratio of mismatches to total aligned bases and the ratio of gap-openings to total aligned bases respectively.

Type I error estimation

Unigenes corresponding to the same locus, but which fail to assemble together were classified as Type I errors. We quantified these error rates in each assembly by determining regions with no overlap (i.e. gap between scaffolds not spanned by unigenes, Case I), insufficient overlap regions (overlap < k-value, Case II), and regions with conflicting overlaps (≥k-value - 1,

Case III) in BLASTn alignments of assembled sequences for each locus as (See Figure 2.4)

Type II error estimation

Potential mis-assemblies (Type II errors) in each assembly were quantified by determining the ratio of all ambiguously aligned sequences to all reference-aligned assembled sequences. We determined potential mis-assembled sequences in the BLASTn alignments if with at least 99% sequence identity different segments of an assembled sequence align to at least two non-adjacent annotated gene loci (= “genes”) (Case I), an assembled sequence aligns equally to two genes (Case II), an assembled sequence aligns to a gene model and its subsequence aligns to a different gene (Case III), an assembled sequence whose alignments to two gene overlap with more than 80% of the sequence length (Case IV), and an assembled sequence whose alignments to two genes overlap with at most 80% of the sequence length (Case V) as illustrated in Figure

2.4.

Assembly Read Mapping Evaluation

The proportion of reads that can map back to an assembly is a useful criterion for the quality assessment of a given de novo assembly. Bowtie was used to map Illumina paired-end reads back their respective assemblies retaining only one best alignment for each read. The alignments were then parsed to determine the number of reads that mapped to each assembled

85 sequence, and subsequently used for generating read mapping results for RNA-Seq expression analysis136 and generation of gene accumulation curves26.

Read count estimation

The number of reads that mapped to unigenes in an assembly were binned for sequences corresponding to a single locus based on BLASTN results to TAIR10. The cumulative log- transformed read counts for each de novo assembly and the corresponding reference-based assembly for a given data set were correlated using the Spearman’s Rank Correlation and

Pearson’s R.

Gene accumulation curves

To generate the gene accumulation curve for a transcriptome assembly, reads were mapped back to the assembly using Bowtie150 retaining only one best alignment for each read as one would typically do in an RNA-Seq experiment. The frequency distribution of mapped reads in the assembly was used (i.e. number of reads per unigene) to calculate the rate of new gene detection by randomly sampling reads without replacement and recording the total number of unigenes detected. Data points were recorded every 1000 reads samples and this process continued until all reads mapping to the assembly were examined. The total number of mapped reads and assembled unigenes was recorded as the end point. This procedure was automated using a modification of the script published by Der et al. 201126.

Transcript length distribution and assembly size

The length distribution of assembled transcripts and the total size of the assembly (Mbp) reconstructed in a de novo assembly is another criterion that can be use to evaluate a de novo assembly. Short assembled sequences were filtered from all assemblies in 100 bp intervals from

100 bp to 600 bp, and we then determined N50 length and captured Mbp at each interval, for each assembly.

Ks distribution curves

86

Ks distribution curves were done as described in Jiao et al. 2011159.

Quality versus Depth plots

The BLASTn results (see above “Completeness of Coverage”) and read counts (see above “Assembly Read Mapping Evaluation”) were used to determine how well each detected transcript was reconstructed in each assembly. For a given locus, the bitscore of the best

BLASTn alignment produced from a given assembly was normalized by cDNA length producing the value for “BS” (normalized bitscore). The expression signal for each detected gene was determined by summing reads from all unigenes that mapped to that locus. The read counts were then normalized by cDNA length to produce the value for “SFB” (sequenced fragments per base pair). This approach excluded alignments of equal or lower quality (≤bit score) than the best alignment for each locus. In this way we simulate the ability to identify a “best hit” when working without a genome reference and can report the ability of each assembler to reconstruct each locus into a single, contiguous and error free unigene. These two values were plotted against each other to show how each assembly accumulates accurate and contiguous unigenes as a function of sequencing depth. The density scatter plots were generated with custom R scripts160.

AT1G31330.1 reference alignments and assembly

Using the “Completeness of Coverage” (see above) BLASTn results, Trinity and

Trinity-S BR1 unigenes that had hits to AT1G31330.1 were aligned using the “Multiple Align” function in Geneious161 with the “Geneious align” option selected. The more numerous collection of CLC and CLC-S BR1 unigenes that had hits to AT1G31330.1 were assembled using the genomic DNA sequence of AT1G31330.1 as a reference. The coding and cDNA sequences were aligned to the assembly using the “Multiple Align” function in Geneious161 with the “Consensus align” function. Highlighting in each alignment shows agreements with the consensus sequence for each alignment.

87

MEGAN analysis and alignment stringency cutoff

Unigenes that failed to be assigned to a TAIR10 cDNA were queried against NCBI’s non redundant protein sequences database2 (NR) using BLASTx (e-value 1e-10, tabular output format). Taxon IDs from NCBI’s Taxomony Browser2 were appended to the tabular BLAST output with a custom Perl script (v5.12.3). The tabular BLASTx output plus Taxon IDs was imported into MEGAN138 (Min. support = 1) with Min. Score values ranging from alignment bitscores of 25-250. The frequency of plant (Viridiplantae with subcategories Arabidopsis and

Other Green Plants) vs. non-assignment was plotted to determine the optimum bitscore cutoff that would retain the greatest number of plant hits while maximizing the frequency of non- assignment. The bitscore thresholds of 125 and 175 were set as two useful thresholds for classification of de novo assembled unigenes. 125 allows roughly double the number of plant hits as 175. The 175 threshold is stricter in that the frequency of non-assignment has flattened out while at the same time excluding many alignments to plant sequences.

88 Chapter 3

Functional Genomics of a Generalist Parasitic Plant: Laser Microdissection of Host-

Parasite Interface Reveals Host Specific Parasite Gene Expression

Triphysaria (Orobanchaceae) is a generalist parasite that feeds on a highly diverse collection of angiosperms in nature, including at least 30 species in 17 families of monocot and eudicot host plants117. We reasoned that sequencing transcriptomes from the haustorium of T. versicolor grown on distantly related hosts would maximize the potential to identify both shared and host-specific patterns of gene expression. The transcriptome datasets of T. versicolor provide a unique opportunity to leverage newly established genomic resources of the Parasitic Plant

Genome Project (PPGP162) with well developed functional protocols including parasite-host co- culture118,163, haustorium induction assays119, and parasite transformation113,128,129. By characterizing the molecular signature of host-parasite interactions, we stand to gain insight into the processes underway in a generalist parasite that facilitate a broad host range and learn about the molecular mechanisms that can facilitate the generalist parasite strategy.

Intimate symbioses tend towards specialization (e.g. parasitism)164. A true generalist strategy, where a parasite routinely feeds on many distantly related host species, is relatively uncommon in parasitic organisms165. At face value, this is surprising, because a broad host range provides more feeding opportunities. Seedlings of most parasitic plants, for example, must contact and parasitize a suitable host plant soon after germination130, and access to a wider range of potential host plants should increase the likelihood of survival, regardless of the specific plants growing nearby166. Although less common than host plant specialists, many parasitic plant families do contain generalists, including some or all parasitic members of Orobanchaceae,

Lauraceae, Convolvulaceae, Krameriaceae, and most of the 18 families of Santalales

(sandalwoods, mistletoes and their relatives167).

89

If mutations that increase specialized feeding strategies increase in frequency when specific host resources are predictable168, then traits associated with maintenance of generalist abilities are likely to decrease in frequency. If a generalist strategy involves the evolution of a general-purpose suite of genes that are necessary and sufficient to successfully parasitize a wide range of hosts, then such a trend could lead to a long-term stable generalist strategy.

Alternatively, if generalists maintain distinct sets of genes specific to different hosts, then the long-term maintenance of gene sets for attacking different hosts may be unlikely unless there is frequent reinforcement by a diverse range of hosts.

Two substantial hurdles emerge when characterizing the transcriptomes of T. versicolor haustoria. The first is that gene expression profiles of specialized cells in the haustorium become diluted when harvesting even the tiny haustorium (1–2 mm diameter) of T. versicolor. The excellent histology and electron microscopy work by Heide-Jorgensen and Kuijt109,110 revealed cells residing at the host parasite interface that had transfer cell-like morphology. The anatomy of these specialized cells includes dense cytoplasm, numerous small vacuoles, a highly invaginated cell membrane, and a labyrinthine cell wall (for a review see169). We hypothesized that the small collection of interface cells, including those with transfer-cell like morphology, facilitate the elusive molecular interaction between host and parasite, making them excellent candidates for transcriptome analysis. The second hurdle is that discovery of genes and subsequent gene expression analysis on a genome-wide scale is difficult without a sequenced and well-annotated genome, which is currently lacking for T. versicolor. Next Generation Sequencing (NGS) technologies have emerged as powerful tools for exploring new genomes because the cost per base is substantially lower than traditional dye-terminator or even pyro-sequencing (454) methods5. In the wake of the SGS revolution several tools for data analysis (for a review see8), including high performance de novo transcriptome assemblers like Trinity131, have emerged to facilitate transcriptome analysis in uncharacterized model systems.

90

To overcome the limitations of reference independent transcriptome analysis of small numbers of difficult to harvest cells, we developed methods to sample parasite-host interface cells from T. versicolor grown on the distantly related and sequenced model hosts Zea mays (B73)

(monocot) and Medicago truncatula (A17) (eudicot) via Laser Pressure Catapult Microdissection

(LPCM). We extracted and then amplified exceedingly small RNA samples via T7-based linear amplification and then deeply sequenced each of the amplified parasite-host interface transcriptomes. We assembled millions of paired-end Illumina reads de novo, annotated each assembly and then estimated levels of gene expression via read mapping to the de novo assembled transcriptome. Using this approach, we identified genes that were part of a host- specific response as well as those that are part of a shared response of T. versicolor to the different hosts. We also verified the host-specific differential expression pattern of two

Triphysaria expansin genes. Expansins are among the few genes known to be differentially regulated in haustoria124,170. Analysis of expansin genes allowed us to verify the differential gene expression pattern present in the interface sequence data, and demonstrate the first evidence that a

β-expansin is highly upregulated in T. versicolor when grown on the Z. mays host. Our results suggest that the maintenance of a generalist feeding strategy in Triphysaria involves both generalized and specialized gene responses that help us understand Triphysaria’s generalist feeding abilities.

91 Results

Parasite host co-culture and microdissection of the T. versicolor haustorium

T. versicolor and hosts were germinated and grown axenically in separate culture plates.

To begin co-culture, hosts were transferred to fresh plates and T. versicolor were added and placed in close proximity (~1mm) to host roots. The attachment rate of T. versicolor to host roots was ~90% for M. truncatula and ~50% for Z. mays. This difference was likely due to the more rapid growth rate of Z. mays (compared to M. truncatula) coupled with the confined dimensions of the co-culture Petri dish rather than differential parasite-host compatibility. Where host roots remained more or less stationary on the agar growth medium during early phases of co-culture, the attachment rates of T. versicolor were high (>90%) and equivalent between Z. mays and M. truncatula.

The first step in sample preparation for LPCM was isolation and cryosectioning of haustoria formed on each host. The optimum section thickness was determined empirically by micro-dissecting samples from sections cut at 1µm section thickness intervals from 18-30µm. For

T. versicolor haustoria we determined that 25µm cryosections were optimal to allow efficient tissue release from the adhesive coated StarFrostTM LPCM slides coupled with maximum tissue harvest volume. Parasite and host cells that were in contact with each other at the interface were difficult to separate, so to ensure capture of the entire parasite interface cell population we intentionally included a minimal amount of host tissue, knowing that host transcripts could be identified and removed informatically. Figure 3.1 shows a typical haustorium cross section before

(A) and after (B) LPCM.

To generate representative interface-cell samples we pooled ~110 interface regions of interest (ROIs) from biological replicates (>8 haustoria). The average ROI for T. versicolor interface transcriptome samples grown on M. truncatula was 54,910µm2 with a total area of 6.1 million µm2 that yielded 144ng total RNA. These pooled samples had an RNA integrity number

92

(RIN) of 7.6, an A260 /A280 of 1.58 and an A260 /A230 of 0.76. The average ROI for T. versicolor interface transcriptome samples grown on Z. mays was 56,079 µm2 with a total area of 6.4 million

2 µm that yielded 160ng total RNA with a RIN of 6.9, a A260 /A280 of 1.68 and a A260 /A230 of 0.11.

Linear mRNA amplification from Laser Microdissected tissues

The first step of T7 based amplification is cDNA synthesis with an oligo-dT/T7 RNA polymerase primer/promoter. It is critical that this step is highly efficient to minimize the bias toward shorter fragment lengths in amplified samples171. We frequently observed a yellow-brown material at the host parasite interface that may have contributed to the low initial purity of the interface RNA samples, indicated by the low A260 /A280 and A260 /A230 ratios. Thus, we cleaned the interface total RNA with the ZymoTM RNA Clean and Concentrator kit. Subsequently, we observed consistent amplification performance between technical and biological replicates of the cleaned interface total RNA as well as performance consistent with positive control samples of A. thaliana young leaf RNA of high quality and purity (28s/18s ratio: 1.9; RIN: 8; A260 /A280: 2.0,

A260 /A230: 2.1).

Amplification of ~100ng of total RNA routinely yielded 50-100ug of amplified RNA

(aRNA) after two rounds of amplification, which was consistent with the positive control

(Arabidopsis young leaf total RNA), the expected performance of the Message Amp™ II aRNA kit, and a previous report172. The aRNA yield after a single round of amplification was up to

100ng, which is sufficient for construction of an Illumina sequencing library, yet we chose to amplify the samples for two rounds since it was desirable to have additional aRNA for further analyses including qRT-PCR validation of gene expression profiles. The fragment length profiles, as determined via Bioanalyzer™, were reduced from the first to the second round of amplification, which is consistent with a previous report171.

Sequencing and assembly statistics

93

Amplified interface RNA samples were sequenced on one lane each of Illumina’s

Genome Analyzer IIx with an 83 x 83 bp paired-end cycle protocol. Sequencing data are available at http://ppgp.huck.psu.edu162. The T. versicolor interface transcriptome datasets (Table

3.1) contained 17.9 million read pairs on Z. mays and 19.1 million read pairs on M. truncatula.

Host reads from each interface transcriptome dataset were mapped to their respective host genomes, leading to the removal of 1.5 million M. truncatula reads and 0.4 million Z. mays reads from each respective transcriptome data set. Reads were quality trimmed and filtered (see methods), leaving >26 million reads (orphans and mate pairs) for each sample that were then assembled separately using Inchworm (Trinity131) and post-processed to remove exact duplicate or non-translatable sequences. The interface transcriptome assembly of T. versicolor grown on Z. mays yielded 12.77 Mbp of assembled sequence represented by 28,126 unigenes with an N50 of

525 bp (Table 3.1). The interface transcriptome assembly of T. versicolor grown on M. truncatula yielded 12.25 Mbp of assembled sequence represented by 26,709 unigenes with an N50 of 536 bp

(Table 3.1). Sequencing and assembly statistics were similar in all categories (Table 3.1) indicating that both data sets were of equivalent quality.

Unigene annotation

Unigenes were annotated using an objective classification of known plant genes from the

PlantTribes 2.0 database173,174 as described in Wickett et al. (2011)139. We assigned genes into a hierarchy of gene clusters, which includes approximate gene families (Tribes), and potentially narrower lineages (Orthogroups) which seek to represent descendants of a single ancestral gene in the collection of reference plant species139,173,174. We also classified unigenes from our experiment using BLAST to query sequence databases (Table 3.1). To identify host derived unigenes in the mixed-species transcriptomes, we established a pairwise nucleotide identity threshold of 95% by querying a collection of Z. mays ESTs for T. versicolor grown on M. truncatula and vice versa

(Figure 3.2). The plot of T. versicolor unigene identity to each host database is clearly divergent

94 at 95% compared to the reciprocal host database query. To verify that the incident high identity was not due to cross contamination between samples, we also queried the host EST databases with a de novo transcriptome assembly of Lindenbergia philippensis, a non-parasitic member of the Orobanchaceae162. The BLAST identity plot of the Lindenbergia transcriptome shows a similar trend to the plot of Triphysaria interface transcriptomes queried against the respective non-host databases (Figure 3.2).

In order to remove host genes that may have escaped detection during pre-assembly read mapping, we further screened the assembly based on sequence similarity to cDNA

(Phytozome175) and EST databases (PlantGDB176). This screen removed 4,967 unigenes from the transcriptome of T. versicolor grown on Z. mays and 7,785 unigenes from the transcriptome of T. versicolor grown on M. truncatula (Table 3.1). The same reference transcripts used to screen the raw read data were also used to screen assemblies. The large number of putative host derived unigenes indicates that read screening with Mosaik at default values alone was insufficient to remove all host contamination. After the host screen the remaining unigenes were filtered for T. versicolor genes based on sequence similarity to genes detected in other PPGP libraries of

Triphysaria versicolor162. We identified 17,887 unigenes from the transcriptome of T. versicolor grown on Z. mays and 14,352 unigenes from the transcriptome of T. versicolor grown on M. truncatula that had >95% identity at the nucleotide level to T. versicolor genes from the other

PPGP libraries. After removing unigenes with high similarity to T. versicolor unigenes from other assemblies in the PPGP database, the remaining 5272 unigenes in the interface transcriptome of

T. versicolor grown on Z. mays and 4572 unigenes in the interface transcriptome of T. versicolor grown on M. truncatula were used to query the non-redundant protein sequence database (NR) at

NCBI2 using BLASTx at a threshold e-value of 1e-10. Roughly half of the remaining unigenes in each interface transcriptome had best hits to plants including the model species Arabidopsis,

95

Populus, and Vitis, other Orobanchaceae, or >30 other plant species (“Other Plant Hits” Table

3.1).

Each T. versicolor interface transcriptome had ~2300 unigenes with no significant alignments to sequences in any of the above described external databases (Table 3.1). We took several additional steps to try to identify these unknown sequences. Though these unigenes are not classified by source, we identified potential plant gene orthologs for ~20-25% of the remaining unigenes via the query of the PlantTribes 2.0 database. We then queried the extensive

InterProScan177 (IPS) and OrthoMCL DB178 databases with the translated sequences of the remaining, unclassified unigenes. The majority of these unigenes (>75% in each transcriptome) lacked significant similarity to genes in the OrthoMCL database, nor did they contain IPS peptide motifs (Figure 3.3); they are thus referred to as “no hit” unigenes (Table 3.1).

The scan of OrthoMCL DB resulted in identification of an additional 100 plant and 7 non-plant unigenes from the Zea grown T. versicolor and 85 plant and 16 non-plant unigenes from the Medicago grown T. versicolor (Table 3.1). Additionally, 370 Zea grown T. versicolor and 310 Medicago grown T. versicolor unigenes contained IPS motifs. Roughly half of the

OrthoMCL DB hits lack descriptions or have minimal (e.g. one word) descriptions (Appendix E).

A similar pattern exists in the IPS search results, where about half of the unigenes with IPS motifs contain only putative secretion signals and/or transmembrane domains (Appendix E). Overall our efforts to classify the unigenes in each assembly resulted in identification of potentially orthologous sequences for 82% of the Zea grown T. versicolor interface unigenes and 88% of the

Medicago grown T. versicolor unigenes. We were able to assign a putative origin to >90% of unigenes in both transcriptomes and only 5% in each transcriptome remain unclassified. Of these,

493 unigenes from the T. versicolor grown on Medicago assembly and 536 unigenes from the T. versicolor grown on Zea assembly are longer than 300bp and have read support.

96

The interface transcriptome of T. versicolor grown on Z. mays and M. truncatula contained a total of 127 and 329 unigenes, respectively, with best hits to non-plant species (Table

3.1). The non-plant component of each interface transcriptome included best hits to 16 taxa shared by both interface libraries. These included Escherichia, Aspergillus, Clavispora,

Burkholderia, and others. Among this set, Burkholderia was the most highly represented taxon

(>20 fold increase over any other species) in the non-plant component of both interface transcriptomes. These interface Burkholderia sequences were not detected in the reference transcriptome (TrVeBC1, sequence identity cutoff >90%).

The remaining “no hit” sequences, especially those >300bp with read support, could represent unannotated host or parasite genes, uncharacterized associated symbionts, or incidental contamination. All remaining unigenes not assigned to a host plant or non-plant source in the NR database (23,032 for T. versicolor on Z. mays and 18,595 for T. versicolor on M. truncatula) are considered collectively as “putative T. versicolor derived unigenes.”

Comparative interface transcriptome profiles of T. versicolor

To examine the profiles of the T. versicolor interface transcriptomes, we used an annotated transcriptome from above ground tissues of autotrophically grown T. versicolor as a reference (Triphysaria assembly TrVeBC1162). We sorted unigenes by (PlantTribes 2.0)

Orthogroups to identify host-specific and shared components of the interface transcriptomes of T. versicolor grown on Z. mays and M. truncatula (Figure 3.4). As expected, the largest number of

Orthogroups (5947, or 53.6% of the total detected in T. versicolor) were shared between all three transcriptomes, and likely represent expression of genes involved in processes common to a wide variety of cell types. A large number of Orthogroups (1124) were shared between the interface transcriptomes of T. versicolor interacting with both hosts. These genes likely include a putative core set of parasite genes that are active irrespective of the host plant species. Many additional

Orthogroups were either exclusive to the interface and host-specific (677 for Z. mays and 361 for

97

M. truncatula), or shared with above ground phases of growth (1,066 for Z. mays and 314 for M. truncatula).

Our annotation strategy includes assignment of a GO Slim category term derived from the best BLAST hit in PlantTribes 2.0. GO Slim categories are the broadest designations of GO and are useful for transcriptome-wide comparisons. The host-specific component of the T. versicolor interface transcriptome is likely to contain genes that interact with unique aspects of host biology while those that are shared likely contain genes essential for parasitism. To determine if the annotation profiles were similar between the overlapping and unique transcriptome components we plotted the proportion of GO Slim categories of unigenes (Figure

3.5) represented by unique or overlapping Orthogroups in Figure 3.4. GO Slim category profiles between equivalent components of each interface transcriptome were generally similar to each other, yet often distinct from profiles of non-equivalent components of each interface transcriptome (Figure 3.5). For instance, interface-unique Orthogroup profiles were similar between both interface transcriptomes, yet distinct from the above ground Orthogroup profiles of both interface transcriptomes.

GO Slim category profiles in overlapping and unique sets of Orthogroups within each interface transcriptome were tested for proportionality by a Chi-Square test (Figure 3.6 A-F). The results of all 6 tests showed disproportionality and were strongly significant (P<<0.0001). The number of unigenes in each GO Slim category, with strong residual values (strongly positive or strongly negative, thus disproportionate), is indicated in Figure 3.6 A-F. The results of this analysis are concordant with the plot of GO Slim category profiles (Figure 3.5). The most striking result is the consistently strong over-representation of unigenes lacking GO Slim categories in interface-unique Orthogroups (Figure 3.6 A-C: columns A and B, Figure 3.6 D-F: columns E and F) compared to the consistent weak representation of unigenes lacking GO Slim categories in the shared Orthogroups (Figure 3.6 A-C: column D, Figure 3.6 D-F: column H).

98

Interestingly the Orthogroups shared in the interaction of T. versicolor with both hosts are overrepresented by “transcription factor activity” GO Slim Function terms (Figure 3.6 A: column

B and 3.6 D: column F) and underrepresented by “transport” GO Slim Process terms (Figure 3.6

C: column B and 3.6 F: column F). This indicates that there are transcription factor genes active at the parasite-host interface that are not expressed in the above ground reference transcriptome.

In contrast, the transporter gene families expressed at the interface are expressed in the above ground reference transcriptome as well.

Highly expressed genes at the host-parasite interface

An advantage of using a de novo assembled transcriptome for RNA-Seq is an intrinsic threshold for transcript detection. If the transcript is represented by sufficient reads for de novo assembly, the presence of a target for de novo RNA-Seq is evidence for the presence of a transcript. The reference assembly TrVeBC2162 includes data from the haustorium of T. versicolor grown on M. truncatula and was used as a reference to map reads from each interface transcriptome. We correlated normalized reads (reads/kilobase/million mappable reads (RPKM)) from unigenes belonging to Orthogroups shared between the interface transcriptomes and the above ground reference transcriptome, TrVeBC1 (Figure 3.7). For unigenes detected in both interface transcriptomes the correlation was high (Pearson’s R= 0.81), which indicates low technical and biological variability between the interface transcriptomes.

To determine the expression level of each unigene we also mapped reads to each respective interface de novo assembly. The 20 most highly expressed unigenes in each set of shared and unique Orthogroups from the two transcriptomes are presented in Figure 3.8A-C. We queried this set of 120 unigenes against the NR database using BLASTx (e-value threshold: 1e-

10) and 17 annotated plant genomes using BLASTn and BLASTx. The results of the database queries using BLAST are presented in Figure 3.8A-C. The best-hit descriptions from searches in

NR were concordant with the annotations assigned using PlantTribes 2.0 (Appendix F-1 & F-2).

99

We also used InterProScan to predict signal peptides and transmembrane domains for the unigenes listed in Figure 3.8A-C (See Appendix G). The motif prediction tools frequently identified putative transmembrane domains in unigenes annotated as transporters and secretion signals in unigenes annotated as secretory proteins. Of the 120 unigenes listed (Figure 3.8A-C), nine had no hit when queried against NR. The remaining unigenes had best hits to plant species.

About 30% of these 120 unigenes have either no BLAST hits in NR or align to predicted, hypothetical, or otherwise uncharacterized sequences. This result is consistent our finding that the interface is enriched with unigenes that lack GO Slim category assignments (thus functional annotations).

Among the most highly expressed genes in the shared orthogroups of interface samples of T. versicolor grown on in both Z. mays and M. truncatula (Figure 3.8A) are a β-expansin gene

(see below), genes for several other cell wall modifying enzymes, and a gene encoding a putative ap2-erf domain transcription factor. A striking pattern in the shared interface orthogroups was 10 unigenes (including the most strongly expressed unigenes from T. versicolor grown on

Medicago) with sequence identity to annotated pathogenesis-related proteins in other eudicot species. A single M. truncatula unigene passed through the host plant removal process in the common interface component (ID 5537); this also shared high sequence identity with a pathogenesis-related protein.

Of the unigenes listed in Figure 3.8 A-C, 42 from the interface transcriptome of T. versicolor grown on Z. mays and 34 from the interface transcriptome of T. versicolor grown on

M. truncatula had strongest BLAST hits to Asterid genomes (including Mimulus guttatus). When we queried the 17 plant genomes database there were slightly more best hits to legumes in the

Medicago grown Triphysaria data set, perhaps because there is less sequence divergence between the eudicots Triphysaria and Medicago than between the more distantly related Triphysaria and

Zea. This results in a somewhat broader range of ambiguous sequence identity between host and

100 parasite. Despite rigorous filtering, a single putative Z. mays transcript and three putative M. truncatula transcripts persevered (indicated in bold) in the highly expressed gene list in Figure

3.8.

A novel β-expansin is differentially expressed at the parasite-host interface

Among the highly expressed unigenes observed in the interface transcriptome of T. versicolor grown on Z. mays was a putative β-expansin (Figure 3.8A). Manual curation of the read mapping data indicated that it was highly expressed when grown on Z. mays, and a nearly identical unigene from the M. truncatula-grown Triphysaria interface was lowly expressed.

This apparently host-specific gene expression pattern was of interest because expansins are cell-wall loosening proteins (for a review, see Sampedro and Cosgrove 2005179) that have been implicated in the interaction between parasitic plants and their hosts124,170. While the β- expansin gene was expressed in both samples, read mapping evidence suggested that this gene

(unigene 772) was highly differentially expressed. As a point of comparison, we investigated a putative α-expansin, unigene 11, which showed a reciprocal pattern of high expression in the interface transcriptome of T. versicolor grown on M. truncatula. We verified the nucleotide sequence of unigenes 772 and 11 via dye-terminator sequencing of PCR products amplified from interface aRNA.

Phylogenetic analysis of β-expansin unigene 772 shows that it is nested within a supported clade of dicot β-expansin sequences (Figure 3.9 A) indicating that unigene 772 is a dicot β-expansin and not a Z. mays derived sequence. Annotation via InterProScan supports an expansin identity for 772 (Appendix G) and shows a putative 5’ signal peptide (Figure 3.8A), consistent with a role in the apoplast that is typical for expansins. Additionally, the results of all of the BLAST searches suggest that unigene 772 is a T. versicolor derived sequence.

Phylogenetic evidence for unigene 11 does not yield a well-resolved tree of α-expansins (Figure

3.9B), but the BLAST results suggest that the pairwise nucleotide identity to known, or putative

101

(e.g. ESTs) M. truncatula genes is <70%, while unigene 11 has high identity (>95% pairwise nucleotide identity) to Triphysaria unigenes in other PPGP assemblies.

Quantitative Real-Time PCR verification of host specific expansin expression

We sought to verify the reciprocal expression patterns of these two expansins via qRT-

PCR. Unigenes 772 and 11 were assigned formal names TvEXPB1 and TvEXPA4, respectively.

We verified that primers were specific to their targets by melting curve analysis. To further verify that the TvEXPB1, TvEXPA4, and TvActin primers were specific to parasite transcripts, we harvested portions of host roots that were immediately adjacent to mature T. versicolor attachments and interrogated them via qRT-PCR. In these host root samples we were able to detect ZmActin in Z. mays root samples and MtActin in M. truncatula samples, while the parasite primers yielded signal consistent with background.

We interrogated biological replicates of the T. versicolor host-parasite interface cells grown on both Z. mays and M. truncatula via qRT-PCR for expression of the reference gene,

TvActin, TvEXPA4, and TvEXPB1 (Figure 3.10). TvEXPB1 is up-regulated >120 fold

(P=0.024) in T. versicolor haustorial interface cells grown on Z. mays relative to T. versicolor grown on M. truncatula. TvEXPA4 shows a weak reciprocal pattern (P=0.17). The expression patterns of TvEXPA4 and TvEXPB1 are concordant with our de novo RNA-Seq results.

Additionally, when we examined TvEXPB1 expression in whole haustorium samples the signal was indistinguishable from background, suggesting that the massive upregulation of TvEXBP1 is specific to a small number of interface cells.

102 Discussion

Using a workflow that allowed us to sample, sequence, and de novo assemble transcriptomes from cells at the host-parasite interface we have shown that T. versicolor expresses genes in a host specific manner. This preliminary look at genes expressed at the parasitic plant-host plant interface suggests that the basis for generalist parasitism is constituted, at least in part, by host-specific patterns of gene expression. Generally, this work demonstrates the potential to discover genes de novo and examine genome-wide patterns of gene expression in a highly tissue-specific manner in organisms that lack a sequenced genome.

Laser Microdissection (LM) is a powerful tool for plant transcriptomics

The power to develop a comprehensive picture of any biological system lies with understanding the myriad processes underway in complex organs and tissues. A primary hurdle to revealing this picture is the ability to identify, separate, capture and analyze tissues and cells of interest. Several authors have emphasized the importance of high resolution, high throughput investigations of gene expression in a tissue- and cell-specific manner as well as the need to survey gene expression in a global manner, and why LM (including LPCM) is emerging as a powerful tool for genomics65-67,180.

LM generated samples from some model systems have been examined using microarrays181-186 allowing investigators to analyze global gene expression patterns in specific tissues and cell types. More recently, LM has been increasingly coupled with SGS to sequence the transcriptomes of various tissues in Z. mays187-189 and S. lycopersicum172. The advent of de novo transcriptome assembly now makes global surveys of gene expression in specific tissues and cells of non-sequenced organisms a logical next step. To examine the parasite-host interface transcriptome of T. versicolor, we combined LPCM with robust T7 based linear RNA amplification171,190-193 in concert with Illumina mRNA-Seq, high performance de novo transcriptome assembly131, and various assembly post-processing tools.

103

For the haustorium of T. versicolor our sampling strategy was based largely on detailed histology and transmission electron microscopy work done with field-collected specimens109,110.

This critical background information allowed us to identify cells of interest within the haustorium and then subsequently identify regions of the haustorium in cryosections that contained these cells. Plant tissues must be embedded prior to LPCM and preparatory steps can have an impact on the quantity and quality of RNA preparation194. However, the ability to identify tissues of interest must be balanced with downstream usability. The histological quality of the section is an important consideration that may determine the sample preparation method. Paraffin embedded sections generally provide high histological quality at the expense of RNA quality and yield195,196.

Histological quality increases with thinner sections for sampling at a finer spatial resolution and the efficiency of pressure catapulting increases with thinner sections, yet Kerk et al. (2003)197 report increased RNA yield from thicker sections. We found that the optimum section thickness for capturing interface cells of the T. versicolor haustorium was 20-25µm. This was determined based on a balance of our ability for histological identification of tissues and cells of interest with efficient tissue release from the slide during the pressure catapult phase of LPCM. The integrity of plant tissues that are susceptible to damage by flash freezing can be preserved by infusion with a cryoprotectant70. Cryosectioning with the CryoJaneTM (a cryosection transfer system) allowed us to easily capture serial cryosections of T. versicolor haustorium that routinely yielded high quality RNA from carefully chosen samples.

By design, our sampling strategy minimized the likelihood that differences in gene expression arose from temporal or spatial sampling artifacts. The co-culture of T. versicolor is not highly synchronous, so our sample of haustoria represents a broad temporal window of connections that are ~8-10 days old. We collected ~110 interface ROIs which diminishes the likelihood of spatial sampling artifacts. Furthermore, highly similar statistics throughout sample processing and data analysis, including a high correlation of read counts for unigenes in shared

104

Orthogroups (Figure 3.7) and verification of the expression pattern of TvEXPB1 in additional experiments with biological replicates (Figure 3.10), indicates low variation from either biological or technical sources.

Parasite-host interface transcriptomes help to identify core parasite genes

Orobanchaceae include the pernicious weeds Striga, Orobanche and Phelipanche, as well as the model parasite Triphysaria. We reasoned that by discovering a subset of genes that likely include those central to the response of T. versicolor to distantly related hosts, we would move closer to identifying a core set of parasitism genes operating in the weedy Orobanchaceae. The molecular dialogue of the parasitic plant with its host is largely uncharacterized92. Here we report the identification of the subset of genes expressed at the host-parasite interface by T. versicolor in response to both of the distantly related hosts Z. mays and M. truncatula. Additionally, we identified host-specific patterns of gene expression indicating that the generalist parasitic plant T. versicolor maintains suites of genes for use with different hosts.

The genes of T. versicolor expressed at the host-parasite interface in response to both Z. mays and M. truncatula included a substantial set of genes annotated in other plant species as pathogenesis- (or pathogen-response) related proteins. This included six of the most highly expressed genes in the Triphysaria interface grown on M. truncatula, and three when grown on Z. mays, but a number of additional unigenes were also putative homologs of genes that are upregulated during pathogen invasion (dirigent-like, acidic endochitinase, disease resistance proteins, etc.). While upregulation of genes of these classes would be expected as a defense response to pathogens, this observation suggests that pathways commonly involved in plant protection are also turned on by the parasite during the process of host invasion. It has been already suggested that parasitic plants may have recruited pathways from allelochemical detoxification for use during haustorium signal transduction115. Whether Triphysaria is defending itself during the invasion process with pathogen resistance (PR) genes, or has recruited PR

105 pathways as offensive ‘weapons’ is not yet clear, but it does appear that a portion of the core gene set for Triphysaria parasitizing either host has been derived from pathways that plants use for defense against pathogens.

There is no consensus for why true generalist parasites occur or persist, even infrequently, despite evidence for an evolutionary trend toward specialization (Agosta et al.

2010165 and references there-in). We hypothesized that a successful generalist feeding strategy might rely heavily on general-purpose genes that were deployed for feeding on any host. Instead, we see large components of the interface transcriptome that are detected only when feeding on one of the tested host plants. Of the 2162 Orthogroups detected only in interface transcriptomes, about half (52%) were expressed on both host species, but a substantial fraction of interface

Orthogroups were detected only when grown on maize (31%), or on Medicago (17%). Long-term maintenance of extensive genetic machinery directed toward parasitism of subsets of possible hosts will require maintenance of selective pressures on the genes that are deployed in a host- specific manner. Unless the parasite routinely encounters and parasitizes a wide range of host plants, selective pressure on some host-specific genes will be relaxed and the gene functions eventually lost, limiting the plant’s potential to parasitize some hosts, despite the potential advantages of a large host range. The evidence for host specific gene expression suggests that T. versicolor regularly parasitizes across a broad host range, maintaining selective pressure on genes used in a host-specific manner. Indeed the advantage of a broad host range has been explored where host populations are moderately variable in space and time166. This study serves as a starting point for a broader survey of putative and potential hosts for Triphysaria as well as a first look at gene expression profiles of highly specialized tissues in parasitic plant haustoria.

It is intriguing that the host-specific and shared-interface gene families (Orthogroups) were over-represented by unigenes with no GO Slim category term assignment compared to the above ground transcriptome. Because Orthogroups were defined based on a classification of

106 sequenced and annotated plant genomes139,173,174 to which the parasite genes were assigned, this observation highlights the fact that genes expressed in the haustorium include many that have been recruited from the subset of genes whose function is not yet known in any plant. In addition, approximately 1300 unigenes from each interface transcriptome lack strong homology to any known sequence, though 25% of these unigenes are high identity reciprocal best hits in the interface data sets from each host. Further, because such patterns are reproducible in the interface transcriptomes of T. versicolor when grown on different host plants, these data suggest that genes of unknown function are expressed in the haustorium in a host-specific manner. Our data also suggest that underground phases of growth in T. versicolor are enriched for genes of unknown function.

Of those unigenes with GO Slim category assignments, the shared interface-specific

Orthogroups are overrepresented for the GO Slim Function category “transcription factor” and underrepresented for the Go Slim Process category “transport.” This indicates that there are transcription factor Orthogroups unique to the interface (relative to the above ground reference transcriptome), yet Orthogroups involved in transport processes are active in all three transcriptomes examined. The latter observation regarding transport does not rule out differential expression of particular genes that are expressed in all three transcriptomes in this study.

TvEXPB1 encodes a T. versicolor β-expansin that is part of a host-specific response

The expansin gene family includes 4 main groups: α, β, expansin-like A and expansin- like B179. Expansins are thought to loosen plant cell walls by allowing slippage of cell wall polymers179. Their activity is non-enzymatic, but they are distantly related to glycoside hydrolase family proteins179. Expansins are involved with cell growth198 and have been implicated in the interaction of bacterial plant pathogens199,200, plant-parasitic nematodes201,202 and parasitic plants124,170 with their plant hosts. In each case, expansins are suggested to play a role in host invasion.

107

Expansin activity has been assayed in the cell walls of both monocots203-205 and dicots206,207. β-expansins have activity that is specific to the cell walls of grasses, but not dicot or most other monocot cell walls208. The reverse pattern of action is found for α-expansins, which suggests substrate specificity for both α- and β-expansin proteins179,209. Typically, expansins are found at low concentrations179 with the exception of grass pollen that secretes massive amounts of

β-expansins206 that likely serve to loosen stylar tissue during pollen tube growth210,211. Although the exact mechanism of action remains unknown, grass cell walls have relatively small amounts of xyloglucan and pectin; these are replaced with β-(1→3),(1→4)-D-glucan and glucuronoarabinoxylan. Both of these grass cell wall components are potential targets of β- expansins in their wall-loosening activity212. Throughout more than a decade of research, the accumulation of β-expansins to high levels was thought to be specific only to grass pollen212.

In this study, we have shown that the transcript level for the β-expansin TvEXPB1 is among the most highly expressed genes in the interface transcriptome of T. versicolor grown on

Z. mays. Relative to the interface transcriptome of T. versicolor grown on M. truncatula, the expression of TvEXPB1 is greatly up-regulated (>120-fold) in the parasite-host interface tissues of T. versicolor grown on Z. mays. The massive and host-specific expression of TvEXPB1 is suggestive of a role in a dicot parasite’s interaction with a grass host. Additionally, the signal for

TvEXPB1 in whole haustorium samples of T. versicolor grown on Z. mays was undetectable relative to the interface samples suggesting that TvEXPB1 is highly specific to the host-parasite interface.

Taken together, the evidence for grass cell wall-specific activity of β-expansins and the massive upregulation of a parasite β-expansin at the parasite-host interface suggests that T. versicolor expresses TvEXPB1 when interacting with the monocot host Z. mays in an effort to manipulate host cell walls. Heide-Jørgensen and Kuijt109,110 observed that host cortical and epidermal cells seemed crushed and displaced at the host parasite interface in the haustorium of T.

108 versicolor. The mechanism of host tissue displacement may include cell wall modifying proteins like expansins that have host cell wall-specific activity. Such a mechanism would allow the parasite to soften or separate host cell walls without affecting the integrity of its own cell walls in the penetrating haustorium. The specific role that expansins play in the host parasite interaction will only be uncovered through detailed functional analysis. This includes focused gene expression analysis, targeted silencing of T. versicolor expansin genes and biochemical characterization to determine the substrate specificity of the expansin proteins encoded by these genes.

Are a ubiquitous genus of soil bacteria symbionts of T. versicolor?

The best represented genus in the non-plant transcriptome component (~60% of non- plant hits in T. versicolor grown on Z. mays and ~45% non-plant hits in T. versicolor grown on

M. truncatula) was Burkholderia, a common genus of soil bacteria. The high frequency of hits to this bacterium is surprising for three reasons: (1) the relative frequency of other non-plant genera was much lower, (2) T. versicolor seeds were aggressively surface sterilized prior to axenic co- culture, and (3) Burkholderia-derived unigenes were not detected in the above ground reference assembly. The low frequency of hits to other genera (including plant pathogenic fungi, human and other bacteria) could be explained by incidental contamination from the lab environment, however the preponderance of Burkholderia hits only in the interface samples suggest the presence of an organism belonging to this genus in the co-culture system. The unigenes were generally <1kbp (indicating transcript-sized unigenes) and RNA samples were DNase treated, diminishing the likelihood that these unigenes originated from genomic DNA contamination.

The genus Burkholderia has received increasing attention in the last two decades due in part to a diverse catalog of host interactions ranging from human pathogen to plant-growth promoting rhizosphere fauna213. Of particular interest here is their potential role as beneficial plant endosymbionts213. Attsat214 noted the presence of filamentous bacteria-like structures in

109 haustorial cells and that application of terramycin significantly reduced haustorium formation in

Orthocarpus purpurascens (syn. exserta), a hemi-parasite that is a close relative of T. versicolor215. The presence of Burkholderia in an axenic co-culture system points to an intriguing possibility: a species of Burkholderia was carried through surface sterilization with the seeds of

T. versicolor. Evidence that may suggest a role for a prokaryotic endosymbiont in parasitic plant biology214 considered with evidence for the presence of a genus of beneficial soil bacteria hints at a symbiosis between Burkholderia and Triphysaria. The presence and possible roles of

Burkholderia at the host-parasite interface in T. versicolor are currently under investigation.

Conclusions

Triphysaria represents an important asset in the identification of genes and evolutionary processes that are central to parasitic plant biology, and one that is key to the development of new control strategies for the weedy Orobanchaceae. Highly tissue-specific transcriptome analysis in

Triphysaria versicolor, an experimental model parasitic plant that has yet to have its genome sequenced, has revealed host-specific gene expression. Furthermore, genes represented at the parasite-host interface are enriched for genes of unknown function relative to above ground phases of growth. Ongoing development of molecular techniques, including parasitic-plant transformation128,216 will facilitate the functional characterization of genes whose roles may be central in the parasitic plant-host plant interaction.

110 Figures and Tables

A 100µm

penetration peg

haustorium host root

B 100µm

Interface ROI

Figure 3.1 Laser Microdissected Haustorium. LPCM allows highly tissue- and cell-specific harvest after histological identification of tissues or cells of interest. A) Representative 25µm cross-section of T. versicolor haustorium on the host M. truncatula approximately 9 days post infestation, and prior to LPCM. The mature haustorium contains the xylem bridge that connects the parasite and host vasculature and is visible in the penetration peg. B) The same section after LCPM shows the cleared interface tissue from the user-defined region of interest (ROI). The flakes of tissue are catapulted by a photonic cloud resulting from pulses of laser light focused between the tissue and glass slide. Multiple pulses of laser light raster across the ROI causing tissue in the selected region to be catapulted and then captured in the adhesive coated cap of a 0.5mL tube held by a robotic arm in very close proximity (<0.5mm) to the upper surface of the section affixed to the slide.

111

Host plant Sequencing Z. mays M. truncatula Total raw sequence 2.73 Gbp 2.91 Gbp Reads Total 35,894,662 38,228,134 Host reads (401,352) (1,588,592) Host filtered 35,493,310 36,639,542 Quality trimmed 26,947,737 27,325,845 Assembly Assembly length 12.77 Mbp 12.25 Mbp Unigenes Total 28,126 26,709 Unigenes >500 bp 9,369 9,718 Min/Max length (bp) 197/3,115 197/3,265

N50 525 bp 536 bp

N50 >500 bp 731 bp 695 bp Annotation Host unigenes (4,967) (7,785) Non-plant unigenes (127) (329) Triphysaria unigenes 23,032 18,595 Triphysaria hits 17,887 14,352 Other Plant hits 2,975 2,086 No hits 2,170 2,157

Table 3.1 Read and assembly level statistics for T. versicolor interface transcriptomes. Low quality reads were filtered before assembly, and host sequences were filtered both before and after assembly. Unigenes remaining after removal of host plant and non-plant sequences were aligned with BLASTx to sequences detected in any other PPGP transcriptome library of Triphysaria versicolor (http://ppgp.huck.psu.edu/). Unigenes with less than 95% pairwise identity to either host or to other Triphysaria libraries were sorted further if a BLASTx search of the NR (www.ncbi.nlm.nih.gov) database yielded alignments of 1e-10 or stronger. The remaining unclassified unigenes were submitted to OrthoMCL DB and InterProScan. Unigenes that remained unclassified after the final screen are called “no hit” unigenes.

112

10000" Triphysaria"on"Zea"to"Zea" Triphysaria"on"Medicago"to"Medicago" Triphysaria"on"Medicago"to"Zea" 1000" Triphysaria"on"Zea"to"Medicago" Lindenbergia"to"Zea" Lindenbergia"to"Medicago"

100" Unigene&number&

10"

1" 70" 75" 80" 85" 90" 95" 100" %&pairwise&nucleo3de&iden3ty&

Figure 3.2 Unigene Pairwise Nucleotide Identity Plot. Sequence identity between unigenes considered in this study and reference EST sets (PlantGDB public ESTs, http://www.plantgdb.org/) for the hosts Z. mays and M. truncatula. Triphysaria unigenes were aligned to the host reference to identify host contaminants and aligned to the reciprocal non-host reference sets to identify the incidental nucleotide pairwise identity. A whole plant normalized transcriptome assembly of Lindenbergia philippensis (a non-parasitic member of the Orobanchaceae) was used to determine the distribution of pairwise identity for a non-parasite to each host and to control for high unigene identity to host ESTs from potential cross contamination. A threshold of 95% was chosen to balance exclusion of host transcripts with retention of Triphysaria unigenes that had incident high identity to host ESTs.

113

Figure 3.3 VENN diagram summary of OrthoMCL DB and InterProScan (IPS) results. ESTScan ORF predictions from unigenes in each interface transcriptome that remained unclassified after extensive BLAST-based database searching were translated and submitted to OrthoMCL DB and InterProScan. The pattern is similar between unigenes from each transcriptome indicating equivalent unigene classification for T. versicolor grown on both hosts. The number of unigenes for which an ortholog or peptide motif was identified was relatively small, indicating our unigene classification using PlantTribes 2.0 and external database queries was robust. Approximately 25% of the known orthologs identified in the OrthoMCL database from each transcriptome are shared. A majority of the unigenes remain unknown, and these include many (~500 in each transcriptome) that are >300 nucleotide bp and have read support

114

Figure 3.4 Transcriptome Orthogroup Venn. Venn diagram showing the number of Orthogroups in the interface transcriptomes of T. versicolor with hosts Z. mays and M. truncatula and an above ground, autotrophically grown T. versicolor transcriptome (TrVeBC1) constructed from leaves, stems and inflorescences. Also shown are the numbers of host-derived Orthogroups. The lack of overlap between host and parasite transcriptomes does not imply lack of shared Orthogroups, but indicates the total number of host Orthogroups for a point of comparison.

115

GO Function 35.00# Interface#Unique# 30.00# Interface#Shared# 25.00# Shared#All# 20.00# Interface/Above#Ground#Shared# 15.00# %"Unigenes" 10.00# 5.00# 0.00#

other#binding# kinase#ac@vity# other#func@ons#No#GO#Func@on# protein#binding# hydrolase#ac@vity# nucleo@de#binding# transferase#ac@vity#transporter#ac@vity#DNA#or#RNA#binding# nucleic#acid#binding# other#enzyme#ac@vity#

transcrip@on#factor#ac@vity# structural#molecule#ac@vity#receptor#binding#or#ac@vity# GO"Component" 40.00# Interface#Unique# 35.00# Interface#Shared# 30.00# Shared#All# 25.00# Interface/Above#Ground#Shared# 20.00# 15.00# %"Unigenes" 10.00# 5.00# 0.00#

ER# plas@d# nucleus# cytosol# cell#wall# ribosome# chloroplast# mitochondria# extracellular# Golgi#apparatus# other#components#other#membranes#No#GO#Component# plasma#membrane#

other#intracellular#components#other#cytoplasmic#components# GO"Process" 60.00# Interface#Unique# 50.00# Interface#Shared# 40.00# Shared#All# 30.00# Interface/Above#Ground#Shared#

%"Unigenes" 20.00# 10.00# 0.00#

transport# transcrip@on# other#processes# No#GO#Process# response#to#stress# signal#transduc@on#protein#metabolism# other#cellular#processes# DNA#or#RNA#metabolism# other#metabolic#processes# developmental#processes# electron#transport#or#energy# cell#organiza@on#and#biogenesis# response#to#abio@c#or#bio@c#s@mulus#

116

Figure 3.5 GO Slim Category Summary. GO Slim category terms of unigenes in interface transcriptomes of T. versicolor and the above ground reference assembly of T. versicolor. Each series displays the average number of unigenes in equivalent transcriptome components with a given GO Slim term. For instance, “Interface Unique” indicates the average number of unigenes from interface unique components in both Medicago and Zea grown T. versicolor transcriptomes. Error bars are standard error of the mean. “Interface Unique” = unigenes from Orthogroups that are host and interface specific, “Interface Shared” = unigenes from Orthogroups that are interface specific and shared between interface transcriptomes, “Shared All” = unigenes from Orthogroups shared between both interface transcriptomes and the above ground transcriptome, “Interface/Above Ground Shared” = unigenes from Orthogroups that are shared between the above ground, autotrophic transcriptome and the host-specific interface transcriptome.

Additional)file)4:)Figure)S3.))A))GO)Slim)Function)category)analysis)for)the)interface)transcriptome)of)T.#versicolor#grown)on)Z.#mays.!!Chi% Square!test!(P=<<0.0001)!of!GO!Slim!terms!represented!in!the!indicated!regions!(A%D)!of!the!Venn.!!The!numbers!of!unigenes!in!each!GO! category!for!regions!A%D!are!indicated!in!the!table.!!Cells!with!strongly!positive!residual!values!(>4)!are!indicated!as!bold+!and!strongly!negative! residual!values!(<%4)!are!indicated!as!boldA.!!! 117 ! GO!Function! A! B! C! D! DNA!or!RNA!binding! 24! 43! 80# 515! hydrolase!activity! 62! 177! 187! 1688! kinase!activity! 33! 70! 51! 889! nucleic!acid!binding! 3! 13! 11! 152! ) nucleotide!binding! 19! 20A 34! 441! other!binding! 107! 205! 147! 1298! other!enzyme!activity! 66! 146A) 159! 1791! other!molecular!functions! 227! 581+# 497# 3642! protein!binding! 32! 55! 42! 644! receptor!binding!or!activity! 1! 2! 0! 37! ) ) ) structural!molecule!activity! 1A 12A 18! 465A transcription!factor!activity! 61+# 157+# 63! 531! ) # ! transferase!activity! 25A 78A 91! 1182! transporter!activity! 21! 66A) 57# 974! # # # No!GO!Function! 123+ 237+ 121! 768A ! Figure 3.6 A GO Slim Function category analysis for the interface transcriptome of T. versicolor grown on Z. mays. Chi-Square test !(P=<<0.0001) of GO Slim terms represented in the indicated regions (A-D) of the Venn. The numbers of unigenes in each GO category for regions A-D are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual !values (<-4) are indicated as bold-.

!

!

Additional)file)4:)Figure)S3.))B))GO)Slim)Component)category)analysis)for)the)interface)transcriptome)of)T.#versicolor#grown)on)Z.#mays.!!Chi% Square!test!(P=<<0.0001)!of!GO!Slim!terms!represented!in!the!indicated!regions!(A%D)!of!the!Venn.!!The!numbers!of!unigenes!in!each!GO! category!for!regions!A%D!are!indicated!in!the!table.!!Cells!with!strongly!positive!residual!values!(>4)!are!indicated!as!bold+!and!strongly!negative! residual!values!(<%4)!are!indicated!as!boldC.!!! 118 ! GO!Component! A! B! C! D! cell!wall! 7! 8! 11! 55! ) # chloroplast! 59! 110C 227+ 1415! cytosol! 0! 9! 12! 185! ER! 1! 22! 10! 257! extracellular! 3! 13! 3! 40! Golgi!apparatus! 2! 12! 2! 173! mitochondria! 18! 42! 36! 647! ) ) ) ) nucleus! 82 189 129 1148 # # other!cellular!components! 300! 656 504 5455! other!cytoplasmic!components! 11! 25! 31! 493! other!intracellular!components! 17! 36! 24! 378! ) ) ) other!membranes! 150 406 316! 3151 # # ! plasma!membrane! 4 10 12! 145! ) # plastid! 2 5 15! 81! ) ribosome! 1! 7C 17# 309! # # # No!GO!Component! 148+ 312+ 209C) 1085C

! Figure 3.6 B GO Slim Component category analysis for the interface transcriptome of T. versicolor grown on Z. mays. Chi-Square test !(P=<<0.0001) of GO Slim terms represented in the indicated regions (A-D) of the Venn. The numbers of unigenes in each GO category for regions A-D are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual ! values (<-4) are indicated as bold-.

)

!Additional!file!4:!Figure!S3.!!C)!GO!Slim!Process!category!analysis!for!the!interface!transcriptome!of!T.#versicolor#grown!on!Z.#mays.!!Chi% Square!test!(P=<<0.0001)!of!GO!Slim!terms!represented!in!the!indicated!regions!(A%D)!of!the!Venn.!!The!numbers!of!unigenes!in!each!GO! category!for!regions!A%D!are!indicated!in!the!table.!Cells!with!strongly!positive!residual!values!(>4)!are!indicated!as!bold+!and!strongly!negative! residual!values!(<%4)!are!indicated!as!boldC.!!! 119 ! GO!Process! A! B! C! D! cell!organization!and!biogenesis! 9! 27! 29! 286! ! # developmental!processes! 14! 19 30 145! DNA!or!RNA!metabolism! 2! 13! 5! 48! electron!transport!or!energy!pathways! 21! 39! 27! 298! ! other!biological!processes! 376! 910+ 711! 5827! other!cellular!processes! 47! 124! 96! 1388! ! ! ! other!metabolic!processes! 147C 357C 398! 4393+ protein!metabolism! 1! 2! 0! 14! # # response!to!abiotic!or!biotic!stimulus! 12! 16 28 230! response!to!stress! 2! 6! 1! 65! signal!transduction! 5! 9! 15! 148! ! ! ! transcription! 0 1 3! 42 # # ! ! transport! 29 72C 58C 1157+! ! # ! No!GO!Process! 140+ 267+ 157! 976C ! Figure 3.6 C GO Slim Process category analysis for the interface transcriptome of T. versicolor grown on Z. mays. Chi-Square test !(P=<<0.0001) of GO Slim terms represented in the indicated regions (A-D) of the Venn. The numbers of unigenes in each GO category for regions A-D are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual !values (<-4) are indicated as bold-.

!

!

!

Additional)file)4:)Figure)S3.))D))GO)Slim)Function)category)analysis)for)the)interface)transcriptome)of)T.#versicolor#grown)on)M.#truncatula.!! Chi%Square!test!(P=<<0.0001)!of!GO!Slim!terms!represented!in!the!indicated!regions!(E%H)!of!the!Venn.!!The!numbers!of!unigenes!in!each!GO! category!for!regions!E%H!are!indicated!in!the!table.!!Cells!with!strongly!positive!residual!values!(>4)!are!indicated!as!bold+!and!strongly!negative! residual!values!(<%4)!are!indicated!as!boldB.!!! 120 ! GO!Function! E! F! G! H! DNA!or!RNA!binding! 9! 41! 11# 444! hydrolase!activity! 33! 156! 47! 1423! kinase!activity! 39! 66! 20! 718! nucleic!acid!binding! 2! 8! 4! 118! nucleotide!binding! 15! 22! 6! 353! other!binding! 51! 184! 38! 1125! ) other!enzyme!activity! 20! 132B 38! 1526! # # other!molecular!functions! 102! 568+ 111 3172! protein!binding! 12! 51! 16! 569! receptor!binding!or!activity! 0! 2! 0! 28! ) ) ) structural!molecule!activity! 2 10B 7! 446 # # transcription!factor!activity! 30 141+ 16! 440! ) # ! transferase!activity! 25 64B 26! 991! ) transporter!activity! 9! 62 11# 795! # # # No!GO!Function! 72+ 219+ 34! 677B ! Figure 3.6 D GO Slim Function category analysis for the interface transcriptome of T. versicolor grown on M. truncatula. Chi-Square test! (P=<<0.0001) of GO Slim terms represented in the indicated regions (E-H) of the Venn. The numbers of unigenes in each GO category for regions E-H are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual! values (<-4) are indicated as bold-.

!

!

Additional)file)4:)Figure)S3.))E))GO)Slim)Component)category)analysis)for)the)interface)transcriptome)of)T.#versicolor#grown)on)M.#truncatula.!! Chi%Square!test!(P=<<0.0001)!of!GO!Slim!terms!represented!in!the!indicated!regions!(E%H)!of!the!Venn.!!The!numbers!of!unigenes!in!each!GO! category!for!regions!E%H!are!indicated!in!the!table.!!Cells!with!strongly!positive!residual!values!(>4)!are!indicated!as!bold+!and!strongly!negative! residual!values!(<%4)!are!indicated!as!boldC.!!! 121 ! GO!Component! E! F! G! H! cell!wall! 1! 6! 1! 53! ) # chloroplast! 20! 111 45 1211! cytosol! 3! 8! 2! 177! ER! 0! 15! 5! 247! extracellular! 4! 14! 1! 39! Golgi!apparatus! 0! 12! 2! 137! mitochondria! 0! 35! 6! 544! ) ) ) ) nucleus! 44 169 28 941 # # other!cellular!components! 152! 572 128 4578! other!cytoplasmic!components! 5! 29! 4! 407! other!intracellular!components! 8! 36! 12! 359! ) ) ) other!membranes! 102 402 86! 2657 # # ! plasma!membrane! 0 7 1! 120! ) # plastid! 2 5 2! 75! ) ribosome! 2! 6C 7# 293! # # # No!GO!Component! 78+ 299+ 55! 987C ! Figure 3.6 E GO Slim Component category analysis for the interface transcriptome of T. versicolor grown on M. truncatula. Chi- Square! test (P=<<0.0001) of GO Slim terms represented in the indicated regions (E-H) of the Venn. The numbers of unigenes in each GO category for regions E-H are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative! residual values (<-4) are indicated as bold-.

)

Additional)file)4:)Figure)S3.))F))GO)Slim)Process)category)analysis)for)the)interface)transcriptome)of)T.#versicolor#grown)on)M.#truncatula.!!Chi% Square!test!(P=<<0.0001)!of!GO!Slim!terms!represented!in!the!indicated!regions!(E%H)!of!the!Venn.!!The!numbers!of!unigenes!in!each!GO! category!for!regions!E%H!are!indicated!in!the!table.!!Cells!with!strongly!positive!residual!values!(>4)!are!indicated!as!bold+!and!strongly!negative! residual!values!(<%4)!are!indicated!as!boldB.!!! 122 ! GO!Process! E! F! G! H! cell!organization!and!biogenesis! 6! 25! 7! 291! ) # developmental!processes! 4! 15 8 130! DNA!or!RNA!metabolism! 4! 12! 2! 50! ) electron!transport!or!energy!pathways! 10! 45! 16 237! ) other!biological!processes! 198! 854+ 177! 4946! other!cellular!processes! 53! 116! 25! 1193! ) ) ) other!metabolic!processes! 62B 310B 89! 3750 protein!metabolism! 0) 1) 0) 12) # # response!to!abiotic!or!biotic!stimulus! 5! 12 3 173! response!to!stress! 0! 6! 1! 59! signal!transduction! 2! 10! 2! 124! ) ) ) transcription! 0 1 0! 36 # # ) ! transport! 8 66B 13 967! ) # ) No!GO!Process! 69+ 253+ 42! 857B Figure! 3.6! F GO! Slim Process! category! !analysis! for the !interface! transcriptome! ! of T. versicolor! ! grown on! M. truncatula! !. Chi-Square test (P=<<0.0001) of GO Slim terms represented in the indicated regions (E-H) of the Venn. The numbers of unigenes in each GO category for! regions E-H are indicated in the table. Cells with strongly positive residual values (>4) are indicated as bold+ and strongly negative residual values (<-4) are indicated as bold-. )

)

)

)

123

20 Pearson's R=0.81 ●

● ●● ● ●● ● ● ● ● ●● ●●● ●● ● ●● ● ● ● ● ●● ● 15 ●● ●● ● ● ●●●● ●● ● ● ●● ●● ●●●● ● ● ● ● ●●●●●●● ● ● ●●●●●●●●● ●● ● ●●●●●●● ●● ● ● ●● ●●●●●●●●●●● ● ●● ● ● ●● ● ●● ●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● 10 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● on ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● 5 ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●●●●● ● ● ● ● ● ● ●●●● ●●●●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●● ●●● ●● ●● ● ●●● ● ● ● ● ● ● ●●●●●●●●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ●● ●●●●●● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●●●●●●●● ●●●●● ● ●●●●● ● ●●●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●● ● ● ● ● ● ● ● ● ●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●● ●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ● ●

T. versicolor Z. mays versicolor T. 0

0 5 10 15 20

T. versicolor on M. truncatula

Figure 3.7 Correlation of normalized read counts (RPKM) for unigenes in orthogroups shared between the interface transcriptomes and reference assembly TrVeBC1 (ppgp.huck.psu.edu). Reads from each interface transcriptome were mapped to a reference assembly (TrVeBC2, ppgp.huck.psu.edu) that included whole haustorium data from T. versicolor grown on M. truncatula. A subset of unigenes is more highly expressed in the interface transcriptome of T. versicolor grown on M. truncatula; a similar pattern is not observed for T. versicolor grown on Z. mays. This is due to a bias for Medicago grown Triphysaria unigenes in the reference dataset TrVeBC2, which was constructed with reads from Medicago grown Triphysaria. For unigenes in shared orthogroups, the RPKM values are highly correlated (Pearson’s R = 0.81) between interface transcriptomes indicating that technical and biological variation is low.

124

Figure 3.8 A Top expressed unigenes from shared orthogroups between parasite-host interface transcriptomes of T. versicolor.

125

Figure 3.8 B Top expressed unigenes from orthogroups unique to interface transcriptomes of T. versicolor.

126

Figure 3.8 C Top expressed unigenes from host-specific yet above ground shared orthogroups of T. versicolor transcriptomes.

127 Figure 3.8 Highly Expressed Interface Unigenes. The 20 most highly expressed (RPKM) unigenes (ID) in each indicated portion of the transcriptome Venn diagram for the interaction of T. versicolor with each host species. NR BLASTx – description, species and %id.: the description, species of origin, and percent pairwise identify, respectively, of the best unigene alignment (<1e-10) resulting from the NR database query, 17 genomes BLAST and %id.: best hit species in a BLAST database of 17 annotated plant genomes with the percent pairwise identity in the nucleotide BLAST (N) or translated nucleotide BLAST (P). TXXX = IPS transmembrane prediction, SXXX = IPS secretion signal prediction.

128

A

AT2G20750 Os03g01630 100 Os10g40090 100 Os10g40720 88 Os10g40710 99 54 99 Os03g01260 Os02g44108 66 Os02g44106 mgv1a019518 mgv1a026506 97 100 94 mgv1a022578 100 mgv1a019259 100 mgv1a018748

97 mgv1a021782 100 mgv1a020729 47 79 mgv1a018787 100 mgv1a024221 AT1G65681 100 AT3G60570 17 TrVeIntMedtrGuB1_3466 72 AT1G65680 22 OrAeIntArathGB1_7112 84 TrVeIntZeamaGuB1_772* 100 StHeIntSorbiGB1_5079 44 StHeIntSorbiGB1_2866 Smoellindorffii_171427 99 Smoellindorffii_234188

129

B

Os02g51040 100 Sb04g028090.1 97 Os05g39990 99 Sb09g023440.1 65 StHeIntSorbiGB1_2696 StHeIntSorbiGB1_6405 45 37 OrAeIntArathGB1_1702 TrVeIntZeamaGuB1_486 52 TrVeIntZeamaGuB1_733 100 TrVeIntMedtrGuB1_3202 13 Carpa1|supercontig_1.46 Poptr1|577163 60

10 TrVeIntMedtrGuB1_11* 23 12 Medtr1|AC140030_29.4 84 Medtr1|CT025840_3.4 12 Poptr1|656451 96 Poptr1|659295 24 43 Vitvi1|GSVIVP00016318001 23 Vitvi1|GSVIVP00036942001 96 Vitvi1|GSVIVP00000851001 AT1G26770 98

54 AT1G69530 AT2G03090 Selmo1|102623 22 Phypa1|75297 40 Phypa1|192364 100 65 Phypa1|104563 Selmo1|179212

Figure 3.9 RaxML analysis of A: Triphysaria beta expansin gene TvEXPB1 (TrVeIntZeamaGB1_772*, green text), and B: alpha expansin gene TvEXPA4 (TrVeIntMedtrGB1_11*, green text). Bootstrap proportions are given above each node. Taxon abbreviations for A: Arabidopsis thaliana (AT), Oryza sativa (Os), Mimulus guttatus (Mg), Triphysaria versicolor (TrVe), Striga hermonthica (StHe), Phelipanche (=Orobanche) aegyptiaca (OrAe), Selaginella moellendorffii (Smoellendorffii). Taxon Abbreviations for B: Oryza sativa (Os), Sorghum bicolor (Sb), Striga hermonthica (StHe), Phelipanche (=Orobanche) aegyptiaca (OrAe), Triphysaria versicolor (TrVe), Carica papaya (Carpa), Populus trichocarpa (Poptr), Medicago truncatula (Medtr), Vitis vinifera (Vitvi), Arabidopsis thaliana (AT), Selaginella moellendorffii (Selmo), Physcomitrella patens (Phypa).

130

1000" TvEXPB1" * TvEXPA4"

100"

10"

Expression relative to TvActin Expression relative to 1" T. versicolor on T. versicolor M. truncatula on Z. mays

Figure 3.10 Differential Expansin Expression. qRT-PCR analysis of TvEXPA4 and TvEXPB1 expression relative to TvActin in parasite-host interface cells harvested by LPCM from the haustoria of T.versicolor. *P<0.05

131

Materials and Methods

Growth of plant material

Triphysaria versicolor

Seed was collected from hundreds of open pollinated plants growing in coastal grassland stands of Napa laboratories (University of California, Davis, USA) from the same source population as our prior transcriptome sequencing studies126,139. Seeds were surface sterilized in 70% ethanol for 10 min. while gently shaking, and then washed 3x with sterile distilled H2O. Seeds were further sterilized and scarified by a wash with a 50% bleach + 0.01%

Triton X-100 (Sigma) solution for 30 min. while gently shaking. Seeds were then washed 10x with sterile distilled H2O and placed in petri dishes containing co-culture medium (1/4x

Hoagland’s basal salt and nutrient mix, 7.5 g/L sucrose, 6 g/L plant tissue culture grade agarose, pH of 6.1) and wrapped with ParafilmTM. Seeds were stratified for 4 days at 4°C in the dark then transferred to a 16°C growth chamber under a 12-hour light regime with a light intensity of

30µmoles photons/m2/sec. T. versicolor seedlings were grown for 14–17 days then transferred to fresh co-culture plates with hosts.

Medicago truncatula (A17)

Seed was generously provided by Zengyu Wang (Noble Foundation, Oklahoma, USA).

Seeds were scarified by incubation with occasional stirring in 18M H2SO4 for 8 min, and then washed 5x with sterile distilled H2O. Seeds were surface sterilized by a wash with 50% bleach +

0.01% Triton X-100 (Sigma) for 3 min. while gently shaking. Seeds were washed 10x with sterile distilled H2O and placed in a 50mL conical bottom tube with 25mL of sterile distilled H2O at

25°C, in the dark, over-night while gently shaking. The next day seedlings were transferred to sterile filter paper (Whatman #5), moistened with sterile distilled H2O, and placed in petri dishes.

Seedlings were placed at 25°C, in the dark, over-night. Seedlings were transferred to co-culture

132 medium and wrapped with ParafilmTM then placed in a growth chamber at 25 C under a 16-hour light regime with a light intensity of 100µmoles photons/m2/sec for 5–7 days, then transferred to fresh co-culture plates with parasites.

Zea mays (B73)

Seed was generously provided by David Braun (University of Missouri, Columbia,

USA). Seeds were surface sterilized by a wash with 20% bleach+0.08% Triton X-100 (Sigma) for

10 min. while gently shaking. Seeds were washed 3x with sterile distilled H2O. Seeds were further surface sterilized by a wash with 70% ethanol for 5 min. while gently shaking. The ethanol was decanted and the seeds were left to air dry in a laminar flow hood. Dry seeds were placed in petri dishes containing co-culture medium and wrapped with ParafilmTM then placed in a growth chamber at 25 C under a 16-hour light regime with a light intensity of 100µmoles photons/m2/sec for 5–7 days, then transferred to fresh co-culture plates with parasites.

Parasite and host co-culture

Zea mays or Medicago truncatula were transferred to fresh co-culture medium at the times indicated above. Host roots were carefully oriented to allow placement of T. versicolor

(grown as described above) seedlings in close proximity to host roots, with the parasite root tips

0.5-1mm from the host root. Co-culture plates were sealed with MicroporeTM(3M) surgical tape and placed in a growth chamber at 25°C under a 16-hour light regime with a light intensity of

100µmoles photons/m2/sec for 8–10 days.

Tissue processing and sample preparation for sequencing

Tissue harvest

After 8–10 days co-culture, the haustoria of T. versicolor were harvested by making cuts in both the host root and parasite root ~1-2mm adjacent to the haustorium under a stereomicroscope and were embedded in Shandon CryomatrixTM (Thermo Scientific) dispensed in

133 a 10mmx10mmx5mm CryomoldTM (Tissue-Tek). Dissected tissue was quickly oriented with the host root on the vertical axis and with all haustoria equidistant from the bottom of the mold.

Samples were quickly frozen on dry ice and stored at −80°C.

Cryosectioning and dehydration

Embedded haustoria were sectioned in a Cryotome SME (ThermoFisher Scientific) at 20-

25µm. Frozen sections were mounted to slides using the Cryo-JaneTM Tape Transfer System

(Leica Microsystems). Mounted sections were dehydrated in a series of organic solvent baths (RT

70% ethanol (10 min.), 4°C 70% ethanol (2 min.), 4°C 95% ethanol (2 min.), 4°C 100% ethanol

(2 min.), 4°C 100% xylene (2 min.), 4°C 100% xylene (2 min.), and finally RT 100% xylene (2 min.)). Slides were allowed to dry in a fume hood for 30 min.

Laser Pressure Catapult Microdissection (LPCM)

The P.A.L.M. Microbeam™ System (Zeiss) was used to harvest cells of interest. Only the pressure catapult function was used and the laser energy setting for the pressure catapult function was minimized. The laser focus was fixed, but the objective focus was manually adjusted to ensure efficient removal of tissues of interest during LPCM. Dissected tissue was captured in opaque adhesive cap 0.5mL tubes (Zeiss, part #415101-4400-250) and stored at

−80°C.

RNA extraction and RNA cleanup

Total RNA was extracted from LPCM harvested material using the PicoPureTM RNA isolation kit (Arcturus) with adaptations derived, in part, from documentation accompanying the opaque adhesive cap 0.5mL tubes (Zeiss, part #415101-4400-250). The changes include modification of the RNA extraction step as detailed in the Arcturus protocol (Step 1) and are as follows: We removed the adhesive cap and placed it into the tube body of a 0.5mL Eppendorf

Safe-LockTM tube (part # 022363719) to achieve a better tube-cap seal. 50µL of buffer XB was

134 added to the tube and it was vortexed inverted for 30 seconds. The tube was then placed in a temperature equilibrated (42˚ C) custom clamping device (not shown) to prevent leakage during the lysis step, in which the tube is inverted and incubated at 42˚ C for 30 min. in an air incubator.

At 10 min. intervals the entire clamping device containing the tubes was vortexed inverted for 30 seconds. After the adapted lysis step we followed the PicoPureTM RNA isolation kit protocol beginning at step 2, RNA isolation. RNA was DNase treated on-column according to Appendix A in the PicoPureTM RNA isolation kit protocol. Total RNA was assessed on the Agilent

Bioanalyzer using the RNA 6000 Nano kit (Agilent) with the Plant Total RNA assay with specific attention to the RNA Integrity Number217 (RIN, scale of 1 (degraded) to 10 (intact)). An additional clean-up of the total RNA prep was required to remove what we suspected to be poly- phenolics and secondary metabolites that interfered with downstream enzymatic treatments. The

RNA Clean and ConcentratorTM-5 kit (Zymo Research) was used to clean and concentrate the total RNA extracted from LPCM harvested samples following the General Procedure in protocol version 2.0.6 and RNA was eluted in 10µL RNase-free water.

T7 based RNA amplification of mRNA

The Message AmpTM II aRNA kit (Ambion) was used to amplify the poly-A RNA contained in total RNA samples to yield amplified RNA (aRNA). The input amount was approximately 100ng of total RNA. The manufacturer’s instructions were followed and the first round of in vitro transcription (IVT) was allowed to progress for 14hrs. The entire aRNA sample was concentrated in a Speed-vac (Savant) to <10µls then entered into second round cDNA synthesis. The second round IVT was allowed to progress for 4 hours. Total RNA was assessed on the Agilent Bioanalyzer using the RNA 6000 Nano kit (Agilent) with the mRNA assay.

Typical yields after the first round of amplification were up to 100ng aRNA and yields after the second round of amplification ranged from 50-100µg aRNA. High quality Arabidopsis young leaf

135 RNA was used as a positive control and RT-PCR grade water (Ambion, included with the

Message AmpII kit) was used as a negative control for amplification.

Illumina paired-end library construction

aRNA was used as input in Illumina’s mRNA-Seq library prep protocol (Rev D). We omitted the poly-A selection step and moved directly to “Fragment the RNA.” 100ng aRNA was fragmented and the manufacturer’s instructions were followed for the rest of the library preparation with the following exceptions: 1) we size selected the adapter-ligated library fragments at 300bp rather than 200bp at the “Purify the cDNA Templates” step, 2) we performed a second size selection/purification step by running a gel in a similar fashion as described in the

“Purify the cDNA Templates” step and excised the band at approximately 325bp that contained the products of library enrichment. The second size selection was done to purify the library and further constrain the fragment distribution as recommended by Illumina for paired-end mRNA-

Seq. The Illumina RNA-Seq library was assessed on the Agilent Bioanalyzer using the DNA

1000 kit (Agilent).

Sequencing, bioinformatics and phylogenetics

Sequencing

Paired-end (83x83bp) sequencing was performed on the Illumina Genome Analyzer IIX by the Genomics Core facility at the University of Virginia in Charlottesville, VA USA. Each library was sequenced in one lane.

Post sequencing data processing and annotation

Contaminating sequences were removed from the pre-assembled, paired-end reads by alignment to the annotated coding DNA sequences of Medicago truncatula175 and Zea mays175 genomes using version 1.1.0014 of Mosaik Assembler218 with the recommended parameters (hs =

15, mm = 12, and act = 35). Unaligned reads were then trimmed to remove low-quality bases

136 (

(http://www.clcbio.com/index.php) requiring additionally that the remaining read fragment be at least half the original read length. Paired-end read files were reconstructed from the trimmed read fragments and orphaned single-end read fragments written to separate files using a custom script.

The resultant filtered and trimmed set of reads for each library was then de novo assembled using the Inchworm component in release 03122011 of Trinity131 with default parameters. Assemblies were filtered using version 2 of ESTScan155 to remove sequences that had numerous frame-shift errors in coding regions, and version 4.0 of USEARCH156 to remove similar (sub)sequences (to create non-redundant de novo assemblies). The post-processed unigenes for each build were then queried (BLASTx, 1e-5) against the 10 genomic proteomes in PlantTribes 2.0139,174 and assigned to PlantTribes 2.0 Orthogroups based on the cluster containing the best BLAST hit. Expression levels for each unigene was determined by mapping reads back to de novo assemblies (each interface transcriptome as well as TrVeBC1 and TrVeBC2) and computing RPKM value with the

High-throughput Sequencing RNA-Seq Analysis program of CLC Genomics Workbench version

4.6 (parameters: mismatch cost = 2, insertion cost = 3, deletion cost = 3, length fraction = 0.5, similarity = 0.8, min insert size = 100, and max insert size = 250).

Post annotation assembly filtering

To determine the appropriate pairwise identity (at the nucleotide level) to reference host sequences, BLASTn was used to determine the frequency of unigene pairwise identity to reference host ESTs (Zea mays and Medicago truncatula, mRNA ESTs176). The interface transcriptomes of T. versicolor grown on Z. mays and M. truncatula were BLASTed into both the

Z. mays and M. truncatula EST databases. The BLAST to the non-host databases was to determine the incident nucleotide pairwise identity of unigenes to a set of host references that should not be present in the assembly. To determine if unigenes with >95% identity to reciprocal host sequences were due to cross contamination, the distribution of unigene pairwise identity was

137 determined for a whole-plant normalized assembly (of an non-parasitic member of the

Orobanchaceae, Lindenbergia philippensis (PPGP assembly LiPhGnB1162). An identity threshold of 95% was established to minimize host contamination while retaining parasite transcripts with incidentally high identity to the host reference sequences.

Final assemblies were filtered by BLASTn to remove unigenes with >95% identity to host derived sequences (Zea mays and Medicago truncatula, transcripts from175, mRNA ESTs from176). Remaining unigenes were then filtered by BLASTn to available PPGP transcriptome assemblies of T. versicolor162 to select for unigenes with >95% identity to other putative

Triphysaria transcripts. Unigenes with less than 95% identity to host or T. versicolor sequences were queried against the non-redundant protein sequences database2 using BLASTx (1e-10).

Sequences were sorted by best-hit species into non-plant and plant categories. Sequences that remained unannotated and unclassified after extensive efforts detailed above were translated based on the ESTScan ORF prediction and submitted to InterProScan177 via blast2go219 using default parameters. These unigenes were also submitted to the OrthoMCL DB178 database using default parameters.

The unigenes with the greatest RPKM in unique and overlapping Orthogroups, excluding those belonging to Orthogroups shared in all three transcriptomes and those unique to the above ground transcriptome, were used to query the non-redundant protein sequences database (NR) at

NCBI2 with BLASTx (1e-10) using blast2go219. This set of 120 unigenes was queried against a collection of 17 annotated plant genomes using BLASTn and BLASTx to determine best hits to annotated plant sequences not necessarily present in NR. The 17 genomes queried are as follows:

Arabidopsis thaliana, Carica papaya, Fragaria vesca, Glycine max, Medicago truncatula,

Mimulus guttatus, Oryza sativa, Phoenix dactylifera, Populus trichocarpa, Selaginella moellendorffii, Solanum lycopersicum, Solanum tuberosum, Sorghum bicolor, Thellungiella

138 parvula, Theobroma cacao, Vitis vinifera, and Zea mays. Additionally, these unigenes were annotated with InterProScan177 via blast2go219 with the default settings.

GO Slim category analysis

An annotated T. versicolor PPGP transcriptome with no tissue overlap of interface transcriptomes (TrVeBC1162) was included in Orthogroup analysis to serve as a point of comparison for interface transcriptome analysis. Putative T. versicolor unigenes (excluding host and non-plant unigenes) were sorted based on Orthogroup assignment using Venny220. GO Slim annotations from unigenes present in Orthogroups were subject to a Chi-Square test using R160.

Alpha was set to 0.05.

Phylogenetic analysis of TvEXPB1 and TvEXPA4

Homologs of TvEXPB1 from Arabidopsis thaliana, Oryza sativa, Mimulus guttatus, and

Selaginella moellendorfii were extracted from Phytozome175 v7.0: gene family #28891348 (- expansins). Separately, alpha expansin homologs of TvEXPA4 were extracted from PlantTribes

2.0174 Orthogroup 1292. These sequences were combined with a subset of translatable sequences assembled in the PPGP project that had best BLAST hits with PlantTribes v2.0 Orthogroup 6163

(-expansins) and 1292 (-expansins). Inferred amino acid sequences for each data set were aligned using MUSCLE221 and the coding DNA sequences were then forced onto this alignment for phylogenetic analysis. RAxML222 version 7.2.8 was used to estimate the maximum likelihood tree under the GTR+gamma model of molecular evolution and bootstrap support values were estimated using 100 rapid bootstrapping replicates.

Verification of Expansin transcript sequence and relative transcript abundance cDNA synthesis

cDNA was synthesized from the same amplified samples used for sequencing, plus an additional biological replicate of each, to use as a template for qRT-PCR using the iScriptTM

139 Reverse Transcription Supermix for RT-qPCR (BioRad). The manufacturer’s instructions were followed for all RNA samples as this kit uses both oligo-dT and random priming.

RNA isolation from host roots

RNA was isolated from host roots that were co-cultured, as described above, with T. versicolor. Host roots were harvested at sites adjacent to haustoria and flash frozen on liquid N2.

Approximately 50mg of tissue was vigorously macerated in Kimble Chase glass tissue grinders

(part # KT885450-0020) in the presence of 450L buffer RLT + -mercapto-ethanol (Qiagen) for 3 min. The lysate was then transferred to a QiaShredder column and the RNeasy Plant Mini

Kit (Qiagen) instructions were then followed with the addition of an on-column DNase treatment

(Appendix D: RNeasy Mini Handbook 4th Edition) until elution, which was done in 30µL RNase free water. Total RNA was assessed on the Agilent Bioanalyzer using the RNA 6000 Pico kit

(Agilent) with the Plant Total RNA assay.

Primer design and sequence verification

Primers were designed using Geneious161 Pro (v5.5.4) to amplify TvEXPA4 and

TvEXPB1 from aRNA based upon the unigenes that resulted from the de novo assembly and post-processing of the Illumina paired-end mRNA-Seq data.

Primer sequences:

TvEXPA_F5: GCTTTTGCCTACGACCAACTTATG

TvEXPA_R3: GACAGTTTTGCCATCGCTTGTAG

TvEXPB_F1: GCCATAGTTTCAACCCGAGGAC

TvEXPB_R2: GGCTTCTTCCTGCTCTCCTTACTTG

With these primers the putative - and -expansins transcripts were amplified and submitted for Sanger sequencing. Sequencing reads were quality trimmed manually by visually examining the electroferrograms. MUSCLE was used to align the Sanger reads with the unigenes

140 to confirm the unigene sequence. Any remaining gaps were sequenced with the same method using primers based on the Sanger sequence verified first-round PCR products.

Quantitative Real-Time Polymerase Chain Reaction

To verify the host-specific expression pattern of TvEXPB1 and TvEXPA4 observed in the de novo read mapping results, transcript levels of TvEXPB1, TvEXPA4, TvActin, ZmActin, and MtActin were estimated using qRT-PCR.

Primer sequences:

TvEXPA4:

For: 5’-TGGGAGGTGCTTGTGGGTAT-3’; Rev: 5’-CCGCAGGATAACCCATTGTT-3’

TvEXPB1:

For: 5’-GATGGCCTGACTGAAGTTGCA-3’; Rev: 5’-GCGGCAAATTCACCCTAAAA-3’

TvActin:

For: 5’-ACCCGATCCTTCTCACTGA-3’; Rev: 5’-CATGACAATACCAGTCGTACG-3’

ZmActinB223:

For: 5’-CAATGGCACTGGAATGGT-3’; Rev: 5’-ATCTTCAGGCGAAACACG-3’

MtActin224:

For: 5’-ATGTTGCTATTCAGGCCG-3’; Rev: 5’-GCTCATAGTCAAGGGCAAT-3’

Verification of primer specificity

To determine if the primers were specific to their intended targets, melt-curve analysis was performed for each PCR product. Primer specificity for parasite target genes was verified by submitting host RNA extracted from co-cultured host roots to analysis by qRT-PCR with parasite gene primers. In both cases host actin transcripts were detected, yet primer pairs specific to the parasite genes yielded signal consistent with background. All no-template controls (NTC) showed signal consistent with background signal and all reverse transcription negative (RTN) controls showed signal consistent with background.

141 qRT-PCR assay conditions

The qRT-PCR reaction prepared using the KAPATM SYBR FAST qPCR kit (KAPA

Biosystems, KK4602) following the manufacturer’s instructions. The reaction was run on a

BioRad MyiQ (170–9770) with the following program:

95°C for 8 min. (initial melt)

95°C for 0.5 min. (cycle melt)

60°C for 0.5 min. (cycle anneal/extend)

Repeat 40 cycles

Melt Curve: 0.5°C increments from 95°C – 25°C. qRT-PCR data analysis

Crossing point (Ct) values for each of 3 technical replicates were used to calculate the average Ct value. The 2^(−ddCt) method was used to calculate the fold-change in expression in each sample relative to the control145. A one-tailed, two-sample t-test assuming unequal variance was performed using R160; alpha was set to 0.05.

142 Chapter 4

Transcriptomes of Three Parasitic Members of the Orobanchaceae Reveals Shared,

Differentially Expressed Genes at the Host Parsite Interface

The Parasitic Plant Genome Project162 focused upon the Orobanchaceae for several reasons including the devastating impact on agriculture visited by Striga (witchweed) and

Orobanche (Syn. Phelipanche, broomrape), the ideal comparative framework provided by this unique plant family and a methodological foundation for further study established for each of the target species: T. versicolor, S. hermonthica and O. aegyptiaca130. With the completion of global assemblies for each species along with rigorous annotation, classification and de novo RNA-Seq, the basis for cross species comparison is established. Further, our work in evaluation of de novo transcriptome assemblies using high performance tools like Trinity131, CLC Assembly Cell132 and

SOAPdenovo-trans133 (see Chapter 2) and the development of a work flow that allows deep transcriptome sequencing from just a few cells with in the haustorium of T. versicolor has laid the ground work for comparison of highly tissue specific transcriptomes from each of the PPGP species.

The haustoria of T. versicolor is known to contain specialized cells109,225 that may play a role in the host-parasite dialogue. Our previous work (see Chapter 3) has shown that these specialized cells express genes in host specific manner in T. versicolor and are enriched for genes of unknown function49. Similarly, the haustoria of Orobanche and Striga contain numerous specialized cells. In Orobanche there is evidence that functional phloem elements that connect the vasculature of the host to the vasculature of the parasite are present in the haustorium of

Orobanche (syn. Phelipanche) crenata and that the phloem elements arise from interspecific plasmodesmata and have distinct fine structure226. Evidence demonstrating the functional phloem connections of Orobanche (syn. Phelipanche) ramosa to its host was provided by movement of

143 viroid particles from the host potato plant into the parasite, but not vise versa111. The movement of the dye carboxyfluorescein (which is not able to cross the plasma membrane) after being loaded into cut petioles of tobacco was transported into attached O. aegyptiaca suggesting direct phloem connections227 between host and parasite. Work with Striga aspera has shown that cortical cells in the haustorium had numerous plasmodesmata, which the authors suggest may play a role in metabolite transfer since phloem connections were not obvious228. Striga hermonthica and Striga asiatica develop oscula, ingrowths into host xylem elements, which presumably function to acquire water and nutrients via invasion of pits229. For xylem connectivity in Triphysaria and Striga species, and xylem and phloem connectivity in Orobanche species, very carefully controlled and synchronous developmental events must be coordinated by the molecular dialogue between the parasite and its host227,229. This molecular dialogue is poorly characterized92 and only recently have global, cross species comparisons been possible139 to describe family-wide gene expression patterns in the Orobanchaceae. Indeed, the evidence that the anatomy of haustoria in the Orobanchaceae are distinct raises the possibility that species specific signals may interfere with identification of shared components unless equivalent tissues are sampled and compared. Here we describe the laser microdissection of haustoria of the two additional PPGP species, S. hermonthica (grown on Sorghum bicolor) and O. aegyptiaca (grown on Arabidopsis thaliana). Using an approach similar to that described in Chapter 3, the interface transcriptomes of S. hermonthica and O. aegyptiaca were sampled, amplified, sequenced, de novo assembled, and functionally annotated.

The classification strategy of each mixed species transcriptome was refined and includes a step that leverages the comprehensive PPGP “BigBuilds” (see Material and Methods) to help identify parasite interface genes, as well as a meta genome analysis. Consistent with our observations in Chapter 3 that the interface is enriched for genes of unknown function, our results show that the interface specific gene families in S. hermonthica and O. aegyptiaca are enriched

144 for genes of unknown function. Curiously, despite the relatively large amount of contaminant species in the transcriptomes of S. hermonthica and O. aegyptiaca and the relative lack of contamination in transcriptomes of T. versicolor, only Burkholderia are evenly represented across all 4 interface transcriptomes, further suggesting a role for this genus of soil bacteria in the interaction of parasitic Orobanchaceae and their hosts. Finally, we identified 11 parasite gene families that are shared between the interface transcriptomes of all three species and selected homologous unigenes by leveraging the entire catalog of PPGP data. The coding sequence of each homolog from each shared gene family was screened against a global PPGP read mapping experiment to identify parasite genes in these shared families that were differentially expressed in the interface or in the interface and haustorium. The differentially expressed gene families encode a cytokinin biosynthetic enzyme, the β-expansin family discussed in Chapter 3, an anion transporter whose relatives are involved guard cell anion homeostasis, a peroxidase and a lateral organ boundary (LOB) domain transcription factor. These genes provide excellent candidates for further study including phylogenetic analyses aimed at discovering patterns of neo- and sub- functionalization of members of these gene families, and as well as functional characterization to determine the specific roles that these genes play in parasite biology.

145 Results

Parasite host co-culture and microdissection of haustoria

To expand upon the Triphysaria versicolor sample set (grown on hosts Zea mays and

Medicago truncatula) described in Chapter 3, Striga hermonthica was grown with Sorghum bicolor, and Orobanche (syn. Phelipanche) aegyptiaca with Arabidopsis thaliana, in clear polyethylene bag co-culture systems. Mature haustoria, indicated by a fast connection, were harvested at 8-10 days for T. versicolor compared to 3-5 days for the faster-maturing obligate parasites S. hermonthica and O. aegyptiaca. Freshly dissected haustoria from T. versicolor were oriented in the cryo-molds prior to flash freezing with the host root on the vertical axis, whereas

S. hermonthica and O. aegyptiaca, owing to their relatively smaller size and greater numbers in individual replicate co-culture systems, were oriented with the host root (containing multiple haustoria) on the horizontal axis. For each parasite-host pair, ~110 regions of interest (ROIs) were harvested from 25 µm cryosections representing biological replicates (>8 haustoria). The larger haustoria of T. versicolor yielded a larger host-parasite interface (Figure 4.1, the average on M. truncatula was 54,910 µm2 and on Z. mays was 56,079 µm2) compared to the smaller haustoria of S. hermonthica and O. aegyptiaca, which yielded average ROIs of 24,204 µm2- and

22,637 µm2, respectively. The average ROI area for S. hermonthica and O. aegyptiaca was roughly half of the average T. versicolor ROIs area and the total RNA yield from the pooled micro-dissected interface tissues from S. hermonthica and O. aegyptiaca was similarly lower at

80-100 ng.

Anticipating that the S. hermonthica and O. aegyptiaca RNA samples were similar in purity to the T. versicolor LPCM samples, and to preserve as much RNA as possible, we immediately cleaned the RNA with the ZymoTM RNA Clean and Concentrator kit after RNA integrity (RIN > 6) was confirmed with the BioanalyzerTM. Consistent with the results from

146 Chapter 3 for T. versicolor interface samples, the two-round amplification using the Message

AmpTM II aRNA kit from AmbionTM (and the same Arabidopsis control RNA) yielded 50-100 µg of aRNA. Also consistent with the results in Chapter 3, and a previous report171, was a reduction in RNA fragment lengths of the RNA as determined by BioanalyzerTM.

Sequencing and assembly statistics

Amplified interface RNA samples from S. hermonthica and O. aegyptiaca were sequenced on the Illumina Genome Analyzer IIx with an 83x83 paired-end protocol. After filtering and trimming reads for quality the S. hermonthica interface data set contained 3.88Gbp in 47,479,944 reads with an average trimmed read length of 81.7 bp (Table 4.1). The filtered and quality trimmed O. aegyptiaca interface data set was larger at 4.58Gbp represented by 55,615,807 reads with an average trimmed read length of 82.3bp (Table 4.1). Compared to the T. versicolor data sets (Table 4.1) the S. hermonthica and O. aegyptiaca interface data sets were larger (more reads) and of higher quality (longer average trimmed read length). Since read filtering prior to assembly was insufficient to remove host contamination, as reported in Chapter 3 with the T versicolor interface assemblies, we chose instead to assemble reads filtered only for quality from the S. hermonthica and O. aegyptiaca interface data sets and then classify unigene by source after assembly. The T. versicolor assemblies, which were filtered at the read level, were re-classified according to the modified scheme discussed below to facilitate comparisons between the interface transcriptomes from S. hermonthica, O. aegyptiaca and T. versicolor.

The first BLAST157 based classification identified interface unigenes with alignments greater than 95% pairwise identity and e-value <1e-5 to the extensively classified, comprehensive assembly162 (BigBuild) of the respective PPGP data for each species. This classification step identified the greatest single proportion of unigenes (47.8-79.5%) in the mixed tissue assembly for each interface transcriptome (Table 4.1) indicating a majority of tissue captured was of parasite origin. Between 15.7-26.1% of the interface unigenes were classified as host cDNAs.

147 After classification of unigenes with good alignments to the respective BigBuild and transcripts annotated in the respective host genomes176, the remaining unigenes were BLASTed against

NCBIs non-redundant protein sequences data base2 (NR) and then further classified with

MEGAN138 to characterize the metagenome of each co-culture system. Any unigenes that were classified in Streptophyta, or that remained unassigned, were BLASTed against large collections of the respective host’s ESTs175 to screen for un-annotated host plant genes (absent from genome reference cDNA set). The Arabidopsis genome is the premiere plant genome and with the 10th generation annotation we expected to identify very few hits to Arabidopsis ESTs (Table 4.1) following a classification using the TAIR 10 cDNA set. Alternatively, the first release S. bicolor genome was expected to have a less complete S. bicolor cDNA set and was expected to yield more hits to ESTs that likely represent un-annotated host genes. The Arabidopsis EST screen yielded only 30 hits while the S. bicolor EST screen generated 1201 hits. This revised strategy slightly reduced the number of total host hits for T. versicolor grown on both hosts (compare to

Table 3.1), yet the proportion of hits to unigenes in the latest T. versicolor BigBuild increased from 14,352 to 16,483 (15% increase) for the Medicago grown T. versicolor and 17,887 to 21,224

(18% increase) for Zea grown T. versicolor. These increases were seen with a drastic and concurrent drop in hits to other plants (Table 4.1 compare to Table 3.1), indicating that the classification in Chapter 3 was robust for identification of plant genes, and the leveraging of more of the PPGP data (i.e. the latest version T. versicolor BigBuild) results in a unigene classification with more high confidence T. versicolor unigenes. The number of unigenes in the S. hermonthica interface transcriptome that were classified as host derived was double the number of host unigenes of the other three interface transcriptomes. The number of unigenes in the O. aegyptiaca interface transcriptome was similar to the number of host unigenes in the T. versicolor interface transcriptomes. In total, between 16.5% and 28.7% of the interface transcriptome was

148 host derived which is consistent with the LPCM sampling strategy (Figure 4.1, see also Figure

3.1).

MEGAN analysis of the interface transcriptomes

The metagenome of each interface transcriptome was classified by a BLASTx search to

NR and further refined with MEGAN. The bitscore threshold was determined based on the analysis in Chapter 2 that revealed a threshold alignment bit score of 125 was sufficient such that the frequency of non-assignment was not substantially increased with more stringent alignment scores. The classification of interface unigenes that did not align to other parasite unigenes or host sequences with MEGAN revealed that the axenic culture condition for T. verciolor was quite clean compared to the polyethylene bag culture systems (Table 4.1, Figure 4.2A). Numerous fungi and bacteria derived unigenes are seen in the bag culture systems indicating a preponderance of non-plant genera in the bag co-culture system. Consistent with our observation that the genus Burkholderia is overrepresented in the interface transcriptome of T. versicolor

(Chapter 3) we see that among non-plant genera, only the genus Burkholderia is evenly and strongly represented across all 4 interface transcriptomes presented here (Figure 4.2B). The hits in the NR database are not to the same Burkholderia genes, but are highly similar (>95% pairwise nucleotide identity) to the hits in NR suggesting that these are not horizontally transferred bacterial genes expressed by the parasite, but bona fide Burkholderia transcripts. The average length for Burkholderia unigenes is between 311bp and 349bp further indicating that these are bacterial transcripts rather than genomic fragments that were carried through poly-A selection.

Magnoliophyta (syn. angiosperm, Figure 4.2A) was heavily populated indicating the presence of plant unigenes that could not be assigned to the respective host sources or respective BigBuilds and may represent parasite genes not expressed in other parasite tissues.

The number of unclassified unigenes is higher in S. hermonthica and O. aegyptiaca, indicating that the complexity of the mixed species transcriptome is higher. The scan of

149 OrthoMCL DB and InterProScan (IPS) yielded results consistent with Chapter 3 in that only very small number of unigenes were further classified (Table 4.1). The majority of IPS hits (53.1-

71.9%) were predicted secretion signals or putative transmembrane domains. All together the axenic co-culture systems of T. versicolor were less complex which increased the proportion of classified unigenes to greater than 95%. For the bag-culture systems, the increased complexity likely contributed to a lower overall classification rate of ~90% for S. hermonthica and O. aegyptiaca.

Of the unclassified unigenes in S. hermonthica and O. aegyptiaca, 6,859 (96.9%) and

3,794 (94.0%), respectively, do not have read support and good protein level alignments to the other unclassified sequences in this study. The number is far lower for T. versicolor at 608

(83.5%) for Medicago grown T. versicolor and 515 (83.7%) for Zea grown T. versicolor. A three-way reciprocal best produced 9 good hits, all showing very high nucleotide pairwise identity (96.5-100%) with 2/3 having no hit in a relaxed (1e-1) blastx search of NR and 1/3 having hits to Burkholderia sequences with alignments scoring <125 (thus excluded from the

MEGAN classification). Taken together this suggests that our classification is robust for classifying plant genes and that the large number of unclassified unigenes in S. hermonthica and

O. aegyptiaca likely represents lab and species specific contamination in the bag co-culture systems and not uncharacterized plant genes.

Comparative annotation profiles of interface transcriptomes

To determine the extent and nature of overlap in the GO Slim profiles for S. hermonthica,

O. aegyptiaca and T. versicolor (grown on the distantly related hosts Z. mays and M. truncatula) we identified Stage 6.1162 (leaves and stems or “shoots”) as a suitable reference transcriptome for each of the three species that would allow us to classify unigenes into three categories: 1) interface specific, 2) stage 6.1 specific, and 3) shared. By comparing the annotation profiles of equivalent transcriptome components we can verify that the unigene classification and de novo

150 assembly produced similar results for equivalent transcriptomes while also discovering patterns in the processes underpinning parasitism in the Orobanchaceae.

The classification of each interface transcriptomes identified between 18,119 and 31,808 unigenes that were of potential parasite origin (“Putative parasite with PlantTribes Ortho” in Table

4.1). Of these, 75-85% were assigned to PlantTribes 2.0173 Orthogroups (orthos = approximate gene families) which represent between 40.0-69.5% of the complete interface transcriptome assembly (Table 4.1). To determine if the assemblies were sufficient to represent each interface transcriptome and equivalent for comparison, a tag analysis of Ultra Conserved Orthologs230

(UCOs) was done using Vitis orthologs of Arabidopsis UCOs as determined from a search of the

Plant Tribes 2.0 database (Figure 4.3). The frequency of UCO tagging, via BLASTx to Vitis sequences, was similar ranging from 63.1% for Medicago grown T. versicolor to 81.7% for O. aegyptiaca, as was the distribution of best hit unigenes lengths (Figure 4.3). This result is consistent with a previous study26, though the stringency of our screen was more rigorous (1e-20 and HSP >45 amino acids). Despite the additional challenges of a low diversity cell population and inclusion of multiple out-crossed wild individuals, the proxy of UCO recovery indicated sufficiently complete interface transcriptome assemblies for global scale comparisons. Further, the best case scenario in Chapter 2 of ample, high quality data from a sequenced and isogenic model plant shows that the recovery of UCOs is not likely to exceed 90% in a de novo plant transcriptome assembly from a single tissue type (i.e. Arabidopsis young leaves).

The “Putative parasite with PlantTribes Ortho” unigene class in Table 4.1 were the representative unigene set for global transcriptome comparisons. The Stage 6.1 transcriptomes from S. hermonthica, O. aegyptiaca and T. versicolor were screened for unigenes with alignments

>95% nucleotide pairwise identity to unigenes in the respective BigBuilds to remove any potential contamination. Between ~80-85% of each stage 6.1 assembly was retained for comparison to the respective interface assembly. A majority of orthos were shared in the

151 comparison of each interface transcriptome (Figure 4.4), yet a substantial number were stage- specific. The GO Slim annotation profiles of equivalent transcriptome components were tested for proportionality in GO Slim annotation terms by a Chi-Square test. The results of all comparisons showed disproportionality with p-values <0.001. Table 4.2 shows the categories with residuals <4 (indicated by “-“) or >4 (indicated by “+”). This revealed that, primarily, the differences were the proportions of orthos lacking GO Slim terms, “other metabolic processes” and “chloroplast” with S. hermonthica and O. aegyptiaca being enriched for orthos lacking GO

Slim terms compared to T. versicolor and the opposite for “other metabolic processes” and

“chloroplast” where the obligate parasites were underrepresented compared to T. versicolor. This comparison between species shows that despite disproportionality, the differences were in just a few categories indicating that generally the GO Slim annotation profiles were consistent between the interface transcriptomes of T. versicolor, S. hermonthica and O. aegyptiaca.

The differences in GO slim annotation profiles between overlapping and distinct orthogroups within each species was dramatic compared to the differences observed between species. In Table 4.3 the results of the Chi-Square test reveal that the Stage 6.1 and interface GO

Slim profiles are highly disproportionate (all p-values <<<0.001), yet display strikingly similar patterns of disproportionality between species. For example, the GO Slim Function category

“structural molecule activity” is over represented in the shared ortho component of the S. hermonthica and both T. versicolor interface transcriptomes while being underrepresented in all

Stage 6.1 specific orthos. Additionally, the GO Slim Function category “transcription factor activity” is overrepresented in both stage specific ortho collections while “transporter activity” is underrepresented in the stage specific orthos. Similar consistent patterns are observed in several categories like Go Slim Process categories “other metabolic process,” “transport” and “other biological processes.” For the GO Slim Component categories, the facultative parasite T. versicolor is overrepresented for the “chloroplast” category in Stage 6.1 while the “cytosol”

152 component is underrepresented in the Stage 6.1 specific orthos. The most striking pattern here is consistent with the results of a similar analysis in Chapter 3 showing strong representation of orthos that lack GO Slim terms in the stage specific orthos with a concurrent weak representation for orthos lacking GO Slim terms in shared orthos.

Shared interface orthogroups of T. versicolor, S. hermonthica and O. aegyptiaca

The interface transcriptomes contain from ~500-1000 orthos that are not shared with the respective Stage 6.1 transcriptome (Figure 4.4). The proportion of each interface transcriptome that is derived from shared orthos between T. versicolor, S. hermonthica and O. aegyptiaca are likely to contain genes that are central the function of the specialized cells residing deep with in the haustorium. The comparison of interface specific orthos revealed that only 12 were shared between T. versicolor, S. hermonthica and O. aegyptiaca (Figure 4.5). Each interface unigene assigned to one of the 12 orthos was BLASTed into the respective BigBuild assembly to find high nucleotide pairwise identity hits (>95%). Sequences were grouped by ortho and species and were carefully aligned with extensive manual curation. Extensive BLAST searches into stage specific PPGP assemblies were done recover the longest possible open reading frame (ORF) which occasionally extended the alignments (35-135 a.a. in orthos 6070, 6140, 6163 and 9416), but often returned near-perfect subsequences of 1) unigenes recovered in the first BLASTn search of each respective BigBuild, or 2) of the original interface unigene. The best hit in the

PlantTribes database (or Arabidopsis) was used to verify that at least one parasite sequence contained a full length ORF and that the translated ORF produced a good alignment. A summary of ortho descriptions is presented in Figure 4.5B. Full-length parasite unigenes were blasted against NR producing hits with descriptions that were concordant with descriptions provided by the PlantTribes 2.0 annotation.

Ortho 16138, which was not assigned a description based on the best hit to Papaya in

PlantTribes 2.0, produced alignments to non-plant sources and plastid gene sequences when

153 queried against NR. A BLASTx search of NR with the S. hermonthica interface unigene produced an alignment (85% pairwise identity, 8e-31 & bit score 115) to a Rhizobium etli protein sequence and a BLASTn search to NCBIs nucleotide collection2 (nt/nr) produced an alignment

(85% nucleotide pairwise identity, 8e-112) to the genome sequence of Cellvibrio japonicas plus alignments to numerous uncultured soil bacteria. A BLASTn search to nt/nr with the T. versicolor interface unigene (T. versicolor unigenes from both hosts were identical) in ortho

16138 produced an alignment (1.8e-145, 97.7% nucleotide pairwise identity) with the

Antirrhinum majus rpl2 plastid gene. A similar BLASTn search with the O. aegyptiaca interface unigene in ortho 16138 produced alignments (1.5e-93, 98.0% nucleotide pairwise identity) with the Niastella koreensis genome. Since the potential origin of unigenes from Ortho 16138 are conflicting and are likely not derived from the respective nuclear parasite genome, they were excluded from further analyses.

For each species, representative coding sequences were carefully chosen and extracted for each ortho from either the unigene containing the longest ORF, or consensus sequences that were generated based on nucleotide level alignments of high nucleotide pairwise identity (often

100%). These translated nucleotide sequences (Figure 4.6) from interface transcriptomes of T. versicolor, S. hermonthica and O. aegyptiaca were aligned to each other resulting in pairwise identities ranging from 44.1% to 77.7%. In all cases at least one parasite unigene produced a full- length coding sequence (verified by alignment to the best hit in PlantTribes, see above), yet some alignments contained unigenes that were lacking the 5’ end of the coding sequence. This is likely due to the fragment length reduction observed in amplified RNA samples, which results in a 3’ bias, and not from insufficient read depth for assembly. In 2 cases the 3’ end of a unigene coding sequence lacked a stop codon, though in the first case of the S. hermonthica representative in ortho 11528, seems only to lack the stop codon and in the second case the S. hermonthica representative of ortho 12614 lacks roughly half of the coding sequence and the missing sequence

154 could not be recovered from any of the available PPGP assemblies. Reassuringly, in all cases at least one species provided a full length coding sequence with the others providing at least half of the coding sequence while ~2/3 of the alignments were virtually full length for all 3 species

(Figure 4.6).

Differential expression analysis of unigenes in the shared interface orthogroups

The analyses described above produced nucleotide level alignments that were useful to interpret read mapping (expression) evidence for genes in the 11 shared orthos. One of the primary challenges of RNA-Seq using de novo assembled transcriptomes is the increased complexity of the mapping reference compared to a non-redundant and contiguous cDNA collection from a sequenced plant genome. The extensive manual curation of each species and ortho specific nucleotide alignment revealed cases where a BigBuild unigene was virtually identical to the respective interface unigene and was easily chosen for the read level analysis (see below). Alternatively, cases where an interface unigene clearly joined 2 or more BigBuild unigenes, and vise versa, were observed. More complex cases, like the T. versicolor unigenes in ortho 6571 (Figure 4.7) where good alignments to three BigBuild clusters resulted from BLASTn searches using the interface unigenes. These were classified by their over-all sequence identity to other members of the cluster or alignment. This approach resulted in 2 sequence groups with one consisting of unigenes TrVeIntZeamaGB1_12808, TrVeIntZeamaGB1_111912,

TrVeIntZeamaGB1_6590, TrVeBC3_11223.2, and TrVeBC3_21385.1 (TrVeBC3_21385.2 was classified as a variant of TrVeBC3_21385.1 and had a very low expression signal, see Figure

4.7B) and the other populated by the remaining sequences (Figure 4.7). The former group was chosen since these are 97% identical to each other and show a stronger interface expression signal, suggesting that these represent a single transcript that is highly expressed at the interface.

The nucleotide alignments described above were leveraged with the gene expression data from each PPGP stage specific library using the BigBuild sequences that represented the best

155 ORF (Figure 4.6) to reconstruct the expression signal for genes in each ortho (Figure 4.8). After the gene expression signal for each representative was determined, we cross checked the

BigBuild unigenes in the various alignments for differential expression using the DESeq package in R231. Reassuringly, this analysis revealed that many of the reconstructed genes were composed of, or contained, BigBuild unigenes that were differentially expressed (DE) in the interface or interface and haustorium (Figure 4.8) compared to the other PPGP stages. The power for determining DE unigenes is greater in data sets with replication, like S. hermonthica, resulting in a greater number of DE genes. Thus we imposed the criteria that for each ortho a candidate gene must be DE in 2 of 3 species to be considered DE and shared. These criteria selected five genes

(Figure 4.5B and Figure 4.6 – indicated with an asterisk) that included APIPT1, an isopentenyl transferase, SLAH, an S-type anion channel, EXPB, a β-expansin (which includes the T. versicolor TvEXPB1 (Chapter 3)), RCI3, a peroxidase, and a LOB domain transcription factor.

Curiously, five of the 11 genes in the 12 orthos are putative transcription factors, four of which are DE in at least one species considered here (Figures 4.5 & 4.8).

156 Discussion

The proven workflow that allowed us to interrogate the host-parasite interface cell transcriptome of T. versicolor during interactions with two hosts was leveraged to gain insight into broader patterns of gene expression in the parasitic Orobanchaceae. This work reveals that highly tissue specific transcriptomics helps us begin to understand the role of poorly characterized plant genes. Furthermore, the comparative power of the parasitic members of the

Orobanchaceae chosen for the PPGP was leveraged to identity 11 gene families with members that are active at the parasitic-host interface and differentially expressed in tissues critical to the parasitic lifestyle.

Tissue specific transcriptomes are rich in genes of unknown function

Laser microdissection was used to isolate and capture cells from the parasite-host interface of T. versicolor in our previous study presented in Chapter 3. The sample set was expanded to include similar parasite interface cell-enriched transcriptomes from S. hermonthica and O. aegyptiaca grown on model hosts with sequenced genomes, S. bicolor and A. thaliana, respectively. As a point of comparison the Stage 6.1 (generated from leaves and stems) from each species was used to identify gene families that were active in vegetative phases of growth, those that were active at the interface and those that were shared. The shared gene families are likely active in a wide variety of cell types but the gene families that were stage-specific were each enriched for genes of unknown function. The Stage 6.1 transcriptomes contain a more complex mixture of cell types compared to the interface transcriptome and similarly contain more gene families of unknown function than the interface transcriptomes.

In a recent review by Wang et al. 201273 the authors note that very few global gene expression studies at high spatiotemporal (i.e. cell type) resolution have been done. Moreover, the generation of high spatiotemporal resolution transcriptomes is a prerequisite for systems level biology; the activity of individual cells through space and time must be known and integrated for

157 an organism-level view of the transcriptional landscape. Our work joins the few efforts to date

(April 2013) that combine very fine scale tissue sampling and deep transcriptome sequencing via

SGS.

Laser microdissection of the shoot apical meristem (SAM) of maize187 coupled with 454 revealed that the tissue specific sequencing of the SAM in maize, compared to meristem enriched samples, proved powerful for investigation of rare genes as well as the discovery of new genes not present in the annotated maize genome. A survey of maize leaf development, using LM and

Illumina sequencing, provided the basis for a systems level approach for the study of photosynthesis and also showed that a large number of transcription factors were differentially expressed in mesophyll and bundle sheath cells189. Matas et al 2011172 microdissected developing tomato fruits and showed that samples enriched for certain cell types, e.g. epidermis, were enriched for genes of unknown function. The results of our analyses are concordant with these previous reports and represent the first cross-species comparison of equivalent laser dissected tissues in plant systems that lack a sequenced genomes.

The metagenome of co-cultured parasitic Orobanchaceae

Using MEGAN we classified components of each interface transcriptome that were not highly similar to other parasite sequences (PPGP BigBuilds) or host gene sequences (cDNAs).

We then filtered unigenes classified into Streptophyta, plus those that were unassigned, for unigenes with high similarity to host derived ESTs. The remaining unigenes comprise the non- parasite, non-host metagenome of each co-culture system. Compared to the axenic approach with

T. versicolor, the polyethylene bag system in which S. hermonthica and O. aegyptiaca were grown contained far more non-plant species, namely fungi and bacteria. One surprising exception in T. versicolor was the presence of numerous unigenes classified into the Burkholderia genus, a genus which was also strongly represented in the interface transcriptomes of S. hermonthica and O. aegyptiaca. That bacteria were present in the bag co-culture systems is not

158 surprising given the abundance of other non-plant species, but that the genus Burkholderia was strongly represented in the axenic co-culture system of T. versicolor (see Chapter 3), and equally represented across all three species, suggests a role for Burkholderia species in the interaction of parasitic Orobanchaceae and their hosts. The likelihood that the presence of Burkholderia transcripts was due to contamination from seeds is low since the source material for these experiments came not only from different labs, but different states and in some cases, different continents. As noted in Chapter 3, the seeds for T. versicolor and hosts were aggressively surface sterilized. The seeds for S. hermonthica and O. aegyptiaca and hosts were also surface sterilized prior to co-culture reducing the possibility of a contaminant from seed sources. Furthermore, as noted in Chapter 3, the unigene size (<1000bp) is indicative of an RNA transcript carried through poly-A selection rather than genomic contamination.

The extensive non-plant component of the interface transcriptomes of S. hermonthica and

O. aegyptiaca were classified into only a few shared clades indicating the source of non-plant contamination was likely the respective lab environment. Of the non-classified unigenes in S. hermonthica and O. aegyptiaca, the vast majority (>94%) are unique to each transcriptome.

While the proportion of unique, unclassified sequences is similar in T. versicolor at ~85%, the numbers are far lower at ~500-600, compared to ~4,000-7,000 in S. hermonthica and O. aegyptiaca. That the non-plant proportion of the interface transcriptomes in S. hermonthica and

O. aegyptiaca is much higher than T. versicolor, considered with the concurrent increased amount of unique and unclassified unigenes lead us to conclude that most of the unclassified unigenes were likely of non-plant origin. This has several implications in the analysis of a de novo assembled transcriptome. First, the proportion of reads representing the target transcriptomes (i.e. parasite and host) is reduced making the effective parasite interface transcriptome smaller.

Second, the efforts to classify unigenes are complicated by the increase in potential transcript sources. Third, the assembly statistics are skewed since the effective transcriptome complexity is

159 exaggerated. For instance, that the S. hermonthica transcriptome assembly yielded >60,000 unigenes may have indicated a poor de novo assembly with extensive Type I error (see Chapter

2), though after extensive unigene classification, the number of parasitic plant unigenes is closer to the expected number of expressed parasite genes. Overall, the axenic conditions in which T. versicolor was grown allowed us to identify non-plant organisms that may be playing a role in the interaction of parasitic plants with their hosts. The polyethylene bag co-culture system introduces an excess of non-plant organisms that serve to complicate global gene expression analyses and mask potentially relevant signals (e.g. Burkholderia transcripts).

Interface transcriptomes of parasitic Orobanchaceae have similar functional profiles

To compare the interface transcriptomes of T. versicolor, S. hermonthica and O. aegyptiaca we first determined if each transcriptome was sufficiently representative by determining what proportion of UCOs were tagged in each data set and the length distribution for parasite UCOs. This approach has been used to gauge the completeness of an assembly using the successful detection of this highly conserved gene set as a proxy for the whole transcriptome26.

Indeed the coverage of UCOs in the Arabidopsis young leaf transcriptome in Chapter 2 was one of the most informative metrics for superior assemblies. Here we used a stricter criteria than Der et al. 201126 by requiring the HSP of BLASTx alignment to exceed 45 amino acids at an e-value threshold of ≤1e-20. The results of this analysis show that the tagging frequency and unigene length distribution of all interface transcriptomes were similar, with the larger datasets for S. hermonthica and O. aegyptiaca affording slightly higher unigene lengths for more UCOs than the pair of T. versicolor transcriptomes. The frequency of interface UCO hits was lower than the results presented in Chapter 2 and those reported by Der et al.26 2011. This is likely due to the highly cell specific nature of the interface cell transcriptomes compared to the whole leaf transcriptome presented in Chapter 2 and the whole plant normalized 454 transcriptome of

160 Pteridium26. Overall, this analysis shows that the interface transcriptomes are equivalent for global scale comparison and that de novo assembly was successful in reconstruction of a majority of interface transcripts.

Of the putative parasite unigenes (Table 4.1) ~75-85% were assigned to Plant Tribes 2.0

Orthogroups. The Chi-Square test showed that the obligate parasites had fewer unigenes with

Chloroplast GO Slim component terms, which is consistent with a previous report139 that showed, compared to T. versicolor, that S. hermonthica was underrepresented for photosynthesis pathway genes and O. aegyptiaca was completely deficient for the same genes. This test also showed that the obligate parasites were over represented for unigenes lacking GO Slim terms. For the other

GO Slim terms the differences were less dramatic indicating that the transcriptome-wide functional profiles of the interface transcriptomes were similar. When we tested the over lapping and unique transcriptome components with each species for proportionality the differences were much more dramatic, yet the GO Slim terms that showed strong disproportionality were remarkably consistent between each species (Table 4.3). For instance, unigenes in stage specific orthos were generally underrepresented for unigenes with GO Slim “transport” process terms while unigenes in the shared orthos were over represented for this term. The Go Slim function term “transcription factor” showed an overrepresentation in the stage specific assemblies, which is consistent with the results from Chapter 3.

The consistency of the GO Slim term comparisons indicate that transcriptome wide gene family representation is conserved across T. versicolor, S. hermonthica and O. aegyptiaca. The most striking pattern, which persists after including S. hermonthica and O. aegyptiaca interface transcriptomes to those presented in Chapter 3, is the consistent over representation of unigenes lacking Go Slim terms in the stage specific assemblies. This indicates that stage specific transcriptomes are enriched for genes of unknown function and represent a frontier for laying the foundation of systems biology approaches. As noted in Wang et al. 201273, a prerequisite for any

161 systems approach is to understand the biology of individual cells in their natural context. Since the majority of global transcriptome analyses are on a homogenized mix of several cell types the activities of genes whose activity is highly cell and context specific are diluted. This work shows that increased spatial resolution sampling is a step towards elucidating the roles of genes of unknown function. Furthermore, ultra high-resolution techniques like laser microdissection can isolate genes of unknown functions into smaller groups, as indicated by the smaller number of interface specific orthos compared to Stage 6.1 specific orthos (Figure 4.4). These results are consistent with findings in sequenced model systems172,187 and demonstrate that global transcriptome analysis of highly tissue specific samples in non-model systems will be useful to functionally characterize plant genes of unknown function.

Discovery of genes in shared interface orthogroups

After defining a set of interface specific orthogroups for each species we then searched for shared interface specific orthogroups. This search yielded 12 orthos (or approximate gene families) that were represented by unigenes in each interface transcriptome, though one of these, ortho 16138, was likely not a nuclear parasite gene. After selecting representative genes by careful manual curation and an exhaustive search of PPGP assemblies162, we cross checked a global unigene expression analysis (see methods) to identify five genes in these shared orthos that were differentially expressed in the interface or interface and haustorium in each species.

The alignment of unigenes in interface shared orthos with high identity matches in other

PPGP builds sometimes revealed a single full length transcript with perfect subsequences, but at other times revealed a complex mixture of fragmented transcripts, subsequences and isoforms.

To resolve these issues for this study it was feasible to manually align and classify these complex unigene collections. However, for a transcriptome wide consolidation effort we identified useful criteria for resolution of complex collection of unigenes into representative consensus transcripts.

For a majority of these genes a search of all PPGP assemblies typically resulted in a collection of

162 unigenes for which there was a single best (contained a full length coding sequence) unigene with numerous high identity subsequences. Though the best unigene was not always from the latest and largest assembly (i.e. BigBuild), a simple summation of reads from the respective BigBuild unigenes (effectively subsequences) would be sufficient to reconstruct the expression signal for a given gene. In other more complex scenarios like the case of ortho 6571 (Figure 4.7), simple pairwise identity rules could classify the members of a cluster (Trinity representatives plus subsequences and isoforms) into sub clusters. The determination of relative expression of isoforms would require a more sophisticated approach since often the alternative forms have a mixture of divergent and identical regions (i.e. splice forms) that would have to be divided among sub-clusters and sub-sequences.

The T. versicolor unigenes in 6571 also seemed to be linked to another cluster (classified into a different ortho) by a sequence which had high identity alignments to terminal sequences in both clusters. This likely resulted from adjacent genes whose transcripts overlap causing a Type

II error (see Chapter 2). While this may not be a common occurrence, it may complicate efforts to leverage other PPGP stage specific assemblies to discover full-length transcripts. Indeed, strand specific sequencing would reduce the frequency of this issue.

The functions of the genes in the shared families are diverse, yet five are putative transcription factors. Those that are differentially expressed include only one transcription factor, the putative LOB gene. This gene seems to be interface specific in S. hermonthica and O. aegyptiaca and more broadly expressed in all haustorial tissues in T. versicolor. The lateral organ boundaries domain-containing genes seem to be a class of plant specific transcription factors232,233. The loss of function mutant of a LOB domain containing protein AS2 in

Arabidopsis thaliana produced numerous phenotypes including mis-formed leaves and vasculature as well as abnormal expression of KNOX class genes, which play a role in leaf development234. The role of this transcription factor at the interface is unknown, though

163 determining the finer spatiotemporal gene expression patterns of this gene will shed light on its role in the haustorium of parasitic Orobanchaceae. The APIPT1 gene follows a similar expression pattern to the LOB domain gene in that it seems to be interface specific in S. hermonthica and O. aegyptiaca, but more broadly expressed in the haustorium of T. versicolor.

Isopentenyl transferase genes function in cytokinin biosynthesis235 and several reports have shown that hormone fluxes play a role in parasite-host interactions97,170,236. Recently it has been shown that a cytokinin gradient plays a role in SAM maintenance by interacting with the transcription factor WUSCHEL (WUS)237. It is an intriguing possibility that the differentially expressed LOB domain transcription factor and cytokinin biosynthesis gene, having similar expression patterns across the various life stages of the parasitic Orobanchaceae, play a role together in haustorium development and/or maturation.

We have already discussed the potential role of members of the β-expansin family in the haustorium of T. versicolor in Chapter 3. The finding that homologs of this gene are not only differentially expressed in a host specific manner in T. versicolor (Chapter 3) but that they are differentially expressed at the interface (relative to other PPGP stages – see methods) of all three species considered here is interesting since half of the hosts in this study are dicots, not monocots, which may suggest a novel role for β-expansins.

The putative transporter in ortho 6140, with a best BLAST hit to SLAH-1 in tomato, is related to genes involved in guard cell anion homeostasis238. The role of this gene may be to regulate the flux of charged particles between host-and parasite. Little is known of the specific role such transporters in parasitic plant biology, though a role for transporters of any type is anticipated at the interface due to the presence of cells specialized for transport (see Introduction).

The fifth differentially expressed shared gene we identified was a homolog of the RARE COLD

INDUCIBLE 3 gene in Arabidopsis. This peroxidase generates reactive oxygen species (ROS) in response to potassium deficiency239. Perxoidases are thought to contribute to ROS production

164

239 using H2O2 to catalyze the reduction of various substrates . A central role for peroxidases has been proposed in semagenesis of haustorium induction signals in parasitic plants96,129 (see

Chapter 1). Though the specific role of a peroxidase after haustorium maturation is unclear, the production of ROS is a common signaling intermediate in numerous plant processes240.

Conclusion

Comparative de novo transcriptomics in three parasitic Orobanchaceae has facilitated the discovery of gene families that play a role at the parasite host interface. Furthermore, these gene families are not detected above ground and contain genes that are differentially expressed in tissues critical to the parasitic lifestyle. By sampling in a highly tissue specific manner we stand to functionally isolate genes of unknown function and begin to build the foundation of a systems level approach to understanding the biology of plants that are optimal rather than convenient in the pursuit of questions of broad interest to the plant community.

165 Figures and Tables

166 Figure 4.1 LPCM dissection strategy. Preparative cryo-sections of A) T. versicolor on Medicago, B) S. hermonthica on S. bicolor and C) O. aegyptiaca on A. thaliana. The sampling strategy (white outlined area = ROI) was to capture tissue enriched for intact parasite interface cells, thus a portion of the host interface was captured with the parasite tissue. T. versicolor haustoria were oriented individually with the host root on the vertical axis (haustorium cross section) since immediately adjacent connections were rare due to the co-culture strategy where ~20 parasites were placed in a plate with 1-3 host plants. The smaller dimensions of S. hermonthica and O. aegyptiaca made orientation of individual haustoria impractical. Additionally, many more (>100 germinating parasites) were “painted” onto host roots allowing multiple attachments to occur in a small space. Thus host roots with multiple parasites were harvested and oriented on the horizontal axis in the cryo- mold resulting in longitudinal haustorium sections.

167 Parasitic plant T. versicolor T. versicolor S. hermonthica O. aegyptiaca Host plant M. truncatula Z. mays S. bicolor A. thaliana Sequencing Quality filtered and trimmed reads 27,325,845 26,947,737 47,479,944 55,615,807 Quality filtered and trimmed Gbp 1.88 1.82 3.88 4.58 Mean trimmed read length 68.8 67.4 81.7 82.3 Assembly Unigene count 26,709 28,126 60,953 40,003 Mbp assembled 12.25 12.77 29.02 19.59 Min/Max length 197/3265 197/3115 197/6904 197/6,377 N50 length 536 525 571 598 Classification Host cDNA 4,196 (15.7%) 4,505 (16.8%) 15,928 (26.1%) 6,579 (16.4%) Host EST 3,462 (13.0%) 332 (1.2%) 1,201 (2.0%) 30 (0.1%) Non-plant 144 + 60 (0.8%) 55 + 16 (0.3%) 3,920 + 1,015 (8.1%) 1,159 + 1,071 (5.6%) MEGAN + OrthoMCL PPGP Parasite Reference (BigBuild) 16,483 (61.7%) 21,224 (79.5%) 29,161 (47.8%) 25,640 (64.1%) Other plant 743+75 (3.1%) 580+67 (2.3%) 634+142 (1.3%) 229+84 (0.8%) MEGAN +OrthoMCL Putative parasite with PlantTribes Ortho 15,481 (58.0%) 19,539 (69.5%) 24,407 (40.0%) 20,287 (50.7%) Putative parasite BigBuild or OrthoMCL plant Ortho or 18,119 (67.8%) 22,603 (80.4%) 31,808 (52.2%) 27,128 (67.8%) non-host plant or Plant Tribes Ortho IPS, signal peptide, or transmembrane domain 204 (0.8%) 192 (0.7%) 2106 (3.5%) 1465 (3.6%) Unclassified 728 (2.7%) 615 (2.2%) 7,081 (11.6%) 4,036 (10.1%)

168 Table 4.1 Summary of sequencing, assembly and classification statistics for interface transcriptomes. The parasitic plant and host plant pair are indicated in the top two rows. Read level statistics were calculated after quality filtering, to remove low quality reads, and quality trimming, to remove low quality bases from reads that passed the quality filter. Fields in italics are not mutually exclusive, but subsets of the field below – either “Putative parasite” list or “Unclassified”. The “Putative parasite with PlantTribes Ortho” field contains the unigene count for parasite unigenes considered in the Orthogroup level comparisons. BLAST based classifications (PPGP Parasite reference BigBuild, Host cDNA and Host EST) were done according to a 95% pairwise identity threshold. “Unclassified” refers to unigenes that cannot be classified by BLAST to the respective Parasitic Plant Genome Project Big Build, Host cDNA or EST, or a non-plant source by either BLASTx to NR or query of OrthoMCL db and are not assigned to a Plant Tribes Orthogroup.

169

170

171 Figure 4.2 MEGAN analysis of the interface transcriptomes of T. versicolor, S. Hermonthica and O. aegyptiaca. Unigenes that were not classified as parasite or host based upon BLAST searches of the PPGP BigBuilds or host cDNA and EST sequences were classified by the best hit in NR with an e-value <1e-5 and an alignment bit score of 125 or better. A) The axenic co-culture system used for T. versicolor produced few unigenes attributable to non-plant origins whereas the bag co-culture system used for S. hermonthica and O. aegyptiaca produced unigenes that were assigned to numerous bacterial and fungal origins. The Viridiplantae clade is exploded for detail. B) Of the few non-plant unigenes detected in T. versicolor interface transcriptomes only the genus Burkholderia was strongly represented in both (arrow). The genus Burkholderia is represented with similar frequency in the interface transcriptomes of S. hermonthica and O. aegyptiaca as well (arrow). The Bacteria clade is exploded for detail.

172

Figure 4.3 Ultra Conserved Orthologs tagged in interface transcriptome assemblies. From each interface assembly the parasite UCOs were determined by best BLAST hit (BLASTx e-value <1e-20, >45 residues in the HSP) to Vitis genes in the UCO orthos and are reported as a percentage of Vitis UCOs (of 404). The frequency of unigene lengths is plotted in orange for each set of parasite UCOs compared to the whole assembly unigene length distribution.

173

Figure 4.4 Orthogroup classification Venn of Interface and Stage 6.1 parasite transcriptomes. The distinct and overlapping orthogroups represented in the “Putative parasite with PlantTribes Ortho” unigene class and filtered Stage 6.1 unigenes are shown.

174

Shoots&& Shared& Interface& GO!Slim!Function!& O& S& TM& TZ& O& S& TM& TZ& O& S& TM& TZ& DNA&or&RNA&binding& hydrolase&activity& kinase&activity& nucleic&acid&binding& nucleotide&binding& other&binding& + other&enzyme&activity& - other&molecular&functions& - protein&binding& receptor&binding&or&activity& structural&molecule&activity& transcription&factor&activity& + transferase&activity& transporter&activity& No&GO&Function& + + - + - - + GO!Slim!Component! other&cellular&components& + cell&wall& chloroplast& - - cytosol& ER& extracellular& Golgi&apparatus& mitochondria& nucleus& other&cytoplasmic&components& other&intracellular&components& other&membranes& - - plasma&membrane& plastid& ribosome& No&GO&Component& + + - + - + GO!Slim!Process!

cell&organization&and&biogenesis& developmental&processes& DNA&or&RNA&metabolism& electron&transport&or&energy&pathways& other&biological&processes& + other&cellular&processes& other&metabolic&processes& - - + protein&metabolism& response&to&abiotic&or&biotic&stimulus& response&to&stress& signal&transduction& - transcription& transport& No&GO&Process& + + - - + - - - +

Table 4.2 GO Slim Chi-Square test results of equivalent transcriptome& components. ! Residuals <4 are represented by “ -“ and >4 by “+”. O = O. aegyptiaca, S = S. hermonthica, TM = Medicago grown T. versicolor and TZ = Zea grown T. versicolor. Shoots = Stage 6.1 unique

PlantTribes Orthogroups, Shared = Stage6.1 and Interface shared PlantTribes Orthogroups and Interface = Interface unique PlantTribes Orthogroups. Each panel consists of 3 tests separated by a darker line. For instance, the top left panel section contains results from a test on GO Slim Function categories in Shoot specific orthos for each of the 4 interface transcriptomes.

175

GO#Slim#Function# Shoots& Shared& Interface& DNA&or&RNA&binding& + - - hydrolase&activity& - - - kinase&activity& - nucleic&acid&binding& - nucleotide&binding& other&binding& + + + + - - - + other&enzyme&activity& - - + - other&molecular&functions& protein&binding& - - receptor&binding&or&activity& structural&molecule&activity& - - - - + + + transcription&factor&activity& + + - + + + transferase&activity& - - + - transporter&activity& - - - + + + - - - No&GO&Function& + + + + - - - - + + + + GO#Slim#Component# other&cellular&components& - cell&wall& chloroplast& + + - - - - cytosol& - - - - + + ER& - - - + + - extracellular& + Golgi&apparatus& + mitochondria& - - - + + nucleus& + other&cytoplasmic&components& - - - - + + other&intracellular&components& other&membranes& - + plasma&membrane& plastid& ribosome& - - - - + + No&GO&Component& + + + + - - - - + + + + GO#Slim#Process# cell&organization&and&biogenesis& - developmental&processes& + - DNA&or&RNA&metabolism& + electron&transport&or&energy&pathways& + other&biological&processes& + + + + - - - other&cellular&processes& - - - + - other&metabolic&processes& - - - - + + + + - - protein&metabolism& response&to&abiotic&or&biotic&stimulus& response&to&stress& signal&transduction& - transcription& transport& - - - - + + + - - - No&GO&Process& + + + + - - - - + + + + &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&OrAe&&&&&&&StHe&&&&&TrVeMed&&&&&&TrVeZea& Table 4.3 GO Slim Chi-Square test between overlapping and distinct& orthos in each interface transcriptome. Residuals <4 are represented by “-“ and >4 by & “+”. Red symbols represent O. aegyptiaca (OrAe) test results. Orange symbols represent S. hermonthica (StHe) test results. Green symbols represent M. truncatula grown T. versicolor (TrVeMed) test results. Teal symbols represent Z. mays grown T. versicolor (TrVeZea) test results. Shoots= Stage 6.1 unique PlantTribes Orthogroups, Shared= Stage6.1 and interface shared PlantTribes Orthogroups and Interface= Interface unique PlantTribes Orthogroups.

176

Figure 4.5 The interface specific Orthogroups of T. versicolor, S. hermonthica and O. aegyptiaca. A) Venn diagram showing distinct and overlapping orthogroups from interface specific orthogroups in each interface transcriptome. B) Summary table of best descriptions for representative of the 12 shared interface Orthogroups. Transcription factortf, differentially expressed* in the haustorium and/or interface in 2 of the 3 species.

177

Figure 4.6 Protein alignments of representatives of the 12 interface shared orthogroups. Interface unigenes from each interface transcriptome that were assigned to the 12 shared orthos were translated and aligned. The sequence order is the same in each alignment: O. aegyptiaca, T. versicolor and S. hermonthica. Agreements to the consensus sequence are highlighted. Interface unigenes were used to search the respective BigBuilds for unigenes with high nucleotide pairwise identity. Where alignments were not full length, stage specific PPGP assemblies were searched to find unigenes with high nucleotide pairwise identity to unigenes used to populate the alignments. Where a single unigene did not contain longest open reading frame in a given alignment, consensus sequences were extracted, translated and aligned with unigenes/consensus sequences from each species. Orthogroups are listed at left. *Differentially expressed in 2 of 3 species.

178 A

B

Figure 4.7 Nucleotide alignment of a collection of T. versicolor unigenes with good alignments to interface unigenes in ortho 6571. A) The T. versicolor BigBuild sequences (prefix TrVeBC3) populating this alignment were identified by a BLASTn search using interface sequences (prefix TrVeInt). By classifying sequences based upon pairwise nucleotide identity, 2 distinct groups arise consisting of sequences 1-5 and 6-10 with 11 being classified as a subsequence of 10 (see B). Disagreements to the consensus are highlighted. B) Protein alignment of sequences 10 and 11 in “A”.

179

180

Figure 4.8 Stage Specific expression patterns of representative genes in the 12 orthos for each interface transcriptome. White bars = differential expression in at least 2 of the 3 species. Red outlined bars = DE species shown. A = T. versicolor plot with Medicago grown interface; B = T. versicolor plot with Zea grown interface; C = S. hermonthica; D = O. aegyptiaca; 500- 1500FPKM*; >1500FPKM**; “hetero” = cultured with host; “auto” = cultured with out host. FPKM is on the y-axis, stages on the x-axis and orthos on the z-axis.

181 Materials and Methods

Growth of plant material

Triphysaria versicolor

T. versicolor was not re-grown for this study. See “Growth of plant material” in the

Materials and Methods section of Chapter 3.

Striga hermonthica on Sorghum bicolor

S. hermonthica (local landrace collected in Mokwa, Nigeria) was grown on Sorghum bicolor (local cultivar collected in Kano, Nigeria) as described in Fernández-Aparicio et al.

2013241. At 72 hr to 120 hr after inoculation, parasites with vascularized haustoria (fast attachments) were identified under a stereomicroscope. With a fine forceps and a scalpel portions of the host root with multiple S. hermonthica seedlings were excised. The excised tissue was rapidly oriented in Shandon CryomatrixTM (Thermo Scientific) dispensed in a 10 mm x 10 mm x

5mm CryomoldTM (Tissue-Tek) molds such that the dissected tissue was oriented with the host root on the horizontal axis adjacent to the bottom of the mold. The samples were quickly frozen on dry ice and stored at −80°C.

Orobanche (syn. Phelipanche) aegyptiaca on Arabidopsis thaliana

O. aegyptiaca seeds were provided by Dr. Y. Kleifeld, Newe Ya’ar Research Center,

ARO, Israel. O. aegyptiaca and Arabidopsis thaliana (Col-O) were grown as described in

Westwood 2000242. At 72 hr to 120 hr after inoculation, parasites with vascularized haustoria

(fast attachments) were identified under a stereomicroscope. With a fine forceps, a fine scissors and scalpel portions of the host root with multiple O. aegyptiaca seedlings were excised. The excised tissue was rapidly oriented in Shandon CryomatrixTM (Thermo Scientific) dispensed in a

10 mm x 10 mm x 5mm CryomoldTM (Tissue-Tek) molds such that the dissected tissue was

182 oriented with the host root on the horizontal axis adjacent to the bottom of the mold. The samples were quickly frozen on dry ice and stored at −80°C.

Cryo-sectioning, dehydration and Laser Pressure Catapult Microdissection

Frozen, embedded seedlings attached to hosts were prepared for LPCM as described in

Chapter 3. Since the orientation of S. hermonthica and O. aegyptiaca was different than for T. versicolor in that tissues of interest were closer to the initially cut surface of the embedding

(original the bottom of the mold) serial sections were captured as soon as possible. Dehydration of preparative cryo-sections and LPCM was done as described in Chapter 3.

RNA extraction, T7 amplification, Illumina library preparation and sequencing

RNA was extracted and cleaned as described in Chapter 3. RNA was evaluated using the

Plant Total RNA assay with the RNA6000 Nano Kit (Agilent) on the Bioanalyzer (Agilent).

Entire RNA samples were verified for quality (RIN >6) and amplified as described in Chapter 3.

The resulting aRNA was evaluated on Illumina library preparation and validation was carried out as described in Chapter 3. Each library was sequenced one lane on Illumina’s GAIIx with an

83x83 paired end protocol at the Genomics Core facility at the University of Virginia in

Charlottesville, VA USA.

Post sequencing data processing, annotation and post-assembly filtering

Transcriptome assembly and annotation was done as described in Chapter 3. Using the pairwise identify threshold determined in Chapter 3, interface unigenes from each assembly T. versicolor on each host, S. hermonthica on S. bicolor, and O. aegyptiaca on A. thaliana were filtered for parasitic plant transcripts by using BLASTn157 to query Trinity131 assemblies of the comprehensive PPGP data (BigBuilds used were O. aegyptiaca = OrAeBC5, T. versicolor =

TrVeBC3 and S. hermonthica = StHeBC3) set for each species162. This step classified interface unigenes with high (>95% nucleotide pairwise identity) to the extensive cleaned and classified

183 comprehensive PPGP assemblies. Remaining unigenes were queried against the respective host cDNA set (Zea mays243, Medicago truncatula244, Sorghum bicolor245, Arabidopsis thaliana148) with the same parameters to identify host transcripts from the mixed plant interface transcriptome. After classifying unigenes as a match to parasite transcripts or host cDNAs, the remaining unigenes were queried against NCBI’s non-redundant protein sequences database2 (nr, downloaded Feb 2013) with BLASTx (e-value threshold 1e-5, best hit reported). The BLAST output was imported into MEGAN138 and classified by source. The MEGAN parameters for classification were set to defaults with the following exceptions: Min. support = 1, Min Score =

125, Gene ID was determined by the TaxID lookup file (downloaded Feb. 2013 from http://www- ab.informatik.uni-tuebingen.de/) and the “min-complexity filter” was disabled. The threshold score (alignment bit score) was determined by the analysis presented in Chapter 2 that showed more stringent alignment score thresholds were not effective to decrease the frequency of non- assignment.

Unigenes that were classified into non-plant clades or Chlorophyta were classified as derived from the source of the respective best hit species in nr. Unigenes that remained unassigned or that were assigned to Streptophyta were queried against a collection of host ESTs downloaded from PlantGDB176 for Zea mays, Medicago truncatula, Sorghum bicolor and

Arabidopsis thaliana with BLASTn (e-value 1e-5, 95% identity). Unigenes that remained after the search against PPGP databases, host cDNA and EST databases and NR were further screened for unigenes assigned to a PlantTribes Orthogroup (approximate plant gene family). The unigenes remaining after this screen were classified by submission to OrthoMCL DB and

InterProScan as described in Chapter 3.

Ultraconserved Ortholog (UCO) tagging analysis

A custom Perl script was used to extract Vitis vinifera genes in PlantTribes 2.0174

Orthogroups containing Arabidopsis thaliana UCOs230. This generated a list of 404 V. vinifera

184 UCOs. Interface unigenes were then queried against the collection of V. vinifera UCOs using

BLASTx (e-value 1e-20). The best hits from each interface transcriptome were selected and used to estimate UCO coverage. All HSPs were ≥49 residues, thus no minimum HSP length threshold

(i.e. 30 residues as per Der et al 201126) was imposed. The proportion of tagged interface UCOs was reported as a percentage of V. vinifera UCOs.

Plant Tribes 2.0 Orthogroup level classification and GO Slim category analysis

The ortho level classification analysis was done as described in Chapter 3. The respective Parasitic Plant Genome Project Stage 6.1 (shoots i.e. leaves & stems) de novo assembled transcriptomes were used as a reference transcriptome to identify interface specific

Orthogroups. The stage 6.1 assemblies were classified by a BLASTn to the extensively cleaned

(i.e. removal of all non-parastite unigenes), comprehensive assembly of all available PPGP data for a given species (including Stage 6.1) to remove any non-parasite unigenes. The BLAST parameters were the same as described above for other BLASTn searches. Chi-Square tests of

GO Slim categories were done as described in Chapter 3.

Selection of unigenes to represent the 3-interface shared orthogroups and

Differential expression analysis cross check

To select unigenes suitable for a global expression analysis, the interface unigenes belonging to the 12 shared orthogroups (see Results) were used to query the respective BigBuild assembly. BLASTn parameters were the same as described above for other searches. All

BigBuild unigenes identified by this search (plus all cluster members) were aligned using

Geneious161 to the respective interface unigenes. The best hit cDNA in Plant Tribes 2.0 (or

Arabidopsis) was used as a proxy for full length coding sequences (CDS) and correct ORFs.

Alignments were extensively curated by hand to choose an open reading frame (ORF) that would represent the parasite gene for each ortho in each species. If the nucleotide alignments lacked a

185 complete CDS, PPGP162 stage specific assemblies were queried by BLASTn to search for unigenes to extend the alignments. To bring in additional sequences, high identity alignments

(>95 pairwise identity, though were often 100%) were required for more than 30bp (though were often much longer). After alignments were either full length, or PPGP assemblies were exhaustively searched, CDSs were chosen from either a unigene in each alignment with the best

ORF or a consensus sequence was generated from the nucleotide alignment that contained the best ORF. For each of the 12 orthos, a protein alignment (translated ORF) of each species was generated using Geneious. The BigBuild unigene (or collection thereof) that represent the CDS for each ortho in each species were grouped for the gene expression analysis since they likely represent a single transcript. Reads from each stage specific transcriptome were mapped to the respective BigBuild (see Chapter 3 “Post sequencing data processing and annotation” for read mapping methods). Reads from each unigene “group” (all BigBuild unigenes) were summed to estimate the expression of each interface gene in the 12 orthos for each species.

The BigBuild groups were checked for unigenes identified as differentially expressed

(DE) by DESeq231 analysis as follows. Read counts for each unigene in each PPGP stage specific library constitute the input matrix for differential expression (DE) analysis. Unigenes with low reads counts (<16) were removed. DE analysis was done with the DESeq231 package in R with default parameters (for both replicated and non-replicated datasets) to identify unigenes that are

DE in two stages. If at least one unigene in the BigBuild “group” for each gene in the 12 orthos was identified as DE, the interface gene was classified as DE.

186

Appendix C

The poorly correlated candidates summary – indicated by red arrows in Figure 2.15

Strong signal in Array, no signal in RNA-Seq (reads mapped to probes)

AT1G24851.1 – No reads mapped to the probes for this gene. qPCR failed to detect transcripts in either biological replicate. Low levels of expression were detected when reads were mapped to the complete collection of TAIR10 cDNAs, suggesting a false positive on the array. The moderate SFB (0.3) suggests that the transcript has nearly sufficient reads for assembly and its absence from the de novo assemblies of biological replicates 1 and 2 is surprising and may indicate a innaccurate annotation in TAIR10. The SFB is higher in combined replicate and normalized assemblies, though is still absent from the assembly. This suggests that the locus annotation is inaccurate and its transcriptional product is poorly understood, therefore making a reference based annotation challenging if not impossible.

AT4G33120.1 - No reads mapped to the array probes for this gene. qPCR failed to detect transcripts in either biological replicate. The average fpkm (0.032) and SFB (1.32) suggest there are sufficient reads for assembly, though no assembler considered in this analysis produced a unigene that can be assigned to the locus. This suggests that the locus annotation is inaccurate and/or its transcriptional product is poorly understood, therefore making a reference based annotation challenging if not impossible.

AT5G43640.1 - No reads mapped to the probes for this gene. qPCR failed to detect transcripts in either biological replicate. The average SFB for this gene is 4.75. Transcripts with SFB >1 are not necessarily assembled well, or even at all. Such is the case for this transcript, excessive reads and no RNA-Seq signal at the probe suggest that the locus annotation is inaccurate and its transcriptional product is poorly understood, therefore making a reference based annotation challenging if not impossible.

Strong signal in RNA-Seq (reads mapped to probes), proportionally weak signal in Array

187 AT1G24996.1 - A transcript representing this gene was present in the BR1 assembly by Oases and in assemblies by Oases and ABySS in the combined replicate assemblies. Roughly 4500 reads from each biological replicate mapped to the TAIR10 cDNA covering the entire reference transcript. This transcript had sufficient reads to put it well above the threshold of assembly for many of the assemblers. We used the qRT-PCR primers to amplify an approximately 100bp region from each of the 12 candidates using DNase treated RNA in a properly controlled PCR experiment. We annealed the primers at 60oC, allowed a 1 minute extension time and completed

40 cycles to verify a clean qRT-PCR product. Those primer pairs that yielded reliable qRT-PCR amplification for the 6 well correlated candidates, and for AT3G24480.1, yielded a clean product of expected size. Those primer pairs that yielded unreliable or undetectable results in the qRT-

PCR assay yielded similarly weak or undectable signal save one: AT1G24996.1. The product of the PCR reaction using the qRT-PCR primers yielded a ~550bp product, not the expected 100bp product. The qRT-PCR primers were aligned to the TAIR10 cDNA for AT1G24996.1 revealing that they flanked the annotated intron and would yield a 551bp product if a partially processed transcript was the template (RT negative cDNA was used as a control for genomic DNA contamination and gave no signal). This result suggests that the predominant from of this transcript in young leaf tissue is immature, differentially spliced or partially processed.

AT3G24480.1 - A single unigene from a primary CLC assembly that mapped to AT3G24480.1 revealed a novel ~800bp low complexity region and a single bp deletion in the array probe binding region, suggesting that the poor Microarray vs RNA-Seq correlation may be explained by an inaccurate gene model. We verified the presence of this novel 800bp region in other de novo assemblies considered in this study. We attempted to amplify this gene from cDNA and genomic

DNA and were unsuccessful, though we did amplify a closely related gene. All supporting EST data (http://gbrowse.arabidopsis.org/) consisted of ESTs that mapped to the 3’ end of the TAIR10 cDNA. A single clone (GenBank: AI995373.1) from inflorescence tissue extended further 5’

188 than any other EST and was gapped in its alignment with the contig and the TAIR10 cDNA.

Further, proteomics data retrieved from AtProteome (http://fgcz-atproteome.unizh.ch) revealed that the poly-peptide was only detected in cotyledons, buds, flowers and siliques and not in juvenile leaves. An intact reading frame was identified by considering the annotated start codon, novel indel, and gapped de novo EST. Together, this evidence suggests that we have detected an immature transcript in young leaf that is a product of a poorly annotated locus in the Arabidopsis genome. This transcript is represented in all assemblies save the SOAPdenovo assembly of the combined biological replicates.

AT4G02970.1 - qPCR failed to detect transcripts in either biological replicate. A CLC unigene that represented AT4G02970.1 was identified and contained a region that aligned perfectly with an annotated intron suggesting a partially processed transcript was assembled de novo. A search of AtProteome (http://fgcz-atproteome.unizh.ch) revealed that the peptide has only been detected in dark grown cell cultures yet the available ESTs (http://gbrowse.arabidopsis.org/) support the annotated intron as part of mature transcript (GenBank: AV549571.1 & AA597599.1). This transcript is represented by the majority of de novo assemblies considered in this study.

189 Works Cited

1 Kodama, Y. et al. The Sequence Read Archive: Explosive Growth Of Sequencing Data. Nucleic Acids Res. 40, D54-D56 (2012). 2 National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov 3 Benson, D. A. et al. GenBank. Nucleic Acids Res. 40, D48-D53 (2012). 4 Ward, J. A et al. Strategies For Transcriptome Analysis In Nonmodel Plants. Am. J. Bot. 99, 267-276 (2012). 5 Wall, P. K. et al. Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics 10, 347 (2009). 6 The Huck Institutes of the Life Sciences Genomics Core Facility http://www.huck.psu.edu/facilities/genomics-up/ 7 Schuster, S. C. Next-Generation Sequencing Transforms Today's Biology. Nat Meth 5, 16-18 (2008). 8 Strickler, S. R et al. Designing A Transcriptome Next-Generation Sequencing Project For A Nonmodel Plant Species. Am. J. Bot. 99, 257-266 (2012). 9 Rokas, A. & Abbot, P. Harnessing Genomics For Evolutionary Insights. Trends Ecol Evol 24, 192-200 (2009). 10 Cahais, V. et al. Reference-Free Transcriptome Assembly In Non-Model Animals From Next-Generation Sequencing Data. Molecular Ecology Resources 12, 834-845 (2012). 11 Brautigam, A. et al. Critical Assessment Of Assembly Strategies For Non-Model Species Mrna-Seq Data And Application Of Next-Generation Sequencing To The Comparison Of C3 And C4 Species. J. Exp. Bot. 62, 3093-3102 (2011). 12 Kumar, S. & Blaxter, M. L. Comparing De Novo Assemblers For 454 Transcriptome Data. BMC Genomics 11, 571 (2010). 13 Bräutigam, A. & Gowik, U. What Can Next Generation Sequencing Do For You? Next Generation Sequencing As A Valuable Tool In Plant Research. Plant Biology 12, 831- 841 (2010). 14 Whiteford, N. An Analysis Of The Feasibility Of Short Read Sequencing. Nucleic Acids Res. 33, e171-e171 (2005). 15 Martin, J. A. & Wang, Z. Next-Generation Transcriptome Assembly. Nat. Rev. Genet. 12, 671-682 (2011). 16 Marioni, J. C et al. RNA-Seq: An Assessment Of Technical Reproducibility And Comparison With Gene Expression Arrays. Genome Res. 18, 1509-1517 (2008). 17 The AToL initiative (Assembling the Tree of Life), http://www.phylo.org/atol/ 18 The 1KP project, http://www.onekp.com/ 19 Riggins, C. W. et al. Characterization Of De Novo Transcriptome For Waterhemp (Amaranthus tuberculatus) Using GS-FLX 454 Pyrosequencing And Its Application For Studies Of Herbicide Target-Site Genes. Pest Management Science 66, 1042-1052 (2010). 20 454 Life Sciences, http://454.com/ 21 Chevreux, B. et al. Using The Miraest Assembler For Reliable And Automated Mrna Transcript Assembly And SNP Detection In Sequenced Ests. Genome Res. 14, 1147-1159 (2004). 22 Barakat, A. et al. Comparison Of The Transcriptomes Of American Chestnut (Castanea dentata) And Chinese Chestnut (Castanea mollissima) In Response To The Chestnut Blight Infection. BMC Plant Biol. 9, 51 (2009). 23 Novaes, E. et al. High-Throughput Gene And SNP Discovery In Eucalyptus grandis, An Uncharacterized Genome. BMC Genomics 9, 312 (2008).

190 24 Dassanayake, M. et al. Shedding Light On An Extremophile Lifestyle Through Transcriptomics. New Phytol. 183, 764-775 (2009). 25 Franssen, S. U. et al. Comprehensive Transcriptome Analysis Of The Highly Complex Pisum sativum Genome Using Next Generation Sequencing. BMC Genomics 12, 227 (2011). 26 Der, J. P. et al. De Novo Characterization Of The Gametophyte Transcriptome In Bracken Fern, Pteridium aquilinum. BMC Genomics 12, 99 (2011). 27 Torales, S. L. et al. Transcriptome Survey Of Patagonian Southern Beech Nothofagus nervosa (= N. Alpina): Assembly, Annotation And Molecular Marker Discovery. BMC Genomics 13, 291 (2012). 28 Pevzner, P. A., et al. An Eulerian Path Approach To DNA Fragment Assembly. P. Natl. Acad. Sci. USA 98, 9748-9753 (2001). 29 Nagarajan, N. & Pop, M. Sequence Assembly Demystified. Nat. Rev. Genet. 14, 157-167 (2013). 30 Collins, L. J. et al. An Approach To Transcriptome Analysis Of Non-Model Organisms Using Short-Read Sequences. Vol. 21 Genome Informatics Series (eds J. Arthur & S. K. Ng) 3-14 (2008). 31 Annadurai, R. S. et al. Next Generation Sequencing And De Novo Transcriptome Analysis Of Costus pictus D. Don, A Non-Model Plant With Potent Anti-Diabetic Properties. BMC Genomics 13, 663 (2012). 32 Barrero, R. A. et al. De Novo Assembly Of Euphorbia fischeriana Root Transcriptome Identifies Prostratin Pathway Related Genes. BMC Genomics 12, 600 (2011). 33 Venturini, L. et al. De Novo Transcriptome Characterization Of Vitis vinifera Cv. Corvina Unveils Varietal Diversity. BMC Genomics 14, 41 (2013). 34 Schulz, M. H. et al. Oases: Robust De Novo RNA-Seq Assembly Across The Dynamic Range Of Expression Levels. Bioinformatics 28, 1086-1092, (2012). 35 Tang, Q. et al. An Efficient Approach To Finding Siraitia Grosvenorii Triterpene Biosynthetic Genes By RNA-Seq And Digital Gene Expression Analysis. BMC Genomics 12, 343 (2011). 36 Zhang, J. A. et al. De Novo Assembly And Characterisation Of The Transcriptome During Seed Development, And Generation Of Genic-SSR Markers In Peanut (Arachis hypogaea L.). BMC Genomics 13, 90 (2012). 37 Sun, X. D. et al. De Novo Assembly And Characterization Of The Garlic (Allium sativum) Bud Transcriptome By Illumina Sequencing. Plant Cell Reports 31, 1823-1828, (2012). 38 Huang, H. H. et al. De Novo Characterization Of The Chinese Fir (Cunninghamia lanceolata) Transcriptome And Analysis Of Candidate Genes Involved In Cellulose And Lignin Biosynthesis. BMC Genomics 13, 648 (2012). 39 Gahlan, P. et al. De Novo Sequencing And Characterization Of Picrorhiza kurrooa Transcriptome At Two Temperatures Showed Major Transcriptome Adjustments. BMC Genomics 13, 126 (2012). 40 Wong, M. M. L. et al. Identification Of Lignin Genes And Regulatory Sequences Involved In Secondary Cell Wall Formation In Acacia auriculiformis And Acacia mangium Via De Novo Transcriptome Sequencing. BMC Genomics 12, 342 (2011). 41 Xia, Z. H. et al. RNA-Seq Analysis And De Novo Transcriptome Assembly Of Hevea brasiliensis. Plant Mol. Biol. 77, 299-308 (2011). 42 Huang, L. L. et al. The First Illumina-Based De Novo Transcriptome Sequencing And Analysis Of Safflower Flowers. PLoS ONE 7, 0038653 (2012).

191 43 Hao, D. C. et al. The First Insight Into The Tissue Specific Taxus Transcriptome Via Illumina Second Generation Sequencing. PLoS ONE 6, 0021220 (2011). 44 Wang, X. J., et al. Transcriptome Analysis Of Sacha Inchi ( volubilis L.) Seeds At Two Developmental Stages. BMC Genomics 13, 716 (2012). 45 Sun, Q. et al. Transcriptome Analysis Of Stem Development In The Tumourous Stem Mustard Brassica juncea Var. Tumida Tsen Et Lee By RNA Sequencing. BMC Plant Biol. 12, 53 (2012). 46 Angeloni, F. et al. De Novo Transcriptome Characterization And Development Of Genomic Tools For Scabiosa columbaria L. Using Next-Generation Sequencing Techniques. Molecular Ecology Resources 11, 662-674 (2011). 47 Gruenheit, N. et al. Cutoffs And K-Mers: Implications From A Transcriptome Study In Allopolyploid Plants. BMC Genomics 13, 92 (2012). 48 Xu, D. L. et al. De Novo Assembly And Characterization Of The Root Transcriptome Of Aegilops variabilis During An Interaction With The Cereal Cyst Nematode. BMC Genomics 13, 133 (2012). 49 Honaas, L. A. et al. Functional Genomics Of A Generalist Parasitic Plant: Laser Microdissection Of Host-Parasite Interface Reveals Host-Specific Patterns Of Parasite Gene Expression. BMC Plant Biol. 13, 9 (2013). 50 Krishnan, N. M. et al. De Novo Sequencing And Assembly Of Azadirachta indica Fruit Transcriptome. Current Science 101, 1553-1561 (2011). 51 Zhao, Z. G. et al. Deep-Sequencing Transcriptome Analysis Of Chilling Tolerance Mechanisms Of A Subnival Alpine Plant, Chorispora bungeana. BMC Plant Biol. 12, 222 (2012). 52 Wang, S. F. et al. Transcriptome Analysis Of The Roots At Early And Late Seedling Stages Using Illumina Paired-End Sequencing And Development Of EST-SSR Markers In Radish. Plant Cell Reports 31, 1437-1447 (2012). 53 Liu, G. Q. et al. Transcriptomic Analysis Of 'Suli' Pear (Pyrus pyrifolia White Pear Group) Buds During The Dormancy By RNA-Seq. BMC Genomics 13, 700 (2012). 54 Garg, R. et al. Gene Discovery And Tissue-Specific Transcriptome Analysis In Chickpea With Massively Parallel Pyrosequencing And Web Resource Development. Plant Physiol. 156, 1661-1678 (2011). 55 Orshinsky, A. M. et al. RNA-Seq Analysis Of The Sclerotinia homoeocarpa - Creeping Bentgrass Pathosystem. PLoS ONE 7, 0041150 (2012). 56 Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia Of Genes And Genomes. Nucleic Acids Res. 28, 27-30 (2000). 57 Boeckmann, B. et al. The SWISS-PROT Protein Knowledgebase And Its Supplement Trembl In 2003. Nucleic Acids Res. 31, 365-370 (2003). 58 Ashburner, M. et al. Gene Ontology: Tool For The Unification Of Biology. The Gene Ontology Consortium. Nat. Genet. 25, 25-29 (2000). 59 Schliesky, S et al. RNA-Seq Assembly: Are We There Yet? Frontiers in Plant Science 3, 220 (2012). 60 Gibbons, J. G. et al. Benchmarking Next-Generation Transcriptome Sequencing For Functional And Evolutionary Genomics. Mol. Biol. Evol. 26, 2731-2744 (2009). 61 Surget-Groba, Y. & Montoya-Burgos, J. I. Optimization Of De Novo Transcriptome Assembly From Next-Generation Sequencing Data. Genome Res. 20, 1432-1440 (2010). 62 Robertson, G. et al. De Novo Assembly And Analysis Of RNA-Seq Data. Nat Meth 7, 909-912 (2010). 63 Severin, A. J. et al. RNA-Seq Atlas Of Glycine Max: A Guide To The Soybean Transcriptome. BMC Plant Biol. 10, 160 (2010).

192 64 Marioni, J. et al. RNA-Seq: An Assessment Of Technical Reproducibility And Comparison With Gene Expression Arrays. Genome research 18, 1509 - 1517 (2008). 65 Brandt, S. P. Microgenomics: Gene Expression Analysis At The Tissue-Specific And Single-Cell Levels. J. Exp. Bot. 56, 495-505 (2005). 66 Day, R. C. et al. Be More Specific! Laser-Assisted Microdissection Of Plant Cells. Trends Plant Sci. 10, 397-406 (2005). 67 Nelson, T. et al. Plant Cell Types: Reporting And Sampling With New Technologies. Curr. Opin. Plant Biol. 11, 567-573 (2008). 68 Karrer, E. E. et al. In-Situ Isolation Of Messenger-Rna From Individual Plant-Cells - Creation Of Cell-Specific Cdna Libraries. P. Natl. Acad. Sci. USA 92, 3814-3818 (1995). 69 Brandt, S. et al. A Rapid Method For Detection Of Plant Gene Transcripts From Single Epidermal, Mesophyll And Companion Cells Of Intact Leaves. Plant J. 20, 245-250 (1999). 70 Nakazono, M. et al. Laser-Capture Microdissection, A Tool For The Global Analysis Of Gene Expression In Specific Plant Cell Types: Identification Of Genes Expressed Differentially In Epidermal Cells Or Vascular Tissues Of Maize. Plant Cell 15, 583-596 (2003). 71 Grosset, J. et al. Messenger-RNAs Newly Synthesized By Tobacco Mesophyll Protoplasts Are Wound-Inducible. Plant Mol. Biol. 15, 485-496 (1990). 72 Jiao, Y. L. & Meyerowitz, E. M. Cell-Type Specific Analysis Of Translating RNAs In Developing Flowers Reveals New Levels Of Control. Molecular Systems Biology 6, (2010). 73 Wang, D. X et al. Technologies For Systems-Level Analysis Of Specific Cell Types In Plants. Plant Science 197, 21-29 (2012). 74 Westwood, J. H et al. The Evolution Of Parasitism In Plants. Trends Plant Sci. 15, 227- 235 (2010). 75 Olmstead, R. G. et al. Disintegration Of The Scrophulariaceae. Am. J. Bot. 88, 348-361 (2001). 76 Parker, C. Parasitic Weeds: A World Challenge. Weed Sci. 60, 269-276 (2012). 77 Ejeta, G. The Striga Scourge In Africa: A Growing Pandemic. In Integrating New Technologies For Striga Control: Towards Ending The Witch-Hunt. (World Scientific Publishing Co. Singapore, 2007). 78 Scholes, J. D. & Press, M. C. Striga Infestation Of Cereal Crops - An Unsolved Problem In Resource Limited Agriculture. Curr. Opin. Plant Biol. 11, 180-186 (2008). 79 Parker, C. Observations On The Current Status Of Orobanche And Striga Problems Worldwide. Pest Management Science 65, 453-459 (2009). 80 Vurro, M. et al. Emerging Infectious Diseases Of Crop Plants In Developing Countries: Impact On Agriculture And Socio-Economic Consequences. Food Sec. 2, 113-132 (2010). 81 Press, M. C. & Graves, J. D. Parasitic Plants (Chapman and Hall, London, 1995). 82 Swarbrick, P. J. et al. Global Patterns Of Gene Expression In Rice Cultivars Undergoing A Susceptible Or Resistant Interaction With The Parasitic Plant Striga Hermonthica. New Phytol. 179, 515-529 (2008). 83 Dita, M. A. et al. Gene Expression Profiling Of Medicago Truncatula Roots In Response To The Parasitic Plant Orobanche Crenata. Weed Research 49, 66-80 (2009). 84 Hiraoka, Y. & Sugimoto, Y. Molecular Responses Of Sorghum To Purple Witchweed (Striga hermonthica) Parasitism. Weed Sci. 56, 356-363 (2008).

193 85 Hiraoka, Y.et al. Molecular Responses Of Lotus Japonicus To Parasitism By The Compatible Species Orobanche aegyptiaca And The Incompatible Species Striga hermonthica. J. Exp. Bot. 60, 641-650 (2009). 86 Die, J. V. et al. Identification By Suppression Subtractive Hybridization And Expression Analysis Of Medicago truncatula Putative Defence Genes In Response To Orobanche crenata Parasitization. Physiol Molec Plant Path 70, 49-59 (2007). 87 Cardoso, C et al. Strigolactones And Root Infestation By Plant-Parasitic Striga, Orobanche And Phelipanche Spp. Plant Science 180, 414-420 (2011). 88 Yoder, J. I. & Scholes, J. D. Host Plant Resistance To Parasitic Weeds; Recent Progress And Bottlenecks. Curr. Opin. Plant Biol. 13, 478-484 (2010). 89 Thorogood, C. J. & Hiscock, S. J. Compatibility Interactions At The Cellular Level Provide The Basis For Host Specificity In The Parasitic Plant Orobanche. New Phytol 186, 571-575 (2010). 90 Rubiales, D. et al. Breeding Approaches For Crenate Broomrape (Orobanche crenata forsk.) Management In Pea (Pisum sativum l.). Pest. Manag. Sci. 65, 553-559 (2009). 91 Perez-De-Luque, A. et al. Host Plant Resistance Against Broomrapes (Orobanche Spp.): Defence Reactions And Mechanisms Of Resistance. Annals of Applied Biology 152, 131- 141 (2008). 92 Rispail, N. et al. Plant Resistance To Parasitic Plants: Molecular Approaches To An Old Foe. New Phytol 173, 703-711 (2007). 93 Rubiales, D. et al. Screening Techniques And Sources Of Resistance Against Parasitic Weeds In Grain Legumes. Euphytica 147, 187-199 (2006). 94 Parker, C. Protection Of Crops Against Parasitic Weeds. Crop Prot. 10, 6-22 (1991). 95 Palmer, A. G. et al. Chemical Biology Of Multi-Host/Pathogen Interactions: Chemical Perception And Metabolic Complementation. Annu. Rev. Phytopathol. 42, 439-464 (2004). 96 Keyes, W. J. et al. Signaling Organogenesis In Parasitic Angiosperms: Xenognosin Generation, Perception, And Response. Journal of Plant Growth Regulation 19, 217-231 (2000). 97 Tomilov, A. A. et al. Localized Hormone Fluxes And Early Haustorium Development In The Hemiparasitic Plant Triphysaria versicolor. Plant Physiol. 138, 1469-1480 (2005). 98 Cook, C. et al. Germination Stimulants. II. Structure Of Strigol, A Potent Seed Germination Stimulant For Witchweed (Striga lutea). J. Am. Chem. Soc. 94, 6198-6199 (1972). 99 Akiyama, K et al. Plant Sesquiterpenes Induce Hyphal Branching In Arbuscular Mycorrhizal Fungi. Nature 435, 824-827 (2005). 100 Gomez-Roldan, V. et al. Strigolactone Inhibition Of Shoot Branching. Nature 455, 189- 194 (2008). 101 Umehara, M. et al. Inhibition Of Shoot Branching By New Terpenoid Plant Hormones. Nature 455, 195-200 (2008). 102 Xie, X. et al. The Strigolactone Story. Phytopathology 48, 93 (2010). 103 Macqueen, M. Haustorial Initiating Activity Of Several Simple Phenolic Compounds. In Third International Symposium on Parasitic Weeds. (International Center for Agricultural Research, Aleppo, Syria, 1984. ed. Musselman LJ Parker C, Polhill RM, Wilson AK) pp118-122. 104 Riopel, J. L. & Timko, M. P. in Parasitic Plants (eds Malcolm C. Press & Jonathan D. Graves) pp39-79 (Chapman and Hall, 1995). 105 Smith, C. E. et al. A Mechanism For Inducing Plant Development: The Genesis Of A Specific Inhibitor. P. Natl. Acad. Sci. USA 93, 6986-6991 (1996).

194 106 Albrecht, H. et al. Flavonoids Promote Haustoria Formation In The Root Parasite Triphysaria versicolor. Plant Physiol. 119, 585-591 (1999). 107 Irving, L. J. & Cameron, D. D. You Are What You Eat:: Interactions Between Root Parasitic Plants And Their Hosts. Adv. Bot. Res. 50, 87-138 (2009). 108 Kuijt, J. The Biology of Parasitic Flowering Plants. (University of California Press, 1969). 109 Heide-Jørgensen, H. & Kuijt, J. Epidermal Derivatives As Xylem Elements And Transfer Cells: A Study Of The Host-Parasite Interface In Two Species Of Triphysaria (Scrophulariaceae). Protoplasma 174, 173-183 (1993). 110 Heide-Jørgensen, H. S. & Kuijt, J. The Haustorium Of The Root Parasite Triphysaria (Scrophulariaceae), With Special Reference To Xylem Bridge Ultrastructure. Am. J. Bot. 82, 782-797 (1995). 111 Vachev, T. et al. Trafficking Of The Potato Spindle Tuber Viroid Between Tomato And Orobanche ramosa. Virology 399, 187-193 (2010). 112 Birschwilks, M. et al. Transfer Of Phloem-Mobile Substances From The Host Plants To The Holoparasite Cuscuta Sp. J. Exp. Bot. 57, 911-921 (2006). 113 Tomilov, A. A. et al. Trans-Specific Gene Silencing Between Host And Parasitic Plants. Plant J. 56, 389-397 (2008). 114 Matvienko, M. et al. Transcriptional Responses In The Hemiparasitic Plant Triphysaria versicolor To Host Plant Signals. Plant Physiol. 127, 272-282 (2001). 115 Tomilov, A. et al. in Chemical Communication: From Gene to Ecosystem (ed M. Dicke) pp55-69 (Frontis, 2006). 116 Goldwasser, Y., et al. in The Arabidopsis Book (eds C. R. Somerville & E. M. Meyerowitz) (American Society of Plant Biologists, 2002). 117 Thurman, L. D. Genecological studies in Orthocarpus subgenus Triphysaria, University of California, (1966). 118 Yoder, J. I. A Species-Specific Recognition System Directs Haustorium Development In The Parasitic Plant Triphysaria (Scrophulariaceae). Planta 202, 407-413 (1997). 119 Jamison, D. S. & Yoder, J. I. Heritable Variation In Quinone-Induced Haustorium Development In The Parasitic Plant Triphysaria. Plant Physiol. 125, 1870-1879 (2001). 120 Yoder, J. I. Self And Cross-Compatibility In Three Species Of The Hemiparasite Triphysaria (Scrophulariaceae). Environ Ex Bot 39, 77-83 (1998). 121 Tomilov, A. A. et al. Localized Hormone Fluxes And Early Haustorium Development In The Hemiparasitic Plant Triphysaria versicolor. Plant Physiol. 138, 1469-1480 (2005). 122 Delavault, P. et al. Host-Root Exudates Increase Gene Expression Of Asparagine Synthetase In The Roots Of A Hemiparasitic Plant Triphysaria versicolor (Scrophulariaceae). Gene 222, 155-162 (1998). 123 Wrobel, R. L. & Yoder, J. I. Differential RNA Expression Of Alpha-Expansin Gene Family Members In The Parasitic Angiosperm Triphysaria versicolor (Scrophulariaceae). Gene 266, 85-93 (2001). 124 O'Malley, R. C. & Lynn, D. G. Expansin Message Regulation In Parasitic Angiosperms: Marking Time In Development. Plant Cell 12, 1455-1465 (2000). 125 Matvienko, M. et al. Quinone Oxidoreductase Message Levels Are Differentially Regulated In Parasitic And Non-Parasitic Plants Exposed To Allelopathic Quinones. Plant J. 25, 375-387 (2001). 126 Torres, M. J et al. Pscroph, A Parasitic Plant EST Database Enriched For Parasite Associated Transcripts. BMC Plant Biol. 5, 24 (2005).

195 127 Tomilov, A. et al. In Vitro Haustorium Development In Roots And Root Cultures Of The Hemiparasitic Plant Triphysaria versicolor. Plant Cell Tissue Organ Cult. 77, 257-265 (2004). 128 Tomilov, A. et al. Agrobacterium tumefaciens And Agrobacterium rhizogenes Transformed Roots Of The Parasitic Plant Triphysaria versicolor Retain Parasitic Competence. Planta 225, 1059-1071 (2007). 129 Bandaranayake, P. C. G. et al. A Single-Electron Reducing Quinone Oxidoreductase Is Necessary To Induce Haustorium Development In The Root Parasitic Plant Triphysaria. Plant Cell 22, 1404-1419 (2010). 130 Westwood, J. H. et al. The Parasitic Plant Genome Project: New Tools For Understanding The Biology Of Orobanche And Striga. Weed Sci. 60, 295-306 (2012). 131 Grabherr, M. G. et al. Full-Length Transcriptome Assembly From RNA-Seq Data Without A Reference Genome. Nat Biotech 29, 644-652 (2011). 132 CLC. CLCbio, http://www.clcbio.com/ 133 SOAP. Short Oligonucleotide Analysis Package, http://soap.genomics.org.cn/ 134 Ian Korf, M. Y., Joseph Bedell. BLAST. (O'Reilly and Associates Inc., 2003). 135 Hale, M. C. et al. Next-Generation Pyrosequencing Of Gonad Transcriptomes In The Polyploid Lake Sturgeon (Acipenser fulvescens): The Relative Merits Of Normalization And Rarefaction In Gene Discovery. BMC Genomics 10, 203 (2009). 136 Mortazavi, A et al. Mapping And Quantifying Mammalian Transcriptomes By RNA- Seq. Nat Meth 5, 621-628 (2008). 137 Malone, J. H. & Oliver, B. Microarrays, Deep Sequencing And The True Measure Of The Transcriptome. BMC Biology 9, 34 (2011). 138 Huson, D. H. et al. Integrative Analysis Of Environmental Sequences Using MEGAN4. Genome Res. 21, 1552-1560 (2011). 139 Wickett, N. J. et al. Transcriptomes Of The Parasitic Plant Family Orobanchaceae Reveal Surprising Conservation Of Chlorophyll Synthesis. Current biology 21, 2098-2104 (2011). 140 Zerbino, D. R. & Birney, E. Velvet: Algorithms For De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18, 821-829 (2008). 141 Li, R. Q. et al. De Novo Assembly Of Human Genomes With Massively Parallel Short Read Sequencing. Genome Res. 20, 265-272 (2010). 142 Simpson, J. T. et al. ABySS: A Parallel Assembler For Short Read Sequence Data. Genome Res. 19, 1117-1123 (2009). 143 The Arabidopsis Information Resource, http://www.arabidopsis.org/index.jsp 144 Wilkening, S. & Bader, A. Quantitative Real-Time Polymerase Chain Reaction: Methodical Analysis And Mathematical Model. J of Biomolec Tech 15, 107 (2004). 145 Pfaffl, M. W. A New Mathematical Model For Relative Quantification In Real-Time RT- PCR. Nucleic Acids Res. 29, e45 (2001). 146 The Dlugosch Lab @ The University of , http://dlugoschlab.arizona.edu/index.html 147 Biopython, http://biopython.org/wiki/Main_Page 148 Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): Improved Gene Annotation And New Tools. Nucleic Acids Res. 40, D1202-D1210 (2012). 149 Quinlan, A. R. & Hall, I. M. Bedtools: A Flexible Suite Of Utilities For Comparing Genomic Features. Bioinformatics 26, 841-842 (2010). 150 Langmead, B. et al. Ultrafast And Memory-Efficient Alignment Of Short DNA Sequences To The Human Genome. Genome Biol 10, R25 (2009).

196 151 Li, R. Q. et al. De Novo Assembly Of Human Genomes With Massively Parallel Short Read Sequencing. Genome Res. 20, 265-272 (2010). 152 SOFTGENETICS, http://www.softgenetics.com/ 153 Huang, X. Q. & Madan, A. CAP3: A DNA Sequence Assembly Program. Genome Res. 9, 868-877 (1999). 154 Boetzer, M. et al. Scaffolding Pre-Assembled Contigs Using SSPACE. Bioinformatics 27, 578-579 (2011). 155 Iseli, C. et al. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. pp138-158 (AAAI Press, 1999). 156 Edgar, R. C. Search And Clustering Orders Of Magnitude Faster Than BLAST. Bioinformatics 26, 2460-2461 (2010). 157 Altschul, S. F et al. Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403-410 (1990). 158 Wang, J. P. Z. et al. EST Clustering Error Evaluation And Correction. Bioinformatics 20, 2973-2984 (2004). 159 Jiao, Y. N. et al. Ancestral Polyploidy In Seed Plants And Angiosperms. Nature 473, 97- 113 (2011). 160 The R Project for Statistical Computing, http://www.r-project.org/ 161 Geneious, http://www.geneious.com/ 162 The Parasitic Plant Genome Project, http://ppgp.huck.psu.edu/ (2012). 163 Estabrook, E. M. & Yoder, J. I. Plant-Plant Communications: Rhizosphere Signaling Between Parasitic Angiosperms And Their Hosts. Plant Physiol. 116, 1-7 (1998). 164 Thompson, J. N. The Coevolutionary Process. (The University of Chicago Press, 1994). 165 Agosta, S. J. et al. How Specialists Can Be Generalists: Resolving The "Parasite Paradox" And Implications For Emerging Infectious Disease. Zoologia (Curitiba) 27, 151-162 (2010). 166 Atsatt, P. R. & Strong, D. R. Population Biology Of Annual Grassland Hemiparasites .1. Host Environment. Evolution 24, 278-291 (1970). 167 Nickrent, D. The Parasitic Plant Connection, http://www.parasiticplants.siu.edu/ 168 Schneeweiss, G. M. Correlated Evolution Of Life History And Host Range In The Nonphotosynthetic Parasitic Flowering Plants Orobanche And Phelipanche (Orobanchaceae). J. Evolution. Biol. 20, 471-478 (2007). 169 Offler, C. E. et al. Transfer Cells: Cells Specialized For A Special Purpose. Annu. Rev. Plant Biol. 54, 431-454 (2003). 170 Wrobel, R. & Yoder, J. Differential RNA Expression Of Alpha-Expansin Gene Family Members In The Parasitic Angiosperm Triphysaria versicolor (Scrophulariaceae). Gene 266, 85-93 (2001). 171 Kenzelmann, M. et al. High-accuracy amplification of nanogram total RNA amounts for gene profiling. Genomics 83, 550-558 (2004). 172 Matas, A. J. et al. Tissue- and Cell-Type Specific Transcriptome Profiling of Expanding Tomato Fruit Provides Insights into Metabolic and Regulatory Specialization and Cuticle Formation. THE PLANT CELL ONLINE, doi:10.1105/tpc.111.091173 (2011). 173 Wall, P. K. et al. PlantTribes: a gene and gene family resource for comparative genomics in plants. Nucleic Acids Res. 36, D970-D976, doi:10.1093/nar/gkm972 (2008). 174 The Floral Genome Project, http://fgp.bio.psu.edu 175 Phytozome, http://www.phytozome.net (2012). 176 PlantGBD, http://www.plantgdb.org (2012). 177 European Bioinformatics Institute, http://www.ebi.ac.uk/ 178 OrthoMCL DB, http://www.orthomcl.org/

197 179 Sampedro, J. & Cosgrove, D. J. The Expansin Superfamily. Genome Biol 6, 242 (2005). 180 Gomez, S. K. & Harrison, M. J. Laser Microdissection And Its Application To Analyze Gene Expression In Arbuscular Mycorrhizal Symbiosis. Pest Management Science 65, 504-511 (2009). 181 Woll, K. Isolation, Characterization, And Pericycle-Specific Transcriptome Analyses Of The Novel Maize Lateral And Seminal Root Initiation Mutant rum1. Plant Physiol. 139, 1255-1267 (2005). 182 Ithal, N. et al. Parallel Genome-Wide Expression Profiling Of Host And Pathogen During Soybean Cyst Nematode Infection Of Soybean. MPMI 20, 293-305 (2007). 183 Gomez, S. K. et al. Medicago truncatula And Glomus intraradices Gene Expression In Cortical Cells Harboring Arbuscules In The Arbuscular Mycorrhizal Symbiosis. BMC Plant Biol. 9, 10 (2009). 184 Klink, V. P. et al. A Gene Expression Analysis Of Syncytia Laser Microdissected From The Roots Of The Glycine max (Soybean) Genotype PI 548402 (Peking) Undergoing A Resistant Reaction After Infection By Heterodera glycines (Soybean Cyst Nematode). Plant Mol. Biol. 71, 525-567 (2009). 185 Matas, A. J. et al. Tissue-Specific Transcriptome Profiling Of The Citrus Fruit Epidermis And Subepidermis Using Laser Capture Microdissection. J. Exp. Bot. 61, 3321-3330 (2010). 186 Pillitteri, L. J. et al. Molecular Profiling Of Stomatal Meristemoids Reveals New Component Of Asymmetric Cell Division And Commonalities Among Stem Cell Populations In Arabidopsis. Plant Cell 23, 3260-3275 (2011). 187 Emrich, S. J et al. Gene Discovery And Annotation Using LCM-454 Transcriptome Sequencing. Genome Res. 17, 69-73 (2007). 188 Ohtsu, K. et al. Global Gene Expression Analysis Of The Shoot Apical Meristem Of Maize (Zea mays L.). Plant J. 52, 391-404 (2007). 189 Li, P. H. et al. The Developmental Dynamics Of The Maize Leaf Transcriptome. Nat. Genet. 42, 1060-1067 (2010). 190 Feldman, A. L. et al. Advantages Of mRNA Amplification For Microarray Analysis. BioTechniques 33, 906-912 (2002). 191 Polacek, D. C. et al. Fidelity And Enhanced Sensitivity Of Differential Transcription Profiles Following Linear Amplification Of Nanogram Amounts Of Endothelial mRNA. Physiol Genomics 13, 147-156 (2003). 192 Li, Y. et al. Systematic Comparison Of The Fidelity Of aRNA, mRNA And T-RNA On Gene Expression Profiling Using cDNA Microarray. Journal of Biotechnology 107, 19- 28 (2004). 193 King, C. et al. Reliability And Reproducibility Of Gene Expression Measurements Using Amplified RNA From Laser-Microdissected Primary Breast Tissue With Oligonucleotide Arrays. Journal of Molecular Diagnostics 7, 57 (2005). 194 Day, R. C. et al. Transcript Analysis Of Laser Microdissected Plant Cells. Physiol. Plantarum 129, 267-282 (2007). 195 Goldsworthy, S. M. et al. Effects Of Fixation On RNA Extraction And Amplification From Laser Capture Microdissected Tissue. Molecular Carcinogenesis 25, 86-91 (1999). 196 Gillespie, J. W. et al. Evaluation Of Non-Formalin Tissue Fixation For Molecular Profiling Studies. American Journal of Pathology 160, 449-457 (2002). 197 Kerk, N. M. Laser Capture Microdissection Of Cells From Plant Tissues. Plant Physiol. 132, 27-35 (2003). 198 Cosgrove, D. J. Growth Of The Plant Cell Wall. Nat Rev Mol Cell Biol 6, 850-861 (2005).

198 199 Albrecht, U. & Bowman, K. D. Gene Expression In Citrus sinensis (L.) Osbeck Following Infection With The Bacterial Pathogen Candidatus Liberibacter Asiaticus Causing Huanglongbing In Florida. Plant Science 175, 291-306 (2008). 200 Ding, X. H. et al. Activation Of The Indole-3-Acetic Acid-Amido Synthetase GH3-8 Suppresses Expansin Expression And Promotes Salicylate- And Jasmonate-Independent Basal Immunity In Rice. Plant Cell 20, 228-240 (2008). 201 Kikuchi, T. et al. Expressed Sequence Tag (EST) Analysis Of The Pine Wood Nematode Bursaphelenchus xylophilus And B. mucronatus. Mol. Biochem. Parasitol. 155, 9-17 (2007). 202 Gal, T. Z. et al. Expression Of A Plant Expansin Is Involved In The Establishment Of Root Knot Nematode Parasitism In Tomato. Planta 224, 155-162 (2006). 203 Li, Z. C., et al. An Oat Coleoptile Wall Protein That Induces Wall Extension In-Vitro And That Is Antigenically Related To A Similar Protein From Cucumber Hypocotyls. Planta 191, 349-356 (1993). 204 Cho, H.T. & Kende, H. Expansins In Deepwater Rice Internodes. Plant Physiology 113, 1137-1143 (1997). 205 Cosgrove, D. J et al. Group I Allergens Of Grass Pollen As Cell Wall-Loosening Agents. P. Natl. Acad. Sci. USA 94, 6559 (1997). 206 McQueenmason, S. J. & Cosgrove, D. J. Disruption Of Hydrogen Bonding Between Plant Cell Wall Polymers By Proteins That Induce Wall Extension. P. Natl. Acad. Sci. USA 91, 6574-6578 (1994). 207 McQueenmason, S. J. & Cosgrove, D. J. Expansin Mode Of Action On Cell-Walls - Analysis Of Wall Hydrolysis, Stress-Relaxation, And Binding. Plant Physiol. 107, 87- 100 (1995). 208 Li, Y. et al. Expansins And Cell Growth. Curr Opin Plant Biol 6, 603-610 (2003). 209 Cosgrove, D. J. Loosening Of Plant Cell Walls By Expansins. Nature 407, 321-326 (2000). 210 Kapu, N. U. S. & Cosgrove, D. J. Changes In Growth And Cell Wall Extensibility Of Maize Silks Following Pollination. J. Exp. Bot. 61, 4097-4107 (2010). 211 Tabuchi, A. et al. Matrix Solubilization And Cell Wall Weakening By Beta-Expansin (Group-1 Allergen) From Maize Pollen. Plant J. 68, 546-559 (2011). 212 Yennawar, N. H et al. Crystal Structure And Activities Of EXPB1 (Zea M 1), Alpha, Beta-Expansin And Group-1 Pollen Allergen From Maize. P. Natl. Acad. Sci. USA 103, 14664-14671 (2006). 213 Vial, L. et al. The Various Lifestyles Of The Burkholderia cepacia Complex Species: A Tribute To Adaptation. Environmental Microbiology 13, 1-12 (2011). 214 Atsatt, P. R. Parasitic Flowering Plants - How Did They Evolve? Am. Nat. 107, 502-510 (1973). 215 Chuang, T. I. & Heckard, L. R. Generic Realignment And Synopsis Of Subtribe Castillejinae (Scrophulariaceae-Tribe Pediculareae). Syst. Bot. 16, 644-666 (1991). 216 Fernandez-Aparicio, M et al. Transformation And Regeneration Of The Holoparasitic Plant Phelipanche aegyptiaca. Plant Methods 7, 36 (2011). 217 Schroeder, A. et al. The RIN: An RNA Integrity Number For Assigning Integrity Values To RNA Measurements. BMC Mol Biol 7,3 (2006). 218 The MarthLab, http://bioinformatics.bc.edu/marthlab/Main_Page 219 blast2go, http://www.blast2go.com/ 220 Venny, http://bioinfogp.cnb.csic.es/tools/venny/index.html 221 Edgar, R. C. MUSCLE: A Multiple Sequence Alignment Method With Reduced Time And Space Complexity. BMC Bioinformatics 5, 113 (2004).

199 222 Stamatakis, A. RAxML-VI-HPC: Maximum Likelihood-Based Phylogenetic Analyses With Thousands Of Taxa And Mixed Models. Bioinformatics 22, 2688-2690 (2006). 223 Xing, G. F. et al. Identification And Characterization Of A Novel Hybrid Upregulated Long Non-Protein Coding RNA In Maize Seedling Roots. Plant Science 179, 356-363 (2010). 224 Jones, K. M. et al. Differential Response Of The Plant Medicago Truncatula To Its Symbiont Sinorhizobium meliloti Or An Exopolysaccharide-Deficient Mutant. P. Natl. Acad. Sci. USA 105, 704-709 (2008). 225 Heide-Jorgensen, H. S. & Kuit, J. The Haustorium Of The Root Parasite Triphysaria (Scrophulariaceae), With Special Reference To Xylem Bridge Ultrastructure. Am. J. Bot. 82, 782-797 (1995). 226 Dorr, I. & Kollmann, R. Symplasmic Sieve Element Continuity Between Orobanche And Its Host. Bot. Acta 108, 47-55 (1995). 227 Aly, R. et al. Movement Of Protein And Macromolecules Between Host Plants And The Parasitic Weed Phelipanche Aegyptiaca Pers. Plant Cell Reports 30, 2233-2241 (2011). 228 Samb, P. I. & Ba, A. T. Structural And Ultrastructural Study Of The Haustoria Of Striga Aspera (Willd.) Benth. Acta Bot. Gall. 144, 45-56 (1997). 229 Dorr, I. How Striga Parasitizes Its Host: A TEM And SEM Study. Ann. Bot. 79, 463-472 (1997). 230 The Compositae Genome Project, http://compgenomics.ucdavis.edu/ (2013). 231 Anders, S. & Huber, W. Differential Expression Analysis For Sequence Count Data. Genome Biol 11, R106 (2010). 232 Shuai, B et al. The LATERAL ORGAN BOUNDARIES Gene Defines A Novel, Plant- Specific Gene Family. Plant Physiol. 129, 747-761 (2002). 233 Husbands, A et al. LATERAL ORGAN BOUNDARIES defines a new family of DNA- binding transcription factors and can interact with specific bHLH proteins. Nucleic Acids Res. 35, 6663-6671 (2007). 234 Moon, J. & Hake, S. How A Leaf Gets Its Shape. Curr. Opin. Plant Biol. 14, 24-30 (2011). 235 Miyawaki, K. et al. Expression Of Cytokinin Biosynthetic Isopentenyltransferase Genes In Arabidopsis: Tissue Specificity And Regulation By Auxin, Cytokinin, And Nitrate. Plant J. 37, 128-138 (2004). 236 Peron, T. et al. Role Of The Sucrose Synthase Encoding Prsus1 Gene In The Development Of The Parasitic Plant Phelipanche ramosa L. (Pomel). Mol. Plant Microbe In. 25, 402-411 (2012). 237 Chickarmane, V. S et al. Cytokinin Signaling As A Positional Cue For Patterning The Apical-Basal Axis Of The Growing Arabidopsis Shoot Meristem. P. Natl. Acad. Sci. USA 109, 4002-4007 (2012). 238 Negi, J. et al. CO2 Regulator SLAC1 And Its Homologues Are Essential For Anion Homeostasis In Plant Cells. Nature 452, 483-486 (2008). 239 Kim, M. J. et al. Peroxidase Contributes To ROS Production During Arabidopsis Root Response To Potassium Deficiency. Molecular Plant 3, 420-427 (2010). 240 Bailey-Serres, J. & Mittler, R. The Roles Of Reactive Oxygen Species In Plant Cells. Plant Physiol. 141, 311-311 (2006). 241 Fernández-Aparicio, M. et al. Application Of Qrt-PCR And RNA-Seq Analysis For The Identification Of Housekeeping Genes Useful For Normalization Of Gene Expression Values During Striga hermonthica Development. Mol Biol Rep 40, 3395-3407 (2013). 242 Westwood, J. H. Characterization Of The Orobanche-Arabidopsis System For Studying Parasite-Host Interactions. Weed Sci. 48, 742-748 (2000).

200 243 Schnable, P. S. et al. The B73 Maize Genome: Complexity, Diversity, And Dynamics. Science 326, 1112-1115 (2009). 244 Young, N. D. et al. The Medicago Genome Provides Insight Into The Evolution Of Rhizobial Symbioses. Nature 480, 520-524 (2011). 245 Paterson, A. H. et al. The Sorghum Bicolor Genome And The Diversification Of Grasses. Nature 457, 551-556 (2009).

Loren Axel Honaas

Education: Ph.D. The Pennsylvania State University (2013) Plant Biology B.S. Southeast Missouri State University (2003) Biology

Publications: Guettler S, Jackson EN, Lucchese SA, Honaas L, Green A, Hittinger CT, Tian Y, Lilly WW, Gathman AC (2003) ESTs from the basidiomycete Schizophyllum commune grown on nitrogen-replete and nitrogen limited media. Fungal Genet Biol. 39(2):191-8. Hammes UZ, Nielsen E, Honaas LA, Taylor CG, Schachtman DP (2006) AtCAT6, a sink- tissue-localized transporter for essential amino acids in Arabidopsis. The Plant Journal (2006) 48, 414–426. Wickett N, Honaas LA, Wafula E, Das M, Huang K, Wu B, Landherr L, Timko M, Yoder J, Westwood J, dePamphilis C (2011) Transcriptomes of the Parasitic Plant Family Orobanchaceae Reveal Surprising Conservation of Chlorophyll Synthesis. Current Biology 21, 2098–2104. Westwood JH, dePamphilis CW, Das M, Fernández-Aparicio M, Honaas LA, Timko MP, Wafula EK, Wickett NJ, Yoder JI (2012) The Parasitic Plant Genome Project: New Tools for Understanding the Biology of Orobanche and Striga. Weed Science 60, 295-306. Fernández-Aparicio M, Huang K, Wafula EK, Honaas LA, Wickett NJ, Timko MP, dePamphilis CW, Yoder JI, Westwood JH (2012) Application of qRT-PCR and RNA-Seq analysis for the identification of housekeeping genes useful for normalization of gene expression values during Striga hermonthica development. Molecular Biology Reports. Honaas LA, Wafula EK, Yang Z, Der JP, Wickett NJ, Altman NS, Taylor CG, Timko MP, Westwood JH, dePamphilis CW (2013) Functional genomics of a generalist parasitic plant: Laser microdissection of host-parasite interface reveals host-specific patterns of parasite gene expression. BMC Plant Biology 13:9. Zhang Y, Fernández-Aparicio M, Wafula EK, Das M, Jiao Y, Wickett NJ, Honaas LA, Ralph PE, Wojciechowski MF, Timko MP, Yoder JI, Westwood JH, dePamphilis CW (2013) Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species. BMC Evolutionary Biology 13:48.

Awards/Honors: Elected to membership of βββ Biological Honors Society, Iota Sigma chapter 2002 IBIOS 1st year Ph.D. Fellowship 2005-2006 ~ $33,000 Plant Physiology RA Fellowship Fall 2009 ~ $8000 Braddock Research Award 2010 -$1200 J. Ben and Helen Hill Memorial Fund Award 2011 - $600 Botanical Society of America, International Botanical Congress Travel Award 2011 – $2000 Institute of Molecular Evolution and Genomics Travel Award 2011 - $300 European Weed Research Society Travel Award 2011 - €400 Intercollege Graduate Degree Program in Plant Biology Travel Award 2011- $450 Eva J. Pell Endowed Graduate Student Scholarship 2013 - $2500 Penn State PostDoc Society Travel Award 2013 - $250