<<

CHARACTERIZATION OF GENOMIC DIVERSITY AT A

QUANTITATIVE DISEASE RESISTANCE LOCUS IN MAIZE USING

IMPROVED BIOINFORMATIC TOOLS FOR TARGETED

RESEQUENCING

by

Felix Francis

A dissertation submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of in and Systems Biology

Spring 2018

© 2018 Felix Francis All Rights Reserved CHARACTERIZATION OF GENOMIC DIVERSITY AT A

QUANTITATIVE DISEASE RESISTANCE LOCUS IN MAIZE USING

IMPROVED BIOINFORMATIC TOOLS FOR TARGETED

RESEQUENCING

by

Felix Francis

Approved: Cathy H. Wu, Ph.D. Chair of Bioinformatics & Computational Biology

Approved: Mark Rieger, Ph.D. Dean of the College of Agriculture and Natural Resources

Approved: Ann L. Ardis, Ph.D. Senior Vice Provost for Graduate and Professional Education I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Randall J. Wisser, Ph.D. in charge of dissertation

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: J. Antoni Rafalski, Ph.D. Member of dissertation committee

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Shawn W. Polson, Ph.D. Member of dissertation committee

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Blake C. Meyers, Ph.D. Member of dissertation committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Li Liao, Ph.D. Member of dissertation committee ACKNOWLEDGEMENTS

I would like to express my deep and sincere gratitude to my dissertation advisor, Dr. Randall J. Wisser, for the opportunity to pursue research in his lab. The experience has definitely helped me become a better scientist. In particular, I thank him for giving me the independence to explore my research ideas and through this experience, I have learned a lot, especially the importance of perseverance while dealing with challenging research problems. I am extremely grateful to my dissertation committee members, Dr. Blake C. Meyers, Dr. J. Antoni Rafalski, Dr. Shawn W. Polson, and Dr. Li Liao, for their feedback and guidance, which greatly helped shape my research direction. I thank the Wisser group, for all the interesting discussions and for providing an enjoyable environment to learn. I greatly appreciate the assistance of Teclemariam Weldekidan, Michael Dumas and Scott Davis for various crucial validation and data generation work associated with this dissertation. I particularly feel lucky to have shared my time at the lab with Meredith Biedrzycki, Juliana Teixeira, Heather Manch- ing and Terence Mhora who continue to inspire me with their enthusiasm towards science, and have been brilliant role models as early career scientists. I would also like to thank the other collaborators on the NSF grant that provided the funding for this research, especially Dr. Rebecca Nelson and Dr. Tiffany Jamann who provided valuable insights into the biological questions addressed in this dissertation. I appreciate those who got me started in life and research, especially my parents, for their support and encouragement and for introducing me into scientific research. I thank my teachers and advisors during my Undergraduate and Masters programs for inspiring me to pursue science. I am especially thankful to my wife, Pratha Sah, for her support and patience throughout these six years and beyond, whose encouragement and sacrifice made this happen.

v TABLE OF CONTENTS

LIST OF TABLES ...... ix LIST OF FIGURES ...... xi ABSTRACT ...... xix

Chapter

1 INTRODUCTION ...... 1

1.1 Role of genomic diversity for crop improvement ...... 1 1.2 Challenges in plant genome sequencing projects ...... 2 1.3 Maize genomic diversity ...... 4 1.4 Complex traits and Quantitative disease resistance (QDR) ...... 5

2 THERMOALIGN: A GENOME-AWARE PRIMER DESIGN TOOL FOR STANDARD PCR AND TILED AMPLICON RESEQUENCING ...... 7

2.1 Abstract ...... 7 2.2 Introduction ...... 8 2.3 Results ...... 11

2.3.1 Target Region Selection (TRS) ...... 12 2.3.2 Unique Oligo Design (UOD) ...... 12 2.3.3 Priming Specificity Evaluation (PSE) ...... 15 2.3.4 Primer Pair Selection (PPS)...... 16 2.3.5 Empirical evaluation of priming specificity ...... 18

2.4 Discussion ...... 21 2.5 Methods ...... 25

2.5.1 ThermoAlign pipeline ...... 25 2.5.2 Target region selection (TRS) ...... 25 2.5.3 Unique oligonucleotide design (UOD) ...... 27

vi 2.5.4 Priming specificity evaluation (PSE) ...... 30 2.5.5 Primer pair selection (PPS) ...... 32 2.5.6 PCR validation ...... 36 2.5.7 SMRT sequencing and analysis of long-range PCR amplicons . 37

2.6 Availability ...... 37 2.7 Acknowledgments ...... 38 2.8 Author contributions statement ...... 38 2.9 Additional information ...... 38

3 CLUSTERING OF CIRCULAR CONSENSUS SEQUENCES: ACCURATE ERROR CORRECTION AND ASSEMBLY OF SINGLE MOLECULE REAL-TIME READS FROM MULTIPLEXED AMPLICON LIBRARIES ...... 39

3.1 Abstract ...... 39 3.2 Background ...... 40 3.3 Methods ...... 41

3.3.1 Sequence data ...... 41 3.3.2 Clustering of circular consensus sequences for long amplicon analysis ...... 42 3.3.3 Evaluating the accuracy of C3S-LAA ...... 43

3.4 Results and Discussion ...... 46 3.5 Conclusion ...... 51 3.6 Availability ...... 52

4 RESEQUENCING OF A QUANTITATIVE DISEASE RESISTANCE LOCUS IN MAIZE PROVIDES BENCHMARK DATA AND INSIGHT INTO THE SPECTRUM OF SEQUENCE VARIATION AMONG INBRED LINES ...... 54

4.1 Introduction ...... 54 4.2 Methods ...... 57

4.2.1 Barcoded DNA amplification of the qNLB 1 25721468 23298 locus ...... 57 4.2.2 Sequencing, error correction and assembly of multiplexed amplicon libraries ...... 58 4.2.3 Sequence characterization ...... 59 4.2.4 Comparison to maize HapMap3 ...... 60 4.2.5 Annotation of variant effects ...... 62

vii 4.2.6 Association mapping ...... 62

4.3 Results ...... 62

4.3.1 Genomic diversity across the qNLB 1 25721468 23298 locus . 62 4.3.2 Comparison to maize HapMap3 ...... 64 4.3.3 Analysis of the NLB susceptible, Tx303 haplotype ...... 68

4.4 Discussion ...... 70

5 DISCUSSION AND CONCLUSIONS ...... 74

5.1 A ThermoAlign approach for targeted enrichment of repetitive genomes 75 5.2 SMRT sequencing and assembly of multiplexed amplicon libraries from the maize genome ...... 77 5.3 Unravelling the genomic diversity at a maize quantitative disease resistance (QDR) locus using long molecule resequencing ...... 78 5.4 Future directions ...... 79

BIBLIOGRAPHY ...... 81

Appendix

A SUPPLEMENTARY INFORMATION FOR THERMOALIGN: A GENOME-AWARE PRIMER DESIGN TOOL FOR STANDARD PCR AND TILED AMPLICON RESEQUENCING ...... 99 B SUPPLEMENTARY INFORMATION FOR: CLUSTERING OF CIRCULAR CONSENSUS SEQUENCES: ACCURATE ERROR CORRECTION AND ASSEMBLY OF SINGLE MOLECULE REAL-TIME READS FROM MULTIPLEXED AMPLICON LIBRARIES ...... 109

C SUPPLEMENTARY INFORMATION FOR: UNRAVELLING THE GENOMIC DIVERSITY AT A MAIZE QUANTITATIVE DISEASE RESISTANCE LOCUS USING LONG MOLECULE RESEQUENCING ...... 114

D PERMISSIONS ...... 124

viii LIST OF TABLES

2.1 Results from BLASTn alignment of error corrected PacBio consensus sequences to the B73 genome...... 21

3.1 Comparison of LAA and C3S-LAA consensus sequences for B73 amplicons...... 47

3.2 The number of consensus sequences generated from the multiplex library, following barcode demultiplexing...... 50

4.1 Quartiles of genotyping accuracy for maize HapMap3 at the qNLB 1 25721468 23298 locus...... 67

A.1 Comparison of ThermoAlign to related primer design tools...... 99

A.2 Effects of the amplicon size range parameter on the minimum tiling path primer design for the 24 kb target region described in the main text...... 100

A.3 Eight genomic loci in maize B73 genome, selected for targeted enrichment ...... 102

A.4 Target and off-target qPCR assays used to quantify enrichment ratios 106

A.5 Enrichment ratios of RSE products ...... 106

A.6 qPCR analysis of SWGA products ...... 108

B.1 PacBio reads of insert protocol output metrics...... 109

B.2 Padded and barcoded primer sequences used for amplification of six maize lines...... 109

C.1 Tiling path of primers used for the resequenced 27 NAM founder lines 114

ix C.2 Padding, barcodes and primer sequences used for amplification of the 27 NAM founder lines...... 114

x LIST OF FIGURES

2.1 Schematic of the design strategy and workflow for ThermoAlign. A single run parameters file is used by all components of the pipeline. Colored boxes represent the four core modules of ThermoAlign, enumerated in their order of operation: (1) target region selection, (2) unique oligonucleotide design, (3) primer specificity evaluation, and (4) primer pair selection. Dashed boxes represent sub-routines within each of these modules and arrows depict their order of operation. The remaining elements are the database (reference genome sequence), external files (variant call format [.vcf ] files and a run parameters file) and functions (nearest-neighbor model for the Tm of homodimer, heterodimer and hairpin interaction functions in Primer3). Connecting lines for these remaining components depict dependencies for the connected components (a filled dot is used to indicate the source from which information or a function is pulled). Required inputs for ThermoAlign are indicated by an asterisk...... 13

xi 2.2 Repeat content and GC percentage across the 24 kb target region described in the text. The figure is based on analysis of each 25 bp sequence (26 bp sliding window) of the plus strand. For all subfigures, red lines show the number of thermoalignments with an off-target Tm ◦ within 10 C of the corresponding on-target Tm. Yellow lines (orange when overlapping with red) show the number of thermoalignments between a given primer and off-target sites with ≥70 percent identity (pid). Blue lines show the percent GC content. The search for off-target sites was based on BLASTn settings used in this study for priming specificity evaluation (see Methods), which had a maximum of 20 potential sites per pseudomolecule or a total of 260 possible sites. (a) Cumulative distribution of the number of repeats and percent GC content. (b) Genomic distribution of repeat content and GC percentage. The pseudomolecule coordinate of the 5’-nucleotide of each 25 bp sequence was used to position the plotted data. Black horizontal bars on the x-axis show the two genes in this region [left: GRMZM2G031364; right: GRMZM2G031239]. Among 25-mers in the ◦ region ≈ 73% would be predicted to have a misprime Tm within 10 C of the primer Tm. (c) The CIRCOS plot extends from a single primer in the region with the greatest number (n = 215) of predicted mispriming sites across the genome. Red lines of the CIRCOS plot connect the predicted mispriming sites on the pseudomolecules for chromosomes 1 to 10, the mitochondria (Mt), the plastid (Pt) and unmapped sequences (unkn)...... 14

2.3 Creation and characteristics of thermoalignments compared to local alignments and the number of mismatches. (a.1) Examples of full-length primer sequences. (a.2) The top-ranking BLASTn high-scoring segment pair (HSP) alignment for two off-target sequences (bottom strand) is processed into a (a.3) thermoalignment by end-filling (ungapped BLASTn) or removing gaps and end-filling (gapped BLASTn) the original BLASTn HSP alignment. (b) For 877 candidate primers outputted by the UOD module for the 24 kb region described in the text, the Tm was calculated for each top-ranking BLASTn HSP alignment and the corresponding thermoalignment. (c) Using the subset of thermoalignments formed from ungapped BLASTn HSPs (n=169,404 alignments), the plot shows the relationship between the off-target Tm for thermoalignments compared to the total number of mismatches. (d) Using the same subset of data in (c) the plot shows the difference between the on-target Tm and the off-target Tm of thermoalignments compared to the total number of mismatches...... 17

xii 2.4 Empirical evaluation of ThermoAlign using standard and long-range PCR to tile five genes. The products from two additional genes amplified with standard PCR but not long-range PCR (as described in the text) are not shown. Labels indicate the chromosome number of the target locus, the forward primer start site and the expected size of the product. Details on each primer is available in Supplementary File S3 online[45]. (a) Standard PCR products were quantified without post-PCR purification and approximately ≈ 7.5 ng were loaded into each well. For the two reactions that had no product, a volume equivalent to the average volume loaded was used. Multiplex reactions composed of primer pairs corresponding to each set for a given gene was loaded alongside the primers belonging to that same set. (b) Long-range PCR products from reactions without (-) and with (+) betaine. PCR products were quantified without post-PCR purification and ≈ 29 ng were loaded into each well. For the three reactions that had a no product, the same volume used for the corresponding betaine reaction was loaded into the well. For the negative control, the maximum volume used among all of the reactions was loaded into the well. The negative control was composed of master mix, primer pair TA 1 25390617 27 F and TA 1 25395472 24 R with no DNA template. Lanes with background smearing were associated with reactions that required a greater volume of the product be loaded to achieve a standardized amount of product across lanes...... 19

2.5 Directed graph analysis approach used in ThermoAlign to identify primer pairs for an amplicon tiling path at a targeted region. xi and yi represent forward and reverse primers respectively. Nodes correspond to expected amplicons (depicted as colored bars), which are ordered by their base pair coordinates. Directed edges (lines with an arrowhead) are drawn between overlapping amplicons, which in this case forms two subgraphs. Edge weights (depicted by the thickness of the lines representing edges; here, thicker lines represent the smaller weighted edges that would be selected) are computed based on the cumulative coverage and amount of overlap. Dijktra’s shortest path algorithm is applied to each subgraph to identify the primer pairs comprising the minimum tiling path of amplicons (black colored edges). As portrayed by the blue and yellow coloring, two groups of non-overlapping primer pairs are designated for multiplex consideration. The expected coverage for this example minimum tiling path is indicated by black (covered) and white (gaps) fill. .. 34

xiii 3.1 Graphical representation of the C3S-LAA process and pipeline. (a) Raw reads comprised of multiple subreads are depicted for three different amplicons [green, fuchsia and blue boxes; different shades of color are used to portray variable subread sequence qualities (darker shading portrays higher quality)]. Subreads are separated by a shared adapter sequence (grey boxes). The higher quality CCS read for each raw read is used to cluster the corresponding raw reads into CCS-based cluster groups. Error correction is performed per CCS-based cluster, producing top quality consequences sequences, followed by assembly of any overlapping consensus sequences. (b) A single run parameters file is used by all components of the pipeline. The grey highlighted rectangles represent two main steps of C3S-LAA. (i) Using the CCS reads generated by the SMRT analysis reads of insert protocol, C3S clusters the raw reads according to each barcode-primer pair combination, producing files of read identifiers to whitelist the corresponding raw reads. (ii) Raw read clusters are passed to Quiver to generate amplicon-specific consensus sequences, which are then passed to Minimus for sequence assembly. Rectangles with folded corners represent single files or multiple files (depicted as stacks of files) and those with rounded edges represent scripts and tools. Arrows indicates output files that are generated. Connecting lines with dots at one end depict input files, with the dot corresponding to the source data for the connected script or tool. . 44

3.2 Sequence accuracy as a function of subread depth. (a) Accuracy of consensus and (b) assembly sequences. Data from all the amplicons were pooled together to evaluate the consensus calling accuracy as a function of depth of coverage of SMRT raw reads. The vertical line shows the minimum read depth of the consensus sequences used for assemblies...... 46

3.3 Total number of accurate bootstrap assemblies versus CCS sample size. At each level of the CCS read depth sample (1-40), the figure shows the total number of bootstrapped assemblies that were 100% identical to the reference sequence. This was determined for the four target regions (25 bootstrap assemblies at each of 4 loci, giving rise to a maximum of 100 on the x-axis) formed from the consensus sequences among the eight overlapping amplicons...... 49

xiv 3.4 Sequence alignment highlighting a recurring insertion error in some bootstrap samples. The alignment corresponds to the consensus sequence for a part of the amplicon from (a) locus 6 7045710 7052049 (Query) and (b) locus 1 25390617 25396540 (Query) on maize chromosome 6 and 1 respectively compared to the B73 v3 reference sequence (Sbjct)...... 50

4.1 Sequence variation at the qNLB 1 25721468 23298 locus. (a) Nucleotide coordinates corresponding to the B73 RefGen v4 genome are indicated at the top. The B73 V4 track corresponds to the reference genome sequence along with encoded genes annotated in the region, which include Protein UPSTREAM OF FLC (UFC ), T-complex protein 1 subunit gamma (CCT3 ), uncharacterized protein (LOC100382810), remorin 6.3 (remorin) and Pentatricopeptide repeat (PPR) containing protein. The remaining tracks show each of the ReSeq samples. Grey and black bands represent consensus and variant sites (SNPs, MPNs and Indels) relative to the B73 RefGen v4 sequence, respectively. Horizontal thin lines indicate deletions, where grey lines indicate deletions shared with the reference and black lines are unique to the line. White bands represent assembly gaps. (b) The tabulated data is the number of different types of variants per line, adjusted according to the length of the corresponding assembly. .. 65

4.2 Haplotype network for the ReSeq assemblies. Each unique haplotype is represented by an individual pie chart that is labelled using Roman numerals. The radius of each pie chart is proportional to the the number of genotypes with a given haplotype, and the number of sections indicates the number of individuals with that same haplotype. Edges connecting a given pair of haplotypes indicate the number of SNP differences between those haplotypes; however, edge numbers are not additive across more than two nodes...... 66

xv 4.3 Genotyping accuracy in relation to sequence and genotype features at the qNLB 1 25721468 23298 locus. (a) The minor allele frequency (MAF) among all samples in maize HapMap3. (b) Genotype accuracy estimated for the NAM set in HapMap3, based on the comparison to the ReSeq data. The size of each point reflects the percentage of heterozygous genotype calls present among samples constituting the NAM set in HapMap3. The color gradient reflects the median read depth of genotype calls for samples constituting the NAM set in HapMap3. (c) Sequence features of B73 RefGen v4, including the mean percent GC content (window size: 102 bp, step size: 25 bp), the number of thermoalignment defined repeats (i.e. sequences within the locus that were separated from other loci in the maize genome by less than 10 ◦C in melting temperature[45]), repeat masked segments and genes (from left to right: UFC, CCT3, LOC100382810, remorin and PPR [see Figure 4.1])...... 69

4.4 Impacts and gene structure positions of variants based on VEP based annotation. (a) Predicted impacts of variants between ReSeq Tx303 and B73 haplotypes; (b) Gene structure positions of variants in a; (c) Predicted impacts of variants unique to the ReSeq Tx303 haplotype (compared to all ReSeq NAM founder haplotypes, including B73); (d) Gene structure positions of variants in c; (e) Predicted impacts of variants among all ReSeq NAM haplotypes (including B73); (f) Gene structure positions of variants in e. Note: an individual variant can have more than one impact, due to alternative splicing annotation. Hence, the total number of VEPs exceeds the total number of variants per se...... 71

A.1 Effects of UOD filters on all 82,520 primers at monomoprhic sites across the 24 kb region described in the main text. The number of primers filtered are indicated for the parameters classified as (A) primer features and (B) primer interactions...... 101

xvi A.2 Relationship between ThermoAlign search speed and the search space required for a sufficient search to identify primers predicted to ◦ produce specific, on-target PCR products (i.e. on-target Tm > 10 C of all off-target Tm values). Results are based on primers designed to the 24 kb locus described in the main text. (A) Wall time seconds for running ThermoAlign using different BLASTn hsp values for PSE. (B) Lines in the plot show the difference between on-target and off-target Tm (computed from thermoalignments) of each primer across different BLASTn hsp values used for PSE module. The search space is expanded by increasing the BLATSn hsp value. A single primer may have multiple off-target matches, but each primer is represented only by the off-target match with the minimum difference in Tm. Only primers that resulted in a change at different hsp values were plotted. Note: because hsp alignments with the same percent identify to the query sequence may be extended into thermoalignments that are not of the same percent identity, it is not guaranteed that the minimally distant off-target site will be identified when the hsp value is low. For this reason, some primers show a decreasing difference in Tm as the hsp settings is increased. Once there is no change, the minimally distant off-target site within the genome has been identified...... 103

A.3 Minimum tiling paths of standard PCR and long range PCR primers tested in this study. Green lines indicate amplicons from primer pairs that worked without any reaction optimization. Orange lines indicate amplicons from primers that worked only with the addition of betaine. The gene track indicates genes to which the primer pairs were designed (Remorin: GRMZM2G107774; NPB: GRMZM2G022627; p450: GRMZM2G031364; LHT1: GRMZM2G127328; GST: GRMZM2G416632)...... 104

A.4 Depth of coverage for on-target versus off-target regions of the B73 reference genome from Illumina sequencing of RSE on B73 gDNA. . 105

B.1 Dot-plots of alignments between amplicon reference sequences and inaccurate consensus sequences generated by LAA. (a-g) The reference sequence for each amplicon is labeled according the chromosome, start and stop position in the reference genome of B73 v3 (x-axes). The consensus sequences are labeled 7-13 (y-axes). Green lines show alignments in the same orientation while red lines show alignments in reverse orientation...... 113

xvii C.1 Repeat content within the qNLB 1 25721468 23298 locus. For each line, the proportion of repetitive elements (identified by RepeatMasker) in the resequence assembly is indicated...... 119

C.2 Error and discovery rates in maize HapMap3 at the qNLB 1 25721468 23298 locus. Genotyping error and discovery rates for maize HapMap3 were computed across qNLB 1 25721468 23298 for the (a) NAM set and (b) 282 panel. The error rate is the number of incorrect genotype calls in maize HapMap3, using ReSeq variants as the ground truth. The discovery rate is the number of ReSeq variants captured by maize HapMap3...... 120

C.3 Relationship between minor allele frequency (MAF) and genotyping accuracy at the qNLB 1 25721468 23298 locus. The MAF was computed from among all samples in maize HapMap3, while genotyping accuracy was estimated for the NAM set. The color gradient represents the percentage of heterozygotes at the site for which genotyping accuracy was measured...... 121

C.4 Local associations from qNLB 1 25721468 23298 locus for DTA, DTS, GDD anthesis, GDD silking and NLB phenotypes. The p-values were corrected for false discovery rate using the Benjamini-Hochberg procedure. All markers are plotted with their p-values (as -log10 values) as a function of the genomic position (B73 RefGen v4). The markers are colored based on their genotyping accuracy where blue and orange colors indicate high and low accuracy values respectively. Horizontal dashed grey line indicates 0.05 p-value. Gene annotations (filtered set) were obtained from the NCBI B73 v4 annotation release 101 and is indicated by green arrows...... 122

C.5 Counts of Tx303 variants present in other NAM founders and their maximum variant impact prediction. The allele counts indicate the occurrence of the Tx303 variant among other resequenced NAM founders. The maximum VEP of each each variant is color coded. . 123

xviii ABSTRACT

Sequence variation is a fundamental component of biodiversity that underlies the response to selection and improvement of crops via plant breeding. The objective of this dissertation was to use targeted resequencing to characterize genomic diver- sity at regions of the maize genome associated with quantitative disease resistance (QDR). However, unsuccessful attempts to enrich or amplify specific regions of the maize genome using current methods (including a method tested in this dissertation), as well as the requirement for repeat-subtraction techniques for effective sequence cap- ture in maize, indicated that current methods for targeted resequencing could be im- proved, at least for genomes with high repeat content. Moreover, sources of error from sequencing technology and bioinformatic algorithms are important factors to consider for the implementation of resequencing studies. To address these problems, new tools were developed for the production and analysis of multiplexed amplicon sequencing libraries: (i) ThermoAlign: a genome- aware primer design tool tailored for tiled amplicon resequencing; (ii) C3S-LAA: a se- quencing error correction and assembly pipeline for single molecule real-time (SMRT) sequence data from amplicon libraries. Given a reference genome sequence, Ther- moAlign performs priming relevant genome-wide alignments under a thermodynamics model for DNA hybridization – “thermoalignments” – to identify locally specific primer pairs. It was determined that the number of mismatches in an alignment or subse- quence specificity was a poor proxy for evaluating priming specificity, and laboratory validation experiments demonstrated that the thermoalignment approach did indeed generate specific primer pairs. Notably, ThermoAlign has broad applications and can be used to design primer pairs for routine PCR assays and may also be extended for evaluating specificity of DNA hybridization probes and CRISPR/Cas9 guide RNAs.

xix To minimize errors from sequence data processing, a clustering of circular consensus sequence (C3S) algorithm was developed and shown to eliminate the bioinformatic source of error encountered using the current, standard long amplicon analysis (LAA) method for SMRT sequence data of amplicon libraries. These developments enabled successful resequencing of 27 founder lines of a maize nested association mapping (NAM) population at a fine-mapped section of chro- mosome 1 (approximately 23 kb) associated with QDR to Northern leaf blight. Nestled within the highly repetitive and diverse genome of maize, this locus contained relatively low repeat content and low sequence diversity. Single nucleotide polymorphisms were the dominant type of variants and no major structural variants were observed. Using the resequence data as a benchmark, maize HapMap3, a community resource cata- loging approximately 60 M variants across the genome, had a mean genotyping error rate of 12% per line and a limited catalog (less than 10%) of the variants present at the locus. This study demonstrated the importance and scope of long-read sequencing for accurate and exhaustive identification of genomic variants. Previous work suggested that the inbred line Tx303 carries a susceptible allele of a remorin gene that is unique to Tx303 among the NAM founders which underlies variation in QDR, but resequenc- ing revealed that Tx303-specific variants were only present within another gene in the region (T-complex protein 1 subunit gamma gene). This work contributes new tools for genome science and provides new insights into the potential causal variants at a locus associated with QDR to NLB in maize.

xx Chapter 1

INTRODUCTION

Diverse individuals of any given species encompass variations in their genome, known as polymorphisms. The primary determinants of genomic diversity are muta- tion rate, effective population size and linked selection, which are in turn influenced by several other parameters. Genomic diversity arises as a result of the balance be- tween formation of new genomic variants or alleles at each generation by spontaneous mutation and their subsequent loss[43]. Mutation rate varies across the genome and among different species[86]. Insights into genomic variation began with the observa- tion of individual chromosomes under the microscope, leading to an understanding of their karyotypes[119]. Early allozymes based studies revealed substantial genetic variation between species[55], which was subsequently validated by DNA sequenc- ing technologies[78]. A primary driving force of biological is this variable of genomes. Genomic diversity and its analysis has implications in human health[119], agricultural breeding[54], infectious disease management[87], biodiversity conservation[36], and is essential for a species to respond and adapt to environmental changes[12].

1.1 Role of genomic diversity for crop improvement Studying the genomic diversity of the vast germplasm is important for their effec- tive utilization in crop improvement. In spite of a wealth of germplasm being available worldwide for a variety of crops, only a small fraction of it is being tapped by breeders because of a lack of information beyond their taxonomy and geographical origin[49]. In recent years, molecular marker technology has enabled a thorough understanding of genetic diversity in a wide variety of crops. Modern advancement in genomics and

1 sequencing technology is now enabling whole-genome and targeted gene/locus surveys. High throughput sequencing speed and capacity has provided insights into the genomic architecture and evolution of more than 100 plant genomes[97]. Intuitively, the next stage of plant genomics would aim at refining these numerous draft reference genomes, resequencing diversity panels, and targeted high coverage resequencing to identify the molecular genetic variation underlying various traits and pathways of interest. It is increasingly becoming feasible to resequence large number of crop varieties and popu- lations using high-throughput sequencing technologies after a reference genome is made available. The resulting information enables investigations on genome-wide or locus specific patterns of diversity and associations to various phenotypes[71].

1.2 Challenges in plant genome sequencing projects Sequence processing and assembly tools that were developed for non-plant species pose several hurdles while dealing with the complexity and size of some crop genomes. Plant genome size can be as large as 150,000 Mb for Paris japonica, which makes it the largest known eukaryotic genome[113]. Size of genomes and repeat content, which are typically correlated to each other, and their higher ploidy levels makes plant genome assembly tricky. This has lead to most of the draft plant genomes remaining as thousands of fragmented contigs or hundreds of scaffolds with numerous assembly gaps[97]. Accurately resolving the variants of repetitive plant genomes by mapping short sequencing reads onto a reference is still a major challenge[129]. Studies that require accurate information on presence/absence variations, gene structure, and detailed characterization of SNPs in large plant genomes require more than just “genome skimming” using low coverage sequencing[37]. High coverage of large plant genomes, on the other hand, is cost prohibitive and is often unnecessary. An effective approach instead is to reduce the complexity of the genome by first targeting a specific portion for selective enrichment, followed by high coverage sequencing. However, application of targeted enrichment techniques is difficult in large, complex plant genomes such as maize, with large amount of repetitive regions

2 across the genome. In maize specifically, transposable elements comprise almost 85% of the genome[130]. Adding to this complexity is the structural rearrangements that are very common in the maize genome. These challenges makes it difficult to apply many of the commonly used targeted enrichment techniques to the maize genome. Designing oligos corresponding to a repetitive region makes the enrichment non-specific and leads to unwanted background capture. Structural rearrangements and other polymorphisms will result in the oligos designed based on a reference sequence not being effective in the genomes of other lines at polymorphic sites. Errors arising from sequencing technology and bioinformatic analysis tools are other important factors to consider for the implementation of resequencing studies. To address the limitations of existing targeted resequencing technologies, a genome-aware primer design tool called ThermoAlign, and C3S-LAA, a sequencing error correction and assembly pipeline for single molecule real-time (SMRT) sequence data from multiplexed amplicon libraries was developed in this study (Chapters 2 and 3). The development of ThermoAlign stems from the couple years of work related to this dissertation where we aimed to systematically resequence tens to hundreds of kilobase sized genomic regions within the highly repetitive genome of maize. After several unsuccessful attempts at targeted enrichment in maize using available methods of primer design and enrichment technology (commercial and public), I found the need for a better approach to design template specific primer pairs along with a comprehen- sive pipeline that could facilitate identification of tiling paths of amplicons traversing a target sequence. With this objective in mind, I developed ThermoAlign, which inte- grates novel functions and unique algorithms in an extensible tool. Chapter 2 of this dissertation shows how ThermoAlign is effective for standard PCR and long range PCR primer pair design, achieving perfect amplification specificity in laboratory tests. This chapter therefore addresses the major challenges faced by researchers while resequenc- ing targeted regions of repetitive genomes, which encompass a variety of organisms of medical and economical significance. Chapter 3 presents a unique approach (C3S- LAA) for analyzing PacBio SMRT sequence data from amplicon libraries that rectifies

3 performance issues with the community standard long amplicon analysis (LAA) tool by PacBio. An assembly step was also integrated into C3S-LAA to handle overlapping amplicon sequences. A use case scenario is included that provides a clear demonstra- tion of the failure of LAA and how C3S-LAA method corrects for this. Thus, not only is the C3S-LAA approach capable of producing optimized sequence results, the extension that we have developed for the analysis of amplicon sequence data could be used for targeted resequencing studies that rely on multiplexed tiled amplicon based enrichment of genomes.

1.3 Maize genomic diversity Several studies have evaluated sequence diversity at various maize genomic loci and have reported remarkable levels of sequence variation[46, 21], unlike what was predicted based on earlier genetic mapping study of members of closely related plant species[14]. The high genetic diversity has helped transform maize into one of the world’s most productive crop. At the same time, domestication and commer- cial breeding has created population bottleneck and directional selection, leading to some loss in maize genetic diversity[142]. The genetic diversity in maize was ini- tially characterized by molecular markers and sequencing multiple alleles from few selected loci[141, 23, 145]. Genome sequencing technologies have now opened up new avenues for identifying genomic variation. For instance, the maize HapMap projects described the genomic variation across several hundreds of maize lines[30, 24]. An- other genotyping-by-sequencing study used reduced representation high-throughput sequencing technology to genotype 2,815 maize inbred accessions[123]. These studies have provided valuable contributions in understanding the genomic diversity in large populations of maize. However, the use of reduced representation, low coverage and short read sequencing technologies in these studies have limitations in characterizing all the genomic variations, especially the structural variations of the complex and highly repetitive maize genome. In addition, the accuracy of such genotyping studies has also not been evaluated with accurate genotype data. Chapter 4 of this dissertation

4 provides insights into the genomic diversity and collinearity at a ≈ 23 kb locus on chromosome 1 of maize that was sequenced across 27 founder lines of a maize nested association mapping (NAM) population.

1.4 Complex traits and Quantitative disease resistance (QDR) Previous genetic studies have revealed that the associated genetic signals for complex traits are spread across most of the genome, including in regions proximal to genes without direct connection to the traits under study[134, 84, 115]. In many cases, an association between a variant or a locus to a particular trait may not provide insights into the target gene or underlying molecular mechanisms[115]. Primarily for this reason, the path from association studies to the underlying biology of various traits have not been easy. The heritability of complex traits is typically found to be spread across the genome, indicating that a large fraction of all genes contribute to the underlying variations[84, 132]. The strongest associated variants from GWAS studies typically explain only a fraction of the genetic variance, although there are some exceptions to this[92]. These subtle contributions from various parts of the genome along with the enrichment of noncoding variants for a majority of the GWAS studies makes a mechanistic understanding of the molecular pathways underlying quantitative traits very challenging[17]. Initial studies in quantitative genetics were based on linkage to polymorphic markers. Since late 19th century, the discovery of molecular markers rapidly advanced various genotyping and mapping methods to better understand quantitative traits[6]. However, technological limitations have prevented a comprehensive understanding of the genetic variation underlying quantitative traits in genes, the molecular basis of functional allele effects, and the population frequency of causal variants. The current genomic revolution spearheaded by massively parallel and cost effective sequencing technologies has the potential to solve these current challenges which will lead to a better understanding of the genomic and systems genetics pathways underlying various quantitative traits[50].

5 Disease resistance in plants can be broadly categorized as qualitative and quan- titative types of resistance. Qualitative resistance typically leads to high levels of race/strain specific resistance, and is conditioned by a single dominant or recessive gene[15]. Immunity by means of this specific pathogen recognition is often broken down by adapting pathogens and is therefore typically effective only for a short period of time[15]. In contrast, quantitative disease resistance (QDR) is better in an agro- nomic context because of its longer durability and wider effectiveness to different races of pathogen[83]. Therefore, QDR is the major form of disease resistance employed in maize breeding[150]. Because QDR phenotypes are incomplete (exhibiting continuous or ordinal variation) and are conditioned by multiple loci of minor effects, dissect- ing the molecular mechanisms that underlie QDR is difficult. Novel approaches for studying the genomic variation underlying QDR will lead to a better understanding of host defense mechanisms, ultimately leading to the development of long term solutions for disease control. Employing host plant genetic resistance has great potential as an environmentally friendly and effective approach for prevention and control of most of these diseases[103]. Chapter 4 of this dissertation focuses on characterizing the ge- nomic diversity at a fine-mapped section of chromosome 1 (≈ 23 kb) associated with QDR to Northern leaf blight. ThermoAlign and C3S-LAA (Chapters 2 and 3) enabled successful resequencing of this QDR locus across the 27 NAM founders of maize.

6 Chapter 2

THERMOALIGN: A GENOME-AWARE PRIMER DESIGN TOOL FOR STANDARD PCR AND TILED AMPLICON RESEQUENCING

2.1 Abstract Isolating and sequencing specific regions in a genome is a cornerstone of molecu- lar biology. This has been facilitated by computationally encoding the thermodynam- ics of DNA hybridization for automated design of hybridization and priming oligonu- cleotides. However, the repetitive composition of genomes challenges the identifica- tion of target-specific oligonucleotides, which limits genetics and genomics research on many species. Here, a tool called ThermoAlign was developed that ensures the design of target-specific primer pairs for DNA amplification. This is achieved by eval- uating the thermodynamics of hybridization for full-length oligonucleotide-template alignments — thermoalignments — across the genome to identify primers predicted to bind specifically to the target site. For amplification-based resequencing of regions that cannot be amplified by a single primer pair, a directed graph analysis method is used to identify minimum amplicon tiling paths. Laboratory validation by standard and long-range polymerase chain reaction and amplicon resequencing with maize, one of the most repetitive genomes sequenced to date (≈ 85% repeat content), demonstrated the specificity-by-design functionality of ThermoAlign. ThermoAlign is released under an open source license and bundled in a dependency-free container for wide distribu- tion. It is anticipated that this tool will facilitate multiple applications in genetics and genomics and be useful in the workflow of high-throughput targeted resequencing studies.

7 2.2 Introduction Isolating homologous DNA sequence from individuals at specific regions of a genome is fundamental to research and applications in genetics and genomics. Despite the advent of high-throughput sequencing technologies, obtaining a fully contiguous and accurate region-specific consensus sequence for multiple individuals from whole- genome DNA libraries can be cost prohibitive, unnecessary or inefficient. This is partic- ularly challenging for investigations of large genomes and when surveying variation in population studies where hundreds to thousands of individuals are concerned. Conse- quently, a number of techniques have been devised for targeted enrichment, including microarray-based sequence capturee.g.,[47, 90], molecular inversion probese.g.,[74] and PCRe.g.,[11], each of which require the design of target-specific hybridization or prim- ing oligonucleotides. Genomes contain varying degrees of repetitive DNA[56], and for many species this represents the major fraction of the genome. It has been estimated that > 50%[77] and as much as 69%[41] of the human genome is repetitive, and over 80% of the genomes for some plant species is repetitivee.g.,[130, 104]. This poses a significant challenge to de- signing oligonucleotides that will hybridize and prime only on-target. Targeted enrich- ment approaches relying on hybridization-based DNA capture can result in off-target sequences being captured due to nonspecific binding[96, 81, 88]. Similarly, primers used for amplification-based approaches may produce off-target products[98, 59]. This presents the need for “genome-aware” oligonucleotide design tools that leverage refer- ence genome sequence data to maximize the enrichment of on-target sequences. Although there are several computational tools now available that facilitate genome-aware primer design[9, 117, 131, 125], obtaining specific amplification of tar- geted sequences is still a difficult problem, especially for genomes with large amounts of repetitive DNA. A popular tool for choosing primers is Primer-BLAST[155], which integrates Primer3 for primer design with BLAST and Needleman-Wunsch alignments for evaluating primer specificity. The number of non-complementary bases in sequence

8 alignments is used as the basis for a specificity filter to select candidate primers. How- ever, several factors contribute to the mispriming potential of primers including tem- perature, reaction chemistry, nucleotide composition and the position and type of mis- matching nucleotides[126], such that the number of mismatches alone is likely to be an insufficient measure of mispriming potential. Given the availability of empirical data on the thermodynamics of hybridiza- tion for complementary[127] and non-complementary[4,3,2,1, 114] base pairings, genome-wide estimates of the thermodynamics of primer-template hybridization can be incorporated into the selection process for oligonucleotide design tools. Indeed, some algorithms have used this approach, including specificity-determining subse- quence (SDSS)[98, 152, 91], MFEprimer-2.0[117] and PRIMEGENSw3[75]. These par- ticular tools differ in terms of the algorithms they use to evaluate priming specificity and to select primer pairs. Nevertheless, for all of these tools, the assessment of mis- priming (a critical step for identifying target-specific primers) is restricted because it is initialized by evaluation of specificity at only the 3’-end region of a candidate primer against potential off-target binding sites. Only candidate primers with perfect com- plementarity at the 3’-end region are further evaluated in terms of binding stability. However, studies have demonstrated that mismatches within the 3’-end region can still lead to PCR amplification[76, 98], and Miura et al. (2005), who developed SDSS, discussed the need for more sophisticated methods to evaluate off-target hybridization and priming. In this study, such a method was developed which quantitatively eval- uates the thermodynamics of hybridization for full-length primer-template binding at the target site relative to the remainder of the genome. In terms of selecting primers for DNA amplification, most tools assess specificity according to whether primer pairs (not individual primers) are predicted not to pro- duce off-target amplicons within a specified size range (Supplementary Table A.1). This criteria can result in the selection of primers that individually produce single-strand synthesis products at multiple off-target sites. For instance, a primer with many per- fectly complementary off-target binding sites could be selected by tools that use this

9 approach. In turn, such primers can create byproducts that act as mega-primers in PCR, leading to background amplification. This issue is expected to be exacerbated when working with repetitive genomes[59]. Therefore, evaluating the specificity of individual primers could improve the selection of target-specific primer pairs. For targeted resequencing, it is still not feasible to enrich and sequence many tens of kilobases or more of contiguous DNA, particularly for large samples of individuals. Some techniques have been developed that might be used for this, but in our attempts neither region-specific enrichment[39] nor selective whole genome amplification[79] led to enrichment of targeted regions of the maize genome (Supplementary methods and results). Further optimization may enable these techniques to work for our target regions, but our results suggested the repetitive nature of the genome prevented en- richment. Indeed, others have found that hybridization-based enrichment has proven difficult[47] or unsuccessful[88] in maize due to its repeat content. In contrast, given primer pairs that specifically prime on-target, DNA amplification is a highly effective method for enrichment. Thus, coupling target-specific primer design for DNA amplifi- cation with automated identification of primer pairs that amplify overlapping segments across a target region is an alternative solution to creating libraries for targeted rese- quencing. Unfortunately, existing tools are not suited for this and output only a subset of the higher scoring primer pairs designed to a region of interest. This leads to parts of the targeted locus not being covered or a situation where users need to manually pick from among numerous possible primer pairs. Therefore, the automated design of groups of primer pairs producing a tiling path of amplicons would be useful for rese- quencing studies. Some studies have approached the problem of designing amplicon tiling paths by segmentation of the genomic locus and evaluating potential amplicon sizes and their overlape.g.,[48,7, 156]. In ThermoAlign, a unique implementation of directed graph analysis was programmed for selecting the minimum number of primer pairs forming an amplicon tiling path with maximum coverage of the target region. Finally, while many algorithms and tools have been developed for target-specific primer design, not all of these have reported on their performance in laboratory tests,

10 such that the application of some tools remains unknown (Supplementary Table A.1). This makes it difficult for end-users to assess the utility of a given tool amid a wide landscape of such tools. For those that have been tested in the laboratory none have achieved 100% specificity in DNA amplification; however, reports have shown fairly good performance in a few species (yeast, human and soybean). Miura et al. (2005) evaluated their SDSS approach on the budding yeast genome and encountered some cases of off-target amplification. A study on a PrimerStation tool[152] that also uses SDSS showed that all 10 primer pairs amplified specific products on human chromosome X. Reports on PRIMEGENS showed that ≈ 90% of the primer pairs produced a single amplicon[137, 136, 75]. Working in a genome scenario with a substantially greater repeat content (maize), we sought to develop a method that achieves near perfection in the automated design of primer pairs that amplify only on target. In this study, a genome-aware primer pair design algorithm embedded within a distributable tool called ThermoAlign was developed and validated. Priming speci- ficity is determined based on genome-wide analysis of the hybridization for full-length primer-template interactions in the presence of non-complementary base pairings. For population studies, ThermoAlign is also capable of using prior information on the lo- cations of genomic variants during the selection of primer pairs. In addition to the design of primer pairs for standard, single locus PCR amplification, a directed graph analysis method is used to design primer pairs forming amplicon tiling paths for tar- geted sequencing of large regions in a genome. Finally, the amplification specificity of primer pairs designed by ThermoAlign was validated on maize, which is one the most repetitive genomes (≈ 85% repetitive) for which a reference sequence is available[130]. ThermoAlign is expected to facilitate targeted genome sequence studies for any species.

2.3 Results A schematic of the ThermoAlign pipeline is shown in Figure 2.1. In the following sections, results are presented pertaining to each module of the tool. A 24 kb target region of the maize genome (B73 RefGen v3 Chr3:33490673..33514673) was used to

11 demonstrate the pipeline and highlight features of ThermoAlign. Sixty-six percent of this region is annotated in the genome assembly as repeat-masked. Using the unmasked sequence and examining repetitiveness relevant to primer binding, 72% of the primers designed to this region would be predicted to produce off-target priming events at 1 to 215 sites for a given primer (Figure 2.2). This same region, along with other segments of the genome, was used to test the amplification specificity of primers designed by ThermoAlign.

2.3.1 Target Region Selection (TRS) ThermoAlign produces an output file with summary information from the run. Output for the 24 kb target region showed that it contained no gaps in the reference sequence assembly, 1,073 SNPs, 93 indels and 46% GC content.

2.3.2 Unique Oligo Design (UOD) The UOD algorithm was designed to identify every individual primer (not primer pairs) in a target region that is deemed favorable for PCR and has no identical matches elsewhere in the genome. For the 24 kb targeted region, among 184, 145 to- tal possible primers, 82, 520 did not occur at sites containing polymorphisms in maize HapMap3[24]. Applying the full set of remaining UOD filters resulted in the selection of 877 candidate primers. The classification of the 82, 520 primers into UOD filtering categories was ex- amined to see which features had the largest effect on the removal of primers. This was split into two parts, starting with filters for primer sequence features and ending with filters for primer interactions (Supplementary Figure A.1). In terms of sequence features, 75, 073 primers were filtered. Considering primers that were associated with only one sequence feature category, the A/T-end filter removed the largest number of primers (n=9, 217), comprising ≈ 50% of the collective set of primers that were spe- cific to only one feature (Supplementary Figure A.1 (a)). The A/T-end feature is a useful heuristic to eliminate primers with a greater potential for inefficient priming[63].

12 1 Target Region Selection

BLAST DB extract target locus

vcf annotate polymorphisms *FASTA conversion sequences

2 Unique Oligonucleotide Design .vcf files all oligos per size class (1 bp sliding window)

filter according to assembly gaps and polymorphisms, Tm, hairpin, self dimer, etc.

off-target exact match search and filter

NN Tm model *run 3 parameters Primer Specificity Evaluation file off-target close match search

create thermoalignments

off-target Tm difference filter

4 Primer Pair Selection homodimer, filter according to amplicon size, heterodimer, primer-primer interaction, etc. and hairpin (Primer3)

directed graph analysis: minimum primers maximum coverage

multiplex group selection

Figure 2.1: Schematic of the design strategy and workflow for ThermoAlign. A single run parameters file is used by all components of the pipeline. Colored boxes represent the four core modules of ThermoAlign, enumerated in their order of operation: (1) target region selection, (2) unique oligonucleotide design, (3) primer specificity evalu- ation, and (4) primer pair selection. Dashed boxes represent sub-routines within each of these modules and arrows depict their order of operation. The remaining elements are the database (reference genome sequence), external files (variant call format [.vcf ] files and a run parameters file) and functions (nearest-neighbor model for the Tm of ho- modimer, heterodimer and hairpin interaction functions in Primer3). Connecting lines for these remaining components depict dependencies for the connected components (a filled dot is used to indicate the source from which information or a function is pulled). Required inputs for ThermoAlign are indicated by an asterisk.

13 a 1.00 T ≤ 10 °C c chr8 m chr9 pid ≥ 70 chr7 chr10 0.75 %GC

chr6 Mt 0.50 Pt unkn

0.25 chr5 cumulative proportion

0.00 chr4 chr1 0 50 100 150 200 250 repeat count or chr3 chr2 percent GC b 260 240 220 200 180 160 140 120 100 80 percent GC 60 repeat count or 40 20 0

33,490,673 33,496,873 33,503,148 33,509,398 33,515,648 bp

Figure 2.2: Repeat content and GC percentage across the 24 kb target region de- scribed in the text. The figure is based on analysis of each 25 bp sequence (26 bp sliding window) of the plus strand. For all subfigures, red lines show the number of ◦ thermoalignments with an off-target Tm within 10 C of the corresponding on-target Tm. Yellow lines (orange when overlapping with red) show the number of thermoalign- ments between a given primer and off-target sites with ≥70 percent identity (pid). Blue lines show the percent GC content. The search for off-target sites was based on BLASTn settings used in this study for priming specificity evaluation (see Meth- ods), which had a maximum of 20 potential sites per pseudomolecule or a total of 260 possible sites. (a) Cumulative distribution of the number of repeats and percent GC content. (b) Genomic distribution of repeat content and GC percentage. The pseudo- molecule coordinate of the 5’-nucleotide of each 25 bp sequence was used to position the plotted data. Black horizontal bars on the x-axis show the two genes in this region [left: GRMZM2G031364; right: GRMZM2G031239]. Among 25-mers in the region ◦ ≈ 73% would be predicted to have a misprime Tm within 10 C of the primer Tm. (c) The CIRCOS plot extends from a single primer in the region with the greatest number (n = 215) of predicted mispriming sites across the genome. Red lines of the CIRCOS plot connect the predicted mispriming sites on the pseudomolecules for chromosomes 1 to 10, the mitochondria (Mt), the plastid (Pt) and unmapped sequences (unkn).

14 Optionally, the A/T-end filter or other filters may be excluded or re-parameterized to achieve a higher discovery rate of candidate primers, but this comes at the cost of increasing the computational time required for primer specificity evaluation (PSE; next section). For example, excluding the A/T-end filter from UOD resulted in 1, 161 addi- tional candidate primers (compared to 877 identified with the A/T-end filter applied) but took approximately four times longer in runtime seconds for PSE. The primer interaction filters, which were applied to 7, 447 primers that re- mained after filtering based on sequence features, included the occurrence of an exact match at an off-target site in the genome, homodimer Tm, heterodimer Tm and hairpin

Tm[144] (Supplementary Figure A.1(b)). This resulted in the filtering of an additional 6, 570 primers, leaving 433 forward primers and 444 reverse primers with 136 being from the same position on the two strands.

2.3.3 Priming Specificity Evaluation (PSE) A critical aspect of ThermoAlign is the algorithmic and quantitative approach used to characterize off-target hybridization sites. As part of the algorithm to determine the potential for mispriming, BLASTn alignments for each off-target match are edited into thermoalignments (full-length, ungapped primer-template alignments) that allow meaningful and accurate estimates of the Tm to be computed for a primer (Figure 2.3). Native BLASTn alignments with ≥ 70% sequence identity (which are mostly truncated local alignments) had an average Tm that was 7 higher than their thermoalignment

(Figure 2.3b). However, the Tm for 10.8% (n=18,834) of the BLASTn alignments was less than their thermoalignment (Figure 2.3b). The range of the difference in Tm for BLASTn alignments compared to corresponding thermoalignments was −14 to 272 .

Considering the relationship between the number of mismatches and Tm, Figure 2.3c- d showed that the number of mismatches, although correlated with thermoalignment

Tm, is not a great proxy for the potential of mispriming. Even in the presence of multiple mismatches, the Tm for binding at off-target sites can be at temperatures typical for PCR (e.g. > 60 ; Figure 2.3c). Moreover, the off-target Tm may not always

15 be sufficiently distant from the on-target Tm for specific priming to occur (Figure 2.3d).

For the data in Figure 2.3d, ≈ 80% of the thermoalignments had an on-target Tm >10 of the off-target Tm.

2.3.4 Primer Pair Selection (PPS). From among the 877 oligonucleotides that were expected to stably hybridize and specifically prime on-target within the reference genome, 2,818 combinations of primer pairs were found to be compatible for standard PCR. The parameter settings used for PPS (Supplementary File S2 online[45]) included the requirement of a +10 difference in the Tm between the primer with the lower Tm of a given pair and the greatest off-target

Tm for either of the two primers. Reducing this threshold can increase the discovery rate for primers, but one must consider a lower limit at which off-target amplicons would be likely to arise in actual PCR. When set to +6 , the number of primer pairs selected by the PPS module for the 24 kb region increased to 4,189. Adjusting this threshold along with the upper limit in the Tm range used for UOD can also increase the discovery rate. By increasing the Tm range by +5 (a change from 64 − 74 to 62 − 77 ) while keeping a +10 maximum misprime difference led to the identification of 4,103 primer pairs via the UOD → PSE → PPS pipeline. With the 877 primers from above, a directed graph method was used to identify the minimum number of primer pairs (shortest path) providing the maximum amount of coverage for the targeted region. The amplicon size range setting was a critical factor in the amount of coverage that could be achieved for the region examined here (Supplementary Table A.2). Smaller amplicon size ranges led to relatively low coverage and the largest size ranges (≥ 15 kb) led to no coverage. Maximum coverage was achieved for amplicon sizes between 5 and 15 kb. However, recalling that the A/T-end filter resulted in the loss of over one thousand primers, excluding this filter increased the expected coverage from a maximum of 61.8% (with the filter) to 88.7% (without the filter).

16 c a.1 5’-CGACACCTACCTGCAATGAT-3’ 5’-ACGTCTCAATCGTGCAGTCA-3’ 70 60 a.2 5’-....ACCTACCTGCAATG..-3’ 5’-....CTCAATCGTGCAGT..-3’ 50 ||x||||||||||| |||||x|||||||| 40

3’-....TGAATGGACGTTAC..-5’ 3’-....GAGTT-GCACGTCA..-5’ ( ˚ C) 30 end-fill gap-correct m 20 end-fill 10 a.3 5’-CGACACCTACCTGCAATGAT-3’ 5’-ACGTCTCAATCGTGCAGTCA-3’ 0 x|xx||x|||||||||||xx |x|xxxxx|x||||||||xx -10 3’-TCAATGAATGGACGTTACCC-5’ 3’-TACGTGAGTTGCACGTCATG-5’ -20 Off-target T Off-target -30 -40 -50 b ungapped BLASTn gapped BLASTn 100 1 3 5 7 9 11 13 15 17 Mismatches 50 d 120 0

( ˚ C) 110

m 100 −50

( ˚ C) 90

m 80 T TT m −100 70 60 −150

off-target T – off-target 50

m 40 −200 30 20 −250 10 0 On-target T −300 1 3 5 7 9 11 13 15 17 BLASTn BLASTn Mismatches BLASTnThermoalignment BLASTnThermoalignment mo−Alignment mo−Alignment Ther Ther

Figure 2.3: Creation and characteristics of thermoalignments compared to local align- ments and the number of mismatches. (a.1) Examples of full-length primer sequences. (a.2) The top-ranking BLASTn high-scoring segment pair (HSP) alignment for two off-target sequences (bottom strand) is processed into a (a.3) thermoalignment by end- filling (ungapped BLASTn) or removing gaps and end-filling (gapped BLASTn) the original BLASTn HSP alignment. (b) For 877 candidate primers outputted by the UOD module for the 24 kb region described in the text, the Tm was calculated for each top-ranking BLASTn HSP alignment and the corresponding thermoalignment. (c) Us- ing the subset of thermoalignments formed from ungapped BLASTn HSPs (n=169,404 alignments), the plot shows the relationship between the off-target Tm for thermoalign- ments compared to the total number of mismatches. (d) Using the same subset of data in (c) the plot shows the difference between the on-target Tm and the off-target Tm of thermoalignments compared to the total number of mismatches.

17 2.3.5 Empirical evaluation of priming specificity Primer pairs designed by ThermoAlign were tested using standardized condi- tions for standard PCR and long-range PCR (see Methods section). For standard PCR, 46 primer pairs associated with seven genes located on six chromosomes of maize were tested (Supplementary File S3 online[45]). Using the directed graph analysis method in PPS, these primer pairs were designed to tile from 1 kb upstream to 1 kb down- stream of each gene. Thirty-eight of these primer pairs produced an amplicon, and for each of these a single specific amplicon of the expected size was observed; no off- target amplicons were detected for any of the primer pairs tested [Figure 2.4a shows the results for 29 of the 46 primer pairs, two of which failed to amplify (6:7,048,348 and 7:128,406,874)]. ThermoAlign integrates MultiPLX[70] while customizing the input and output to obtain two groups of multiplexes compatible with the amplification of overlapping tiling paths. For each of the seven targeted genes tested using standard PCR, under the ”normal” stringency settings, MultiPLX identified multiplexes with no more than two primer pairs (the possibility existed to combine as many as five primer pairs). The amplicons produced using multiplex PCR were generally consistent with those produced by each primer pair individually (one primer pair in one multiplex set failed in the multiplex reaction) and no alternative amplicons were observed (Figure 2.4a). For five of the seven genes mentioned above, 0.1 − 5.0 kb amplicon tiling paths were designed for each gene (independent of the standard PCR primers; Supplementary File S3 online[45]) and tested using long-range PCR. For each gene, two primer pairs were identified that would tile across the full length of the gene (one exception: with the settings used, primer pairs were not found that would cover the entire P450 gene on chromosome 3). Similar to standard PCR, not all ten primer pairs produced an ampli- con, but the seven that did produced a single prominent amplicon of the expected size (Figure 2.4b). For long-range PCR amplicons that failed to amplify or had low yield, more of the reaction product was loaded into the gel in order to normalize the prod- ucts for comparison. This showed some background smearing that was greater than

18 a

1: 25,391,7921: 25,394,423 (3011: 25,394,792 bp) Multiplex(299 bp) 1:(682 25,392,939 bp)1: 25,394,702 (352Multiplex bp) (771 bp)2: 37,564,1432: 37,567,415 Multiplex(771 bp) 2:(627 37,564,887 bp) (4073: 33,503,980 bp)3: 33,504,413 (2953: 33,505,556 bp) Multiplex(800 bp) (1493: 33,504,251 bp)3: 33,505,186 Multiplex(587 bp) (518 bp) set 1 set 2 set 1 set 2 set 1 set 2

800 bp_

350 bp_ b 200 bp_

100 bp_

1: 25,390,6171: (4,85625,391,952 bp) (4,5892: 37,562,840 bp) 2: (4,60237,564,580 bp) (4,9543: 33,503,980 bp) 3: 33,506,037(3,330 bp) (3,1266: 7,045,710 bp) 6: (4,786 7,047,707 bp) (4,3437: bp)128,403,1587: 128,406,316 (4,836 bp) (4,933 bp) ______Control _ + _ + _ + _ + _ + _ + _ + _ + _ + _ + Remorin NBP P450

10 kb _ 3 kb _

6: 7,046,9126: 7,048,348 (6906: 7,049,385 bp) (7906: 7,050,312 bp) (6716: 7,051,037 bp) (691Multiplex bp) (2356: 7,047,578 bp)6: 7,049,113 (7946: 7,0500,92 bp) (3066: 7,050,974 bp) (245Multiplex bp) (297 bp)7: 128,406,3167: 128,407,4637: (587128,405,439 bp)7: (533 128,408,423 bp)Multiplex (208 bp)7: (360 128,405,645 bp)7: 128,406,8747: (696128,407,972 bp)Multiplex (773 bp) (484 bp) set 1 set 2 set 1 set 2 Remorin NBP P450 LHT1 GST 800 bp _

350 bp _ 200 bp _ 100 bp _

LHT1 GST

Figure 2.4: Empirical evaluation of ThermoAlign using standard and long-range PCR to tile five genes. The products from two additional genes amplified with standard PCR but not long-range PCR (as described in the text) are not shown. Labels indicate the chromosome number of the target locus, the forward primer start site and the expected size of the product. Details on each primer is available in Supplementary File S3 online[45]. (a) Standard PCR products were quantified without post-PCR purification and approximately ≈ 7.5 ng were loaded into each well. For the two reactions that had no product, a volume equivalent to the average volume loaded was used. Multiplex reactions composed of primer pairs corresponding to each set for a given gene was loaded alongside the primers belonging to that same set. (b) Long- range PCR products from reactions without (-) and with (+) betaine. PCR products were quantified without post-PCR purification and ≈ 29 ng were loaded into each well. For the three reactions that had a no product, the same volume used for the corresponding betaine reaction was loaded into the well. For the negative control, the maximum volume used among all of the reactions was loaded into the well. The negative control was composed of master mix, primer pair TA 1 25390617 27 F and TA 1 25395472 24 R with no DNA template. Lanes with background smearing were associated with reactions that required a greater volume of the product be loaded to achieve a standardized amount of product across lanes.

19 the negative control, suggesting some amount of random off-target amplification had occurred during long-range PCR (potentially due to mega-primer amplification[59]). Because of the dependency of a reference genome for primer design and that some standard PCR and long-range PCR reactions failed to produce amplicons, we questioned whether these failed reactions were due to inaccuracies in the sequence assembly. Under the assumption that long-range PCR primer pairs that produced a specific amplicon of the expected size were an indication of an accurate assembly, the production of standard PCR amplicons nested within these long-range PCR amplicons was used to address this question. Twenty-nine standard PCR primer pairs were designed to the same five genes tested by long-range PCR and were nested within at least one of the expected long- range PCR amplicons. Some of the standard PCR amplicons were nested within over- lapping sections of two long-range PCR amplicons where one of the primer pairs pro- duced a product and the other did not. Excluding those standard PCR primer pairs from consideration, one out of 21 of the standard PCR primer pairs failed to produce an amplicon in regions where an amplicon was produced by long-range PCR. In con- trast, all five standard PCR primer pairs produced an amplicon in regions where no amplicon was produced by long-range PCR. The association between successful and failed reactions for standard and long-range PCR was not significant (Fisher’s Exact Test, p = 1.0), which failed to implicate the assembly as the cause for PCR failures. Considering the possibility that the sequence composition of the primers or am- plification target affected success[59], the addition of betaine to the reactions resulted in all 10 long-range PCR primer pairs producing a specific product of the expected size (Figure 2.4b). Subsequent testing of standard PCR primer pairs with betaine resulted in the recovery of a single specific amplicon for the two nested pairs that had failed in the absence of betaine, in addition to four primer pairs from the original set of 46. However, these products amplified poorly (data not shown). Additional PCR optimiza- tion could potentially improve the amplification efficiency of these primer pairs. The amplicons of reactions that were recovered by the addition of betaine for long-range

20 PCR had a higher median GC content by 3.2 percentage points for the primers and 7.8 percentage points for the expected amplicons (B73 reference genome sequence). Similarly, standard PCR reactions that were recovered using betaine (considering all 46 of the primer pairs) had a higher median GC content for the primers (3.7 percentage points) and expected amplicons (19.7 percentage points). To confirm that the amplicons corresponded to the targeted loci, nine of the 10 long-range PCR products in Figure 2.4b were pooled and sequenced by single molecule, real-time sequencing. A primer-based clustering and sequence analysis approach gen- erated exactly nine consensus sequences with perfect identity to the expected sequence (Table 2.1; Supplementary File S4[45]).

Table 2.1: Results from BLASTn alignment of error corrected PacBio consensus se- quences to the B73 genome.

Start Stop Alignment Percent CCS Amplicon label Chr. position position length (bp) identity depth1 1: 25,390,617 (4,856 bp) 1 25,390,617 25,395,472 4,856 100% 56 1: 25,391,952 (4,589 bp) 1 25,391,952 25,396,540 4,589 100% 2172 2: 37,562,840 (4,602 bp) 2 37,562,840 37,567,441 4,602 100% 3988 2: 37,564,580 (4,954 bp) 2 37,564,580 37,569,533 4,954 100% 1384 3: 33,503,980 (3,330 bp) 3 33,503,980 33,507,309 3,330 100% 2718 3: 33,506,037 (3,126 bp) 3 33,506,037 33,509,162 3,126 100% 732 6: 7,045,710 (4,786 bp) 6 7,045,710 7,050,495 4,786 100% 2031 6: 7,047,707 (4,343 bp) 6 7,047,707 7,052,049 4,343 100% 254 7: 128,406,316 (4,933 bp) 7 128,406,316 128,411,248 4,933 100% 4727 1Circular consensus sequence (CCS) is PacBio terminology for consensus sequences formed from subreads per zero-mode waveguide. The sequence used for BLASTn was the consensus of all CCSs for a given amplicon).

2.4 Discussion Through the current era of high-throughput sequencing, the design of oligonu- cleotides has remained a fundamental need for genome science research and applica- tions. Models for designing hybridization and priming oligonucleotides can shed light on the understanding of oligonucleotide-template interactions while providing the design

21 products for probing genomes. Two key aspects are whether hybridization and prim- ing will occur and whether these events are specific to the targeted sequence. In this study, specificity of primer pair amplification was addressed in the development of Ther- moAlign for the automated design of priming oligonucleotides. ThermoAlign assesses specificity by computing the nearest-neighbor thermodynamics of hybridization[126] for full-length oligonucleotide-template interactions across the genome. Computation- ally empirical results suggested there is no sufficient substitute for this. Approaches that rely on sub-sequences (characterized here based on local BLASTn alignments; Figure 2.3b) of a primer-template interaction or the number of mismatches in a ther- moalignment (Figure 2.3d) for specificity evaluation are expected to have a greater risk of selecting primers that would hybridize at off-target sites and a lower discovery rate for identifying suitable primers. Examination of genome-wide thermoalignments for primers in a 24 kb region of the maize genome showed that the proportion of repeat content in the context of a primer space (25 bp) was greater than that based on transposable element repeat mask- ing. This emphasizes the importance of a genome-aware primer specificity evaluation process. Applying this approach to the repetitive genome of maize resulted in a single and specific amplicon of expected size for every primer pair that produced a product by standard PCR, without any modifications to the reaction chemistry (Figure 2.4). Long-range PCR gave essentially the same results. Although some of those reactions produced some random background amplification, a single prominent amplicon was observed, and the expected sequence was obtained by sequencing. We have continued to observe near-perfect specificity for an additional 50 functional primer pairs tested on multiple samples of maize (data not shown). Our results demonstrate the accu- racy of the nearest-neighbor thermodynamics model and the principles embedded in ThermoAlign for primer design. The results of this study lend support to prior studies that have developed chemistry-conditioned nearest-neighbor thermodynamic models to estimate energy parameters for DNA duplexes that may include non-complementary bases[127, 146,1,2,3,4, 114].

22 ThermoAlign is an elaborate pipeline that includes a suite of modules with a number of unique elements (Supplementary Table A.1). This includes the creation of thermoalignments to obtain accurate estimates of hybridization energies and a graph analysis approach for automated identification of amplicon tiling paths for resequencing studies. Therefore, ThermoAlign can be used not only to design primer pairs for standard, single-locus PCR applications, but also for resequencing studies of targeted regions that cannot be captured with a single primer pair. For population studies, a polymorphism-aware feature allows primers to be designed to monomorphic regions of the genome. However, this will be limited by the extent of prior information pertaining to the population(s) under study. To increase the efficiency of laboratory efforts to generate amplicons for resequencing, primer pairs constituting amplicon tiling paths are passed to the software MultiPLX[70] and organized into multiplex compatible sets (Figure 2.5). Finally, as a matter of extensibility, the graph analysis method used in ThermoAlign can be re-structured in terms of how edge weights are computed. For instance, as it is sometimes desirable to obtain more similarly sized amplicons for sequencing, we foresee that edge weights could be adjusted according to a probability function centered on a desired amplicon size.

This study demonstrates that nearest-neighbor estimation of Tm on thermoalign- ments is a robust solution for target-specific primer design. ThermoAlign requires minimal computational power and quickly identifies suitable primer pairs for routine applications targeting a single locus of a few hundred to a few thousand base pairs. This makes the tool useful for routine PCR applications. However, de novo searches using BLASTn may be suboptimal and PSE requires greater computational resources for large target regions with run parameterizations having broad search settings. This is despite having optimized the settings for BLASTn such that the search was lim- ited to obtaining sufficient information for PSE (see Supplementary information and Supplementary Figure A.2). Clocking the speed of each component showed that, not surprisingly, the genome-wide search for non-exact matches in the PSE module is a key component to focus on for reducing run times (Supplementary File S1 online[45]).

23 e.g., Therefore, other alignment search methods or a database of pre-computed Tm [152] could be considered for reducing the run time. Furthermore, the BLASTn approach may not identify all mispriming sites for a given primer because it does not allow for mismatches with the sequence used to seed an alignment. Ideally, the Tm of every possible thermoalignment across the genome would be computed for every candidate primer considered. Another critical aspect that remains to be addressed by primer design tools is determining the likelihood for successful amplification. This has received recent atten- tion, leading to insights into how this might be predicted in silico from oligonucleotide and template features[13,8, 59]. Methods for predicting amplification success could be readily integrated into ThermoAlign. By comparing the success of standard PCR primer pairs nested within long-range PCR primer pairs, we observed that assembly errors were not likely to have caused failures to amplify the regions tested in this study. The conversion of failed reactions into successful ones following the addition of betaine to PCR reactions suggested secondary structures as the likely cause of failure. Betaine has been shown to improve the amplification of GC rich regions by reducing the formation of secondary structures, making the template more accessible to DNA polymerase[121, 57]. Indeed, as it has been widely documented, this study found an association between high GC content of primers and the target sequences with failure to amplify by PCR in the absence of betaine. ThermoAlign is founded upon thermodynamically relevant sequence alignments or thermoalignments for estimating hybridization energies of oligonucleotides across a sequenced genome. This results in the design of target-specific primer pairs for PCR and tiled amplicon resequencing. We have demonstrated that the selection of such primer pairs is fully automatable and was applicable to a species (Zea mays subsp. mays) with ≈ 85% repeat content in its genome. The model used to extend local alignments was developed with priming in mind, but an alternative model could be developed for hybridization probes, which would require evaluation of the Tm for bidi- rectional gap-adjusted end-filling models (if the BLASTn algorithm is maintained).

24 The modular design of ThermoAlign and open source license provides an extensible resource for continued development that can be adapted for a range of other applica- tions including the design of hybridization oligonucleotides and gene editing techniques. The applications for ThermoAlign currently range from standard PCR assays to long- molecule resequencing. Multiplexed libraries containing pools of separate, long-range PCR products, such as those tiling across a target region and amplified on multiple individuals, can be sequenced on current parallel sequencing platforms, placing Ther- moAlign as a tool for studies using high-throughput sequencing.

2.5 Methods 2.5.1 ThermoAlign pipeline A thermodynamics-based, genome-wide specificity evaluation approach for de- signing priming oligonucleotides was developed. The approach consists of four stages: (1) selection of a targeted sequence in a genome and masking of variant sites; (2) extraction of all possible oligonucleotides from the target region that are unique and identification of those expected to stably hybridize with the DNA template ; (3) eval- uation of the mispriming potential for candidate primers; and (4) identification of all suitable primer pairs including a minimum set that provides maximum coverage of the targeted region (Figure 2.1). Users can modify run parameters corresponding to each of the ThermoAlign modules in a single settings file (parameters.py; see the GitHub repository referenced in the Availability section below). The ThermoAlign pipeline was developed in Python 2.7.5 and evaluated on a machine equipped with dual 64-bit 2.1 CPU GHz AMD Opteron processor 6172, using 5 of its cores, running on Fedora release 19 and 23 OS.

2.5.2 Target region selection (TRS) The TRS module uses external files and custom pre-processing scripts to create a BLAST database and modify the targeted sequence for subsequent filters in the

25 pipeline. For a given FASTA formatted reference genome sequence, the sequence for a target region is extracted based on user provided coordinates.

External files: ThermoAlign minimally requires a FASTA formatted files for a reference genome. This is used to extract the targeted sequence for designing oligonucleotides and to cre- ate a BLAST database to search for off-target priming sites. Standard variant call format (vcf v4.0 and v4.1; https://vcftools.github.io/index.html) files based on the same coordinate system of the reference genome sequence may be optionally used for polymorphism-aware primer design.

BLAST database: The sequences for each pseudomolecule are required as separate FASTA files using a specific naming convention. Formatting details are provided at the GitHub repository cited in the Availability section below.

Pre-processing vcf files (optional): A custom script (vcf conversion.py) is used to convert input vcf files into files compatible with ThermoAlign scripts containing only the genome coordinates and corresponding variant information. These pre-processed, smaller sized data files enable more efficient parsing and annotation of polymorphic bases in the target region.

Polymorphism annotation (optional): Nucleotides at variant sites are converted to “n” [single nucleotide polymor- phism (SNP)] or “N” (indels), which are later used as indicator characters to filter primers associated with these sites. For this study, maize HapMap3[24] was used for prior variant information. Although indels may comprise multiple nucleotides, maize HapMap3 encodes indels as a single nucleotide position without information on the start and stop positions or length of the indel. Therefore, the individual nucleotide at the corresponding position for each indel in HapMap3 was converted to an N.

26 Assembly gap consideration: The maize reference genome contains strings of 100 and 1000 “N” characters to represent gaps for sequences from individual clones and between assembly scaffolds, respectively, which could be distinguished from SNP and indel locations encoded as a single “n” or “N.” In terms of coordinates, each and every “n” and “N” character were treated as having a position (as was the case for the original reference genome coordinates).

2.5.3 Unique oligonucleotide design (UOD) The UOD algorithm was designed to extract candidate oligonucleotides from the target region and characterize their properties relevant to hybridization and priming. Across a user-defined range of oligonucleotide lengths, and using a 1bp sliding window for each length, every possible oligonucleotide sequence was extracted from both the positive and negative strands. A series of filters were applied to the entire set of oligonucleotides in the following order of operations:

Flanking primers (user-specified): UOD can optionally design primers that flank a given target locus. Forward primers with their 3’-end position that is less than or equal to the locus start and reverse primers with their 3’-end position that is greater than or equal to the locus stop are designed. Flanking primers are designed according to a specified flanking size.

Base composition (polymorphism filter optional): Oligonucleotides containing any character other than A, C, G or T were filtered (this removed primers containing any “n” or “N” character). This ensured that oligonu- cleotides corresponding to polymorphic sites and gaps in the sequence were avoided. However, the filtering of oligonucleotides with polymorphic sites is optional.

27 GC content (user-specified): Oligonucleotides that fell outside a defined range of percent GC were filtered. In this study, a range of 40 − 60% GC was used.

A/T-end (optional): For primers, a single G or C nucleotide at the 3’-end helps to stabilize binding near the site of extension, which can reduce the possibility of “breathing” and improves priming efficiency[63]. Therefore, primers ending in an A or a T base could optionally be filtered. In this study, the filter was applied.

GC clamp (optional): Oligonucleotides with more than three G or C nucleotides within the last five bases were filtered. This should help minimize mispriming at GC-rich binding sites[63].

Repeating nucleotides (user-specified): Oligonucleotides with four or more mononucleotide or dinucleotide repeats were filtered.

Melting temperature (user-specified): An important component of the pipeline is estimation of melting tempera- ture (Tm). The nearest-neighbor estimator was used as described by Santalucia and Hicks[127]:

o o X o o ∆H37(total) = ∆H37(inititation) + ∆H37(stack) + ∆H37(AT terminal) (2.1)

o o X o o ∆S37(total) = ∆S37(inititation) + ∆S37(stack) + ∆S37(AT terminal) (2.2) ∆H ∗ 1000 Tm = o − 273.15 (2.3) o Ct ∆S + R ∗ ln( 4 ) where ∆H0 and ∆S0 are enthalpy (kcal mol1) and entropy (e.u.) respectively. We refer the reader to Santalucia and Hicks, 2004 for further details on equations, units

28 + and definitions. To obtain Tm for different reaction conditions, the Na equivalent of monovalent and divalent ions were calculated according to Von Ahsen, 2001, which were used to adjust ∆S0(total) [equation 5 in Santalucia and Hicks, 2004, where: T ris √ Na = Na+ + K+ + + 120 ∗ (Mg2+) − dNT P s (2.4) eq 2 however, if dNT P s ≥ Mg2+, then

T ris Na = Na+ + K+ + (2.5) eq 2

o o Adjusted∆S (total) = ∆S37(total) + 0.368(oligo length − 1) ∗ (log[Naeq]) (2.6)

Primers that fell outside the specified range in Tm were filtered. In this study, a Tm range of 64 − 74◦C was used for primer design.

Self-primer interactions (user-specified): The Primer3 thal.c library functions[144] were used to compute hairpin and self- dimer Tm values (functions: primer3.calcHairpin and primer3.calcHomodimer; Primer3 release 2.3.7). A primer was filtered if either of these Tm values were within 20 of the corresponding primer-DNA hybridization Tm.

Exact match (user-specified): The remaining candidate primers were screened by BLASTn (ThermoAlign de- fault optimized settings: e value: 30000; word size: smallest primer length; gapopen: 2; gapextend: 2; reward: 1; penalty: -3; max target seqs 2; max hsps 2) to determine if at least one off-target exact match occurs within the reference genome, in which case the primer was filtered. Note: setting the max target seqs to 2 and max hsps to 2 provides the quickest search that ensures at least one match other than the target priming site can be found. Setting either of these less than 2 could cause ThermoAlign to fail in identifying off-target exact matches.

29 2.5.4 Priming specificity evaluation (PSE) Following the identification of unique primer sequences with adequate proper- ties for amplification, a more computationally intensive algorithm, PSE, was used to evaluate specificity of the remaining primers as follows:

Nearly identical match: BLASTn (ThermoAlign default optimized settings: e-value: 30000; word size: 7; gapopen: 2; gapextend: 2; reward: 1; penalty: -1; perc identity: 70; max target seqs: 13; max hsps: 20) was used to identify off-target sites across the reference genome that partially match candidate primers, considering that these could serve as mispriming sites. The parameters used for PSE required some consideration. A low stringency e-value cutoff was used to account for the higher probability that short sequences would have matches in the database. The word size value is an important parameter. A per- fect match across the word size is required to seed a BLASTn alignment, but the PSE module seeks to find imperfectly matching sequences. The smallest word size available for BLASTn was used to provide the maximum sensitivity for discovery of potential off-target sites. Gapopen and gapextend scores of 2, a reward score of 1 and a penalty score of 1 were used to support the alignment of divergent sequences[139]. Selecting alignments with a percent identity of 70% or more ensures a thorough evaluation of potential off-target priming sites. Figure 2.2 shows that primers with ≥ 9 mismatches with off-target sites have a Tm that is at least 10 degrees less than the on-target site. The maximum number of target sequences (max target seqs) corresponded to the total number of FASTA sequences in the reference genome database (for maize this included 13 sequences comprising nuclear chromosomes 1 − 10, the mitochondrial genome, the chloroplast genome and a tethered set of unmapped sequences). To avoid extensive searches, based on a preliminary PSE algorithm it was found that 20 high-scoring seg- ment pairs (HSPs) per target sequence identified most sites within the maize genome that candidate primers might hybridize with (i.e. this was a sufficient search space

30 to provide information for filtering a primer; data not shown). While this max hsps setting was used for designing primers in this study, an updated analysis suggested a max hsps setting of 50 may be more appropriate for future work (Supplementary Figure A.1). The low complexity filter was disabled to allow for an exhaustive search.

Thermoalignments: Since the alignment algorithm of BLASTn maximizes the scores on local align- ments, it can return truncated alignments between the primer and template sequences. For each truncated alignment, given the alignment start and stop coordinate and the length of the primer, an end-filling algorithm was developed that extends the align- ment to the full length of the primer sequence (Figure 2.3). This was done in a manner that maintains strandedness. Given the strand-specific 5’ and 3’ coordinates to which the full-length primer is expected to align, the corresponding region of the genomic sequence was extracted. In some cases, BLASTn may introduce gaps in the alignment while real primer-template interactions are gapless. To address this, strand-specific algorithms were developed to create full-length, ungapped primer-template alignments relevant to priming oligonucleotides (Figure 2.3). Gaps introduced in the primer se- quence (deletions in the primer relative to the template) were corrected by shifting the bases in the primer that were 5’ of the gap toward the 3’-end of the primer, and then discarding the resulting overhanging nucleotides at the 5’ end of the template sequence. Gaps introduced in the template sequence (insertions in the primer relative to the template) were corrected by shifting the bases in the template that were 3’ of the gap toward the 5’ end of the template, and then extending the template sequence corresponding to the resulting overhang at the 5’ end of the primer. This algorithm was designed to maximize the number of complementary bases toward the 3’ end of the primer, i.e., the site of primer extension; a different approach could be considered if ThermoAlign is extended for the identification of hybridization probes. The Tm was calculated for each ungapped full-length primer-template alignment which we refer to as a thermoalignment.

31 Misprime Tm Using the same formulae outlined under the Melting temperature calculations, the Tm was calculated for each thermoalignment between a primer and its off-target template locations. All of these thermoalignments contain at least one mismatch (oligonucleotides with exact matches at off-target sites were filtered in UOD). There- fore, parameters for G/T mismatches[1], G/A mismatches[2], A/C mismatches[3], C/T mismatches[4] and A/A, C/C, G/G, T/T mismatches[114] were used for the computa- tion of Tm.

Maximum misprime Tm(user-defined): For each candidate primer, among each potential off-target hybridization site

th in the genome, the maximum Tm and the 90 percentile Tm of thermoalignments are identified. Primers are selected if their on-target Tm is greater than their maximum off-target Tm as defined by a user-defined threshold. In this study, a +10 threshold was used.

3’-end mismatch: Primers with a 3’-terminal mismatch at a biding site are not expected to produce off-target synthesis products under standard PCR conditions[76]. Therefore, off-target thermoalignments that included a mismatch at the 3’-terminal nucleotide were not considered in PSE. In addition, the proportion of thermoalignments with any number of mismatches within the last five bases at the 3’-end were recorded in output but not used for filtering.

2.5.5 Primer pair selection (PPS) Once primers were determined by the TRS → UOD → PSE pipeline, primer pairs were identified using the following criteria:

32 Amplicon length (user-specified): Amplicon length was calculated as the distance between the 5’-ends of the for- ward and reverse primers. Primer pairs that would produce an expected on-target amplicon within a user-defined range in length were maintained. In this study, differ- ent settings were used for different experiments.

Assembly gaps and indels (including an optional gap filter): The presence of assembly gaps (represented as 100 or 1,000 nucleotide runs of an N) and indels (HapMap info) in the expected amplicon for a primer pair is indicated in the output. For this study, primer pairs expected to produce amplicons containing gaps in the reference genome were filtered while amplicons containing indels were not.

Tmifference (user-specified):

Primer pairs differing by 10 or less in Tm were selected. Although it seems inappropriate to choose such a large difference in Tm between primer pairs, so long as the Tm of each primer is sufficiently greater than the maximum misprime Tm of either primer then the primer pair should still be capable of locus specific amplification, using an annealing temperature appropriate for the primer with the lowest Tm .

Maximum misprime Tm(user-specified):

The primer with the lowest Tm in a given primer pair must be greater than the maximum misprime Tm of both primers in the pair according to a user defined threshold value. In this study, a 10 threshold was used.

Heterodimer ∆G (user-specified): The Primer3-py API 0.5.0 (https://github.com/libnano/primer3-py) was used to compute heterodimer ∆G values. Primer pairs with less than −2000 kcal/mol ∆G values were filtered in this study. We note that during our development phase, using what had been the most current version of Primer3 (v. 2.3.6), we discovered that the entropy and enthalpy estimates for dimers depended upon the input order of the

33 two primer sequences. In v. 2.3.7 the results are no longer a function of the primer input order.

Amplicon features: The amplicon sequence corresponding to each primer pair was summarized in terms of features such as GC content, presence of indels and assembly gaps.

Target Region

(x1, y1) (x5, y5) (x9, y9) (x11, y11)

(x2, y2) (x6, y6) (x9, y10) (x12, y12)

(x3, y3) (x7, y7) (x13, y13)

(x4, y4) (x8, y8) (x14, y14)

Expected Coverage

Figure 2.5: Directed graph analysis approach used in ThermoAlign to identify primer pairs for an amplicon tiling path at a targeted region. xi and yi represent forward and reverse primers respectively. Nodes correspond to expected amplicons (depicted as colored bars), which are ordered by their base pair coordinates. Directed edges (lines with an arrowhead) are drawn between overlapping amplicons, which in this case forms two subgraphs. Edge weights (depicted by the thickness of the lines representing edges; here, thicker lines represent the smaller weighted edges that would be selected) are computed based on the cumulative coverage and amount of overlap. Dijktra’s shortest path algorithm is applied to each subgraph to identify the primer pairs comprising the minimum tiling path of amplicons (black colored edges). As portrayed by the blue and yellow coloring, two groups of non-overlapping primer pairs are designated for multiplex consideration. The expected coverage for this example minimum tiling path is indicated by black (covered) and white (gaps) fill.

Minimum primer pair set identification: In situations where amplification of a long section of the genome is the ob- jective, a set of primer pairs producing a tiling path of overlapping amplicons may be designed. We employed a graph analysis strategy to identify the minimum set of primers to achieve maximum coverage of the target region, along with grouping of

34 the resulting primers into multiplex compatible sets. For a given target region, a di- rected graph was constructed for all possible primer pairs (Figure 2.5). Each primer pair (potential amplicon) was represented as a node. Directed edges between any two nodes (Ni, Nj) were made for overlapping amplicons, given that the node Ni starts

(5’ position of forward primer) at a position before that of the overlapping node Nj; and Nj ends (5’ position of reverse primer) after the corresponding end position of

Ni. To handle instances in which a target region contained gaps because no pass-filter primer pairs could be identified to produce a continuous set of overlapping amplicons, subgraphs were created for the target region. Edge weights (Ei) were calculated based on the cumulative sequence coverage for each pair of overlapping amplicons (Cjk), after penalizing for the amount of overlap between them (Ojk): Ei = Cjk − Ojk. The high- est possible edge weight for a given graph (Mg) was subtracted from the original edge weights to adjust the scale, such that the original higher scored edges were transformed into the lower scored edges for a given subgraph: Wi = Mg − Ei. From the directed graph with adjusted edge weights (Wi), the shortest path was identified based on Di- jkstra’s shortest path algorithm[42, 29]. The creation of the graph and identification of the shortest paths was implemented using the NetworkX 1.11[53] python package. The directed graph based approach identifies the minimum number of primers that gives the maximum coverage of a targeted genomic region in a computationally efficient manner. There was one exception in which we abandoned graph analysis. When two or more amplicons of different sizes covered the same genomic segment and started or ended with the same primer and did not overlap with other segments (i.e. amplicons were not offset on both sides), the single amplicon that gave the maximum coverage was used. Together, these strategies ensured that the minimum set of primer pairs with the maximum coverage for a given target region are identified. The resulting primer pairs were split into two sets where each set contained primer pairs that would not produce overlapping amplicons (Figure 2.5). The separate sets were evaluated as to whether they could form multiplex sets using MultiPLX[70]. Attempting to multiplex primer pairs that produce overlapping amplicons in one set

35 would lead to PCR production of undesirable or small amplicons from primers at the ends of overlapping segments; due to their smaller size their production could dominate the reaction.

ThermoAlign output ThermoAlign produces the following primary output files: (i) a text file summa- rizing pipeline settings and results; (ii) a text file for ordering primers; (iii) a text file with information about the primer pairs and resulting amplicon features; (iv) a text file with the minimum primer pairs giving maximum coverage that are also grouped into compatible multiplex sets; and (v) .bed formatted files of the primers for further analysis and visualization. For further details on these and additional output files see the README documentation at the GitHub repository).

2.5.6 PCR validation Standard PCR was performed using Taq DNA polymerase with standard Taq (Mg-free) Buffer (New England Biolabs, Ipswich, MA; Cat#M0320). Reactions with a final concentration of 1X standard Taq (Mg-free) reaction buffer, 1.5 mM MgCl2, 200 µM dNTPs, 0.1 µM forward primer, 0.1 µM reverse primer and 0.625 U Taq were combined with 20 ng of gDNA and brought to a volume of 25 µL with molecular- grade water. Some reactions were supplemented to include a final concentration of 1 M betaine. Amplification was carried out on an Eppendorf Mastercycler pro S with the following conditions: 5 min at 95 ; 30 cycles of 20 s at 94 , 20 s at 4 below the minimum Tm for the primer pair as computed by ThermoAlign and 1 min at 68 ; 5 min at 68 ; and a hold at 4 . PCR products were assayed by electrophoresis in a 3% TBE gel (Bio-Rad, Hercules, CA; Cat#161-3040) and imaged using a FluorChem HD2 with AlphaView SA v3.4.0. Long-range PCR was performed using GoTaq Long PCR Master Mix (Promega, Madison, WI; Cat#M4021). Reactions with a final concentration of 1X GoTaq Long PCR Master Mix, 0.1 µM forward primer and 0.1 µM reverse primer were combined

36 with 25 ng gDNA and brought to a volume of 16 µL with molecular-grade water. Some reactions were supplemented to include a final concentration of 1 M betaine. Amplification was carried out in an Eppendorf Mastercycler pro S with the following conditions: 2 min at 94 ; 30 cycles of 30 s at 94 , 5.5 min at 65 (following a 1 min/kb guideline based of the longest expected amplicon in the set); 10 min at 72 ; and a hold at 4 . PCR products were assayed via electrophoresis in a 1% TBE gel (Bio-Rad; Cat#161-3038) and imaged using a FluorChem HD2 with AlphaView SA v3.4.0.

2.5.7 SMRT sequencing and analysis of long-range PCR amplicons Amplicons for each long-range PCR primer pair were cleaned with SPRIse- lect (Beckman Coulter, Brea, CA; Cat#: B23319) at a ratio of 1:1 (beads:sample). Fragment analysis confirming the expected size amplicons was performed using a bio- analyzer (data not shown). Cleaned samples were then quantified using PicoGreen dsDNA quantitation assay (ThermoFisher, Waltham, MA; Cat#P7589) and pooled so that each amplicon was at 0.4 nM. Additionally, the same set of samples was pooled following[158] to account for differences in molar mass. For each pool, approximately 900 ng of pooled amplicons was submitted to University of Delaware’s Sequencing and Genotyping Center (http://www1.udel.edu/dnasequence/Site/Home.html) for SMRTbell library construction and sequencing on one SMRTcell per library using the PacBio RSII. PacBio SMRT raw reads were clustered by the primer sequences. The number of circular consensus sequences per cluster were recorded. Each cluster of reads was error corrected using Quiver in PacBio SMRT analysis V2.3. LAA protocol to get their consensus sequences. These sequences were aligned to the B73 reference genome using BLASTn (Expect threshold: 1e−4, word size: 15, match/mismatch scores: 1,-2).

2.6 Availability ThermoAlign is released under a GNU GPLv3 open source license at https: //github.com/drmaize/ThermoAlign and has been packaged as a distributable tool in

37 Docker containers. The following Docker images are available: (i) TA 1.0.0 d is a gen- eral distributable version which requires user supplied files; (ii) TA 1.0.0 s is a sample run version containing a small set of sample files that can be used to test ThermoAlign; (iii) TA 1.0.0 Zm3 is a maize ready version. Instructions for using ThermoAlign and running these containers can be found at the GitHub repository.

2.7 Acknowledgments This work was supported by the U.S. NSF Plant Genome Program IOS-1127076. We thank Dr. Karol Miaskiewicz at the Delaware Biotechnology Institute for assistance with using BIOMIX, a high performance computing cluster. BIOMIX was supported by Delaware INBRE grant NIH/NIGMS GM103446 and a number of investigators who have contributed nodes to the cluster. We are grateful to the Primer3 development team for rapidly fixing a bug in Primer3 thal.c heterodimer function.

2.8 Author contributions statement RW guided the study. FF and RW conceived the design principles for Ther- moAlign. FF wrote the program code and created docker containers. FF executed the computational analysis and program optimization, which iterated based on feedback from MD and RW. MD performed the laboratory validation experiments. FF, RW and MD wrote the manuscript.

2.9 Additional information Competing financial interests: The authors declare no competing financial interests.

38 Chapter 3

CLUSTERING OF CIRCULAR CONSENSUS SEQUENCES: ACCURATE ERROR CORRECTION AND ASSEMBLY OF SINGLE MOLECULE REAL-TIME READS FROM MULTIPLEXED AMPLICON LIBRARIES

3.1 Abstract Background Targeted resequencing with high-throughput sequencing (HTS) platforms can be used to efficiently interrogate the genomes of large numbers of individuals. A criti- cal challenge for research and applications using HTS data, especially from long-read platforms, is errors arising from technological limits and bioinformatic algorithms.

Results A single molecule real-time (SMRT) sequencing-error correction and assembly pipeline, C3S-LAA, was developed for libraries of pooled amplicons. By uniquely leveraging the structure of SMRT sequence data (comprised of multiple low quality subreads from which higher quality circular consensus sequences are formed) to cluster raw reads, C3S-LAA produced accurate consensus sequences and assemblies of overlap- ping amplicons from single sample and multiplexed libraries. In contrast, despite read depths in excess of 100X per amplicon, the standard long amplicon analysis module from Pacific Biosciences generated unexpected numbers of amplicon sequences with substantial inaccuracies in the consensus sequences. A bootstrap analysis showed that the C3S-LAA pipeline per se was effective at removing bioinformatic sources of error, but in rare cases a read depth of nearly 400X was not sufficient to overcome minor but systematic errors inherent to amplification or sequencing.

39 Conclusions C3S-LAA uses a novel processing algorithm for SMRT amplicon-sequence data that produces accurate consensus sequences and local sequence assemblies. The com- munity standard long amplicon analysis module from Pacific Biosciences is prone to substantial errors that raise concerns about findings based on this pipeline. The method developed here removed this confounding bioinformatics source of error, allowing for the identification of limited instances of errors due to DNA amplification or sequencing.

3.2 Background High-throughput sequencing (HTS) platforms have revolutionized the study of genomes and genomic variation. However, HTS platforms and analysis of HTS data generate errors in base calling[122]. Even perfectly accurate sequence reads may be improperly assembled or incorrectly aligned to a reference sequence when read lengths are too short. The consequence of such errors can lead to incorrect results and mis- leading conclusions in a variety of settings ranging from scientific investigation[16] to clinical diagnostics[102]. Single molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) generates long-read data, which, once error corrected [raw SMRT sequence data has a median error rate of ≈ 15%[124]], can help to produce complete de novo assemblies and accurate alignments to a reference genome. SMRT sequencing also exhibits relatively little sequence coverage bias, allowing regions of the genome with large differences in sequence complexity to be fully traversed[124]. Therefore, SMRT sequencing facilitates assembly, resequencing, haplotype phasing, characterization of isoforms and structural variation, etc., all of which are more prone to errors with ”short-read” data[67]. For resequencing applications, SMRT sequencing of tiled amplicons allows kilo- base or larger-scale target regions of a genome to be sequenced at great depth, providing the opportunity to generate highly accurate, consensus assemblies[45]. In combination with molecular barcoding, sequencing of multiplexed amplicon libraries facilitates stud- ies across broad biological disciplines[73, 25, 28, 154, 69]. However, such studies can

40 be affected by confounding sources of errors arising during library preparation, se- quencing and data analysis[38]. Isolating the sources and types of errors is crucial to progress in the development of sequencing technologies, sequence analysis methods and interpretation of sequence data. Several computational pipelines have been developed for automated processing and analysis of amplicon sequence data produced on different HTS platforms, such as PyroNoise[72], mothur[72] and Long Amplicon Analysis (LAA)[111]. LAA is the standard pipeline for analysis of SMRT sequence data from amplicon libraries. LAA uses a “coarse clustering” approach to group raw reads according to pairwise similarity estimated from BLASR alignments[109]. The Quiver consensus calling framework[32] is then used to generate an error-corrected consensus sequence for each cluster. When we first used LAA to process amplicon sequences as part of a previous study[45], several of the consensus sequences outputted by LAA were incorrect. We found that clustering of high quality circular consensus sequences (i.e. clustering of CCS reads, which we refer to as C3S) to group the corresponding raw read data prior to performing analysis with Quiver recovered all of the expected sequences with high fidelity. Here, we investigated this further and present a new, open-source pipeline for processing tiled amplicon resequence data from multiplexed libraries.

3.3 Methods 3.3.1 Sequence data PacBio sequence data (RS II chemistry P6/C4) from two amplicon libraries, a single sample library (SRX2880716) and a multiplex sample library (SRX3474979), were used for this study. SMRTbell libraries were constructed according to PacBio’s amplicon library protocol[110]. Sequencing was performed on a Pacbio RS II instru- ment with one SMRT Cell used for each library, using P6/C4 chemistry with a 6-hour movie. SMRTbell library preparation and sequencing was carried out by the University of Delaware Sequencing and Genotyping Center (Newark, DE).

41 Sequence data from the single sample library was from a previous study[45] and was comprised of nine amplicons, which were amplified from the maize inbred line B73. The maximum expected amplicon size was 4, 954 bp, such that the raw reads, which had a mean length of 23, 794 bp, consisted of an average of approximately nine subreads per amplicon (Table B.1). The multiplex sample library produced for this study was comprised of a tiling path of six amplicons spanning ≈ 23, 000 bp of the maize genome, which were amplified from six different maize inbred lines (B73, CML277, Hp301, Mo17, P39, Tx303). The primer pairs used for the multiplex library had distinct symmetric barcodes for each sample and amplicon, along with a shared 5’ GTTAG padding sequence (Table B.2). The maximum expected amplicon size was 7, 752 bp and the raw reads consisted of an average of nine subreads per amplicon (Supplementary Table B.1).

3.3.2 Clustering of circular consensus sequences for long amplicon analysis A cluster and assembly pipeline was developed in which raw reads are clustered based on circular consensus sequences (CCS) prior to running error correction with Quiver. We refer to this pipeline as C3S-LAA, for Clustering of Circular Consensus Sequence (C3S) Long Amplicon Analysis. Clustering is performed as follows. The reads of insert protocol in SMRT Portal is used to generate CCS reads (run settings: minimum of 1 subread at 90% CCS read accuracy). These are higher quality sequences formed from the corresponding raw reads based on their multiple subreads. Therefore, the CCS reads are used to cluster the data. Clustering is performed by a simple match function that identifies CCS reads containing both the forward and reverse primer sequences for each amplicon (considering the sense and antisense primer sequences). From this, a list of CCS read identifiers belonging to each amplicon cluster is produced. This list is then used to subset the corresponding raw reads, using the whitelist option in LAA, such that Quiver-based consensus calling[32] occurs on only the raw reads belonging to a given amplicon-specific cluster. Consensus sequences formed from clusters comprised of fewer

42 than 100 subreads were eliminated when all available reads were used; this setting was adjusted to 0 for evaluation of accuracy (see below). The pipeline can be used to perform one-level clustering for non-barcoded amplicon libraries or two-level clustering for barcoded amplicon libraries. Because barcodes or other sequences may precede the primer sequence and may vary in length, the primer search space was designed as a user input parameter, which, for this study, was set to 21 bases at both the ends of the sequence. The pipeline proceeds to an assembly step (Figure 3.1). The C3S-LAA consen- sus sequences are automatically merged into a Multi-FASTA format file and assem- bled (per barcode if barcoding is used) using Minimus based on the overlap-layout- consensus paradigm[135]. To trim extraneous sequences (e.g. padding or barcodes) for downstream analysis, a user input parameter (trim bp) is specified to remove the corresponding number of bases from each end of the consensus sequences while writing them to the FASTA file. The assembly is then carried out among all trimmed consen- sus sequences, and mismatches between any two overlapping sequences are represented as Ns in the assembly sequence. Where there are more than two overlapping sequences with mismatches, the most frequent base will be represented in the assembly. In the case of barcoded sequencing libraries, the assembly is carried out separately for each barcode.

3.3.3 Evaluating the accuracy of C3S-LAA First, evaluation of the performance of LAA was carried out on the sequence data from the single sample library. LAA v1 was run on SMRT Portal, using the following settings: minimum subread length: 2000 bp; maximum number of subreads: 2000 (default); ignore primer sequence when clustering: 0 bp (default); trim ends of sequences: 0 bp (default); provide only the most supported sequences: 0 (0=disabled filter; default); coarse cluster subreads by gene family: yes (default); phase alleles: no; split results from each barcode into independent output files: no; barcode: no. The minimum subread length was reduced from the default value of 3000 bp to 2000

43 a Raw reads CCS CCS-based Clusters Error Correction (Consensus Sequences)

Assembly

Seq1 Seq2

C3S-LAA Pipeline

b Barcode Reads of Cluster Reads Sequences Insert (CSS)

Primer Sequences .h5 Clustering Read IDs (raw reads) Barcode-Primer Run Parameters File

Quiver Minimus and

Assembly Consensus Assembly Sequences Sequences Error Correction

Figure 3.1: Graphical representation of the C3S-LAA process and pipeline. (a) Raw reads comprised of multiple subreads are depicted for three different amplicons [green, fuchsia and blue boxes; different shades of color are used to portray variable subread sequence qualities (darker shading portrays higher quality)]. Subreads are separated by a shared adapter sequence (grey boxes). The higher quality CCS read for each raw read is used to cluster the corresponding raw reads into CCS-based cluster groups. Error correction is performed per CCS-based cluster, producing top quality consequences sequences, followed by assembly of any overlapping consensus sequences. (b) A single run parameters file is used by all components of the pipeline. The grey highlighted rectangles represent two main steps of C3S-LAA. (i) Using the CCS reads generated by the SMRT analysis reads of insert protocol, C3S clusters the raw reads according to each barcode-primer pair combination, producing files of read identifiers to whitelist the corresponding raw reads. (ii) Raw read clusters are passed to Quiver to generate amplicon-specific consensus sequences, which are then passed to Minimus for sequence assembly. Rectangles with folded corners represent single files or multiple files (depicted as stacks of files) and those with rounded edges represent scripts and tools. Arrows indicates output files that are generated. Connecting lines with dots at one end depict input files, with the dot corresponding to the source data for the connected script or tool.

44 bp since the sequencing library had one amplicon of 3330 bp, such that any partial sequences may also be considered. Phasing of alleles was not used since the ampli- cons were produced from homozygous individuals (inbred lines). The resulting LAA consensus sequences were aligned using BLASTn[5] to the B73 v3 reference genome of maize[130]. YASS[107] was used to align the incorrect (partial matches) consensus sequences formed by LAA to their expected amplicon sequence using the following score parameter settings: Scoring matrix (match: +5, transversion: -4, transition: - 3, composition bias correction: -4), Gap costs (opening: -16, extension: -4), E-value threshold: 10 and X-drop threshold: 30. The same sequence data from above was also processed using C3S-LAA. In ad- dition, the relationship between subread depth and the accuracy of consensus sequence construction as well as assembly was evaluated for the output from C3S-LAA. For each amplicon, sample sets of 1, 2, 3, ...40 CCS read identifiers were randomly selected with replacement from among the eight amplicons. Using the corresponding raw reads of each CCS read set, C3S-LAA was used to create consensus sequences per amplicon cluster and assemblies from the corresponding group of consensus sequences belonging to a sampled CCS read set. This was repeated 25 times, such that a total of 8000 consensus sequences were generated in addition to the corresponding Minimus assem- blies. BLASTn alignments with the B73 v3 reference genome were used to determine the map location and compute the percent identity for each of the amplicon-specific consensus sequences and corresponding assembly sequences. From these alignments, the number of mismatches and gaps were also recorded to characterize the types of errors present in the sequences. For each cluster of sequences, the number of subreads used to derive the consensus sequence was recorded. The minimum number of subread counts for a set of overlapping amplicons that produced an assembly was used as the number of subreads for that assembly. The performance of LAA versus C3S-LAA was also evaluated using the multi- plex library. LAA was used to generate consensus sequences under the same settings indicated above, with an additional selection of the barcode demultiplexing option.

45 Since the amplicons were barcoded using PacBio’s standard barcodes, the default pre- set in SMRT Portal pointing to PacBio barcodes with padding in the reference directory was used. C3S-LAA was used to perform two-level clustering of the CCS reads, using the primer and barcode sequence information. A search space of 121 bp was used for identifying barcode-primer sequences in order to cluster the CCS reads. Since one of the lines (B73) has a reference genome available, LAA and C3S-LAA consensus sequences associated with B73 were aligned using BLASTn to the B73 v3 reference genome. The C3S-LAA assembly for B73 was also compared to this reference.

a b

100.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● 100.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●● ● 99.98 ●●●●●●●●●●●●●●● ● 99.98 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● 99.96 ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ● 99.96 ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ●●●●●●● 99.94 ●●●●●●●●●●● ● 99.94 ●●● ●●● ● ●●●●● ● 99.92 ●●●●●● ● 99.92 ● ●●●●●● ●●● 99.90 ●●●● 99.90 ●● ●● ● 99.88 ●●●● 99.88 ●●●● ● 99.86 ●●● 99.86 ● ●● ●● 99.84 ●● 99.84 ● ●●● 99.82 ● 99.82 ●● 99.80 ●● 99.80 99.78 ● 99.78

Assembly accuracy Assembly ●

Consensus accuracy 99.76 99.76 99.74 ● 99.74 99.72 ● 99.72 99.70 99.70 21 100 200 300 400 500 600 21 100 200 300 400 500 600 Number of raw subreads Number of subreads

Figure 3.2: Sequence accuracy as a function of subread depth. (a) Accuracy of con- sensus and (b) assembly sequences. Data from all the amplicons were pooled together to evaluate the consensus calling accuracy as a function of depth of coverage of SMRT raw reads. The vertical line shows the minimum read depth of the consensus sequences used for assemblies.

3.4 Results and Discussion PacBio SMRT sequence data from a pooled library of long-range PCR amplicons was previously produced and used for this study[45]. The data was processed with

46 PacBio’s LAA protocol under default settings using all of the raw read data. This did not produce a consensus sequence for all of the expected amplicons and included several artifactual sequences (Table 3.1). Dot-plot visualization of the alignments of these seven incorrect consensus sequences with the reference sequence indicated the presence of spurious inverted duplications in six out of the seven consensus sequences and a truncated alignment for the one remaining sequence (Figure B.1). The above errors led us to inspecting LAA, which uses a custom algorithm based on the raw read data to pre-cluster similar sequences for analysis by Quiver. However, PacBio raw reads are notoriously poor in quality and overlapping or repetitive sequences could be present, either of which may cause errors in cluster formation (our speculation based on results presented below). Moreover, the primer sequences used to produce the amplicons in a library are not considered. Therefore, we hypothesized that using the higher quality CCS reads to group the corresponding raw reads into amplicon-specific clusters based on the expected primer sequences would improve the consensus sequence analysis. A bioinformatic pipeline, C3S-LAA, was developed to carry out such clustering (Figure 3.1). The divide and conquer principle of C3S-LAA simplifies the determination of consensus sequences by only operating on raw reads for which there is a high degree of certainty that they were derived from the same origin.

Table 3.1: Comparison of LAA and C3S-LAA consensus sequences for B73 amplicons.

Number of Complete match2 Truncated match Partial match Library type1 Method consensus sequences (100% identity) (100% identity) (<100% identity) Single LAA 14 7 1 6 Single C3S-LAA 9 9 0 0 Multiplex LAA 8 4 1 3 Multiplex C3S-LAA 6 5 0 1

1The single library had nine expected consensus sequences, whereas the multiplex library had six expected consensus sequences.

2For the multiplex sample library, the B73 v3 assembly contained a gap relative to one of the five amplicon sequences. This gap was

filled in the latest B73 v4 release.

Indeed, our results indicated that C3S-LAA rectified the errors generated by the standard LAA protocol. For a typical use case, where all the reads from a sequencing library are used, C3S-LAA could resolve and accurately call the consensus sequence for

47 every amplicon in the single sample library with no extraneous sequences (Table 3.1). In contrast, more than half of the consensus sequences generated by LAA had truncated or partial matches to the reference genome, and LAA could only fully resolve six of the nine amplicons in the single sample library. Based on these results, we recommend C3S-LAA for analysis of PacBio amplicon sequence data. Moreover, the C3S concept may be used in other situations where some portion of the sequences are known in advance. For tiled amplicon resequencing, C3S-LAA can also be used to assemble any overlapping segments that may exist among the consensus sequences outputted for a given genotype (Figure 3.1). We bootstrapped the read data from the single sample library to examine the accuracy of the assemblies, as well as the underlying consen- sus sequences, produced by C3S-LAA as a function of subread depth. All C3S-LAA alignments of the resulting amplicon consensus and assembly sequences mapped to the expected target region. The minimum subread depth from which amplicon-clustered consensus sequences were outputted by LAA was 21, which corresponds to ≈ 2 CCS reads for our 3-5 kb amplicons (mean number of passes was 9.39; Supplementary Table B.1). Accuracy of the consensus sequences from bootstrapped samples of amplicon- clustered data was generally high, with accuracies ranging from of 99.72-100% (Figure 3.2a). By extension, Minimus assemblies of these consensus sequences were similarly accurate (Figure 3.2b). Despite an increase in accuracy with subread depth, not all of the bootstrap replications from high CCS sample depths included completely accurate consensus sequences or assemblies, and even at a subread depth of nearly 400X some bootstrap samples included imperfect assemblies (Figure 3.3). This was primarily due to a specific error in one locus (locus 6 7045710 7052049) that was observed among some of the bootstrapped samples at different CCS sampling depths (rare instances of locus 1 25390617 25396540 also showed minor inaccuracies). For instance, at a CCS sample depth of 40, the consensus sequence for locus 6 7045710 7052049 contained a 2 bp insertion in two of the 25 bootstrap samples. This same type of insertion error oc- curred for both loci and was embedded within homopolymeric regions of the sequences

48 (Figure 3.4), indicating this was due to PCR or sequencing and not the pipeline per se. Among all the assemblies generated from bootstrapping (n = 3787), errors in the form of insertions, deletions and single nucleotides contributed to 66.7, 17.2 and 16.1% of the total errors respectively.

97 98 97 97 97 97 98 97 98 99 98 100 94 96 94 94 94 96 95 94 95 96 96 96 92 92 91 92 93 84 84 86 86 78 78 75 65

56

50 45

25 Number of bootstraps with Number of bootstraps assemblies 100% accurate 13

0 0 0 0 10 20 30 40 Number of CCS reads in bootstrap sample

Figure 3.3: Total number of accurate bootstrap assemblies versus CCS sample size. At each level of the CCS read depth sample (1-40), the figure shows the total number of bootstrapped assemblies that were 100% identical to the reference sequence. This was determined for the four target regions (25 bootstrap assemblies at each of 4 loci, giving rise to a maximum of 100 on the x-axis) formed from the consensus sequences among the eight overlapping amplicons.

For the multiplex sample library, the number of consensus sequences formed by LAA differed from the expected number for four of the six samples, and LAA generated consensus sequences for barcodes that were not used to make the library (Table 3.2). In contrast, C3S-LAA produced the exact number of expected consensus sequences per sample and per barcode. As with the single sample library, comparing the B73-barcode derived consensus sequences to the B73 reference genome showed substantial errors in the consensus sequences from LAA but not C3S-LAA, where LAA only resolved four of the amplicons from B73 (Table 3.1); the one C3S-LAA consensus sequence

49 a Query 4201 GAGAGAGAGAGAGAGAGAAATGGGAGATTGGAGAGCGAGCTAGGGAGATCGAGGAAGGTG 4260 |||||||||||||||| |||||||||||||||||||||||||||||||||||||||||| Sbjct 7047828 GAGAGAGAGAGAGAGA--AATGGGAGATTGGAGAGCGAGCTAGGGAGATCGAGGAAGGTG 7047769

b Query 1951 GCGATCATCGGGTACGAGCAGATTCAGATTTGACAGTTTTTTTTTTTTGT 2000 ||||||||||||||||||||||||||||||||||||||||||||||| || Sbjct 25394569 GCGATCATCGGGTACGAGCAGATTCAGATTTGACAGTTTTTTTTTTT-GT 25394521

Figure 3.4: Sequence alignment highlighting a recurring insertion error in some bootstrap samples. The alignment corresponds to the consensus sequence for a part of the amplicon from (a) locus 6 7045710 7052049 (Query) and (b) lo- cus 1 25390617 25396540 (Query) on maize chromosome 6 and 1 respectively compared to the B73 v3 reference sequence (Sbjct).

Table 3.2: The number of consensus sequences generated from the multiplex library, following barcode demultiplexing.

Sample1 Barcode ID LAA consensus C3S-LAA consensus B73 32 8 6 CML277 35 6 6 Hp301 31 6 6 Mo17 20 7 6 P39 2 7 6 Tx303 4 7 6 N/A 8 7 0 N/A 23 5 0 N/A 49 1 0 N/A 82 2 0 N/A 85 1 0 N/A 91 6 0 N/A 92 3 0

1No samples were associated with the N/A barcode.

50 with an imperfect match was due to two separate 1 bp insertions embedded within homopolymeric regions. Another C3S-LAA consensus sequence aligned to the expected region of chromosome 1 with 100% identity but spanned a 531 bp assembly gap in the reference genome. This gap was filled in the latest release of the B73 reference genome [v4;[68]] which matched perfectly to the C3S-LAA consensus sequence. None of the other results were changed when using the B73 v4 reference sequence. C3S-LAA also produced assemblies for each sample from the corresponding set of consensus sequences. The 23,300 bp C3S-LAA assembly for B73 differed from the expected B73 reference genome sequence only by the differences indicated above. C3S-LAA clearly outperformed LAA for the data examined in this study. We have observed the same performance using C3S-LAA on data from a multiplex library of another 21 individuals amplified across multiple overlapping amplicons (not shown). Nevertheless, a potential limitation of C3S-LAA is that it requires the CCS reads have both the barcode and primer sequences intact. Accuracy of CCS reads is a function of the number of subreads[143]. Thus, for very long amplicons where one or a few subreads are sequenced, reliance on CSS reads will limit the number of sequences used from the available data. It may be possible to use a less stringent clustering algorithm, however, the fragment lengths of most amplicon libraries are expected to be well below the current and increasingly long read lengths of PacBio data, such that fairly accurate CCS reads would be available for clustering. C3S-LAA is expected to be applicable for SMRT sequence data of amplicon libraries or where flanking sequences can be predefined. C3S-LAA was developed as part of an extension to tiled amplicon resequencing projects facilitated by ThermoAlign[45] and is released under an open source license.

3.5 Conclusion This study shows that CCS-facilitated clustering of raw reads vastly improves the analysis of SMRT sequence data. This method directs error correction and con- sensus sequence analysis to be performed only on sequences derived from the same

51 amplicon and sample, leading to accurate consensus sequences and local assemblies. The community standard LAA module could not resolve all of the expected ampli- cons in a library and generated several spurious consensus sequences during barcode demulitplexing and sequence clustering. LAA v1 uses BLASR[109] for pairwise align- ment of all reads, which are then clustered based on their similarity using a Markov Model (R. Lleras, Pacific Biosciences of California, Inc., pers. comm.). Given that the the underlying principle of LAA and C3S-LAA are essentially the same — use clus- tering to group reads from which consensus sequences should be formed — but only C3S-LAA produces correct output, indicates that the clustering algorithm of LAA is prone to error. This release of C3S-LAA provides users with a more accurate process- ing pipeline for SMRT sequence data, which addresses a critical gap in the analysis of amplicon sequence data.

3.6 Availability C3S-LAA is released under an MIT open source license at: https://github. com/drmaize/C3S-LAA

Competing interests The authors declare that they have no competing interests.

Author’s contributions RJW guided the study. FF and RJW conceived the design principles for C3S- LAA. MDD and SBD produced the sequence data. FF developed the code and exe- cuted the computational analysis and program optimization, which iterated based on feedback from RJW. FF and RJW wrote the manuscript.

Acknowledgements This work was supported by the U.S. NSF Plant Program IOS-1127076. We thank Dr. Karol Miaskiewicz at the Delaware Biotechnology In- stitute for assistance with using BIOMIX, a high performance computing cluster.

52 BIOMIX was supported by Delaware INBRE grant NIH/NIGMS GM103446 and a number of investigators who have contributed nodes to the cluster.

53 Chapter 4

RESEQUENCING OF A QUANTITATIVE DISEASE RESISTANCE LOCUS IN MAIZE PROVIDES BENCHMARK DATA AND INSIGHT INTO THE SPECTRUM OF SEQUENCE VARIATION AMONG INBRED LINES

4.1 Introduction Understanding how genomic variation influences phenotypes, and in turn is affected by selection, is a central goal of genetics. For quantitative traits – the predom- inant form of natural variation – an individual’s genetic merit is diffused across many regions of the genome, such that each of the separate causal sequence variants is inher- ently difficult to pinpoint. The development of high-throughput biology has helped pave the way towards understanding that different types of sequence variants can give rise to quantitative phenotypes, including single nucleotide polymorphisms[151], presence-absence variants and structural variants[35]. Therefore, characterizing se- quence variation can provide insights pertinent to understanding the nature of func- tional variation, predicting disease risk and for fast paced genomic selection in breeding programs[89]. Our current understanding of the genotype to phenotype relationship has arisen from advances in DNA sequencing technologies, the emergence of which had spurred consolidated sequencing initiatives that have produced large-scale repositories of pub- licly available genomic data (e.g.[10, 148]). While these data make many investigations possible, careful consideration of the data quality is critical to reaching stable conclu- sions for specific questions. For instance, using data from the 1000 Genomes Project, Nothnagel et al (2011) observed that reliance on an individual sequencing platform for variant discovery was associated with an alarming proportion of false positive genotype

54 calls (3-17%)[108]. While some degree of error is inevitable, what might be acceptable for some studies could be unacceptable or even misleading for others. Moreover, it is unlikely that estimates of the genome-wide error rate are uniform across a genome, such that studies focused on specific regions of a genome may be affected in unex- pected ways. The full set of variants for a specific genomic region and in a given sample of interest may also not be catalogued in public repositories[22], requiring ad- ditional sequencing to be performed. In this study, as described below, we resequenced a quantitative trait locus in maize and assessed the discovery rate of variant sites and the accuracy of genotype calls in maize HapMap3[24], a public resource with ≈60 M variant sites that can be – and has been – used for a range of genetic studies. Maize (Zea mays subsp. mays L.), besides being a major crop, has served as a model species for genetics and genomics research. The maize genome exhibits astounding levels of diversity, with sequence variation that is an order of magnitude greater than that in humans[141] and a pan-genome estimated to encompass genomes that vary by 40-50%[21, 85]. Sequence analysis of specific loci have revealed hallmarks of high transposon activity as well as non-collinear haplotypes[46, 21]. Combined with the complex genetic architecture of phenotypic variation in maize, this array of sequence variation challenges the identification of causal variants underlying quantitative traits. To study sequence variation at specific regions of a genome, targeted rese- quencing is an effective alternative to whole genome sequencing[37]. The primary techniques for enriching specific sections of a genome are based on DNA hybridiza- tion and amplification[52], but the the high repeat content of many plant genomes poses challenges to conventional applications of target enrichment[88, 45]. New com- putational tools for oligonucleotide and primer design and approaches for subtract- ing repetitive DNA have made it possible for targeted enrichment in plants including maize[47, 66, 45, 44]. Proof-of-concept studies have demonstrated the application of targeted resequencing to study plant genomes; e.g. targeted exome capture was used to genotype samples of black cottonwood[159] and multiplexed PCR enrichment of can- didate genes for flowering time was used to examine diversity of maize landraces[66].

55 Targeted resequencing is thus a cost-effective and high-throughput approach for popu- lation scale genomic studies that is becoming feasible even for complex plant genomes. In crop production systems, breeding for improved quantitative disease resis- tance (QDR) is an effective means of sustainable disease management. Although nu- merous studies have mapped loci associated with QDR in crop species, it is only recently that some of the causal genes or variants have been identified[106]. Northern leaf blight disease of maize, caused by the fungal pathogen Setosphaeria turcica, ranks among the most detrimental to maize production in the U.S.[100] and abroad[58]. A substantial body of research exists on the genetics of QDR to Northern leaf blight (e.g.[150, 116]); however, with the exception of one relatively large, partial effect gene that has been cloned (HtN1 [62]), genes underlying QDR to NLB remain elusive. Jamann et al. used breakpoint mapping to fine-map two quantitative trait loci associated with distinct mechanisms of resistance to infection by S. turcica[33, 65, 64]. The qNL1.02B73 lo- cus was delimited to ≈23 kb on the short arm of chromosome 1, where functional analysis implicated a remorin gene as the cause of variation in QDR between the two contrasting haplotypes of inbred lines B73 and Tx303[64]. The DNA sequence of the susceptible haplotype of Tx303 was obtained for only ≈1 kb outside of the remorin gene (where there was a gap in the B73 RefGen v2 sequence), and the data from maize HapMap2[30] was used to speculate on the variants that might affect remorin gene function.

In order to obtain the complete sequence of the fine-mapped region of qNLB1.02B73 across additional samples and to uncover all of the variants between B73 and Tx303 and potentially identify the causal variant(s) for QDR to NLB, tiled amplicon rese- quencing was performed on the inbred line parents of a 5000-line nested association mapping (NAM) population of maize[95]. Comparative analysis of these resequenced haplotypes was used to address hypothesis that Tx303 carries a susceptible allele of a remorin gene that is unique to Tx303 among the NAM founders[64].

56 4.2 Methods 4.2.1 Barcoded DNA amplification of the qNLB 1 25721468 23298 locus Using the DNeasy 96 Plant Kit (Qiagen, Hilden, Germany; Cat# 69504), DNA was extracted from lyophilized leaf tissue of 27 samples (i.e. inbred line parents of a maize nested association mapping [NAM] population[95]: B73, B97, CML103, CML228, CML247, CML277, CML322, CML333, CML52, CML69, HP301, Il14H, Ki11, KI3, Ky21, M162W, M37W, Mo17, MO18W, Ms71, NC350, NC358, Oh43, Oh7B, P39, Tx303, Tzi8). DNA extracts were quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Thermo Fisher, Waltham, MA; Cat# P11496) and normalized to a working concentration of 12.5 ng/µl. Amplicon tiling paths covering the qNLB 1 25721468 23298 locus (a QTL for Northern leaf blight resistance on chromosome 1[64], starting at nucleotide position 25,721,468 and extending 23,298 bp [B73 RefGen v4[68] were designed using Ther- moAlign v1.0.0[45]. The locus was previously resequenced across six of the NAM founder lines[44]. Through an iterative process of tiling path design and primer pair testing, the amplicon coverage of qNLB 1 25721468 23298 for the remaining 21 - ples was filled as much as possible, resulting in the design of two sets of amplicon tiling paths used to amplify corresponding sets of samples (Supplementary Table C.1). The primers pairs were purchsaed from IDT (Integrated DNA Technologies, Inc., Coralville, Iowa) as desalted oligonucleotides and included symmetric barcodes (per amplicon by line combination) and a shared 5’ GTTAG padding sequence according to guide- lines for using barcodes from Pacific Biosciences http://www.pacb.com/wp-content/ uploads/Shared-Protocol-PacBio-Barcodes-for-SMRT-Sequencing.pdf; Supple- mentary Table C.2). PCR amplifications were performed as separate reactions for each primer pair on each sample. Amplifications were prepared in a total of 20 µL wherein the final concentrations were: 1X Phusion HF Buffer (New England Biolabs, Ipswich, MA; Cat# B0518S), 200 µM per dNTP, 8 µM betaine, 0.1 µM forward and reverse primer and 0.02 U/µL Phusion Hot Start II DNA polymerase (Thermo Fisher, Waltham, MA;

57 Cat# P11496). Amplifications were performed on an Applied Biosystem GeneAmp PCR system 9700 (Thermo Fisher, Waltham, MA; Cat# P/N N805-0200) with the following thermalcycling conditions: (i) denaturation for 30s at 98 ◦C; (ii) 3 steps and 35 cycles of denaturation for 10s at 98 ◦C, annealing for 30s at 65 ◦C and extension for 4 minutes at 72 ◦C; and (iii) a final extension for 10 minutes at 72 ◦C. To minimize PCR errors in the sequence data, each amplification was replicated four times and pooled. SPRIselect beads (Beckman Coulter, CA, USA; Cat# B24965AA) were used to purify each of the pooled amplicons with a 0.5 X bead ratio and left side selection to eliminate excess primers and any other products less than ≈ 1200 bp (the smallest expected amplicon was 2275 bp). A 1% agarose gel was run to ensure the correct size and quality of amplicons. Based on PicoGreen quantification data of the purified PCR products, an equimolar pool of all DNA amplicons from all samples was formed according to Zhang et al.[158].

4.2.2 Sequencing, error correction and assembly of multiplexed amplicon libraries A single library was made for the 21 samples sequenced in this study (the remaining six samples were sequenced previously[44]). SMRTbell library preparation and PacBio sequencing was carried out by the University of Delaware Sequencing and Genotyping Center (Newark, DE). The library was constructed according to PacBio’s amplicon library protocol[110]. Sequencing was performed in a Pacbio RS II instrument with three SMRT Cells using P6/C4 chemistry with a 6-hour movie. A file of filenames (FOFN) file containing the absolute paths of nine bax.h5 files outputted from the SMRT sequencing output was used to generate circular consensus sequences (CCS) using the reads of insert protocol in SMRT Portal. Using the primer and barcode sequences and a flanking sequence search space of 121 bp, C3S-LAA[44] was used to construct consensus sequence assemblies for each line. For samples that produced multiple contigs due to gaps in the assembly, BLAST alignments were used

58 to scaffold the contigs across the target locus and the contigs were merged while adding 100 Ns to represent sequence gaps.

4.2.3 Sequence characterization Geneious 10.0.9 was used to generate a multiple sequence alignment from the 27 resequence assemblies and the B73 RefGen v4 reference sequence[68] using Mauve[40]. Sequences in the alignment were arranged according to the percentage identity of each assembly relative to the reference sequence. The number of SNPs, indels and MNPs were counted from the vcf files for each line. Contiguous multibase pair indels and MNPs were counted at a single point. These counts were normalized by dividing them with the fraction of the length of the sequence as compared to the B73 reference. Re- peatMasker (version: open-4.0.6) was used to screen repetitive elements in each of the assemblies against the repeat library for Panicoids (RMLib: 20160829) using rmblast under a speed/sensitivity setting of “slow.” An additional analysis of repeat content was performed that did not depend on a predefined repeat library. The sequence identity and Tm of every 102 bp sequence (26 bp sliding window) of B73 RefGen v4 was computed from thermoalignments[45]. A repeat was counted if the Tm for an ◦ off-target thermoalignment was within 10 C of the corresponding on-target Tm. Ab initio gene prediction was carried out on each resequence assembly using the plugin for Augustus[138] using the maize5 parameter set in Geneious 10.0.9. Nucleotide di- versity (π)[105] among the 27 samples across the qNLB 1 25721468 23298 locus was computed using TASSEL 5.2.38[18]. The shared haplotype length was computed as the mean of the lengths of contiguous nucleotide matches between each variant site among all pairs of samples (similar to[61]). Based on a Clustal Omega alignment of all the resequenced assemblies, the R software package pegas[112] was used to generate a haplotype network based on pairwise SNP differences.

59 4.2.4 Comparison to maize HapMap3 Each of the resequence assemblies was aligned to the B73 RefGen v4 sequence using Clustal Omega[133]. From these alignments, jvarkit[82] was used to generate a Variant Call Format (vcf) file, recording information of all the surveyed sites (vari- ant and invariant). Not all of variants in the vcf files produced by jvarkit used B73 as the reference, such that some genotype calls for B73 were inconsistent with the standard encoding of maize HapMap3 (i.e. 1/1 instead of 0/0;[24]). In addition, the coordinate of each surveyed site was in terms of the Clustal alignment (not the qNLB 1 25721468 23298 reference locus) and also affected by the presence of indels. A custom script was used to recode the genotype calls and adjust the coordinates such that B73 was consistently the reference (0/0) and coordinates corresponded to the qNLB 1 25721468 23298 locus. Coordinate adjustments were performed on a per line basis, given the jvarkit vcf output from each alignment against B73. For each insertion in the non-reference sequence, the position of nucleotides downstream of the insertion (i.e. the end position for the insertion) was subtracted from the length of the corresponding (preceding) insertion. The insertion variant was encoded as at the aligned nucleotide immediately flanking the 5’ end of the start of the insertion (same as maize HapMap3). For deletions, no coordinate adjustment was required and the deleted variant was encoded as . HapMap3 includes the same NAM founder lines with genotype calls based on higher coverage sequence data (NAM parent subset: median read depth of 10) and lower coverage sequence data (282 diverse line subset: median read depth of 5), both of which we compared to the ReSeq data. All variant sites across the qNLB 1 25721468 23298 locus in HapMap3 (even if they were monomorphic among the subset of 27 samples) were compared to the ReSeq data in terms of zygosity and homozygous genotypes. These comparisons were made on a per line basis, excluding missing genotype calls in either HapMap3 or the ReSeq data. Based on this comparison, the following summary statistics were computed in a stepwise manner: (i) compute the total number of sur- veyed sites shared between HapMap3 and the ReSeq data; (ii) compute the proportion

60 of surveyed sites with matching genotype calls (genotype accuracy); (iii) compute the proportion of genotype errors due to differences in zygosity (i.e. 1/1 or 0/0 versus 0/1) versus homozygous genotypes (i.e. 0/0 versus 1/1). In addition, for each line, the HapMap3 discovery rate was computed as the total number of true positive genotype calls in HapMap3 (assuming the resequencing data was correct) divided by the total number variant sites identified in the corresponding ReSeq assembly. Using the generalized linear models (glm) function in R[120], binomial regression was used to test the association of sequence and genotype features with genotype error in HapMap3.

Yi ≈ binomial(ni, πi); (4.1) where Yi is the observed number of true positives at coordinate i, ni is the total number of surveyed genotypes and πi is the probability of a true positive. The distribution of Yi was assumed to be binomial. A logit link function (equation 4.2) was used for modeling the log odds of probability of true positive genotypes as a function of the explanatory variables.

 π  g(π) = logit(π) = log = β + β X + β X + β X + β X + β X (4.2) 1 − π 0 1 1 2 2 3 3 4 4 5 5

The binomial regression was modeled using the response variable as counts of true positive (numerator πi) and false positive (denominator 1 − πi) events and corresponding continuous predictors of (i) sequence features of the locus including GC content (X1), repeat masked sites (X2) and thermoalignment defined repeats (X3); (ii) genotype features of HapMap3 such as read depth (X4) and percentage of heterozygous genotype calls (X5). Counts of coordinates that were repeat masked were recorded for every 25 bp and was used as (X2). GC content and thermoalignment defined repeat count values were averaged across 25 bp windows. The relationship between genotype accuracy based on comparison to the NAM set in HapMap3 and minor allele frequency (MAF) among all samples in HapMap3 was also evaluated.

61 4.2.5 Annotation of variant effects The Variant Effect Predictor (VEP)[94] was used to annotate the effects of variant sites for each of the resequence assemblies. VEP analysis was performed online via Gramene using the B73 RefGen v4 assembly (http://ensembl.gramene.org/Zea_ mays/Tools/VEPaccessed:080917)[101]. For VEP analysis, the standard vcf format outputted by jvarkit was used, where indels and multiple nucleotide polymorphisms are represented at a single position (in contrast to the maize HapMap format used above). However, the jvarkit vcf was still corrected so that B73 RefGen v4 was always the reference allele and the variant coordinates corresponded to the B73 RefGen v4 genome.

4.2.6 Association mapping HapMap3 data[24] at the qNLB 1 25721468 23298 locus was used to perform univariate association mapping for all 38 traits measured on the 282 panel. Association tests were performed using the mixed linear model under default settings in Tassel 5 (optimum compression and P3D variance component estimation)[19]. Best linear unbiased predictions downloaded from Panzea[26] were used as the response variable. The MLM was Q and K corrected; Q1 and Q2 were used as covariates, while Q3 was deselected. All markers were plotted after Benjamini–Hochberg false discovery rate correction in R[120], to highlight the statistical strength of associations in the context of gene annotations by adapting a R script from the Diabetes Genetics Initiative of [128].

4.3 Results 4.3.1 Genomic diversity across the qNLB 1 25721468 23298 locus Complete resequence (ReSeq) assemblies of the qNLB 1 25721468 23298 locus were obtained for 13 of the 27 NAM founders, with the remaining samples covering a minimum of 64% of the locus (Figure 4.1). Incomplete assemblies occurred primarily

62 when one of the six primer pairs constituting the amplicon tiling path did not pro- duce a PCR product. For two of the samples, the C3S-LAA consensus sequence for one of the amplicons was truncated, resulting in a small gap in their assembly. The assemblies were built from of a minimum and median subread depth of 131 and 500, respectively (PacBio reads are comprised of subreads from cyclical sequencing of the same molecule). Alignment of the ReSeq assembly of B73 against the B73 RefGen v4 sequence indicated the ReSeq data was highly accurate, showing two single We characterized the variation in sequence features among the ReSeq assemblies. Con- tiguous ReSeq assemblies that spanned the terminal primers varied in length by 666 bp. In contrast to the 85% repeat (transposon related) content of the maize genome [130], the repeat content at qNLB 1 25721468 23298 was markedly lower (≈7% re- peat content) with minimal variation among samples (Supplementary Figure C.1). Ab initio gene prediction on the B73 ReSeq assembly produced the same number of genes as the B73 RefGen v4 annotation; however, for each of the five predicted genes, the start and stop positions were not identical to those of the B73 RefGen v4 anno- tation. Within the ReSeq assemblies, where there was sufficient coverage, ab initio gene predictions corresponding to the Remorin and PPR genes showed two isoforms in the predicted coding sequence among the samples. At the analogous Remorin gene, CML52, CML322, CML333 and NC350 had a smaller coding sequence than B73. At the analogous PPR gene, CML52, CML277, CML322, CML333, NC350, and Tzi8 had a longer coding sequence than B73. Thus, the 27 NAM founder ReSeq assemblies for the qNLB 1 25721468 23298 locus were collinear, with no gene-level presence-absence variants or major structural rearrangements, but sequence variants among the sam- ples modified the predicted structure for two ab initio predicted genes (Figure 4.1). base pair indels embedded in homopolymeric sections of the 23,298 bp locus (discussed previously:[44]). We characterized the variation in sequence features among the ReSeq assemblies. Contiguous ReSeq assemblies that spanned the terminal primers varied in length by 666 bp. In contrast to the 85% repeat (transposon related) content of the maize

63 genome [130], the repeat content at qNLB 1 25721468 23298 was markedly lower (≈7% repeat content) with minimal variation among samples (Supplementary Figure C.1). Ab initio gene prediction on the B73 ReSeq assembly produced the same number of genes as the B73 RefGen v4 annotation; however, for each of the five predicted genes, the start and stop positions were not identical to those of the B73 RefGen v4 annotation. Within the ReSeq assemblies, where there was sufficient coverage, ab initio gene predictions corresponding to the Remorin and PPR genes showed two isoforms in the predicted coding sequence among the samples. At the analogous Remorin gene, CML52, CML322, CML333 and NC350 had a smaller coding sequence than B73. At the analogous PPR gene, CML52, CML277, CML322, CML333, NC350, and Tzi8 had a longer coding sequence than B73. Thus, the 27 NAM founder ReSeq assemblies for the qNLB 1 25721468 23298 locus were collinear, with no gene-level presence-absence variants or major structural rearrangements, but sequence variants among the samples modified the predicted structure for two ab initio predicted genes (Figure 4.1). Among samples, a total of 671 variant sites (424 SNP sites, 247 MNP/indel sites) were present across the 23,298 bp locus (1 variant site per ≈35 bp; 1 SNP per ≈55 bp). All SNPs were bi-allelic, 13 MNP/indel sites were tri-allelic and one MNP/indel site contained six polymorphisms. Nucleotide diversity (π) for the locus was 0.0134 (0.0050 when MNP/indels were masked), which falls within the low range of nucleotide diversity in maize [149, 142]. Nevertheless, there were 25 distinct haplotypes among the 27 samples (Figure 4.2), for which the mean shared haplotype length was 162 bp. Adjusting for differences in assembly lengths, Tx303 (the most distant haplotype from B73) contained 1 variant per ≈57 bp; approximately four times the average rate among the NAM founders (i.e. 23,298/117 = 1 variant per ≈200, where 117 was the mean number of normalized variant counts per haplotype in Figure 4.1b).

4.3.2 Comparison to maize HapMap3 Maize HapMap3 aims to identify genetic markers in genomic regions with pre- served collinearity and serves as a community resource for analysis of genetic diversity

64 a b 25721468 25725407 25730357 25735246 25740056 25744794

SNP MNP InsertionDeletion 2 0 0 0 13 3 8 12 40 4 13 16 55 4 6 9 41 4 11 14 36 4 9 12 32 4 13 15 70 4 17 19 75 5 17 17 38 4 14 14 39 3 12 12 33 2 10 6 33 3 7 4 31 1 7 2 28 1 8 2 28 1 8 2 24 1 10 6 34 0 6 6 26 0 7 5 53 0 14 11 136 34 32 32 170 42 39 38 152 35 33 24 101 3 17 12 259 47 46 54 173 9 24 30 205 42 40 46

Figure 4.1: Sequence variation at the qNLB 1 25721468 23298 locus. (a) Nucleotide coordinates corresponding to the B73 RefGen v4 genome are indicated at the top. The B73 V4 track corresponds to the reference genome sequence along with encoded genes annotated in the region, which include Protein UPSTREAM OF FLC (UFC ), T- complex protein 1 subunit gamma (CCT3 ), uncharacterized protein (LOC100382810), remorin 6.3 (remorin) and Pentatricopeptide repeat (PPR) containing protein. The remaining tracks show each of the ReSeq samples. Grey and black bands represent consensus and variant sites (SNPs, MPNs and Indels) relative to the B73 RefGen v4 sequence, respectively. Horizontal thin lines indicate deletions, where grey lines indicate deletions shared with the reference and black lines are unique to the line. White bands represent assembly gaps. (b) The tabulated data is the number of different types of variants per line, adjusted according to the length of the corresponding assembly. and association mapping[24]. Using the ReSeq assemblies constructed from deep, long- read sequence data as a benchmark allowed us to investigate the accuracy of HapMap3 at the qNLB 1 25721468 23298 locus. Quartiles of genotyping accuracy per line and surveyed site are reported in Table 4.1, with extremes in genotype accuracy ranging from ≈75% to ≈97% and 0% to 100%, respectively. The range in the discovery rate per line (the proportion of ReSeq variants captured by HapMap3) was 1% to 29% (Figure C.2).

65 XI XXII 12 XIII 21 5 XIV XVIII XXIII 9 14 9 XII XXI 21 XXV 25 8 XXIV

14 XVII 9 VIII XVI 5 12 IX 5 X 5 XV 10 XIX 36 7 23 50 V XX

VI

I 48

VII

● B73 (XXI) ● CML69 (XX) ● Mo18W (XIX) 72 ● B97 (IX) ● Hp301 (XII) ● MS71 (XXII) ● CML103 (II) ● Il14H (XIV) ● NC350 (XXIII) ● CML228 (XV) ● Ki11 (XVII) ● NC358 (III) ● CML247 (XVI) ● Ki3 (XVII) ● Oh43 (VIII) ● CML277 (VI) ● Ky21 (X) ● Oh7B (XIII) ● CML322 (XXIV) ● M162W (V) ● P39 (XIV) IV 11 II ● CML333 (XXV) ● M37W (XI) ● Tx303 (IV) 18 ● CML52 (XVIII) ● Mo17 (VII) ● Tzi8 (I) III

Figure 4.2: Haplotype network for the ReSeq assemblies. Each unique haplotype is represented by an individual pie chart that is labelled using Roman numerals. The radius of each pie chart is proportional to the the number of genotypes with a given haplotype, and the number of sections indicates the number of individuals with that same haplotype. Edges connecting a given pair of haplotypes indicate the number of SNP differences between those haplotypes; however, edge numbers are not additive across more than two nodes.

66 Among the 50,946 cumulative genotype calls for the NAM set in HapMap3, 5881 (≈12%) were incorrect (i.e. mismatches with the ReSeq data), of which 1215 were incorrectly typed as heterozygous genotypes while the remaining were incorrectly typed homozygous genotypes. We checked to confirm our assumption that the samples were homozygous across the region. Indeed, alignment of the circular consensus sequences (CCS reads) for one of the amplicons (chr.1: 25721468..25727793) of Tx303 showed no evidence of heterozygosity. Only 12 sites in the alignment had a nucleotide difference in > 5% of the CCS reads, with a maximum of 10% (we attribute these to PCR or sequencing errors or minor levels of cross-contamination). In contrast, Tx303 in HapMap3 included 107 heterozygous genotype calls across this same region.

Table 4.1: Quartiles of genotyping accuracy for maize HapMap3 at the qNLB 1 25721468 23298 locus.

Level1 Set2 Min. 25% 50% 75% Max. Sample NAM 76% 83% 87% 95% 97% Sample 282 panel 74% 84% 88% 95% 96% Surveyed site NAM 0% 83% 100% 100% 100% Surveyed site 282 panel 0% 82% 100% 100% 100%

1Accuracy values are indicated at the sample level (Sample) and at the nucleotide level (Surveyed site).

2HapMap3 genotypes from the 10X coverage NAM founder set (NAM) and the 5X coverage 282 panel are shown.

To gain insight into the potential cause of genotyping error in HapMap3, geno- type accuracy was compared to sequence features of the locus, genotype features of HapMap3 and the population feature, minor allele frequency (Figure 4.3). Results from binomial regression indicated that the number of true positive genotype calls was significantly associated (p < 1 e-10) with all sequence and genotype features except for GC content (p = 0.186). A high density of sites with 0% genotype accuracy occurred proximal to ≈25.735 Mb. These segments contained a high density of thermoalignment repeats (but no RepeatMasker defined repeats) and the genotypes had the highest lev- els of read depth and heterozygous calls. The minor allele frequency among HapMap3 samples, a feature derived from the HapMap3 genotype data, but which was not fit in

67 the model, showed a significant negative correlation with genotype accuracy (r = -0.8; Figure C.3).

4.3.3 Analysis of the NLB susceptible, Tx303 haplotype Prior work using breakpoint analysis of haplotypes from B73 and Tx303, which have contrasting effects on NLB resistance, delimited the qNLB 1 25721468 23298 locus[64]. However, association mapping using the 282 panel (HapMap2 data[31]) failed to identify any variants associated with NLB[64], while linkage mapping using the NAM population suggested the Tx303 haplotype associated with NLB suscepti- bility is unique among the NAM founders[116, 64]. Based on additional evidence, Jamann et al. (2016) identified remorin as the most likely gene underlying resistance to NLB; this gene is emphasized in some of the following results. Thus, a primary mo- tivation of this study was to catalogue all of the variants at qNLB 1 25721468 23298 in order to identify potential causal variants for susceptibility to NLB that may be specific to Tx303. We predicted effects on gene function for variants specific to the B73/Tx303 contrast (VEP analysis). It was also determined if those variants were typed in HapMap3 and we computed the frequency of those variants in the NAM founders (under the expectation that the causal variant(s) is unique to Tx303 in the NAM population). As described above, among the NAM samples, Tx303 was the most divergent haplotype from B73 (Figures 4.1 and 4.2). Despite having the greatest number of substitutions among the ReSeq haplotypes, including a 7 bp insertion in the 3’ UTR of the remorin gene, none of the Tx303 variants were predicted to have a high impact on gene function. There was one variant that was predicted to have a high impact on remorin gene function, but it was not present in the Tx303 haplotype; instead, this variant was shared by CML52, CML322, CML333 and NC350. The majority of substitutions in the Tx303 haplotype, and among the NAM founder haplotypes in general, were predicted as modifiers (Figure 4.4).

68 a 50

40

30

20

10 MAF (HapMap3) 0

b 100 Heterozygous genotypes 0% 80 25% 50%

acy (NAM) 60 75%

Read 40 depth 30 20 20

Genotype Accu r 10 0 0 c 100 %GC 80 < 60 Tm 10 C 40 Repeat masked 20 Gene 0

25.720 25.725 25.730 25.735 25.740 25.745 Coordinate (Mb)

Figure 4.3: Genotyping accuracy in relation to sequence and genotype features at the qNLB 1 25721468 23298 locus. (a) The minor allele frequency (MAF) among all samples in maize HapMap3. (b) Genotype accuracy estimated for the NAM set in HapMap3, based on the comparison to the ReSeq data. The size of each point reflects the percentage of heterozygous genotype calls present among samples constituting the NAM set in HapMap3. The color gradient reflects the median read depth of genotype calls for samples constituting the NAM set in HapMap3. (c) Sequence features of B73 RefGen v4, including the mean percent GC content (window size: 102 bp, step size: 25 bp), the number of thermoalignment defined repeats (i.e. sequences within the locus that were separated from other loci in the maize genome by less than 10 ◦C in melting temperature[45]), repeat masked segments and genes (from left to right: UFC, CCT3, LOC100382810, remorin and PPR [see Figure 4.1]).

69 Across the qNLB 1 25721468 23298 locus, a total of 413 variants differentiated the B73 and Tx303 haplotypes, with 103 of these located in the remorin gene (including the upstream and downstream intergenic space). Among these B73/Tx303 variant sites, 255 were typed in HapMap3 and used for association mapping with the 282 panel (27 of these were in the remorin gene). HapMap3 genotype accuracies estimated at these sites had a median of 64% and was less than 81% for 90% of the sites, and neither these variant sites, nor any other sites in HapMap3 showed a significant association with NLB (Figure C.4). For the NAM population, the fewer the number of non-B73 founders that share an allele, the lesser is the statistical power of tests for trait association by joint linkage and association mapping. Jamann et al. (2016) inferred that a B73/Tx303 variant in the remorin gene underlies the NLB effect at qNLB 1 25721468 23298 and that this was unique to Tx303 among the NAM founders. Even though the remorin gene did not have complete coverage across all lines (Figure 4.1), all of the B73/Tx303 variants were shared by at least one other NAM founder. Among these variant sites, there were 2 in 1 additional founder, 18 in 2 additional founders, 4 in 3 additional founders, 4 in 4 additional founders and 4 in 5 additional founders (Figure C.5). The two sites with the Tx303 variant shared by only one other NAM founder (NC358) were not typed in HapMap3 or available for association mapping with the 282 panel; 4 of the 18 sites with variants shared by two other founders (CML103 and NC358) were also not present in HapMap3. The only variants that were unique to Tx303 (n=66) occured within the CCT3 gene (Figure 4.4).

4.4 Discussion Quantitative disease resistance is a common and durable form of resistance in maize, but limited information is available on the genes and mechanisms underlying QDR. In maize, sequencing of haplotypes encompassing full-length genes has been shown to exhibit non-collinearity and high levels of nucleotide diversity[46, 99, 21, 147, 20, 93]. However, nestled within the highly repetitive (≈ 85%;[130]) and diverse

70 a b Impact UFC (+) Intergenic CCT3 (−) Intergenic LOC100382810 (+) Intergenic Remorin (−) Intergenic PPR (+) 5000

4000 2928 3000

2000

(B73/Tx303) 1417 Variant counts Variant 1000 605 317 198 215 123 29 0 0 11 48 11 56 88 31 0 2 6 2 15 16 27 0 8 3 4 0 0 c d 5000

4000

3000

2000 Variant counts Variant (Tx303−specific) 1000 492 347 48 0 0 0 0 0 0 45 34 48 0 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e f 5000 4674

4000

3000 2301 (NAM) 2000 Variant counts Variant 1000 663 566 370 410 185 77 1 0 29 86 12 58 109 63 2 3 49 6 17 45 63 6 8 31 40 0 0

Low High Exon Intron Exon Intron Exon Intron Exon Intron Exon Intron Modifier 5p_UTR 3p_UTR 5p_UTR 3p_UTR 5p_UTR 3p_UTR 5p_UTR 3p_UTR 5p_UTR 3p_UTR Moderate Intergenic Intergenic Intergenic Intergenic

Figure 4.4: Impacts and gene structure positions of variants based on VEP based an- notation. (a) Predicted impacts of variants between ReSeq Tx303 and B73 haplotypes; (b) Gene structure positions of variants in a; (c) Predicted impacts of variants unique to the ReSeq Tx303 haplotype (compared to all ReSeq NAM founder haplotypes, in- cluding B73); (d) Gene structure positions of variants in c; (e) Predicted impacts of variants among all ReSeq NAM haplotypes (including B73); (f) Gene structure posi- tions of variants in e. Note: an individual variant can have more than one impact, due to alternative splicing annotation. Hence, the total number of VEPs exceeds the total number of variants per se.

(π for domesticated maize ranges from 0.002-0.041; considering:[149, 157]) genome of maize, haplotypes of the NAM founders at qNLB 1 25721468 23298 showed relatively low levels of genomic diversity, both in terms of repeat content (≈ 7%) and nucleotide differences (πSNP = 0.005). Our data was not suitable for a formal test of selection, but this level of sequence variation indicates the locus may have been subjected to selection. These observations could also be explained by the paucity of transposon elements, which have been shown to generate variation in the maize genome[99]. This narrowed our focus to considering SNPs and indels as potential causal variants for QDR at qNLB 1 25721468 23298. Maize HapMap is an expansive community resource that has been used for numerous studies. For HapMap3, Bukowski et al. (2015) restricted genotyping to putatively collinear sections of the genome and estimated that the per site error rate

71 was 1-3%. Given the high level of sequence collinearity at qNLB 1 25721468 23298, this locus might be considered a best case scenario for accurately mapping sequences and determining genotype calls. Using our independent ReSeq data as a bench- mark, we were able to determine the distribution of error rates in maize HapMap3 at qNLB 1 25721468 23298, where the median per sample error rate was 13% and the median per site error rate was 0%. However, the per site error rate dropped precip- itously among surveyed sites, with an upper quartile error of 17% and a maximum error of 100% which occurred at 17 of the 2281 sites (0% accuracy in Figure 4.3b). The majority of errors was due to incorrect homozygous genotype calls, but extremely high levels of error occurred non-randomly at sites where there was a high read depth underlying the genotype call and a high proportion of heterozygous genotypes (pre- sumably originating from misalignments of sequencing reads[24]). The error structure produced a strong negative correlation between genotype accuracy and MAF, such that sites with higher MAF had lower accuracy (Figure C.3). Our results indicate that MAFs computed from HapMap3 data are biased and suggest that statistical tests for genotype-phenotype association will be underpowered when using maize HapMap3.

Based on fine mapping and functional analysis of qNLB1.02B73, Jamann et al. (2016) suggested that the inbred line Tx303 carries a susceptible allele of a remorin gene that is unique to Tx303 among the NAM founders. Our results do not sup- port this hypothesis, as no Tx303-specific variants were identified in the remorin gene among the NAM founders. Moreover, sequence analysis of the Tx303 and B73 hap- lotypes showed no evidence for loss of function for any of the genes in the region. In fact, examining sequence variation at qNLB 1 25721468 23298 failed to provide a clear indication for what might be the cause of variation in QDR. Perhaps the causal variant(s) is not specific to Tx303 among the NAM founders, in which case CML103, NC358 are the most likely NAM founders to share the same allele (Figures 1 and 2). The T-complex protein 1 subunit gamma (CCT3 ) gene fit the hypothesis of a Tx303-specific allele, but without evidence for differential expression in response to

72 pathogen infection[64], only sequence variants predicted to impact the protein struc- ture of CCT3 would be considered plausible candidates. However, there were no unique non-synonymous substitutions in the CCT3 gene. Alternatively, as it appears to be common that quantitative traits in maize are conditioned by regulatory sites distal to the corresponding functional gene[140, 153, 27, 60], the differential response and functional effect previously demonstrated for remorin might be attributable to vari- ants outside of the bio-statistically delimited qNLB 1 25721468 23298 locus, within the original 243 kb region delimited by breakpoint mapping [64]. More complex expla- nations, such as local epistasis and epigenetic variation are also possibilities. Clearly, additional work is needed to elucidate the molecular basis for this QDR locus. This study contributes to the understanding of haplotypes encompassing mul- tiple genes in maize and the growing interest in understanding the maize pan-genome. Contrary to studies that have highlighted extensive non-collinearity in the maize genome, qNLB 1 25721468 23298 is a highly conserved section of the genome. Identification of low variant discovery rate and prevalence of genotyping errors in the maize HapMap3 indicates importance of independent validation for the ever increasing volumes of data being made available for genetic studies. As an extension of this study on the role of qNLB1.02B73 on NLB QDR, associations to traits other than NLB QDR were identified based on the 282 panel of HapMap3 data (Figure C.4), which leads to a hypothesis regarding potential overlap in mechanisms underlying quantitative disease resistance and other traits, that is in line with the emerging omnigenic model [17] for complex traits.

73 Chapter 5

DISCUSSION AND CONCLUSIONS

The genomic era has become a reality and the consequences of the emergence of this new field is evident in a variety of domains[34]. In spite of these advances in genome sequencing and bioinformatic analysis approaches, it is quite challenging to sequence and understand the genomic variations in large and complex genomes such as maize. Because of extensive polymorphisms including presence absence variations across diverse lines of maize, the reference genome only provides a little glimpse into the maize pan-genome[21, 85]. High coverage sequencing of large plant genomes is still cost prohibitive and unnecessary in many cases. A more effective approach is to reduce the complexity of the genome by targeting a specific portion for selective enrichment followed by high coverage sequencing. Several attempts to test and opti- mize various targeted genome enrichment techniques that were demonstrated to work successfully in human and other genomes proved quite challenging with the highly repetitive maize genome[45]. It becomes even more daunting when sequence variations underlying complex traits in a highly repetitive genome need to be studied to under- stand genomic diversity and their underlying implications. This is primarily because several parts of the genome, each contributing subtle effects are responsible for such quantitative traits. This dissertation was focused on characterizing the genomic di- versity underlying a Northern leaf blight (NLB) quantitative disease resistance (QDR) locus in maize. Understanding the challenges of resequencing complex genomes like maize and addressing some of them through the development of ThermoAlign[45] and C3S-LAA[44] was crucial to attaining this goal. These developments enabled successful characterization of the genomic diversity across the 27 founder lines of a maize nested association mapping (NAM) population, at a fine-mapped section of chromosome 1

74 (≈ 23 kb) associated with NLB QDR. This study also helped us understand the low variant discovery rate as well as the prevalence of genotyping errors in the community maize HapMap3 data set that is based on low coverage short read sequencing tech- nology. This work contributes new tools for genome science and provides new insights into the potential causal variants at a locus associated with QDR to NLB in maize.

5.1 A ThermoAlign approach for targeted enrichment of repetitive genomes Targeted resequencing of parts of a genome is an important component of ge- nomic studies. Our unsuccessful attempts at testing various targeted enrichment / amplification techniques on the maize genome lead to the understanding of the chal- lenges posed by such repetitive genomes. Similar was the outcome of other research groups that attempted targeted enrichment in the maize genome[47, 88]. Further op- timization or new technologies are required for hybridization probe based targeted enrichment approaches to work on repetitive genomes like maize. The challenges that were identified could broadly be classified as amplifiability and specificity related is- sues. Amplifiability related challenges were because of sequencing and assembly errors in the reference genome, complex conformations of repetitive genomic DNA and exten- sive polymorphisms among diverse lines of maize. Specificity related constraints were because of non-specific binding of primers and hybridization oligos and self-genome/bi- product priming. Repetitive DNA comprises varying fractions of different genomes, which may be as high as over 50% in the human genome and over 80% in some plant genomes. Along with transposable elements and simple repeats, homeologous regions of the genomes also contribute to repeat fraction, where primers and oligos can bind non-specifically. To address this challenge, I hypothesized that evaluating the hybridization ther- modynamics across priming relevant alignments of primers across the genome would enable accurate selection of template specific primers. A critical aspect of Ther- moAlign, the tool that was developed under this study is the algorithmic and quan- titative approach used to characterize off-target hybridization sites. Priming relevant

75 ”thermoalignments” were generated by gap adjusted end filling of local alignments to- wards the primer’s 3’ region, which ensures the design of locally specific primer pairs by evaluating the DNA hybridization thermodynamics at potential off-target hybridiza- tion sites. Prior information on polymorphisms is also used. This advance identifies primers that are highly specific to the local target site that are expected to function across diverse individuals. Laboratory tests show that ThermoAlign designed primers provide perfect performance in terms of amplification specificity when tested on one of the most repetitive genomes sequenced to date (over 85% repeat content). Next, using a unique implementation of graph analysis, the minimum set of primers giving the maximum coverage of the targeted locus was efficiently identified. This facilitates amplification of genomic regions that cannot be amplified by a single primer pair. These tiling paths of primers were then grouped together as multiplex compatible sets. This study revealed that there is no substitute to the priming relevant alignment based evaluation of the hybridization thermodynamics of primers. Approaches that evaluate specificity of subsequences or look at the number of mismatches for primer specificity evaluation pose a risk of selecting primers that could hybridize to off-target sites or results in low primer discovery rate. This work builds off fundamental knowledge on the thermodynamics of DNA hybridization, demonstrating that the nearest-neighbor estimation of Tm on thermoalignments is a robust solution for target-specific primer design. In summary, ThermoAlign offers a significant advance in our ability to study species with repetitive genomes and challenging regions of regular genomes. Besides targeted resequencing, ThermoAlign has broad applications for routine PCR assays and may also be extended for evaluating specificity of DNA hybridization probes and CRISPR/Cas9 guide RNAs.

76 5.2 SMRT sequencing and assembly of multiplexed amplicon libraries from the maize genome High throughput sequencing data based applications used for research and diag- nostics is affected by errors resulting from sequencing and amplification based techno- logical limitations as well as from downstream bioinformatic analysis. PacBio’s Single molecule real-time (SMRT) sequencing technology generates long sequencing reads which once error corrected can be useful for assembling genomes and alignments to a reference genome. PacBio’s Long Amplicon Analysis (LAA), the standard pipeline for the analysis of SMRT sequence data from amplicon libraries resulted in incorrect con- sensus sequences including inverted duplications and truncations. PacBio raw reads are known to be of poor quality primarily because of the prevalence of randomly distributed indels. These errors along with repeats and overlapping regions in the amplicons could potentially be causing errors in clustering of reads by LAA. My hypothesis was that using higher quality circular consensus sequence (CCS) which is the consensus gener- ated form multiple copies of reads coming from a single amplicon, along with using sequence information from the primers and barcodes that were used to produce the amplicons in a sequencing library would improve the clustering of sequence reads and the resulting consensus sequences. Our study helped identify that clustering of CCS and their corresponding raw reads using the primer and barcode information for each amplicon (C3S-LAA approach) aided in the generation of accurate consensus sequences and their assemblies from multiplexed amplicon libraries. C3S-LAA approach could fully resolve all the amplicons that were part of the sequencing library. Evaluating the relationship between read depth and consensus sequence as well as assembly accu- racy based on bootstrapping the read data, revealed that the minimum subread depth required for generating consensus sequences was 21. The accuracy of the resulting consensus sequences ranged from 99.72-100%, which was found to increase with the read depth. However, rare instances of recurring errors were observed in some of the amplicon sequences even at high subread depth, which was found to be because of in- sertions at a homopolymeric region, indicating limited instances of errors due to DNA

77 amplification or sequencing. C3S-LAA could resolve all the amplicon sequences from a multiplexed amplicon library that was sequenced, and clearly outperformed LAA.

5.3 Unravelling the genomic diversity at a maize quantitative disease re- sistance (QDR) locus using long molecule resequencing DNA marker based studies that compared plant genetic diversity have indicated extensive collinearity of genetic maps between closely related species[14]. However, subsequent comparative sequence analysis studies in maize revealed extensive violation of intraspecific genetic collinearity[46, 99, 21]. Contrary to those studies, resequencing of an ≈23kb NLB QDR associated locus on chromosome 1 identified high collinearity and low nucleotide diversity (π) in this section of the maize genome. Even though the data was not suitable for formal tests of selection, the low diversity value observed at this locus was close to that for genes that were under selection in maize[149]. Higher sequence homology at this locus could be attributed to reduced transposable elements, which are thought to be responsible for capturing and moving genes around the maize genome[99]. The goal of the maize HapMap3 project, which relies on mapping of short read sequencing reads onto a reference genome is to identify genetic markers in regions where genomic collinearity is fairly preserved[24]. The qNLB 1 25721468 23298 locus on the tail end of chromosome 1 that was resequenced in this study had relatively low repeat fraction as compared to the average values that have been estimated for the B73 reference genome. Despite being an ideal region (collinear, with low π value and low repeat content) of the maize genome for genotyping, we observed low variant discovery rate as well as the prevalence of genotyping errors in the maize HapMap3 data for this locus. Sites with high genotyping errors were especially prevalent in regions with higher median read depth, indicating instances of potential mismapping of sequencing reads in HapMap3. Such sites also corresponded to sites with higher percentage of lines with heterozygous genotypes being recorded in HapMap3 dataset. This indicates a scenario where sequencing reads from the regions of a resequenced genome that

78 could potentially be even absent in the reference, gets mapped to paralogous sites of the reference genome, resulting in incorrect variants being called. The low variant discovery rate and extent of genotyping errors in maize HapMap3 data set indicates the relevance of independent validation for the ever increasing high throughput sequencing based genotype data that is being made available at population scales for a variety of organisms. One of the resequenced line, Tx303 was thought to have a unique suscepti- ble allele for NLB QDR, based on an earlier fine mapping and functional analysis of qNLB1.02B73 locus[64]. Our resequencing study helped identify that the candidate gene remorin did not have any variants that were unique to the Tx303 haplotype. However, we identified another T-complex protein 1 subunit gamma (CCT3 ) gene within the qNLB 1 25721468 23298 locus that fit the hypothesis of a Tx303-specific allele. Ja- mann et al. (2016) showed that CCT3 gene did not show any differential expression based function evidence. This functional evidence taken together with our resequencing based variant analysis indicates that regulatory elements outside of the bio-statistically delimited qNLB 1 25721468 23298 locus, but within the fine mapping qNLB1.02B73 re- gion could be influencing the differential response and functional effect attributed to remorin gene, which is a common phenomenon for quantitative traits[140, 153, 27, 60]. This study thus opens up the potential for additional work to examine the role of reg- ulatory elements within the fine mapping qNLB1.02B73 region and even address other more complex hypothesis regarding the role of local epistasis and epigenetic variation at this QDR locus. Association mapping (the 282 panel using HapMap3 data) helped identify significant associations to traits other than QDR, leading to a hypothesis re- garding potential overlap in mechanisms underlying quantitative disease resistance and other traits, that is in line with the emerging omnigenic model [17] for complex traits.

5.4 Future directions My research work has made significant progress in understanding the challenges regarding resequencing complex genomes like maize and have addressed some of those

79 challenges through the development of ThermoAlign and C3S-LAA. Using these ap- proaches, I could successfully characterize the genomic diversity underlying a Northern leaf blight resistance associated locus resequenced across the NAM founder lines of maize. However, to understand the molecular mechanisms underlying such complex traits, candidate genes and their variants need to be validated and their functions need to be identified by overlaying information from histopathology, transcriptomic and metabolomic studies. Such time series information will enable addressing various hypothesis regarding how QDR loci and corresponding variants affect various stages of disease development. Besides, the relationship between disease resistance and yield would help understand the cost associated with QDR. Another potential area of inter- est would be to explore how disease resistance progresses with different maturity stages of the host. This information would provide guidance for management of diseases and selection of traits related to maturity such as flowering traits. With each subsequent genome/genomic loci being sequenced across organisms or populations, our ability to explore genome function and diversity more precisely is increasing. The availability of accurate genome-scale polymorphism data sets in a variety of organisms will provide the opportunity to further understand the complex relationship between genetic diver- sity, population dynamics, evolutionary history and ecology. Such omics data needs to be integrated into a systems level for a better understanding the disease resistance sys- tem as well as the interplay of systems biology pathways underlying multiple complex traits.

80 BIBLIOGRAPHY

[1] Hatim T. Allawi and John Santalucia. Thermodynamics and NMR of internal G/T mismatches in DNA. Biochemistry, 36(34):10581–10594, aug 1997.

[2] Hatim T. Allawi and John SantaLucia. Nearest neighbor thermodynamic param- eters for internal G/A mismatches in DNA. Biochemistry, 37(8):2170–2179, feb 1998.

[3] Hatim T. Allawi and John SantaLucia. Nearest-neighbor thermodynamics of internal A/C mismatches in DNA: Sequence dependence and pH effects. Bio- chemistry, 37(26):9435–9444, jun 1998.

[4] Hatim T. Allawi and John SantaLucia. Thermodynamics of internal C/T mis- matches in DNA. Nucleic Acids Research, 26(11):2694–2701, jun 1998.

[5] Stephen F Altschul, Thomas L Madden, Alejandto A Sch¨affer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI- BLAST:a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, 1997.

[6] David Altshuler, Mark J Daly, and Eric S Lander. Genetic mapping in human disease. Science (New York, N.Y.), 322(5903):881–8, nov 2008.

[7] Rumen Andonov, Nicola Yanev, Dominique Lavenier, and Philippe Veber. Com- binatorial approaches for segmentingbacterium genomes. INRIA, RR-4853(inria- 00071730):1–18, 2003.

[8] Reidar Andreson, T˜ouMos, and Maido Remm. Predicting failure rate of PCR in large genomes. Nucleic Acids Research, 36(11):e66, jun 2008.

[9] R Andreson Eric Reppo, Lauris Kaplinski and Maido Remm. Software Open Access GENOMEMASKER package for designing unique genomic PCR primers. BMC Bioinformatics, 7(1):172, 2006.

[10] Adam Auton, Gon¸caloR. Abecasis, David M. Altshuler, Richard M. Durbin, Gon¸caloR. Abecasis, David R. Bentley, Aravinda Chakravarti, Andrew G. Clark, Peter Donnelly, Evan E. Eichler, and et Al. A global reference for human genetic variation. Nature, 526(7571):68–74, sep 2015.

81 [11] W M Barnes. PCR amplification of up to 35-kb DNA with high fidelity and high yield from lambda bacteriophage templates. Proceedings of the National Academy of Sciences of the United States of America, 91(6):2216–2220, mar 1994.

[12] R BARRETT and D SCHLUTER. Adaptation from standing genetic variation. Trends in Ecology & Evolution, 23(1):38–44, jan 2008.

[13] Yair Benita, Ronald S Oosting, Martin C Lok, Michael J Wise, and Ian Humphery-Smith. Regionalized GC content of template DNA as a predictor of PCR success. Nucleic acids research, 31(16):e99, aug 2003.

[14] Jeffrey L Bennetzen. Comparative Sequence Analysis of Plant Nuclear Genomes: Microcolinearity and Its Many Exceptions. The Plant Cell, 12:1021–1029, 2000.

[15] Andrew F Bent and David Mackey. Elicitors, effectors, and R genes: the new paradigm and a lifetime supply of questions. Annual review of phytopathology, 45(1):399–436, 2007.

[16] Nicholas A Bokulich, Sathish Subramanian, Jeremiah J Faith, Dirk Gevers, Jef- frey I Gordon, Rob Knight, David A Mills, and J Gregory Caporaso. Quality- filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature Methods, 10(1):57–59, dec 2012.

[17] Evan A. Boyle, Yang I. Li, Jonathan K. Pritchard, S. Gordon, A.K. Henders, D.R. Nyholt, P.A. Madden, A.C. Heath, N.G. Martin, G.W. Montgomery, et Al., MIGen Consortium, PAGEGE Consortium, LifeLines Cohort Study, et Al., Global Lipids Genetics Consortium, ReproGen Consortium, MAGIC Investiga- tors, et Al., MIGen Consortium, PAGE Consortium, ReproGen Consortium, GE- NIE Consortium, International Endogene Consortium, and et Al. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell, 169(7):1177–1186, jun 2017.

[18] P. J. Bradbury, Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics, 23(19):2633–2635, oct 2007.

[19] P. J. Bradbury, Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics, 23(19):2633–2635, oct 2007.

[20] K E Broglie, K H Butler, M G Butruille, A da Silva Concei¸c˜ao,T J Frey, J A Hawk, J S Jaqueth, E S Jones, D S Multani, and P.J.C.C. Wolters. Method for identifying maize plants with RCG1 gene conferring resistance to colletotrichum infection, 2011.

82 [21] W Stephan Brunner, Kevin Fengler, Michele Morgante, Scott Tingey, and Antoni Rafalski. Evolution of DNA Sequence Nonhomologies among Maize Inbreds. THE PLANT CELL ONLINE, 17(2):343–360, 2005.

[22] Carrie C Buchanan, Eric S Torstenson, William S Bush, and Marylyn D Ritchie. A comparison of cataloged variation between International HapMap Consortium and 1000 Genomes Project data. Journal of the American Medical Informatics Association : JAMIA, 19(2):289–94, 2012.

[23] Edward S Buckler, Brandon S Gaut, and Michael D McMullen. Molecular and functional diversity of maize. Curr Opin Plant Biol, 9(2):172–6, 2006.

[24] Robert Bukowski, Xiaosen Guo, Yanli Lu, Cheng Zou, Bing He, Zhengqin Rong, Bo Wang, Dawen Xu, Bicheng Yang, Chuanxiao Xie, Longjiang Fan, Shibin Gao, Xun Xu, Gengyun Zhang, Yingrui Li, Yinping Jiao, John Doebley, Jeffrey Ross- Ibarra, Vince Buffalo, Edward S Buckler, Yunbi Xu, Jinsheng Lai, Doreen Ware, and Qi Sun. Construction of the third generation Zea mays haplotype map. bioRxiv, page 026963, sep 2015.

[25] S. M. Bybee, H. Bracken-Grissom, B. D. Haynes, R. A. Hermansen, R. L. Byers, M. J. Clement, J. A. Udall, E. R. Wilcox, and K. A. Crandall. Targeted Ampli- con Sequencing (TAS): A Scalable Next-Gen Approach to Multilocus, Multitaxa Phylogenetics. Genome Biology and Evolution, 3(0):1312–1323, dec 2011.

[26] P. Canaran, E. S. Buckler, J. C. Glaubitz, L. Stein, Q. Sun, W. Zhao, and D. Ware. Panzea: an update on new content and features. Nucleic Acids Re- search, 36(Database):D1041–D1043, dec 2007.

[27] Sara Castelletti, Roberto Tuberosa, Massimo Pindo, and Silvio Salvi. A MITE transposon insertion is associated with differential methylation at the maize flow- ering time QTL Vgt1. G3 (Bethesda, Md.), 4(5):805–12, mar 2014.

[28] Philip A. Chambers, Lucy F. Stead, Joanne E. Morgan, Ian M. Carr, Kate M. Sutton, Christopher M. Watson, Victoria Crowe, Helen Dickinson, Paul Roberts, Clive Mulatero, Matthew Seymour, Alexander F. Markham, Paul M. Waring, Philip Quirke, and Graham R. Taylor. Mutation Detection by Clonal Sequencing of PCR Amplicons and Grouped Read Typing is Applicable to Clinical Diagnos- tics. Human Mutation, 34(1):248–254, jan 2013.

[29] JC Chen. Dijkstra’s shortest path algorithm. Journal of Formalized Mathematics, 15, 2003.

[30] Jer Ming Chia, Chi Song, Peter J Bradbury, Denise Costich, Natalia de Leon, John Doebley, Robert J Elshire, Brandon Gaut, Laura Geller, Jeffrey C Glaub- itz, Michael Gore, Kate E Guill, Jim Holland, Matthew B Hufford, Jinsheng Lai, Meng Li, Xin Liu, Yanli Lu, Richard McCombie, Rebecca Nelson, Jesse

83 Poland, Boddupalli M Prasanna, Tanja Pyh¨aj¨arvi,Tingzhao Rong, Rajandeep S Sekhon, Qi Sun, Maud I Tenaillon, Feng Tian, Jun Wang, Xun Xu, Zhiwu Zhang, Shawn M Kaeppler, Jeffrey Ross-Ibarra, Michael D McMullen, Edward S Buck- ler, Gengyun Zhang, Yunbi Xu, and Doreen Ware. Maize HapMap2 identifies extant variation from a genome in flux. Nature genetics, 44(7):803–807, jul 2012.

[31] Jer-Ming Chia, Chi Song, Peter J Bradbury, Denise Costich, Natalia de Leon, John Doebley, Robert J Elshire, Brandon Gaut, Laura Geller, Jeffrey C Glaub- itz, Michael Gore, Kate E Guill, Jim Holland, Matthew B Hufford, Jinsheng Lai, Meng Li, Xin Liu, Yanli Lu, Richard McCombie, Rebecca Nelson, Jesse Poland, Boddupalli M Prasanna, Tanja Pyh¨aj¨arvi,Tingzhao Rong, Rajandeep S Sekhon, Qi Sun, Maud I Tenaillon, Feng Tian, Jun Wang, Xun Xu, Zhiwu Zhang, Shawn M Kaeppler, Jeffrey Ross-Ibarra, Michael D McMullen, Edward S Buck- ler, Gengyun Zhang, Yunbi Xu, and Doreen Ware. Maize HapMap2 identifies extant variation from a genome in flux. Nature Genetics, 44(7):803–807, jun 2012.

[32] Chen-Shan Chin, David H Alexander, Patrick Marks, Aaron A Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland, John Huddleston, Evan E Eichler, Stephen W Turner, and Jonas Korlach. Nonhybrid, finished micro- bial genome assemblies from long-read SMRT sequencing data. Nature methods, 10(6):563–9, jun 2013.

[33] Chia-Lin Chung, Joy M. Longfellow, Ellie K. Walsh, Zura Kerdieh, George Van Esbroeck, Peter Balint-Kurti, and Rebecca J. Nelson. Resistance loci affecting distinct stages of fungal pathogenesis: use of introgression lines for QTL mapping and characterization in the maize - Setosphaeria turcica pathosystem. BMC Plant Biology, 10(1):103, jun 2010.

[34] Francis S. Collins, Eric D. Green, Alan E. Guttmacher, and Mark S. Guyer. A vision for the future of genomics research. Nature, 422(6934):835–847, apr 2003.

[35] D. E. Cook, T. G. Lee, X. Guo, S. Melito, K. Wang, A. M. Bayless, J. Wang, T. J. Hughes, D. K. Willis, T. E. Clemente, B. W. Diers, J. Jiang, M. E. Hudson, and A. F. Bent. Copy Number Variation of Multiple Genes at Rhg1 Mediates Nematode Resistance in Soybean. Science, 338(6111):1206–1209, nov 2012.

[36] Richard T Corlett. A Bigger Toolbox: Biotechnology in Biodiversity Conserva- tion. Trends in Biotechnology, 35(1):55–65, jan 2017.

[37] Richard Cronn, Brian J. Knaus, Aaron Liston, Peter J. Maughan, Matthew Parks, John V. Syring, and Joshua Udall. Targeted enrichment strategies for next-generation plant biology. American Journal of Botany, 99(2):291–311, 2012.

84 [38] S. M. Cummings, M. McMullan, D. A. Joyce, and C. van Oosterhout. Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conser- vation Genetics, 11(3):1095–1097, jun 2010.

[39] Johannes Dapprich, Deborah Ferriola, Eleni E. Magira, Mark Kunkel, and Dim- itri Monos. SNP-specific extraction of haplotype-resolved targeted genomic re- gions. Nucleic Acids Research, 36(15):e94, sep 2008.

[40] Aaron E. Darling, Bob Mau, Nicole T. Perna, S Batzoglou, and Y Zhong. pro- gressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rear- rangement. PLoS ONE, 5(6):e11147, jun 2010.

[41] A. P Jason de Koning, Wanjun Gu, Todd A. Castoe, Mark A. Batzer, and David D. Pollock. Repetitive elements may comprise over Two-Thirds of the human genome. PLoS Genetics, 7(12):e1002384, dec 2011.

[42] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269–271, dec 1959.

[43] Hans Ellegren and Nicolas Ellegren. Determinants of genetic diversity. Nature Publishing Group, 17, 2016.

[44] Felix Francis, Michael D Dumas, Scott B Davis, and Randall J Wisser. Clustering of Circular Consensus Sequences: Accurate Error Correction and Assembly of Single Molecule Real-Time Reads from Multiplexed Amplicon Libraries. bioRxiv, page 236893, dec 2017.

[45] Felix Francis, Michael D. Dumas, Randall J. Wisser, D. Schnorr, and S. A. Loening. ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing. Scientific Reports, 7:44437, mar 2017.

[46] Huihua Fu and Hugo K Dooner. Intraspecific violation of genetic colinearity and its implications in maize. Proceedings of the National Academy of Sciences of the United States of America, 99(14):9573–9578, 2002.

[47] Yan Fu, Nathan M. Springer, Daniel J. Gerhardt, Kai Ying, Cheng Ting Yeh, Wei Wu, Ruth Swanson-Wagner, Mark D’Ascenzo, Tracy Millard, Lindsay Freeberg, Natsuyo Aoyama, Jacob Kitzman, Daniel Burgess, Todd Richmond, Thomas J. Albert, W. Brad Barbazuk, Jeffrey A. Jeddeloh, and Patrick S. Schnable. Repeat subtraction-mediated sequence capture from a complex genome. Plant Journal, 62(5):898–909, jun 2010.

[48] Alain L. Gervais, Maud Marques, and Luc Gaudreau. PCRTiler: Automated design of tiled and specific PCR primer pairs. Nucleic Acids Research, 38(SUPPL. 2):W308–12, jul 2010.

85 [49] Jc Glaszmann, B Kilian, Hd Upadhyaya, Rk Varshney, Rajeev K Varshney, and Douglas R Cook. Accessing genetic diversity for crop improvement This review comes from a themed issue on Genome studies and molecular genetics -Plant biotechnology Edited. Current Opinion in Plant Biology, 13:167–173, 2010.

[50] Sara Goodwin, John D. McPherson, and W. Richard McCombie. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333–351, may 2016.

[51] A Gordon and G J Hannon. Fastx-toolkit. FASTQ/A short-reads preprocessing tools (unpublished) http://hannonlab. cshl. edu/fastx\ toolkit, 2010.

[52] Corrinne E Grover, Armel Salmon, and Jonathan F Wendel. Targeted sequence capture as a powerful tool for evolutionary analysis. American journal of botany, 99(2):312–9, feb 2012.

[53] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using NetworkX. Proceedings of the 7th Python in Science Conference (SciPy 2008), (SciPy):11–15, 2008.

[54] Sarah Hake and Jeffrey Ross-Ibarra. Genetic, evolutionary and plant breeding insights from the domestication of maize. eLife, 4:e05861, mar 2015.

[55] H Harris. Enzyme polymorphisms in man. Proceedings of the Royal Society of London. Series B, Biological sciences, 164(995):298–310, mar 1966.

[56] Bernhard Haubold and Thomas Wiehe. How repetitive are genomes? BMC bioinformatics, 7(1):541, jan 2006.

[57] Wolfgang Henke, Kerstin Herdel, Klaus Jung, Dietmar Schnorr, and Stefan A. Loening. Betaine improves the PCR amplification of GC-rich DNA sequences. Nucleic Acids Research, 25(19):3957–3958, oct 1997.

[58] G G Hennessy, W A J De Milliano, and C G Mclaren. Influence of Primary Weather Variables on Sorghum Leaf Blight Severity in Southern Africa.

[59] Carl Maximilian Hommelsheim, Lamprinos Frantzeskakis, Mengmeng Huang, and Bekir Ulker.¨ PCR amplification of repetitive DNA: a limitation to genome editing technologies and many other applications. Scientific reports, 4:5052, jan 2014.

[60] Cheng Huang, Huayue Sun, Dingyi Xu, Qiuyue Chen, Yameng Liang, Xufeng Wang, Guanghui Xu, Jinge Tian, Chenglong Wang, Dan Li, Lishuan Wu, Xi- aohong Yang, Weiwei Jin, John F Doebley, and Feng Tian. ZmCCT9 enhances maize adaptation to higher latitudes. Proceedings of the National Academy of Sciences of the United States of America, 115(2):E334–E341, jan 2018.

86 [61] Matthew B Hufford, Xun Xu, Joost van Heerwaarden, Tanja Pyh¨aj¨arvi,Jer-Ming Chia, Reed A Cartwright, Robert J Elshire, Jeffrey C Glaubitz, Kate E Guill, Shawn M Kaeppler, Jinsheng Lai, Peter L Morrell, Laura M Shannon, Chi Song, Nathan M Springer, Ruth A Swanson-Wagner, Peter Tiffin, Jun Wang, Gengyun Zhang, John Doebley, Michael D McMullen, Doreen Ware, Edward S Buck- ler, Shuang Yang, and Jeffrey Ross-Ibarra. Comparative population genomics of maize domestication and improvement. Nature Genetics, 44(7):808–811, jun 2012.

[62] Severine Hurni, Daniela Scheuermann, Simon G. Krattinger, Bettina Kessel, Thomas Wicker, Gerhard Herren, Mirjam N. Fitze, James Breen, Thomas Presterl, Milena Ouzunova, and Beat Keller. The maize disease resistance gene Htn1 against northern corn leaf blight encodes a wall-associated receptor-like ki- nase. Proceedings of the National Academy of Sciences, 112(28):8780–8785, jul 2015.

[63] M A Innis. Optimization of PCRs. PCR Protocols: A Guide to Methods and Applications.(Innis, MA and Gelfand, DH, Sninsky, JJ and White, TJ, eds.) pp. 21–27, 1990.

[64] Tiffany M. Jamann, Xingyu Luo, Laura Morales, Judith M. Kolkman, Chia-Lin Chung, and Rebecca J. Nelson. A remorin gene is implicated in quantitative disease resistance in maize. Theoretical and Applied Genetics, 129(3):591–602, mar 2016.

[65] Tiffany M. Jamann, Jesse A. Poland, Judith M. Kolkman, Laurie G. Smith, and Rebecca J. Nelson. Unraveling Genomic Complexity at a Quantitative Disease Resistance Locus in Maize. Genetics, 198(1):333–344, 2014.

[66] Tiffany M. Jamann, Shilpa Sood, Randall J. Wisser, and James B. Holland. High- Throughput Resequencing of Maize Landraces at Genomic Regions Associated with Flowering Time. PLOS ONE, 12(1):e0168910, jan 2017.

[67] Xiaoli Jiao, Xin Zheng, Liang Ma, Geetha Kutty, Emile Gogineni, Qiang Sun, Brad T Sherman, Xiaojun Hu, Kristine Jones, Castle Raley, Bao Tran, David J Munroe, Robert Stephens, Dun Liang, Tomozumi Imamichi, Joseph A Kovacs, Richard A Lempicki, and Da Wei Huang. A Benchmark Study on Error Assess- ment and Quality Control of CCS Reads Derived from the PacBio RS. Journal of data mining in genomics & proteomics, 4(3), jul 2013.

[68] Yinping Jiao, Paul Peluso, Jinghua Shi, Tiffany Liang, Michelle C. Stitzer, Bo Wang, Michael S. Campbell, Joshua C. Stein, Xuehong Wei, Chen-Shan Chin, Katherine Guill, Michael Regulski, Sunita Kumari, Andrew Olson, Jonathan Gent, Kevin L. Schneider, Thomas K. Wolfgruber, Michael R. May, Nathan M. Springer, Eric Antoniou, W. Richard McCombie, Gernot G. Presting, Michael

87 McMullen, Jeffrey Ross-Ibarra, R. Kelly Dawe, Alex Hastie, David R. Rank, and Doreen Ware. Improved maize reference genome with single-molecule technolo- gies. Nature, 546(7659):524, jun 2017.

[69] Bethan M. Jones and Adam B. Kustka. A quantitative SMRT cell sequencing method for ribosomal amplicons. Journal of Microbiological Methods, 135:77–84, apr 2017.

[70] Lauris Kaplinski and Maido Remm. MultiPLX: Automatic grouping and evalu- ation of PCR primers. Methods in Molecular Biology, 1275(8):127–142, 2015.

[71] Su Kim, Kirk E Lohmueller, Anders Albrechtsen, Yingrui Li, Thorfinn Kor- neliussen, Geng Tian, Niels Grarup, Tao Jiang, Gitte Andersen, Daniel Witte, Torben Jorgensen, Torben Hansen, Oluf Pedersen, Jun Wang, and Rasmus Nielsen. Estimation of allele frequency and association mapping using next- generation sequencing data. BMC Bioinformatics, 12(1):231, jun 2011.

[72] James J Kozich, Sarah L Westcott, Nielson T Baxter, Sarah K Highlander, and Patrick D Schloss. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and environmental microbiology, 79(17):5112–20, sep 2013.

[73] W John Kress, Kenneth J Wurdack, Elizabeth A Zimmer, Lee A Weigt, and Daniel H Janzen. Use of DNA barcodes to identify flowering plants. Proceedings of the National Academy of Sciences of the United States of America, 102(23):8369– 74, jun 2005.

[74] Sujatha Krishnakumar, Jianbiao Zheng, Julie Wilhelmy, Malek Faham, Michael Mindrinos, and Ronald Davis. A comprehensive assay for targeted multiplex amplification of human DNA sequences. Proceedings of the National Academy of Sciences of the United States of America, 105(27):9296–9301, jul 2008.

[75] Garima Kushwaha, Gyan Prakash Srivastava, and Dong Xu. PRIMEGENSw3: A web-based tool for high-throughput primer and probe design. Methods in Molecular Biology, 1275:181–199, jan 2015.

[76] S Kwok, D E Kellogg, N McKinney, D Spasic, L Goda, C Levenson, and J J Sninsky. Effects of primer-template mismatches on the polymerase chain reaction: human immunodeficiency virus type 1 model studies. Nucleic acids research, 18(4):999–1005, feb 1990.

[77] Eric S. Lander, Lauren M. Linton, Bruce Birren, Chad Nusbaum, Michael C. Zody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William FitzHugh, Roel Funke, Diane Gage, Katrina Harris, Andrew Heaford, John Howland, Lisa Kann, Jessica Lehoczky, Rosie LeVine, Paul McEwan, Kevin McKernan, James Meldrim, Jill P. Mesirov, Cher Miranda, William Morris,

88 Jerome Naylor, Christina Raymond, Mark Rosetti, Ralph Santos, Andrew Sheri- dan, Carrie Sougnez, Nicole Stange-Thomann, Nikola Stojanovic, Aravind Sub- ramanian, Dudley Wyman, Jane Rogers, John Sulston, Rachael Ainscough, Stephan Beck, David Bentley, John Burton, Christopher Clee, Nigel Carter, Alan Coulson, Rebecca Deadman, Panos Deloukas, Andrew Dunham, Ian Dun- ham, Richard Durbin, Lisa French, Darren Grafham, Simon Gregory, Tim Hub- bard, Sean Humphray, Adrienne Hunt, Matthew Jones, Christine Lloyd, Amanda McMurray, Lucy Matthews, Simon Mercer, Sarah Milne, James C. Mullikin, Andrew Mungall, Robert Plumb, Mark Ross, Ratna Shownkeen, Sarah Sims, Robert H. Waterston, Richard K. Wilson, LaDeana W. Hillier, John D. McPher- son, Marco A. Marra, Elaine R. Mardis, Lucinda A. Fulton, Asif T. Chinwalla, Kymberlie H. Pepin, Warren R. Gish, Stephanie L. Chissoe, Michael C. Wendl, Kim D. Delehaunty, Tracie L. Miner, Andrew Delehaunty, Jason B. Kramer, Lisa L. Cook, Robert S. Fulton, Douglas L. Johnson, Patrick J. Minx, Sandra W. Clifton, Trevor Hawkins, Elbert Branscomb, Paul Predki, Paul Richardson, Sarah Wenning, Tom Slezak, Norman Doggett, Jan-Fang Cheng, Anne Olsen, Su- san Lucas, Christopher Elkin, Edward Uberbacher, Marvin Frazier, Richard A. Gibbs, Donna M. Muzny, Steven E. Scherer, John B. Bouck, Erica J. Soder- gren, Kim C. Worley, Catherine M. Rives, James H. Gorrell, Michael L. Metzker, Susan L. Naylor, Raju S. Kucherlapati, David L. Nelson, George M. Weinstock, Yoshiyuki Sakaki, Asao Fujiyama, Masahira Hattori, Tetsushi Yada, Atsushi Toy- oda, Takehiko Itoh, Chiharu Kawagoe, Hidemi Watanabe, Yasushi Totoki, Todd Taylor, Jean Weissenbach, Roland Heilig, William Saurin, Francois Artigue- nave, Philippe Brottier, Thomas Bruls, Eric Pelletier, Catherine Robert, Patrick Wincker, Andr´eRosenthal, Matthias Platzer, Gerald Nyakatura, Stefan Taudien, Andreas Rump, Douglas R. Smith, Lynn Doucette-Stamm, Marc Rubenfield, Keith Weinstock, Hong Mei Lee, JoAnn Dubois, Huanming Yang, Jun Yu, Jian Wang, Guyang Huang, Jun Gu, Leroy Hood, Lee Rowen, Anup Madan, Shizen Qin, Ronald W. Davis, Nancy A. Federspiel, A. Pia Abola, Michael J. Proctor, Bruce A. Roe, Feng Chen, Huaqin Pan, Juliane Ramser, Hans Lehrach, Richard Reinhardt, W. Richard McCombie, Melissa de la Bastide, Neilay Dedhia, Hel- mut Bl¨ocker, Klaus Hornischer, Gabriele Nordsiek, Richa Agarwala, L. Aravind, Jeffrey A. Bailey, , Serafim Batzoglou, , Peer Bork, Daniel G. Brown, Christopher B. Burge, Lorenzo Cerutti, Hsiu-Chuan Chen, Deanna Church, Michele Clamp, Richard R. Copley, Tobias Doerks, Sean R. Eddy, Evan E. Eichler, Terrence S. Furey, James Galagan, James G. R. Gilbert, Cyrus Harmon, Yoshihide Hayashizaki, David Haussler, Henning Hermjakob, Karsten Hokamp, Wonhee Jang, L. Steven Johnson, Thomas A. Jones, Simon Kasif, Arek Kaspryzk, Scot Kennedy, W. James Kent, Paul Kitts, Eugene V. Koonin, Ian Korf, David Kulp, Doron Lancet, Todd M. Lowe, Aoife McLysaght, Tarjei Mikkelsen, John V. Moran, Nicola Mulder, Victor J. Pollara, Chris P. Ponting, Greg Schuler, J¨orgSchultz, Guy Slater, Arian F. A. Smit, Elia Stupka, Joseph Szustakowki, Danielle Thierry-Mieg, Jean Thierry-Mieg, Lukas Wagner,

89 John Wallis, Raymond Wheeler, Alan Williams, Yuri I. Wolf, Kenneth H. Wolfe, Shiaw-Pyng Yang, Ru-Fang Yeh, Francis Collins, Mark S. Guyer, Jane Peterson, Adam Felsenfeld, Kris A. Wetterstrand, Richard M. Myers, Jeremy Schmutz, Mark Dickson, Jane Grimwood, David R. Cox, Maynard V. Olson, Rajinder Kaul, Christopher Raymond, Nobuyoshi Shimizu, Kazuhiko Kawasaki, Shinsei Minoshima, Glen A. Evans, Maria Athanasiou, Roger Schultz, Aristides Patrinos, and Michael J. Morgan. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, feb 2001. [78] Ellen M. Leffler, Kevin Bullaughey, Daniel R. Matute, Wynn K. Meyer, Laure S´egurel,Aarti Venkat, Peter Andolfatto, and Molly Przeworski. Revisiting an Old Riddle: What Determines Genetic Diversity Levels within Species? PLoS Biology, 10(9):e1001388, sep 2012. [79] Aaron R Leichty and Dustin Brisson. Selective whole genome amplification for resequencing target microbial species from complex natural samples. Genetics, 198(2):473–481, oct 2014. [80] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. mar 2013. [81] Mingkun Li, Roland Schroeder, Albert Ko, and Mark Stoneking. Fidelity of capture-enrichment for mtDNA genome sequencing: Influence of NUMTs. Nu- cleic Acids Research, 40(18):e137, oct 2012. [82] Pierre Lindenbaum. JVarkit: java-based utilities for Bioinformatics. jan 2015. [83] Pim Lindhout. The perspectives of polygenic resistance in breeding for durable disease resistance. Euphytica, 124(2):217–226, 2002. [84] Po-Ru Loh, Gaurav Bhatia, Alexander Gusev, Hilary K Finucane, Brendan K Bulik-Sullivan, Samuela J Pollack, Teresa R de Candia, Sang Hong Lee, Naomi R Wray, Kenneth S Kendler, Michael C O, Benjamin M Neale, Nick Patterson, and Alkes L Price. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nature Publishing Group, 47(12), 2015. [85] Fei Lu, Maria C Romay, Jeffrey C Glaubitz, Peter J Bradbury, Robert J Elshire, Tianyu Wang, Yu Li, Yongxiang Li, Kassa Semagn, Xuecai Zhang, Alvaro G Hernandez, Mark A Mikel, Ilya Soifer, Omer Barad, and Edward S Buckler. High-resolution genetic mapping of maize pan-genome sequence anchors. Nature communications, 6:6914, apr 2015. [86] Michael Lynch, Matthew S. Ackerman, Jean-Francois Gout, Hongan Long, Way Sung, W. Kelley Thomas, and Patricia L. Foster. Genetic drift, selection and the evolution of the mutation rate. Nature Reviews Genetics, 17(11):704–714, oct 2016.

90 [87] Tarah Lynch, Aaron Petkau, Natalie Knox, Morag Graham, and Gary Van Dom- selaar. A Primer on Infectious Disease Bacterial Genomics. Clinical microbiology reviews, 29(4):881–913, oct 2016.

[88] Zhaorong Ma and Michael J. Axtell. Long-range genomic enrichment, sequencing, and assembly to determine unknown sequences flanking a known microRNA. PLoS ONE, 8(12):e83721, 2013.

[89] Trudy F. C. Mackay, Eric A. Stone, and Julien F. Ayroles. The genetics of quanti- tative traits: challenges and prospects. Nature Reviews Genetics, 10(8):565–577, aug 2009.

[90] Lira Mamanova, Alison J Coffey, Carol E Scott, Iwanka Kozarewa, Emily H Turner, Akash Kumar, Eleanor Howard, Jay Shendure, and Daniel J Turner. Target-enrichment strategies for next-generation sequencing. Nature Methods, 7(2):111–118, feb 2010.

[91] Tobias Mann, Richard Humbert, Michael Dorschner, John Stamatoyannopoulos, and William Stafford Noble. A thermodynamic approach to PCR primer design. Nucleic Acids Research, 37(13):e95, jul 2009.

[92] Teri A. Manolio, Francis S. Collins, Nancy J. Cox, David B. Goldstein, Lucia A. Hindorff, David J. Hunter, Mark I. McCarthy, Erin M. Ramos, Lon R. Cardon, Aravinda Chakravarti, Judy H. Cho, Alan E. Guttmacher, Augustine Kong, Leonid Kruglyak, Elaine Mardis, Charles N. Rotimi, Montgomery Slatkin, David Valle, Alice S. Whittemore, Michael Boehnke, Andrew G. Clark, Evan E. Eichler, Greg Gibson, Jonathan L. Haines, Trudy F. C. Mackay, Steven A. McCarroll, and Peter M. Visscher. Finding the missing heritability of complex diseases. Nature, 461(7265):747–753, oct 2009.

[93] Lyza G Maron, Claudia T Guimar˜aes,Matias Kirst, Patrice S Albert, James A Birchler, Peter J Bradbury, Edward S Buckler, Alison E Coluccio, Tatiana V Danilova, David Kudrna, Jurandir V Magalhaes, Miguel A Pi˜neros, Michael C Schatz, Rod A Wing, and Leon V Kochian. Aluminum tolerance in maize is associated with higher MATE1 gene copy number. Proceedings of the National Academy of Sciences of the United States of America, 110(13):5241–6, mar 2013.

[94] William McLaren, Laurent Gil, Sarah E Hunt, Harpreet Singh Riat, Graham R S Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. The Ensembl Variant Effect Predictor.

[95] M. D. McMullen, S. Kresovich, H. S. Villeda, P. Bradbury, H. Li, Q. Sun, S. Flint- Garcia, J. Thornsberry, C. Acharya, C. Bottoms, P. Brown, C. Browne, M. Eller, K. Guill, C. Harjes, D. Kroon, N. Lepak, S. E. Mitchell, B. Peterson, G. Pres- soir, S. Romero, M. O. Rosas, S. Salvo, H. Yates, M. Hanson, E. Jones, S. Smith,

91 J. C. Glaubitz, M. Goodman, D. Ware, J. B. Holland, and E. S. Buckler. Ge- netic Properties of the Maize Nested Association Mapping Population. Science, 325(5941):737–740, aug 2009.

[96] Cliff Meldrum, Maria a Doyle, and Richard W Tothill. Next-Generation Sequenc- ing for Cancer Diagnostics: a Practical Perspective. The Clinical Biochemist Reviews, 32(4):177–195, nov 2011.

[97] Todd P Michael and Robert Vanburen. Progress, challenges and the future of crop genomes. Current Opinion in Plant Biology, 24:71–81, 2015.

[98] Fumihito Miura, Chihiro Uematsu, Yoshiyuki Sakaki, and Takashi Ito. A novel strategy to design highly specific PCR primers based on the stability and unique- ness of 3’-end subsequences. Bioinformatics, 21(24):4363–4370, dec 2005.

[99] Michele Morgante, Stephan Brunner, Giorgio Pea, Kevin Fengler, Andrea Zuc- colo, and Antoni Rafalski. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nature Genetics, 37(9):997– 1002, sep 2005.

[100] Daren S. Mueller, Kiersten A. Wise, Adam J. Sisson, Tom W. Allen, Gary C. Bergstrom, D. Bruce Bosley, Carl A. Bradley, Kirk D. Broders, Emmanuel Bya- mukama, Martin I. Chilvers, Alyssa Collins, Travis R. Faske, Andrew J. Friskop, Ron W. Heiniger, Clayton A. Hollier, David C. Hooker, Tom Isakeit, Tamra A. Jackson-Ziems, Douglas J. Jardine, Kasia Kinzer, Steve R. Koenning, Dean K. Malvick, Marcia McMullen, Ron F. Meyer, Pierce A. Paul, Alison E. Robertson, Gregory W. Roth, Damon L. Smith, Connie A. Tande, Albert U. Tenuta, Paul Vincelli, and Fred Warner. Corn Yield Loss Estimates Due to Diseases in the United States and Ontario, Canada from 2012 to 2015. Plant Health Progress, 2016.

[101] Sushma Naithani, Matthew Geniza, and Pankaj Jaiswal. Variant Effect Predic- tion Analysis Using Resources Available at Gramene Database. pages 279–297. Humana Press, New York, NY, 2017.

[102] Azeet Narayan, Nicholas J Carriero, Scott N Gettinger, Jeannie Kluytenaar, Kevin R Kozak, Torunn I Yock, Nicole E Muscato, Pedro Ugarelli, Roy H Decker, and Abhijit A Patel. Ultrasensitive measurement of hotspot mutations in tumor DNA in blood using error-suppressed multiplexed deep sequencing. Cancer re- search, 72(14):3492–8, jul 2012.

[103] P. Narayanasamy. Crop Disease Management. In Microbial Plant Pathogens, pages 224–305. John Wiley & Sons, Ltd, Chichester, UK, feb 2017.

[104] David B Neale, Jill L Wegrzyn, Kristian A Stevens, Aleksey V Zimin, Daniela Puiu, Marc W Crepeau, Charis Cardeno, Maxim Koriabine, Ann E Holtz-Morris,

92 John D Liechty, Pedro J Mart´ınez-Garc´ıa, Hans A Vasquez-Gross, Brian Y Lin, Jacob J Zieve, William M Dougherty, Sara Fuentes-Soriano, Le-Shin Wu, Don Gilbert, Guillaume Mar¸cais, Michael Roberts, Carson Holt, Mark Yan- dell, John M Davis, Katherine E Smith, Jeffrey F D Dean, W Walter Lorenz, Ross W Whetten, Ronald Sederoff, Nicholas Wheeler, Patrick E McGuire, Doreen Main, Carol A Loopstra, Keithanne Mockaitis, Pieter J DeJong, James A Yorke, Steven L Salzberg, and Charles H Langley. Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome biology, 15(3):R59, jan 2014. [105] M Nei and W H Li. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences of the United States of America, 76(10):5269–73, oct 1979. [106] Rebecca Nelson, Tyr Wiesner-Hanks, Randall Wisser, and Peter Balint-Kurti. Navigating complexity to breed disease-resistant crops. Nature Reviews Genetics, 19(1):21–33, nov 2017. [107] L. Noe and G. Kucherov. YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research, 33(Web Server):W540–W543, jul 2005. [108] Michael Nothnagel, Alexander Herrmann, Andreas Wolf, Stefan Schreiber, Matthias Platzer, Reiner Siebert, Michael Krawczak, and Jochen Hampe. Technology-specific error signatures in the 1000 Genomes Project data. Human Genetics, 130(4):505–516, oct 2011. [109] Pacific Biosciences. Consensus Tools (https://github.com/PacificBiosciences/SMRT- Analysis/wiki/ConsensusTools-v2.3.0-Documentation. Accessed: 20, July, 2017.). [110] Pacific Biosciences. Shared Protocol Amplicon Template Preparation and Se- quencing General Workflow for Amplicon Sample Preparation and Sequencing. Technical report. [111] Pacific Biosciences. SMRT-Analysis (https://github.com/PacificBiosciences/SMRT- Analysis. Accessed: 20, July, 2017.). [112] E. Paradis. pegas: an R package for population genetics with an integrated- modular approach. Bioinformatics, 26(3):419–420, feb 2010. [113] Jaume Pellicer, Michael F Fay Fls, and Ilia J Leitch Fls. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society, 164:10–15, 2010. [114] Nicolas Peyret, P. Ananda Seneviratne, Hatim T. Allawi, and John SantaLucia. Nearest-neighbor thermodynamics and NMR of DNA sequences with internal A/A, C/C, G/G, and T/T mismatches. Biochemistry, 38(12):3468–3477, mar 1999.

93 [115] Joseph K Pickrell, Tomaz Berisa, Jimmy Z Liu, Laure Segurel, Joyce Y Tung, and David Hinds. Detection and interpretation of shared genetic influences on 42 human traits.

[116] Jesse A Poland, Peter J Bradbury, Edward S Buckler, and Rebecca J Nelson. Genome-wide nested association mapping of quantitative resistance to northern leaf blight in maize. Proceedings of the National Academy of Sciences of the United States of America, 108(17):6893–8, apr 2011.

[117] Wubin Qu, Yang Zhou, Yanchun Zhang, Yiming Lu, Xiaolei Wang, Dongsheng Zhao, Yi Yang, and Chenggang Zhang. MFEprimer-2.0: A fast thermodynamics- based program for checking PCR primer specificity. Nucleic Acids Research, 40(W1):W205–8, jul 2012.

[118] A. R. Quinlan and I. M. Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, mar 2010.

[119] Llu´ısQuintana-Murci and Andrew G. Clark. Population genetic tools for dis- secting innate immunity in humans. Nature Reviews Immunology, 13(4):280–293, mar 2013.

[120] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2017.

[121] William A. Rees, Thomas D. Yager, John Korte, and P H von Hippel. Betaine can eliminate the base pair composition dependence of DNA melting. Biochemistry, 32(1):137–144, jan 1993.

[122] Kimberly Robasky, Nathan E. Lewis, and George M. Church. The role of repli- cates for error mitigation in next-generation sequencing. Nature Reviews Genet- ics, 15(1):56–62, dec 2013.

[123] Maria C Romay, Mark J Millard, Jeffrey C Glaubitz, Jason a Peiffer, Kelly L Swarts, Terry M Casstevens, Robert J Elshire, Charlotte B Acharya, Sharon E Mitchell, Sherry a Flint-Garcia, Michael D McMullen, James B Holland, Ed- ward S Buckler, and Candice a Gardner. Comprehensive genotyping of the USA national maize inbred seed bank. Genome biology, 14(6):R55, 2013.

[124] Michael G Ross, Carsten Russ, Maura Costello, Andrew Hollinger, Niall J Lennon, Ryan Hegarty, Chad Nusbaum, and David B Jaffe. Characterizing and measuring bias in sequence data. Genome Biology, 14(5):R51, 2013.

[125] Kirill Rotmistrovsky, Wonhee Jang, and Gregory D. Schuler. A web server for per- forming electronic PCR. Nucleic Acids Research, 32(WEB SERVER ISS.):W108– 12, jul 2004.

94 [126] J SantaLucia. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proceedings of the National Academy of Sci- ences of the United States of America, 95(4):1460–5, 1998.

[127] John SantaLucia and Donald Hicks. The thermodynamics of DNA structural motifs. Annual review of biophysics and biomolecular structure, 33:415–40, 2004.

[128] Richa Saxena, Benjamin F. Voight, Valeriya Lyssenko, No¨elP. Burtt, Paul I. W. de Bakker, Hong Chen, Jeffrey J. Roix, Sekar Kathiresan, Joel N. Hirschhorn, Mark J. Daly, Thomas E. Hughes, Leif Groop, David Altshuler, Pe- ter Almgren, Jose C. Florez, Joanne Meyer, Kristin Ardlie, Kristina Bengtsson Bostr¨om,Bo Isomaa, Guillaume Lettre, Ulf Lindblad, Helen N. Lyon, Olle Me- lander, Christopher Newton-Cheh, Peter Nilsson, Marju Orho-Melander, Lennart R˚astam,Elizabeth K. Speliotes, Marja-Riitta Taskinen, Tiinamaija Tuomi, Can- dace Guiducci, Anna Berglund, Joyce Carlson, Lauren Gianniny, Rachel Hackett, Liselotte Hall, Johan Holmkvist, Esa Laurila, Marketa Sj¨ogren, Maria Sterner, Aarti Surti, Margareta Svensson, Malin Svensson, Ryan Tewhey, Brendan Blu- menstiel, Melissa Parkin, Matthew DeFelice, Rachel Barry, Wendy Brodeur, Jody Camarata, Nancy Chia, Mary Fava, John Gibbons, Bob Handsaker, Claire Healy, Kieu Nguyen, Casey Gates, Carrie Sougnez, Diane Gage, Marcia Niz- zari, Stacey B. Gabriel, Gung-Wei Chirn, Qicheng Ma, Hemang Parikh, Del- wood Richardson, Darrell Ricke, and Shaun Purcell. Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels. Science, 316(5829), 2007.

[129] Michael C Schatz, Jan Witkowski, and W Richard McCombie. Current challenges in de novo plant genome sequencing and assembly. Genome Biology, 13(4):243, apr 2012.

[130] P. S. Schnable, D. Ware, R. S. Fulton, J. C. Stein, F. Wei, S. Pasternak, C. Liang, J. Zhang, L. Fulton, T. A. Graves, P. Minx, A. D. Reily, L. Courtney, S. S. Kru- chowski, C. Tomlinson, C. Strong, K. Delehaunty, C. Fronick, B. Courtney, S. M. Rock, E. Belter, F. Du, K. Kim, R. M. Abbott, M. Cotton, A. Levy, P. Marchetto, K. Ochoa, S. M. Jackson, B. Gillam, W. Chen, L. Yan, J. Higginbotham, M. Car- denas, J. Waligorski, E. Applebaum, L. Phelps, J. Falcone, K. Kanchi, T. Thane, A. Scimone, N. Thane, J. Henke, T. Wang, J. Ruppert, N. Shah, K. Rotter, J. Hodges, E. Ingenthron, M. Cordes, S. Kohlberg, J. Sgro, B. Delgado, K. Mead, A. Chinwalla, S. Leonard, K. Crouse, K. Collura, D. Kudrna, J. Currie, R. He, A. Angelova, S. Rajasekar, T. Mueller, R. Lomeli, G. Scara, A. Ko, K. Delaney, M. Wissotski, G. Lopez, D. Campos, M. Braidotti, E. Ashley, W. Golser, H. Kim, S. Lee, J. Lin, Z. Dujmic, W. Kim, J. Talag, A. Zuccolo, C. Fan, A. Sebastian, M. Kramer, L. Spiegel, L. Nascimento, T. Zutavern, B. Miller, C. Ambroise, S. Muller, W. Spooner, A. Narechania, L. Ren, S. Wei, S. Kumari, B. Faga, M. J. Levy, L. McMahan, P. Van Buren, M. W. Vaughn, K. Ying, C.-T. Yeh,

95 S. J. Emrich, Y. Jia, A. Kalyanaraman, A.-P. Hsia, W. B. Barbazuk, R. S. Bau- com, T. P. Brutnell, N. C. Carpita, C. Chaparro, J.-M. Chia, J.-M. Deragon, J. C. Estill, Y. Fu, J. A. Jeddeloh, Y. Han, H. Lee, P. Li, D. R. Lisch, S. Liu, Z. Liu, D. H. Nagel, M. C. McCann, P. SanMiguel, A. M. Myers, D. Nettle- ton, J. Nguyen, B. W. Penning, L. Ponnala, K. L. Schneider, D. C. Schwartz, A. Sharma, C. Soderlund, N. M. Springer, Q. Sun, H. Wang, M. Waterman, R. Westerman, T. K. Wolfgruber, L. Yang, Y. Yu, L. Zhang, S. Zhou, Q. Zhu, J. L. Bennetzen, R. K. Dawe, J. Jiang, N. Jiang, G. G. Presting, S. R. Wessler, S. Aluru, R. A. Martienssen, S. W. Clifton, W. R. McCombie, R. A. Wing, and R. K. Wilson. The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science, 326(5956):1112–1115, nov 2009. [131] Gregory D Schuler. Sequence mapping by electronic PCR. Genome Research, 7(5):541–550, may 1997. [132] Huwenbo Shi, Gleb Kichaev, and Bogdan Pasaniuc. Contrasting the Genetic Ar- chitecture of 30 Complex Traits from Summary Association Data. The American Journal of Human Genetics, 99(1):139–153, jul 2016. [133] Fabian Sievers, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, Hamish McWilliam, Michael Remmert, Johannes S¨oding,Julie D Thompson, and Desmond G Higgins. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molec- ular systems biology, 7:539, oct 2011. [134] Shanya Sivakumaran, Felix Agakov, Evropi Theodoratou, James G Prendergast, Lina Zgaga, Teri Manolio, Igor Rudan, Paul McKeigue, James F Wilson, and Harry Campbell. Abundant pleiotropy in human complex diseases and traits. American journal of human genetics, 89(5):607–18, nov 2011. [135] Daniel D Sommer, Arthur L Delcher, Steven L Salzberg, and Mihai Pop. Min- imus: a fast, lightweight genome assembler. [136] Gyan Prakash Srivastava, Mamatha Hanumappa, Garima Kushwaha, Henry T. Nguyen, and Dong Xu. Homolog-specific PCR primer design for profiling splice variants. Nucleic Acids Research, 39(10):e69, may 2011. [137] Gyan Prakash Srivastava and Dong Xu. Genome-Scale Probe and Primer Design with PRIMEGENS. pages 159–175. 2007. [138] M. Stanke, R. Steinkamp, S. Waack, and B. Morgenstern. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Research, 32(Web Server):W309–W312, jul 2004. [139] David J. States, Warren Gish, and Stephen F. Altschul. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Meth- ods, 3(1):66–70, aug 1991.

96 [140] Anthony Studer, Qiong Zhao, Jeffrey Ross-Ibarra, and John Doebley. Identifi- cation of a functional transposon insertion in the maize domestication gene tb1. Nature Genetics 2011 43:11, 43(11):1160, sep 2011.

[141] M. I. Tenaillon, M. C. Sawkins, A. D. Long, R. L. Gaut, J. F. Doebley, and B. S. Gaut. Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proceedings of the National Academy of Sciences, 98(16):9161–9166, jul 2001.

[142] M. I. Tenaillon, Jana U’Ren, Olivier Tenaillon, and Brandon S Gaut. Selection Versus Demography: A Multilocus Investigation of the Domestication Process in Maize. Molecular Biology and Evolution, 21(7):1214–1225, mar 2004.

[143] K. J. Travers, C.-S. Chin, D. R. Rank, J. S. Eid, and S. W. Turner. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Research, 38(15):e159–e159, aug 2010.

[144] Andreas Untergasser, Ioana Cutcutache, Triinu Koressaar, Jian Ye, Brant C. Faircloth, Maido Remm, and Steven G. Rozen. Primer3-new capabilities and interfaces. Nucleic Acids Research, 40(15):e115, aug 2012.

[145] Yves Vigouroux, Jeffrey C. Glaubitz, Yoshihiro Matsuoka, Major M. Goodman, Jes??s S??nchez G., and John Doebley. Population structure and genetic diversity of New World maize races assessed by DNA microsatellites. American Journal of Botany, 95(10):1240–1253, 2008.

[146] N von Ahsen, C T Wittwer, and E Sch¨utz. Oligonucleotide melting tempera- tures under PCR conditions: nearest-neighbor corrections for Mg(2+), deoxynu- cleotide triphosphate, and dimethyl sulfoxide concentrations with comparison to alternative empirical formulas. Clinical chemistry, 47(11):1956–61, nov 2001.

[147] Qinghua Wang and Hugo K Dooner. Remarkable variation in maize genome structure inferred from haplotype diversity at the bz locus. Proceedings of the National Academy of Sciences of the United States of America, 103(47):17644–9, 2006.

[148] Detlef Weigel and Richard Mott. The 1001 genomes project for Arabidopsis thaliana. Genome biology, 10(5):107, 2009.

[149] S E White and J F Doebley. The molecular evolution of terminal ear1, a regula- tory gene in the genus Zea. Genetics, 153(3):1455–62, nov 1999.

[150] Randall J. Wisser, Peter J. Balint-Kurti, and Rebecca J. Nelson. The Genetic Architecture of Disease Resistance in Maize: A Synthesis of Published Studies. Phytopathology, 96(2):120–129, feb 2006.

97 [151] Weixun Wu, Xiao-Ming Zheng, Guangwen Lu, Zhengzheng Zhong, He Gao, Lip- ing Chen, Chuanyin Wu, Hong-Jun Wang, Qi Wang, Kunneng Zhou, Jiu-Lin Wang, Fuqing Wu, Xin Zhang, Xiuping Guo, Zhijun Cheng, Cailin Lei, Qibing Lin, Ling Jiang, Haiyang Wang, Song Ge, and Jianmin Wan. Association of functional nucleotide polymorphisms at DTH2 with the northward expansion of rice cultivation in Asia. Proceedings of the National Academy of Sciences of the United States of America, 110(8):2775–80, feb 2013. [152] Tomoyuki Yamada, Haruhiko Soma, and Shinichi Morishita. PrimerStation: A highly specific multiplex genomic PCR primer design server for the human genome. Nucleic Acids Research, 34(WEB. SERV. ISS.):W665–W669, jul 2006. [153] Qin Yang, Zhi Li, Wenqiang Li, Lixia Ku, Chao Wang, Jianrong Ye, Kun Li, Ning Yang, Yipu Li, Tao Zhong, Jiansheng Li, Yanhui Chen, Jianbing Yan, Xiaohong Yang, and Mingliang Xu. CACTA-like transposable element in ZmCCT attenuated photoperiod sensitivity and accelerated the postdomestication spread of maize. Proceedings of the National Academy of Sciences of the United States of America, 110(42):16969–74, oct 2013. [154] Yao Yang, Robert Sebra, Benjamin S Pullman, Wanqiong Qiao, Inga Peter, Robert J Desnick, C Ronald Geyer, John F DeCoteau, and Stuart A Scott. Quantitative and multiplexed DNA methylation analysis using long-read single- molecule real-time bisulfite sequencing (SMRT-BS). BMC Genomics, 16, 2015. [155] J. Ye, G. Coulouris, I. Zaretskaya, I. Cutcutache, S. Rozen, and T. L. Madden. Primer-BLAST: A tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics, 13(1):134, 2012. [156] Nouri Ben Zakour, Michel Gautier, Rumen Andonov, Dominique Lavenier, Marie Fran??oise Cochet, Philippe Veber, Alexei Sorokin, and Yves Le Loir. GenoFrag: Software to design primers optimized for whole genome scanning by long-range PCR amplification. Nucleic Acids Research, 32(1):17–24, jan 2004. [157] Liqing Zhang, Andrew S Peek, Detiger Dunams, and Brandon S Gaut. Popula- tion genetics of duplicated disease-defense genes, hm1 and hm2, in maize (Zea mays ssp. mays L.) and its wild ancestor (Zea mays ssp. parviglumis). Genetics, 162(2):851–60, oct 2002. [158] Xiaojing Zhang, Karen W Davenport, Wei Gu, Hajnalka E Daligault, A Christine Munk, Hazuki Tashima, Krista Reitenga, Lance D Green, and Cliff S Han. Im- proving genome assemblies by sequencing PCR products with PacBio. BioTech- niques, 53(1):61–2, jul 2012. [159] Lecong Zhou and Jason A Holliday. Targeted enrichment of the black cotton- wood (Populus trichocarpa) gene space using sequence capture. BMC Genomics, 13(1):703, dec 2012.

98 Appendix A

SUPPLEMENTARY INFORMATION FOR THERMOALIGN: A GENOME-AWARE PRIMER DESIGN TOOL FOR STANDARD PCR AND TILED AMPLICON RESEQUENCING

Supplementary Tables

Table A.1: Comparison of ThermoAlign to related primer design tools.

Specificity evaluation

Primer Probe Genome NN Full-length Off-site Search Amplicon Multiplex Lab Name Reference4 design1 design aware model2 alignment3 primer pair algorithm tiling path grouping validation BLAST; Primer3 Y C Y c NNYN N N N Untergasser et al., 2012 [144], a Smith Waterman

k-mer counts; GENOMEMASKER Y S NYNNY N N N Andreson et al., 2006 [9], a e-PCR

Indexed MFEprimer-2.0 NNYYNY N N N Qu et al., 2012 [117], a database search

Srivastava et al., 2011 [136], a; PRIMEGENSw3 Y C,S Y C,S Y Y N Y MegaBLAST N N Y Kushwaha et al., 2015 [75], b

BLASTn; Primer-BLAST Y S Y S YNYY N N N Ye et al., 2012 [155], a Needleman-Wunsch

Miuraet al., 2005 [98], a; Indexed SDSS Y C NYYNN NNY Yamada et al., 2006 [152], b; database search Mann et al., 2009 [91], b

Indexed http://genome.ucsc.edu in silico PCR NNYNYY NNN database search /cgi-bin/hgPcrb

Indexed Schuler, 1997 [131], a; e-PCR N N Y N N Y NNN database search Rotmistrovsky et al., 2004 [125], b

PCRTiler Y S NYNYY BLASTn Y C N Y Gervais et al., 2010 [48], a

BLASTn; ThermoAlign Y C NYYYN Y C Y S Y https://github.com/drmaize/ThermoAlignb thermoalignment

1Y: yes; N: no; C Custom approach; S Relies on a secondary tool. 2NN model: nearest neighbor thermodynamic model. 3GENOMEMASKER, MFEprimer-2.0, PRIMEGENSw3, SDSS and e-PCR initiate the search specificity using a subsequence.

4For citations see the bibliography for the paper. Citations are highlighted as either an (a) original publication or a (b) software resource for the method or tool.

99 Table A.2: Effects of the amplicon size range parameter on the minimum tiling path primer design for the 24 kb target region described in the main text.

Amplicon size range (kb)

0.1-0.5 0.1-1.0 0.1-5.0 5-10 10-15 15-20 20-24

Number of possible primer pairs 1,463 3,505 400 348 81 0 0 Minimum amplicon length (bp) 100 100 100 5,000 10,043 0 0 Maximum amplicon length (bp) 496 990 5,000 9,603 14,839 0 0 Mean amplicon length (bp) 280 534 2,076 6,237 13,047 0 0 Median amplicon length (bp) 301 587 1,435 6,323 12,853 0 0 Cumulative amplicon coverage (bp) 2,699 4,530 8,566 14,840 14,840 0 0 Percent coverage ≈11.3% ≈18.9% ≈35.7% ≈61.8% ≈61.8% 0 0 Number of subnetworks 5 3 1 1 1 0 0 Total number of primer pairs 9 9 4 3 1 0 0

1Percent coverage (A/T-end filter off) ≈13.8% ≈19.8% ≈36.6% ≈61.8% ≈88.7% ≈82.3% 0

1Coverage based on primer pairs designed when excluding the A/T-end filter as described in the results section on UOD.

100 A Primer Tm Input: 82,520 primers

934

nucleotidesRepeating 180 638 6,617 237 796 6,676 111 5,200 4,059 3,202 777 Primer GC 412 5,339 644 605 33 1,653 157 1,540 9,500 1,176 456 9,764 1,142 16 1,186 70 394 2,342 9,217

3’ A/T-end GC clamp

B Input (from A): 7,447 primers

Exact match m Hairpin T 2,181

272 61 4,056

Homodimer Tm

Reamining: 877 primers

Figure A.1: Effects of UOD filters on all 82,520 primers at monomoprhic sites across the 24 kb region described in the main text. The number of primers filtered are indi- cated for the parameters classified as (A) primer features and (B) primer interactions.

101 Supplementary methods and results Region Specific Enrichment on maize A Region Specific Enrichment (RSE) technology used to resequence a 150 kb section of the human genome[39] was tested on maize. Region specific enrichment involves hybridization of primers followed by single-strand synthesis with a mixture of standard and biotinylated nucleotides. The synthesized fragments, while still bound to the copied DNA, are pulled down with streptavidin-coated magnetic beads. Through this approach, the targeted DNA fragment is captured and enriched, uncoupled from the single-strand synthesis fragments, amplified using whole genome amplification and sequenced. Generation Biotech (Lawrenceville, NJ) was contracted to perform RSE on eight regions of the maize genome (Table A.3). A pilot experiment was performed using six samples of maize, including the same B73 sample used for the reference genome[130]. Generation Biotech provided paired-end 100-bp sequence data produced on an Illumina HiSeq 2000 (reads are deposited at the NCBI sequence read archive; SRA ID: SRP048756, BioProject ID: PRJNA263415). The reads for each sample (here, we present results for the B73 sample) were quality trimmed using FASTX toolkit [51], mapped to the B73 reference genome (AGP v2) with bwa-mem[80] and summarized in terms of on-target versus off-target coverage using ”-hist” of bedtools [118] (Figure A.4). Region specific enrichment showed effectively no enrichment of the target regions.

Table A.3: Eight genomic loci in maize B73 genome, selected for targeted enrichment

Target site # Target locus name Chr. Start pos. (B73 v3) Stop pos. (B73 v3) Length(bp) 1 qNLB 1 25376615 22184 1 25,376,615 25,398,798 22,184 2 qNLB 1 187278617 197947 1 187,278,617 187,476,563 197,947 3 qSLB 2 37556845 13001 2 37,556,845 37,569,845 13,001 4 qSLB 3 33490673 24001 3 33,490,673 33,514,673 24,001 5 qSLB 3 219917184 72001 3 219,917,184 219,989,184 72,001 6 qSLB 6 7002788 135001 6 7,002,788 7,137,788 135,001 7 qMDR 7 128386997 50394 7 128,386,997 128,437,390 50,394 8 qSLB 9 16206364 303882 9 16,206,364 16,510,245 303,882

Follow-up experiments were performed to examine the potential cause for the lack of enrichment. For this, three on-target and three off-target qPCR assays (Applied BiosystemsTM TaqMan®; Table A.4) were used to more rapidly assess enrichment as

102 A

200000 173,441

150000 123,677 105,622

runtime (s) 100000 75,255 48,214 50000 27,499 12,221 3,764 106 118 132 148 171 293 99 0 1 5 10 15 20 25 50 250 500 750 1000 1250 1500 1750 2000

hsp

B

m 20

15 – off-target T m 10

5 on-target T 1 5 10 15 20 25 50 250 500 750 1000 1250 1500 1750 2000 hsp

Figure A.2: Relationship between ThermoAlign search speed and the search space required for a sufficient search to identify primers predicted to produce specific, on- ◦ target PCR products (i.e. on-target Tm > 10 C of all off-target Tm values). Results are based on primers designed to the 24 kb locus described in the main text. (A) Wall time seconds for running ThermoAlign using different BLASTn hsp values for PSE. (B) Lines in the plot show the difference between on-target and off-target Tm (computed from thermoalignments) of each primer across different BLASTn hsp values used for PSE module. The search space is expanded by increasing the BLATSn hsp value. A single primer may have multiple off-target matches, but each primer is represented only by the off-target match with the minimum difference in Tm. Only primers that resulted in a change at different hsp values were plotted. Note: because hsp alignments with the same percent identify to the query sequence may be extended into thermoalignments that are not of the same percent identity, it is not guaranteed that the minimally distant off-target site will be identified when the hsp value is low. For this reason, some primers show a decreasing difference in Tm as the hsp settings is increased. Once there is no change, the minimally distant off-target site within the genome has been identified.

103 Figure A.3: Minimum tiling paths of standard PCR and long range PCR primers tested in this study. Green lines indicate amplicons from primer pairs that worked without any reaction optimization. Orange lines indicate amplicons from primers that worked only with the addition of betaine. The gene track indicates genes to which the primer pairs were designed (Remorin: GRMZM2G107774; NPB: GRMZM2G022627; p450: GRMZM2G031364; LHT1: GRMZM2G127328; GST: GRMZM2G416632).

104 Figure A.4: Depth of coverage for on-target versus off-target regions of the B73 refer- ence genome from Illumina sequencing of RSE on B73 gDNA.

the ratio between qPCR estimates of on-target and off-target DNA amounts. The RSE products were provided by Generation Biotech. Quantitative PCR was performed at the University of Delaware using TaqMan Gene Expression Master Mix and reagents (Applied Biosystems; Cat: #4369016) following the manufactures recommendations. Each assay was prepared in triplicate and read with an Applied Biosystems 7900HT Fast Real-Time PCR System. When the primers for RSE were excluded from the reaction, gDNA was captured with approximately equal amounts of on-target and off-target DNA (Table S5). Esti- mates of the amount of enrichment by qPCR were not significantly different (α = 0.05) for RSE with and without primers used for capture. When the same test was performed without primers but also excluding polymerase, no gDNA was captured (data not shown). Together, this suggested that our samples of maize gDNA led to non-specific single-strand synthesis products (required for capture via incorporation of biotinylated dNTPs), which in turn led to non-specific capture and failure to enrich the target

105 regions.

Table A.4: Target and off-target qPCR assays used to quantify enrichment ratios

Target / off-target TaqMan assay name Target gene accession # Chr. Target locus name Zm03998176 s1 NP 001148434 1 qNLB 1 187278617 197947 Target Zm04041148 s1 NP 001140650 1 qNLB 1 187278617 197947 Zm04077775 s1 NP 001136630 9 qSLB 9 16206364 303882

Zm04080181 s1 NP 001147508 8 - Off-target Zm04040029 s1 NM 001157035 4 - Zm04084821 s1 NP 001150767 5 -

Table A.5: Enrichment ratios of RSE products

Enrichment ratio ± SE B73 gDNA (target/non-target DNA) RSE with primers 4.13 ± 0.53 RSE without primers 2.44 ± 0.47

Selective whole genome amplification on maize Selective whole genome amplification (SWGA) is a technique based on multiple displacement amplification (MDA) using primers prevalent in a target genome and rare in the background genome(s) in order to selectively amplify the target genome[79]. Using phi29 polymerase for MDA, this approach resulted in >105-fold amplification of the targeted Borrelia burgdorferi genome (as compared to <6.7 -fold background amplification) from a 1:2000 mixture of the B. burgdorferi and Escherichia coli genomes [79]. The efficacy of phi29-based MDA for targeted enrichment on maize (DNA from B73) was tested, considering the targeted region as the ”target genome” and the non- targeted space as the ”background genome.” Forward and reverse primers flanking the target of two qPCR assays (Zm03998176 s1 and Zm04041148 s1 in Table S4) were designed for MDA. Off-target Tm for primers in the target region were calculated based on their BLASTn local alignments (not thermoalignments, as these tests were performed prior to the development of ThermoAlign) with the background genome, and primers with the maximum difference in target vs off-target Tm were used. qPCR based

106 quantification of the amount of DNA corresponding to the target and non-target region indicated that the SWGA technology resulted in little to no enrichment of the target regions (Table S6). We note our study was not an attempt to validate the findings of Leichty et al.[79]. We consider our failure to achieve enrichment to be specific to our tests, which used a more complex genome and applied SWGA in a different manner than described by Leichty et al.[79]. We also tested this using primers designed prior to our development of ThermoAlign. Nevertheless, a difficulty of this approach with repetitive genomes like maize is identifying primers flanking the target region that will bind specifically at 30 , which is required for phi29 amplification. Selective whole genome amplification was performed using phi29 DNA poly- merase (New England Biolabs; Cat: #M0269L). Prior to assembling the reactions, template DNA was heated for 2 min at 90 and immediately placed on ice. Reactions (prepared on ice) with a final concentration of 1X phi29 reaction buffer, 0.2 mg/mL BSA, 1 mM dNTPs, 0.125, 0.250, or 0.500 µM forward and reverse primer set (total) and 30 U phi29 were combined with 50 ng of gDNA and brought to a volume of 50 µL with molecular-grade water. Amplification was carried out on an Eppendorf Mas- tercycler pro S with the following conditions: 5 min at 39 ; 10 min at 38 ; 15 min at 37 ; 20 min at 36 ; 30 min at 35 ; 960 min at 34 ; 15 min at 65 ; and a hold at 4 . PCR products were assayed by electrophoresis in a 1% TBE gel and imaged using a FluorChem HD2 with AlphaView SA v3.4.0, which showed amplification had occurred (not shown). Amplified products were cleaned with SPRIselect and qPCR was performed using TaqMan Gene Expression Master Mix and reagents (Applied Biosystems; Cat: #4369016) following the manufactures recommendations. Each assay was prepared in triplicate and read with an Applied Biosystems 7900HT Fast Real-Time PCR System.

107 Table A.6: qPCR analysis of SWGA products

qPCR assay that PCR primer Enrichment ratio ± SE the primers flanked concentration (target/non-target DNA) 500 nM 1.54 ± 0.23 Zm03998176 s1 250 nM 1.96 ± 0.12 125 nM 1.71 ± 0.17

500 nM 0.98 ± 0.05 Zm04041148 s1 250 nM 1.03 ± 0.06 125 nM 1.02 ± 0.10

No amplification (gDNA) N/A 1.23 ± 0.09

108 Appendix B

SUPPLEMENTARY INFORMATION FOR: CLUSTERING OF CIRCULAR CONSENSUS SEQUENCES: ACCURATE ERROR CORRECTION AND ASSEMBLY OF SINGLE MOLECULE REAL-TIME READS FROM MULTIPLEXED AMPLICON LIBRARIES

Supplementary Tables

Table B.1: PacBio reads of insert protocol output metrics.

Amplicon library Metric Single Multiplex Reads of Insert 136,008,130 210,038,102 Mean Read Length of Insert 3,697 4,094 Mean Read Quality of Insert 98.54% 98.39% Mean Number of Passes 9.39 9.00

Table B.2: Padded and barcoded primer sequences used for amplification of six maize lines.

Sample Primer Sequence

P39 11.4 f 1 GTTAGCTATACATGACTCTGCGCGGCTGTGAAGATGTCATGGACA

Tx303 11.4 f 2 GTTAGTGTGTATCAGTACATGGCGGCTGTGAAGATGTCATGGACA

Mo17 11.4 f 3 GTTAGTATGTGATCGTCTCTCGCGGCTGTGAAGATGTCATGGACA

Hp301 11.4 f 4 GTTAGTACGACTACATATCAGGCGGCTGTGAAGATGTCATGGACA

B73 11.4 f 5 GTTAGTATCTCTGTAGAGTCTGCGGCTGTGAAGATGTCATGGACA

CML277 11.4 f 6 GTTAGTCTATGTCTCAGTAGTGCGGCTGTGAAGATGTCATGGACA

P39 11.4 r 1 GTTAGGCAGAGTCATGTATAGTGGAACAAGAGCTGGCCTGTTCGA

Tx303 11.4 r 2 GTTAGCATGTACTGATACACATGGAACAAGAGCTGGCCTGTTCGA

109 Mo17 11.4 r 3 GTTAGGAGAGACGATCACATATGGAACAAGAGCTGGCCTGTTCGA

Hp301 11.4 r 4 GTTAGCTGATATGTAGTCGTATGGAACAAGAGCTGGCCTGTTCGA

B73 11.4 r 5 GTTAGAGACTCTACAGAGATATGGAACAAGAGCTGGCCTGTTCGA

CML277 11.4 r 6 GTTAGACTACTGAGACATAGATGGAACAAGAGCTGGCCTGTTCGA

P39 2.2 f 1 GTTAGCTATACATGACTCTGCGCGGCATACAACAGGCAAAGTTAGC

Tx303 2.2 f 2 GTTAGTGTGTATCAGTACATGGCGGCATACAACAGGCAAAGTTAGC

Mo17 2.2 f 3 GTTAGTATGTGATCGTCTCTCGCGGCATACAACAGGCAAAGTTAGC

Hp301 2.2 f 4 GTTAGTACGACTACATATCAGGCGGCATACAACAGGCAAAGTTAGC

B73 2.2 f 5 GTTAGTATCTCTGTAGAGTCTGCGGCATACAACAGGCAAAGTTAGC

CML277 2.2 f 6 GTTAGTCTATGTCTCAGTAGTGCGGCATACAACAGGCAAAGTTAGC

P39 2.2 r 1 GTTAGGCAGAGTCATGTATAGGGCTGCTATGGTCTGTATCGTCTCCAAC

Tx303 2.2 r 2 GTTAGCATGTACTGATACACAGGCTGCTATGGTCTGTATCGTCTCCAAC

Mo17 2.2 r 3 GTTAGGAGAGACGATCACATAGGCTGCTATGGTCTGTATCGTCTCCAAC

Hp301 2.2 r 4 GTTAGCTGATATGTAGTCGTAGGCTGCTATGGTCTGTATCGTCTCCAAC

B73 2.2 r 5 GTTAGAGACTCTACAGAGATAGGCTGCTATGGTCTGTATCGTCTCCAAC

CML277 2.2 r 6 GTTAGACTACTGAGACATAGAGGCTGCTATGGTCTGTATCGTCTCCAAC

P39 2.1 f 1 GTTAGCTATACATGACTCTGCTTGTGGCATCAGTAGGGTCTGAAAC

Tx303 2.1 f 2 GTTAGTGTGTATCAGTACATGTTGTGGCATCAGTAGGGTCTGAAAC

Mo17 2.1 f 3 GTTAGTATGTGATCGTCTCTCTTGTGGCATCAGTAGGGTCTGAAAC

Hp301 2.1 f 4 GTTAGTACGACTACATATCAGTTGTGGCATCAGTAGGGTCTGAAAC

B73 2.1 f 5 GTTAGTATCTCTGTAGAGTCTTTGTGGCATCAGTAGGGTCTGAAAC

CML277 2.1 f 6 GTTAGTCTATGTCTCAGTAGTTTGTGGCATCAGTAGGGTCTGAAAC

P39 2.1 r 1 GTTAGGCAGAGTCATGTATAGCCTTCCCTTAACCAAAGTTGAATAGCAG

Tx303 2.1 r 2 GTTAGCATGTACTGATACACACCTTCCCTTAACCAAAGTTGAATAGCAG

Mo17 2.1 r 3 GTTAGGAGAGACGATCACATACCTTCCCTTAACCAAAGTTGAATAGCAG

Hp301 2.1 r 4 GTTAGCTGATATGTAGTCGTACCTTCCCTTAACCAAAGTTGAATAGCAG

B73 2.1 r 5 GTTAGAGACTCTACAGAGATACCTTCCCTTAACCAAAGTTGAATAGCAG

CML277 2.1 r 6 GTTAGACTACTGAGACATAGACCTTCCCTTAACCAAAGTTGAATAGCAG

P39 13.4 f 1 GTTAGCTATACATGACTCTGCTCCTCATTCTCCTCTCGGAATCGCT

110 Tx303 13.4 f 2 GTTAGTGTGTATCAGTACATGTCCTCATTCTCCTCTCGGAATCGCT

Mo17 13.4 f 3 GTTAGTATGTGATCGTCTCTCTCCTCATTCTCCTCTCGGAATCGCT

Hp301 13.4 f 4 GTTAGTACGACTACATATCAGTCCTCATTCTCCTCTCGGAATCGCT

B73 13.4 f 5 GTTAGTATCTCTGTAGAGTCTTCCTCATTCTCCTCTCGGAATCGCT

CML277 13.4 f 6 GTTAGTCTATGTCTCAGTAGTTCCTCATTCTCCTCTCGGAATCGCT

P39 13.4 r 1 GTTAGGCAGAGTCATGTATAGATACCACCTCAAGAAAGCAGGCCTA

Tx303 13.4 r 2 GTTAGCATGTACTGATACACAATACCACCTCAAGAAAGCAGGCCTA

Mo17 13.4 r 3 GTTAGGAGAGACGATCACATAATACCACCTCAAGAAAGCAGGCCTA

Hp301 13.4 r 4 GTTAGCTGATATGTAGTCGTAATACCACCTCAAGAAAGCAGGCCTA

B73 13.4 r 5 GTTAGAGACTCTACAGAGATAATACCACCTCAAGAAAGCAGGCCTA

CML277 13.4 r 6 GTTAGACTACTGAGACATAGAATACCACCTCAAGAAAGCAGGCCTA

P39 6.5 f 1 GTTAGCTATACATGACTCTGCCTTGCCGCTAAACCATCTCGCGATT

Tx303 6.5 f 2 GTTAGTGTGTATCAGTACATGCTTGCCGCTAAACCATCTCGCGATT

Mo17 6.5 f 3 GTTAGTATGTGATCGTCTCTCCTTGCCGCTAAACCATCTCGCGATT

Hp301 6.5 f 4 GTTAGTACGACTACATATCAGCTTGCCGCTAAACCATCTCGCGATT

B73 6.5 f 5 GTTAGTATCTCTGTAGAGTCTCTTGCCGCTAAACCATCTCGCGATT

CML277 6.5 f 6 GTTAGTCTATGTCTCAGTAGTCTTGCCGCTAAACCATCTCGCGATT

P39 6.5 r 1 GTTAGGCAGAGTCATGTATAGTTCTTGCGCCCACCTCCTTCTGAAA

Tx303 6.5 r 2 GTTAGCATGTACTGATACACATTCTTGCGCCCACCTCCTTCTGAAA

Mo17 6.5 r 3 GTTAGGAGAGACGATCACATATTCTTGCGCCCACCTCCTTCTGAAA

Hp301 6.5 r 4 GTTAGCTGATATGTAGTCGTATTCTTGCGCCCACCTCCTTCTGAAA

B73 6.5 r 5 GTTAGAGACTCTACAGAGATATTCTTGCGCCCACCTCCTTCTGAAA

CML277 6.5 r 6 GTTAGACTACTGAGACATAGATTCTTGCGCCCACCTCCTTCTGAAA

P39 1.6 f 1 GTTAGCTATACATGACTCTGCAAGCTCTAAGCGCCGCCATGGTTA

Tx303 1.6 f 2 GTTAGTGTGTATCAGTACATGAAGCTCTAAGCGCCGCCATGGTTA

Mo17 1.6 f 3 GTTAGTATGTGATCGTCTCTCAAGCTCTAAGCGCCGCCATGGTTA

Hp301 1.6 f 4 GTTAGTACGACTACATATCAGAAGCTCTAAGCGCCGCCATGGTTA

B73 1.6 f 5 GTTAGTATCTCTGTAGAGTCTAAGCTCTAAGCGCCGCCATGGTTA

CML277 1.6 f 6 GTTAGTCTATGTCTCAGTAGTAAGCTCTAAGCGCCGCCATGGTTA

111 P39 1.6 r 1 GTTAGGCAGAGTCATGTATAGGCATCTTGGCGTACACCTTGTTGG

Tx303 1.6 r 2 GTTAGCATGTACTGATACACAGCATCTTGGCGTACACCTTGTTGG

Mo17 1.6 r 3 GTTAGGAGAGACGATCACATAGCATCTTGGCGTACACCTTGTTGG

Hp301 1.6 r 4 GTTAGCTGATATGTAGTCGTAGCATCTTGGCGTACACCTTGTTGG

B73 1.6 r 5 GTTAGAGACTCTACAGAGATAGCATCTTGGCGTACACCTTGTTGG

CML277 1.6 r 6 GTTAGACTACTGAGACATAGAGCATCTTGGCGTACACCTTGTTGG

112 a b

Cluster7

Cluster11

chr1_25390617_25395472 chr2_37564580_37569533 c d

Cluster9

Cluster10

chr3_33506037_33509162 chr2_37562840_37567441 e f

Cluster13 Cluster12

chr6_7045710_7050495 chr1_25391952_25396540 g

Cluster8

chr1_25391952_25396540

Figure B.1: Dot-plots of alignments between amplicon reference sequences and inac- curate consensus sequences generated by LAA. (a-g) The reference sequence for each amplicon is labeled according the chromosome, start and stop position in the reference genome of B73 v3 (x-axes). The consensus sequences are labeled 7-13 (y-axes). Green lines show alignments in the same orientation while red lines show alignments in reverse orientation.

113 Appendix C

SUPPLEMENTARY INFORMATION FOR: UNRAVELLING THE GENOMIC DIVERSITY AT A MAIZE QUANTITATIVE DISEASE RESISTANCE LOCUS USING LONG MOLECULE RESEQUENCING

Table C.1: Tiling path of primers used for the resequenced 27 NAM founder lines

Set F primer name F primer R primer name R primer Lines

11.4 f GCGGCTGTGAAGATGTCATGGACA 11.4 r TGGAACAAGAGCTGGCCTGTTCGA 1 2.2 f GCGGCATACAACAGGCAAAGTTAGC 2.2 r GGCTGCTATGGTCTGTATCGTCTCCAAC B73, CML277, Hp301, Mo17, P39, Tx303 13.4 f TCCTCATTCTCCTCTCGGAATCGCT 13.4 r ATACCACCTCAAGAAAGCAGGCCTA

16.1 f GCGGCTGTGAAGATGTCATGGACA 16.1 r CAAGAGCTGGCCTGTTCGAGGTTA B97, CML103, CML228, CML247, CML322, 2 4.1 f GCAAACAAACATTGGTCAGGCATGC 4.1 r TTATGTACAGAGTAACACATCGATCGG CML333, CML52, CML69, Il14H, Ki3, Ki11, 15.4 f ATTCTCCTCTCGGAATCGCTCCCT 15.4 r TACCACCTCAAGAAAGCAGGCCTA Ky21, M162W, M37W, Mo18W, MS71, NC350, NC358, Oh43, Oh7B, Tzi8

2.1 f TTGTGGCATCAGTAGGGTCTGAAAC 2.1 r CCTTCCCTTAACCAAAGTTGAATAGCAG B73, CML277, Hp301, Mo17, P39, Tx303, 1,2 6.5 f CTTGCCGCTAAACCATCTCGCGATT 6.5 r TTCTTGCGCCCACCTCCTTCTGAAA B97, CML103, CML228, CML247, CML322, 1.6 f AAGCTCTAAGCGCCGCCATGGTTA 1.6 r GCATCTTGGCGTACACCTTGTTGG CML333, CML52, CML69, Il14H, Ki3, Ki11, Ky21, M162W, M37W, Mo18W, MS71, NC350, NC358, Oh43, Oh7B, Tzi8

Table C.2: Padding, barcodes and primer sequences used for amplification of the 27 NAM founder lines.

Primer name Sequence

11.4 f 1 GTTAGCTATACATGACTCTGCGCGGCTGTGAAGATGTCATGGACA

11.4 f 2 GTTAGTGTGTATCAGTACATGGCGGCTGTGAAGATGTCATGGACA

11.4 f 3 GTTAGTATGTGATCGTCTCTCGCGGCTGTGAAGATGTCATGGACA

11.4 f 4 GTTAGTACGACTACATATCAGGCGGCTGTGAAGATGTCATGGACA

11.4 f 5 GTTAGTATCTCTGTAGAGTCTGCGGCTGTGAAGATGTCATGGACA

11.4 f 6 GTTAGTCTATGTCTCAGTAGTGCGGCTGTGAAGATGTCATGGACA

11.4 r 1 GTTAGGCAGAGTCATGTATAGTGGAACAAGAGCTGGCCTGTTCGA

11.4 r 2 GTTAGCATGTACTGATACACATGGAACAAGAGCTGGCCTGTTCGA

114 11.4 r 3 GTTAGGAGAGACGATCACATATGGAACAAGAGCTGGCCTGTTCGA

11.4 r 4 GTTAGCTGATATGTAGTCGTATGGAACAAGAGCTGGCCTGTTCGA

11.4 r 5 GTTAGAGACTCTACAGAGATATGGAACAAGAGCTGGCCTGTTCGA

11.4 r 6 GTTAGACTACTGAGACATAGATGGAACAAGAGCTGGCCTGTTCGA

2.2 f 1 GTTAGCTATACATGACTCTGCGCGGCATACAACAGGCAAAGTTAGC

2.2 f 2 GTTAGTGTGTATCAGTACATGGCGGCATACAACAGGCAAAGTTAGC

2.2 f 3 GTTAGTATGTGATCGTCTCTCGCGGCATACAACAGGCAAAGTTAGC

2.2 f 4 GTTAGTACGACTACATATCAGGCGGCATACAACAGGCAAAGTTAGC

2.2 f 5 GTTAGTATCTCTGTAGAGTCTGCGGCATACAACAGGCAAAGTTAGC

2.2 f 6 GTTAGTCTATGTCTCAGTAGTGCGGCATACAACAGGCAAAGTTAGC

2.2 r 1 GTTAGGCAGAGTCATGTATAGGGCTGCTATGGTCTGTATCGTCTCCAAC

2.2 r 2 GTTAGCATGTACTGATACACAGGCTGCTATGGTCTGTATCGTCTCCAAC

2.2 r 3 GTTAGGAGAGACGATCACATAGGCTGCTATGGTCTGTATCGTCTCCAAC

2.2 r 4 GTTAGCTGATATGTAGTCGTAGGCTGCTATGGTCTGTATCGTCTCCAAC

2.2 r 5 GTTAGAGACTCTACAGAGATAGGCTGCTATGGTCTGTATCGTCTCCAAC

2.2 r 6 GTTAGACTACTGAGACATAGAGGCTGCTATGGTCTGTATCGTCTCCAAC

2.1 f 1 GTTAGCTATACATGACTCTGCTTGTGGCATCAGTAGGGTCTGAAAC

2.1 f 2 GTTAGTGTGTATCAGTACATGTTGTGGCATCAGTAGGGTCTGAAAC

2.1 f 3 GTTAGTATGTGATCGTCTCTCTTGTGGCATCAGTAGGGTCTGAAAC

2.1 f 4 GTTAGTACGACTACATATCAGTTGTGGCATCAGTAGGGTCTGAAAC

2.1 f 5 GTTAGTATCTCTGTAGAGTCTTTGTGGCATCAGTAGGGTCTGAAAC

2.1 f 6 GTTAGTCTATGTCTCAGTAGTTTGTGGCATCAGTAGGGTCTGAAAC

2.1 r 1 GTTAGGCAGAGTCATGTATAGCCTTCCCTTAACCAAAGTTGAATAGCAG

2.1 r 2 GTTAGCATGTACTGATACACACCTTCCCTTAACCAAAGTTGAATAGCAG

2.1 r 3 GTTAGGAGAGACGATCACATACCTTCCCTTAACCAAAGTTGAATAGCAG

2.1 r 4 GTTAGCTGATATGTAGTCGTACCTTCCCTTAACCAAAGTTGAATAGCAG

2.1 r 5 GTTAGAGACTCTACAGAGATACCTTCCCTTAACCAAAGTTGAATAGCAG

2.1 r 6 GTTAGACTACTGAGACATAGACCTTCCCTTAACCAAAGTTGAATAGCAG

13.4 f 1 GTTAGCTATACATGACTCTGCTCCTCATTCTCCTCTCGGAATCGCT

115 13.4 f 2 GTTAGTGTGTATCAGTACATGTCCTCATTCTCCTCTCGGAATCGCT

13.4 f 3 GTTAGTATGTGATCGTCTCTCTCCTCATTCTCCTCTCGGAATCGCT

13.4 f 4 GTTAGTACGACTACATATCAGTCCTCATTCTCCTCTCGGAATCGCT

13.4 f 5 GTTAGTATCTCTGTAGAGTCTTCCTCATTCTCCTCTCGGAATCGCT

13.4 f 6 GTTAGTCTATGTCTCAGTAGTTCCTCATTCTCCTCTCGGAATCGCT

13.4 r 1 GTTAGGCAGAGTCATGTATAGATACCACCTCAAGAAAGCAGGCCTA

13.4 r 2 GTTAGCATGTACTGATACACAATACCACCTCAAGAAAGCAGGCCTA

13.4 r 3 GTTAGGAGAGACGATCACATAATACCACCTCAAGAAAGCAGGCCTA

13.4 r 4 GTTAGCTGATATGTAGTCGTAATACCACCTCAAGAAAGCAGGCCTA

13.4 r 5 GTTAGAGACTCTACAGAGATAATACCACCTCAAGAAAGCAGGCCTA

13.4 r 6 GTTAGACTACTGAGACATAGAATACCACCTCAAGAAAGCAGGCCTA

6.5 f 1 GTTAGCTATACATGACTCTGCCTTGCCGCTAAACCATCTCGCGATT

6.5 f 2 GTTAGTGTGTATCAGTACATGCTTGCCGCTAAACCATCTCGCGATT

6.5 f 3 GTTAGTATGTGATCGTCTCTCCTTGCCGCTAAACCATCTCGCGATT

6.5 f 4 GTTAGTACGACTACATATCAGCTTGCCGCTAAACCATCTCGCGATT

6.5 f 5 GTTAGTATCTCTGTAGAGTCTCTTGCCGCTAAACCATCTCGCGATT

6.5 f 6 GTTAGTCTATGTCTCAGTAGTCTTGCCGCTAAACCATCTCGCGATT

6.5 r 1 GTTAGGCAGAGTCATGTATAGTTCTTGCGCCCACCTCCTTCTGAAA

6.5 r 2 GTTAGCATGTACTGATACACATTCTTGCGCCCACCTCCTTCTGAAA

6.5 r 3 GTTAGGAGAGACGATCACATATTCTTGCGCCCACCTCCTTCTGAAA

6.5 r 4 GTTAGCTGATATGTAGTCGTATTCTTGCGCCCACCTCCTTCTGAAA

6.5 r 5 GTTAGAGACTCTACAGAGATATTCTTGCGCCCACCTCCTTCTGAAA

6.5 r 6 GTTAGACTACTGAGACATAGATTCTTGCGCCCACCTCCTTCTGAAA

1.6 f 1 GTTAGCTATACATGACTCTGCAAGCTCTAAGCGCCGCCATGGTTA

1.6 f 2 GTTAGTGTGTATCAGTACATGAAGCTCTAAGCGCCGCCATGGTTA

1.6 f 3 GTTAGTATGTGATCGTCTCTCAAGCTCTAAGCGCCGCCATGGTTA

1.6 f 4 GTTAGTACGACTACATATCAGAAGCTCTAAGCGCCGCCATGGTTA

1.6 f 5 GTTAGTATCTCTGTAGAGTCTAAGCTCTAAGCGCCGCCATGGTTA

1.6 f 6 GTTAGTCTATGTCTCAGTAGTAAGCTCTAAGCGCCGCCATGGTTA

116 1.6 r 1 GTTAGGCAGAGTCATGTATAGGCATCTTGGCGTACACCTTGTTGG

1.6 r 2 GTTAGCATGTACTGATACACAGCATCTTGGCGTACACCTTGTTGG

1.6 r 3 GTTAGGAGAGACGATCACATAGCATCTTGGCGTACACCTTGTTGG

1.6 r 4 GTTAGCTGATATGTAGTCGTAGCATCTTGGCGTACACCTTGTTGG

1.6 r 5 GTTAGAGACTCTACAGAGATAGCATCTTGGCGTACACCTTGTTGG

1.6 r 6 GTTAGACTACTGAGACATAGAGCATCTTGGCGTACACCTTGTTGG

16.1 f 1 GTTAGCTATACATGACTCTGCGCGGCTGTGAAGATGTCATGGACA

16.1 f 2 GTTAGTGTGTATCAGTACATGGCGGCTGTGAAGATGTCATGGACA

16.1 f 3 GTTAGTATGTGATCGTCTCTCGCGGCTGTGAAGATGTCATGGACA

16.1 f 4 GTTAGTACGACTACATATCAGGCGGCTGTGAAGATGTCATGGACA

16.1 f 5 GTTAGTATCTCTGTAGAGTCTGCGGCTGTGAAGATGTCATGGACA

16.1 f 6 GTTAGTCTATGTCTCAGTAGTGCGGCTGTGAAGATGTCATGGACA

16.1 r 1 GTTAGGCAGAGTCATGTATAGCAAGAGCTGGCCTGTTCGAGGTTA

16.1 r 2 GTTAGCATGTACTGATACACACAAGAGCTGGCCTGTTCGAGGTTA

16.1 r 3 GTTAGGAGAGACGATCACATACAAGAGCTGGCCTGTTCGAGGTTA

16.1 r 4 GTTAGCTGATATGTAGTCGTACAAGAGCTGGCCTGTTCGAGGTTA

16.1 r 5 GTTAGAGACTCTACAGAGATACAAGAGCTGGCCTGTTCGAGGTTA

16.1 r 6 GTTAGACTACTGAGACATAGACAAGAGCTGGCCTGTTCGAGGTTA

4.1 f 1 GTTAGCTATACATGACTCTGCGCAAACAAACATTGGTCAGGCATGC

4.1 f 2 GTTAGTGTGTATCAGTACATGGCAAACAAACATTGGTCAGGCATGC

4.1 f 3 GTTAGTATGTGATCGTCTCTCGCAAACAAACATTGGTCAGGCATGC

4.1 f 4 GTTAGTACGACTACATATCAGGCAAACAAACATTGGTCAGGCATGC

4.1 f 5 GTTAGTATCTCTGTAGAGTCTGCAAACAAACATTGGTCAGGCATGC

4.1 f 6 GTTAGTCTATGTCTCAGTAGTGCAAACAAACATTGGTCAGGCATGC

4.1 r 1 GTTAGGCAGAGTCATGTATAGTTATGTACAGAGTAACACATCGATCGG

4.1 r 2 GTTAGCATGTACTGATACACATTATGTACAGAGTAACACATCGATCGG

4.1 r 3 GTTAGGAGAGACGATCACATATTATGTACAGAGTAACACATCGATCGG

4.1 r 4 GTTAGCTGATATGTAGTCGTATTATGTACAGAGTAACACATCGATCGG

4.1 r 5 GTTAGAGACTCTACAGAGATATTATGTACAGAGTAACACATCGATCGG

117 4.1 r 6 GTTAGACTACTGAGACATAGATTATGTACAGAGTAACACATCGATCGG

15.4 f 1 GTTAGCTATACATGACTCTGCATTCTCCTCTCGGAATCGCTCCCT

15.4 f 2 GTTAGTGTGTATCAGTACATGATTCTCCTCTCGGAATCGCTCCCT

15.4 f 3 GTTAGTATGTGATCGTCTCTCATTCTCCTCTCGGAATCGCTCCCT

15.4 f 4 GTTAGTACGACTACATATCAGATTCTCCTCTCGGAATCGCTCCCT

15.4 f 5 GTTAGTATCTCTGTAGAGTCTATTCTCCTCTCGGAATCGCTCCCT

15.4 f 6 GTTAGTCTATGTCTCAGTAGTATTCTCCTCTCGGAATCGCTCCCT

15.4 r 1 GTTAGGCAGAGTCATGTATAGTACCACCTCAAGAAAGCAGGCCTA

15.4 r 2 GTTAGCATGTACTGATACACATACCACCTCAAGAAAGCAGGCCTA

15.4 r 3 GTTAGGAGAGACGATCACATATACCACCTCAAGAAAGCAGGCCTA

15.4 r 4 GTTAGCTGATATGTAGTCGTATACCACCTCAAGAAAGCAGGCCTA

15.4 r 5 GTTAGAGACTCTACAGAGATATACCACCTCAAGAAAGCAGGCCTA

15.4 r 6 GTTAGACTACTGAGACATAGATACCACCTCAAGAAAGCAGGCCTA

11.4 f 4 2 GTTAGTACGACTACATATCAGGCGGCTGTGAAGATGTCATGGACA

2.2 f 4 2 GTTAGTACGACTACATATCAGGCGGCATACAACAGGCAAAGTTAGC

2.1 f 4 2 GTTAGTACGACTACATATCAGTTGTGGCATCAGTAGGGTCTGAAAC

13.4 f 4 2 GTTAGTACGACTACATATCAGTCCTCATTCTCCTCTCGGAATCGCT

6.5 f 4 2 GTTAGTACGACTACATATCAGCTTGCCGCTAAACCATCTCGCGATT

1.6 f 4 2 GTTAGTACGACTACATATCAGAAGCTCTAAGCGCCGCCATGGTTA

11.4 r 4 2 GTTAGCTGATATGTAGTCGTATGGAACAAGAGCTGGCCTGTTCGA

2.2 r 4 2 GTTAGCTGATATGTAGTCGTAGGCTGCTATGGTCTGTATCGTCTCCAAC

2.1 r 4 2 GTTAGCTGATATGTAGTCGTACCTTCCCTTAACCAAAGTTGAATAGCAG

13.4 r 4 2 GTTAGCTGATATGTAGTCGTAATACCACCTCAAGAAAGCAGGCCTA

6.5 r 4 2 GTTAGCTGATATGTAGTCGTATTCTTGCGCCCACCTCCTTCTGAAA

1.6 r 4 2 GTTAGCTGATATGTAGTCGTAGCATCTTGGCGTACACCTTGTTGG

118 B73 B97 CML103 CML228 CML247 CML277 CML322 CML333 CML52 CML69 HP301 Il14H Sequence type Ki11 Retroelements Ki3 DNA_transposons Ky21 Unclassified Sample M162W Unmasked M37W Mo17 MO18W Ms71 NC350 NC358 Oh43 Oh7B P39 Tx303 Tzi8 0 25 50 75 100 Percentage

Figure C.1: Repeat content within the qNLB 1 25721468 23298 locus. For each line, the proportion of repetitive elements (identified by RepeatMasker) in the resequence assembly is indicated.

119 a (NAM) b (282 panel)

Error rate Discovery rate Error rate Discovery rate B97 B97 CML103 CML103 CML228 CML228 CML247 CML247 CML277 CML277 CML322 CML322 CML333 CML333 CML52 CML52 CML69 CML69 HP301 HP301 Il14H Il14H Ki11 Ki11 KI3 Ky21 Ky21 M162W

Sample M162W Sample M37W M37W Mo17 Mo17 MO18W MO18W Ms71 Ms71 NC350 NC350 NC358 NC358 Oh43 Oh43 Oh7B Oh7B P39 P39 Tx303 Tx303 Tzi8 Tzi8 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 Percentage Percentage

Figure C.2: Error and discovery rates in maize HapMap3 at the qNLB 1 25721468 23298 locus. Genotyping error and discovery rates for maize HapMap3 were computed across qNLB 1 25721468 23298 for the (a) NAM set and (b) 282 panel. The error rate is the number of incorrect genotype calls in maize HapMap3, using ReSeq variants as the ground truth. The discovery rate is the number of ReSeq variants captured by maize HapMap3.

120 100 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●● ● ●●●●●●●● ●●● ●●●● ●● ● ●● ●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●●● ● ●●●● ●●●●●●●●●●● ● ● ●● ●●●●●●●● ● ● ● ● ●● ● ●●●●●● ●●●●●● ● ●● ●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●● ● ●● ● ● ●●●●●●●●● ●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●● ● ●●●●● ●● ●● ● ●●●●●●●●●● ●●●●●●●● ●●●● ● ● ● ●● Percent ●● ● ●●●●●● ● ●●● ●●● ● ●● ●●● ●●●●●●●● ●● 75 ● ● ●● ● ●●●●●●●●● ● ● ●●● ●● ● ● ●● ● ●●●●●●● ● heterozygotes ● ● ●● ● ●●● ●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●● ● ● ● ●●●● ● ● ●●●●●●●●●●●● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ● ● ● ●●●●● ● ● ● ● ● ● ●●●●●● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● 75 ●●● ●●●●●●●● ●●● ●●● ●● ●●●●●●●● ●●●● ●●●●●● ●●● ● ● ●●●●●●●●●●●● ● ● ● ●● ●● ● ●●●●●●●●●● ● ● ● ● ●● ●● ● ● ● ●●● ●●● ● ●●●●● 50 ● ● ●● ● ● ● ● ● ● ●●●● ● 50 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● 25 ● ● ● ● 0 ● ● ●● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●●●●●● ● ●● ●●●

Genotype Accuracy (NAM) Genotype Accuracy 0 0.0 0.1 0.2 0.3 0.4 0.5 MAF (HapMap3)

Figure C.3: Relationship between minor allele frequency (MAF) and genotyping accu- racy at the qNLB 1 25721468 23298 locus. The MAF was computed from among all samples in maize HapMap3, while genotyping accuracy was estimated for the NAM set. The color gradient represents the percentage of heterozygotes at the site for which genotyping accuracy was measured.

121 Figure C.4: Local associations from qNLB 1 25721468 23298 locus for DTA, DTS, GDD anthesis, GDD silking and NLB phenotypes. The p-values were corrected for false discovery rate using the Benjamini-Hochberg procedure. All markers are plotted with their p-values (as -log10 values) as a function of the genomic position (B73 RefGen v4). The markers are colored based on their genotyping accuracy where blue and orange colors indicate high and low accuracy values respectively. Horizontal dashed grey line indicates 0.05 p-value. Gene annotations (filtered set) were obtained from the NCBI B73 v4 annotation release 101 and is indicated by green arrows.

122 25 ●●●●● 23 ● ● 21 ● ● ● ●●●● 19 17 15 ● ● ●●● maxImpact 13 ●● ● ● ● Moderate 11 ●● ● ● ●●● ● ● Modifier 9 ● ● ● ● 7 ● ●● ●● ●●●

Tx303 allele count 5 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● 3 ● ●● ●●●●● ●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● 1 ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ......

25.720 25.725 25.730 25.735 25.740 25.745 Coordinate (Mb)

Figure C.5: Counts of Tx303 variants present in other NAM founders and their max- imum variant impact prediction. The allele counts indicate the occurrence of the Tx303 variant among other resequenced NAM founders. The maximum VEP of each each variant is color coded.

123 Appendix D

PERMISSIONS

Chapters 2 [45] was published in Scientific Reports (https://www.nature. com/srep/journal-policies/editorial-policies). At the time of submission of this dissertation, Chapter 3 was under review with BMC Bioinformatics (https:// bmcbioinformatics.biomedcentral.com/submission-guidelines/editorial-policies). These journals do not require that authors receive permission before publishing the au- thor’s articles in their own dissertations. The research described in this dissertation if published in any journal, would be the property of the respective journal or press.

124