Chromosome-Scale Assembly and Annotation of the Macadamia Genome (Macadamia Integrifolia HAES 741)
Total Page:16
File Type:pdf, Size:1020Kb
GENOME REPORT Chromosome-Scale Assembly and Annotation of the Macadamia Genome (Macadamia integrifolia HAES 741) Catherine J. Nock,*,1,2 Abdul Baten,*,†,1 Ramil Mauleon,* Kirsty S. Langdon,* Bruce Topp,‡ Craig Hardner,‡ Agnelo Furtado,‡ Robert J. Henry,‡ and Graham J. King* *Southern Cross Plant Science, Southern Cross University, Lismore NSW 2480, Australia, †AgResearch NZ, Grasslands Research Centre, Palmerston North 4442, New Zealand, and ‡Queensland Alliance for Agriculture and Food Innovation, University of Queensland, St Lucia QLD 4069, Australia ORCID ID: 0000-0001-5609-4681 (C.J.N.) ABSTRACT Macadamia integrifolia is a representative of the large basal eudicot family Proteaceae and KEYWORDS the main progenitor species of the Australian native nut crop macadamia. Since its commercialisation in Proteaceae Hawaii fewer than 100 years ago, global production has expanded rapidly. However, genomic resources genome are limited in comparison to other horticultural crops. The first draft assembly of M. integrifolia had good pseudo- coverage of the functional gene space but its high fragmentation has restricted its use in comparative chromosome genomics and association studies. Here we have generated an improved assembly of cultivar HAES transcriptome 741 (4,094 scaffolds, 745 Mb, N50 413 kb) using a combination of Illumina paired and PacBio long read nut crop sequences. Scaffolds were anchored to 14 pseudo-chromosomes using seven genetic linkage maps. This assembly has improved contiguity and coverage, with .120 Gb of additional sequence. Following annotation, 34,274 protein-coding genes were predicted, representing 90% of the expected gene content. Our results indicate that the macadamia genome is repetitive and heterozygous. The total repeat content was 55% and genome-wide heterozygosity, estimated by read mapping, was 0.98% or an average of one SNP per 102 bp. This is the first chromosome-scale genome assembly for macadamia and the Proteaceae. It is expected to be a valuable resource for breeding, gene discovery, conservation and evolutionary genomics. The genomes of most crop species have now been sequenced and nut crop over the past 10 years (International Nut and Dried Fruit their availability is transforming breeding and agricultural pro- Council, http://www.nutfruit.org). As a member of the Gondwanan ductivity (Bevan et al. 2017, Chen et al. 2019). Macadamia is the first family Proteaceae (83 genera, 1660 species) (Christenhusz and Byng Australian native plant to become a global food crop. In 2018/2019, 2016), Macadamia (F.Muell.) is a basal eudicot and is phylogenet- world production was valued at US$1.1 billion (59,307 MT/year, ically divergent from other tree crops (Nock et al. 2014). M. kernel basis), reflecting the most rapid increase in production of any integrifolia, the main species used in cultivation (Hardner 2016), is a mid-storey tree endemic to the lowland rainforests of sub- tropical Australia (Powell et al. 2014) (Figure 1). Macadamia was Copyright © 2020 Nock et al. commercialised as a nut crop from the 1920s in Hawaii and doi: https://doi.org/10.1534/g3.120.401326 cultivated varieties are closely related to their wild ancestors in Manuscript received April 24, 2020; accepted for publication August 1, 2020; published Early Online August 3, 2020. Australia (Nock et al. 2019). This is an open-access article distributed under the terms of the Creative Despite the rapid expansion in production over the past 50 years Commons Attribution 4.0 International License (http://creativecommons.org/ and commercial cultivation in 18 countries, breeding is restricted licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction by a paucity of information on the genes underlying important in any medium, provided the original work is properly cited. 1 crop traits (Topp et al. 2019). There are few genomic resources These authors contributed equally to this work. 2Corresponding author: Southern Cross Plant Science, Southern Cross University, for either macadamia or within the Proteaceae. Transcriptomic 1 Military Road, Lismore NSW 2480, Australia. E-mail: [email protected] data are available for Macadamia (Chabika et al. 2020) and other Volume 10 | October 2020 | 3497 Figure 1 Macadamia integrifolia (a) orchard (b) nut in husk (c) racemes. Proteaceae genera including Banksia (Lim et al. 2017), Grevillea MATERIALS AND METHODS (Damerval et al. 2019), Protea (Akman et al. 2016) and Gevuina (Ostria-Gallardo et al. 2016) andhavebeenutilizedtohelpto Sample collection and extraction , fl understand the evolution of floral architecture (Damerval et al. Leaf, shoot and ower tissue were collected from a single M. 2019) and variation in locally adaptive traits (Akman et al. 2016). integrifolia cultivar HAES 741 individual located in the M2 Re- ° Recent genome wide association studies in macadamia have iden- gional Variety Trial plot, Clunes, New South Wales, Australia (28 ’ ° ’ tified markers associated with commercially important traits 43.843 S; 153 23.702 E). An herbarium specimen was submitted to (O’Connor et al. 2020)buttheirlocationandcontextinthegenome the Southern Cross Plant Science Herbarium [accession PHARM- is unknown. An earlier highly fragmented (193,493 scaffolds, N50 13-0813]. On collection, samples were snap frozen in liquid ° 4,745) draft genome assembly of the widely grown M. integrifolia nitrogen or placed on dry ice and stored at -80 prior to extraction. cultivar HAES 741 was constructed from Illumina short read Previously described methods for and DNA/RNA extraction for Illumina (Nock et al. 2016) and PacBio (Furtado 2014) sequencing sequence data (Nock et al. 2016). Here we report on the first annotated and anchored genome were followed. assembly for macadamia and the Proteaceae. Long read PacBio and paired-end Illumina sequence data were generated and 187 Gb of data Library preparation and sequencing were used to construct an improved HAES 741 assembly (Table 1). In addition to the original 480 bp and 700 bp insert and 8 kb mate pair Assembled sequence scaffolds from the new hybrid de novo assembly libraries (51.6 Gb total), new libraries with 200, 350 and 550 bp insert (745 Mb, 4094 scaffolds, N50 413 kb) were then anchored and sizes were prepared with TruSeq DNA PCR-free kits and sequenced oriented to 14 pseudo-chromosomes using methods to maximize with Illumina HiSeq 2500 producing 57.0, 29.8 and 29.6 Gb paired- colinearity across seven genetic linkage maps (Langdon et al. 2020). A end sequence data. A PacBio library (20-Kb) was prepared and contiguous genome sequence for macadamia is expected to facilitate sequenced across 10 SMRT cells on a PacBio RSII system (P6-C4 the identification of candidate genes and support marker assisted chemistry) generating 6.44 Gb data. Transcriptome sequencing of selection to accelerate the development of new improved varieties. leaf, shoot and flower tissue followed previously described methods 3498 | C. J. Nock et al. n■ Table 1 Data files and library information for Macadamia integrifolia genome sequencing. ÃData deposited for draft assembly v1.1 Sequencing Insert Read Sequence Sequence Platform Library size (bp) length (bp) reads (Million) bases (Gb) Accession Genomic data Illumina Pair end 200 2 x 125 228.8 57.2 SRR10896963 Pair end 350 2 x 125 119.0 29.8 SRR10896962 Pair end 550 2 x 125 107.6 26.9 SRR10896961 Pair end 480 2 x 150 58.7 17.7 ERX1468522-23Ã Pair end 700 2 x 150 28.7 8.7 ERX1468524Ã Mate pair 8000 2 x 100 200.3 40.5 ERX1468525Ã PacBio NA 9769 1.0 6.4 10896960 Total gDNA 744.1 187.2 Transcriptomic data Illumina young leaf 300 2 x 100 77.9 15.6 SRR10897159 shoot 300 2 x 100 71.6 14.3 SRR10897158 flower 300 2 x 100 83.3 16.7 SRR10897157 Total RNA-seq 232.8 46.6 (Nock et al. 2016) and generated 44.6 Gb paired-end RNA-seq data in plants (Sun et al. 2018). Genome-wide heterozygosity using reads total (Table 1). data were estimated with GenomeScope (Vurture et al. 2017) from the k-mer 25 histogram computed using Jellyfish. In addition, De novo assembly and anchoring genome-based heterozygosity was determined by mapping post-QC Illumina raw read data were initially assessed for quality using short read data to the assembly using minimap2 (Li 2018) and FastQC (Andrews 2010)Á Low quality bases (Q , 20), adapters heterozygous sites identified using GATK best practices (Van der and chloroplast reads were removed from Illumina data using Auwera et al. 2013) for SNP discovery. BBMaps (Bushnell 2020). PacBio reads were error-corrected using the Long Read DBG Error Correction method (LoRDEC) with over Genome annotation and gene prediction 170 · post-QC Illumina data (Salmela and Rivals 2014). The per- Repetitive elements were first identified in the final assembly by formance of hybrid error correction methods including LoRDEC is modeling repeats using RepeatModeler, and then quantified using optimized with increasing coverage of more accurate short, rather RepeatMasker (Tarailo-Graovac and Chen 2009). Transcriptome than long read, sequence data since each long read is corrected assembly was performed using the Trinity pipeline (Grabherr independently (Fu et al. 2019). Hybrid de novo assembly, scaffolding et al. 2011). Following repeat masking, the final assembly was and gap closing was performed using MaSuRCA (Zimin et al. 2013). annotated using the MAKER gene model prediction pipeline Illumina paired-end and mate-pair reads were extended into super- (Campbell et al. 2014). Sources of evidence for gene prediction reads of variable lengths, and combined with PacBio reads to generate included the Trinity assembled transcripts and protein sequences the assembly. The MaSURCA output was further scaffolded in two of the taxonomically closest available genome sequence of the sacred rounds, first using SSPace (Boetzer et al. 2011) scaffolder with the lotus Nelumbo nucifera, and the model plant Arabidopsis thaliana. Illumina mate pair reads, and second, using L_RNA_Scaffolder (Xue et al.