bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A chromosome-level genome assembly of the woolly apple ,

Eriosoma lanigerum (Hausmann) (: )

Roberto Biello1, Archana Singh2, Cindayniah J. Godfrey3, Felicidad Fernández Fernández3,

Sam T. Mugford1, Glen Powell3, Saskia A. Hogenhout1 and Thomas C. Mathers1*

1Department of Crop Genetics, John Innes Centre, Norwich Research Park, Norwich, United

Kingdom

2Earlham Institute, John Innes Centre, Norwich Research Park, Norwich, United Kingdom

3NIAB EMR, New Road, East Malling, Kent, ME19 6BJ, United Kingdom

*Corresponding author

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ABSTRACT

Woolly apple aphid (WAA, lanigerum Hausmann) (Hemiptera: Aphididae), is a

major pest of apple trees (Malus domestica, order Rosales) and is critical to the economics of

the apple industry in most parts of the world. Here, we generated a chromosome-level genome

assembly of WAA – representing the first genome sequence from the aphid subfamily

Eriosomatinae – using a combination of 10X Genomics linked-reads and in vivo Hi-C data.

The final genome assembly is 327 Mb, with 91% of the assembled sequences anchored into

six chromosomes. The contig and scaffold N50 values are 158 kb and 71 Mb, respectively, and

we predicted a total of 28,186 protein-coding genes. The assembly is highly complete,

including 97% of conserved single-copy orthologues based on BUSCO analysis.

Phylogenomic analysis of WAA and nine previously published aphid genomes, spanning four

aphid tribes and three subfamilies, reveals that the tribe Eriosomatini (represented by WAA) is

recovered as a sister group to Aphidini + Macrosiphini (subfamily Aphidinae). We identified

syntenic blocks of genes between our WAA assembly and the genomes of other aphid species

and find that two WAA chromosomes (El5 and El6) map to the conserved Macrosiphini and

Aphidini X chromosome. Our high-quality WAA genome assembly and annotation provides a

valuable resource for research in a broad range of areas such as comparative and population

genomics, -plant interactions and pest resistance management.

KEYWORDS

Hemiptera, plant pest, comparative genomics, phylogenetics, sex chromosome.

1 INTRODUCTION

There are approximately 5,000 known species of aphid (Hemiptera: Aphididae), and

approximately 100 of these are of significant agricultural economic importance (Blackman &

Eastop, 2017). While aphid genomics research has mostly focussed on the subfamily Aphidinae

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(International Aphid Genomics Consortium, 2010; Y. Li, Park, Smith, & Moran, 2019;

Mathers, 2020; Mathers et al., 2017; Mathers, Mugford, et al., 2020; Mathers, Wouters, et al.,

2020; Nicholson et al., 2015; Thorpe et al., 2018; Wenger et al., 2016), a large and widespread

group including many important pests (Blackman & Eastop, 2000), investigation of genome

evolution across the wider diversity of has been limited (but see Julca et al., 2020). This

report represents the first complete genome sequence for a species from the

subfamily which includes three potentially polyphyletic tribes (Eriosomatini, Fordini and

Pemphigini), all of which are distantly related to members of Aphidinae (X. Li, Jiang, & Qiao,

2014; Nováková et al., 2013; Ortiz-Rivas & Martínez-Torres, 2010).

Woolly apple aphid (WAA, Hausmann, tribe Eriosomatini) is probably

North American in origin. It was likely introduced to Britain in 1796 or 1797 with infested

apple trees imported from America (Theobald, 1920) and has subsequently become a

cosmopolitan and highly-damaging pest of apple worldwide (Barbagallo et al., 1997). Up to

twenty generations per year on apple have been reported. First-instar nymphs (crawlers) are

dispersive and walk to new feeding sites, but later developmental stages tend to be more

sedentary, forming colonies distinguishable based on their fluffy white protective wax coating

(Barbagallo et al., 1997). The aphid is able to feed on apple roots, trunks, branches and shoots.

Saliva secreted whilst feeding causes cambium cells to divide rapidly, forming a gall which

collapses and becomes pulpy under pressure from proliferating cells, making the area

vulnerable to fungal (Barbagallo et al., 1997; Childs, 1929; Staniland, 1924).

Edaphic (soil-dwelling) WAA has a significant negative effect on plant growth, especially of

young, non-bearing apple trees where feeding significantly reduces trunk diameter (Brown,

Schmitt, Ranger, & Hogmire, 1995) that is strongly correlated with fruit production (Waring,

1920). Such widespread and significant damage has made WAA resistance a key objective for

rootstock breeding since the early 20th century (Cummins & Aldwinckle, 1983).

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Sexual and asexual reproduction in aphid life cycles have essential impacts on population

structure and allelic diversity (Delmotte, Leterme, Gauthier, Rispe, & Simon, 2002). Species

in the subfamily Eriosomatinae typically go through phases of both sexual and parthenogenetic

reproduction during the life cycle, but (unlike in other aphid subfamilies) the sexual males and

egg-laying females lack mouthparts and are therefore unable to feed. The life cycle and mode

of reproduction of WAA is somewhat ambiguous. In North America, where WAA populations

have been reported to induce leaf galls on American elm (Ulmus americana L.) (Baker, 1915),

it has been suggested that WAA has a heteroecious life cycle, alternating between apple

(parthenogenetic reproduction) and elm (sexual reproduction). However, it is not clear whether

genotypes found on elm are also capable of feeding on apple (Blackman & Eastop, 2000). In

other parts of the world, WAA is assumed to be entirely anholocyclic (asexual) on apple (e.g.

Dumbleton & Jeffreys, 1938; Eastop, 1966). Sexual males and females have sometimes been

observed on apple, but the deposited eggs usually do not hatch, and such populations are

assumed to be functionally asexual (Blackman & Eastop, 2000).

The genetic structure of WAA populations also has important practical relevance in the context

of host-plant resistance. Four genes associated with WAA resistance in apple (Er1-4) have

been identified from various sources (Bus et al., 2010; King et al., 1991; Ranatunga, Gardiner,

Bssett, Rikkerink, & Bus, 1999). Some genotypes of WAA have been observed to feed on

rootstocks with Er1-, Er2-, and Er3-mediated resistance (Cummins & Aldwinckle, 1983; Rock

& Zeiger, 1974; Sandanayaka, Bus, Connolly, & Newcomb, 2003). The prevalence and spread

of such resistance-breaking genotypes within WAA populations has not yet been investigated,

and the availability of a full genome sequence will benefit such studies considerably.

Here, we generated a high-quality chromosome-level genome assembly of WAA using a

combination of 10X Genomics linked-reads and in vivo chromatin conformation capture (Hi-

C) sequencing. Subsequently, gene prediction, functional annotation and phylogenetic analysis

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

were carried out to determine the relationship of WAA within the Aphidoidea superfamily. Our

reference genome can provide information about genome organization in the Eriosomatinae

subfamily and allows comparative genomic studies for a better understanding of the evolution

of aphids.

2 MATERIALS AND METHODS

2.1 Sampling

Aphids were collected from a population feeding on apple trees grown under glass at NIAB

EMR. The glasshouse-grown plants were deliberately infested using WAA collected from local

orchards. A single colony infesting a glasshouse-grown potted apple plant (a clonally-

propagated aphid-susceptible breeding line in the rootstock breeding programme at NIAB

EMR) was sampled for aphids in November 2018. All sampled aphids were collected from one

(12 cm) section of infested woody stem. Wax filaments covering sampled were

removed using a fine paint brush and the aphids placed in Eppendorf tubes. Adult individuals

(20 in total) were collected into separate tubes for genome analysis. An additional three

samples of grouped aphids (25 in total) were pooled into individual tubes for RNA-Seq

analysis. These groups consisted of: apterous (wingless) adults; mid-instar nymphs (mix of

second, third and fourth instars); mixed stages (all of the above life stages present). Collected

aphids were snap frozen by immersing tubes in liquid nitrogen.

2.2 DNA extraction and sequencing

The total genomic DNA was extracted from a single aphid using Illustra Nucleon Phytopure

kit modifying the manufacturer’s protocol (GE Healthcare). The lysis step was performed

adding 10 ul of Proteinase K and incubating the sample in a water bath for 2 hours at 55°C.

The DNA precipitation was performed using NaAc (3M) together with Isopropanol for

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

increasing the DNA yield. DNA quality and quantity were assessed using a Nanodrop

spectrophotometer (Thermo Scientific), a Qubit double-stranded DNA BR Assay Kit

(Invitrogen, Thermo Fisher Scientific) and Femto fragment analyser (Agilent). 10X Genomics

library preparation and Illumina genome sequencing (HiSeq X, 150bp paired-end) were

performed by Novogene Bioinformatics Technology Co, Beijing, China, in accordance with

standard protocols.

A pool of mixed stages samples was used to construct a Hi-C chromatin contact map to enable

a chromosome-level assembly. Dovetail Genomics created the Hi-C library with the DpnII

restriction enzyme following a similar protocol to Lieberman-Aiden et al. (2009). Hi-C library

was sequenced on Illumina HiSeq X sequencer and 150 bp paired-end reads were generated.

2.3 Genome assembly

To create the de novo assembly, the 10X Genomics linked-read data were assembled using

Supernova 2.1.1 (Weisenfeld, Kumar, Shah, Church, & Jaffe, 2017) with the default

parameters and the recommended number of reads (--maxreads=199222871) to produce the

pseudohaplotype assembly output (--style=pseudohap). We improved the initial Supernova

assembly by performing iterative scaffolding using all of the 10X Genomics raw data (364

millions of reads; Supplementary Table 1) following the procedure set out in Mathers,

Wouters, et al. (2020). Briefly, we performed two rounds of Scaff10x (https://github.com/wtsi-

hpag/Scaff10X) with the parameters “-longread 1 -edge 50000 - block 50000”, followed by

mis-assembly detection and correction with Tigmint (Jackman et al., 2018). These steps were

followed by a final round of scaffolding with ARCS (Yeo, Coombe, Warren, Chu, & Birol,

2018). We aligned the Hi-C reads to the 10x Genomics assembly using the Juicer pipeline

(Durand et al., 2016). The assembly was then scaffolded with Hi-C data (Supplementary

Table 1) into chromosome-level organization using the 3D-DNA pipeline (Dudchenko et al.,

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

2017), followed by manual correction using Juicebox Assembly Tools (JBAT) (Dudchenko et

al., 2018). The assembly was polished after JBAT review using the 3D-DNA seal module to

reintegrate genomic content removed from super-scaffolds through false-positive manual

editing to create a final scaffolded assembly.

We checked the Hi-C assembly for contamination using the BlobTools pipeline v0.9.19

(Kumar, Jones, Koutsovoulos, Clarke, & Blaxter, 2013; Laetsch & Blaxter, 2017) by

generating taxon annotated GC content-coverage plots (known as “BlobPlots”). Each scaffold

was annotated with taxonomy information based on BLASTN v2.2.31 (Camacho et al., 2009)

searches against the NCBI nucleotide database (nt, downloaded 13/10/2017) with the options

“-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' -culling_limit 5 -evalue 1e-

25”. To calculate average coverage per scaffold, we mapped the 10X Genomics raw reads,

after barcode removal using proc10xG (https://github.com/ucdavis-bioinformatics/proc10xG),

to the assembly using BWA-MEM v0.7.7 (Li, 2013) with default parameters. The resulting

BAM file was sorted with SAMtools v1.3 (Li et al., 2009) and passed to BlobTools along with

the table of BLASTN results. The mitochondrial genome was identified and removed based on

alignment to the WAA mitochondrial genome (NCBI accession number NC_033352.1) with

nucmer v4.0.0.beta2 (Marçais et al., 2018), and patterns of coverage and GC content obtained

from BlobTools. A frozen release was generated for the final assembly with scaffolds renamed

and ordered by size with SeqKit v0.9.1 (Shen, Le, Li, & Hu, 2016). We assessed the quality of

the genome assembly by searching for conserved, single copy, arthropod genes (n=1,066) with

Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0 (Waterhouse et al., 2018) and

by analysis of k-mer spectra with KAT v2.3.2 (Mapleson, Garcia Accinelli, Kettleborough,

Wright, & Clavijo, 2017) using the default k-mer size (k = 27). To generate a k-mer spectra

we compared k-mer content of the raw sequencing reads to the k-mer content of the assembly

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

using KAT comp. Using this spectra we also estimated the WAA genome size, heterozygosity

level and genome assembly completeness.

2.4 RNA extraction and sequencing

Total RNA was extracted from two pools of approximately 25 aphids, one containing adults

and one containing mid-instar nymphs, collected from the same colony. The sample was

ground under liquid nitrogen in a 1.5 ml Eppendorf tube using a plastic pestle. RNA was

extracted using Trizol (Sigma) according to the manufacturer’s protocol. RNA was further

purified using RNeasy with on-column DNAse treatment (Qiagen) according to the

manufacturer’s protocol and eluted in 100 ml of nuclease-free water. RNA quality was assessed

by electrophoresis of 5 µl denatured in formamide on a 1% agarose gel. Purity was assessed

using a Nanodrop spectrophotometer (ThermoFisher) to measure the A260/A280 and A260/A230 ratios.

Concentration of RNA was measured, and the presence of contaminating DNA was assessed

using a Qubit (Lifetech).

Quality control and trimming for adapters and low-quality bases (quality score < 30) of the

RNA-seq raw reads were performed using FastQC v0.11.8 (Andrews, 2010) and trim_galore

v0.5.0 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore) respectively.

2.5 Gene prediction

Before running the gene prediction, we identified repeats and transposable elements (TEs) with

RepeatMasker v4.0.7 (Tarailo-Graovac & Chen, 2009) using the RepBase Insecta repeat

library (Bao, Kojima, & Kohany, 2015) with the parameters “-e ncbi -species insecta -a -xsmall

-gff” (Jurka et al., 2005).

We mapped the quality and adapter trimmed RNA-seq reads (Supplementary Table 2) to the

soft-masked assembly with HISAT2 v2.0.5 (Kim, Langmead, & Salzberg, 2015) with the

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

following parameters: “--max-intronlen 25000 --dta-cufflinks” followed by sorting and

indexing with SAMtools v1.3 (Li et al., 2009). Strand-specific RNA-seq alignments were split

by forward and reverse strands and passed to BRAKER2 as separate BAM files. Therefore, we

ran BRAKER2 v2.1.2 (Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016; Hoff,

Lomsadze, Borodovsky, & Stanke, 2019) to train AUGUSTUS (Lomsadze, Burns, &

Borodovsky, 2014; Stanke, Diekhans, Baertsch, & Haussler, 2008) and predict protein-coding

genes, incorporating evidence from the RNA-seq alignments and alignment of BUSCO genes

with the following parameters “--softmasking --gff3 --prg=gth --gth2traingenes”. After gene

prediction, completeness of the gene set was checked with BUSCO using the longest transcript

of each gene as the representative transcript.

2.6 Functional annotation

All the unique transcripts were converted to peptide sequence using cufflinks v2.2.1 (Trapnell

et al., 2010). Sequences were searched against the non-redundant NCBI protein database using

BLASTP v2.6.0 with an E-value cut-off of <= 1x10-5. BLAST2GO v5.0 (Conesa et al., 2005)

and InterProScan v2.5.0 (Quevillon et al., 2005) were used to assign Gene Ontology (GO)

terms. Protein domains were annotated by searching against the InterPro v32.0 (Hunter et al.,

2012) and Pfam v27.0 (Punta et al., 2012) databases, using InterProScan v2.5.0 (Quevillon et

al., 2005) and HMMER v3.1 (Finn, Clements, & Eddy, 2011), respectively. The pathways in

which the genes might be involved were assigned by BLAST against the KEGG databases

(release 53), with an E-value cut-off of 1x10-5.

2.7 Phylogeny and comparative genomics

Orthologous groups in Aphididae genomes were identified from the predicted protein

sequences of WAA and nine other aphid genomes already published (Supplementary Table

3): Myzus cerasi (Mathers, Mugford, et al., 2020; Thorpe et al., 2018), Myzus persicae

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(Mathers, Wouters, et al., 2020), Diuraphis noxia (Nicholson et al., 2015), Acyrthosiphon

pisum (Mathers, Wouters, et al., 2020), Pentalonia nigronervosa (Mathers, Mugford, et al.,

2020), Aphis glycines (Mathers, 2020), Rhopalosiphum maidis (Chen et al., 2019),

Rhopalosiphum padi (Thorpe et al., 2018) and Cinara cedri (Julca et al., 2020). As an outgroup,

we included the genome of the silverleaf whitefly Bemisia tabaci (Chen et al., 2016). We used

the longest transcript to represent the gene model when several transcripts of a gene were

annotated. OrthoFinder v2.2.3 (Emms & Kelly, 2015, 2019) with Diamond v0.9.14 (Buchfink,

Xie, & Huson, 2015), MAFFT v7.305 (Katoh & Standley, 2013) and FastTree v2.1.7 (Price,

Dehal, & Arkin, 2009, 2010) were used to cluster proteins into orthogroups, reconstruct gene

trees and estimate the species tree. The OrthoFinder species tree was automatically rooted by

OrthoFinder based on informative gene duplications with STRIDE (Emms & Kelly, 2017).

Gene Ontology (GO) analysis of lineage-specific gene families and genes that have undergone

lineage-specific duplication was carried out using the Bioconductor (Gentleman et al., 2004)

package topGO (Alexa & Rahnenführer, 2009). We used a Fisher exact test to identify

overrepresented GO terms.

2.8 Synteny analysis

Syntenic blocks of genes were identified between the chromosome-level genome assemblies

of WAA, M. persicae, A. pisum and R. maidis (see Supplementary Table 3 for details of

assembly and annotation versions used) using MCscanX v1.1 (Wang et al., 2012). For each

comparison, we carried out an all vs. all BLAST search of annotated protein sequences using

BLASTALL v2.2.22 (Altschul, Gish, Miller, Myers, & Lipman, 1990) with the options “-p

blastp - e 1e-10 -b 5 -v 5 -m 8” and ran MCScanX with the parameters “-s 5 -b 2”, requiring

synteny blocks to contain at least five consecutive genes and to have a gap of no more than 25

genes. MCScanX results were visualised with SynVisio (https://synvisio.github.io/#/).

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

3 RESULTS AND DISCUSSION

3.1 Genome sequencing and assembly

We generated a high-quality chromosome-level genome assembly of WAA using a

combination of 10X Genomics linked-reads and in vivo Hi-C data (Figure 1a). In total, we

generated 54.72 Gb of 10X Genomics linked-reads and 71.95 Gb of Hi-C reads, corresponding

to 167x and 220x coverage of the WAA genome, respectively. Initial de novo assembly of the

10X Genomics linked-reads with Supernova produced a contiguous assembly totalling 330 Mb

(Table 1; scaffold N50 = 4.16 Mb). We further improved the Supernova assembly by iterative

scaffolding and misjoin correction with Scaff10X (two rounds) and Tigmint (one round),

increasing the scaffold N50 of the assembly to 4.22 Mb and reducing the number of scaffolds

from 8,967 to 7,929, with the longest scaffold spanning 12.58 Mb (Table 1). To generate

chromosome length super-scaffolds, we scaffolded the 10X assembly using in vivo Hi-C data.

After manual curation, the final assembly comprised 327 Mb, with 91% of the assembly

anchored into six chromosomes (Figure 1a), consistent with the WAA 2n karyotype being 12

chromosomes (Gautam & Verma, 1982, 1983; Kulkarni, 1984; Robinson & Chen, 1969). The

lengths of the six chromosomes ranged from 29.68 to 71.23 Mb.

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Table 1. Genome summary statistics for each step of the assembly of WWA. SP = Supernova; SC = Scaff10X; TG = Tigmint; HC = Hi-C. Assembly SP SP+SC+TG SP+SC+TG+HC

Base pair (Mb) 330 330 327

Number of contigs 12,566 12,703 12,065

Contig N50 (Mb) 0.158 0.158 0.158

Number of scaffolds 8,967 7,929 7,146

Scaffold N50 (Mb) 4.164 4.222 62.861

Longest scaffold (Mb) 16.081 12.584 71.231

% of assembly in chromosome length scaffolds 0 0 91

Figure 1. Chromosome-level genome assembly of WAA. (a) Heatmap showing frequency of Hi-C contacts along WAA genome assembly. Blue lines indicate super scaffolds and green lines show contigs. Genome scaffolds are ordered from longest to shortest with the X and Y axis showing cumulative length in millions of base pairs (Mb). (b) KAT k-mer plots comparing k-mer content of 10X Genomics raw reads (barcodes removed) with Hi-C assembly. The black area of the graphs represents the distribution of k-mers present in the reads but not in the assembly and the red area represents the distribution of k-mers present in the reads and once in the assembly. Other colours show k-mers found multiple times in the genome assembly. (c) Genome assembly completeness assessed by the recovery of universal single-copy genes (BUSCOs) using the Arthropoda gene set (n=1,066). BUSCO assessment result for Eriosoma lanigerum (in bold) compared with aphid genomes available from National Center for Biotechnology Information (NCBI). The species are coloured by aphid tribe (see Figure 2).

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

The WAA genome assembly is accurate, complete and free from contamination. K-mer

analysis comparing genomic content of the 10X reads (after barcode removal) with the WAA

genome assembly reveals little missing single-copy genome content and very low levels of

duplicated content caused by the assembly of haplotigs (Figure 1b). Furthermore, our 327 Mb

WAA genome assembly is close to the genome size estimate based on k-mer analysis with

KAT (363 Mb) (Figure 1b; Supplementary Table 4) and the genome size of Eriosoma

americanum (330 Mb) (Finston, Hebert, & Foottit, 1995). This analysis is further supported by

high representation of conserved arthropod genes (n=1,066) in the assembly, with 97%

(n=1,032) found as complete single copies. Indeed, the WAA assembly contains the highest

number of conserved single-copy Arthropoda genes of any published aphid genome (Figure

1c). A taxon-annotated GC content-coverage plot (known as a “BlobPlot"; Kumar, Jones,

Koutsovoulos, Clarke, & Blaxter, 2013) revealed the co-assembly of the obligate aphid

bacterial symbiont Buchnera aphidicola (Baumann et al., 1995; Douglas, 1998; Hansen &

Moran, 2011; Shigenobu & Wilson, 2011) and a secondary symbiont, Serratia symbiotica

(Burke & Moran, 2011; Manzano-Marín & Latorre, 2016; Moran, Russell, Koga, & Fukatsu,

2005) (Supplementary Figure 1). These bacterial scaffolds were filtered from the final

assembly, along with scaffolds showing atypical GC content and read coverage, leaving the

final assembly free from obvious contamination (Supplementary Figure 2). B. aphidicola

genome was assembled into 99 scaffolds 444 Kb in length (Supplementary Table 5). S.

symbiotica genome was fragmented and incomplete (165 Kb total length, 30 scaffolds, N50 =

9 Kb).

3.2 Genome annotation

We generated 44 Gb of strand-specific RNA-seq data to aid genome annotation. A total of

28,186 protein-coding genes (28,297 transcripts) were predicted in the WAA genome

assembly. Completeness of the gene set reflected that of the genome assembly, with 97.3% of

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

the BUSCO Arthropoda gene set found as complete copies (95.6% complete and single copy;

Supplementary Figure 3). In total, 55.7% (15,477) of the predicted transcripts were

functionally annotated with at least one GO-term and/or protein domain. 9,903 transcripts were

assigned GO term annotations using BLAST2GO and 9,934 using InterProScan. 14,228

transcripts contained at least one known InterProScan protein domain.

3.3 Phylogeny and comparative genomics

WAA is the first member of the aphid subfamily Eriosomatinae to have its genome sequenced.

To place this new genome assembly in a phylogenetic context and to investigate gene family

evolution across aphids, we compared the WAA proteome (the complete set of annotated

protein coding genes) to the proteomes of nine other aphid species that have fully sequenced

genomes (Supplementary Table 3) and to the whitefly, Bemisia tabaci MEAM1 (Chen et al.,

2016). In total, we clustered 240,702 proteins into 23,294 orthogroups (gene families) and

26,969 singleton genes (Supplementary Table 6). Maximum likelihood phylogenetic analysis

based on a concatenated alignment of 3,079 conserved single-copy genes produced a fully

resolved species tree with 100% support at all nodes (Figure 2). Eriosomatini (represented by

WAA) is recovered as a sister group to Aphidini + Macrosiphini (Aphidinae), with Lachnini

(represented by C. cedri) placed as an outgroup to all other sequenced aphid species (Figure

2).

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 2. Maximum likelihood phylogeny of Eriosioma lanigerum and other nine aphid species based on a concatenated alignment of 3,079 conserved one-to-one orthologs. The tree is rooted with the whitefly B. tabaci MEAM1 (not shown). Clades are coloured by aphid tribe. Branch lengths are in amino acid substitutions per site. The bar chart shows the gene count for each species with orthology relationships among aphids.

The number of predicted genes in WAA (28,186) is within the range of other aphid genomes

(16,992 – 31,001) but among the highest (Figure 2). However, predicted gene numbers can

vary depending on both the quality of the genome assembly and the different pipelines used

for predicting genes (Denton et al., 2014; Yandell & Ence, 2012). Of the 28,186 predicted

genes in the WAA genome, 18,738 (66%) have an ortholog in at least one other aphid species

and 13,278 (47%) are conserved in the majority (at least 9/10) of aphid species (Figure 2). The

high number of genes with orthologs in other aphid species despite the absence of close WAA

relatives in our analysis likely reflects the early emergence of many aphid gene families (Julca

et al., 2020).

Aphid genomes are also subject to high levels of ongoing gene duplication (Fernández et al.,

2020; International Aphid Genomics Consortium, 2010; Julca et al., 2020; Mathers et al., 2017;

Thorpe et al., 2018). WAA is no exception and we detect a large number of lineage-specific

gene families (689 orthogroups corresponding to 3,954 genes; Figure 2) and identify 9,936

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

genes that have undergone lineage-specific duplication. These genes are enriched for a diverse

set of functions including sensory perception and metabolic process (Supplementary Table

7; Supplementary Table 8). As additional Eriosomatini genomes become available, the

diversification of these genes and gene families will be investigated in greater detail.

3.4 X chromosome fragmentation in WAA

We have previously shown that the autosomes of aphids within the tribes Macrosiphini and

Aphidini have undergone extensive rearrangement over the last ~30 million years while the

aphid sex (X) chromosome has been conserved (Mathers, Wouters, et al., 2020). Given that we

now have a chromosome-level genome assembly of an aphid from a third, more divergent tribe,

we used our WAA aphid assembly to investigate the evolution of aphid genome structure. We

identified syntenic blocks of genes between our WAA assembly and the genomes of M.

persicae (Figure 3a) and A. pisum (Figure 3b) from Macrosiphini (Mathers, Wouters, et al.,

2020), and R. maidis (Figure 3c) from Aphidini (Chen et al., 2019). All comparisons reveal

high levels of genome rearrangement between the autosomes. Surprisingly, however, we find

that two WAA chromosomes (El5 and El6) map to the conserved Macrosiphini and Aphidini

X chromosome, suggesting either fragmentation of the X chromosome in WAA or that the

large Aphidinae (Macrosiphini + Aphidini) X chromosome was the result of an ancient fusion

event. Additional chromosome-level assemblies of diverse aphid species will be required to

test these two competing hypotheses. However, the lack of autosomal rearrangements between

either El5 or El6 and the autosomes, suggests that recent X chromosome fragmentation in

WAA is more likely.

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 3: Genome reorganisation across aphids and fragmentation of the sex (X) chromosome in WAA. Pairwise synteny relationships are shown between WAA and the chromosome-scale genome assemblies of Myzus persicae (a), Acyrthosiphon pisum (b) and (c) Rhopalosiphum maidis (see Figure 2 for phylogenetic relationships between compared species). Links indicate the boundaries of syntenic blocks of genes identified by MCscanX and are colour coded by WAA chromosome ID. Black arrows along chromosomes indicate reverse compliment orientation relative to WAA.

4 CONCLUSIONS

WAA is a widespread pest of apple trees that is particularly critical to the economics of the

apple industry in most parts of the world. This study provides the first chromosome-level

genome of WAA and the genome sequencing of the first representative of the whole subfamily

Eriosomatinae. The WAA genome will be useful as a reference for investigation of genetic

differences among wild WAA populations. The high quality of the genome will also allow

molecular marker development and detection using large-scale genome resequencing.

Population resequencing data can be used to investigate genomic regions and functional genes

that show genetic variability and to analyse demographic history events in WAA populations.

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Finally, as the only genome available for the Eriosamatinae subfamily and as an additional

outgroup to other sequenced aphids from the subfamily Aphidinae, the WAA genome will

allow more extensive comparative genomics analysis of aphids.

DATA ACCESSIBILITY

The genome raw reads, the RNA sequencing data and the genome assembly are available at

the National Center for Biotechnology Information (NCBI) with the BioProject number

PRJNA623270. The genome assembly and annotation and orthogroup clustering results are

available for download from Zenodo (10.5281/zenodo.3797131).

ACKNOWLEDGMENTS

TCM is funded by a BBSRC Future Leader Fellowship (BB/R01227X/1). Additional support

was received from the BBSRC Institute Strategy Programme (BB/P012574/1) and the John

Innes Foundation. CJG acknowledges receipt of a studentship funded by the BBSRC, AHDB

and an industry consortium (http://www.ctp-fcr.org/industry-partners/).

AUTHOR CONTRIBUTIONS

GP, SAH and TCM conceived the study. GP, FFF and CJG provided samples. RB and STM

extracted DNA and RNA. RB, AS and TCM performed the genome assembly, gene model

prediction, gene annotation, and comparative analyses. RB, CJG, GP, SAH and TCM wrote

the manuscript with input from all authors. All authors reviewed the manuscript.

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

REFERENCES

Alexa, A., & Rahnenführer, J. (2009). Gene set enrichment analysis with topGO. Bioconductor

Improv, 27.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local

alignment search tool. Journal of Molecular Biology, 215(3), 403–410.

Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Baker, A. C. (1915). The woolly apple aphis. U.S. Dep. Agric. Rep. 101.

Bao, W., Kojima, K. K., & Kohany, O. (2015). Repbase Update, a database of repetitive

elements in eukaryotic genomes. Mobile Dna, 6(1), 11.

Barbagallo, S., Cravedi, P., Pasqualini, E., Patti, I., & Stroyan, H. L. (1997). Aphids of the

principal fruit-bearing crops. Milano Bayer, pp.117.

Baumann, P., Baumann, L., Lai, C.-Y., Rouhbakhsh, D., Moran, N. A., & Clark, M. A. (1995).

Genetics, physiology, and evolutionary relationships of the genus Buchnera:

intracellular symbionts of aphids. Annual Review of Microbiology, 49(1), 55–94.

Blackman R.L., & Eastop V.F. (2017). Taxonomic issues. In: van Emden HF, Harrington R,

editors. Aphids as Crop Pests. 2nd ed. UK: CAB International.

Blackman, R. L., & Eastop, V. F. (2000). Aphids on the world’s crops: an identification and

information guide. John Wiley & Sons Ltd.

Brown, M., Schmitt, J., Ranger, S., & Hogmire, H. (1995). Yield reduction in apple by edaphic

woolly apple aphid (Holnoptera: Aphididae) populations. Journal of Economic

Entomology, 88(1), 127–133.

Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using

DIAMOND. Nature Methods, 12(1), 59.

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Burke, G. R., & Moran, N. A. (2011). Massive Genomic Decay in Serratia symbiotica, a

Recently Evolved Symbiont of Aphids. Genome Biology and Evolution, 3, 195–208.

doi: 10.1093/gbe/evr002

Bus, V. G., Bassett, H. C., Bowatte, D., Chagné, D., Ranatunga, C. A., Ulluwishewa, D., …

Gardiner, S. E. (2010). Genome mapping of an apple scab, a powdery mildew and a

woolly apple aphid resistance gene from open-pollinated Mildew Immune Selection.

Tree Genetics & Genomes, 6(3), 477–487.

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden,

T. L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10(1), 421.

Chen, W., Hasegawa, D. K., Kaur, N., Kliot, A., Pinheiro, P. V., Luan, J., … Sun, H. (2016).

The draft genome of whitefly Bemisia tabaci MEAM1, a global crop pest, provides

novel insights into virus transmission, host adaptation, and insecticide resistance. BMC

Biology, 14(1), 110.

Chen, W., Shakir, S., Bigham, M., Richter, A., Fei, Z., & Jander, G. (2019). Genome sequence

of the corn leaf aphid (Rhopalosiphum maidis Fitch). GigaScience, 8(4). doi:

10.1093/gigascience/giz033

Childs, L. (1929). The relation of woolly apple aphis to perennial canker with other

notes on the disease. Station Bulletin, 243. Corvalis, Oregon: Agricultural Experiment

Station Oregon State Agricultural College.

Conesa, A., Götz, S., García-Gómez, J. M., Terol, J., Talón, M., & Robles, M. (2005).

Blast2GO: a universal tool for annotation, visualization and analysis in functional

genomics research. Bioinformatics, 21(18), 3674–3676. doi:

10.1093/bioinformatics/bti610

Cummins, J. N., & Aldwinckle, H. S. (1983). Breeding apple rootstocks. In Plant breeding

reviews (pp. 294–394). Springer.

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Delmotte, F., Leterme, N., Gauthier, J., Rispe, C., & Simon, J. (2002). Genetic architecture of

sexual and asexual populations of the aphid Rhopalosiphum padi based on allozyme

and microsatellite markers. Molecular Ecology, 11(4), 711–723.

Denton, J. F., Lugo-Martinez, J., Tucker, A. E., Schrider, D. R., Warren, W. C., & Hahn, M.

W. (2014). Extensive error in the number of genes inferred from draft genome

assemblies. PLoS Comput Biol, 10(12), e1003998.

Douglas, A. (1998). Nutritional interactions in insect-microbial symbioses: aphids and their

symbiotic bacteria Buchnera. Annual Review of Entomology, 43(1), 17–37.

Dudchenko, O., Batra, S. S., Omer, A. D., Nyquist, S. K., Hoeger, M., Durand, N. C., … Aiden,

A. P. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields

chromosome-length scaffolds. Science, 356(6333), 92–95.

Dudchenko, O., Shamim, M. S., Batra, S., Durand, N. C., Musial, N. T., Mostofa, R., …

Stamenova, E. (2018). The Juicebox Assembly Tools module facilitates de novo

assembly of mammalian genomes with chromosome-length scaffolds for under $1000.

Biorxiv, 254797.

Dumbleton, L. J., & Jeffreys, F. (1938). The control of the woolly aphis by Aphelinus mali.

Department of Scientific and Industrial Research.

Durand, N. C., Shamim, M. S., Machol, I., Rao, S. S., Huntley, M. H., Lander, E. S., & Aiden,

E. L. (2016). Juicer provides a one-click system for analyzing loop-resolution Hi-C

experiments. Cell Systems, 3(1), 95–98.

Eastop, V. F. (1966). A taxonomic study of Australian Aphidoidea (Homoptera). Australian

Journal of Zoology, 14(3), 399–592.

Emms, D. M., & Kelly, S. (2015). OrthoFinder: solving fundamental biases in whole genome

comparisons dramatically improves orthogroup inference accuracy. Genome Biology,

16(1), 157.

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Emms, D. M., & Kelly, S. (2017). STRIDE: species tree root inference from gene duplication

events. Molecular Biology and Evolution, 34(12), 3267–3278.

Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for

comparative genomics. Genome Biology, 20(1), 1–14.

Fernández, R., Marcet-Houben, M., Legeai, F., Richard, G., Robin, S., Wucher, V., … Tagu,

D. (2020). Selection following gene duplication shapes recent genome evolution in the

pea aphid Acyrthosiphon pisum. Molecular Biology and Evolution. doi:

10.1093/molbev/msaa110

Finn, R. D., Clements, J., & Eddy, S. R. (2011). HMMER web server: interactive sequence

similarity searching. Nucleic Acids Research, 39(suppl_2), W29–W37. doi:

10.1093/nar/gkr367

Finston, T. L., Hebert, P. D., & Foottit, R. B. (1995). Genome size variation in aphids. Insect

Biochemistry and Molecular Biology, 25(2), 189–196.

Gautam, D., & Verma, L. (1982). Karyotype analysis and mitotic cycle of woolly apple aphid

(Eriosoma lanigerum Hausmann). Entomon., 7, 167–171.

Gautam, D., & Verma, L. (1983). Chromosome numbers in different morphs of the woolly

apple aphid, Eriosoma lanigerum Hausmann. Chromosome Information Service, (35),

9–10.

Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., … Gentry,

J. (2004). Bioconductor: open software development for computational biology and

bioinformatics. Genome Biology, 5(10), R80.

Hansen, A. K., & Moran, N. A. (2011). Aphid genome expression reveals host–symbiont

cooperation in the production of amino acids. Proceedings of the National Academy of

Sciences, 108(7), 2849–2854.

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M., & Stanke, M. (2016). BRAKER1:

unsupervised RNA-Seq-based genome annotation with GeneMark-ET and

AUGUSTUS. Bioinformatics, 32(5), 767–769.

Hoff, K. J., Lomsadze, A., Borodovsky, M., & Stanke, M. (2019). Whole-genome annotation

with BRAKER. In Gene Prediction (pp. 65–95). Springer.

Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T. K., Bateman, A., … Yong, S.-Y.

(2012). InterPro in 2011: new developments in the family and domain prediction

database. Nucleic Acids Research, 40(D1), D306–D312. doi: 10.1093/nar/gkr948

International Aphid Genomics Consortium. (2010). Genome sequence of the pea aphid

Acyrthosiphon pisum. PLoS Biology, 8(2).

Jackman, S. D., Coombe, L., Chu, J., Warren, R. L., Vandervalk, B. P., Yeo, S., … Jones, S. J.

(2018). Tigmint: correcting assembly errors using linked reads from large molecules.

BMC Bioinformatics, 19(1), 1–10.

Julca, I., Marcet-Houben, M., Cruz, F., Vargas-Chavez, C., Johnston, J. S., Gómez-Garrido, J.,

… Gabaldón, T. (2020). Phylogenomics Identifies an Ancestral Burst of Gene

Duplications Predating the Diversification of Aphidomorpha. Molecular Biology and

Evolution, 37(3), 730–756. doi: 10.1093/molbev/msz261

Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O., & Walichiewicz, J.

(2005). Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and

Genome Research, 110(1–4), 462–467.

Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version

7: improvements in performance and usability. Molecular Biology and Evolution,

30(4), 772–780.

Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low

memory requirements. Nature Methods, 12(4), 357–360.

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

King, G. J., Alston, F., Battle, I., Chevreau, E., Gessler, C., Janse, J., … Schmidt, H. (1991).

The ‘European Apple Genome Mapping Project’-developing a strategy for mapping

genes coding for agronomic characters in tree species. Euphytica, 56(1), 89–94.

Kulkarni, P. (1984). Chromosomes of seven species of aphids (Homoptera: Aphididae).

Bulletin of the Zoological Survey of India, 6, 267–270.

Kumar, S., Jones, M., Koutsovoulos, G., Clarke, M., & Blaxter, M. (2013). Blobology:

exploring raw genome data for contaminants, symbionts and parasites using taxon-

annotated GC-coverage plots. Frontiers in Genetics, 4, 237.

Laetsch, D. R., & Blaxter, M. L. (2017). BlobTools: Interrogation of genome assemblies.

F1000Research, 6(1287), 1287.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-

MEM. ArXiv Preprint ArXiv:1303.3997.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Durbin, R. (2009).

The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–

2079.

Li, X., Jiang, L., & Qiao, G. (2014). Is the subfamily Eriosomatinae (Hemiptera: Aphididae)

monophyletic? Turkish Journal of Zoology, 38(3), 285–297.

Li, Y., Park, H., Smith, T. E., & Moran, N. A. (2019). Gene family evolution in the pea aphid

based on chromosome-level genome assembly. Molecular Biology and Evolution,

36(10), 2143–2156.

Lieberman-Aiden, E., Berkum, N. L. van, Williams, L., Imakaev, M., Ragoczy, T., Telling, A.,

… Dekker, J. (2009). Comprehensive Mapping of Long-Range Interactions Reveals

Folding Principles of the Human Genome. Science, 326(5950), 289–293. doi:

10.1126/science.1181369

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Lomsadze, A., Burns, P. D., & Borodovsky, M. (2014). Integration of mapped RNA-Seq reads

into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Research,

42(15), e119–e119.

Manzano-Marín, A., & Latorre, A. (2016). Snapshots of a shrinking partner: Genome reduction

in Serratia symbiotica. Scientific Reports, 6(1), 32590. doi: 10.1038/srep32590

Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J., & Clavijo, B. J. (2017).

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.

Bioinformatics, 33(4), 574–576.

Marçais, G., Delcher, A. L., Phillippy, A. M., Coston, R., Salzberg, S. L., & Zimin, A. (2018).

MUMmer4: a fast and versatile genome alignment system. PLoS Computational

Biology, 14(1), e1005944.

Mathers, T. C. (2020). Improved Genome Assembly and Annotation of the Soybean Aphid

(Aphis glycines Matsumura). G3: Genes|Genomes|Genetics, 10(3),

g3.400954.2019-g3.400954.2019. doi: 10.1534/g3.119.400954

Mathers, T. C., Chen, Y., Kaithakottil, G., Legeai, F., Mugford, S. T., Baa-Puyoulet, P., …

Hogenhout, S. A. (2017). Rapid transcriptional plasticity of duplicated gene clusters

enables a clonally reproducing aphid to colonise diverse plant species. Genome

Biology, 18(1), 27. doi: 10.1186/s13059-016-1145-3

Mathers, T. C., Mugford, S. T., Hogenhout, S. A. T., & Tripathi, L. (2020). Genome sequence

of the banana aphid, Pentalonia nigronervosa Coquerel (Hemiptera: Aphididae) and its

symbionts. BioRxiv. doi: 10.1101/2020.04.25.060517

Mathers, T. C., Wouters, R. H. M., Mugford, S. T., Swarbreck, D., Oosterhout, C. van, &

Hogenhout, S. A. (2020). Chromosome-scale genome assemblies of aphids reveal

extensively rearranged autosomes and long-term conservation of the X chromosome.

BioRxiv. doi: 10.1101/2020.03.24.006411

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Moran, N. A., Russell, J. A., Koga, R., & Fukatsu, T. (2005). Evolutionary Relationships of

Three New Species of Enterobacteriaceae Living as Symbionts of Aphids and Other

Insects. Applied and Environmental Microbiology, 71(6), 3302–3310. doi:

10.1128/AEM.71.6.3302-3310.2005

Nicholson, S. J., Nickerson, M. L., Dean, M., Song, Y., Hoyt, P. R., Rhee, H., … Puterka, G.

J. (2015). The genome of Diuraphis noxia, a global aphid pest of small grains. BMC

Genomics, 16(1), 1–16. doi: 10.1186/s12864-015-1525-1

Nováková, E., Hypša, V., Klein, J., Foottit, R. G., von Dohlen, C. D., & Moran, N. A. (2013).

Reconstructing the phylogeny of aphids (Hemiptera: Aphididae) using DNA of the

obligate symbiont Buchnera aphidicola. Molecular Phylogenetics and Evolution,

68(1), 42–54.

Ortiz-Rivas, B., & Martínez-Torres, D. (2010). Combination of molecular data support the

existence of three main lineages in the phylogeny of aphids (Hemiptera: Aphididae)

and the basal position of the subfamily Lachninae. Molecular Phylogenetics and

Evolution, 55(1), 305–317.

Price, M. N., Dehal, P. S., & Arkin, A. P. (2009). FastTree: computing large minimum

evolution trees with profiles instead of a distance matrix. Molecular Biology and

Evolution, 26(7), 1641–1650.

Price, M. N., Dehal, P. S., & Arkin, A. P. (2010). FastTree 2 – approximately maximum-

likelihood trees for large alignments. PloS One, 5(3).

Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell, C., … Finn, R. D.

(2012). The Pfam protein families database. Nucleic Acids Research, 40(Database

issue), D290–D301. doi: 10.1093/nar/gkr1065

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., & Lopez, R.

(2005). InterProScan: protein domains identifier. Nucleic Acids Research, 33(suppl_2),

W116–W120.

Ranatunga, C., Gardiner, S., Bssett, H., Rikkerink, E., & Bus, V. (1999). Marker assisted

selection for pest and disease resistance in the New Zealand apple breeding programme.

In Eucarpia symposium on Fruit Breeding and Genetics 538.

Robinson, A., & Chen, Y.-H. (1969). Cytotaxonomy of Aphididae. Canadian Journal of

Zoology, 47(4), 511–516.

Rock, G. C., & Zeiger, D. C. (1974). Woolly apple aphid infests Malling and Malling-Merton

rootstocks in propagation beds in North Carolina. Journal of Economic Entomology,

67(1), 137–138.

Sandanayaka, W., Bus, V., Connolly, P., & Newcomb, R. (2003). Characteristics associated

with woolly apple aphid Eriosoma lanigerum, resistance of three apple rootstocks.

Entomologia Experimentalis et Applicata, 109(1), 63–72.

Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: a cross-platform and ultrafast toolkit for

FASTA/Q file manipulation. PloS One, 11(10).

Shigenobu, S., & Wilson, A. C. (2011). Genomic revelations of a mutualism: the pea aphid and

its obligate bacterial symbiont. Cellular and Molecular Life Sciences, 68(8), 1297–

1309.

Staniland, L. (1924). The immunity of apple stocks from attacks of woolly aphis (Eriosoma

lanigerum, Hausmann). Part II. The causes of the relative resistance of the stocks.

Bulletin of Entomological Research, 15(2), 157–170.

Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically

mapped cDNA alignments to improve de novo gene finding. Bioinformatics, 24(5),

637–644.

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Tarailo-Graovac, M., & Chen, N. (2009). Using RepeatMasker to identify repetitive elements

in genomic sequences. Current Protocols in Bioinformatics, 25(1), 4–10.

Theobald, F. V. (1920). The woolly aphid of the apple and elm. Journal of Pomology and

Horticultural Science, 2(2), 73–92.

Thorpe, P., Escudero-Martinez, C. M., Cock, P. J. A., Eves-van den Akker, S., & Bos, J. I. B.

(2018). Shared Transcriptional Control and Disparate Gain and Loss of Aphid

Parasitism Genes. Genome Biology and Evolution, 10(10), 2716–2733. doi:

10.1093/gbe/evy183

Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M. J., …

Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals

unannotated transcripts and isoform switching during cell differentiation. Nature

Biotechnology, 28(5), 511.

Wang, Y., Tang, H., DeBarry, J. D., Tan, X., Li, J., Wang, X., … Guo, H. (2012). MCScanX:

a toolkit for detection and evolutionary analysis of gene synteny and collinearity.

Nucleic Acids Research, 40(7), e49–e49.

Waring, J. (1920). The probable value of trunk circumference as an adjunct to fruit yield in

interpreting apple orchard experiments. American Society of Horticultural Science,

179–185.

Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., …

Zdobnov, E. M. (2018). BUSCO applications from quality assessments to gene

prediction and phylogenomics. Molecular Biology and Evolution, 35(3), 543–548.

Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M., & Jaffe, D. B. (2017). Direct

determination of diploid genome sequences. Genome Research, 27(5), 757–767.

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.29.121947; this version posted May 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Wenger, J. A., Cassone, B. J., Legeai, F., Johnston, J. S., Bansal, R., Yates, A. D., … Michel,

A. (2016). Whole genome sequence of the soybean aphid, Aphis glycines. Insect

Biochemistry and Molecular Biology. doi: 10.1016/j.ibmb.2017.01.005

Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature

Reviews Genetics, 13(5), 329–342.

Yeo, S., Coombe, L., Warren, R. L., Chu, J., & Birol, I. (2018). ARCS: scaffolding genome

drafts with linked reads. Bioinformatics, 34(5), 725–731.

29