Supporting Information

Boothby et al. 10.1073/pnas.1510461112 SI Text tardigrades were allowed to settle and were left overnight at Repetitive Sequences in the H. dujardini Genome. A total of 4,440 room temperature. (10.72%) of the PacBio contigs, making up 7.91% of the as- sembled sequence, do not have direct homologous matches in the DNA Extraction. Isolated specimens were transferred to 1.5-mL × Illumina assembly. These contigs may represent highly repetitive microcentrifuge tubes and centrifuged at 16,837 g for 2 min to sections of the genome that are difficult to assemble with short pellet animals. Excess liquid was removed, and tubes with pel- Illumina reads, or they may be misassemblies of the PacBio data leted specimens were dipped in liquid nitrogen and simulta- neously homogenized with plastic pestles and then left briefly to due to the inherently high rate of sequencing error (10–15%). To thaw. Freeze/thaw douncing was repeated five times to homog- try to resolve this question, we aligned the Illumina reads used to enize samples. To isolate genomic DNA, homogenized samples create our assembly to all assembled PacBio contigs using Bowtie 2. were processed with Qiagen’s DNeasy Blood and Tissue Kit Alignment of our Illumina assembly to our PacBio assembly showed ’ ∼ (catalog no. 69506) according to the manufacturer s instructions. that the average coverage of the differential PacBio contigs is 66% Following isolation, genomic DNA was ethanol-precipitated. of the average coverage across all PacBio contigs, indicating that these sequences are represented in the raw Illumina data but are Assessment of EST and Core Eukaryotic Genes Mapping Approach likely repetitive, and therefore very difficult to assemble accurately. Representation in the Genome. All available H. dujardini ESTs Tandem repeats (39) (at least 2 bp × 10) were found within were obtained from the GenBank. The BLAST-like alignment 20 bp of the ends of 22.85% of scaffolds. Combined with the larger tool (BLAT) was used to find matches between H. dujardini scale repetitive sequence suggested by our PacBio sequencing, ESTs and our sequencing datasets. Resulting BLAT files were these results indicate that these elements are widespread and a used with baa.pl (40). significant hindrance to genome assembly. Core eukaryotic genes mapping approach analysis was per- formed as previously described (10). Distribution of Foreign DNA Among Scaffolds. Although, on average, the distribution of foreign genes is approximately the same through Assessment of Redundancy Within the Genome. BLAT analysis using our assembly, looking at individual scaffolds, one can see that there is baa.pl was performed as described above, comparing all predicted a bias for smaller scaffolds having either 100% HGT genes or 0% protein sequences from the H. dujardini genome annotation with HGT genes (Fig. S2E). This bias is due to smaller scaffolds having assembled scaffolds. fewer genes on them, and this bias decreases with increasing min- imal gene per scaffold number (Fig. S2F). So, when comparing Genomic Library Preparation and Sequencing. scaffolds with at least one gene with scaffolds with at least 10 genes, Short insert mate pair libraries. Genomic DNA libraries were con- one can see a trend toward a more ubiquitous distribution of for- structed by Cofactor Genomics according to the following pro- eign genes as minimal gene count is increased. cedure. Briefly, genomic DNA was sheared to the desired size using a Covaris S2 instrument. Three libraries were constructed with the SI Experimental Procedures following insert sizes; 300 bp, 500 bp, and 800 bp. Following shearing, Genome and Transcriptome Sequences. This Whole Genome Shotgun DNA was end-repaired and A-tailed to prepare for adaptor ligation. project has been deposited at DDBJ/EMBL/GenBank under the Indexed adaptors were ligated to sample DNA, and the adaptor- accession LMYF00000000. The version described in this paper is ligated DNA was then size-selected on a 2% SizeSelect E-Gel version LMYF01000000. (Invitrogen) and amplified by PCR. Library quality was assessed by measuring nanomolar concentration and the fragment size in Tardigrade Culture and Isolation. Tardigrade cultures were main- base pairs. Sequencing was performed on an Illumina HiSeq 2000 tained using spring water and Chlorococcum algae (Sciento) as platform with a 2 × 100 paired-end read configuration. previously described (9). For large-scale culture, H. dujardini Moleculo long-read library preparation. Isolated DNA was submitted “ ” cultures were grown at room temperature in 10-L plastic carboys to Illumina for preparation and sequencing using the long-read filled to 5 L with deionized water. Carboys were inoculated with DNA sequencing service. Resulting sequences were assembled Chlorococcum algae, which served as a food source for the water into long reads by Illumina. bears. To separate tardigrades from their algal food source for PacBio library preparation and sequencing. Quantity and quality were DNA and RNA isolations, carboys were shaken to suspend algal assessed using an Invitrogen Qubit 2.0 fluorometer (Invitrogen Life sediment evenly and this solution was decanted into 150-mm Technologies) and an Experion instrument (Bio-Rad). DNA was purified using Agencourt AMPure beads (Beckman Coulter). DNA glass Petri dishes. Petri dishes were placed near a bright window, (∼10 μg) was further prepared for sequencing using a PacBio 10-kb but out of direct sunlight. It has previously been observed that Preparation Kit (Pacific Biosciences) following the manufacturer’s H. dujardini move away from sunlight. After the suspended protocol. Five single molecule real-time (SMRT) cells were run using sediment had settled on the bottom of the dish, a glass pipette the P6-C4 chemistry and a 1 × 240 movie at the University of North was used to clear algae from the edge of the dish furthest from Carolina at Chapel Hill High-Throughput Sequencing Facility. the window. The dish was monitored over a week until visible accumulation of tardigrades could be seen in this cleared area. Genomic Sequence Assembly. Assembly was performed with the Using a glass pipette, tardigrades were collected and transferred Celera assembler, version 8.1 (35). Long reads were first con- to a 50-mL beaker filled with deionized water. Tardigrades were verted from a Fastq to Celera Assembler format with the com- allowed to settle to the bottom of the beaker for 2 h, and the mand “fastqToCA -libraryname Hd -technology moleculo -reads majority of the water was then removed from the beaker without LongRead.fastq > lr.frg.” The assembler was then run with the disturbing the tardigrades. The beaker was again filled with de- following parameters [largely following the method of McCoy ionized water to wash the tardigrades. This washing procedure et al. (36)]: unitigger = bogart merSize = 31 ovlMinLen = was repeated an additional two times. Following the final wash, 2,000 utgGraphErrorRate = 0.03 utgMergeErrorRate = 0.045

Boothby et al. www.pnas.org/cgi/content/short/1510461112 1of9 obtErrorRate = 0.03 obtErrorLimit = 4.5 ovlMerSize = 31 used to indicate likely HGT for a gene (14). All BLAST searches ovlErrorRate = 0.03 batThreads = 24 cnsConcurrency = 12 were carried out using the BLOSUM62 matrix. frgCorrConcurrency = 12 frgCorrThreads = 12 mbtConcurrency = 12 merOverlapperThreads = 12 merylThreads = 12 ovlConcurrency = Gene Tree Construction and Analysis. Alignments for phylogenetic 12 ovlCorrConcurrency = 12 ovlThreads = 12. The resulting contigs analysis were compiled by blasting predicted protein sequences of and unplaced unitigs were then \scaffolded using the 300-bp, prospective horizontally transferred genes against the NCBI nr 500-bp, and 800-bp mate pair libraries with SSPACE, version database. One hundred cases were chosen at random (SI Ex- 2.0 (41), and the command-line parameters: -l = libfile.txt -s = perimental Procedures, Bridged PCR). The top five hits from Longreads.fasta -b = Hd -x = 1-z= 0-k= 4-a= 0.7 -n = 15 -T = metazoa, fungi, Embryophyta (multicellular plants), single-celled 24 -p = 1-o= 20 -t = 0-m= 32 -r = 0.9. eukaryotes, Archaea, and Eubacteria were used for alignment, except when probable homologs were not present in a lineage. Genomic Annotation. Annotations for the H. dujardini genome Sequences were aligned with MUSCLE (53) and edited by eye. assembly were generated using the automated genome annota- Alignments were trimmed using the program tion pipeline MAKER (16, 37, 38), which aligns and filters EST gBlocks (54) and by eye. Both maximum likelihood and Bayesian and protein homology evidence, identifies repeats, produces ab approaches were used for phylogenetic analyses, implementing initio gene predictions, infers 5′ and 3′ UTRs, and integrates the Le-Gascuel (LG) model (55), with an estimated proportion these data to produce final downstream gene models, along with of invariable sites and an estimated gamma shape parameter. quality control statistics. PhyML (56) was used to determine maximum likelihood tree to- Inputs for MAKER included the H. dujardini genome as- pologies, with 500 bootstrap replicates to evaluate branch supports. sembly, H. dujardini ESTs, a tardigrade-specific repeat library The Bayesian analysis was performed in MRBAYES (57), which made with RepeatModeler (42) based on the reference genome ran for between 600,000 and 3,600,000 generations, depending on assembly, and a protein database combining the UniProt/Swiss-Prot when convergence occurred, and sampled every 100 generations. (43, 44) protein database and all sequences for D. melanogaster and For each maximum likelihood tree, posterior probability branch C. elegans from the National Center for Information support estimates were calculated from the last 4,500 trees of the (NCBI) protein database (45). Ab initio gene predictions were posterior distribution of trees sampled during the Bayesian analysis. created by MAKER using the programs SNAP (46) and Au- gustus (47). A total of three iterative runs of MAKER were used Bridged PCR. Genomic coordinates for adjacent genes were to produce the final gene set. obtained from the MAKER2 output general feature format (gff) Following genome annotation, final gene models were then file. A numbered list of all foreign genes within 4,000 bases (for analyzed using the program InterProScan (48) to identify putative efficient PCR) of another gene was constructed. This list was split protein domains. The final annotation set contained a total of into sublists based on the predicted origin of foreign genes 38,145 genes, 67% of which contain a protein domain as detected (bacterial, plant, fungi, archaeal, viral), and test candidates were by InterProscan and 92% of which have an annotation edit distance selected for PCR using a random number generator to select less than 0.5, consistent with a well-annotated genome (16, 38, 49). genes from each of these sublists. A forward primer was designed within the last 200 bases of the upstream gene, and a corre- Taxonomic Evaluation. The top hit to NCBI’s nr protein database sponding reverse primer was designed within the first 200 bases of was determined for each predicted gene using blastp (50). the downstream gene. Primer sequences and genomic coordinates GenInfo Identifier numbers for these hits were used as input for used in this study are reported in Dataset S2. Primer sets were the Galaxy fetch taxonomy tool (51, 52). Taxonomic ranks de- used in conjunction with single tardigrade PCR (discussed be- rived from this tool were used for downstream analysis. low). Following PCR, products were run on 1% ethidium bro- mide agarose gels to confirm the correct amplicon length. Analysis of Putative rRNA Contamination. All available bacterial rRNA sequences were downloaded from the Ribosomal Data- Single Tardigrade PCR. A single adult tardigrade was transferred base Project (rdp.cme.msu.edu/) and used to perform reciprocal from an active culture to a dish of deionized water. The tardigrade best-hit BLAST analysis against scaffolds in our genome as- was pipetted up and down in autoclaved deionized water with the sembly. Sequences with reciprocal best BLAST hits to rRNA mouth pipette several times to wash it. The cleaned tardigrade sequences with an expected value (e-value) of less than 1E-10 were was transferred to the open lid of a 0.2-mL PCR tube. Five counted as putative bacterial rRNA contamination (Dataset S1). microliters of single tardigrade lysis solution [51 mM KCL, 10 mM Tris (pH 8.3), 2.5 mM MgCl2, 0.45% Nonidet P-40, 0.45% Analysis of Putative Human Contamination. All predicted proteins Tween-20, 0.01% gelatin, and 100 μg/mL proteinase K] was added from the H. dujardini genome were used as search queries to the bottom of each tube. Tubes were capped and centrifuged against a human protein database obtained from the NCBI. briefly to transfer the tardigrade from the cap to the lysis solution. Twenty-one reciprocal best BLAST hits were found and further Immediately following centrifugation, PCR tubes were transferred analyzed to assess if these sequences are truly of human origin. to −80 °C for 20 min. Following freezing, PCR tubes were placed in The 21 sequences were BLASTed against a chimpanzee (Pan a thermocycler and heated at 65 °C for 60 min, followed by 80 °C troglodytes) proteins database (NCBI), and the e-value of each for 15 min. hit was used to gauge the human (or nonhuman) origin of the Lysates were used for PCR using either 2× GOtaq Master Mix sequences. In all cases, the chimpanzee sequences resulted in a (Promega) or LongAmp taq (New England BioLabs), both ac- lower e-value, indicating that these sequences are likely not of cording to each manufacturer’s instructions. Touchdown PCR human origin (Dataset S1). was used, starting with an annealing temperature of 70 °C and decreasing to 52 °C at −1 °C per cycle. Subsequent to touchdown HGT Index Calculation. An HGT index score was calculated for each cycles, an additional 30 cycles of PCR were performed at 52 °C. predicted gene with a top blastp hit to a nonmetazoan sequence. Extension times used were calculated based on the expected As previously described, a custom metazoan BLAST protein amplicon length and the manufacturer’s instructions. database was constructed and used to perform blastp searches, and HGT indices were calculated by subtracting the bit-score of Intron Splice Site Sequence Analysis. U2 spliceosomal introns are the top metazoan BLAST hit from the bit-score of the top characterized by a conserved GT dinucleotide at their 5′ splice nonmetazoan BLAST hit for a given gene, with a threshold of 30 site and a conserved AG dinucleotide at their 3′ splice site. The

Boothby et al. www.pnas.org/cgi/content/short/1510461112 2of9 sequence 10 bases upstream and downstream of conserved di- and D. melanogaster from the InterProScan website (www.ebi.ac. nucleotide was obtained manually for 50 introns present in genes uk/interpro/interproscan.html). of bacterial origin in the H. dujardini genome (Dataset S3). For comparison, the same procedure was used for 20 introns present PacBio Genome Assembly. We used the minhash alignment process in genes of metazoan origin in the H. dujardini genome (Dataset S3). (biorxiv.org/content/early/2014/08/14/008003) to perform align- Sequences were used as input for Galaxy’s Sequence Logo ment and self-correction of PacBio subreads. PacBio self-correction Motif tool (51, 52). reduced read error 10-fold from an estimated 15% to 1.5% after correction. We then assembled the corrected PacBio reads using Codon Use. For codon use analysis, the full predicted coding se- the Celera assembler. quence of 50 randomly selected genes was obtained for genes in We also performed a hybrid assembly using a strict overlap- the H. dujardini genome of metazoan, as well as foreign, origin. layout-consensus method that combined our original data and the In addition, the full predicted coding sequence of 50 randomly PacBio data. Although this approach marginally improved the selected genes obtained from the NCBI was obtained for the two assembly (higher N50, more EST mapped to contigs), it did not most prevalent bacterial species represented in the H. dujardini significantly improve our original assembly and added a new genome (Niastella koreensis and Fluviicola taffensis; Dataset S1). complication by introducing data from a new set of animals into These sequences were used for input into the sequence manip- the analysis. Intriguingly, we noted that contigs produced by all ulation suite codon use calculator (58), and the frequency of three assemblies tended to fail (i.e., end) at similar points, sug- codon use for each dataset was compared. Sequences used for gesting that current assembly algorithms may not be sufficient for this analysis are available in Dataset S3. assembling these regions regardless of data type. After ∼20-fold coverage PacBio sequencing of DNA from a Gene/Protein Family and Gene Ontology Analysis. InterPro and gene new set of individuals, we performed self-correction and assembly ontology gene ontology identifiers were parsed from the MAKER of the PacBio data. Because PacBio has a high per base error rate, gff output. The occurrence of each unique IPR identifier was ∼20-fold is not sufficient to assemble a whole genome de novo counted (Dataset S4). To compare counts of IPR domains in the from PacBio data. However, the high-quality contigs generated H. dujardini genome, we obtained count data for both C. elegans by these data can be compared with the original assembly.

Boothby et al. www.pnas.org/cgi/content/short/1510461112 3of9 Fig. S1. Comparison of H. dujardini genome statistics. (A) Genomic sequencing resources generated by this study. A comparison of the genome size (B), number of genes (C), average coding sequence length (D), guanine-cytosine content (E), average exon number per gene (F), and average exon size (G)for our H. dujardini assembly with the genome assemblies of several other model organisms was performed. Complete (H) and partial (I) core eukaryotic gene (CEG) coverage within our H. dujardini genome assembly is shown. A. gambiae, Anopheles gambiae; A. vaga, Adineta vaga; D. pulex, Daphnia pulex; I. scapularis, Ixodes scapularis; P. pacificus, Pristionchus pacificus; S. maritima, Strigamia maritima; T. urticae, Tetranychus urticae.

Boothby et al. www.pnas.org/cgi/content/short/1510461112 4of9 Fig. S2. Most foreign tardigrade genes score well above the HGT index threshold. (A) Graph showing the number of genes with a given HGT index score (to the nearest integer) for the tardigrade H. dujardini, rotifer A. ricciae, and nematode C. elegans.(B) Graph showing the cumulative percentage of foreign genes accounted for as the HGT index threshold is lowered to 30. For example, with an HGT index threshold of 30, 100% of H. dujardini foreign genes are accounted for, whereas increasing the threshold to 250 still accounts for 50% of all H. dujardini foreign genes. (C) Average coverage for genes of metazoan (Met) or foreign origin. (D) Average coverage for scaffolds containing genes of metazoan or foreign origin. (E) Plot showing the percentage of horizontally acquired genes vs. scaffold size. (F) Graph showing the percentage of scaffolds with no (black bars), all (dark gray bars), or some (light gray bars) horizontally acquired genes, as the minimal gene count per scaffold is increased. As the minimal number of genes per scaffold increases, the number of scaffolds with only hori- zontally acquired genes decreases.

Boothby et al. www.pnas.org/cgi/content/short/1510461112 5of9 Fig. S3. Genes of metazoan and foreign origin are physically linked in the H. dujardini genome. Results and a schematic of PCRs performed with primers bridging genes predicted to be present on the same genomic scaffold (associated information is available in Dataset S2) are shown. Green and orange boxes highlight foreign-metazoan gene pairs and foreign-foreign gene pairs, respectively, that produced PCR products of the correct size. Large (red) boxes highlight PCR reactions failing to recover products of the correct size.

Boothby et al. www.pnas.org/cgi/content/short/1510461112 6of9 Fig. S4. Comparison with PacBio long reads supports the accuracy of the H. dujardini genome assembly. A heat map shows the concordance between the PacBio and Illumina assemblies. The independently sequenced PacBio sequences are highly similar to the Illumina assembly, and there is no significant off- diagonal homology that would indicate misassembly or residual heterozygosity. Both sets of contigs are highly congruent. Synteny within contigs is clearly preserved, and assemblies are 98.53% concordant per base.

Fig. S5. Foreign genes implicated in stress tolerance have been transferred into the H. dujardini genome. Cladograms show the evolutionary relationships between foreign H. dujardini (gray), bacterial (orange), and metazoan (blue) genes implicated in stress tolerance. Numbers on branches indicate Bayesian followed by bootstrap (500) supports. recA, DNA recombination and repair A; spds, spermidine synthase.

Boothby et al. www.pnas.org/cgi/content/short/1510461112 7of9 Fig. S6. Model of HGT in desiccation-tolerant organisms. Speculative model of HGT acquisition in desiccation-tolerant organisms. Prolonged desiccation induces dsDNA breaks. During rehydration, membranes become transiently leaky, allowing the transfer of large macromolecules into and out of cells, including fragmented foreign DNA. Anhydrobiotic organisms possess robust DNA repair mechanisms for fixing desiccation-induced DNA damage. If foreign DNA is present in the nucleus of an anhydrobiotic cell, it may be accidentally incorporated during postrehydration genomic repair.

Boothby et al. www.pnas.org/cgi/content/short/1510461112 8of9 Dataset S1. HGT index scores

Dataset S1

The dataset lists gene identifications (column A), putative identity (column B), best nr BLAST hit Geninfo Identifier number (column C), percentage identity for best nr BLAST hit (column D), best metazoan BLAST bitscore (column E), best nonmetazoan BLAST bitscore (column F), HGT index score (column G), nonmetazoan source (column H), nonmetazoan source genus and species (column I), HGT gene start coordinates (column J), HGT gene end coordinates (column K), predicted InterPro and Pfam domains (column L), associated gene ontology terms (column M), HGT gene length (column N), BLAST match length (column O), query coverage (column P), exon count (column Q), and intron count (column R). Additional tabs contain quantifications for foreign species, putative bacterial rRNA, and human contamination.

Dataset S2. Gene trees and genomic PCR-associated information

Dataset S2

Gene trees for a given gene presented with Bayesian (Upper) and bootstrap (Lower) (500) supports. Gene tree information includes the gene identifier for a given tardigrade gene, NCBI accession number, and taxon label in each tree for genes used in our alignments. Pertinent information for each genomic PCRis also provided.

Dataset S3. HGT codon use and splice site information

Dataset S3

Dataset S4. Quantification of InterProScan families and domains of interest in the H. dujardini genome

Dataset S4

(A) Total count for each InterProScan family/domain is given in column C, along with the count for each entry attributed to genes acquired from foreign sources (column D) and the percentage of representation the HGT accounts for (column E). (B) Dataset shows representative InterProScan entries (column A), their raw count in our H. dujardini genome assembly (column B), the count represented by genes acquired through HGT (column C), the count in the D. melanogaster genome (column D), and the count in the C. elegans genome (column E). (C) Colors indicate InterProScan entries associated with heat shock and chaperon activity (red), DNA repair (blue), and antioxidant activity (yellow), as well as sugar and lipid metabolism (green). The dataset shows the counts for the 20 most abundant InterProScan entries in the H. dujardini genome. The occurrence of InterProScan domains found in H. dujardini metazoan genes (column C) compared with D. melanogaster (column E) and C. elegans (column F) is shown. The count (column D) and percentage of representation (column G) derived from HGT for each entry are given in columns F and G.

Dataset S5. Best reciprocal BLAST hits for HGT genes found in H. dujardini and the rotifer A. ricciae

Dataset S5

The dataset shows the best reciprocal BLAST hits listing: A. ricciae gene identification (column A), H. dujardini gene identification (column B), putative identity (column C), bitscore of best nonmetazoan BLAST hit for H. dujardini gene (column D), bitscore of best nonmetazoan BLAST hit for A. ricciae gene (column E), bitscore for BLAST alignment between H. dujardini and A. ricciae genes (column F), taxonomy of best nonmetazoan BLAST hit for H. dujardini gene (column G), taxonomy of best nonmetazoan BLAST hit for A. ricciae gene (column H), and predicted timing of transfer event (column I).

Boothby et al. www.pnas.org/cgi/content/short/1510461112 9of9