Supporting Information
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Information Boothby et al. 10.1073/pnas.1510461112 SI Text tardigrades were allowed to settle and were left overnight at Repetitive Sequences in the H. dujardini Genome. A total of 4,440 room temperature. (10.72%) of the PacBio contigs, making up 7.91% of the as- sembled sequence, do not have direct homologous matches in the DNA Extraction. Isolated specimens were transferred to 1.5-mL × Illumina assembly. These contigs may represent highly repetitive microcentrifuge tubes and centrifuged at 16,837 g for 2 min to sections of the genome that are difficult to assemble with short pellet animals. Excess liquid was removed, and tubes with pel- Illumina reads, or they may be misassemblies of the PacBio data leted specimens were dipped in liquid nitrogen and simulta- neously homogenized with plastic pestles and then left briefly to due to the inherently high rate of sequencing error (10–15%). To thaw. Freeze/thaw douncing was repeated five times to homog- try to resolve this question, we aligned the Illumina reads used to enize samples. To isolate genomic DNA, homogenized samples create our assembly to all assembled PacBio contigs using Bowtie 2. were processed with Qiagen’s DNeasy Blood and Tissue Kit Alignment of our Illumina assembly to our PacBio assembly showed ’ ∼ (catalog no. 69506) according to the manufacturer s instructions. that the average coverage of the differential PacBio contigs is 66% Following isolation, genomic DNA was ethanol-precipitated. of the average coverage across all PacBio contigs, indicating that these sequences are represented in the raw Illumina data but are Assessment of EST and Core Eukaryotic Genes Mapping Approach likely repetitive, and therefore very difficult to assemble accurately. Representation in the Genome. All available H. dujardini ESTs Tandem repeats (39) (at least 2 bp × 10) were found within were obtained from the GenBank. The BLAST-like alignment 20 bp of the ends of 22.85% of scaffolds. Combined with the larger tool (BLAT) was used to find matches between H. dujardini scale repetitive sequence suggested by our PacBio sequencing, ESTs and our sequencing datasets. Resulting BLAT files were these results indicate that these elements are widespread and a used with baa.pl (40). significant hindrance to genome assembly. Core eukaryotic genes mapping approach analysis was per- formed as previously described (10). Distribution of Foreign DNA Among Scaffolds. Although, on average, the distribution of foreign genes is approximately the same through Assessment of Redundancy Within the Genome. BLAT analysis using our assembly, looking at individual scaffolds, one can see that there is baa.pl was performed as described above, comparing all predicted a bias for smaller scaffolds having either 100% HGT genes or 0% protein sequences from the H. dujardini genome annotation with HGT genes (Fig. S2E). This bias is due to smaller scaffolds having assembled scaffolds. fewer genes on them, and this bias decreases with increasing min- imal gene per scaffold number (Fig. S2F). So, when comparing Genomic Library Preparation and Sequencing. scaffolds with at least one gene with scaffolds with at least 10 genes, Short insert mate pair libraries. Genomic DNA libraries were con- one can see a trend toward a more ubiquitous distribution of for- structed by Cofactor Genomics according to the following pro- eign genes as minimal gene count is increased. cedure. Briefly, genomic DNA was sheared to the desired size using a Covaris S2 instrument. Three libraries were constructed with the SI Experimental Procedures following insert sizes; 300 bp, 500 bp, and 800 bp. Following shearing, Genome and Transcriptome Sequences. This Whole Genome Shotgun DNA was end-repaired and A-tailed to prepare for adaptor ligation. project has been deposited at DDBJ/EMBL/GenBank under the Indexed adaptors were ligated to sample DNA, and the adaptor- accession LMYF00000000. The version described in this paper is ligated DNA was then size-selected on a 2% SizeSelect E-Gel version LMYF01000000. (Invitrogen) and amplified by PCR. Library quality was assessed by measuring nanomolar concentration and the fragment size in Tardigrade Culture and Isolation. Tardigrade cultures were main- base pairs. Sequencing was performed on an Illumina HiSeq 2000 tained using spring water and Chlorococcum algae (Sciento) as platform with a 2 × 100 paired-end read configuration. previously described (9). For large-scale culture, H. dujardini Moleculo long-read library preparation. Isolated DNA was submitted “ ” cultures were grown at room temperature in 10-L plastic carboys to Illumina for preparation and sequencing using the long-read filled to 5 L with deionized water. Carboys were inoculated with DNA sequencing service. Resulting sequences were assembled Chlorococcum algae, which served as a food source for the water into long reads by Illumina. bears. To separate tardigrades from their algal food source for PacBio library preparation and sequencing. Quantity and quality were DNA and RNA isolations, carboys were shaken to suspend algal assessed using an Invitrogen Qubit 2.0 fluorometer (Invitrogen Life sediment evenly and this solution was decanted into 150-mm Technologies) and an Experion instrument (Bio-Rad). DNA was purified using Agencourt AMPure beads (Beckman Coulter). DNA glass Petri dishes. Petri dishes were placed near a bright window, (∼10 μg) was further prepared for sequencing using a PacBio 10-kb but out of direct sunlight. It has previously been observed that Preparation Kit (Pacific Biosciences) following the manufacturer’s H. dujardini move away from sunlight. After the suspended protocol. Five single molecule real-time (SMRT) cells were run using sediment had settled on the bottom of the dish, a glass pipette the P6-C4 chemistry and a 1 × 240 movie at the University of North was used to clear algae from the edge of the dish furthest from Carolina at Chapel Hill High-Throughput Sequencing Facility. the window. The dish was monitored over a week until visible accumulation of tardigrades could be seen in this cleared area. Genomic Sequence Assembly. Assembly was performed with the Using a glass pipette, tardigrades were collected and transferred Celera assembler, version 8.1 (35). Long reads were first con- to a 50-mL beaker filled with deionized water. Tardigrades were verted from a Fastq to Celera Assembler format with the com- allowed to settle to the bottom of the beaker for 2 h, and the mand “fastqToCA -libraryname Hd -technology moleculo -reads majority of the water was then removed from the beaker without LongRead.fastq > lr.frg.” The assembler was then run with the disturbing the tardigrades. The beaker was again filled with de- following parameters [largely following the method of McCoy ionized water to wash the tardigrades. This washing procedure et al. (36)]: unitigger = bogart merSize = 31 ovlMinLen = was repeated an additional two times. Following the final wash, 2,000 utgGraphErrorRate = 0.03 utgMergeErrorRate = 0.045 Boothby et al. www.pnas.org/cgi/content/short/1510461112 1of9 obtErrorRate = 0.03 obtErrorLimit = 4.5 ovlMerSize = 31 used to indicate likely HGT for a gene (14). All BLAST searches ovlErrorRate = 0.03 batThreads = 24 cnsConcurrency = 12 were carried out using the BLOSUM62 matrix. frgCorrConcurrency = 12 frgCorrThreads = 12 mbtConcurrency = 12 merOverlapperThreads = 12 merylThreads = 12 ovlConcurrency = Gene Tree Construction and Analysis. Alignments for phylogenetic 12 ovlCorrConcurrency = 12 ovlThreads = 12. The resulting contigs analysis were compiled by blasting predicted protein sequences of and unplaced unitigs were then \scaffolded using the 300-bp, prospective horizontally transferred genes against the NCBI nr 500-bp, and 800-bp mate pair libraries with SSPACE, version database. One hundred cases were chosen at random (SI Ex- 2.0 (41), and the command-line parameters: -l = libfile.txt -s = perimental Procedures, Bridged PCR). The top five hits from Longreads.fasta -b = Hd -x = 1-z= 0-k= 4-a= 0.7 -n = 15 -T = metazoa, fungi, Embryophyta (multicellular plants), single-celled 24 -p = 1-o= 20 -t = 0-m= 32 -r = 0.9. eukaryotes, Archaea, and Eubacteria were used for alignment, except when probable homologs were not present in a lineage. Genomic Annotation. Annotations for the H. dujardini genome Sequences were aligned with MUSCLE (53) and edited by eye. assembly were generated using the automated genome annota- Alignments were trimmed using the bioinformatics program tion pipeline MAKER (16, 37, 38), which aligns and filters EST gBlocks (54) and by eye. Both maximum likelihood and Bayesian and protein homology evidence, identifies repeats, produces ab approaches were used for phylogenetic analyses, implementing initio gene predictions, infers 5′ and 3′ UTRs, and integrates the Le-Gascuel (LG) model (55), with an estimated proportion these data to produce final downstream gene models, along with of invariable sites and an estimated gamma shape parameter. quality control statistics. PhyML (56) was used to determine maximum likelihood tree to- Inputs for MAKER included the H. dujardini genome as- pologies, with 500 bootstrap replicates to evaluate branch supports. sembly, H. dujardini ESTs, a tardigrade-specific repeat library The Bayesian analysis was performed in MRBAYES (57), which made with RepeatModeler (42) based on the reference genome ran for between 600,000 and 3,600,000 generations, depending on assembly, and a protein database combining the UniProt/Swiss-Prot when convergence occurred, and