Supplementary Tables and Figures: Table S1 Approximately Unbiased (AU) Topology Tests under the LG+C60+F+4 model

Tree logL p-AU*

IQ-Tree ML tree (Fig. 1) -4340429.354 0.9939

(best)

PhyloBayes-MPI Converged Set 1 (Fig. S2) -4340564.740 0.0021

PhyloBayes-MPI Converged Set 2 (Fig. S3) -4340603.430 0.0054

(All other taxa,(Metamonads+Ancyromonads)),(Malawimonads+Amorphea+CRuMs) ** -4342974.658 0.0000

(All other taxa,(Metamonads+Malawimonads)),(Ancyromonads+Amorphea+CRuMs) ** -4340483.458 0.0168

(All other taxa,(Metamonads,(Ancyromonads+Malawimonads))),(Amorphea+CRuMs)** -4340481.797 0.0005

(All other taxa,((Metamonads+Ancyromonads),Malawimonads)),(Amorphea+CRuMs) ** -4340509.012 0.0002

(All other taxa,((Metamonads,Malawimonads),Ancyromonads)),(Amorphea+CRuMs)** -4347156.410 0.0001

(All other taxa,Metamonads),(Ancyromonads,(CRuMs,(Amorphea,Malawimonads)))** -4340579.038 0.0000

-4340753.903 (All other taxa,Metamonads),(CRuMs,(Amorphea,(Ancyromonads+Malawimonads))) ** 0.0000

*AU tests (Shimodaira 2002) were performed with 10,000 resamplings using the RELL method (Kishino et al. 1990) **test of optimal tree inferred using ML given the topological constraint specified

Figure S1. Phylogenetic tree for 64 , inferred from 351 proteins using Maximum

Likelihood (LG+C60+F+Γ-PMSF model). The numbers on branches show (in order) support values from 100 real bootstrap replicates (LG+C60+F+Γ-PMSF model) and posterior probabilities of sets of converged chains in PHYLOBAYES-MPI under the CAT-GTR+model

(i.e., MLBS/PP). Four PHYLOBAYES-MPI chains were run for ~3000 generations. Two chains converged at ~200 generations (maxdiff = 0) and the posterior probabilities are mapped upon the

ML tree. Full support is denoted with circles at nodes. Arrows indicate differences in topologies observed in the converged PHYLOBAYES-MPI CAT-GTR+analyses. Figure S2. Phylogenetic tree for 61 eukaryotes, inferred from 351 proteins using Bayesian

Inference with PHYLOBAYES (CAT-GTR+ model). Tree represents two converged chains with a burnin of 800 generations and sampling 5000 generations post burnin. Arrows denote the differences of topologies compared to Fig. S3.

Figure S3. Phylogenetic tree for 61 eukaryotes, inferred from 351 proteins using Bayesian

Inference with PHYLOBAYES (CAT-GTR+ model). Tree represents two converged chains

(different from the chains used for Fig. S2) with a burnin of 2000 generations and sampling 3000 generations post burnin. Figure S4. Conservative consensus phylogeny of used for identification of potential contaminants and paralogs prior to final dataset assembly.

Supplementary Methods:

Diphylleia rotans NIES-3764 was cultivated with the cyanobacterial strain Microcystis aeruginosa NIES-298 as a food source, in C-Si medium (http://mcc.nies.go.jp/02medium.html, last accessed February 9, 2016) at 20 °C under 10–50 micromole photons/m2/s with a 14 h light/10 h dark cycle (Kamikawa et al. 2016). Rigifila ramosa CCAP 1967/1 was cultivated at 20°C in Rye Grass•Prescott Liquid medium (http://www.ccap.ac.uk/media/recipes/RPL.htm) under continuous darkness (Yabuki et al. 2013). Total RNA was extracted with TRIzol (Invitrogen) according to the manufacturer’s instructions, and RNA was sent to Hokkaido System Sciences Co. Ltd. (Sapporo, Hokkaido, Japan) for library construction with the Truseq RNA Sample Prep Kit (Illumina) and Illumina HiSeq 2000 sequencing. This resulted in 40.7 million and 27.7 million reads (paired-end) for D. rotans and R. ramosa, respectively. Both Ancyromonas sigmoides strain B-70 (CCAP 1958/3) and Fabomonas tropica strain NYK3C were grown similarly for 4 days at room temperature under ambient light in Petri plates containing ~10-15 ml 50% seawater, with concentrated, washed Enterobacter aerogenes as food.

Cells were harvested from 150 plates (~4.5 x 1011 cells total) by discarding the supernatant medium, then using a sterile cell scraper to remove cells from the plate surfaces, which were collected with a pipette. Mantamonas plastica stain Bass1 (CCAP 1946/1) was grown as above, but with Klebsiella pneumoniae (ATCC 23432) as a food source.

Total RNA from each of A. sigmoides, F. tropica and M. plastica was extracted with TRIzol (Invitrogen) according to the manufacturer’s instructions, and submitted to either GeneWiz (South Plainfield, NJ, USA) or Macrogen Inc. (Seoul, South Korea) for library construction and Illumina HiSeq 2000 sequencing. This resulted in 80.0 million (A. sigmoides), 101.2 million (F. tropica), and 132.0 million (M. plastica) reads (paired-end).

Transcriptomic Sequencing Assembly

Raw sequence data from the Illumina sequencing platform were subjected to TRIMMOMATIC

(Bolger et al. 2014), for cleaning and trimming of poorly called reads or removal of specific adaptor sequences used in sequencing with the paired-end data. Quality filtered and cleaned high quality reads were assembled through TRINITY for de novo assembly (Grabherr et al. 2011).

To identify any other eukaryotic contamination in the transcriptomes, TBLASTN was used

(Altschul et al. 1990; Altschul et al. 1997) to identify all contigs containing sequences corresponding to translation elongation factor 1-alpha or elongation factor 1-like (EFL), which would select for eukaryotic cDNA transcripts (Kamikawa et al. 2013). Nucleotide sequences were translated with TRANSDECODER (https://transdecoder.github.io/) which also predicts open reading frames, and searches for homologues in the Pfam protein-family database (Finn et al.

2016) as well. Contigs of transcriptomes were blasted against the nr (‘non-redundant’) protein- sequence database of NCBI using the BLASTX program of BLAST+ suite (Camacho et al. 2009)

Phylogenomic dataset construction

This phylogenomic dataset was constructed of 351 orthologous proteins (97,002 amino acid sites in total) as developed in Tice et al. 2016 and Kang et al. 2017. This main dataset was constructed with an original emphasis on the evolution of Amoebozoa and, more widely, Amorphea. The dataset was constructed from orthologs identified as both highly transcribed (and therefore likely to be present in RNAseq data) and globally distributed across the eukaryotic tree (as described in

(Kang et al. 2017)). For each of the 351 orthologs of interest, a representative sequence was used, most often from Arabidopsis thaliana or Homo sapiens, as queries for TBLASTN or BLASTP approaches (Table S2, Reference Ortholog Queries). Potential homolog sequences from our novel RNAseq data and other publically available data (see table S2) were identified using a threshold e-value of 1e-10. From these putative orthologs, we performed BLASTP against the

ORTHOMCL v. 5.0 database and obtained all sequences matched below a threshold e-value of 1e-

10. The candidate sequences that matched the correct ORTHOMCL ortholog ID (Table S2, ORTHOMCL IDs), and which did not correspond to prokaryotic sequences, were designated as putative orthologs. These putative orthologs were added to the existing protein alignment, and the resultant gene clusters were re-aligned using MAFFT-LINSI (Katoh et al. 2005). Ambiguously aligned sequences were trimmed and culled using BMGE (Criscuolo and Gribaldo 2010). We obtained individual maximum-likelihood trees from these orthologs using RAXML v. 8.0

(Stamatakis 2014) under the LG model + gamma distribution of rate heterogeneity, with 4 discrete gamma rate classes (LG+Γ). Each tree was ML bootstrapped (MLBS) by 100 pseudoreplicates, and the MLBS value was compared to a consensus tree of well-supported eukaryotic groupings (Fig. S4) using a custom PYTHON script. Each bipartitioned tree from each ortholog was examined, either by eye or through use of a custom script, for any paralogy and extreme branch length, which were supported by bootstrap value above 70%. Once each single- ortholog tree was free of problematic sequences, a single concatenated supermatrix was assembled using the custom pipeline known as ORTHOLAGER (Kang et al. 2017).

References: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol. Biol. 215:403-410.

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc. Acids Res. 25:3389-3402.

Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinforma. Oxf. Engl. 30:2114–2120.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421.

Criscuolo A, Gribaldo S. 2010. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 10:210. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. 2016. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, Database Issue 44:D279-D285.

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29:644–652.

Kamikawa R, Brown MW, Nishimura Y, Sako Y, Heiss AA, Yubuki N, Gawryluk R, Simpson AGB, Roger AJ, Hashimoto T, et al. 2013. Parallel re-modeling of EF-1α function: divergent EF-1α genes co-occur with EFL genes in diverse distantly related eukaryotes. BMC Evol. Biol. 13:131.

Kamikawa R, Shiratori T, Ishida K-I, Miyashita H, Roger AJ. 2016. Group II Intron-Mediated Trans -Splicing in the Gene-Rich Mitochondrial Genome of an Enigmatic Eukaryote, rotans. Genome Biol. Evol. 8:458–466.

Kang S, Tice AK, Spiegel FW, Silberman JD, Pánek T, Cepicka I, Kostka M, Kosakyan A, Alcântara DM, Roger AJ, et al. 2017. Between a pod and a hard test: the deep evolution of amoebae. Mol. Biol. Evol.

Katoh K, Kuma K, Toh H, Miyata T. 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33:511–518.

Kishino H, Miyata T, Hasegawa M. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 30:151–160.

Shimodaira H. 2001. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51:492-508.

Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinforma. Oxf. Engl. 30:1312–1313.

Yabuki A, Ishida K-I, Cavalier-Smith T. 2013. Rigifila ramosa n. gen., n. sp., a filose apusozoan with a distinctive pellicle, is related to Micronuclearia. 164:75–88.