Supplementary Tables and Figures

Table S1 | Genome datasets of GTDB Release R04-RS89. Genomes were obtained from RefSeq/GenBank release r89 and augmented with MAGs derived from the Sequencing Read Archive (SRA). The data set was refined by applying a quality threshold (completeness - 4x contamination >50%) and by dereplication on the species rank to remove highly similar genomes and to select the best genome as a species representative.

Category No. of genomes

RefSeq/GenBank 2,661

SRA derived 187

Total archaeal genomes before QC 2,848

Category No. of genomes

Pass QC 2,392

Species representatives 1248

1

Table S2 | Set of 122 archaeal marker proteins used for phylogenomic inferences. The 122 archaeal proteins were identified as being present in ≥90% of bacterial or archaeal genomes and, when present, single-copy in ≥95% of genomes. The protein sequences were tested for congruency to avoid proteins affected by horizontal gene transfer [1].

Marker ID Name Description Length (aa) PF01990.12 ATP-synt_F ATP synthase (F/14-kDa) subunit 95 PF01866.12 Diphthamide_syn Putative diphthamide synthesis protein 307 PF04104.9 DNA_primase_lrg Eukaryotic and archaeal DNA primase, large subunit 260 PF01984.15 dsDNA_bind Double-stranded DNA-binding domain 107 PF02006.11 DUF137 Protein of unknown function DUF137 178 PF04019.7 DUF359 Protein of unknown function (DUF359) 121 PF01864.12 DUF46 Putative integral membrane protein DUF46 175 PF04919.7 DUF655 Protein of unknown function (DUF655) 181 PF07541.7 EIF_2_alpha Eukaryotic translation initiation factor 2 alpha subunit 114 PF13685.1 Fe-ADH_2 Iron-containing alcohol dehydrogenase 250 PF01269.12 Fibrillarin Fibrillarin 229 PF00368.13 HMG-CoA_red Hydroxymethylglutaryl-coenzyme A reductase 373 PF01798.13 Nop Putative snoRNA binding domain 150 PF00687.16 Ribosomal_L1 L1p/L10e family 220 PF00466.15 Ribosomal_L10 Ribosomal protein L10 100 PF00827.12 Ribosomal_L15e Ribosomal L15 192 PF01280.15 Ribosomal_L19e Ribosomal protein L19e 148 PF01157.13 Ribosomal_L21e Ribosomal protein L21e 99 PF01198.14 Ribosomal_L31e Ribosomal protein L31e 83 PF01655.13 Ribosomal_L32e Ribosomal protein L32 110 PF01090.14 Ribosomal_S19e Ribosomal protein S19e 140 PF01282.14 Ribosomal_S24e Ribosomal protein S24e 84 PF01200.13 Ribosomal_S28e Ribosomal protein S28e 69 PF01015.13 Ribosomal_S3Ae Ribosomal S3Ae family 195 PF00900.15 Ribosomal_S4e Ribosomal family S4e 77 PF01092.14 Ribosomal_S6e Ribosomal protein S6e 127 PF00410.14 Ribosomal_S8 Ribosomal protein S8 129 PF01000.21 RNA_pol_A_bac RNA polymerase Rpb3/RpoA insert domain 112 PF13656.1 RNA_pol_L_2 RNA polymerase Rpb3/Rpb11 dimerisation domain 77 PF01194.12 RNA_pol_N RNA polymerases N / 8 kDa subunit 60 PF03874.11 RNA_pol_Rpb4 RNA polymerase Rpb4 117 PF01191.14 RNA_pol_Rpb5_C RNA polymerase Rpb5, C-terminal domain 74 PF02978.14 SRP_SPB Signal peptide binding domain 104 PF01868.11 UPF0086 Domain of unknown function UPF0086 89 PF01496.14 V_ATPase_I V-type ATPase 116kDa subunit family 759 TIGR00021 rpiA ribose 5-phosphate isomerase A 218 TIGR00037 eIF_5A translation elongation factor IF5A 130 TIGR00042 TIGR00042 non-canonical purine NTP pyrophosphatase, RdgB/HAM1 family 184 TIGR00064 ftsY signal recognition particle-docking protein FtsY 279 TIGR00111 pelota mRNA surveillance protein pelota 351 TIGR00134 gatE_arch glutamyl-tRNA(Gln) amidotransferase, subunit E 622 TIGR00240 ATCase_reg aspartate carbamoyltransferase, regulatory subunit 150 TIGR00264 TIGR00264 alpha-NAC homolog 116 TIGR00270 TIGR00270 TIGR00270 family protein 154 TIGR00279 uL16_euk_arch ribosomal protein uL16 172 TIGR00283 arch_pth2 peptidyl-tRNA hydrolase 115 TIGR00291 RNA_SBDS rRNA metabolism protein, SBDS family 231 TIGR00293 TIGR00293 prefoldin, alpha subunit 129 TIGR00307 eS8 ribosomal protein eS8 127 TIGR00308 TRM1 N2,N2-dimethylguanosine tRNA methyltransferase 375 TIGR00323 eIF-6 putative translation initiation factor eIF-6 215 TIGR00324 endA tRNA-intron lyase 177 TIGR00335 primase_sml putative DNA primase, eukaryotic-type, small subunit 324 TIGR00336 pyrE orotate phosphoribosyltransferase 173 TIGR00337 PyrG CTP synthase 526 TIGR00373 TIGR00373 transcription factor E 162 TIGR00389 glyS_dimeric glycine--tRNA ligase 565 TIGR00392 ileS isoleucine--tRNA ligase 861 TIGR00398 metG methionine--tRNA ligase 530 TIGR00405 KOW_elon_Spt5 transcription elongation factor Spt5 145 TIGR00408 proS_fam_I proline--tRNA ligase 475 TIGR00422 valS valine--tRNA ligase 863 TIGR00425 CBF5 putative rRNA pseudouridine synthase 322 TIGR00432 arcsn_tRNA_tgt tRNA-guanine(15) transglycosylase 637 TIGR00442 hisS histidine--tRNA ligase 406 TIGR00448 rpoE DNA-directed RNA polymerase 179 TIGR00456 argS arginine--tRNA ligase 569 TIGR00458 aspS_nondisc aspartate--tRNA(Asn) ligase 428 TIGR00463 gltX_arch glutamate--tRNA ligase 560 TIGR00468 pheS phenylalanine--tRNA ligase, alpha subunit 324 TIGR00471 pheT_arch phenylalanine--tRNA ligase, beta subunit 551

2

TIGR00490 aEF-2 translation elongation factor aEF-2 720 TIGR00491 aIF-2 translation initiation factor aIF-2 594 TIGR00501 met_pdase_II methionine aminopeptidase, type II 295 TIGR00521 coaBC_dfp phosphopantothenoylcysteine decarboxylase / 392 phosphopantothenate--cysteine ligase TIGR00522 dph5 diphthine synthase 258 TIGR00549 mevalon_kin mevalonate kinase 276 TIGR00658 orni_carb_tr ornithine carbamoyltransferase 304 TIGR00670 asp_carb_tr aspartate carbamoyltransferase 304 TIGR00729 TIGR00729 ribonuclease HII 207 TIGR00936 ahcY adenosylhomocysteinase 416 TIGR00982 uS12_E_A ribosomal protein uS12 139 TIGR01008 uS3_euk_arch ribosomal protein uS3 195 TIGR01012 uS2_euk_arch ribosomal protein uS2 196 TIGR01018 uS4_arch ribosomal protein uS4 162 TIGR01020 uS5_euk_arch ribosomal protein uS5 212 TIGR01025 uS19_arch ribosomal protein uS19 135 TIGR01028 uS7_euk_arch ribosomal protein uS7 186 TIGR01038 uL22_arch_euk ribosomal protein uL22 148 TIGR01046 uS10_euk_arch ribosomal protein uS10 99 TIGR01052 top6b DNA topoisomerase VI, B subunit 488 TIGR01060 eno phosphopyruvate hydratase 425 TIGR01077 L13_A_E ribosomal protein uL13 141 TIGR01080 rplX_A_E ribosomal protein uL24 116 TIGR01213 pseudo_Pus10arc tRNA pseudouridine(54/55) synthase 387 TIGR01309 uL30_arch ribosomal protein uL30 151 TIGR01952 nusA_arch NusA family KH domain protein, archaeal 141 TIGR02076 pyrH_arch putative uridylate kinase 222 TIGR02153 gatD_arch glutamyl-tRNA(Gln) amidotransferase, subunit D 405 TIGR02236 recomb_radA DNA repair and recombination protein RadA 311 TIGR02258 2_5_ligase 2'-5' RNA ligase 180 TIGR02338 gimC_beta prefoldin, beta subunit 110 TIGR02389 RNA_pol_rpoA2 DNA-directed RNA polymerase, subunit A'' 367 TIGR02390 RNA_pol_rpoA1 DNA-directed RNA polymerase subunit A' 867 TIGR02651 RNase_Z ribonuclease Z 302 TIGR03626 L3_arch ribosomal protein uL3 331 TIGR03627 uS9_arch ribosomal protein uS9 130 TIGR03628 arch_S11P ribosomal protein uS11 117 TIGR03629 uS13_arch ribosomal protein uS13 144 TIGR03636 uL23_arch ribosomal protein uL23 77 TIGR03653 uL6_arch ribosomal protein uL6 170 TIGR03665 arCOG04150 arCOG04150 universal archaeal KH domain protein 173 TIGR03670 rpoB_arch DNA-directed RNA polymerase subunit B 599 TIGR03671 cca_archaeal CCA-adding enzyme 410 TIGR03672 rpl4p_arch 50S ribosomal protein uL4 251 TIGR03673 uL14_arch 50S ribosomal protein uL14 131 TIGR03674 fen_arch flap structure-specific endonuclease 338 TIGR03677 eL8_ribo ribosomal protein eL8 117 TIGR03680 eif2g_arch translation initiation factor 2, gamma subunit 407 TIGR03683 A-tRNA_syn_arch alanine--tRNA ligase 902 TIGR03684 arCOG00985 arCOG04150 universal archaeal PUA-domain protein 154 TIGR03722 arch_KAE1 universal archaeal protein Kae1 323

3

Table S3 | NCBI taxonomy F measure comparison. Named NCBI taxa classified as polyphyletic (red) and operationally monophyletic (yellow; defined as having an F measure ≥0.95) are shown. F measures were calculated from the ar122.r89 tree (IQTREE, C10, PMSF) decorated with the r89 NCBI taxonomy. Abbreviations: “Taxon” = taxon (lineage) defined by NCBI; “No. Expected in Tree” = number of genomes assigned to an NCBI taxon, in the ar122.r89 tree; “F-measure” = the harmonic mean of precision and recall, which has been proposed for decorating trees with a donor taxonomy; “Precision” = indicates how many of the genomes classified as a given taxon that are correct (according to the NCBI r89 taxonomy). For example, p__Nanoarchaeota (bold font) has a precision of 1 (100%) since all 4 genomes grouped within the Nanoarchaeota are classified as Nanoarchaeota. For the f__Caldisphaeraceae (bold font) the precision is 0.66 since 3 genomes were grouped under Caldisphaeraceae, but only 2 of these are actually classified as Caldisphaeraceae; “Recall” = indicates how many of the genomes classified as a taxon that were placed into that taxon in the tree. For example: p__Nanoarchaeota has a recall of 0.4 since only 4 of the 10 expected genomes classified as Nanoarchaeota were grouped within Nanoarchaeota. “No. genomes from Taxon” = out of all genomes assigned by NCBI to a lineage (taxon) how many fall within this taxon in the ar122.r89 tree. “No. genome in Lineage” = the total number of genomes, regardless of the NCBI taxonomy assignment, which fall into a NCBI defined lineage in the ar122.r89 tree.

No. Expected No. genomes No. genome in Taxon in Tree F-measure Precision Recall from Taxon Lineage p__Candidatus Aenigmarchaeota 4 0.75 0.75 0.75 3 4 p__Candidatus Woesearchaeota 26 0.902 0.92 0.8846 23 25 p__Nanoarchaeota 10 0.5714 1 0.4 4 4 c__Methanomicrobia 150 0.6547 1 0.4867 73 73 o__Desulfurococcales 21 0.766 0.6923 0.8571 18 26 o__Halobacteriales 93 0.7517 1 0.6022 56 56 o__Thermoplasmatales 24 0.8 1 0.6667 16 16 f__Acidilobaceae 4 0.8571 1 0.75 3 3 f__Caldisphaeraceae 2 0.8 0.6667 1 2 3 f__Desulfurococcaceae 14 0.8333 1 0.7143 10 10 f__Halobacteriaceae 18 0.6154 1 0.4444 8 8 f__Methanocaldococcaceae 9 0.875 1 0.7778 7 7 g__Candidatus Nanobsidianus 2 0.8 0.6667 1 2 3 g__Halopiger 4 0.6667 1 0.5 2 2 g__Haloterrigena 8 0.5 0.3333 1 8 24 g__Methanobacterium 26 0.8387 0.7222 1 26 36 g__Methanococcoides 5 0.8889 1 0.8 4 4 g__Methanococcus 7 0.9231 1 0.8571 6 6 g__Methanothermobacter 5 0.8889 1 0.8 4 4 g__Methanothermococcus 2 0.6667 1 0.5 1 1 g__Natrinema 8 0.9333 1 0.875 7 7 g__Natronolimnobius 2 0.6667 1 0.5 1 1 g__Natronorubrum 6 0.9231 0.8571 1 6 7 g__Pyrococcus 7 0.9333 0.875 1 7 8 g__Sulfolobus 9 0.8 1 0.6667 6 6 g__Thermococcus 32 0.9014 0.8205 1 32 39 s__Candidatus Nanobsidianus stetteri 2 0.8 0.6667 1 2 3 s__Methanobacterium formicicum 2 0.8 0.6667 1 2 3

4 s__Methanococcoides methylutens 2 0.8 0.6667 1 2 3 s__Methanoculleus marisnigri 2 0.6667 1 0.5 1 1 s__Methanosarcina barkeri 3 0.8 1 0.6667 2 2 s__Methanospirillum hungatei 2 0.6667 1 0.5 1 1 p__Candidatus Bathyarchaeota 34 0.9565 0.9429 0.9706 33 35 p__Candidatus Micrarchaeota 12 0.9565 1 0.9167 11 11 p__Candidatus Nanohaloarchaeota 12 0.9565 1 0.9167 11 11 p__Crenarchaeota 92 0.989 1 0.9783 90 90 p__Euryarchaeota 823 0.9914 0.9988 0.9842 810 811 p__Thaumarchaeota 73 0.9865 0.9733 1 73 75 c__Halobacteria 223 0.9978 0.9955 1 223 224 c__Nanohaloarchaea 12 0.9565 1 0.9167 11 11 c__Thermoplasmata 59 0.9833 0.9672 1 59 61 o__Methanomassiliicoccales 28 0.9655 0.9333 1 28 30 o__Methanomicrobiales 68 0.9851 1 0.9706 66 66 o__Methanosarcinales 70 0.9855 1 0.9714 68 68 o__Natrialbales 56 0.991 1 0.9821 55 55 o__Nitrosopumilales 19 0.9744 0.95 1 19 20 f__Haloarculaceae 23 0.9583 0.92 1 23 25 f__Haloferacaceae 35 0.9565 0.9706 0.9429 33 34 f__Halorubraceae 38 0.961 0.9487 0.9737 37 39 f__Natrialbaceae 56 0.991 1 0.9821 55 55 g__Methanoculleus 17 0.9697 1 0.9412 16 16 g__Nitrosopumilus 12 0.96 0.9231 1 12 13 o__Haloferacales 74 0.9867 0.9737 1 74 76

polyphyletic operationally monophyletic (F measure ≥0.95)

5

Table S4 | Internal nodes with bootstrap values below 90%. Shown are the nodes with support <90% which were assigned names to preserve existing classifications in GTDB (R04-RS89). Bootstrap values below 75% are highlighted in red.

Decorated internal node Bootstrap support (%) g__Halobiforma 64 c__Lokiarchaeia 70 o__Desulfurococcales 70 p__Nanoarchaeota;c__Nanoarchaeia 77 c__Thermoprotei 79 f__Natrialbaceae 79 c__Nitrososphaeria 80 f__GW2011-AR4 82 g__Halogranum 82 g__MGIIb-O3 82 o__Methanosarcinales 83 c__Archaeoglobi 85 p__Euryarchaeota 86 c__Thermoplasmata 86 o__Thermofilales 86 o__UBA9212 86 o__SCGC-AAA252-I15 88 f__UBA10161 88 g__Acidianus 89 g__Natronorubrum 89

6

Table S5 | Number of decorated internal nodes in the ar122.r89 tree. The upper table shows the number of total nodes including the extant taxa, i.e. the species, (total no. of nodes incl. species), the total number of nodes with GTDB taxon label (no. of nodes with GTDB taxon label) and the percentage of these labelled nodes. The lower table shows the total number of internal nodes only, excluding extant taxa, i.e. the species, (total no. of internal nodes excl. species), and the number of decorated internal nodes (no. of internal nodes with GTDB taxon label), followed by the percentage of decorated internal nodes. Nodes were calculated from the ar122.r89.tree

total no. of nodes incl. no. of nodes with GTDB % of nodes with GTDB taxon species taxon label label 2,495 1,558 62.4%

total no. of internal no. of internal nodes with % of nodes with GTDB taxon nodes excl. species GTDB taxon label label 1,247 310 24.9%

7

Table S6 | SSU rRNA genes detected in species representatives. The 1248 genomes in GTDB 04-RS89 are ranked by their NCBI genome category (ncbi_genome_category). Note the NCBI category “none” is likely to include exclusively isolates since it comprises all genomes not derived from single-cells, metagenomes, or environmental samples. (a) Genomes wit SSU genes and gene fragments; (b) genomes wit SSU genes > 900bp (trimmed). Note that the majority of genomes without a detected SSU falls into the “derived from metagenome” category (91.9% of all genomes without SSU).The majority of genomes without a trimmed SSU > 900bp, i.e. 95.3% (2.2% + 91.9% + 1.2%), are derived from environmental samples, MAGs and SAGs. Subtracting the SAGs leaves 94.1% of genomes most likely derived from MAGs. Abbreviations: percent of genomes within a NCBI genome category with a SSU gene (% SSU cat.) or without a SSU gene (% no SSU cat.) a SSU. Percentage of genomes without a SSU gene in a given category compared to all 578 genomes without a SSU gene (% no SSU 578).

(a) Genomes with SSU genes and gene fragments

ssu-copies ncbi_genome_category 0 1 2 3 4 5 9 Total derived from environmental_sample 12 10 1 23 derived from metagenome 351 347 46 7 1 752 derived from single cell 6 16 1 23 none (i.e. likely an isolate) 8 275 97 50 16 3 1 450 Grand Total 377 648 145 57 17 3 1 1248

(b) Genomes with the longest SSU genes > 900bp (trimmed)

% SSU % no % no ncbi_genome_category SSU cat. no SSU SSU cat. SSU 578 Total derived from environmental_sample 10 43.5 13 56.5 2.2 23 derived from metagenome 221 29.4 531 70.6 91.9 752 derived from single cell 16 69.6 7 30.4 1.2 23 none (i.e. likely an isolate) 423 94.0 27 6.0 4.7 450 All categories 670 53.7 578 46.3 n.a. 1248

8

Table S7 | 16 ribosomal proteins (rp1). This marker set was defined in Hug et al., 2016 [2].

Marker ID Name Description Length (aa) PFAM_PF00238.14 Ribosomal_L14 Ribosomal protein L14p/L23e 122 PFAM_PF00827.12 Ribosomal_L15e Ribosomal L15 192 PFAM_PF00252.13 Ribosomal_L16 Ribosomal protein L16p/L10e 133 PFAM_PF00861.17 Ribosomal_L18p Ribosomal L18p/L5e family 119 PFAM_PF00181.18 Ribosomal_L2 Ribosomal Proteins L2, RNA binding domain 77 PFAM_PF00237.14 Ribosomal_L22 Ribosomal protein L22p/L17e 105 PFAM_PF03947.13 Ribosomal_L2_C Ribosomal Proteins L2, C-terminal domain 130 PFAM_PF00297.17 Ribosomal_L3 Ribosomal protein L3 263 PFAM_PF00573.17 Ribosomal_L4 Ribosomal protein L4/L1 family 192 PFAM_PF00281.14 Ribosomal_L5 Ribosomal protein L5 56 PFAM_PF00673.16 Ribosomal_L5_C ribosomal L5P family C-terminus 95 PFAM_PF00366.15 Ribosomal_S17 Ribosomal protein S17 69 PFAM_PF00203.16 Ribosomal_S19 Ribosomal protein S19 81 PFAM_PF00189.15 Ribosomal_S3_C Ribosomal protein S3, C-terminal domain 85 PFAM_PF00410.14 Ribosomal_S8 Ribosomal protein S8 129 TIGR_TIGR01046 uS10_euk_arch ribosomal protein uS10 99 TIGR_TIGR01080 rplX_A_E ribosomal protein uL24 116 TIGR_TIGR03653 uL6_arch ribosomal protein uL6 170

9

Table S8 | 23 ribosomal proteins (rp2). This table is based on a subset of proteins used by Rinke et al. [3].

Marker ID Name Description Length (aa) PFAM_PF00687.16 Ribosomal_L1 Ribosomal protein L1p/L10e family 220 PFAM_PF00466.15 Ribosomal_L10 Ribosomal protein L10 100 PFAM_PF00298.14 Ribosomal_L11 Ribosomal protein L11, RNA binding domain 69 PFAM_PF03946.9 Ribosomal_L11_N Ribosomal protein L11, N-terminal domain 60 PFAM_PF00238.14 Ribosomal_L14 Ribosomal protein L14p/L23e 122 PFAM_PF00252.13 Ribosomal_L16 Ribosomal protein L16p/L10e 133 PFAM_PF00861.17 Ribosomal_L18p Ribosomal L18p/L5e family 119 PFAM_PF00181.18 Ribosomal_L2 Ribosomal Proteins L2, RNA binding domain 77 PFAM_PF00237.14 Ribosomal_L22 Ribosomal protein L22p/L17e 105 PFAM_PF03947.13 Ribosomal_L2_C Ribosomal Proteins L2, C-terminal domain 130 PFAM_PF00297.17 Ribosomal_L3 Ribosomal protein L3 263 PFAM_PF00573.17 Ribosomal_L4 Ribosomal protein L4/L1 family 192 PFAM_PF00281.14 Ribosomal_L5 Ribosomal protein L5 56 PFAM_PF00673.16 Ribosomal_L5_C ribosomal L5P family C-terminus 95 PFAM_PF00338.17 Ribosomal_S10 Ribosomal protein S10p/S20e 97 PFAM_PF00411.14 Ribosomal_S11 Ribosomal protein S11 110 PFAM_PF00416.17 Ribosomal_S13 Ribosomal protein S13/S18 107 PFAM_PF00312.17 Ribosomal_S15 Ribosomal protein S15 83 PFAM_PF00366.15 Ribosomal_S17 Ribosomal protein S17 69 PFAM_PF00203.16 Ribosomal_S19 Ribosomal protein S19 81 PFAM_PF00189.15 Ribosomal_S3_C Ribosomal protein S3, C-terminal domain 85 PFAM_PF00333.15 Ribosomal_S5 Ribosomal protein S5, N-terminal domain 67 PFAM_PF03719.10 Ribosomal_S5_C Ribosomal protein S5, C-terminal domain 74 PFAM_PF00177.16 Ribosomal_S7 Ribosomal protein S7p/S5e 148 PFAM_PF00410.14 Ribosomal_S8 Ribosomal protein S8 129 PFAM_PF00380.14 Ribosomal_S9 Ribosomal protein S9/S16 121 PFAM_PF00164.20 Ribosom_S12_S23 Ribosomal protein S12/S23 122

10

Table S9 | GTDB lineages not resolved with alternative inference methods. Shown are the taxa that were not recovered as monophyletic or operational monophyletic under different alignments and inference methods (F measure < 0.95). A value of 1 indicates that a taxon was not recovered as monophyletic or operational monophyletic.

P1 1 LANLPP AGGRP2

VVIERMINCOL_4IQTREE 5DI TAXON 143SSUIQTREE900BP 141SSUFASTTREE900BP 04FASTTREE1248XWAGGR 05RP1IQTREEC10PMSFRP 06FASTTREE12481225XW 07RP2IQTREEC10PMSF 181ASTRALAR122LPP 182ASTRAL253XPHYLOPH 111BMGEFASTTREE 112BMGEIQTREE 11 TOTAL GRAND G__MGIIB-O3 1 1 1 1 1 1 1 1 8 G__HALOGRANUM 1 1 1 1 1 1 1 7 O__DESULFUROCOCCALES 1 1 1 1 1 1 1 7 G__ACIDIANUS 1 1 1 1 1 1 6 G__UBA71 1 1 1 1 1 1 6 C__METHANOCELLIA 1 1 1 1 4 C__THERMOPLASMATA 1 1 1 1 4 F__NZ13-MGT 1 1 1 1 4 O__METHANOCELLALES 1 1 1 1 4 F__GW2011-AR4 1 1 1 3 G__ACIDILOBUS 1 1 1 3 G__HALOBIFORMA 1 1 1 3 G__MGIIA-L1 1 1 1 3 G__UBA493 1 1 1 3 O__SCGC-AAA252-I15 1 1 1 3 O__THERMOFILALES 1 1 1 3 O__UBA9212 1 1 1 3 O__WOR-SM1-SCG 1 1 1 3 P__EURYARCHAEOTA 1 1 1 3 C__BATHYARCHAEIA 1 1 2 C__E2 1 1 2 C__HEIMDALLARCHAEIA 1 1 2 C__LOKIARCHAEIA 1 1 2 C__NITROSOSPHAERIA 1 1 2 C__SYNTROPHOARCHAEIA 1 1 2 F__METHANOREGULACEAE 1 1 2 F__THERMOCLADIACEAE 1 1 2 F__UBA12501 1 1 2 F__UBA233 1 1 2 F__UBA93 1 1 2 F__WOR-SM1-SCG 1 1 2 G__GEOGLOBUS 1 1 2 G__HALOARCULA 1 1 2 G__HALOBELLUS 1 1 2 G__HALOMICROBIUM 1 1 2 G__METHANOBACTERIUM_C 1 1 2 G__METHANOBACTERIUM_D 1 1 2 G__METHANOCALDOCOCCUS_A 1 1 2 G__METHANOMASSILIICOCCUS 1 1 2 G__METHANOTORRIS 1 1 2 G__MGIIB-O5 1 1 2 G__NATRINEMA 1 1 2 G__NITROSOARCHAEUM 1 1 2 G__WOR-SM1-SCG 1 1 2 O__B26-1 1 1 2 O__SCGC-AAA011-G17 1 1 2 P__ASGARDARCHAEOTA 1 1 2 C__METHANOSARCINIA 1 1 C__NANOARCHAEIA 1 1 F__ALTIARCHAEACEAE 1 1 11

F__ARS49 1 1 F__GW2011-AR1 1 1 F__HALOARCULACEAE 1 1 F__HALOCOCCACEAE 1 1 F__METHANOMICROBIACEAE 1 1 G__ALTIARCHAEUM 1 1 G__METHANOBREVIBACTER_D 1 1 G__NATRIALBA 1 1 G__NATRONORUBRUM 1 1 G__NITROSOPELAGICUS 1 1 O__ALTIARCHAEALES 1 1 O__UBA10117 1 1 P__HALOBACTEROTA 1 1 P__NANOARCHAEOTA 1 1 GRAND TOTAL 42 39 11 14 9 10 5 12 4 6 1 153

12

Table S10 | Inference methods and alignments used to validate the R04-RS89 archaeal GTDB taxonomy. Abbreviations: “ali” = alignment including 122 markers (122), 16 ribosomal proteins (rp1), 23 ribosomal proteins (rp2), and full lengths (>900bp) 16S rRNA sequences (SSU).; “para” = parameters used to run the inference; “taxa” = number of taxa (genomes) in the alignment; “ali_length” = alignment length; “threads” = number of threads used by the inference software; “wall time” = total actual time taken from the start of the inference to the end; “MB RAM” = amount of memory in MB used by the inference software. inference method tree name tree ID ali para taxa ali_length threads* walltime MB RAM* FastTree 2.1.10 SSE3, OpenMP 01fasttree1248xwagg.tree 1 122 WAG, G 1248 5124 ~10 460.10 seconds n.a.

IQ-TREE multicore version 1.6.12 ar122.r89 2.1 122 C10, PMSF 1248 5124 9/10 16h:38m:48s 43,640/4,014 IQ-TREE multicore version 1.6.12 2.3.iqtree.c10.slow 2.3 122 C10 1248 5124 26 132h:44m:32s 43,640 IQ-TREE multicore version 1.6.12 2.4.iqtree.c60.pmsf.fasttreeStart 2.4 122 C60 1248 5124 10/9 68h:13m:56s 242,003 /4,014

IQ-TREE multicore version 1.6.12 3.iqtree.c10.pmsf.fasttreeStart.BS 3 122 C10, PMSF 1248 5124

FastTree 2.1.10 SSE3, OpenMP 04.fasttree.1248x.wag.g.rp1 4 rp1 WAG, G 1179 1174 n.a. 365.15 seconds n.a. IQ-TREE multicore version 1.6.12 05.rp1.iqtree.c10.pmsf.fasttreeStart.rp1 5 rp1 C10, PMSF 1179 1174 13/19 27h:42m:21s 14,378/1,323 FastTree 2.1.10 SSE3, OpenMP 06.fasttree.1248.1225x.wag.g.rp2 6 rp2 WAG, G 1225 2377 n.a. 256.82 seconds n.a. IQ-TREE multicore version 1.6.12 07.rp2.iqtree.c10.pmsf.fasttreeStart 7 rp2 C10, PMSF 1225 2377 12/5 45h:24m:25s 19,969/1,837

phylobayes mpi version 1.8 10.phylobayes 10 122 CAT, GTR 96 5124 40 3months+ n.a.

FastTree 2.1.10 SSE3, OpenMP 11.1.bmge.FastTree 11.1 122 WAG, G 5124 7859 n.a. n.a. n.a. IQ-TREE multicore version 1.6.12 11.2.bmge.IQTree 11.2 122 C10, PMSF 5124 7859 4/13 15h:9m:57s 66,545/6,122 IQ-TREE multicore version 1.6.12 11.5.dievvier.mincol_4.IQtree 11.5 122 C10, PMSF 5124 32061 17/17 415h:23m:37s 271,680/24,996

ExaML version 3.0.20 12.ExaML_JTT_G 12 122 WAG, G 1248 5124 n.a. ~5h:30m n.a.

FastTree 2.1.10 SSE3, OpenMP 14.1.SSU.FastTree.900bp 14.1 SSU WAG,G 670 1382 ~17 63.71 seconds n.a. IQ-TREE multicore version 1.6.12 14.3.SSU.IQtree.900bp 14.3 SSU SYM+R10 670 1382 1/1 1h:33m:13s 291/291

IQ-Tree v 1.6.9, ASTRAL v5.6.3 18.1.astral.ar122.lpp.tree 18.1 122 C10, PMSF 1248 various n.a. n.a. n.a. IQ-Tree v 1.6.9, ASTRAL v5.6.3 18.2.astral.253x.phylophlan.lpp.tree 18.2 253 C10, PMSF 1248 various n.a. n.a. n.a.

*IQ_TREE has two phases (estimate mixture model + conduct analysis)

13

Table S11 | Taxa with a conflicting taxonomy in the tree comparisons. Shown are the ranks at which the decorated trees disagreed, i.e. the columns "phylum, class" etc. indicate the highest rank at which the taxonomy disagreed for each genome. For example if taxa cluster in different phyla, then only the highest rank (phylum) is counted not subsequent changes at lower ranks. Abbreviations: “total mismatches” = total number of genomes with a conflicting taxonomy in this tree; “total genomes” = total number of genomes in this tree; “pct mismatch” = percentage of genomes with conflicting taxonomy in this tree. Note that the data in this table was the base for the tree comparison maps in Figure S14.

total total pct tree phylum class order family genus species mismatch genomes mismatch 04.fasttree.1248x.wag.g.rp1 2 23 6 4 10 0 45 1179 3.82 05.rp1.iqtree.c10.pmsf.fasttreeStart.rp1 2 26 11 2 21 0 62 1179 5.26 06.fasttree.1248.1225x.wag.g.rp2 0 2 4 2 12 0 20 1225 1.63 07.rp2.iqtree.c10.pmsf.fasttreeStart 5 1 6 7 12 0 31 1225 2.53

14.1.SSU.FastTree.900bp 10 42 5 14 16 0 87 670 12.99 14.3.SSU.IQtree.900bp 131 10 25 9 19 0 194 670 28.96

01.fasttree.1248x.wag.g 0 0 0 0 0 0 0 1248 0 12.ExaML_JTT_G 0 0 0 0 0 0 0 1248 0 2.3.iqtree.c10.slow 1 0 0 0 0 0 1 1248 0.08 2.4.iqtree.c60.pmsf.fasttreeStart 0 0 0 0 0 0 0 1248 0

11.1.bmge.FastTree 0 3 27 11 1 0 42 1248 3.37 11.2.bmge.IQTree 0 20 25 1 2 0 48 1248 3.85 11.5.dievvier.mincol_4.IQtree 0 1 4 1 1 0 7 1248 0.56

18.1.astral.ar122.lpp.tree 48 17 3 2 6 0 76 1248 6.09 18.2.astral.253x.phylophlan.lpp.tree 55 2 5 9 7 0 78 1213 6.43

14

Table S12 | Ranks in the archaeal GTDB taxonomy. The ranks in the derived archaeal GTDB taxonomy release R04-RS89 are shown for the 1,248 designated archaeal species. For comparison, we included the 23,458 bacterial species (more details at https://gtdb.ecogenomic.org/).

Archaea Total Phylum 16 112 128 Class 36 296 332 Order 96 816 912 Family 238 1,969 2,207 Genus 534 7,372 7,906 Species 1,248 23,458 24,706

15

References 1. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2017; 2: 1533–1542. 2. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, et al. A new view of the tree of life. Nature Microbiology 2016; 1: 16048. 3. Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 2013; 499: 431–437.

16