DOCTORAL THESIS 2020

PHYLOGENOMICS OF ; HALOPHILIC BACTEROIDETES AS AN EXAMPLE ON HOW THEIR GENOMES INTERACT WITH THE ENVIRONMENT

Raúl Muñoz Jiménez

DOCTORAL THESIS 2020

Doctoral Programme of Environmental and Biomedical Microbiology

PHYLOGENOMICS OF BACTEROIDETES; HALOPHILIC BACTEROIDETES AS AN EXAMPLE ON HOW THEIR GENOMES INTERACT WITH THE ENVIRONMENT

Raúl Muñoz Jiménez

Thesis Supervisor: Ramon Rosselló Móra Thesis Supervisor: Rudolf Amann Thesis tutor: Elena I. García-Valdés Pukkits

Doctor by the Universitat de les Illes Balears

Publications resulted from this thesis

Munoz, R., Rosselló-Móra, R., & Amann, R. (2016). Revised phylogeny of Bacteroidetes and proposal of sixteen new taxa and two new combinations including Rhodothermaeota phyl. nov. Systematic and Applied Microbiology, 39(5), 281–296

Munoz, R., Rosselló-Móra, R., & Amann, R. (2016). Corrigendum to “Revised phylogeny of Bacteroidetes and proposal of sixteen new taxa and two new combinations including Rhodothermaeota phyl. nov.” [Syst. Appl. Microbiol. 39 (5) (2016) 281–296]. Systematic and Applied Microbiology, 39, 491–492.

Munoz, R., Amann, R., & Rosselló-Móra, R. (2019). Ancestry and adaptive radiation of Bacteroidetes as assessed by comparative genomics. Systematic and Applied Microbiology, 43(2), 126065. Dr. Ramon Rosselló Móra, of the Institut Mediterrani d’Estudis Avançats, Esporles and

Dr. Rudolf Amann, of the Max-Planck-Institute für Marine Mikrobiologie, Bremen

WE DECLARE:

That the thesis titled Phylogenomics of Bacteroidetes; halophilic Bacteroidetes as an example on how their genomes interact with the environment, presented by Raúl Muñoz Jiménez to obtain a doctoral degree, has been completed under our supervision and meets the requirements to opt for an International Doctorate.

For all intents and purposes, we hereby sign this document.

Signatures

Esporles, May the 15th 2020 Bremen, May the 15th 2020 A mis padres, que para dar lo mejor a sus hijos se dedicaron a cocer pan y fregar suelos. Sin saberlo cultivaban y mataban microorganismos. Ni se imaginaban que una generación más tarde esos bichos darían para tanto. Gracias por acompañarme hasta aquí. 5

Acknowledgments

This research was funded by the Max Planck Society, the Spanish Ministry of Economy and Competitivity projects CGL2012-39627-C03-03, CLG2015-66686-C3-1-P and PGC2018-096956- B-C41 that also supported the work with European Regional Development Fund (FEDER) funds, the preparatory phase of the Microbial Resource Research Infrastructure (MIRRI) funded by the EU (grant number 312251), and the financial support of Deep Blue Sea Enterprise S.L.

The PhD candidate would like to thank the MPI for Marine Microbiology in Bremen for their support and welcoming during this research. The IMEDEA for the facilitating their premises and resources. The UIB for providing with the appropriate environment for my education. Also the SEM (Spanish Society of Microbiology) for awarding the presentation of the bacteroidetal phylogeny of Bacteroidetes as the best oral presentation in their 2016 congress. And specially the REDEX (The Spanish Network on Extremophiles), since it has brilliantly promoted the excellence in microbial research while created strong bonds between young talents in the years of my membership.

Many thanks, of course, to my directors. The experience of fulfilling a thesis would not be completed without some bumps and bruises, but I was lucky to count with two individuals that happen to be excellent people in and out academia. Thank you ever so much for depositing your trust in me and carry on no matter what. It was a tight end but we made it. Thank you Dr. Rudolf Amann for your perspectives and appraisals. And thanks to Dr. Ramon Rosselló-Móra for everything, the good and the bad, the whole roller-coaster. I have learned a lot from you and not only about science. Above all, I have learned about caring for what you do and caring for the people you depend on. Many thanks to Dr. Hanno Teeling. We have personally met in too few occasions due to contrary events, but your presence in the last period of this research has been invaluable. Your detailed input and empathy has caused a deep impact on me. I only wish I can become a little like you someday. In this group, despite he has not co-authored any of the presented research, I would like to include Dr. Pablo Yarza, who introduced me to the management of databases and curation of the All- Living Tree Project. It all started there and then at the Marine Microbiology Group (MMG). The terminal command line, the ARB graphic user interface, the cumbersome data collection. Thank you Pablo.

Also from the MMG I would like to thank Dr. Arantxa López, the person who spotted me out of a class of over 30 pupils to tell Ramon I could lend a good hand at the lab. You are responsible for this thesis in first term. Thanks to senior PhD candidates in the MMG like Dr. Jocelyn Brito and Dr. Ana Suarez for your leading. Special thanks to Dr. Ana Suarez, you kept reaching out for me, during your post-doctoral contract you invited me at Newcastle University to present my first research, and furthermost, you kept reaching out as a close friend. Also thanks to Dr. Ana Cifuentes, another reference during my MMG days, you rocked it hard, you made me think, opened my mind and taught me we could agree or not but still understand each other.

Thanks to the ‘next generation MMG’. It has been fun and educating to share the lab routines with you: Dr. Bartomeu Viver, Xisca Font, Joan Gago and Sara Díaz. Let me thank personally to Carlota Alejandre, my ‘verruguita’, my ally, confident, and MPI-grant fellow. We will laugh over a glass of wine telling ours stories. Also personal thanks to Dr. Merit Mora (de-la-Mo-ra), my princess, and Dr. Carlos Diaz my second ‘extremeño’. When two talents like you collided the bond had to be covalent. Gracias Princess por tantos y tantos momentos de risa descontrolada, lágrima empoderada y humor punzante. Gracias Carlos por sumar tanto. And finally, a very emotional acknowledgment to ‘The Rock’ and ‘The Mom’ in the MMG, Mercedes Urdiain. Sobra decir lo buena profesional que eres y gran soporte técnico que nos ofreces, pero hay que destacar tu calidad humana y tu 6 enorme corazón. Me apena que Pepón no lea estas líneas. Es de recibo agradecerle a él también lo que nos ha cuidado y enseñado. Brindo por él mientras escribo estas líneas aunque no sea con un Lagavulin de 18 años. Y brindo por ti y por lo que está por venir.

During my working years in the MMG I was also honored to meet people who stayed for a limited time and contributed to both my microbial knowledge and professional growth; Neus, Roberto, Diego, Luis, Bai. And above all Dr. Pablo Gallardo and Dr. Nayaret Chamorro, my two favorite Chilean people and soul mates in good and evil. Les amo weones.

Many thanks to the Microbiology teaching board at the UIB that shed light to my understanding and satiated my thirst for knowledge; Dr. Jorge Lalucat and Dr. Elena García-Galmés, tutors of this thesis, Dr. Balbina Nogales, Dr. Rafael Bosch, Dr. Antoni Bennasar, Dr. Margarita Gomila and Dr. Arantxa Peña.

Also thanks to Dr. Eduardo Pastor from the CAB (Centro de Astrobiología) at INTA, Madrid. It was my pleasure to work with you in the previous years to this thesis research and coinciding in some academic events. Somehow, I could relate to you and learn that a person like me had a voice in the crowded academic world. Thank you so much for being so genuine and sharing.

Last in the microbiology-related list of acknowledgments, I would like to thank everyone at the Microbiology department at the West Hertforshire NHS Trust, United . I was 20 when I left Spain pursuing an independent lifestyle and English proficiency, quiting my biology degree at which I felt I was not competent enough. Lucky me, I happened to get a job as an administrative clerk at the Microbiology lab at Watford General Hospital. You observed me, you noticed I was curious about those little bugs, and offered me the chance of my life: a promoting grant with full education to become a biomedical lab technician. My life turned left and I returned to Spain with unfinished academic training. But I came back a 24 year-old Spaniard with a fine practical training in classic microbiological techniques and skills, sure of his vocational job. You made me fall in love with microbiology. Special thanks to my mentors Dr. Deborah Surridge and Dr. Phillip Spears. And of course to my dearest Gillian Adlington-Graham, who was and still is there to remind me not to wear my heart on my sleeve. Last, from Hemel Hempstead General Hospital, I would like to include Dr. Deborah Nolan and Dr. Jeanette Allen to my list of acknowledgements for boosting up my self-esteem as a scientist. And from St. Albans City Hospital Dr. Asunya Padayachee, for her life-coaching and complicity.

Acknowledgments in English are closured with massive thanks to Jennifer McNish, my ‘English Mom’ and serious nominee to International Hosting Mom to be awarded by the United Nations someday. I remember arriving to Oakland College in St. Albans and being assigned guest to a widow lady with two cats. I felt shivers down my spine. I was ever so wrong to judge a book by its cover. You hosted, cared for, taught and even fed me when I needed it, welcoming to your family for ever. And it was funny enough that was you who warned me about the position at Watford General Hospital’s Microbiology Lab, encouraged me to apply and even got the application form, threw it on top of the dinner table and told me: ‘do it, you never know’. You can plead guilty of getting me to this position. I know what you’d say: ‘I don’t think so, whatever, looser’

En Español quisiera agradecer en primer lugar a alguien que dignifica su profesión y que, además, creo de inmensa justicia agradecer en una tesis doctoral: mi profesora de octavo de E.G.B., María Teresa Píriz. Además de mantener el contacto y el cariño hoy día, fuiste esencial en mi vida cuando te cruzaste en ella. Contenido curricular a parte, te esforzaste mucho por abrirnos los ojos a lo que realmente se esperaba de nosotros durante nuestro futuro recorrido académico y de lo relevante que esto sería para nuestro futuro profesional. En el contexto contemporaneo parece marciano que a un 7 chiquillo de trece años le pudiera entrar eso en la cabeza, pero lo conseguiste. Tú sí que eres Maestra.

Ya totalmente fuera del plano académico pero contando desde el inicio de mi peculiar singladura hacia convertirme en microbiólogo, gracias a mi familia y amistades (a veces uno no sabe donde separar una cosa de la otra) que me han ido aupando y dando la vida. Creo que del veinteañero del inicio al cuarentón de hoy he ido migrando de un polo extrovertido a otro introspectivo, que no introvertido. Además, creo que esto ha ido íntimamente ligado al proceso de aprender a ser científico; a sumirte en un estado de reflexión permanente que te aisla de lo que ocurre más allá de lo que ocupa el desarrollo de tu presente más inmediato. Por eso, gracias a todos aquellos en los que sé que puedo contar a pesar de abandonaros por temporadas más o menos largas.

Gracias a Eva Alsina y Ana Agud, que custodian el recuerdo más ingenuo y gamberro de mi juventud, y abrazan el presente con el mismo cariño. Gracias a Yolanda Chamero, tú estabas ahí comigo en la boca del lobo, y míranos, mira lo que nos seguimos riendo. Muchas gracias a Ramona Piñero, la prima Ramona; qué lección de vida más grande me has dado. ¡Cuanto te debo Ramona!

Hubo un momento en mi vida en que pensé haber tocado el cielo pero todo se derrumbó de la manera más dolorosa. Ahí estuvieron Tino Zarza, Javier Valero y William Quiróz para recomponer mis piezas. Luego llegó David ‘De Dios’ (si digo tu apellido te confindirán con mi prometido), como un elefante en una chatarrería, literal. Jo David, mira que eres generoso y desinteresado, me siento en gran deuda contigo. Te tengo que mimar más pero no te puedo querer más.

También estuvieron ahí ‘my wifey’ Cassandra C. Santos y mi aragonesa reconquistadora la Dra. María Elisa Cartagena. La Tramuntana nos espera. Cass, hem de fer un pensament. Elisa, alimentas mi mente y mi alma con unas delicatessen para las que no estay a la altura, y siempre sabes cómo hacerme sentir digno. Al final estamos massa bé i que bé que hi estam. Sácame de esta instrospección por favor.

Las últimas en la línea temporal son Chanel y Lola, las fierecillas indomables que me ha devuelto la fé en la humanidad. No somos ingenuas, nos permitimos serlo porque queremos que la vida nos sorprenda como a la mirada de un niño. ¡Salud!

Jamás me olvidaría de mi Mai. María Teresa Amores. Si hay almas gemelas somos tú y yo. No importa de donde sople el viento, siempre nos unirá. Gracias a Oscar Soldado por cimentar nuestra amistad y crecernos tan bien. Gracias a ‘Estercica’ por cuidarnoslo. Y gracias a Eduardo por amarrar cabos a buen puerto. Gracias por ser mi familia y soplar siempre a favor.

Hay amistades que son familia, y lo saben bien Aina, Rosa, Isabel, Juan, Antonia María, Iván y la niña Mar. Bendita la hora. Bien hallado. Habéis sufrido mi tesis de primera mano y me habéis otorgado el bien de la perspectiva y la distancia. Tocar tierra. El contacto con la realidad. La válvula de escape. Silbad cuando queráis.

Otros fueron familia antes que amigos, como Xisco y Antonio. Ahora somos la resistencia junto con Luis, Santi y Elena. ¡Bailad malditos! Nos sale el amor por los poros y no quiero dejar de olerlo. Sin vosotros la losa de la tesis me hubiera hundido.

En ocasiones, la familia de verdad, la de sangre, ocupa un espacio satelital en nuestras vidas. Ese es mi caso pero no por ello la considero menos importante. Cuando no quede nada quedareis vosotros y no por simple compromiso. Soy consciente de que mi familia, desde fuera, puede parecer distante. Es falso. Nuestra distancia es la manera de saber que todo está bien. Cuando las cosas se tuercen sé que mis hermanos y yo somo jauría. Nos mordemos incluso entre nosotros, pero nunca a matar. 8

Procurando mi bien, mi familia no siempre ha sabido comprender mis ambiciones, este oficio y lo que conlleva. Esto ha sido difícil de sobrellevar, no necesito ocultarlo porque es bien sabido. Sin embargo, he llegado a comprender que no debo respaldarme en mi familia para impulsarme hacia delante. Es mucho más importante saber que están detrás, al acecho. El núcleo de mi familia son mis padres Ignacio y María Dolores, a quien va dedicado este manuscrito. Mis hermanos Nacho, Cati, Loli y Miguel Ángel. Y nuestro futuro; Miguel, Ignasi, Guiem M., Guiem C. y Marina. A veces he pensado que los primeros perdieron oportunidades de conocerme mejor, pero lo más grave es que a los últimos les he negado oportunidades de hacer lo mismo. Niños, me encontraréis siempre. Os quiero a todos. Gracias familia por sentar la bases de este logro. No sería justo si no atribuyera parte de este mérito a la famila allegada; Pedro, Cati, ‘Pedrito’, Juan, Susana, Isaac, Rebeca y Tomeu. También sois parte importante de mi periplo personal.

Al poco de iniciar esta tesis apareció una familia política: Diego, María, Manoli, Tomás, Iván, Ana, Mari, Rafa, Isaac y Noa. Estoy profundamente agradecido por vuestro ‘tiento’ y juicio. Nunca pedí permiso para pisar dentro de casa, tampoco necesitáis el mío para hacer de la mía la vuestra.

El culpable de muchas de mis dichas actuales eres tú, David Jiménez. Mi prometido (permite que me recree en el uso de esa palabra). Llevo con mi tesis tanto tiempo como llevo contigo y mira, todo llega a un final. Deseo que las circunstancias actuales nos permitan casarnos en octubre como planeamos, pero ese no es nuestro fin ni nuestro final. Va a ser la celebración de lo recorrido. Durante este tiempo bien sabes, porque te ha dolido, que he puesto en duda tu contribución a que alcanzara el grado de Doctor. Creo haber sido responsable de tu animadversión hacía este grado. Me has visto sufrir, desistir, desesperar, todo lo que no deseas para mí. Sin embargo ahí sigues, velando por el bienestar de un kamikace. Por favor, reconciliate con mi oficio. Esta tesis es tan tuya como mía. Te amo. 9

INDEX

Resum, 10 Resumen, 11 Summary, 12 Introduction, 13 Objectives, 23 Methods, 23 Results, 26 1. Selection and curation of sequence data, 26 1.1. Species with standing in nomenclature and associated ribosomal sequences, 26 1.1.1. SSU ribosomal sequences, 26 1.1.2. LSU ribosomal sequences, 28 1.2. Non-ribosomal sequences, 29 1.2.1. Dataset for Multilocus Sequence Alignment, 29 1.2.2. Accessory ATP sythase beta subunit and Alanine synthase sequences, 31 1.3. Sequenced genomes for comparative genomics, 31 2. Analysis of bacteroidetal 16S rRNA sequences, 35 2.1. Building of the proposed bacteroidetal 16S rRNA phylogenetic tree, 35 2.2. Updated 16S rRNA phylogeny of the Bacteroidetes, 35 2. 2. 1. Coherence of taxonomic clades, 37 2.3. Signatures in the 16S rRNA sequence, 38 2.4. Inconsistencies with current , 39 3. Analyses of auxiliary molecular clocks, 43 3.1. The phylogeny of the bacteroidetal 23S rDNA, 43 3.2. Phylogeny of prokaryotic SSU and LSU ribosomal sequences combined, 45 3.3. Individual phylogenies of 29 orthologous gene products, 46 3.4. Multilocus Sequence Analysis of 29 orthologous gene products, 46 3.5. Phylogeny of the F-type ATP synthase beta subunit, 47 3.6. AtpD and AlaS indel prints, 49 4. Phylogenomic trends in the protein pool of Bacteroidetes. 53 4.1. Reciprocal Best Match (RBM) analysis, 53 4.2. Highly conserved proteins: core genome, exclusive sets and pertinent proteins, 55 4.3. Median sequence identity of conserved sequences, 58 4.4. Synonymy and synteny, 59 4.5. Expected sus and flx genes, 63 4.6. Most ubiquitous gene-clusters in Bacteroidetes, 64 5. Adaptive radiation driven by the environment, 66 5.1. Genes of the Complex I predict halophily, 66 5.2. Variants of the aerobic respiratory chain, 67 5.3. Adaptive radiation explained by the aerobic respiratory chain, 70 Discussion, 72 Conclusions, 81 References, 84 Glossary, 97 Annex, 98 10

Resum

L'ús de marcadors ribosòmics en sistemàtica bacteriana va relacionar íntimament els grups Cytophaga-Flavobacteria- malgrat el seu fenotip els classificava de manera independent. Davant l'evidència filogenètica els van agrupar al fílum Bacteroidetes, tot i que es continuen estudiant principalment per separat perquè no sembla existir una coherència fenotípica entre ells. Els fenotips que recurrentment s'atribueixen als Bacteroidetes són la descomposició de matèria orgànica complexa, la motilitat per lliscament i la pigmentació per flexirrubines i carotenoides. El recent apogeu de mètodes independents de cultiu i el desenvolupament de les tècniques de seqüenciació de polinucleòtids, ha delatat la prevalença dels Bacteroidetes a molts ecosistemes rellevants. Per exemple, certs ambients salins que aparentment seleccionen a aquests bacteris per sobre els d'altres fílums. Aquesta tesi doctoral té els objectius de (1) revisar la taxonomia dels Bacteroidetes, (2) trobar els gens codificants que puguin manifestar un fenotip identitari i (3) descobrir quins gens són seleccionats a ambients salins.

Per a la revisió taxonòmica del fílum no només s'ha actualitzat la filogènia basada al gen ADNr 16S, sinó també la de l'ADNr 23S, nucleòtids signatura de la seqüència ARNr 16S, patrons d'inserció i deleció a dues proteïnes i una filogènia mitjançant el anàlisis multilocus de 29 gens. Aquesta nova taxonomia, a més de permetre la denominació d'un nou fílum, tres classes, tres ordres, vuit famílies i un gènere (tots ells vigents a la nomenclatura bacteriana), ha facilitat la filogenòmica precisa dels Bacteroidetes, que a estudis anteriors han inclòs als membres del nou fílum Rhodothermaeota. Malgrat i tot, això va minvar les possibilitats de trobar gens seleccionats per la salinitat, ja que la major part d'halòfils extrems pertanyen a aquest nou fílum.

Mitjançant la comparació genòmica de 89 genomes de Bacteroidetes acuradament seleccionats es descobrí que el gens més conservats al fílum (al 90% dels genomes), presumiblement no constitutius, són els codificants del sistema de secreció tipus IX, fins ara només descrits a aquest fílum. Aquest sistema de secreció està relacionat amb el moviment per lliscament d'alguns Bacteroidetes, però també és un sistema excretor d'enzims hidrolítics. A continuació, els gens constitutius codificants del súper complex respiratori Complex Alternatiu III – citocrom caa3 oxidoreductasa al 83% dels genomes, a tots els organismes aerobis. El llinatge anaeròbic dels pot completar la respiració aeròbia gràcies a un citocrom bd oxidoreductasa amb alta afinitat per l'oxigen que els ajuda a eliminar-lo de l'ambient. El complex II, succinat deshidrogenasa, de la cadena respiratòria és de còpia única a tots els genomes de Bacteroidetes i als Bacteroidales existeixen vestigis d'altres gens d'enzims del cicle dels àcids tricarboxílics que evidencien el seu passat aerobi. Finalment, els gens del Complex I de la cadena respiratòria no segueixen un patró filogenòmic als Bacteroidetes. Aquests gens són substituïts pels d'un complex I alternatiu que bomba sodi a l'espai intermembranós en lloc de protons, la NADH:quinona oxidoreductasa que bombeja sodi (Na+-NQR), només a organismes marins o de la microbiota del conducte gàstric. A més, hem trobat evidències que el complex va ser inventat pels Bacteroidetes i s'hi ha transmès horitzontalment dins ambients salins.

L'expressió en conjunt del sistema de secreció tipus IX i la cadena respiratòria dels Bacteroidetes, adaptada a cadascun dels ambients que habiten, és molt probablement responsable de la radiació adaptativa del Bacteroidetes. Alguns d'aquests gens es transfereixen horitzontalment, però en conjunt prediuen que el bacteroidete original va ser un aerobi heteròtrof de matèria orgànica complexa que possiblement vivia dins aigua dolça. 11

Resumen

El uso de marcadores ribosomales en sistemática bacteriana relacionó íntimamente los grupos Cytophaga-Flavobacteria-Bacteroides pese a que su fenotipo los clasificaba de manera independiente. Ante la evidencia filogenética se agruparon en el filo Bacteroidetes, pero se siguen estudiando principalmente por separado dado que no parece existir una coherencia fenotípica entre ellos. Los fenotipos más recurrentemente atribuidos a Bacteroidetes son la descomposición de materia orgánica compleja, el movimiento deslizante y su pigmentación por flexirrubina y carotenoides. El reciente auge de métodos independientes de cultivo y desarrollo de técnicas de secuenciación de polinucleótidos, ha revelado la prevalencia de los Bacteroidetes en muchos ecosistemas de relevancia. Como ejemplo, ciertos ambientes salinos parecen seleccionar a estas frente a las de otros filos. Los objetivos que fija esta tesis son (1) revisar la taxonomía de Bacteroidetes, (2) encontrar genes codificantes que puedan expresar un fenotipo identitario y (3) hallar qué genes son seleccionados en ambientes salinos.

Para la revisión taxonómica del filo se ha actualizado no solamente la filogenia basada en el gen ADNr 16S, sino también la del ADNr 23S, nucleótidos firma de la secuencia ARNr 16S, patrones de inserción y deleción en dos proteínas y una filogenia por análisis de secuencias multilocus de 29 genes. Esta nueva taxonomía, además de permitir la denominación de un nuevo filo, tres clases, tres órdenes, ocho familias y un género (todos ellos vigentes en la nomenclatura bacteriana), ha facilitado la filogenómica precisa de los Bacteroidetes, que en estudios anteriores incluía a miembros del nuevo filo Rhodothermaeota. Aunque también, redujo las posibilidades de encontrar genes seleccionados por la salinidad, dado que la mayoría de halófilos extremos pertenecen a este nuevo filo.

Mediante la comparación genómica de 89 genomas de Bacteroidetes cuidadosamente seleccionados se reveló que los genes más conservados del filo (en el 90% de los genomas), presumiblemente no constitutivos, son los codificantes para el sistema de secreción tipo IX, hasta la fecha solo descrito en este filo. Este sistema de secreción está relacionado con el movimiento deslizante de algunos Bacteroidetes, pero también es un sistema excretor de enzimas hidrolíticas. A continuación, los genes constitutivos codificantes del súper complejo respiratorio Complejo Alternativo III – citocromo caa3 oxidorreductasa en un 83% de los genomas, todos de organismos aerobios. El linaje anaerobio de los Bacteroidales puede completar la respiración aerobia gracias a un citocromo bd oxidorreductasa con alta afinidad por el oxígeno que les ayuda a eliminarlo de su entorno. El complejo II, succinato deshidrogenasa, de la cadena respiratoria son de copia única en todos los genomas de Bacteroidetes y en los Bacteroidales existen vestigios de otros genes de enzimas del ciclo de ácidos tricarboxílicos que atestiguan su pasado aerobio. Por último, los genes del complejo I de la cadena respiratoria no reproducen un patrón filogenómico en los Bacteroidetes. Estos genes son sustituidos por los de un complejo I alternativo que bombea sodio al espacio intermembranal en lugar de protones, la NADH:quinona oxidorreductasa bombeadora de sodio (Na+-NQR), solamente en organismos marinos o de la microbiota del conducto gástrico. Además, hemos encontrado evidencias de que el complejo fue inventado por los Bacteroidetes y se ha transmitido horizontalmente en ambientes salinos.

La expresión conjunta del sistema de secreción tipo IX y la cadena respiratoria de Bacteroidetes, adaptada a cada uno de los ambientes que ocupan, muy probablemente es la responsable de la radiación adaptativa de los Bacteroidetes. Algunos de estos genes son de transferencia horizontal, pero en conjunto predicen que el bacteroidete original era un aerobio heterótrofo de materia orgánica compleja que posiblemente vivía en agua dulce. 12

Summary

The use of ribosomal markers in bacterial systematics intimately linked the Cytophaga- Flavobacteria-Bacteroides groups even though phenotypic studies classified them apart. With regard to phylogenetic evidence, they were circumscribed as the phylum Bacteroidetes but they stay principally studied as independent clades due lack of phenotypic coherence. The phenotypes foremost attributed to Bacteroidetes are the ability to decompose complex organic matter, gliding motility and pigmentation with flexirubins and carotenoids. The recent outburst of culture- independent methodologies and development of polynucleotide sequencing techniques has unveiled Bacteroidetes’ prevalence in many relevant ecosystems. For example, some saline environments seem to be selective of these out-competing other phyla. The goals of this doctoral thesis targets are (1) reviewing the taxonomy of the Bacteroidetes, (2) finding coding genes that could translate into an identifying phenotype and (3) finding what of their genes are selected in saline environments.

For the taxonomic review of the phylum not only their 16S rDNA gene phylogeny has been updated, but also that of the 23S rDNA, signature nucleotides on their 16S rRNA sequence, insertion/deletion patterns in two of their proteins and a phylogeny based on the multilocus sequence analysis of 29 genes. This new taxonomy, on top of naming a new phylum, three classes, three orders, eight families and a genus (all of them with standing in bacterial nomenclature), paved a precise phylogenomic analysis of the Bacteroidetes, which on earlier studies included members of the new phylum Rhodothermaeota. However, it curtailed probabilities of finding salinity-related genes, since most extreme halophiles belong to the new phylum.

Through the compared genomics of 89 meticulously selected genomes we found that the most conserved genes in the phylum (in 90% of the genomes), not housekeeping, are those coding the type-IX secretion system, hitherto only seen in the Bacteroidetes. This secretion system is involved in the gliding motility of some Bacteroidetes, but it is also an excreting route for hydrolytic enzymes. Next, were genes coding the respiratory super-complex Alternative Complex III – cytochrome caa3 oxydoreductase at 83% of the genomes, all of aerobic species. The anaerobic lineage of the Bacteroidales can fulfill the aerobic respiration thanks to a cytochrome bd oxidoreductase with high oxygen affinity that helps detoxifying their surroundings. Genes of the Complex II, Succinate dehydrogenase, of the respiratory chain are single-copy through all Bacteroidetes’ genomes, and at the Bacteroidales other vestigial genes of the tricarboxylic acid cycle denote their aerobic past. Last, the genes of the respiratory Complex I are not arranged phylogenomically in the Bacteroidetes. These genes are substituted by an alternative complex I that pumps sodium into the inter-membrane space instead of protons, the sodium pumping NADH:quinone oxydoreductase (Na+-NQR), only on marine species or gut microbiota species. Moreover, we found evidences of this complex being invented by the Bacteroidetes and its horizontal transmission within saline environments.

The simultaneous expression of the type-IX secretion system and the respiratory chain of the Bacteroidetes, which is adapted to the environments they live at, is likely responsible for the adaptive radiation of the phylum. Some of these genes are horizontally transferred, but all together they predict the original bacteroidete was a heterotroph aerobic decomposer of complex organic matter, likely living in freshwater. 13

Introduction

The questions this thesis aims to solve.

Bacterial taxonomy currently follows a hierarchical structure outlined by phylogenies based on biomolecular sequences. This hierarchy is organized in taxonomic ranks in increasing levels; species, genera, families, orders, and classes. Intermediate categories exist to accommodate prokaryotic groups circumscribed in between the main categories. Above the rank class, the rank phylum is widely used, but yet it is not formally accepted (Whitman et al., 2018) in the International Code of Nomenclature of (Parker et al., 2019) supervised by the International Committee on Systematics of Prokaryotes (ICSP). Groups of prokaryotes that can be circumscribed within the same taxonomic category form a taxon, and they share a name. Related taxa usually share phenotypes due to their phenotypic and genomic coherence (Rosselló-Móra and Amann, 2015). This structure helps microbiologists to recognize basic biological traits of prokaryotes by their name. This structure helps microbiologists to recognize basic biological traits of prokaryotes by their name, that can refer to their morphology (e.g., -bacillus, -coccus, Actino-) or metabolism (denitrificans, -fermentans, Anaero-, etc).

The first descriptions of a Bacteroidetes species dates back to the end of the 19th century (Veillon et al., 1898). The phylum was named, ultimately, after the genus Bacteroides. The Bacteroides are Gram-negative, nonsporeforming, anaerobic rods. Species classified within the class Bacteroidia conserve these traits and are often isolated from feces, the gut or oral cavity of animals and are opportunistic pathogens. However, the phylum Bacteroidetes also circumscribes the classes , and Cytophagia, all aerobes. Many are free-living in brines, seas or fresh waters. The Bacteroidia are not even the most specifically diverse Bacteroidetes, that title is held by the Flavobacteriia.

The heterogeneity of the Bacteroidetes is long known because bacterial phylogenies affiliated taxa at high bacteroidetal ranks close together. It was surpassed for many years by referring to the bacteroidetal lineage as the Cytophagia-Flavobacteria-Bacteroides group or the CFB group for short. The name Bacteroidetes can be traced down to the nineties (Olsen et al., 1994), when their monophyletic origin had not been questioned for a long time. For coherence with the ICNP, the etymological root of the phylum’s name had to be Bacteroides since it was the taxon’s first genus name with standing in nomenclature (Lapage et al., 1992). But it was in 2010 after modern sequencing techniques nourished nucleotidic sequence databases, when Krieg et al. (2010) produced a sophisticated phylogenetic analysis of the phylum and classified it ‘on the basis of phylogenetic analysis of 16S rRNA sequences’. They described it as a ‘phenotypically diverse group of Gram-stain-negative rods that do not form endospores.’ In its protologue, it was warned that the genera Rhodothermus, Salinibacter and Thermonema ‘appear to represent deep groups of the phylum that cannot be readily assigned to any of the four classes.’(Krieg et al., 2010).

The unsharpened taxonomic outline of the phylum coincided with an outburst detection and estimated abundance of bacteroidetal sequences in environmental samples thanks to metagenomic techniques and fluorescence in-situ hybridization (FISH) (Eilers et al., 2001; Manz et al., 1996; Weller et al., 2000). The discovery of a bacteroidetal diversity that had gone under the radar of culture protocols and clone libraries motivated many novel species isolation and study of their ecological roles. Ever since, bacteroidetal communities are known to be relevant in competitive ecosystems like bacterioplankton, soil, gut microbiota and related to some autoimmune conditions like Crohn (Vidal et al., 2015). But principally, they are of outmost importance to the carbon cycle due their capacity of remineralizing complex carbon compounds – e.g., starch, chitin, hemicellulose, and pectin- (Wexler, 2007). Consequently, the phylum is nowadays perceived as a 14 group of high molecular weight (HMW) organic matter decomposers. Some authors attribute Bacteroidetes with the capacity of moving by gliding. The gliding phenotype is not widely observed in the phylum, but the gliding machinery seems to be genetically coded in the vast majority of species (McBride and Zhu, 2013). The expression of gliding motility has been linked to low nutrient contents (Bernardet and Bowman, 2011), which implies they would attach and glide upon polymeric surfaces to feed. Lack of gliding has been explained through mutant strains that cannot express functional gliding proteins even though their genes are conserved. Many bacteroidetal species are also pigmented with flexirubins or carotenoids making colonies look yellow, orange or pink. These pigments should protect free-living cells from UV light, or even help them harvest light energy for transmembrane proton transport (Oren, 2011).

A well-known bacteroidetal light harvester is the hyperhalophilic genus Salinibacter (Oren, 2013). Salinibacter thrives in hypersaline shallow brines exposed to sunlight, being the predominant bacteria in crystallizer ponds of solar salterns. The increasing salinity in contiguous saltern ponds enriches the bacteroidetal community opposite to other phylotypes (Gomariz et al., 2015). Consequently, moderate halophiles (with optimal growth at 3-15% salts) are enriched first and extreme halophiles later (with optimal growth above 15% salts). Diverse taxa within the Bacteroidetes exhibit halophily. Related to the Salinibacter phylotype are the moderate halophiles Aliifodinibius, Fodinibius, Gracilimonas and Salisaeta. The Cytophagia affiliate moderate halophilic species of the genera and . In the Bacteroidia the species Anaerophaga thermohalophila has its optimal growth at 6% salts. And in the Flavobacteria are classified the extreme halophile Psychroflexus salinarum and other moderate halophiles of the genera Flaviramulus, Gramella, Psychroflexus, Salegentibacter and Salinimicrobium (optimal growth conditions can be consulted at: www.bacdive.dsmz.de). There is no answer to why bacteroidetal species prevail against other bacterial halophiles in some saline environments. Whether it is due to the available carbon sources (i.e. polymeric carbohydrates) or taxon-specific osmotic mechanisms, comparative genomics should provide with an insight on what bacteroidetal genes are selected by the environment. Furthermore, phylogenetic inferences based on genomic prints (phylogenomics) should tell if those genes are transmitted vertically within a lineage, or horizontally within the environment.

Comparative genomics have already proven useful in the study of bacteroidetal phenotypes. As protein units of the complex gliding machinery have been identified, their annotation in bacteroidetal genomes demonstrated their wide distribution in the phylum regardless of their expression (McBride and Zhu, 2013). Whereas about their complex matter degrading ability, synteny around bacteroidetal genes coding hydrolytic enzymes suggest most Bacteroidetes code for polysaccharide utilization loci (PULs) (Larsbrink et al., 2014; Reeves et al., 1997). The basic structure of a PUL operon requires the presence of the consecutive susC-susD genes followed by genes coding hydrolases. The protein SusD is an outer membrane protein that binds oligosaccharides and presents them to SusC, a TonB-dependent outer membrane pore that engulfs the oligosaccharides. Once in the periplasmal space, molecular bond specific hydrolases coded in the PUL digest the molecule into monomers that penetrate the cytoplasm. The susC/D tandem has become a genomic predictor of the phenotype (Terrapon et al., 2018, and 2015) but no phylogenomic inferences on the susC/D print have been presented yet. However, signs of horizontal transmission of PULs exist (Thomas et al., 2011).

The compared study of bacteroidetal genomes faces the adversity of their poor representation in genomic databases. Despite being one of the most cultured phyla together with , and , the number of genomes from these other taxa drastically outnumber the bacteroidetal ones (Hugenholtz, 2002). This translates in a poor representation of the bacteroidetal taxonomic diversity that is also biased by the prolific sequencing of Flavobacteriia and Bacteroidia genomes. Nevertheless, since 2007 The Genomic Encyclopedia of Bacteria and 15

Archaea (GEBA) project (Kyrpides et al., 2013; Whitman et al., 2015), by initiative of the Joined Genome Institute (JGI), is sequencing genomes of broader phylogenetic diversity. By the beginning of this thesis, the candidate Ph.D. cooperated in the listing of diverse bacteroidetal species to be sequenced by the GEBA project. Hence, an improved dataset for comparative genomics was expected to be available along this doctoral research.

In the meantime, the abundance of bacteroidetal small subunit (SSU or 16S rRNA) ribosomal sequences in curated databases allowed the phylogenetic revision of the taxonomic outline of the phylum. As in the beginning of the current research PhD thesis, the SSU rRNA based phylogeny by Krieg et al. for the Bergey’s ® Manual of Systematic Bacteriology was the taxonomic outline in force (Krieg et al., 2010). However, the number of validly published bacteroidetal names doubled with consequent increase in type strain SSU rRNA sequences. Although 16S rRNA stands as the reference molecule for phylogenetic inferences in detriment of the longer 23S rRNA due to past technical limitations, the analysis of non-ribosomal sequences is desired to proof the differentiation of genetic lineages. The multilocus sequence analysis (MLSA) technique compares chimeric concatenations of preferably more than four core genes (Glaeser and Kämpfer, 2015). Sets of 16S rRNA and core protein sequences can, on top, be explores for taxon-specific signature nucleotides/amino acids. In 1987, Carl R. Woese, father of the ribosomal RNA based phylogenies, described a series of taxon-specific signature nucleotides in the SSU ribosomal sequence (Woese, 1987). Soon after, Rudolf Amann et al. described sequence insertions in the F-type ATP synthase β subunit protein characteristic of the phylum (Amann et al., 1988). Later, the expert in protein sequence alignments Radhey S. Gupta described signature indels (insertion-deletion patterns) for the , Chlorobi and Bacteroidetes in the sequence of the Alanyl-tRNA synthase (Gupta, 2004).

Data resources.

This doctoral thesis is, in essence, an exercise of sequence analysis. The sophistication of sequence analysis methodologies in the last decades encompassed the growth of sequence databases. Initially, every sequence was linked to the name of the organism it belonged. Now, additional information on taxonomy, date and source of strain isolation, manuscript where the sequence was published, cardinal coordinates of the sampling location, etc., configure what is called the ‘metadata.’ It is important to know the history behind modern databases to prevent bad results caused by poor data sampling.

Since the 1980’s, the description of a new bacterial species requires its axenic culture, valid name designation, and a ribosomic or genomic sequence that allows its phylogenetic classification. Publishing the new species’ 16S rRNA gene sequence became almost compulsory (e.g., endosymbionts and species are often SSU sequence orphan), and the complete genome is now desired. The axenic culture must be deposited in at least two independent public culture collections and designated as the type strain. The name must be validly published (i.e., listed by the IJSEM in either a notification list or a validation list). And the sequence must be deposited in a public database, usually NCBI, ENA or DDBJ. The culture is the type material of reference for studying the species, and the sequence represents the type material in off-bench analyses although it is not a type material per se, if any, they are reference sequences. This system to warrant the traceability of a species’ identity remains immutable, although there is a debate on whether sequences should be considered type material given their accessibility (Rosselló-Móra and Whitman, 2019).

Modern sequencing techniques have surpassed our capacity to grow and isolate bacterial strains. This translates in enormous sequence databases where most of the sequences are ‘environmental’, i.e., sequenced from an environmental sample, not from an axenic culture. Naming these sequences 16 require their phylogenetic analysis and depending on the researcher’s skills and/or necessity for accuracy, names are only approximates. Environmental sequences are often named after their closest relative within species similarity boundaries. But with sequences from type material diluted in bulk databases, homonymous sequences diverging over the species similarity boundary abound. It is also common to find sequences with different names that are synonyms (identical or highly similar). Furthermore, sequences without a validly published name also flourished (e.g., named “candidatus”, “unclassified”). This convoluted the accessibility of reference sequences for taxonomists and urged the emergence of catalogs for the prokaryotic diversity. The most relevant, sorted chronologically and checked up to the date of this writing, are:

 The Ribosomal Database Project (RDP), 1992. Compilation of ribosomal sequences and related data sorted phylogenetically, with a database’s own taxonomic classification. Over the years it included different tools like a sequence aligner, a taxonomic hierarchical browser or a DNA probe matcher, but initially it inherited the chaotic classification of sequences as “environmental sample” or “unclassified” from genetic databases. It was last updated in September 2016 and included 1,315 ribosomal sequences from isolated type Bacteroidetes. It classified them according to RDP taxonomy. https://rdp.cme.msu.edu/

 List of Prokaryotic Names with standing in Nomenclature (LPSN), 1997. This is an initiative of Professor Jean-Paul Euzèby that handed over to Dr. Aidan C. Parte in 2012. The database in http://www.bacterio.net/ is a list of all species names that have ever been validly published. It can be browsed hierarchically. It keeps track of species names their new combinations and reclassifications with references to the original articles or changes in the International Code of Nomenclature of Prokaryotes. Since dated names are kept for history track, names with standing in nomenclature must be filtered manually. On February the 17th 2020, LPSN joined the Prokaryotic Nomenclature Up-to-date (PNU) database maintaining the World Wide Web address www.bacterio.net. The refreshed website can be browsed by taxonomic ranks and explicitly tells the nomenclatural and taxonomical status of each name.

 Genomes Online Database (GOLD), 2001. This is an initiative of the JGI. It is a comprehensive database of genomes, metagenomes and associated metadata. As in March 2019, it included 16,011 type-strain genomes of which 1,327 were Bacteroidetes. It’s current richness in type species genomes is the result of the GEBA project launched in 2007 with the sequencing of 250 genomes. Since 2011, GEBA filled the gaps in sequencing many branches of the Tree of Live. In a second phase (2015), they sequenced 1,000 type-strain genomes, and in a third (since 2015) they are sequencing newly described type-strains. https://gold.jgi.doe.gov/

 MGDB: microbial genomes database for comparative analysis, 2002. Dr Uchiyama published the database in 2002, and it consists in a genomes’ database with precomputed all- against-all gene similarities. The user can select the genomes they wish to compare and the similarity threshold to apply. Ready similarity tables are also available. The sets of similar genes between genomes are linked to relevant metadata like their annotation and functional classification. In 2015 it allowed users to upload their own genomes, but it is an on-line tool with a slow interface (it supports the transit of large data) and not user-friendly. Its taxonomic hierarchical classification mirrors the NCBI taxonomy database and cannot be customized. It was last updated in September 2018 containing 6,318 genomes, 200 of which are Bacteroidetes including ‘candidatus’ species.

 GreenGenes, 2006. This was the first 16S rRNA sequences database that presented some level of quality-control by checking the presence of chimeras in the public data repositories. It combined multiple taxonomies from major curators (NCBI, RDP, Wolfgang Ludwig, Phil 17

Hugenholtz and Norman Pace) and was compatible with the software ARB that allows the desktop maintenance of sequence databases. Although it was last updated in June 2017, its latest ARB release dates from May 2011. http://greengenes.lbl.gov/Download/

 SILVA, 2007. This project is a group of ribosomal databases initiative of Ribocon GmbH, an associate of the Max Plank Institute for Marine Microbiology in Bremen. It was the first in including sequences of the small subunit (SSU: 16S rRNA) and large subunit (LSU: 23S rRNA) of the three domains of life. SSU and LSU largest databases are the “Parc” that include all publicly available sequences longer than 300 nucleotides. A shorter version named “Ref” reduces the number of partial sequences filtering those shorter than 1900 nucleotides for the LSU, 1200 nucleotides for bacterial/eukarial SSU and 900 nucleotides for archaeal SSU. The shortest is the non-redundant (NR) that keeps the best sequence of each OTU at 99% similarity. Among the metadata encompassing each sequence are the taxonomies of other platforms (e.g., RDP, Greengenes, NCBI) but SILVA also produces its own classification based on a phylogeny that uses its own aligner, SINA. This aligner considers the secondary structure of the RNA and divides the alignment in helix and loop regions that come handy in manually curating the alignment. Sequences that produce a bad alignment are flagged to warn the user. The criterion to flag bad alignment is an index that tells how well the sequence aligns with neighbor sequences called ‘sequence alignment quality.’ The database also evaluates the sequence itself, counting how mane ambiguous nucleotides it contains (represented by letters different to A, G, C or U) and gaps or unresolved 5’ or 3’ sequence ends. This is reported as the ‘sequence quality’ index. The databases are compatible with ARB, after which the project is named (ARB from latin arbor = tree, SILVA from latin silva = forest). As in March 2019 (release r132), SILVARef contained 611,653 SSU sequences and 8,795 LSU sequences of the phylum Bacteroidetes. https://www.arb-silva.de

 The All-Species Living Tree Project (LTP), 2008. This is an initiative of five partners (Elsevier, IMEDEA, TUM, MPI and SBSV) together with the journal Systematic and Applied Microbiology (SAM) addressed to microbial taxonomists. In cooperation with the editors of ARB, SILVA and LPSN, LTP prepares rRNA gene datasets in synchrony with SILVA but of only one sequence (best in length and alignment) per-type-strain, with names up-to-date, and a manually curated alignment. Its latest update in June 2018 (s132) included 1,566 SSU sequences and 130 LSU sequences of the Bacteroidetes. https://www.arb- silva.de/projects/living-tree/

 StrainInfo, 2010-2019. This was a catalog of strains built upon the catalogs of Biological Resources Centers (BRCs). The integration of all catalogs listed all synonym strains for each species, their availability from culture collections and associated information (e.g., publications, 16S rRNA sequence and validly published name). Unfortunately, the database has been permanently closed since January 2019.

 EzBioCloud (former EzTaxon), 2012. The 16S rRNA database EzTaxon was launched for the rapid identification of SSU rRNA sequences using QIIME and mothur pipelines. It included only validly published species and phylotypes that could result in new species. Nowadays, the database has been engulfed by the EzBioCloud project that includes genomic databases. The current 16S rRNA gene database is oriented to the identification of microbiomes and includes 7,952 Bacteroidetes sequences (as in March 2019). The genomic database contains 3,056 Bacteroidetes genomes. https://www.ezbiocloud.net/

 BactDive, 2013. The German Collection of Microorganisms and Cell Cultures (Leibniz Institute, DSMZ) hosts this database that integrates information on the taxonomy, 18

morphology, physiology, sampling, molecular biology and concomitant condition of 23,458 strains at its launch. In April 2019 it contained 2,464 strains of Bacteroidetes. https://bacdive.dsmz.de/

Sequence analysis of nucleic sequences.

Quantity and quality richness accumulated over the decades by 16S rRNA gene sequence databases has contributed to make this molecule the most significant in phylogenetic inferences. Although the current trend is to use genome sequences for a more comprehensive approach, 16S rRNA gene phylogenies are still relevant. The comparison of ribosomal SSU sequences has been technically tested, improved and standardized over the years, accounting as a solid contextual legacy for the interpretation of 16S rRNA gene-based phylogenetic trees. The molecule has been intensively studied and compared across taxa. Its secondary structure can be integrated in its nucleotidic alignment and ranges of its percentile divergence in the primary structure have been linked to taxonomic categories (Yarza et al., 2014). Single nucleotide variations with taxonomic significance have been identified too (Woese, 1987). A similar level of sophistication and standardization is yet to be reached by genomic comparisons. The heterogeneity of compared genomic loci and mathematical approaches between studies currently curtails the creation of a ‘golden standard’ in phylogenomics like 16S rRNA gene phylogenies are to phylogenetics. As of this writing, we are aware that genomic taxonomy will be soon adopted by influential texts (e.g., Bergey’s ® Manual, NCBI), mainly guided by the Genome Taxonomy Database (Parks et al., 2018) (GTDB). However, taxonomists need more time to implement it and produce literature enough to evaluate it.

The last phylogenetic review of the phylum Bacteroidetes before this study was performed by Wolfgang Ludwig for the Bergey’s ® Manual in 2010 (Ludwig et al., 2010). The updated phylogeny in this thesis uses the strategy described by Munoz et al. (2014) in The Prokaryotes’ 14th edition (Munoz et al., 2014) that developed from W. Ludwig’s recommendations (Ludwig and Klenk, 2005). This strategy combines (1) curated sequences and alignment (2) two evolution models, (3) two branch length corrections, and (4) three filters for hypervariable sequence positions of the alignment, to reproduce a very robust phylogenetic tree.

A 16S rRNA nucleotidic sequence can be treated as a thread of four combined information units (nucleotides) numbered from position 1 to ~1,500. Every position is homologous between taxa except for insertions or deletion of nucleotides. Positional changes of nucleotides between sequences are the true pieces of information that a phylogeny articulates into a tree-like diagram. The raw data produces artefacts in the final diagram mainly due to uneven representation of all positions (normally at the beginning and end of the sequence) and nucleotidic variability of some positions (intrinsic due to natural variability constrained by the secondary structure of the rRNA, or extrinsic due to ambiguous readings during the sequencing). A robust phylogeny of the ribosomal SSU rRNA sequence should be calculated upon complete sequences, with no ambiguities. Their alignment should consider their secondary structure (stable stem regions and unstable loop regions) and hypervariable positions of the alignment that only cause distortion in the analysis should be removed. The sequences that are not complete, but need be present in the final phylogenetic tree, can be inserted parsimoniously on a solid reconstruction of a back-bone diagram based on complete sequences (Ludwig and Klenk, 2005).

The effect of hypervariable positions of the alignment on a ribosomal SSU phylogeny can be evaluated by testing different positional filters. Each position is composed by one to four nucleotides in varying proportions. A uniform composition rates 100% the same nucleotide. An even composition rates 25% each nucleotide. But if ambiguities, deletions, and insertions are taken into account, rates can be lower than 25% for the most repeated nucleotide. For each position of the alignment the rate of the most conserved nucleotide can be calculated, and it will vary from 100% 19 to less than 25%. Restrictive filters that select too few positions of the alignment produce meshed phylogenetic trees. Permissive/relaxed filters that select many positions tend to be redundant (Munoz et al., 2014).

When comparing a curated dataset of sequences, microbiologists maintain a veiled dispute on what phylogenetic algorithm should prevail (Huelsenbeck, 1995; Ludwig and Schleifer, 1994; Tamura et al., 2004). Whereas neighbor-joining and maximum likelihood are the most valued, the first is more popular. Neighbor-joining is based on positional identity that is translated into branch length on the final tree-like diagram. This is very convenient for it makes phylogenetic tree interpretation easy. Identity is interpreted as yes, it is the same nucleotide, or no, it is not. This approach is simplistic, but it also translates into an agile computation that allows multiple iterations producing hundred or thousand trees. Iterations can be fused into a consensual dichotomic diagram where each bifurcation is trusted as much as the percentile of times it was reproduced over the total number of iterations. This is called a bootstrapped tree and it is visually powerful to the untrained eye. On the other hand, maximum likelihood is based on positional divergence, but instead of considering if the nucleotides in the same position are different or not, it evaluates the probabilistic of (1) which nucleotide was originally there and (2) how likely is the observed nucleotidic change to occur. The molecular conformation of the ribosomal nucleotides divides them into double carbon ring purines (Adenine and Guanosine) and single carbon ring pyrimidines (Thymine and Uracil). Substitution of a purine by a purine or a pyrimidine by a pyrimidine is called a transition. Substitution of a purine by a pyrimidine or vice versa is called a transversion. In transversions the conformational re- accommodation of adjacent and complementary nucleotides in the RNA molecule is energetically more costly than in transitions. Maximum likelihood penalizes transversions against transitions. Therefore, branch lengths in the maximum likelihood diagrams are not directly proportional to sequence divergence but related to mutational rate as a composite of different transition and transversion probabilities. This is visually complex to understand but more genuine. Trained taxonomists prefer maximum likelihood over neighbor-joining for this reason. The computational cost of maximum likelihood phylogenies is higher compared to neighbor-joining and much higher for bootstrapped topologies. An alternative to bootstrapping is multifurcation. Some taxonomists prefer to compare multiple tree iterations under their expert eye and represent untrusted branching orders as multifurcations in the final diagram, i.e., more than two branches diverging from the same point. Normally, multifurcations correspond to low bootstrap scoring regions of a bootstrapped tree (Ludwig and Schleifer, 1994). Moreover, they summarize the educated opinion of the taxonomist which causes awareness in bootstrap supporters that prefer statistical figures (Ludwig and Klenk, 2005).

The theory behind neighbor-joining and maximum likelihood algorithms makes their topologies to be distorted per se. When comparing sequences of similar taxa, neighbor-joining phylogenies tend to expand whereas maximum likelihood phylogenies tend to collapse. The neighbor-joining approach starts by setting every sequence as equidistantly different to then bind them stepwise as it finds common nucleotides. Conversely, the maximum likelihood approach sets all sequences as potentially similar and classifies them apart in the highest probabilistic way. Polynucleotides are labile and there is no fossil record of intermediate mutations, so plesiomorphic residues (false identities) can distort tree topologies (Ludwig and Schleifer, 1994). Neighbor-joining overestimates divergence when similarities are not redundant in the databases, whereas maximum likelihood underestimates divergence due to the lack of represented intermediate mutational steps. Neighbor- joining phylogenies benefit from a contextual dataset that places the studied taxa among others to avoid overrated distances or branch expansion. Contrary, the collapsing effect of the maximum likelihood algorithm depends on the inner taxonomic diversity represented in the analysis. This can be balanced with taxa-specific contextual datasets that represent many sequence variants (Ludwig and Klenk, 2005; Munoz et al., 2014). Sequences that help in constraining the topology of a 20 phylogenetic tree work as a scaffold around their structure and are removed from the final construction.

Combining methodologies is possible by merging individual phylogenetic trees into a consensual tree to test its confidence (Ludwig et al., 1998). Bootstrap values on a consensual tree would be no significant since a limited number of trees are normally merged and because merging maximum likelihood trees with neighbor-joining trees unable the interpretation of branch distances as sequence similarity guides. Instead, branching orders with low confidence can be joined at same divergent node by manipulating the figure. Besides, internodal vector lengths correlate with branching order significance; poorly resolved branch orders are connected by short internodes. If in doubt of drawing a multifurcation, nodes outside areas of 0.1% and 0.2% divergence rate (calculated from the tree’s scale bar) around their closest nodes are trusted to be confidently placed (Ludwig and Klenk, 2005).

Amino acidic and genomic comparative analyses

Molecular clocks are biomolecules, polymers of a limited number of subunits, which accumulate sequence changes or mutations over time. These can be nucleotidic or aminoacidic sequences, but ultimately, they are all coded in the chromosome that is replicated and transferred to offspring cells (vertical transmission). The chromosome is divided into genes, coding regions or loci that can be compared between different species if they are always coded in their chromosomes. Genes can be duplicated in a chromosome, probably to enhance their expression level. But copies of the initial gene can mutationally transform their expression into a different protein coding for a new function (neofunctionalization). Matching genes between species are homologs. Genes vertically transmitted along a lineage are orthologs, but those that originate from a common ancestor gene that was duplicated are paralogs.

It is important to discern between orthologs and paralogs in sequence analyses. Paralogous sequences cannot be aligned and compared since the separate mutational drift of the sequences initiates a different phylogenetic path that cannot be represented in the same evolutionary tree (Altenhoff et al., 2016; Chen and Zhang, 2012). On the contrary, orthologous sequences can be aligned and compared since they have followed the same evolutionary path. Sequence identity is the first diagnose for sequence homology, but it is necessary to check the sequence annotation and number of copies in the chromosome to ensure orthology (Addou et al., 2009). When describing the similarity of true ortologs, the term ‘sequence identity’ normally prevails. To ensure orthology, phylogenies rely on single-copy genes. On top, aligned orthologs can be concatenated into longer chimeric sequences to produce a phylogeny that combines the evolutionary signal captured in sequences involved in different cell functions. This last technique is an MLSA (De Vos, 2011; Glaeser and Kämpfer, 2015; Zeigler, 2003).

Genes encoding ribosomal RNA have proven good molecular clocks. Furthermore, their sequence identity has been linked to the taxonomy of prokaryotes (Yarza et al., 2014). Although the rRNA genes operon is known to be multiplied in many chromosomes, in only a few species their 16S rDNA copies diverge beyond the species identity threshold (Klappenbach et al., 2001). Neofunctionalization of these divergent copies has not been described. They are thought to be expressed under different environmental conditions for molecular stability. The largest gene, 23S rDNA, has not been phylogenetically used for technical limitations as explained before. But modern computing power allows a revision of the Bacteroidetes’ 23S rDNA phylogeny and even the phylogeny of the combined 16S and 23S rDNA sequences.

Protein coding genes as nucleotidic sequences could also be concatenated to rRNA genes in a large MLSA phylogeny for depth of analysis. However, rRNA genes are expressed via transcription 21 whereas protein-coding genes undergo transcription and translation. In the process of translation, every three nucleotides (codon) are translated into one amino acid. Every amino acid (20 in total) is coded by codons with variable third nucleotide. A mutation in the third position of the codon can frequently be silent in translation because the codon still codes for the same amino acid. Silent mutations are not significant in the phylogeny of protein-coding genes, they rather cause distortion of their phylogenetic signal. To prevent this, the analysis of amino acidic sequences is preferred. Thus, MLSA of coding genes is conducted in this thesis aside from the analysis of the ribosomal sequences (Das et al., 2014).

Comparative genomics on distant organisms.

The first goal of comparisons on distant genomes (above the taxonomic rank genus) was to find the minimal gene-set of a living organism to reproduce the genome of the last universal common ancestor (LUCA). In 2003 Eugene V. Koonin (Koonin, 2003) listed 63 ubiquitous genes in genomes of Bacteria and Archaea with housekeeping functions (essential to cell sustainability). The set of genes that are repeated among a number of genomes is called their core genome. The more evolutionary distant are the genomes, the less genes they share and the higher is their ratio in essential genes. Furthermore, Koonin identified phyletic patterns in the distribution of some genes and defined them as ‘the pattern of presence or absence (represented by orthologs) of a gene in different lineages across the species tree’ (Koonin, 2003). Contrary to core genomes, phyletic patterns are not defined by the ubiquity of a gene in a lineage but by their abundance. A year later, R. L. Charlebois and W. F. Doolittle exposed some problems in the comparison of distant genomes (Charlebois and Doolittle, 2004). After finding that the prokaryotes’ core genome would contain less than 50 genes, they argued the importance of relaxing the core genome criteria. They proposed the definition of the prokaryotic core by ‘methods that do not demand ubiquity or assert some arbitrary definition of ubiquity, but do retain the requirement that genes of the core be (1) very common, and (2) distributed as broadly as possible, phylogenetically’ (Charlebois and Doolittle, 2004).

In consequence, Charlebois & Doolittle defended a phylogenetically balanced core (PBC) better than a core genome for distant genomes. They conducted a reciprocal best match (Ward and Moreno-Hagelsieb, 2014) (RBM) comparison of 147 genomes of prokaryotes and explained how the analysis is affected by some parameters. The study concluded that variations in the orthology threshold for the RBM comparison did not affect the PBC significantly. Nevertheless, they observed that false negatives are inherent to the RBM as sequences that should be orthologous (based on their annotation) classified in different groups. The mistaken classification of some sequences was difficult to correct since lowering the threshold caused false positives due to spurious matches. The correction of synonym orthologous groups depended on the accuracy and consistency of genomes annotations. Moreover, they explored how the set of common genes changed over cross-phylum analysis by lowering the phyletic pattern at 90% and 80% of the genomes in each phylum. At 100% they found 34 essential genes, a criterion of 90% added 26, and a lower 80% added another 11. They were encouraged that the PBC increased less than 20% in size by dropping the prevalence cut-off from 90% to 80% because it means that the core tends to get constant (Charlebois and Doolittle, 2004). The PBC approach overcomes two types of bias, (1) the reduction of the core set due to new sequencing and assembly errors as new genomes sequences appear, and (2) the disproportionate weighting of popular taxa (Charlebois and Doolittle, 2004). As a pro, I would include that a relaxed core criterion also overcomes the deletion of a gene over the evolution of some lineages. In another study where distant genomes were compared, Ikuo Uchiyama explored the relevance of synteny in the definition of the core gene-set based on the idea that a horizontally transferred gene is unlikely to be inserted in an equivalent chromosome position (Uchiyama, 2008). This criterion helps the correction of mistakenly classified orthologs by RBM analysis by adding a synteny test on those groups that could be synonyms. 22

The genomic comparison of Bacteroidetes in this work will adopt these strategies since its taxonomic scope is closer to the rank than to the ranks Genus or Species, on which literature would support other methods.

Precedent taxonomy and sequenced genomes.

By July 2014, when this work launched, the inventory of validly published names for bacteroidetal taxa comprised 4 classes, 4 equivalent orders, 19 families, 275 genera and 1,149 species. The NCBI RefSeq genomic repository comprised 102 completed genomes and 496 in a permanent draft, of which 57 completed genomes belonged to type strains. Summary tables are provided to be consulted along the reading of the following chapters as a reference of the initial status of this topic (Tables 1 and 2).

Table 1: Taxonomic categories above the rank genus in the phylum Bacteroidetes inventoried in July 2014. Class Order Family Flavobacteriia Cryomorphaceae Blattabacteriaceae Schleiferiaceae Bacteroidia Bacteroidales Marinilabiliaceae Porphyromonadaceae Prevotellaceae Rikenellaceae Prolixibacteraceae Unclassified Bacteroidales Sphingobacteriia Saprospiraceae Unclassified Sphingobacteriales Cytophagia Catalimonadaceae Cytophagaceae Flammeovirgaceae Mooreiaceae Rhodothermaceae Unclassified Cytophagales

Table 2: Inventory of bacteroidetal genomes in NCBI’s RefSeq repository by July 2014. The total number of prokaryotic genomes in the repository was 49,640. Categories Total strains Type strains Completed genomes 102 57 Permanent drafts 496 30 Incomplete genomes 860 - Targeted genomes 62 - Total 1,520 87 23

Objectives

This doctoral research pursues three goals. (1) Providing with a sharp taxonomic outline of the Bacteroidetes that would be the ideal road map for a taxonomic comparison of diverse bacteroidetal genomes in (2) the search of the genetic blueprint of the phylum. A second comparison, under an environmental categorization, should (3) find what bacteroidetal genes are selected by high salt concentrations. Methods

Tree reconstructions based on ribosomal sequences.

The neighbor-joining (NJ) (Saitou and Nei, 1987) and maximum likelihood RAxML (Stamatakis, 2006) algorithms were used for phylogenetic reconstructions in the ARB environment (Ludwig et al., 2004). Silva ready alignments were curated manually paying attention to sequence ends and helix regions. Reconstructions by NJ were corrected with the Jukes–Cantor substitution model (Jukes and Cantor, 1969) for nucleotide sequences and the RAxML reconstructions were corrected with the substitution rate model GTR-GAMMA (Lanave et al., 1984; Yang, 1994) for nucleotide sequences.

Three NJ and three RAxML backbone trees for each rRNA gene were calculated using three maximum frequency conservational filters computed ad hoc, comparing sequences from core and support datasets. Any sequences of poor quality were added to the backbone trees using the parsimony insertion described by Ludwig and Schleifer (Ludwig and Schleifer, 2005) and by applying the same filter used in their reconstruction. Consensus topologies were drawn according to the evaluation of the six backbone trees. Unstable branches occurring in 50% or less of the cases were shown as multifurcations.

Phylogenies based on amino acidic sequences.

Selected sequences of translated orthologous genes were imported to ARB (Ludwig et al., 2004). All the positions of their alignment were included in neighbor-joining and RAxML reconstructions of individual (per orthologous sequences) phylogenetic trees. Substitution rates were corrected with the Kimura model (Kimura, 1980) for NJ topologies and PROT-GAMMA (Lanave et al., 1984; Yang, 1994) model for RAxML. Both trees were merged in a consensus topology using ARB. Unstable branches occurring in 50% or less of the cases were shown as multifurcations.

Phylogenomic trees of concatenated sequences were reconstructed by neighbor-joining with Kimura correction (Kimura, 1980) and 1,000 bootstrap iterations using ARB (Ludwig et al., 2004).

Construction of chimeric sequence concatenates.

For the first MLSA constructions an ad hoc protocol, now outdated, used three freely available Perl scripts: (1) the script split fasta.pl (John Naxh, https://code.google.com/p/nash-bioinformatics- codelets/downloads/detail?name=splitfasta.pl, November the 27th 2015) that served to split the multifasta files of each individual alignment downloaded from the MBGD database; (2) the script fastaConcat.pl (Naoki Takebayashi, http://raven.iab.alaska.edu/ntakebay/teaching/programming/perl-scripts/perl-scripts.html, December 2nd 2015, no longer available) was used to concatenate the same-strain fasta files; and (3) the script FastaCat.pl (James Estill, https://code.google.com/p/jperl/source/browse/trunk/scripts/FastaCat.pl, November the 27th 2015, no longer available) that served to create a multifasta file containing the 24 concatenated sequences. During the thesis research, due to the acquisition of bioinformatic skills, a more proficient procedure was used to produce the phylogenomic MLSA datasets. Using ready installed Linux commands like ~grep and ~cat served to manipulate data adequately and produce the concatenates.

Alignments and filters.

SSU rRNA gene sequences were aligned using the SINA aligner (Pruesse et al., 2012). The alignment was manually curated and the Bacteroidetes were searched for signature nucleotides by exporting the alignment and adapting it to a table format. The ready LSU rRNA gene alignment from SILVA was used without further manual curation. Protein sequences were individually aligned using the MUSCLE v3 program implemented in the ARB 6.0 program package (Ludwig et al., 2004). Concatenated gene/protein sequences were aligned prior to their combination.

The ARB package was also used to calculate positional base/amino acid frequency filters for each alignment as proposed by Munoz et al. (2014) Filters were calculated from 70% to 10% positional frequency at 10% intervals, and ARB was set to omit positions where a gap most often occurred. The optimal taxonomic resolution from the phylum to family ranks in individual phylogenetic reconstructions was found at the 50% to 30% frequency interval in nucleotidic alignments, and 40% to 20% frequency interval in amino acidic alignments. Setting higher frequencies caused non- resolved branching orders, whereas setting lower frequencies did not add a significant number of positions into the analyses.

Indel analyses.

ATPase type F beta-subunit (AtpD) and Alanyl t-RNA synthase (AlaS) gene products from Bacteroidetes were obtained from the MBGD database (genomic sequences) or from GenBank (Sanger sequences). The protein sequences were aligned in ARB (Ludwig et al., 2004) with MUSCLE v3 (Edgar, 2004) and their alignment was manually curated. Insertions and deletions (indels) were searched by exporting the alignments into a table format.

Genomic data curation.

As of June 2016, more than 800 Bacteroidetes genomes were available at NCBI’s RefSeq database. By excluding genomes of Blattabacteriaceae endosymbionts to avoid biases due to genome reduction, and redundant genomes from popular genera, such as Bacteroides, Prevotella, Porphyromonas, Flavobacterium or Hymenobacter, we obtained a phylogenetically more balanced list of 478 complete genomes. The genomes of type strains were retained. For genomes of species from the same genus, we performed average nucleotide identity (ANI) calculations using ANIm as implemented in pyany v0.1.3.9 (Pritchard et al., 2016) to confirm species affiliations. The final dataset comprised 89 high-quality Bacteroidetes genomes, each consisting of less than five contigs. We chose 48 Flavobacteriia, 16 Bacteroidia, 15 Cytophagia, 5 Chitinophagia and 5 Sphingobacteriia in order to maintain phylogenetic representativeness. Two members of the Saprospiria were included within the Chitinophagia as classified by Munoz et al. (2016) The sphingobacterial branch of the spp. was not represented because the most complete genome still consisted of seven contigs. Five non-bacteroidetal genomes of the Fibrobacter-Chlorobi-Bacteroidetes (FCB) lineage complemented the database. Genome length (megabases) and G+C %mol content were used as metrics to statistically compare the full dataset of 478 genomes to the reduced subset of 89 genomes using R v3.6.0. (R Development Core Team, 2011) 25

Comparative genomics.

Using scripts from the enveomics suite (Rodriguez-R and Konstantinidis, 2016), BLAST+ v2.2.28 (Altschul et al., 1990) reciprocal best matches were calculated for the predicted protein sequences encoded by all 89 Bacteroidetes genomes. A minimum of 50% amino acid identity within 50% of the sequence length (Ussery et al., 2009) was used as the threshold for orthology, based on systematic evaluations that showed that core genomes remained stable using 40% to 60% sequence identity-coverage combinations. Thousands of groups of orthologous genes were obtained and the corresponding amino acids were compared at the phylum and class levels as classified in Munoz et al. (2016). The resulting groups of orthologous sequences were divided in three sets. (1) Sequences encoded in all genomes of a taxon (core genome). (2) Sequences exclusively encoded in genomes of a taxon and not in genomes of other taxa (exclusive sequences). (3) Sequences encoded in the majority of genomes of a taxon, but also present in a smaller number of other genomes of other taxa (henceforth referred to as prevalent sequences). The phylogenetic coverage of prevalent sequences in this study ranged from 80% in classes Chitinophagia and Sphingobacteriia (encoded in 4 out of 5 genomes and 1 allowed outlier ortholog), to 96% in the Bacteroidetes phylum (encoded in 86 out of 89 genomes and/or 3 allowed outlier genomes).

Genomic sequence identity and annotation.

Exclusive and pertinent sequences were aligned with Clustal Omega v1.2.2 (Goujon et al., 2010) using default parameters. From the resulting identity matrices, we calculated the median identity of each sequence with its orthologs and extracted the maximal and minimal identities within every group of orthologous sequences. Median sequence identities (m.s.i.) were compiled in a table that was transformed into a heatmap using plotly v2.6.0. (Plotly Technologies, 2015) Genome sequence annotations were updated by sequence similarity searches against the UniProt (Consortium, 2016) and KEGG (Kanehisa et al., 2019) databases. Sequences in the core genome of Bacteroidetes were searched against the NCBI database using BLAST with default parameters in search of homologies outside the phylum by excluding the Bacteroidetes (organism filter taxid: 976).

Sequence synteny and synonyms.

For each taxon we selected reference genomes and extracted multiheaded FASTA files for each group of orthologs containing the conserved protein sequence plus eight adjacent sequences (four genes up- and downstream). These files were searched for recurrent gene arrangements against all 89 genomes using MultiGeneBlast v1.1.13 (Medema et al., 2013) for visual identification of syntenies. The sequence homology cut-off was set to 20% identity and 30% coverage allowing the detection of homologies below the reciprocal best match threshold. The maximum distance between homologs was set to 10 genes. When syntenic arrangements extended past the nine query genes, extended multiheaded fasta files were generated and queried against the database for confirmation. Ideally, every gene in a syntenic arrangement could be matched to a single group of orthologous sequences indicating that the groups of orthologs represented adjacent genes. However, genes represented by multiple groups of orthologs were also depicted. These groups of orthologs representing the same gene and syntenic position were synonymous and, furthermore, their annotations also coincided (Charlebois and Doolittle, 2004). 26

Results 1. Selection and curation of sequence data.

1. 1. Species with standing in nomenclature and associated ribosomal sequences.

1. 1. 1. SSU ribosomal sequences.

The first Approved List of Bacterial Names was published in June 1980 but the monophyletism of the CFB branch was not conceived until 1985 (Paster et al., 1985).The first phylogenetic tree of the Bacteroidetes was reconstructed from the rRNA sequences of 24 species (Figure 1). Later, Olsen et al. (1994) reproduced it with 26 CFB species. The first catalog of Bacteroidetes species was published by Garrity et al. on The Prokaryotes Xth Edition (Garrity et al., 2004) with 229 names with standing in nomenclature. The first phylogenetic assessment of the phylum by Krieg et al. (2010) for Bergeys’® Manual of Bacteriology 2010 encompassed the formal description of the phylum and included around 500 species.

Figure 1: First representation of the CFB monophyletic relationship as extracted from Paster et al. 1985. Rights granted by Elsevier Ltd. for display in this doctoral thesis.

New species names are validly published in the IJSEM every month. Most of the new descriptions are published by the same journal and listed two months later in a ‘Notification List’. Species names and descriptions published elsewhere must be reported by the IJSEM in a ‘Validation List’ that is published every two months. New species names, new combinations of names (basonyms or reclassifications) and new names of higher categories (genus, family, etc) have no standing in nomenclature until their publication in those lists. Nor the IJSEM or the ICSP maintain a public catalog of names with standing in nomenclature. The best resource for checking names with standing in nomenclature is the database LPSN, where also type strains are reported. 27

Although LPSN reports the 16S rRNA sequence associated with each name (type strain), other databases are more specialized in cataloguing ribosomal sequences. The database LTPs119 was consulted for this work and its list of Bacteroidetes’ names and 16S rRNA sequences were compared to the names in LPSN and sequences in StarinInfo, GOLD and NCBI databases. In this comparison two names were found to be missing in the LPSN list and were reported to the list curator; Aquimarina amphilecti (Kennedy et al., 2014) and Gramella flava (Liu et al., 2014).

Some names had to be excluded from this work because their reference 16S rRNA gene sequence did not affiliate to Bacteroidetes phylogenetically. These species are outliers; they stay formally classified as Bacteroidetes but they should be either reclassified or re-sequenced to confirm affiliation. Outlier species were often described the 1990s and were either (1) correctly classified in a contemporary phylogeny but not reclassified ever since, (2) misclassified phylogenetically, or (3) possibly assigned to a ribosomal sequence from a contaminant culture. The curation of the available sequenced data on Bacteroidetes detected 11 outliers (Table 3), adding two names to the previous list by Yarza et al. (2013).

Table 3: Type strains classified as Bacteroidetes which SSU rRNA sequence affiliates to a different phylum. Author and year of description, 16S rRNA sequence accession number/s and phyla (or class) where the sequences affiliate. type strain author acc affiliation (phylum)

Acetomicrobium faecale DSM Winter et al. 1988 FR749980-83 Clostridia 20678

Acetomicrobium flavidum DSM Soutschek et al. 1985 FR733692 20664 Anaerorhabdus furcosa ATCC Shah & Collins 1986 GU585668 Firmicutes 25662 Bacteroides coagulans DSM Eggerth & Gagnon 1933 AB547639 Clostridia 20705 HF558387 Bacteroides galacturonicus Jensen & Canale-Parola 1987 DQ497994 Clostridia DSM 3978 Bacteroides pectinophilus Jensen & Canale-Parola 1987 ABVQ01000036 Clostridia ATCC 43243 Bacteroides xylanolyticus X5-1 Scholten-Koerselman et al. 1988 HF558386 Firmicutes Flavobacterium acidificum Steinhaus 1941 JX986959 Gammaproteobacteria ATCC 8366 Flavobacterium devorans Bergey et al. 1923 HF558376 Alphaproteobacteria ATCC 10829 Flavobacterium Canter & Litchfield 1978 GU269547 Actinobacteria oceanosedimentum ATCC 31317 EF592577 Flavobacterium thermophilum Loginova and Egorova 1982 HF558369 Firmicutes BKM 1325

Five species were also excluded from our list because there is no ribosomal SSU sequence available for their type material. Prokaryotic species under this circumstance have been called orphan (Yarza et al., 2013). Two of the bacteroidetal orphan species were described before the DNA sequencing technique existed and no type material is conserved to represent them. Therefore, they have never been sequenced. These are Blattabacterium cuenoti (Hollande, 1931) and Toxothrix trichogenes (Beger, 1953). The other three bacteroidetal orphan species cannot be sequenced because their type material has been lost; Acetofilamentum rigidum DSM 20769T (Dietrich et al., 1988), paucivorans DSM 20768T (Dietrich et al., 1988) and Bacteroides polypragmatus GP4T (Patel, 28

1981). These organisms might have been later re-isolated or sequenced and classified with a new name, but the ICSP does not uncatalog the names.

Last, the status of the species Algoriphagus machipongonensis (Alegado et al., 2013) was doubtful and excluded from this work. Its type strain PR1 = ATCC BAA-2233 = DSM 24695 was co- isolated with the choanoflagellate Salpingoeca rosetta (ATCC 50818) and was being sequenced in 2014 under the project accession number AAXU00000000. As in 2019, its genomic sequence was CM001023, but still no 16S rRNA sequence was available from public 16S rRNA sequence repositories although it could be extracted from its genomic sequence. Strictly, it was not an orphan species at the moment of closing our list of names.

In total, 1,142 names of Bacteroidetes with standing in nomenclature were listed as in May 2014 (Supplementary file 01). Their 16S rRNA reference sequences reported in the databases LPSN, LTP and StrainInfo were compared for best sequence, plus names were searched in the GOLD and NCBI genomes databases for new generation genomic sequences. Discrepancies in 16S rRNA Sanger accession numbers for reference sequences were detected in 123 species. The best sequence for each was selected by comparing their length, number of ambiguities and overall sequence quality (Quast et al., 2012) (Supplementary file 02). The same criteria served for selecting SSU sequences from sequenced chromosomes or keeping previous Sanger sequences (Supplementary file 03) as it happened for 12 species. Hence, it was assured that the 1,142 species in the list would be represented by the best 16S rRNA reference sequence available to date.

Figure 2: Diversity of cultured and validly named Bacteroidetes (left) versus the diversity represented by whole genomes of type strains (right) by July 2014.

Taxonomically, the species in the list represented 4 Classes, 4 Orders, 19 Families and 275 genera. The selected out-group for phylogenetic porpoises was the phylum Chlorobi by phylogenetic proximity, represented in the SSU dataset by 13 reference sequences. The number of bacteroidetal type strain complete genomes by July 2014 was of 57 (Table 2) and their taxonomic representativity was strongly biased towards genomes of the class Bacteroidia order Bacteroidales when compared to the classification of bacteroidetal species (Figure 2).

1. 1. 2. LSU ribosomal sequences. 29

Conventionally, ribosomal phylogenies have been based in SSU sequences. Before the current affordable computing capacity existed, taxonomists had to compromise working with the ribosomal LSU sequence and use the SSU instead which is shorter and, therefore, yields less information. Nowadays the trend in systematics is using whole-genome sequences for phylogenetic reconstructions, generally overlooking the information captured by the LSU sequence. In this work, LSU sequences will be tested to resolve the phylogeny of the Bacteoridetes.

Table 4: Amount of bacteroidetal taxa represented in sequence datasets selected for this study. Alignment Classes Orders Families Genera Species SSU 4 4 19 275 1,142 LSU 4 4 16 78 145 MLSA 4 4 10 43 53

Since the SSU sequence databases are broader in number of sequences and specific diversity, no LSU sequence was expected to lack a counterpart SSU sequence in the databases. Therefore, names from the full 1,142 species list were confronted to the LSURef_r119 database. The search produced a non-redundant list of 153 sequences (Supplementary file 04), of which 145 were Bacteroidetes and 8 were Chlorobi (out-group). The diversity they represented is compiled in Table 4.

1. 2. Non-ribosomal sequences.

Although ribosomal sequences have resulted in excellent molecular clocks, they strictly reconstruct the mutation history captured by ribosomal nucleotidic structures. The cell is constructed by other polymers, proteins, vital to the survival of species and of varied functionalities (membrane proteins, translation factors, etc.) The phylogenies of these polymers could substantiate a novel narrative on the evolutionary history of the phylum and, ultimately, explain their phenotypes. For the phylogenetic review of the phylum, a multi-locus sequence alignment of core proteins was needed. In a second step, whole genomes were compared to search for protein-coding regions exclusive to the phylum.

1. 2. 1. Dataset for Multilocus Sequence Alignment.

In 2015, in a first exploration of the bacteroidetal core-genome we mined the MBGD database. We selected 60 completed genomes (Supplementary file 05) that affiliated to the phyla Bacteroidetes and Chlorobi keeping a balanced representation of taxonomic high-rank diversity in Bacteroidetes. The dataset included 53 bacteroidetal genomes of diverse taxa (Table 4). Seven chlorobial genomes served as the out-group.

The search for homologous sequences among them used pairwise sequence identities with an e- value < 10-3 plus other MBGD default parameters and found 329 shared genes. For each gene, a cluster of homologs gathered all 60 specific variants of the locus. MBGD included functional classification of the gene clusters. The list of clusters was reduced to 273 by removing clusters that included more than one gene for some genomes (paralogues) and those of unknown classification. This reduced theor functional categories to 13. From each category, we selected the largest genes (most phylogenetic information) reducing each category to a 10% of the initial size. Hence, we obtained a list of 31 genes representing 13 functional categories. The capacity of taxonomic resolution of each translated gen was tested by individual phylogenetic analyses. Two phylogenies per sequence, one by neighbor-joining and other by maximum likelihood, were merged into a consensual phylogeny. The phylogenies based on sequences of the translated genes aspS (aspartyl- tRNA synthase) and ribA (GTP cyclohydrolase II) reproduced overly distorted relationships of the families Saprospiraceae, Odoribacteraceae and Rhodothermaceae with the rest of the 30

Bacteroidetes (Annex A) Based in the concept that sequences with phylogenetic signal should, to some extent, reproduce the know taxonomy, the two sequences were rejected from the final dataset.

The 29 selected proteins for the reconstruction of an MLSA based phylogeny (Table 5) represented the initial 13 functional categories with at least one sequence each. The shortest sequence was the translated gene of the transacylase FabD with 294 ±6 amino acids, and the longest was the RNA polymerase β’ subunit (RpoC) with 1,432 ±22 amino acids.

Table 5: List of 29 orthologous coding-genes in 60 bacteroidetal and chlorobial genomes as extracted from the database MBGD. Length of their translated protein, protein name, and functional classification are also given to illustrate the information incorporated to the MLSA analysis. Gene name median length (aa) Protein full name Function

aroC 358±18.3 chorismate synthase amino acid transport and glyA 426±6.6 serine hydroxymethyltransferase metabolism

guaA 510±13.5 GMP synthase nucleotide transport and pyrG 540±14.4 CTP synthetase metabolism fabD 293.5±6.2 malonyl CoA-ACP transacylase lipid transport and metabolism coenzyme transport and folC 428.5±17.3 folylpolyglutamate synthase metabolism glmM 462±9.9 phosphoglucosamine mutase carbohydrate transport and pgk 397±12.1 Phosphoglycerate kinase metabolism dnaK 637±7.7 chaperone Hsp70, with co-chaperone DnaJ two-component response regulator in acetoacetate atoC 421±20.1 posttranslational modifications metabolism ppiD 704±9.8 peptidylprolyl isomerase D dnaE 1,262.5±143.7 DNA polymerase III alpha subunit polA 943.5±21 DNA polymerase I replication, recombination, repair uvrA 947±15.9 excinuclease ABC subunit A rpoC 1,432±22.4 RNA polymerase, beta prime subunit transcription alaS 876±17.4 alanyl-tRNA synthetase argS 594±25.6 arginyl-tRNA synthetase fusA 705±14.7 elongation factor G ileS 1,134±31.8 isoleucyl-tRNA synthetase

infB 982±63.5 translation initiation factor IF-2 translation and ribosomal pnp 729.5±20.8 polynucleotide phosphorylase structure rpsA 600±47.4 30S ribosomal protein S1 thrS 648±6.6 threonyl-tRNA synthetase valS 877±14.9 valyl-tRNA synthetase secA 1,114±32.5 preprotein translocase subunit SecA intracellular trafficking ftsI 700±28.1 peptidoglycan synthetase FtsI cell wall/membrane biosynthesis alr 817.5±213.1 alanine racemase signal transduction DNA translocase at septal ring sorting daughter ftsK 827.5±48.6 chromosomes cell cycle control lolC 412.5±36.1 lipoprotein releasing system transmembrane protein 31

1. 2. 2. Accessory ATP synthase beta subunit (AtpD) and Alanine synthase (AlaS) sequences.

We found two precedent phylogenetic assessments of the phylum by non-ribosomal phylogenetic markers. One is a phylogenetic reconstruction based on the ATP synthase beta subunit by Amann et al. in 1988, and the second is the description of signature insertions and deletions in the sequence of the Alanine synthase by Gupta and Lorenzini in 2007. Sequences of the translated genes atpD and alaS were downloaded from the MBGD database when possible. AlaS translated genes were found in all genomes. But AtpD translated genes were not to be found in some type strains genomes of the class Bacteroidia. Thus, we included non-type strain bacteroidial sequences from diverse bacteroidial species recruited from NCBI’s RefSeq database to compensate for the missing taxa in the AtpD dataset.

1. 3. Sequenced genomes for comparative genomics.

By 2016 more than 800 bacteroidetal genomes were deposited at the NCBI RefSeq database, but their taxonomic bias remained despite the GEBA had already improved the taxonomic diversity of available bacteroidetal genomic data. In the planning of the comparative analysis of bacteroidetal genomes, we first needed to know how many genomes we would be able to compare with the computational power at hand. By testing comparisons of increasing numbers of genomes and observing their increase in time of response, a sample size of around 100 genomes, including an out-group sample, was established. Such a reduction of data (approximately an eightfold) had to be thoughtfully selective. The factors considered for data selection are hereby listed for synthesis. 32

Figure 3: Example of genomes synonymy despite having different names, detected by ANIm analysis. Here, genomes of the genus Elizabethkingia are confronted. Red tones indicate ANIm values above the species threshold of 94% similarity. Three genomes named after E. meningoseptica classified within the E. anophelis cluster.

Redundancy: Research on the phylum Bacteroidetes is prolific on determined genera of special interest; pathogens or of ecological relevance, e.g., Bacteroides, Prevotella, Porphyromonas, and Flavobacterium. This produces the mentioned taxonomic bias in the full list of sequenced genomes. Their representation was reduced by removing multiple strains of same species yet allowing a balanced redundancy to mirror the redundancy observed in less popular genera. This filter alone reduced the dataset to 478 genomes (Supplementary file 06).

Type strains: They were preferred since they are the type material of their species. Frequently, their genomes and physiology are thoughtfully studied. They are normally best represented in literature, included on inter-disciplinary research, and their genome annotation is more elaborated.

Endosymbionts: The endosymbiotic lifestyle produces a severe genomic reduction in bacteria after million-years of co-evolving with its host. Metabolic routes and cell processes are decoded, limiting the size of their core-genome. This loss of information was not desired for the search of common traits of the phylum, not typically endosymbiont.

Diversity: We desired to represent as many bacteroidetal lineages as possible. Also, it was important to respect the relative abundance of high taxa. For instance, nearly half the bacteroidetal species are classified as members of the family Flavobacteriacea. If any genomic trait was to be found to differentiate subgroups of Flavobaceriaceae they must be richly represented in the database. Hence, the database was preferred to be phylogenetically balanced (Annex B) instead of taxonomically balanced. At the class rank level, it translated in the selection of 48 Flavobacteriia, 16 Bacteroidia, 15 Cytophagia, 5 Chitinophagia and 5 Sphingobacteriia. 33

Figure 4: Boxplot metrics distribution of genomic length (left) and G+C %mol content (right) in the non-redundant genomes dataset of 478 genomes (top) against the reduced dataset of 89 genomes plus 5 out-group genomes (bottom). The boxplot distributes values in quartiles. The box represents the second and third quartiles split by a horizontal line representing the median. Vertical lines are the first and fourth quartiles, whereas dots are atypical values.

Synonyms: Two genomes can belong to same species despite being annotated under different species names. Errors in affiliation can occur due to a bad treatment of data and methodologies, but also, are prone when databases grow fast and curated databases struggle to keep up to date. Hence, synonym genomes can be annotated with different names if, for instance, one of the sequences was not meant to describe a new species but to describe new physiological properties and its correct genus affiliation was sufficient. Also, if the correct species name was not yet accessible in curated databases, a new genome could be named as a new strain of a close species. We prevented synonyms in our dataset by pairwise ANIm confrontations of genomes at the same genus rank (Figure 3) when many strains were sequenced; 47 intra-genus analyses plus a confrontation of B. fragilis strains). Overall ANIm confrontations of pre-selected genomes helped to optimize the final dataset (e.g., Annex C).

Completeness: Genome sequencing projects follow a bottom-up strategy; several reads are linked together by overlapping segments. Depending of the sequencing depth achieved, longer or shorter fragments can be assembled in larger segments called contigs. Sometimes re-sequencing is needed to fill the gaps between contigs. Many genomes are deposited as assemblies of various contigs. Missing information in their gaps could impede the detection of wanted loci in a comparative analysis. Therefore, full genomes were preferred.

Table 6: Resume of the length and nucleotidic composition of the the non-redundant database of 478 genomes versus the reduced set of 89 bacteroidetal genomes. (a) Microscilla marina ATCC 23134T, (b) endosymbiont of Llaveia axin axin, of the family Blattabacteriaceae, (c) Chitinophaga pinensis DSM 2588T, (d) Riemerella anatipestifer ATCC 11845T, (e) DSM 13606T, (f) Blattabacterium sp. (endosymbiont of Cryptocercus punctulatus Cpu), (g) Hymenobacter sp. APR13, and (h) Polaribacter sp. Hel I 88. size (Mb) G+C % mol content 478 genomes 89 genomes 478 genomes 89 genomes maximum 9.77(a) 9.13(c) 62.1(e) 61.9(g) minimum 0.31(b) 2.16(d) 23.8(f) 30.0(h) Median +/- SD 3.96 +/-1.40 4.14+/-1.38 39.4+/-6.8 38.1+/-7.1

Length and nucleotidic composition: Ranges of genome length and content in guanosine and cytosine (G+C %mol) are a genotypic trait often displayed in the description of high taxa. When representing a high taxon with a low number of species, their (narrower) range of genomic length and G+C %mol content can diagnose a biased selection. Since the initial list of over 800 bacteroidetal genomes was already biased, we calculated taxonomic length and G+C %mol ranges from the less redundant and taxonomically balanced database of 478 genomes. Data was represented in boxplots (Figure 4), and then repeated them with the selected genomes. This comparison served to prevent avoidable bias and aware of an inevitable one. Table 6 resumes the initial and final composition of the two databases.

Positive control: The genomes comparative analysis was computationally costly, it hoarded computational resources for an extended time (months). To verify a successful analysis, we introduced two closely related pairs of genomes that should share multiple orthologous loci between them and not with the rest of genomes. The large family Flavobacteriaceae was suitable for introducing the positive controls because relatedness of two genomes could be compensated with other distant genomes in the same taxon. Particularly, we were interested in comparing strains with 16S rDNA sequence identity above the species threshold (98.7%) but ANIm similarity below or just 34 at the species threshold (94%) (Arahal, 2014; Kim et al., 2014; Konstantinidis and Tiedje, 2005). We selected Flavobacterium psychrophilum Z2 opposite F. psychrophilum JIP02/86 that shared 99.5% SSU rDNA identity and 83% ANIm similarity, and Myroides odoratimimus PR63039 opposite M. profundi D25 with 99.2% SSU rDNA identity and 94% ANIm similarity.

Overall, we picked 89 genomes of Bacteroidetes plus five out-group genomes (Supplementary file 07) representing the full dataset of 478 genomes in terms of median genome length (3.96 Mbp ± 1.4) and G+C mol% content (39.4% ± 6.8). Minimum values were higher in the reduced dataset due to exclusion of Blattabacteriaceae (Table 6). The mean genome length in the Chitinophagia increased by 2 Mbp, but its G+C % mol% remained representative. The reason is that the selected Chitinophagia-Saprospiria genomes with the least contigs were also the largest in this taxon (Table 6, Figure 4). The balanced representation of species along the full phylogenetic tree of the phylum (Annex B), and the coherent metrics explained above, indicated that the reduced dataset was as diverse and representative as the repositories allowed. 35

2. Analysis of bacteroidetal 16S rRNA sequences.

Bacteroidetal 16S rRNA sequence analyses in this thesis merged neighbor-joining and maximum likelihood phylogenies constrained by scaffolding sequence sets specific for each algorithm. The neighbor-joining support dataset of 757 prokaryotic sequences proposed in Munoz et al. (2014) was incorporated to the neighbor-joining analysis. For maximum likelihood analysis an ad hoc set of sequences were selecting from the SilvaRef SSU database following Ludwig and Klenk’s outlines (Ludwig and Klenk, 2005), namely, sequences from non-type Bacteroidetes at least 1,450 nucleotides, with no ambiguities, starting no after position 40 and finishing no before position 1,480 of the Escherichia coli alignment by Brosius et al. (1978). In addition, sequences with a SILVA quality score lower than 96% were filtered out. Continuing with Ludwig and Klenk’s protocol a maximum parsimony tree was computed with the selected sequences. Abnormally long branches identified problematic sequences that could unbalance the final sequence analysis. The resulting maximum likelihood scaffold (Supplementary file 08) consisted in 228 bacteroidetal sequences of high confidence.

2.1. Building of the proposed bacteroidetal 16S rRNA phylogenetic tree.

The phylogenetic tree of 1,142 Bacteroidetes plus 13 out-group Chlorobi here exposed is an engineered piece of robust data analysis that merges six ready stable topologies; three neighbor- joining trees and three maximum likelihood trees. After their curated alignment, the 1,155 SSU sequences were first divided into a rich set of 831 high-quality sequences (Supplementary file 09); length >1,400 nucleotides, no ambiguities, and a sequence alignment quality >90% (Quast et al., 2012), and 324 SSU sequences of “poor quality” (Supplementary file 10). The later, could cause unstable branches in the phylogenetic tree.

With the rich 831 sequences, positional filters at 30%, 40% and 50% tested topologies produced three trees by each algorithm. The number of positions selected by each filter are compiled in Table 7. The computing was assisted by scaffolding datasets for neighbor-joining or maximum likelihood that were removed after completion of each phylogeny.

Table 7. The alignments used in each topology are named after the sequence they contain. Their total length (positions) and aligner (brackets) are displayed next. The following numbers explain how many positions of each alignment did select each filter used in the final topologies. The last column contains the number of positions that are selected by the filter for the domain Bacteria in the LTP database. Filters by % monomer frequency alignment length 50% 40% 30% 20% bacteria filter 16S rDNA 43,594 (SINA) 1,324 1,418 1,456 1,532 23S rDNA 130,508 (SINA) 2,580 2,739 2,842 3,164 SSU-LSU rDNA 5,081 (SINA) 3,848 4,179 4,352 - MLSA 28,818 (Muscle) 16,636 18,784 20,646 -

The six trees were compared against each other and against former bacteroidetal phylogenies. Moreover, precedent trees produced for data curation (e.g., testing positional filters or selecting scaffold sequenced for maximum likelihood calculations) helped to identify expected branching distributions. Conclusions were graphically illustrated by modifying the neighbor-joining over positions above 40% conservation since it needed the least alterations. Untrustworthy bifurcations were suppressed by placing their diverging node at the same level of the preceding node causing a multifurcation. Upon the final tree, the remaining 324 sequences of poorer quality were inserted using the parsimony tool implemented in ARB (Ludwig et al., 2004).

2.2. Updated 16S rRNA phylogeny of the Bacteroidetes. 36

The consensual phylogenetic tree based on the ribosomal SSU sequence of the Bacteroidetes was out-grouped with representatives of the sister phyla Cholobi placed at its bottom (Figure 5). From the intersection of both phyla, the root of the tree connects it to other bacteria. In known trees of prokaryotic life (Munoz et al., 2011; Parks et al., 2018) the branch of the Chlorobi diverges deeper than the Bacteroidetes. Hence, in the description of the updated phylogeny of the Bacteroidetes, lineages will be described upwards from the Chlorobi as if following their presumed appearance in evolution.

Figure 5. Consensus topology of the 16S rRNA based phylogeny. Scale bar indicates mutation/substitution rate. Bold letters are new taxa. Grey branches represent parsimonious insertions. Numbers are identity values; to compare with taxonomic thresholds resumed on top-left corner. 37

Close to the root of the tree, the Bacteroidetes diverged in three branches originating at the same point; the Rhodothermales ord. nov. (formerly classified as Cytophagia), Balneolales ord. nov. (formerly classified as Chitinophagia) and the rest of the known Bacteroidetes. The ribosomal SSU sequence could not resolve the relative branching order between Rhodothermales and Balneolales but they were clearly divergent from other Bacteroidetes from a deeper position at the phylogenetic tree. The third branch of known Bacteroidetes diverged in five lineages with short internodes insinuating an adaptive radiation of their ancestral lineage.

The deepest lineage, also with the shortest internode, corresponded to the class Cytophagia. It also radiated into seven lineages with unsolved branching arrangement. The location of type genera for the rank family identified two of the branches as the families Cytophagaceae (genus Cytophaga) and Cyclobacteriaceae (genus Cyclobacteria). The other five branches belonged to (1) the species Luteivirga sdotyamensis, (2) a short branch radiating into the families Catalimonadaceae, Flammeovirgaceae and Persicobacteraceae fam. nov., (3) three species of the genus Flexibacter including F. litoralis, (4) a branch diverging into the Thermonemataceae fam. nov. and the Mooreiaceae (Mooreia alkaloidigena), and (5) the family fam. nov.

The second lineage corresponded to the Chitinophagales ord. nov. comprising the families Chitinophagaceae and Saprospiraceae. The third lineage was the family Sphingobacteriaceae as sole family in the order Sphingobacteriales and class Sphingobacteriia. Fourth, the lineage of the Bacteroidales branching into eight families including Odoribacteraceae fam. nov. And last, the Flavobacteriales with three families (Blattabacteriaceae was not represented); the Flavobacteriaceae, Cryomorphaceae (engulfing the family Schleiferiaceae) and Crocinitomicaceae fam. nov.

2.2.1 Coherence of taxonomic clades.

This consensual topology urged a new organization of the taxonomy that was assessed by applying the minimum 16S rRNA sequence identity thresholds described by Yarza et al. (2014) for taxonomic ranks higher than genus (Figure 5). Below or within the threshold range for the rank phylum (74.95-79.9%) scored the species Luteivirga sdotyamensis (78.7% respect the closest Cytophagales) and all the bacteroidetal orders except for the Sphingobacteriales (83.9%). Also, the combined Rhodothermales and Balneolales with combined 77.9% sequence identity. Since the threshold ranges overlap, Luteivirga sdotyamensis also falls into the identity range for the class rank (78.55-82.5%) and Sphingobacteriales fits the range of the order rank (82.25-84.8%).

The families with minimum sequence identity within the family rank interval (86.6-88.4%) were the families Crocinitomicaceae fam. nov., Marinilabiliaceae, Prolixibacteraceae, Persicobacteraceae fam. nov., Salinibacteraceae fam nov. and Rubricoccaceae fam. nov. Other families scored within the interval for the rank order; Cryomorphaceae, Bacteroidaceae, Odoribacteraceae fam. nov., , Rikenellaceae, Chitinophagaceae, Cyclobacteriaceae, Hymenobacteraceae fam. nov., the Flexibacter litoralis clade, Catalimonadaceae, Balneolaceae. and Rhodothermaceae. Or the rank class; Prevotellaceae, Porphyromonadaceae, Saprospiraceae, the Thermonemataceae fam. nov.–Mooreiaceae clade, and Flammeovirgaceae.

This classification suggested an under-estimation of the remoteness of taxonomic ranks in the Bacteroidetes. The classification of the 1,141 bacteroidetal ribosomal SSU sequences in operational taxonomic units applying the high-rank taxonomic thresholds with uclust (Edgar, 2010) suggested that the phylum Bacteroidetes should be divided in 4 phyla, 7 classes, 20 orders and 59 families. Nevertheless, the internal specific composition of the OTUs did not agree with their phylogeny. Thus, the OTUS approach substantiates a more complex taxonomic route of the phylum, but the 38 simpler phylogenetic approach is preferred while a more complex taxonomic route is not needed to allocate new taxa.

Overall, the phylogenetic tree showed a good categorization of genera. Among species, six pairs shared a ribosomal SSU sequence similarity of at least 99.8% (Table 8). Four of the six putative synonym species pairs were published within less than 12 months from each other. One of the pairs was published in independent studies at the same issue of the International Journal of Systematic and Evolutionary Microbiology (Volume 63, Issue 9); Flavobacterium tructae and F. spartansii shared 99.9% 16S rRNA gene identity, their descriptions differed in the abundance of one fatty acid and their gliding ability. Both Flavobacterium type strains were isolated from the entrails of fish species in the genus Oncorhynchus. As in February 2019, they are yet classified as different species.

Table 8. Similar 16S rRNA sequences and the species they represent, their author, year of publication, and paired sequence identity percentage. The paired species of the genera Muricauda and genus Flavobacterium were published in a lapse time of weeks, they did not include each other in their descriptions. The rest of species with similar 16S rRNA sequence remain as different species after their review or the description of a close species by authors in the “Sequence 2” column. Sequence 1 Sequence 2 identity FN421478, Fontibacter flavus (Kämpfer et al., JQ348962, Fontibacter ferrireducens (Zhang et 99.93% 2010) al., 2013) JN166984, Muricauda antarctica (Wu et al., JF714702, Muricauda taeanensis (Kim et al., 99.93% 2013) 2013) AB198089, Dokdonia diaphoros (Khan et al. , AB681255, Dokdonia eikasta (Khan et al., 2006, 99.93% 2006, rev. by Yoon et al., 2012) rev. by Yoon et al., 2012) HE612100, Flavobacterium tructae (Zamora et JX287799, Flavobacterium spartansii (Loch & 99.93% al., 2014) Faisal, 2014) X81877, Hallella seregens (Moore & Moore, CP003369, Prevotella dentalis (Haapasalo, 1986, 99.86% 1994, rev. Willem & Collins, 1995) rev. Willem & Collins, 1995) HM448033, antarcticus (Farfán et AJ438174, Pedobacter piscium (Farfán et al., 99.80% al., 2014, comb. nov.) 2014, later synonym)

2. 3. Signature nucleotides in the 16S rRNA sequence

The sequence alignment that produced the 16S rRNA based phylogeny confirmed the signature nucleotides reported by Woese in 1987 for the Flavobacteriia and Bacteroidia. It also reported the most frequent nucleotides in the same positions for the clades Sphingobacteriia, Chitinophagia, Cytophagia, Rhodothermaeota phyl. nov. and Chlorobi.

Table 9. Update of the signature nucleotides in the 16S rDNA sequence of the Bacteroidetes. From left to right: Positions of the 16S rRNA gene alignment (Van de Peer et al. (1996)) where C. Woese found signature nucleotides for the Bacteroides-Flavobacteria clade. Nucleotide normally found in the rest of Bacteria at each position. Signature nucleotide for Bacteroides-Flavobacteria reported by C. Woese in 1987. Nucleotide at each position in the studied database and distributed in modern taxonomic groups. Small letters indicate second most frequent nucleotide. When both nucleotides are written in small letters they are co-abundant. (1) One species shows a different nucleic acid, (*) position 1,532 of the alignment is poorly represented in databases. “Flav.” stands for Flavobacteriia and “Bctd.” for Bacteroidia. Bold letters highlight the positions that characterize the Rhodothermaeota apart from the Bacteroidetes and found in this research. Bacteroidetes Rhodothermaeota Chlorobi Position Consensus Bacteroidetes Flavobacterii Bacteroidia Spingobactrii Chitinophagia Cytophagia composition signature a a in Bacteria (Woese, 1987) 570 G U U U U U U1 G G 812 G c (Flav.) C G C C G G G 975 A G (this study) G G G G G Ag A 995 C A A A A A Au A au 1,198 G Ag (Flav.) Ag A G G G C C A (Bactd.) 1,202 U G (Bactd.) U G U U U U Ug 1,224 U c (Flav.) cu gu U U gu U Cu g (Bactd.) 1,225 A A (this study) A A A A A G A 1,234 C U (Bactd.) Ca U C C C C A 1,410 A G G Ga G Ga G Gc G 1,532* U Au Au Au Au Au Au U U 39

Furthermore, two positions of the alignment were identified as distinctive but not reported before; a guanine in position 975 and a guanine in position 1.225 of the E. coli alignment identify Bacteroidetes and Rhodothermaeota respectively (Table 9). This finding is encouraging for the developing of new bacteroidetal and rhodothermaeotal probes to detect them in environmental samples. Other sequence particularities and signature nucleotides for lower ranks were identified in the Chlorobi-Rhodothermaeota-Bacteroidetes clade as summarized in Figure 6.

Figure 6. Signature 16S rRNA nucleotides of the Rhodothermaeota. Opposite to Bacteroidetes, the Rhodothermaeota and the Chlorobi ‘sensu stricto’ showed a nucleotide inversion at complementary positions 370 and 391 of the E. coli alignment (Van de Peer et al., 1996) helix 16-. The Balneolaceae exhibited a deletion at position 993. Two species of the Salinibacteraceae (Salinibacter iranicus and S. luteus) exhibit two substitution of complementary G-C by A-U at positions 577-764 and 1,050-1,208 (Helices 23 and 38, respectively). In addition, all Salinibacteraceae had a cytosine in position 957 instead of a uracil, and a uracil in position 1,109 instead of a cytosine. The Chlorobi differed from the consensus SSU rRNA sequence in Bacteria at the indicated 14 positions but they were not relevant to the taxonomic revision of this study. (*) Poorly represented positions in the studied database.

The recently classified S. iranicus and S. luteus (Makhdoumi-Kakhki et al. 2012), exhibited a 16S rRNA gene sequence divergence with the type species of the genus, S. ruber, of 93.3% which is well below the genus threshold calculated for this category in Yarza et al. (2014). We were also able to identify distinctive nucleotide signatures in their SSU alignment at helix 23 (Figure 6). These results, together with the phenotypic distinctiveness published in the classification and the ANI and AAI values <80% and <71% respectively (Viver et al., 2018), allowed us to consider these two species as a new genus within the family for which we propose Salinivenus gen. nov.

2. 4. Inconsistencies with the taxonomy in use.

The careful inspection of every branch in the consensual 16S rRNA gene phylogenetic tree at the genus rank level compiled a series of species with controversial taxonomic affiliation and genera of questionable monophyly.

Within the family Flavobacteriaceae the genera Algibacter, Flaviramulus, Gaetbulibacter and Bizionia could not be circumscribed monophyletically (Figure 7). Every genus could be reclassified into at least two genera, or up to five in the case of Algibacter. Also, the type genus Flavobacterium was paraphyletic because (1) four species affiliated next to genus Myroides and (2) F. mizutaii affiliates to , class Sphingobacteriia. 40

Figure 7. Phylogenetic affiliations of Algibacter spp. (squares), Flaviramulus spp. (5-pointed star), Gaetbulibacter spp. (6-pointed star), Bizionia spp. (8-pointed star), and Flavobacterium (bottom diagram). Branches and names in grey indicate phylogenetic positions estimated by parsimonious insertion of aligned sequences upon the final tree topology. Scale bar indicates nucleotidic substitution rate.

Within the Bacteroidia, the genus Porphyromonas of the Porphyromonadaceae showed an unstable structure represented as a multifurcation indicating its possible future reclassification into different genera (Figure 8). For example, the single species genus Falsiporphyromonas was consistently affiliated within the Porphyromonas, substantiating the polyphyly of the genus Porphyromonas. In the family Odoribacteraceae its two genera could be divided into three since Odoribacter denticanis affiliated aside the Odoribacter ‘sensu stricto’ clade and away from the Butyricimonas in an independent branch. In the Porphyromonadaceae, the single species genus Macelibacteroides affiliated within the genus Parabacteroides sharing 99.7% 16S rRNA sequence identity with P. chartae as reported by Sakamoto (2014). Their branch connects to other Parabacteroides laterally and, judging from its length, possibly meaning that P. chartae could be reclassified as a Macelibacteroides species. Last, the single species genus Halella of the Prevotellaceae affiliated to Prevotella dentalis and shared 99.9% sequence identity (Table 8). Halella seregens should be reclassified as a Prevotella species to preserve Prevotella’s monophyly. 41

Figure 8: Affiliations of Falsiporphyromonas spp., Odoribacter spp., Macelibacteroides spp., and Hallella spp. Branches and names in grey indicate phylogenetic positions estimated by parsimonious insertion of aligned sequences upon the final tree topology. Scale bar indicates nucleotidic substitution rate.

Within the class Sphingobacteriia, Pseudosphingobacterium domesticum affiliated inside the genus in the family Sphingobacteriaceae. In the same family, genus Pedobacter was paraphyletic as reported by Lambiase (2014). Species diverging from the Pedobacter ‘sensu stricto’ clade could be reclassified into four new genera; (1) P. tournemirensis, (2) P. bauzanensis, (3) P. lentus, P. terricola, P. daechungensis and P. arcticus, and (4) P. glucosidilyticus (Figure 9).

Figure 9. Affiliations of Pseudosphingobacterium spp. and Pedobacter spp. Branches and names in grey indicate phylogenetic positions estimated by parsimonious insertion of aligned sequences upon the final tree topology. Scale bar indicates nucleotidic substitution rate. 42

Two species of the class Cytophagia were outliers affiliating to other classes; Cytophaga xylanolytica to the family Marinilabiliaceae of the Bacteroidia, and Flexibacter aurantiacus to the Flavobacteriia (Figure 10). In addition, the rest of the genus Flexibacter was paraphyletic. The type species F. flexilis and species F. ruber and F. elegans affiliated independently to a branch of the Cytophagaceae shared with the genera Ohtaekwangia, Chryseolinea, Microscilla and Rhodocytophaga. Outside the family Cytophagaceae, F. polymorphus, F. litoralis and F. roseolus emerged in deep branches at the family rank level (Figure 11).

Figure 10: Affiliation of the Cytophagaceae outliers C. xylanolytica and Flexibacter aurantiacus. Branches and names in grey indicate phylogenetic positions estimated by parsimonious insertion of aligned sequences upon the final tree topology. Scale bar indicates nucleotidic substitution rate.

Figure 11: Affiliation of the Flexibacter species. Branches and names in grey indicate phylogenetic positions estimated by parsimonious insertion of aligned sequences upon the final tree topology. Scale bar indicates nucleotidic substitution rate. 43

3. Analyses of auxiliary molecular clocks.

3.1. The phylogeny of the bacteroidetal 23S rDNA.

The ribosomal LSU rDNA dataset of 145 type strain bacteroidetal sequences ranged lengths from 2,125 nucleotides (Gillisia limnaea DSM 15749) to 2,961 nucleotides (Rhodothermus marinus DSM 4252). Their average length was 2,788 nucleotides and their median length of 2,833 ± 164 nucleotides. For the backbone structure of the phylogenetic tree, 130 bacteroidetal reference sequences plus 8 out-group sequences with sequence and alignment quality indexes of at least 85% were selected (Supplementary file 11). Only 5 sequences measured less than 2,300 nucleotides but all of them of high quality (Quast et al., 2012). Of the unselected 15 sequences, 13 displayed an alignment below 85% quality and 2 of them a sequence quality below 85% (Table 10).

Table 10: Subset of 23S rDNA sequences of poor quality inserted by parsimony in the backbone phylogenetic structures. The sequence and alignment quality were obtained from the Silva database (Quast et al., 2012). Name with standing Accession number Sequence length Sequence quality (%) Alignment quality (%) in nomenclature (nucleotides) Fibrella aestuarina HE796683 2,846 90 84 BUZ 2T Flavobacterium EF554787 2,503 83 92 columnare ATCC 23463T Flexibacter flexilis M62806 2,817 77 100 ATCC 23079T Flexibacter litoralis CP003345 2,885 93 84 DSM 6794T Gracilimonas AQXG01000022 2,890 91 80 tropica DSM 19535T Prevotella bergensis ACKS01000101 2,816 92 84 DSM 17361T Prevotella buccae AEPD01000024 2,511 91 84 ATCC 33574T Prevotella AFJE01000009 2,942 95 82 multisaccharivorax DSM 17128T Prevotella oralis AEPE02000001 2,952 93 81 ATCC 33269T Salinibacter ruber CP000159 2,927 96 79 DSM 13855T Salisaeta longa DSM ATTH01000001 2,918 98 78 21114T Saprospira grandis CP002831 2,811 94 83 str. LewinT Spirosoma luteum ARFC01000061 2,145 92 83 DSM 19990T Spirosoma ARFA01000070 2,586 91 83 panaciterrae DSM 21099T Spirosoma ARFD01000199 2,816 90 84 spitsbergense DSM 19989T 44

Like the phylogenetic tree built upon 16S rDNA sequences, the ribosomal LSU tree compiles six trees produced by combining two methodologies and three positional filters, this time using 138 reference sequences. In addition, neighbor-joining phylogenies were supported by 200 sequences of diverse prokaryotic sequences from the SilvaRef LSU database longer than 2,300 nucleotides and with quality parameters above 94%. (Supplementary file 12). RAxML phylogenies were supported by 143 bacteroidetal sequences from the same database that were longer than 2,200 nucleotides and above 97% quality parameters (Supplementary file 13). After the removal of the supporting sequences the six trees were combined in a consensual topology and the 15 sequence not included in the backbone topology were inserted by maximum parsimony (Figure 12).

Figure 12: Consensual phylogeny of the Bacteroidetes as inferred by the alignment of the 23S rRNA gene. Grey branches and names are parsimonious insertions. Scale bar indicates mutational/substitution rate.

The consensual topology of the 23S rDNA based tree divided the Bacteroidetes in 4 major clades. From the base of the tree the first clade affiliated the Balneolales and the Rhodothermales. The second clade affiliated the Chitinophagales. Relatively long internodes separate these first two lineages. However, the third and four clades agglutinate in a very short internodal space. The third clade affiliates the Sphingobacteriales and Cytophagales. In the fourth are placed the Bacteroidales and the Flavobacteriales.

In the Rhodothermales, the families Balneolaceae, Salinibacteraceae and Rhodothermaceae were monophyletically circumscribed, which did not oppose the classification of the Balneolales ord. nov. The same happened with families Chitinophagaceae and Saprospiraceae in the Chitinophagales. The Sphingobacteriales and the Cytophagales were also circumscribed monophyletically but whereas the Sphingobacteriales only contained the family Sphingobacteraceae, the Cytophagales was represented by five branches. The branching order of the Cytophagales families was not resolved in agreement with the SSU rRNA phylogeny, partly because of the few LSU sequences available. Furthermore, the monophyly of the Cytophagaceae could not be delineated. In the fourth deep lineage, the origin of family Rikenellaceae could not be 45 discerned from the divergent node of the Bacteroidales and Flavobacteriales, hence, it was placed in a multifurcation with these two orders. At the deepest node of the Bacteroidales independent branch, the families Marinilabiliaceae and Odoribacteraceae multifurcated with a third branch that contained the families Porphyromonadaceae, Prevotellaceae and Bacteroidaceae. Porphyromonadaceae was paraphyletic whereas Prevotellaceae and Bacteroidaceae sequences reproduced a monophyletic branch were their genera were combined. In the Flavobacteriales branch, the family Flavobacteriaceae was monophyletic whereas the Chryomorphaceae genera were paraphyletic supporting the description of Crocinitomicaceae fam. nov.

3.2. Phylogeny of prokaryotic SSU and LSU ribosomal sequences combined.

Ribosomal SSU and LSU bacteroidetal phylogenies reproduced deep branching orders of low confidence, i.e., with very short internodes. The current specific richness of the phylum representation in the databases, and the sophistication of their alignment, makes it difficult to be an effect of poor/bad data. Instead, a poor resolution power of the ribosomal sequences alone or the effects of adaptive radiation events were suspected. Nevertheless, the order Rhodothermales affiliated in a remarkably independent branch in both SSU and LSU phylogenies. Furthermore, the minimum internal SSU sequence identity of the Rhodothermales suggested them being an independent phylum.

Figure 13: Consensus topology of a bacterial tree based on the concatenation of the SSU and LSU ribosomal sequences. Scale bar indicates substitution rate, interpreted as divergence rate to delineate confidence areas: white circles denote 0.02 divergence areas, grey circles denote 0.01 divergence areas.

To test these two hypotheses a phylogeny based on concatenated SSU and LSU sequences (for higher resolution power) with members of other phyla (for context at the Domain level) was built. 46

Aligned sequences of 85 prokaryotic species, 23 of them from Bacteroidetes, contextualized the affiliation of the phylum in the ribosomal tree of bacterial life (Figure 13). The dataset was strongly reduced in species number but balanced in the representation of phyla to minimize undesired branch attraction-repulsion effects. The six phylogenetic analyses with two different methods and three positional filters showed a stable monophyletic origin of the incertae sedis groups of Balneolaceae and Rhodothermaceae and an adaptive radiation process that led to the divergence of five classes of Bacteroidetes.

Following Ludwig’s recommendations (Ludwig et al., 1998) we drew areas of 1% and 2% nucleotidic divergence rate around the root nodes of bacteroidetal classes/orders and the out-group Chlorobi. The only bacteroidetal order that was confidently independent from any other branch was that of the Rhodothermales (Order II. Incertae sedis). This, together with the evidence listed before and more to be explained, led us to classify them as the Rhodothermaeota phyl. nov. with two phylogenetic lineages; the class Rhodothermia class nov., order Rhodothermales, and the class Balneolia class nov., order Balneolales ord. nov. The remaining five lineages that from now on we consider Bacteroidetes overlapped confidence areas around their roots, thus substantiating their divergence by adaptive radiation (Figure 13).

3.3. Individual phylogenies of 29 orthologous gene products.

Like other phylogenies in this research, the analyses of 29 orthologous coding genes of varied functionalities in the Bacteroidetes merged 6 phylogenetic trees (combining 2 computation methods with 3 positional filters) into a consensual phylogeny. This produced 174 phylogenetic trees that will not be shown, but the 29 consensual topologies are available in the Annex. By circumscribing the known high taxa (families, orders and classes) we classified the 29 phylogenetic trees into three categories:

1. Trees that reproduced the classification of higher taxa; families: trees based on the sequences of proteins involved in pot-translational modifications (Ppid), replication (PolA), transcription (RpoC and AlaS), translation and ribosomal structure (IleS, InfB, Pnp and RpsA). 2. Trees that reproduced ribosomal affiliation at the family rank but not higher ranks: included phylogenies of proteins involved in nucleotide metabolism (PyrG), carbohydrate metabolism (GlmM), post-translational modifications (DnaK and AtoC), translation (FusA), DNA repair (UvrA), intracellular trafficking (SecA), cell wall biosynthesis (FtsI), signal transduction (Alr) and cell cycle control (FtsK and LolC). 3. Trees that did not reproduce higher taxa inferred from ribosomal phylogenies: included the phylogenies of the majority of sequences involved in biomolecules transport and metabolism (AroC, FabD, FolC, GlyA, GuaA and Pgk), and some involved in replication (DnaE) and synthesis of amino acids (ArgS, ThrS and ValS). These proteins were coded as a single-copy gene in each genome.

As the available number of genomes increased and we engaged in a more proficient identification of bacteroidetal orthologous genes, we became aware that horizontal transfer events of the sequences reproducing topologies in categories 2 and 3 would be arguable.

3.4. Multilocus Sequence Analysis of 29 single copy gene products.

The concatenation of 29 single copy amino acidic sequences produced a chimeric sequence of 20,857 ±350 residues aligned in 28,818 homologous positions. The specific diversity represented was of 53 Bacteroidetes and 7 Chlorobi, limited by the availability of fully sequenced genomes of type strains at the time. 47

Their consensual phylogeny demonstrated an overall good resolution power, with a dichotomic distribution of high taxa circumscribed by phylogenies of ribosomal markers (Figure 14). The deepest branch was the Rhodothermaeota emerging from a node closer to Chlorobi than to Bacteroidetes. Next, the Chitinophagales reunited the Chitinophagaceae and Saprospiraceae in a monophyletic branch. From the next bifurcation, the Sphingobacteriales and Cytophagales diverged from the Bacteroidales and Flavobacteriales. All orders and families affiliated monophyletically, except for the Cytophagaceae and the Porphyromonadaceae that were paraphyletic due to the deeper affiliation of F. litoralis. and Paludibacter propionicigenes respectively. Despite the number of aligned positions selected into the MLSA was one magnitude higher than in ribosomal sequence analyses (Table 7), the MLSA based phylogeny also drew short internodal spaces between the five bacteroidetal orders (Figure 14).

Figure 14: Consensus phylogeny as inferred from the multilocus sequence alignment of 29 varied translated genes of the Bacteroidetes and sister phylum Chlorobi. Family names in bold characters, two of them faded in grey to indicate paraphyletism. Scale bar indicates the estimated sequence divergence rate.

3.5. Phylogeny of the F-type ATP synthase beta subunit. 48

Some level of bacteroidetal signature sequence variation of the F-type ATP synthase beta subunit gene (atpD) has been described in the past (Amann et al. 1988). This signal could help in delimiting bacteroidetal taxa. Nevertheless, when compiling its orthologs from bacteroidetal genomes it was found that some genomes did not encode the sequence. Instead, genomes of Alistipes finegoldii DSM 17242T, Porphyromonas asaccharolytica DSM 20707T and P. gingivalis ATCC 33277T encoded a vacuolar-type ATPase. Furthermore, in the genomes of ATCC 25285T, B. helcogenes P 36-108T, B. thetaiotaomicron VPI-5482T, B. vulgatus ATCC 8482T, Odoribacter splanchnicus DSM 20712T and Tannerella forsythia ATCC 43037T, forms of F-type ATPases coexisted with vacuolar-type ATPases. Hence, the specific diversity represented in the phylogenetic analysis had to be reduced, but the representation of the Rhodothermaeota was complemented including non-type strains of the genera Balneola, Gracilimonas, Rhodothermus, Salinibacter and Salisaeta (Figure 15). 49

Figure 15: Consensual tree toplogy reconstructed from the alineation of AtpD protein sequences. Scale bar indicates estimated sequence divergence. In this phylogeny the Rhodothermales are paraphyletic due to a short internode. Also, the Bacteroidales are divergent to other Bacteroidetes.

The length of the translated gene (AtpD) varied from 441 amino acids in Leadbetterella byssophila DSM 17132T to 518 amino acids in Owenweeksia hongkongensis DSM 17368T. Their average length was of 497 amino acids and their median of 502 ±16 amino acids. The alignment of 56 bacteroidetal AtpD sequences against seven out-group Chlorobi occupied 541 positions.

The phylogeny of the AtpD sequences (Figure 15) drew the Rhodothermaeota deeply branched in two independent lineages, the Balneolales and the Rhodothermales. The rest of the groups were divided into two clades: (i) the Bacteroidales, and (ii) the other orders multifurcating from the same node. In the order Bacteroidales, Odoribacter splanchnicus was paraphyletic to the Porphyromonadaceae, which was a new reason to classify the Odoribacteraceae fam. nov. In the next clade four lineages of mixed taxonomy were multifurcated. (1 & 2) The families Cytophagaceae and Cyclobacteriaceae of the Cytophagales rooted at the multifurcation. (3) Whereas the Chitinophagales and Sphingobacteriales formed a multifurcated joined branch were the affiliation of P. saltans impeded the monophyly of the Sphingobacteriales. Last (4), two sequences of the Cytophagales affiliated near the origin of the Flavobacteriales.

3.6. AtpD and AlaS indel prints

Figure 16: Fragment of the AtpD protein alignment where significant insertions in the sequences of the Rhodothermaeota (grey) and Bacteroidetes (top) can be seen, taking the chlorobial sequence (bottom) as the reference with two visible gaps in the alignment. 50 51

Figure 17: Resumed alignment of the AlaS protein sequence of the phyla Bacteroidetes (top), Rhodothermaeota (grey) and Chlorobi (bottom), compiling all insertions described in the manuscript.

The alignment of the F-type ATP synthase β subunit (AtpD) sequences (Figure 16) confirmed the presence of two insertions in most of the Bacteroidetes described by Amann et al. (1988). A large insertion appeared in our alignment between positions 231–259 and a short one between positions 293–298. The large insertion was present in Bacteroidetes and Rhodothermaeota, and absent in Chlorobi. The small insert of Bacteroidetes was absent in the Rhodothermaeota, Odoribacter splanchnicus DSM 20712T and in Chlorobi.

The alignment of Alanine synthase (AlaS) amino acidic sequences (Figure 17) included sequences from 53 bacteroidetal type strains and 7 Chlorobi type strains as the out-group. Sequences ranged from 846 amino acids in Leadbetterella byssophila DSM 17132T to 966 amino acids in S. ruber DSM 13855T. Their average length was 880 amino acids with a median of 876 ±17 amino acids. Their alignment expanded to 1,046 positions.

Considering the chlorobial sequences as the reference, two characteristic insertions in the Rhodothermaeota could be identified: (i) 4 amino acids between positions 314–317, and (ii) 44–46 amino acids between positions 458 and 503. All Bacteroidetes and Rhodothermaeota showed an 8 aa insertion between positions 169–177, a 4–6 aa insertion between positions 228–232 and a deletion of positions 682–697. This last deletion corresponded to the insertion in Chlorobi previously described by Gupta (2004). There was a deletion of positions 559–567 in Bacteroidetes.

Current taxonomy of the Bacteroidetes.

Three years after the publication of the revised phylogeny of the phylum in this research, fourteen of the sixteen new names that were proposed remain in use. Fifteen of them are represented in Table 11 highlighted in bold characters (the sixteenth is genus Salinivenus). The name Balneolaceae was published one month before our phylogeny (Xia et al., 2016), and the name Odoribacteraceae was substituted by Marinifiliaceae, in and emended description published one month later (Ormerod et al., 2016), since it engulfed the genera Butyricimonas and Odoribacter. The genus name is previous to Odoribacter. Table 11: Taxonomic changes in the phylum Bacteroidetes from 2010 to March 2020 in reference to our classification published in 2016. Precedent Reclassifications in this study (bold) Current taxonomic route Class Order Family Phylum Class Order Family Phylum Class Order Family Flavobacteriia Flavobacteriales Flavobacteriaceae Bacteroidetes Flavobacteriia Flavobacteriales Flavobacteriaceae Bacteroidetes Flavobacteriia Flavobacteriales Flavobacteriaceae Cryomorphaceae Cryomorphaceae Cryomorphaceae Blattabacteriaceae Blattabacteriaceae Blattabacteriaceae Schleiferiaceae Schleiferiaceae Schleiferiaceae Crocinitomicaceae Crocinitomicaceae Ichtyobacteriaceae Bacteroidia Bacteroidales Bacteroidaceae Bacteroidia Bacteroidales Bacteroidaceae Bacteroidia Bacteroidales Bacteroidaceae Marinilabiliaceae Balneicellaceae Porphyromonadaceae Muribaculaceae Porphyromonadaceae Porphyromonadaceae Prevotellaceae Prevotellaceae Rikenellaceae Rikenellaceae Williamwhitmaniaceae Prevotellaceae Marinilabiliaceae Mainilabiliales Marinilabiliaceae Rikenellaceae Odoribacteraceae Marinifilaceae Prolixibacteraceae Prolixibacteraceae Prolixibacteraceae Unclassified Bacteroidales Salinivirgaceae Sphingobacteriia Sphingobacteriales Sphingobacteriaceae Sphingobacteriia Sphingobacteriales Sphingobacteriaceae Sphingobacteriia Sphingobacteriales Sphingobacteriaceae Filobacteriaceae Chitinophagaceae Chitinophagia Chitinophagales Chitinophagaceae Chitinophagia Chitinophagales Chitinophagaceae Saprospiraceae Saprospiraceae Saprospiria Saprospirales Saprospiraceae Unclassified Sphingobacteriales Haliscomenobacteraceae Lewinellaceae Cytophagia Cytophagales Catalimonadaceae Cytophagia Cytophagales Cytophagaceae Cytophagia Cytophagales Cytophagaceae Cyclobacteriaceae Cyclobacteriaceae Cyclobacteriaceae Cytophagaceae Catalimonadaceae Catalimonadaceae Flammeovirgaceae Flammeovirgaceae Flammeovirgaceae Mooreiaceae Mooreiaceae Mooreiaceae Hymenobacteraceae Hymenobacteraceae Persicobacteraceae Persicobacteraceae Thermonemaceae Thermonemataceae Bernadetiaceae Microscillaceae Raineyaceae Rhodothermaceae Rhodothermaeota Rhodothermia Rhodothermales Rhodothermaceae Rhodothermaeota Rhodothermia Rhodotermales Rhodothermaceae Unclassified Cytophagales Rubricoccaceae Rubricoccaceae Salinibacteraceae Salinibacteraceae Salisaetaceae Balneolia Balneolales Balneolaceae Balneolaeota Balneolia Balneolales Balneolaceae Soortiaceae 53

4. Phylogenomic trends in the protein pool of Bacteroidetes.

The pairwise comparison of coding loci in the 89 selected genomes of Bacteroidetes and 5 out- group chlorobial species produced a deranged table of similar sequences at 50% similarity in at least 50% of the sequence. The new bacteroidetal taxonomic and phylogenetic outline described in chapters 2 and 3 served to arrange the genomes accordingly (Supplementary file 14). The arrangement unveiled phylogenetic gene clusters susceptible of encrypting common phenotypes of the Bacteroidetes and its subgroups. This chapter verses upon the search and description of these clusters.

Comparing translated sequences was preferred for the same reasons explained in chapter 3; reducing potential sequence similarity bias caused by degenerated amino acidic codification in codons. The largest genome in the analysis, Chitinophaga pinensis DSM 2588T, was 9.13 megabases (Mb: million nucleic bases) coding for 7,063 predicted proteins, and the shortest, Riemerella anatipestifer ATCC 11845T, was 2.16 Mb coding for 1,977 predicted proteins. However, the genome containing most predicted proteins was koreensis GR20-10, with 7,107 in 9.03 Mb, and with the least predicted proteins was Pophyromonas asaccharolytica DSM 20707T with 1,649 in 2.19 Mb. Although the median protein content of the tested genomes was 3,461 ± 1,037 predicted proteins, the maximum set of shared coding genes among Bacteroidetes was limited by P. asaccharolytica to 1,649 genes.

4.1. Reciprocal Best Match (RBM) analysis.

Pairwise comparison of a total of 331,912 translated genes at the 50:50 similarity threshold returned 31,265 groups of orthologous sequences (Supplementary file 14). The input consisted in the 94 translated genomes in multifasta format and the output was a plain tabbed file of 94 columns, one per genome, and 31,266 rows, one per orthologous group plus the header with genome names. Columns and rows were the coordinates to explore data. The file could be opened with a sheet editor, each cell containing the sequence header of a gene (or various genes) clustering into an orthologous group. It was observed that top lines gathered more fasta headers than bottom lines, thus organizing orthologous clusters from richer to poorer.

The success of the analysis was evaluated with the two positive controls inserted in the dataset. The genome pair F. physchrophilum JIP02/86 – F. psychrophilum Z2 (99.5% 16S rDNA identity and 83% ANIm) shared 1,365 genes. Their shared genes are 56% of JIP02/86 genetic endowment and 38% of Z2’s. On the other hand, the second control pair M. odoratimimus PR63039 – M. profundi D25 (99.2% 16S rDNA identity and 94% ANIm), shared 2,920 genes. That is 78% and 84.5% of their respective coding genes, coded in genomes 3,732 Mb and 3,453 Mb long. These differences fit their relative phylogenetic positions in a consensual 16S rRNA base tree where, despite their high sequence identities, they affiliated at distant positions (Figure 18). This was remarkably significant between the two genomes named F. psychrophilum.

Figure 18: Relative 16S rRNA sequence affiliations of RBM positive control genomes (bold characters) of Flavobacterium and Muricauda species. The reconstruction followed the strategy designed for consensual trees. Scale bar indicates estimated divergence. 54

We also considered that differences in the two F. psychrophilum strains could be influenced by their lifestyle (Thomas et al., 2011) and compared them to other Flavobacterium genomes. We observed the expected tendency of genomes to be larger on free-living strains, but differences on shared genes between them seemed more influenced by phylogenetic distance than by lifestyle (Table 12).

Table 12: Shared genes between Flavobacterium strains in our RBM analysis. Flavobacterium strains are sorted phylogenetically as in Figure 18. First column reports the environment where the strain was found. Environment Genomes in growing Coding proteins Shared genes with F. p. Shared genes with F. p. length order JIP02/86 (pathogen) Z2 (free-living) fish pathogen F. p. JIP02/86 2,446 2,446 / 100% 1,365 / 56% fish pathogen F. b. FL-15 3,028 1,393 / 46% 1,366 / 45% fresh water F. j. UW101 5,099 1,522 / 30% 1,834 / 36% fish pathogen F. c. ATCC 49512 2,632 1,520 / 58% 1,323 / 50% Warm-spring water F. i. DSM 17447 2,671 1,403 / 53% 1,409 / 53% soil F. p. Z2 3,553 1,365 / 38% 3,553 / 100%

Another concern about the RBM output was its clear definition of orthologous groups given the essayed similarity threshold. If the threshold was too low, the output would be populated with paralogs; orthologous groups (rows) with more than one sequence belonging to the same genome (column). If the threshold was too high, the orthologous groups would not distribute in phyletic patterns at high ranks (class, order).

Homologous genes from a same genome were separated by commas in the tabbed file. In the file, 1,589 rows contained commas, which is 5% of the total amount. Counting the times a comma happened in each row, we evaluated how often those genes were duplicated. In most cases, the genes were only duplicated in one or two genomes. Less than 10% of the cases belonged to genes duplicated in several genomes (Figure 19).

Figure 19: Percentile distribution of the 1,589 groups of orthologs that recruited any paralogous sequence. Paralogs are defined here as genome redundancy within a group of orthologs. No true paralogy was tested with phylogenetic methods. The orthology threshold used in the reciprocal best match analysis was 50% similarity in 50% of the amino acidic sequence. Chart created with plotly v1.1.13.

By sorting the output columns in phylogenetic order, we were able to evaluate phyletic patterns at a glance (Figure 20). From top to bottom, we first observed the bulk of core-genes in the database and the phylum Bacteroidetes. Next, another cluster of abundant genes in aerobic Bacteroidetes (not in 55

Bacteroidales) then other cluster with decreasing number of genes that belonged to genomes of the orders Flavobacteriales and Bacteroidales. In between, an evident non-phyletic pattern was outstanding and belonged to respiratory genes.

Figure 20: Joined sections of the table of orthologs produced by the RBM analysis showing its principal phyletic patterns. The image depicts a spread sheet where dark cells contain a gene header, whereas light cells contain a dash indicating “no hit”. The phylogenetic order followed goes from left to right as; Flavobacteriales, Bacteroidales, Sphingobacteriales, Chitinophagales, Cytophagales and Out-group genomes.

The RBM analysis was fit for further interrogation judging from the evidence; coherent expected values in the comparison of positive controls, low incidence of paralogy, and conspicuous clustering of orthologs in phyletic patterns. Hence, we continued searching for taxonomic sub- clusters of genes by using logical operators upon the file to filter data.

4.2. Highly conserved proteins: core genome, exclusive sets and pertinent proteins.

The collection of 94 genomes compared in the RBM analysis shared 65 housekeeping coding genes (Supplementary file 15), much like in previous reports on the core genome of the prokaryotes (Gil et al., 2004; Koonin, 2003). The 89 bacteroidetal genomes shared an additional 155 coding genes with predominance of ribosomal proteins (Supplementary file 16). BLAST® searches found homologs of the 155 core-genes in other phyla with similarities above 30%. Although they had homologs in other phyla, 31 of the 155 core-genes were not coded in out-group genomes of the Rhodothermaeota and Chlorobi. Among them, aminoacyl-tRNA synthesis enzymes prevailed. Gupta and Lorenzini reported a list of 27 proteins (Gupta and Lorenzini, 2007) specific for the Bacteroidetes in 2007. Due to the limited number of genomes available at the time, they searched in three Bacteroidia and one Flavobacteriia. From that list, 13 proteins did not find a reciprocal best match in our analysis and many others were not ubiquitous in our analysis (Supplementary file 17). Only two proteins coincided with our list of 31 “exclusive” proteins; a Uroporphyrinogen-III synthase (PF02602.15), and a hypothetical protein (DUF1599; Domain of Unknown Function). 56

Sequences in the core genome with orthologs in other phyla can be used in prokaryotic phylogenies, as in the previous MLSA (chapter 3). In the RBM analyses, of the 29 sequences used in the previous MLSA phylogenetic evaluation of the bacteroidetal taxonomy, 14 belonged to the bacteroidetal core genome, 3 were prevalent and 12 did not sustain as highly conserved in Bacteroidetes. From the individual phylogenies based on each sequence, we did not find a direct relation in their phylogenetic agreement with ribosomal phylogenies and their level of conservation according to the RBM analysis.

Table 13: Uncharacterized yet conserved sequences of the Bacteroidetes. For sequences not found in the UniProt database, accession numbers from the respective genomes are provided. The genome of Spirosoma radiotolerans DG5AT was used as the reference genome, since it contained all core-set proteins. Putative annotations and pathways were compiled from the UniProt and KEGG databases or inherited from non-bacteroidetal proteins with similarities above 60% in BLAST searches. Accession Putative annotation Putative pathway WP_046375230.1 RidA familiy protein RNA processing WP_046376045.1 Recombination protein A Recombination A0A0E3ZUE1 Methyltransferase Unknown A0A0E3V5X8 Uncharacterized protein. Putative Biosynthesis of secondary metabolites gene ribH A0A0E3ZUU2 HIT family hydrolase. Putative gene Ribosome rplJ A0A0E3ZTX4 Zinc ribbon domain protein Unknown A0A0E3ZV03 NGG1p interacting factor 3 protein, Unknown NIF3 A0A0E3V763 Tyrosine recombinase XerC Homologous recombination WP_046573655.1 tRNA dihydrouridine synthase RNA procesing WP_046574203.1 Unmapped. Putative gene recG Homologous recombination A0A0E3V803 ABC transporter ATP-binding protein Homologous recombination A0A0E3V8I4 Phosphoesterase. Putative gene rplW Ribosome A0A0E3ZXS5 ATPase AAA Unknown A0A0E3ZXT0 2-hydroxyhepta-2,4-diene-1,7-dioate Aminoacyl-tRNA biosynthesis isomerase A0A0E3V8H2 Pyrophosphatase. Putative gene rsfS Translation A0A0E3ZYL4 Uncharacterized protein. Putative Homologous recombination gene xerC A0A0E3ZYP4 Uncharacterized protein. Putative Nucleotide metabolism gene adk WP_046578056.1 tRNA-specific adenosine deaminase RNA processing A0A0E4A0W5 Polypeptide deformylase Unknown A0A0E4A0D9 Putative pre-16S rRNA nuclease Unknown WP_046580219.1 Translational GTPase TypA Translation

Exclusive genes can predict unique phenotypes, the 31 sequences we found did not encode for any peculiar metabolism. To enhance the chances of predicting an exclusive phenotype of the phylum we also selected genes that could have been deleted in some genomes and present in some out- group genomes as by lateral gene transfer events. We will refer to these proteins as prevalent. Since the smaller taxonomic sets of genomes were of 5 each (Chitinophagia, Sphingobacteriia and the out-group), we allowed 3 (less than 5) deletions in the 89 bacteroidetal genomes and hits on the out- 57 group to avoid recruiting proteins that were lost by a deep branching lineage. Thus, we defined a set of 87 prevalent proteins in the Bacteroidetes; their 31 exclusive proteins, 7 core-genes (with orthologs in the out-group) and 49 prevalent proteins that were not ubiquitous (Supplementary file 18). The last 49 were enriched in replication/recombination proteins and proteins of unknown function (Figure 21). As much as 21 prevalent sequences were uncharacterized predicted proteins of unknown functions. Independent BLAST® searches indicated that the majority might correspond with housekeeping functions, but six remained difficult to classify (Table 13). Two unknown protein pairs consistently encoded together were of particular interest due to their sinteny; Pfam entries PF02591/PF01784 and PF01327/PF03652.

Figure 21: Composition of the Bacteroidetes’ conserved sequences. (A) Concentric pie charts represent the core genome (outer), exclusive plus prevalent sequences (mid) and exclusive proteins only (inner). (B) Composition of the 49 proteins that are prevalent in the Bacteroidetes and do not belong to the core genome. KEGG and UniProt classifications were compared to produce the final categories (legend).

At the taxonomic rank order/class we searched for exclusive proteins as “not shared with other Bacteroidetes” (Figure 22). Again, the exclusive sets of proteins were very small and poor in non- house-keeping genes. The largest taxonomic group, the Flavobacteriia, only had four exclusive genes. The mid-sized Bacteroidia and Cytophagia had 16 and 20 exclusive genes respectively. But the small-sized Chitinophagia and Sphingobacteriia differed greatly; Chitinophagia had seven exclusive genes denoting evolution remoteness of its genomes (as suggested by the later classification of the class Saprospiria (Hahnke et al., 2016), and the Sphingobacteriia had 100 exclusive genes probably denoting phylogenetic proximity between its genomes.

Figure 22: Exclusive protein pools in the Bacteroidetes. FLV: Flavobaceriia, BTD: Bacteroidia, SPH: Sphingobacteriia, CHT: Chitinophagia, CYT: Cytophagia. The exclusive proteins shared between taxonomic classes are virtually zero. However, when the Bacteroidia are excluded 19 proteins are shared among the other four groups. The 31 exclusive genes of the phylum were defined as not shared with out- group genomes of the phyla Rhodothermaeota and Chlorobi. 58

The exclusive set of proteins among combined taxonomic classes was almost zero, except for the combination of all but the Bacteroidia. The Bacteroidia is the only anaerobic lineage of the Bacteroidetes and they missed 19 bacteroidetal genes that mostly belonged to carbohydrate metabolisms, and significantly to the tricarboxilic acid (TCA) cycle. Only sequences of the Succinate dehydrogenase gene were kept and were orthologous in all 89 bacteroidetal genomes. They also kept sequences of the Fumarase gene, but they diverged beyond the 50:50 rule as a different orthologous group. Sequences of other enzymes involved in the TCA cycle remained as reminiscence genes in a few of the 16 bacteroidales genomes.

Conserved sequences at the taxonomic ranks class/order were later searched applying the criteria for prevalent sequences. This enlarged the sets of conserved proteins to 44 in Flavobacteria, 45 in Bacteroidia, 20 in Chitinophagia, 112 in Sphingobacteriia, 25 in Cytophagia and 49 in all but the Bacteroidia. Adding the 87 prevalent sequences in the phylum, a total of 382 translated genes were more carefully studied by updating their annotations and exploring their conserved identity and synteny (Supplementary file 19).

4.3. Median sequence identity of conserved sequences.

The 382 groups of prevalent sequences were individually aligned and confronted to create similarity matrices. Median sequence identities (m.s.i.) of prevalent orthologs ranged 45% to 81%. As much as 61% of the 382 groups had within-identity ranges of 50-60%, 26% of them had it within 60-70%, and the remaining 13% were evenly distributed above 70% or just below 50%. High internal m.s.i. did not correlate with taxonomic distribution, metabolic pathways or sequence length. However, representation of the presence/absence pattern of orthologs and their m.s.i. in a heatmap revealed some underlying trends (Figure 23) that corresponded to evolutive traits of Bacteroidetes. An extra group of 26 orthologous groups distributed in a non-phyletic -but complementary- pattern were included in the analysis (Figure 23D), all of them related to the aerobic respiratory chain of the Bacteroidetes, that are explained in the next chapter for clarity.

Conserved sequences of the Flavobacteriia were consistently less similar in genomes of the branch (Hahnke et al., 2016) (Figure 23A). The genomes of Fluviicola taffensis DSM 16823T (family Crocinitomicaceae) and Owenweeksia hongkongensis DSM 17368T (family Cryomorphaceae) lacked many of the conserved sequences of the Flavobacteriia. The lack of conserve flavobacterial genes in the Chryseobacterium branch and non Flavobacteriaceae genomes was also observable in Figure 20. Likewise, the genomes of deep-branching Bacteroidia, such as Alistipes finegoldii DSM 17242T, Draconibacterium orientale FH5T, and Odoribacter splanchnicus DSM 20712, frequently encoded some of the 49 proteins missing from most other Bacteroidia (Figure 23B). Finally, conserved sequences of the Chitinophagia and Sphingobacteriia held low m.s.i., indicating phylogenetic remoteness (Figure, 23C). 59

Figure 23. Median sequence identity of orthologs by taxonomic rank class. A: lesser identity of the Chryseobacterium branch with other Flavobacteriaceae shows genomic drift. B: Genes lost in the Bacteroidia, some of them still coded in deep branching species of the class. C: Great overall divergence between genomes of the classes Chitinophagia and Sphingobacteriia. D: non-phyletic distribution of genes coding the complexes I of the respiratory chain.

4.4. Synonymy and synteny.

We explored the pathways and biochemical reactions that involved the 382 prevalent sequences in different taxonomic categories, especially among the most conserved ones (higher m.s.i.), with no success in identifying diagnostic phenotypes. The individual synteny analysis of each orthologous group found some recurrent loci distribution around these genes, although most were theoretical. Hardly any of the proteins involved had reviewed annotations or had been characterized in members of the Bacteroidetes for certainty. Non-published syntenic organizations found in this study are listed next, with numeric reference (group numbers) to the prevalent orthologs reported in the supplementary file 19.

 Prevalent in Bacteroidetes (89 genomes). ◦ Adenyne glycosilase + single-strand DNA binding prot + hemolysin in 58 genomes. Group 173. ◦ UDP-3-O-acylglucosamine transforming genes in the synthesis of lipid-A (Uniprot route P21645) in 75 genomes. Group 180. ◦ Assembly protein + polymerase + decarboxylase + DUF + RedN in 66 genomes. Group 191. ◦ Peptidoglycan glycosyl-transferase + ligase + transferase in 86 genomes. Group 200. ◦ ParA + ParB + hypothetical protein + reductase + peptidase (cell wall biosynthesis pathway) in 62 genomes. Group 205. 60

◦ UDP-N-acetylmurmoyl-L-alanyl-D-glutamate ligase + cell wall division protein in 85 genomes. Group 206. ◦ 50S ribosomal complex proteins in 84 genomes. Group 261. ◦ Ribosomal silencing loci in 72 genomes. Group 268.

 Prevalent in anaerobic Bacteroidia (16 genomes). ◦ Flavoprotein + A-coA dehydrogenase in 22 genomes (with synonyms in other taxa). Groups 2,461 and 2,456. Candidate gateways into their aerobic respiratory chain (chapter 5). ◦ Ketoisovalerate oxidoreductase loci vorBAA in the 16 genomes, non-exclusive fermentative pathway. Groups 2,591 to 2,593.

 Prevalent in aerobic Bacteroidetes (73 genomes). ◦ Three consecutive hypothetical proteins in 67 genomes. Group 438. ◦ 2-oxoglutarate dehydrogenase subunits E1 and E2 in 70 genomes. TCA cycle. Groups 411 and 436. ◦ Electron transfer flavoprotein subunits in 66 genomes. Group 425 and 437.

 Prevalent in Flavobacteriia (48 genomes). ◦ Protein translocase gene-cluster (of 5 genes) in 42 genomes with synonym cluster in Sphingobacteriales. Group 734. ◦ HMG-CoA reductase + hypothetical protein in 31 genomes. Group 736. ◦ Diphosphomevalonate decarboxylase + hypothetical protein in 37 genomes. Group 737. ◦ Holiday junction protein + peptide deformylase + hypothetical protein in 33 genomes. Group 760. ◦ Sodium-proton antiporter + Nitrogen fixation NifU in 47 genomes. Group 795. ◦ Peptidase M14 + Transcriptional regulator + hypothetical protein in 34 genomes. Group 818.

 Prevalent in Chitinophagia (5 genomes). ◦ Two-component histidine kinase sensor system + DNA binding protein with synonyms in Flavobacteriia (5 genomes). Group 7,260.

 Prevalent in Sphingobacteriia (5 genomes). ◦ Aminofutalosine synthase + histidine phosphatase + radical SAM protein in the 5 genomes. Menaquinone biosynthetic pathway. Group 9,902. ◦ Uncharacterized proteins next to housekeeping genes: groups 9,883, 9,886, 9,890, and 9,924.

 Prevalent in Cytophagia (15 genomes). ◦ Response regulator + hypothetical protein + hydrolase + hypothetical protein in 11 genomes. Group 2,833.

However, recurrent quinol:cytochrome oxidoreductase and gliding subunits annotations were prevalent in different taxa. In their syntenic analysis we found some were synonyms and others were adjacent genes in gene-cluster with phyletic distributions. Two gliding motility proteins were conserved in most Flavobacteriia (GldC), Sphingobacteriia (GldC, GldL) and Cytophagia (GldL) as synonym groups of orthologs constrained to the class rank (Supplementary file 20). 61

Figure 24 (previous page): Resume of the loci architectures recovered from Bacteroidetes of the type IX secretion system gene cluster gldKLMN. Genes gldN and gldO are homologous, both represented in blue. MultigeneBlast v1.1.13 confronted the query gene cluster (top) against 89 genomes with homology cut-off at 20% identity, 30% coverage and inter-gene distance of 10 genes.

The gldC gene was not conserved in a recognizable gene cluster across Bacteroidetes and was located adjacent to gldB only in 43 Flavobacteriia. In contrast, gldL belonged to the cluster gldKLMN or gldKLMO recognizable in 80 genomes (Figure 24), 90% of the represented Bacteroidetes (Supplementary file 20). Only Flavobacteriia encoded the cluster gldKLMO. Species missing the gldKLMN cluster were pathogens or symbionts except for soli. 62

Figure 25: Resume of the loci architectures recovered from the Bacteroidetes of the Alternative Complex III plus cytochrome caa3COX. Genes of the cytochrome oxidase subunit III in the Flavobacteriales were coded further downstream and are not represented here. Bacteroidales known to be strict anaerobes did not code this locus. MultigeneBlast v1.1.13 confronted the query gene cluster (top) against 89 genomes with homology cut-off at 20% identity, 30% coverage and inter-gene distance of 10 genes.

Quinol:cytochrome c oxidoreductase subunits were annotated at all taxonomic groups except for the Bacteroidia. They belonged to Act subunits of the alternative complex III (ACIII) of the aerobic respiratory chain (Refojo et al., 2010). Their encoding genes were organized in an actABCDEF gene-cluster present in 83% of the studied bacteroidetal genomes, all the aerobes. The synteny of this gene-cluster was extended downstream to some cytochrome oxidase subunits that coded for an oxygen-reducing caa3-type cytochrome oxidase (caa3COX) (Figure 25). In flavobacterial species, the ACIII and caa3COX have been described to form a respiratory supercomplex equivalent to the canonical complexes III and IV (Sun et al., 2018). BLAST® searches found the ACIII-caa3COX supercomplex also encoded in genomes of the Blattabacteriaceae that were discarded from our database because their genomic reduction. Under our RMB parameters, the subunit ActC was the most conserved sequence with orthologs in 49 genomes and synonym orthologs in the Flavobacteriia. The other Act subunits sequences were divided in at least four orthologous groups with phyletic distributions, being ActE the most disaggregated sequence, or most variable. 63

4.5. Expected sus and flx genes

PUL predictor SusC/D sequences

TonB-dependent uptake systems are widely distributed among Gram-negative bacteria and have been shown to transport a variety of large substrates including iron-siderophores, nickel, vitamin B12 and oligosaccharides (Noinaj et al., 2010; Schauer et al., 2008).

Figure 26: Sequence identity heatmap of the 20 most abundant susC/ragA translated genes in Bacteroidetes against the reference susC translated sequence in Bacteroides thethaiotamicron (black pixel). Groups of orthologous sequences are distributed along the y axis and genomes along the x axis in phyletic order from Flavobacteriia (left) to Cytophagia and out-group genomes (right). The last two columns are the minimum and maximum identity within each OG. Scale bar on the right translates color shades into percentage identities. Heatmap created with plotly v1.1.13.

Bacteroidetes have evolved a variant that features a SusD-like substrate-binding protein. Genes for SusD are usually co-located with genes for SusC in a characteristic tandem. These tandems are often part of PULs, where SusD acts as initial glycan-binding protein that interacts with the SusC TonB-dependent pore protein for uptake across the outer membrane. However, there are also SusD homologs that might have alternate functions, e.g., in iron acquisition.

SusC/D proteins are homologous to RagA/B proteins (Hall et al., 2005) and therefore often annotated as such. In this study, a total of 452 orthologous groups containing annotated SusC/RagA sequences, and 160 containing SusD/RagB sequences. SusC/RagA homologues were more abundant than SusD/RagB because not all bacteroidetal TonB-dependant transports feature SusD which is specific for oligosaccharides (Noinaj et al., 2010). Nevertheless, their sequence similarity was low, causing the RBM analysis to split them into multiple groups of orthologs with no apparent phyletic pattern (Figure 26). The classes Cytophagia, Chitinophagia and Sphingobacteriia showed the highest SusD absolute abundances, but SusD distribution did neither reveal a phylogenetic nor environmental pattern. SusC/SusD ratios were high in the Bacteroides and Prevotella genera in comparison to the phylum’s general trend, 13 genomes did not contain SusD sequences and five no Sus sequences at all (Figure 27). 64

Figure 27: Orthologous groups containing SusC/D-like translated proteins in Bacteroidetes’ reference genomes. Circles: genomes with no sus sequences. From left to right: Flavobacterium indicum DSM 17447T, Flavobacterium psychrophilum JIP02/86, Fluviicola taffensis DSM 16823T, Saprospira grandis str. LewinT, and Flexibacter litoralis DSM 6794T. Stars: Maximum susD counts per class. From left to right: (Flavobacteriia) profunda SMA-87T, Zobellia galactanivorans DsiJT, (Bacteroidia) Draconibacterium orientale FH5T, (Sphingobacteriia) Pedobacter heparinus DSM 2366T, Sphingobacterium sp. ML3W, Sphingobacterium sp. 21, (Chitinophagia) Chitinophaga pinensis DSM 2588T, Niastella koreensis GR20-10T, (Cytophagia) Dyadobacter fermentans DSM 18053T, vietnamensis DSM 17526T, Cyclobacterium marinum DSM 745T, and Cyclobacterium amurskyense KCTC 12363T. Pentagon: Bacteroides and Prevotella genomes encode few susD copies compared to their susC copies. Chart created with plotly v1.1.13.

Synthesis of the flexirubin, flx sequences.

In order to identify flexirubin synthesis genes, we used the flx genes from Flavobacterium johnsoniae UW101 (Sousa et al., 2012) as queries for MultiGeneBlast searches in other genomes. Only 22 genomes of the entire dataset contained at least half of the flx cluster (Annex D), and no homologous sequences were found in genomes of the Saprospiria and Sphingobacteriia classes. When homologous gene clusters could be recognized outside the Flavobacteriia, these were shorter, mainly due to the absence of predicted hypothetical proteins.

4.6 Most ubiquitous gene-clusters in Bacteroidetes

Besides the ACIII-caa3COX supercomplex conserved in aerobic Bacteroidetes, we found Complex II sequences SdhA and SdhB in the core-genome of the phylum (Supplementary file 20). The canonical Complex II is a succinate dehydogenase that can shuttle electrons into the respiratory chain through two pathways; acquiring them from Complex I or participating in the TCA cycle as the catalyst of succinate’s oxidation to fumarate. It is also called Succinate:quinol oxidoreductase (SQR). A syntenic analysis of sdhA and sdhB sequences in bacteroidetal genomes found them contiguous in all genomes and preceded by a sdhC sequence that did not qualify as orthologous for all Bacteroidetes in the RBM comparison of genomes. This genomic organization, sdhCAB, is typical of the type B SQR complex (Lemos et al., 2002). Although it has been reported that Rhodothermus marinus’ SQR complex could be type B (Lemos et al., 2002), we did not find homologous shdCAB gene-clusters in Rhodothermus and other Rhodothermaeota genomes, but in the phyla Chlorobi and Fibrobacteres. The subunits SdhAB of the type B SQR complex are the membrane anchors of the SdhC, the catalytic subunit. SdhAB contains the iron-sulfur centers that funnel electrons into the respiratory chain.

Thus, the sdh, act and gld sequences were the only conserved genomic loci of Bacteroidetes that our methodology could detect. To determine if their phylogenetic signal agreed with the established taxonomic route of the phylum and confirm their vertical transmission we conducted an MLSA of 65 each loci based on the sequences with higher similarities (Figure 28). A phylogenetic reconstruction based on concatenated SdhAB sequences of respiratory complex II could not resolve the six Bacteroidetes classes and suggested a horizontal gene transfer (HGT) event between the common ancestor of the Bacteroidia and the anaerobic green sulfur bacterium Chlorobium limicola. In contrast, the reconstruction based on concatenated ActBCD sequences of the ACIII reproduced the commonly accepted Bacteroidetes phylogeny although the only representative of the Bacteroidia was placed closer to Saprospiria rather than Flavobacteriia. Last, GldKLMN/O sequences of the gliding machinery reproduced a phylogeny that also agreed with the known Bacteroidetes phylogeny, but placed F. taffensis (Crocinitomicaceae) outside the Flavobacteriia. Act and Gld phylogenies both reproduced the distinctness of the Chryseobacterium branch from other Flavobacteriaceae.

Figure 28: MLSA based phylogenies of most conserved Act, Gld and Sdh proteins. Concatenated sequences measured 1,675 a.a. ± 36 (Act), 1,502 a.a. ± 87 (Gld) and 911 a.a. ± 11 (Sdh). Sequences were aligned with Muscle (Edgar, 2004) and phylogenetic trees were computed using neighbor-joining with 1,000 iterations and Kimura correction in ARB (Ludwig et al., 2004). Percentages represent how often a bifurcation was reproduced over the 1,000 replicates. Scale bars represent the estimated substitution rate. 66

5. Bacteroidetal adaptive radiation and halophily explained by aerobic respiratory chain genes.

From SQR sequences to ACIII-caa3COX sequences, the bacteroidetal equivalents of the canonical complexes II, III, and IV were identified as highly conserved in most genomes. This would be insignificant was not the ACIII-caa3COX complex still to be reported as exclusive of the phylum. However, sequences from the Complex I did not overcome the homology cut-offs set for this phylogenetic study and deserved extra attention.

5. 1. Genes of the Complex I predict halophily. The bacterial NADH:quinone oxidoreductase is widely distributed in the domain Bacteria. It might have 11 subunits on its ancestral form, or 14 when it is complete (Figure 29). The complex is divided into three modules: the P-module (NuoAHJKLMN subunits) is the proton translocation module, the Q-module (NuoBCDI) transfers electrons to the quinone via iron-sulfur centers, and the N-module (NuoEFG) is the NADH dehydrogenase module that binds NADH produced by the TCA cycle (Moparthi and Hägerhäll, 2011).

Figure 29: Representation of the canonical complex I modules according to Moparthi and Hägerhäll 2011.

Exploring the annotations of translated genes in various genomes, we found Nuo subunits in both aerobic and anaerobic Bacteroidetes. Performing a synteny analysis of those sequences we were able to identify the full 14 subunits gene-cluster nuoABCDEFGHIJKLMN in aerobic Bacteroidetes whereas the anaerobic Bacteroida encoded a 10 subunits cluster nuoABDHIJKLMN. The lack of subunits CEFG indicate that the Bacteroidia have presumably lost their N-module during their transition to an anaerobic lifestyle. This transformation makes the NADH:quinone oxidoreductase of the Bacteroidia very similar to the ancestral complex, which mechanism is unknown, but probably interacts with varied electron donors or acceptor proteins (Moparthi and Hägerhäll, 2011). A candidate electron donor could be the conserved flavoprotein in Bacteroidia that was coded next to an Acetyl-CoA dehydrogenase (Chapter 4), of which we found no specific literature and might be worth study (proteins PF00441, PF00766, and PF01012).

Most importantly, we did not find the full construction of the nuo cluster in all aerobic species. Checking for their concordance with the phyletic distribution of nuo translated genes in the RBM analysis, we found a clear non-phyletic pattern that was complementary to the non-phyletic pattern of other respiratory genes identified before (Chapter 4). These other genes encoded a sodium pumping NADH:quinone oxidoreductase (Na+-NQR). The Na+-NQR is a six-protein membranal complex that was described in marine bacteria (Unemoto and Hayashi, 1993). It uses the metabolic redox power to pump sodium ions across the inner membrane coordinating the respiratory chain with osmotic regulation (Figure 30). Thus, part of the pumped sodium is destined to the oxidative phosphorylation and part to sustain the intracellular ionic balance. The Na+-NQR is associated with the respiration of pathogens like Vibrio, Klebsiella or Haemophilus spp. It is the only known enzyme that uses riboflavin directly as a redox cofactor and is very efficient in stabilizing unpaired electrons to its flavin molecules (Barquera, 2014). From a phylogenomic analysis and comparison 67 of gene clusters organization, Reyes-Prieto et al. proposed that the Na+-NQR originated in the common ancestor of Bacteroidetes and Chlorobi via duplication of the rnf operon (NADH:ferredoxin dehydrogenase). They postulated that the acquisition of the Na+-NQR genes by different horizontal transfer events allowed different types of bacteria to adapt to the abundance of Na+ ions in habitats like the marine, alkaline and intracellular environment (Reyes-prieto et al., 2014). In the duplication event, the copied rnf operon (coding for a group of redox linked sodium pumps) would have lost the RnfB protein involved in the electron uptake from the reduced ferredoxin and later recruited an AMOr subunit (aromatic monooxygenase) to become NqrF, the electron uptake subunit of the Na+-NQR. They called the intermediate complex with no NqrF subunit the ancestral Na+-NQR. Our results invite to reconsider this hypothesis. Reyes-Prieto et al. represented the Bacteroidetes in their analysis with species of the class Bacteroidia. We observed that Rnf proteins are only common to the Bacteroidia. In other classes, only two species of Flavobacteriia presented rnf orthologs. On the other hand, the nqr cluster was incomplete in the Cytophagia lacking the NqrF subunit, thus, reproducing the ancestral Na+-NQR (Supplementary file 20, Figure 31). If the events that originated the Na+-NQR proposed by Reyes-Prieto et al. were correct, the recruitment of the NqrF subunit should have happened within the Bacteroidetes after the divergence of the Cytophagia branch. A plausible explanation would be that the Na+-NQR originated as a transformation of the original rnf genes in Bacteroidetes, not a duplication.

Figure 30: Representation of the NQR complex according to Barquera (2014). NqrF is colored grey to indicate the only subunit not homologous to the original Rnf complex.

This dichotomy in the distribution of Complex I types fitted the environmental distribution of Bacteroidetes depending on its salinity (Supplementary file 20). When metadata was not concise about the salinity in the isolation site, we consulted the composition of the growth media for each strain. It was clear that all species endowed with the NQR complex needed sodium to grow or could resist seawater salt concentrations.

5.2. Variants of the aerobic respiratory chain.

By surveying the genetic organization of respiratory genes in Bacteroidetes (Supplementary table 20) we observed that in the class Cytophagia it was common to see duplicated Nuo gene-clusters, many of them degenerated (incomplete). In five cases, an incomplete nuo gene-cluster was the only copy of complex I genes found; Myroides odoratimimus PR63039, Elizabethkingia sp. BM10, Spirosoma radiotolerans DG5A, Dyadobacter fermentans DSM 18053, and Leadbetterella byssophila DSM 17132. The last three lacked the genes of the N-module, suggesting interaction with alternative electron donors. Also, the multiple variants of nuo gene organization and the presence of the ancestral NQR complex in this taxon (Figure 31) led us to think that adaptation to variable redox environments influenced the divergence of the Cytophagia from other Bacteroidetes.

Following the SQR on the aerobic respiratory chain, equivalents to Complex III and IV in aerobe Bacteroidetes were always an ACIII bound to a caa3-type cytochrome c oxidase forming the supercomplex ACIII-caa3COX recently described for F. johnsoniae (Sun et al., 2018). The supercomplex accepts electrons from the quinone and conducts them to the reduction of O2 to H2O completing the respiratory chain (Figure 31) while contributes protons to the electrochemical 68 gradient. The electron transport rate of the canonical respiratory chain is limited by the capacity of soluble cytochrome c to connect the terminal oxidase with other membrane-integral redox complexes. The supercomplex ACIII-caa3COX solves this problem by controlling the electron flux that is transmitted directly to the c-type haem electron carrier fused to the caa3COX (Sun et al., 2018). In its description, the gene clusters that codes for the ACIII-caa3COX in F. johnsoniae did not include the gene of the caa3-type cytochrome oxidase subunit III because it is coded upstream. We found the same cluster construction in all Flavobacteriia except the ones in the Chryseobacterium branch (Figure 31), on which we did not find homologs of the caa3-type cytochrome oxidase genes. This difference between the Flavobacteriia and other aerobic Bacteroidetes could indicate a different control in the expression of the supercomplex in response to their environment. By contrary, the ACIII complex is well conserved in all aerobe Bacteroidetes (including the endosymbionts Blattabacteriaceae), (Figure 31).

Despite the Bacteroidia are anerobes, substrate-level phosphorylation is not absent in the Bacteroidia, all their genomes coded for at least one ATP synthase. Half of the Bacteroidia had both an F-type and a V-type ATPases, and only three (of the genera Porphyromonas and Alistipes) had only a V-type ATPase coinciding with our earlier report of indels in the beta subunit. Further study on the expression of both ATPases should explain their utility, but our guess is that they are related to their performance with a transmembrane sodium-motive force (Mulkidjanian et al., 2008). Sequences of the F-type ATP synthase (sometimes called complex V) were pertinent to the phylum, but lost in the Bacteroidia. A detailed inspection (Supplementary file 21) revealed that subunits A, B, C, δ, α, and γ were encoded in the same gene cluster across most genomes of Bacteroidetes. Genomes of the Bacteroidia encoded a predicted gamma subunit of the F-type ATP synthase classified in a group of orthologs synonym to that of the rest of Bacteroidetes, plus the rest of the cluster. However, no F-type ATPase subunits were present in the Bacteroidia members Porphyromonas asaccharolytica, P. gingivalis and Alistipes finegoldii. These species feature a V- type ATPase instead. Eight Bacteroidia genomes encoded both types of ATPases corroborating previous findings.

Conserved sequences of the Bacteroidia comprised a predicted cytochrome D ubiquinol oxidase subunit II or cytochrome bd (Supplementary file 20). Its gene was always preceded by a gene coding for a DUF4492-containing protein (DUF: domain of unknown function) and then followed by the cytochrome D ubiquinol oxidase subunit I gene (Supplementary file 20). The cytochrome D ubiquinol oxidases I and II corresponded to the cytochrome bd subunits CydAB involved in aerobic respiration of Bacteroides species. Like the ACIII-caa3COX of aerobe Bacteroidetes, the cytochrome bd of the Bacteroidia would accept electrons from the quinone and reduce O2 to H2O (Figure 31). This was already explained by Baughn and Malamy (2004) for B. fragilis in a letter where they proposed the term ‘nanaerobes’ to name organisms capable of this of metabolism. The gene cluster cydAB coding for the cytochrome was also present in some other Bacteroidetes, more prevalently in Flavobacteriia but with representatives in all classes (Supplementary file 20). The high oxygen affinity of the cytochrome bd might serve to the Bacteroidia to remove oxygen from the environment to help colonization. Aerobe Bacteroidetes might keep the CydAB complex to face oxidative stress. Our proposed aerobic respiratory chain in the Bacteroidia coincide with the proposed respiratory chain in Porphyromonas gingivalis by Meuric et al. (2010) when oxygen is the final electron acceptor. A recent study links the mechanism of a caa3-type cytochrome oxidase with a cytochrome c4 in Pseudomonas pseudoalcaligenes KF707 (Sandri et al., 2018). We also found a prevalent cytochrome c4 in Bacteroidetes (orthologs group 212, Pfam accession PF01012). Similar research on Bacteroidetes should confirm if the prevalent cytochrome c4 in Bacteroidetes can interact with their caa3COX (in aerobiosis) or cytochrome bd (in nanaerobiosis) and under which conditions. 69

Figure 31: Aerobic respiratory chains in Bacteroidetes. (A) Brief representation of gene clusters in 15 genomes that represent the variety of compositions described in this study. The nucleotide positions on both gene ends indicate locations in the respective genomes. Accession numbers precede genome names. The legend summarizes protein names and their color code. (B) Proposed aerobic respiratory chain in halophilic aerobic Bacteroidetes. (C) Proposed aerobic respiratory chain in mesophilic aerobic Bacteroidetes. (D and E) Proposed aerobic respiratory chain in nanaerobe Bacteroidia: (D) in all except the Porphyromonadaceae, (E) in Porphyromonadaceae. ‘Q’ stands for quinone. Structures of the complexes are based on representations found in the literature except for the cytochrome bd that is represented as a symmetric dimer for convenience to represent the subunits CydAB.

A former prospect of respiratory proteins in bacterial genomes by Marreiros et al. (2016) enumerated the frequency of different respiratory proteins across the domain Bacteria. They reported a substantial prevalence of the ACIII complex in Bacteroidetes present in 69% of the 108 70 genomes explored. Still, 30% of the genomes had no equivalent proteins, and we believe they account for he anaerobe Bacteroidia. In the same study, Bacteroidetes were found to have two main types of terminal electron acceptor reductases; 57% of the genomes had a cytochrome bd, and 72% had an HCO family reductase (that reportedly included the cytochrome oxidases caa3 or ba3). The number of caa3 (HCO) cytochrome oxidases fits with our model whereas the number of cytochromes bd should be around 30%, but its higher frequency is explained by the presence of orthologue cydAB sequences outside the Bacteroidia (Supplementary file 20).

5.3. Adaptive radiation explained by the aerobic respiratory chain.

Based on the evidence obtained in this work, it is likely that the Bactereoidetes emerged as a new lineage of organic matter decomposers. Their success in mining complex carbon compounds may lead them to pursue carbon sources beyond their original habitat, adjusting their respiratory chain to new redox potentials that dependent on substrates and environmental conditions. Their initial advantage probably caused an outburst of new bacteroidetal forms of which presumably five lineages have survived until the present.

Figure 32: Transmission route of ancestral genes in Bacteroidetes and later transformations in the aerobic respiratory chain. Circles: clases Cytophagia (Cyt), Chitinophagia (Cht), Sphingobacteriia (Sph), Bacteroidia (Btd), and Flavobacteriia (Flv). Grey filled geometries correspond to events transcending to modern genomes as explained in dotted lines. Dashes represent lateral gene transfer events. The tree topology is the radial representation of the consensual SSU ribosomal phylogeny in this work.

The phylogenetic distribution of respiratory loci variants suggests transforming events that were vertically transmitted in the phylum. If superposed to the consensual phylogeny of the phylum (Figure 32), they are divided into two groups. The birth of the phylum, endowed with T9SS and ACIII-caa3COX supercomplex, was soon followed by the transformation of rnf genes into the NQR complex. Opposite to the ancestral form still kept by the Cytophagia, a lineage with a full NQR appeared that might be extinct but caused the lateral spread of the nqr genes to contemporary 71 lineages to the present. The acquisition of nqr genes causing nuo genes to degenerate, never vice versa, is clear in our database (Supplementary file 20). Only Haliscomenobacter hydrossis DSM 1100T encoded full copies of both complexes, suggesting a recent acquisition of the nqr genes. It is outstanding the lack of evidence of a full NQR complex laterally transferred to the Cytophagia. The only protein shared by all Bacteroidetes except the Cytophagia (Figure 22) was a putative phosphoserine aminotransferase (Pfam PF00266) occasionally annotated also as a major facilitator superfamily (MFS) transporter involved in the response to chemiosmotic gradients (InterPro IPR036259). This protein alone was insufficient to interpret why the Cytophagia never acquired the complete NQR complex.

The second group of events was the substitution of the ACIII-caa3COX supercomplex by a cytochrome bd and loss of the complex I N-module in Bacteroidia, that we can link to their nanaerophily. And the reorganization of the ACIII-caa3COX supercomplex genes in the Flavobacteriia that theoretically could be related to novel modulation of gene expression. A gap between these two groups of events does not necessarily mean the absence of intermediate variants of the respiratory chain, but their extinction. 72

Discussion

Redistribution of large phylogenetic clades.

The updated phylogeny of the Bacteroidetes allowed an updated and accurate review of the phylum’s taxonomy under the light of different genealogical markers. The data selection of ribosomal sequences was exhaustive, gathered the best reference sequences and improved the selection of 16S rDNA sequences compiled in the LTP_s119_SSU database. The LSU phylogeny could only assess the affiliation of, mainly, species which genomes were sequenced. Nevertheless, its coincidence with the MLSA phylogenetic approach lets us predict that the resolutive power of the LSU sequence alone is adequate for bacteroidetal systematics. The MLSA of 29 concatenated sequences for depth included genetic markers in disagreement with ribosomal phylogenies. In perspective, they might correspond with laterally transferred genes, like we identified in the core- genome sdh genes (not included in the MLSA) further studied in the genomic comparison of the Bacteroidetes. However, the MLSA based phylogeny could reproduce a phylogeny congruent with ribosomal markers. Only four sequences agreed with ribosomal phylogenies and belonged to conserved gene sets evaluated with more genomes; alaS, ileS, infB, and polA. This suggests that LGT events in the phylogenetic path of the Bacteroidetes were frequent and dispersed in time. Thus, a lineage-linked cooperation in the adaption to new environmental pressures (like salinity) seems likely.

Overall, the metabolic gene sequences did not reconstruct phylogenies like the inferred from replicative machinery genes, but they exhibited sufficient resolution at the family rank. By contrary, the genes involved in the cell wall synthesis, signal trafficking and cell cycle produced topologies like the ones obtained from transcription, translation and ribosomal genes. To our understanding, this suggests metabolic sequences to be LGT prone, and sustains the high dispersion and variability of PULs in the Bacteroidetes.

The largest phylogenies in this study (based in SSU, SSU+LSU, LSU sequences and MLSA) agreed in the discerning of six large lineages. From their root, they diverged in the phylum Rhodothermaeota, and bacteroidetal classes Cytophagia, Chitinophagia, Sphingobacteriia, Bacteroidia and Flavobacteriia. Apart from the Rhodothermaeota and Flavobacteriia, other lineages varied their position across phylogenies due to their short branch insertions to backbone phylogenetic structures. This sustained the description of the new class Chitinophagia and the description of the phylum Bacteroidetes as a group of radially divergent lineages. Other evidences supporting the description of the new phylum Rhodothermaeota are; its sequence minimum sequence identity below the phylum 75% threshold (Yarza et al., 2014) against bacteroidetal sequences, characteristic indels in two essential proteins (AtpD and AlaS) and their signature nucleotide in position 1,225 of the 16S rDNA sequence. Precedent observations by various authors were also useful in the delineation of the new phylum (Antón et al., 2002; Ludwig et al., 2010; Park et al., 2014; Soria-Carrasco et al., 2007).

Rhodothermaotal species are frequently extremophiles; barophiles or halophiles. Initially, we expected to find mechanisms of halophily exclusive of the Bacteroidetes, including extreme halophiles like the genera Salinibacter (Antón et al., 2002) or Fodinibius (Wang et al., 2012). The classification of these extremophiles in a new phylum seemed contrary to our purpouse, but in the light of true bacteroidetal species being enriched in saline environments (Gomariz et al., 2015; Kalwasińska et al., 2018) we remained firm in our goals. It turned out, bacteroidetal halophiles shared an exclusive halophilic strategy not found in the Rhodothermaeota, which further supports its classification as a different phylum. 73

Rhodothermaeota lineages were also reclassified as follows. Based on 16S rRNA and 23S rRNA gene topologies the rank thresholds inside Rhodothermaeota and the deletion of position 993 in the 16S rRNA gene alignment, we proposed the Balneola group to be classified as class Balneolia, order Balneolales, family Balneolaceae. One month before effective publication, the name Balneolaceae was published by Xia et al. (2016), and one month after our publication, Hanke et al. (2016) classified the Balneolia as the new phylum Balneolaeota. Our AtpD based phylogeny agreed with the classification of the Balneolaeota, but its divergence from the Rhodothermales was very small. Considering they were monophyletic in the phylogeny of the concatenated SSU and LSU ribosomal sequences, and their divergence in the SSU phylogeny was not conclusive, plus their identical insertion in the AtpD sequence, we were prudent to classify them into one phylum.

At the genus level, we decided to reclassify Salinibacter iranicus and S. luteus (Makhdoumi-Kakhki et al. 2012) as the new genus Salinivenus after evaluating their 16S rRNA similarity with Salinibacter species and identifying a nucleotide signature on helix 23 of the SSU sequence, together with ANI and AAI values facilitated by Viver et al. (2018)

New bacteroidetal taxa.

Although genome-based taxonomies are overtaking classic genetic markers (Parks et al., 2018; Zhu et al., 2019), as in 2020, the phylogeny of the SSU sequence in Bacteroidetes remains as the most representative in species diversity since more genomes need to be sequenced. Hence, findings in this study concerning the delineation of lower taxa at the ranks family and genus remain solid except for the designation of the family Odoribacteraceae, reclassified in 2016 as a synonym of Marinifilaceae (Iino et al., 2014; Ormerod et al., 2016) but not validated as a synonym until January 2020. We proposed the new family Odoribactereaceae in 2016 by means of its distinctiveness in SSU, LSU, MLSA, and AtpD phylogenies, and in agreement with previous remarks by Sakamoto (2014).

The deep branching order of the SSU phylogeny permitted the proposal of the new class Chitinophagia with its new order Chitinophagales to allocate the families Chitinophagaceae and Saprospiraceae which are unrelated to Sphingobacteriaceae. The circumscription of this new class was sustained by the MLSA and LSU phylogenies. Some evidences of the existence of this new clade were previously exposed in McIlroy and Nielsen (2014). Consequently, the class Sphingobacteriia was limited to the family Sphingobacteriaceae only. The 23S rRNA and MLSA phylogenies suggested that class Sphingobacteriia was likely related to the class Cytophagia, but the 16S rRNA gene phylogeny did not support a reclassification of this branch. The combined SSU- LSU topology suggested that the five classes of the Bacteroidetes root so close that they are likely to have emerged by radial adaption rather than consecutive divergence. The same effect was observed at the rank family level in some areas of the bacteroidetal phylogeny, most significantly in the order Cytophagales with the SSU based multifurcation of seven families supported by auxiliary topologies. The new family Hymenobacteraceae was proposed to host species that were conflictive in the circumscription of the family Cytophagaceae as addressed by McBride et al. (2014) The new Themonemaceae contained only two species and one genus, and was only represented in in the SSU gene phylogeny because their genomes had not been sequenced yet. Its 16S rRNA gene affiliation was highly variable among the SSU gene backbone trees but always emerged as a very long independent branch. The new family Persicobacteraceae was also not represented in other topologies besides the SSU gene tree. It recruited former members of the Flammeovirgaceae which affiliated close to Catalimonadaceae or Flammeovirgaceae on backbone topologies but stood out monophyletically in the consensual SSU phylogeny. Another multifurcation also occurred in the Bacteroidales among the Marinilabiliaceae, Prolixibacteraceae and Marinifilaceae. 74

Inside the class Flavobacteriia we proposed the new family Crocinitomixaceae fam. nov. in terms of its 16S rRNA gene affiliation. This cluster, distinct to the family Cryomorphaceae, was also supported by LSU and MLSA based tree reconstructions although only one sequence represented each family Cryomorphaceae and Crocinitomixaceae. The proposal of this family agrees with the observations of Bowman in his review of the family Cryomorphaceae (Bowman, 2014). The family Blattabacteriaceae was not treated in this phylogeny because no type material is available. The family Schleiferiaceae had a single species (Schleiferia thermophila) which 16S rRNA best sequence has a poor quality for phylogenetic inferences. Its affiliation was determined only by parsimony, and although it affiliated within the family Cryomorphaceae a reclassification could not be proposed.

Our proposal of new taxa in Bacteroidetes was conservative in the sense that it only amended misleading taxonomic classifications that do not match the current phylogeny, although as exposed in chapter 2, taxonomic thresholds predicted 4 phyla, 7 classes, 20 orders and 59 families of Bacteroidetes. Our taxonomy, together with the published soon after by Hanke et al. (2016), achieved the desambiguation of “unclassified” clades, and set the mark for future reclassifications recapitulated in the current taxonomic rout of the phylum (previous point). Thirteen of the new taxa proposed in our research remain with standing in nomenclature.

The difference of our classification against Hanke’s (Hahnke et al., 2016) was our different methodological approach: while Hanke’s was a genomic, future forward analysis, our classification was a review of classic methods with a refined treatment. The agreement of both studies in major reclassifications proved that despite genomic analyses are potentially more informative, classic sequence analyses should not be underestimated.

In this respect, our effort in updating the analyses of previous marker sequences produced significant findings. The novel description of nucleotide signatures in 16S rRNA sequences for the taxa Bacteroidetes (position 975), Rhodothermaeota (pos. 370, 391, and 1.225), Balneolaeota (pos. 993), and Salinibacteraceae (pos. 577, 764, 957, 1.050, 1.109, and 1.208) could help in the synthesis of new probes to detect them in environmental samples. While the description of new indels in the sequences of the proteins AlaS and AtpD sustained the circumscription of the Rhodothermaeota, hence finding biological prove beyond mathematical models.

Below the taxonomic rank family, most of our appreciations about conflictive taxonomic classifications against their updated phylogeny remain not addressed by the taxonomic community. In the genomic classification by Hahnke et al. (2016) the independent Flexibacter cluster containing species F. litoralis, F. roseolus, and F. polymorphus was proposed as a lineage of three different genera as an homage to important taxonomists with the new names Bernardetia litoralis, Hugenholtzia roseola, and Garritya polymorpha all validated in 2017. Also, Pseudosphingobacterium domesticum was renamed as Olivibacter domesticus in 2018 as we suggested. We also noticed that the updated LPSN database classifies Flexibacter aurantiacus as a synonym of Flavobacterium johnsoniae, solving the conflict we wrongly highlighted maybe due misinterpretation of nomenclature changes recorded in the former LPSN format. Hence, further reclassifications of species in the genera Algibacter, Bizionia, Cytophaga, Flaviramulus, Flavobacerium, Flexibacter, Gaetbulibacter, Hallella, Muricauda, Odoribacter, Parabacteroides, Pedobacter, and Porphyromonas are expected according to our results and literature (Willems and Collins, 1995; Yarza et al., 2013).

Another discussed affiliation in the publication of our reviewed taxonomy was that of the cytophagial genera Litoribacter (with 2 species at the time) and Rhodonellum (with one species). Out of the three reference sequences, only that of L. ruber was good enough to be included in the backbone structure of the !6S rRNA phylogeny. The parsimonious addition of the other two 75 sequences agreed that both genera should be classified in the family Cyclobacteriaceae. In 2011 Rhodonellum was already included in the description of the candidatus family Cyclobacteriaceae (Nedashkovskaya and Ludwig, 2015) that was accepted on Validation List number 143. Therefore, Rhodonellum’s affiliation should have not been questioned. Nevertheless, Litoribacter was published in 2010 as a member of the Cytophagaceae. In 2014, Pinnaka and Tanuku (2014) already classified them into the Cyclobacteriaceae in the chapter of The Prokaryotes ® dedicated to that family. However, even now in 202, different taxonomies disagree in their classification. LPSN reports Litoribacter as a Cytophagaceae, NCBI reports Rhodonellum as a Cytophagaceae, and BactDive reports both as Cyclobacteriaceae.

All the reclassification proposed from this doctoral research, except Rhodothermaeota phyl. nov. and Balneolaceae fam. nov., were validly published in Validation Lists 172, 173 and 183. The name Rhodothermaeota cannot be validly published since the rank phylum is not assimilated by the ICNP, but it might be in the near future. The name Balneolaceae fam. nov. was published by Xia et al. (2016) in the IJSEM one month prior to us, thus, the name was published two months later in their Notification List.

Further considerations can be made after the analysis of conserved sequences identified in the genomic comparison of the phylum. Phylogenetic reconstructions based on the conserved GldKLMN/O and ActCDE sequences reproduced the currently accepted major Bacteroidetes taxa regardless of branching order. Both topologies consistently placed the Chryseobacterium branch apart from other Flavobacteriaceae. They also agreed that F. taffensis (Crocinitomicaceae) is only distantly related to other Flavobacteriia. The Gld tree even placed Bacteroidia between F. taffensis and the rest of the Flavobacteriia. However, the position of the Bacteroidia was reproduced in only 39% of the replicates. The positions of the Chryseobacterium and Crocinitomicaceae branches combined with the low similarities of their conserved proteins within the Flavobacteriia makes them candidates for future reclassifications. The Chryseobacterium branch could constitute a new family of the Flavobacteriales, presumably the ‘Riemerellaceae’, and the family Crocinitomicaceae could become a different order of the Flavobacteriia, presumably the ‘Crocinitomicales’. The rank of the Cryomorphaceae (O. hongkongensis) would be debatable. No change is required according to the SdhAB inferred topology, but low similarities within the Flavobacteriia conserved sequences plus Act and Gld phylogenies support their reclassification as the order ‘Cryomorphales’.

Comparative genomics.

Our genomic comparison of 89 bacteroidetal genomes, corrected with syntenic analyses, achieved a satisfactory of genes in phyletic patterns. However, the low number of sequences exclusive to Bacteroidetes and their involvement in housekeeping functions prohibits delineation of a genetic blueprint of the phylum on this basis alone. Future functional characterization of 21 yet unknown predicted proteins could provide further insights into the Bacteroidetes common biology and ancestry. For now, no common phenotype can be ascribed to all Bacteroidetes and the definition as a phylum is hence only supported by phylogenetic reconstruction. Nevertheless, the Bacteroidetes phylum status has recently been questioned in a proposal to base the assignment of taxonomic ranks consistently on comparable evolutionary distances in the prokaryotic tree of life, which could lead to a unification of the FCB group into a single phylum (Parks et al., 2018). The 94 FCB genomes analyzed in this study shared a small core genome (65 proteins) resembling a core genome of organisms from different domains (Gil et al., 2004; Koonin, 2003). Based on comparable evolutionary distances, the Actinobacteria represents a phylum-level taxon (Parks et al., 2018) with a core genome of 123 genes (Ventura et al., 2007) - a size similar to the Bacteroidetes core genome of 155 genes. While a formal status of the phylum category in the ICNP is pending to be implemented by the ICSP (Oren et al., 2015), the classification of the Bacteroidetes as an independent phylum seems justified based in our analyses in terms of phylogenetic coherence, 76 independently of how distant the emergence of the Bacteroidetes branch is in the prokaryotic tree of life, and thus deserves a stable nomenclatural status.

Virtually no genes shared between combined classes of Bacteroidetes sustained our previous observation of their radiative adaption from an early stage, causing different lineages to diverge relatively soon after each other. According to our analysis, it is probable that this divergence was more influenced by adaption of primordial genes rather than acquisition of novel genetic assets. The exception to this hypothesis is the class Bacteroidia, which lost at least nineteen genes. It makes sense that most of these lost genes coded for enzymes of the TCA cycle, no longer needed in a fermentative metabolism. In general, we observed an expected reduction in core genomes at the rank class as their were represented by more genomes. However, the class Sphingobacteriia shared 100 genes despite being represented by only 5 genomes. The limited number of available genomes at the time of selecting our database caused a reduction of their represented average length of 1Mb. This also happened with the Chitinophagia, also represented by 5 genomes in our database and with only 7 core genes. Revising their phylogenetic distance, they neither belonged to close branches. Thus, abnormally long genomes and relatedness were not responsible for finding such a large core genome. We did not identify particularly enriched metabolic pathways among the 100 core genes of the Sphingobacteriia, therefore we suspect they could be synonymous to other genes in the Bacteroidetes but with a remarkable sequence divergence in this lineage.

The comparison of 89 genomes representing a nomenclatural diversity of 1.142 species enforced the comparison of very distant genomes. Core gene sets of distant genomes are usually small and dominated by essential housekeeping functions (Charlebois and Doolittle, 2004; Koonin, 2003). More relaxed criteria can in addition recruit genes that are common yet not ubiquitous, and with an as broad phylogenetic distribution as possible (Charlebois and Doolittle, 2004). A problem in comparing distant genomes are groups of distant but still orthologous genes. A high identity threshold for orthology can split such a group of distant orthologs into multiple synonymous groups (false negatives), while a low threshold can pick up spurious matches (false positives). In this study, we applied a high identity threshold, and to detect false negative groupings with taxonomic coherence we also searched for conserved sequences and their syntenies in lower taxonomic ranks. This way we recovered sequences of the respiratory chain and gliding machinery, which are orthologous and widely distributed in the phylum.

Annotation of conserved loci was a delicate task since many publicly available annotations for bacteroidetal genes are unreviewed and are not supported by the careful annotation of a close relative reference genome. We focused on the description of conserved loci with relevance to the phylum and literature to back up our interpretations. This premise made us discard further exploration of an interesting locus in the Flavobacteriia coding for a sodium-proton antiporter next to a nitrogen fixing protein. Not enough evidence in the literature was found to interpret it a functional nitrogen fixing mechanism of the taxon.

Gliding machinery and Type IX secretion system

Gliding motility has for long been thought to be exclusive of the Bacteroidetes. Gliding genes are widespread throughout the Bacteroidetes, but the phenotype is not always expressed (McBride and Zhu, 2013). The exact mechanism of gliding is still unknown, but two sets of proteins seem to be essential: the membrane Gld subunits B, D, H and J that might be effectors of movement, and a type IX secretion system (T9SS) formed by GldKLMN plus SprA, SprE, and SprT (McBride and Zhu, 2013). Gliding in F. johnsoniae also requires an ABC-type transporter formed by Gld subunits A, F, and G, but this transporter is absent from other gliding bacteria (McBride and Zhu, 2013). GldC is not essential to gliding, and known not to be produced by all gliding cells (Hunnicutt and McBride, 2000). The T9SS (also referred to as PorSS or PerioGate) is the latest discovered secretion system 77 and has been only found in Bacteroidetes so far (Sato et al., 2005). The genes of the T9SS are called porKLMN, sov, porW and porT, and are synonyms to gldKLMN, sprA, sprE, and sprT. In gliding motility, the T9SS secretes the proteins SprB and RemA that are believed to be adhesins, but T9SS also secretes hydrolytic enzymes like chitinase and cellulase (Lasica et al., 2017; McBride and Zhu, 2013). Thus, the T9SS is also involved in the acquisition of external complex carbon sources. We found that the cluster gldKLMN/O (gldO is thought to be a paralog of gldN) is encoded in 90% of the analyzed genomes. Except for Niabella soli, the genomes that lacked the T9SS belonged to pathogen or symbiont species that likely feed on small low molecular weight organic molecules rather than complex macromolecules. By means of conservation, phylogenetic signal and uniqueness to the phylum, T9SS gldKLMN genes represent suitable genomic markers for Bacteroidetes although they are lacking in some pathogenic or symbiont species. T9SS is the anchor of the known gliding machinery, but while other gld genes seem accessory and are furthermore distributed in different loci hiding a complex phylogenetic pattern if any, T9SS genes remain well conserved regardless of the gliding capacity of the cell. Since T9SS also translocates enzymes, such as chitinase and cellulose, in a two-step process across both membranes, it connects the two most notable phenotypes of the phylum, gliding and degradation of complex organic matter. This suggests that the T9SS was part of the ancestral bacteroidete and might, to some extent, be responsible for their biological success.

Aerobic respiratory chains in Bacteroidetes and its link to osmotic regulation in halophily.

Complex II SQR is the only respiratory complex encoded in all analyzed Bacteroidetes. A phylogenetic reconstruction based on SQR SdhAB was not able to delineate the classes Cytophagia, Chitinophagia, and Sphingobacteriia as monophyletic. Substitution rates were low and of insufficient resolution considering the length of the protein sequences. This also caused many branches to have low bootstrap support. In this context, the affiliation of C. limicola with the Bacteroidia in the Sdh tree could be an artifact, even though an HGT between them would explain the similarity of their Sdh, Rnf, and Nqr protein sequences as xenologues.

The alternative complex III (ACIII) was first described in Rhodothermus marinus, and although it is found in many bacteria, it is prevalent in Bacteroidetes (Marreiros et al., 2016). Association of ACIII with a caa3-type cytochrome oxydase in a super-complex has so far only been described in F. johnsoniae (Sun et al., 2018). We found that a corresponding super-complex operon is conserved in all aerobic Bacteroidetes, thus not constraint to the genus Flavobacterium. The super-complex accepts electrons from the quinone pool and funnels them to oxygen as terminal acceptor while translocating protons across the cytoplasmic membrane and thereby contributing to the proton- motive force. The rate of electron transport of the canonical respiratory chain is limited by the capacity of the electron shuttling soluble cytochrome c to connect the terminal cytochrome oxidase with other redox complexes. The ACIII-caa3COX super-complex solves this problem by controlling the flux of electrons that are transmitted directly from complex III to the c-type heme electron carrier fused to the caa3COX (Sun et al., 2018). The high efficiency of the ACIII-caa3COX might be advantageous for competing with other bacteria for the most efficient utilization of intermittently available energy-rich complex organic matter.

The gateways into the respiratory chain are complexes I and II. While a complex II type-B SQR is present in all Bacteroidetes, two types of complex I divide the phylum. Strains isolated from freshwater environments featured the canonical NADH:quinone oxidoreductase with 14 subunits (NuoABCDEFGHIKJLMN) coded in the same nuo genes cluster. The cluster was incomplete in some genomes, specifically in Cytophagia and Bacteroidia, but the incomplete cluster of the Bacteroidia resembled the ancestral complex (Moparthi and Hägerhäll, 2011) as it lacked nuoEFG (the so-called N-module with dehydrogenase activity) and nuoC (part of the Q-module with electron transfer activity). The mechanism of the ancestral complex I is unknown, but it probably 78 interacts with various electron donors or acceptor proteins (Moparthi and Hägerhäll, 2011). In other cases, incomplete nuo clusters appeared to be paralogous to full nuo clusters within the same genome or residual upon the acquisition of the Nqr complex.

Another complex I, the Nqr complex (Na+-NQR), a sodium-pumping NADH:quinone oxidoreductase, was present in strains isolated from seawater or animal and human gut microbiota. Nqr consists of a six-protein membrane complex that was described in marine bacteria (Unemoto and Hayashi, 1993). It translocates sodium ions across the inner membrane thereby linking the respiratory chain to osmotic regulation (Reyes-prieto et al., 2014). The Nqr is associated with the respiration of pathogens like Vibrio, Klebsiella or Haemophilus spp. (Barquera, 2014), lineages that reportedly acquired the nqr genes horizontally during their adaptation to sodium-rich marine, alkaline or intracellular habitats (Reyes-prieto et al., 2014). Based on phylogenomic analysis and comparison of gene cluster layout, it has been proposed that the Na+-NQR originated in the common ancestor of Bacteroidetes and Chlorobi via duplication and subsequent neo- functionalization of the rnf operon (NADH:ferredoxin dehydrogenase) (Reyes-prieto et al., 2014). The copied rnf operon would have lost the RnfB protein involved in electron uptake from the reduced ferredoxin and later recruited an AMOr subunit (aromatic monooxygenase) to become NqrF, the electron uptake subunit of the Na+-NQR (Reyes-prieto et al., 2014). The intermediate complex with no NqrF subunit is the ancestral Na+-NQR (Reyes-prieto et al., 2014). Our results, however, suggest that the Nqr complex evolved solely in the Bacteroidetes. Rnf proteins prevail in the Bacteroidia with two outliers in the Flavobacteriia and homologous sequences in Chlorobi. Reyes-Prieto et al. represented the Bacteroidetes with Bacteroidia species in their reconstruction of the Nqr dispersion (Reyes-prieto et al., 2014); hence, the interpreted common origin of the Nqr of Bacteroidetes-Chlorobi could be the consequence of a HGT between the Bacteroidia and Chlorobi, as is suggested by the phylogeny of the SQR complex.

The nqr cluster was incomplete in the Cytophagia lacking the nqrF subunit, thus, reproducing the ancestral Na+-NQR. This suggests that the original rnf operon was transformed in the ancestor of the Bacteroidetes, and the acquisition of the NqrF subunit happened after the divergence of the Cytophagia. NADH:quinone oxidoreductase (Nuo complex for short) must have been the original complex I, whereas the ancestral Nqr was an adaptation to salinity that remains in genomes of the genus Flexibacter and the families Catalimonadaceae and Cyclobacteriaceae. The complete Nqr complex radiated to lineages adapting to saline stress. Therefore, the radiation of the nqr genes was not phylogenetic, but environmental, and Nqr became abundant in predominantly marine or gut taxa like the Saprospiria, Bacteroidia, Cryomorphaceae and a substantial part of the Flavobacterieaceae. Since full copies of the Nqr complex coexist with partial copies of the Nuo complex in most of the analyzed genomes but, never vice versa, the acquisition of the Nqr may render the Nuo complex obsolete. Only H. hydrossis, isolated from activated sludge, encodes full Nuo and Nqr complexes possibly representing a recent Nqr acquisition. The order in which nuo genes are deleted seems random, but for some as yet unknown reason the Bacteroidia kept an ancestral-like model of the Nuo complex.

The respiratory chain of the Bacteroidia has been extensively described for the anaerobe Porphyromonas gingivalis (Meuric et al., 2010). Despite its anaerobic lifestyle, it can use oxygen as terminal electron acceptor using a cytochrome bd (subunits CydAB) with high oxygen affinity. Such anaerobes, or nanaerobes (that can benefit from nanomolar concentrations of O2 but cannot grow at higher concentrations) (Baughn and Malamy, 2004), use this mechanism to scavenge harmful oxygen thereby facilitating colonization. In Bacteroidia, the cydAB constituted a conserved gene pair together with a conserved DUF4492-containing protein. Some Flavobacteriia also encoded the gene pair cydAB. Explaining the role of the DUF4492-containing protein (if expressed together with cydAB) and the role of the cytochrome bd in Flavobacteriia are two future challenges. 79

Finally, ATPases (complex V) in the phylum were most commonly F-type ATPases with homologus subunits that served as molecular clocks. Although in the class Bacteroidia we consistently observed a substitution of F-type ATPases by V-type ATPases likely influenced by their adaptation to an anerobic lifestyle.

Polysaccharide Utilization Loci and flexirubin genes.

Expected phylum-level genomic markers susD and flx genes were not recruited in any of the sets of conserved sequences. The flx genes were not common enough, and susD grouped in many different groups of orthologs, none of which reproduced a phylogenetic pattern. PULs in Bacteroidetes frequently feature susCD-like genes tandems, although some are functional only with susC (Terrapon et al., 2015). Therefore SusD sequences can be used as a first rough approximation of the minimum number of PULs in a bacteroidetal genome (Terrapon et al., 2015). Only five of the Bacteroidetes genomes in our dataset did not contain any annotated susD-like sequences and only 5% of the SusCD OGs contained paralogs. Despite we found few copies of susD in some genomes, we do not possess information about their expression. However, PULs are not only frequent and diverse, but also believed to be subjected to frequent HGT among Bacteroidetes as has been shown for a porphyran-targeting PUL from marine Flavobacteriia to anaerobic gut Bacteroides (Thomas et al., 2011). SusC/D sequences furthermore carry a strong substrate-specific signal that obscures their phylogeny (Kappelmann et al., 2019). Such sequence diversity combined with a weak phylogenetic signal impedes the utilization of SusCD-like sequences for the explanation of the adaptive radiation of the phylum.

Bacteroidetes do not only excel in decomposing complex organic matter, but their phylogenetic position in the prokaryotic tree of life suggests they could have been Gram-negative pioneers in this regard. Contrary to Gram-positive decomposers, the diderm Bacteroidetes feature a periplasmic space that provides a protected area for degradation, e.g., of oligosaccharides, without diffusive loss of both enzymes and degradation products. Recent studies on the Bacteroidia link their success as members of the human gut microbiota to their PULs (Foley et al., 2016; Wexler and Goodman, 2017) that are varied in CAZymes. Functionally, their SusCDs capture and translocate oligosaccharides that are further decomposed in the periplasmic space, keeping them away from competitors (Larsbrink et al., 2014; Reintjes et al., 2017). Unfortunately, we could not identify an environmental pattern explaining the influence of the SusCD lateral transfer in the adaptive radiation of the phylum, or prove they were likely to exist in the ancestral bacteroidete. Yet, they seem relevant to the essence of what constitutes a bacteroidete.

SusD proteins, of the SusCD protein complex encoded in PULs (polysaccharide utilization locus) are also coded in Rhodothermaeota genomes, but we did not find literature describing their relevance in the clade. It would be interesting to explore the utilization of complex carbon sources by Rhodothermaota members and get a deep insight into the origin and differentiation of these PULs.

Final considerations on bacteroidetal blueprint genes selected by their environment.

The Bacteroidetes conserved the energy-efficient ACIII-caa3-COX respiratory super-complex improving their fitness to compete for high molecular weight carbon compounds. The substitution of the ACIII-caa3COX by a cytochrome bd is recorded only once in their genomes, at the origin of the Bacteroidia. The Bacteroidia broke the T9SS/ACIII-caa3COX association and yet succeeded, which hampers a circumscription of the Bacteroidetes by common traits. Still, many Bacteroidia conserve the T9SS and only pathogens or symbionts have disposed of it. Therefore, we believe the common ancestor of the phylum was a free-living aerobic decomposer, whose bioenergetic efficiency allowed a broad adaptive radiation that ultimately caused some lineages to dispense of 80

the ancestral genes coding T9SS and/or ACIII-caa3-COX. We would like to end on thoughts with respect to the ancestor of the Bacteroidetes. Did it live rather in a marine or freshwater environment? All species that have yet been isolated from freshwater encode the canonical complex I (Nuo), whereas species isolated from salt-rich environments encode the full Nqr complex and often some remnants of the nuo gene cluster. Presence of the ancestral Nqr and incomplete Nuo complexes in the Cytophagia suggests that the initially present Nuo complex gradually degraded after the Nqr cluster was acquired. This would rather support that the Bacteroidetes radiated from a Gram-negative, diderm freshwater ancestor from which later successors adapted to the marine environment based on novel mechanisms.

Hereby, we rather explained intra-phylum genomic and phenotypic diversity despite the absence of a common phenotype, which remains a hallmark of Bacteroidetes notwithstanding their evolutionary origin. Hence, carbon source limitations and salinity were pivotal to the origin of the phylum and responsible of its adaptive radiation. The initial hypothesis of halophilic bacteroidetal selection in gradient salinity has been refuted by the reclassification of extreme halophiles as a different phylum Rhodothermaeota. Conversely, it has been proven that halophilic Bacteroidetes suffer a drastic change in their respiratory chain that helps them cope with the saline stress. Moreover, the genes coding this change are not transmitted vertically but horizontally, possibly as part of a marine mobilome, and cause the deletion of other genes. Thus, we have explained how the genome of halophilic Bacteroidetes interact with their environment. Future studies to support these new hypothesis would need to explain if T9SS could have been paired with an oligosaccharide uptake mechanism and confirm nqr genes are part of marine mobilome of the Bacteroidetes, where it could have been also appropriated by known Nqr encoding marine proteobacteria (Frost et al., 2005; Toussaint and Chandler, 2012). 81

Conclusions

 The catalog of bacteroidetal species should be corrected by amending the taxonomic classification of 11 species that, according to their reference 16S rRNA sequence, belong to the phyla Clostridia, Synergistetes, Firmicutes, Gammaproteobacteria, Alphaproteobacteria and Actinobacteria.

 The phylogenetic tree based on 16S rRNA bacteroidetal sequences and sequence similarities predicts that many diverse species are still to be described affiliating near the lineages Flammeovirgaceae, Catalimonadacea and Persicobacteraceae in the Cytophagales, and the Marinifilaceae, Prolixibacteraceae and Marinilabiliaceae in the Bacteroidales.

 All the phylogenies and sequence signatures revised in this thesis suggested the circumscription of a new phylum composed of species hitherto classified as Bacteroidetes. The Rhodothermaeota phyl. nov. includes the Rhodothermia class. nov., Rhodothermales ord. nov., Rhodothermaceae, Salinibacteraceae fam. nov, Rubricocaceae fam. nov, as well as the Balneolia class. Nov., Balneolales ord. nov. and family Balneolaceae. This reclassification excluded the extreme halophilic Salinibacter from the scope of our research.

 Within the Salinibacteraceae, two species of the genus Salinibacter have been reclassified into the nove genus Salinivenus by means of genetic and genomic divergence besides signature nucleotides in the helix 23 of their 16S rRNA sequence.

 The revised phylogeny of the phylum, by means of topologies based on ribosomal sequences and Multilocus Sequence Alignment of 29 genes, lead to the novel classification of the class Chitinophagia and order Chitinophagales, the family Crocinitomicaceae of the Flavobacteriales, the Odoribacteraceae in the Bacteroidales, and the novel families Hymenobacteraceae, Thermonemataceae and Persicobacteraceae of the Cytophagales. Thirteen of the sixteen new names in this thesis remain with standing in nomenclature.

 Tree topologies reconstructed with the 23S rRNA sequence and Multilocus Sequence Alignment of 29 translated genes of Bacteroidetes were remarkably similar, thus, proving the phylogenetic signal of the 23S rRNA sequence alone would be sufficient for the precise affiliation of bacteroidetal species if LSU rRNA databases were updated.

 The median sequence identity between orthologs in the class Flavobacteriia and their phylogenomics based on genes of their type-IX secretion system, also predicted the classification of the new familly ‘Riemerellaceae’ and the novel orders ‘Crocinitomicales’ and ‘Cryomorphales’.

 The type-IX secretion system genes, hitherto only described in the Bacteroidetes, are well conserved into homologous gene clusters in 90% of the analysed bacteroidetal genomes. The other 10% belonged to pathogen or symbiont species (except for Niabella soli) than can dispense with gliding motility or polysaccharide digestion.

 Living at the expenses of complex carbon sources like polymers is expensive and the Bacteroidetes seem to balance the energetic cost with the efficient trafficking of electrons in their respiratory chain by joining complexes III and IV into a super-complex, the Alternative Complex III – cytochrome aa3 cytochrome oxidase. The super-complex is missing in anaerobic Bacteroidia. 82

 Despite the lack of super-complex, the Bacteroidia code for a cytochrome bd oxidase with high affinity to O2 allowing them to complete the aerobic respiration. According to literature, this respiration serves to detoxify nanomolar concentrations of oxygen making them nanaerobes instead of strict anaerobes.

 All Bacteroidetes share homologous genes coding the respiratory complex II or succinate dehydrogenase, an electron gateway into the respiratory chain. In the Bacteroidia, vestige of other genes involved in the tricarboxylic acid cycle confirm their aerobic past and common ancestry with other Bacteroidetes.

 The second electron gateway to the respiratory chain is complex I, which in halophilic Bacteroidetes is substituted by a sodium pumping NAHD:quinone oxidoreductase. Vestigial genes of complex I in halophilic Bacteroidetes witness this substitution.

 The ancestral form of the sodium pumping NADH:quinone oxidoreductase is still coded in cytophagial genomes, indicating that they were the inventors of this new complex, not the Bacteroidetes-Chlorobi common ancestor.

 The presence of sodium pumping NADH:quinone oxidoreductase genes only in genomes from halophilic Bacteroidetes suggest their horizontal transmission, thus their genomic coding would be circumscribed to saline environments, including animal guts where physiological salinity challenge bacterial osmoregulation.

 Bacteroidales gut microbiota code copies of both complex I and Na+-NQR. Only the Porphiromonadaceae code Na+-NQR exclusively. Complex I of the Bacteroidales lack the NADH dehydrogenase module resembling its ancestral form that accepted electrons from varied electron donors. Nevertheless, we do not know of any proof of its expression and mechanics.

 The horizontal transferring of the Na+-NQR complex genes has surpassed the phylum boundary and they are found in other Gram-negative marine bacteria. Also, from the Bacteroidia they were transferred to the phylum Chlorobi.

 Complex V or F-type ATP synthase is conserved in many Bacteroidetes and the gene of its subunit beta contains signature insertions for the phyla Rhodothermaeota and Bacteroidetes. However, some species of the Bacteroidia have already changed their genomes to code a V- type ATP synthase (with best performance in anaerobic conditions) instead of an F-type ATP synthase.

 Although none of the common genes found in this research can be postulated as the genetic blueprint of the phylum Bacteroidetes, phylogenomics, taxonomic distribution and synteny of respiratory genes explain their radial adaption to different environments where oxygen and salinity selected certain genotypes.

 Tracing back the dispersion of conserved genes of the bacteroidetal respiratory chain and type IX secretion system, we postulate that the ancestor bacteroidete was a complex carbon compounds heterotroph most likely from fresh waters rather than marine.

 We produced a list of 21 uncharacterized genes that are candidate to successfully explain the essence of the Bacteroidetes in the future. 83

 We can conclude that the Bacteroidetes likely interact with their environment as important organic matter decomposers, and that halophilic Bacteroidetes adapt to osmotic pressure by lateral acquisition of nqr genes that causes sequential loss of complex I nuo genes. 84

References

Addou, S., Rentzsch, R., Lee, D., Orengo, C.A., 2009. Domain-Based and Family-Specific Sequence Identity Thresholds Increase the Levels of Reliable Protein Function Transfer. J. Mol. Biol. 387, 416–430. https://doi.org/10.1016/j.jmb.2008.12.045 Alegado, R.A., Grabenstatter, J.D., Zuzow, R., Morris, A., Huang, S.Y., Summons, R.E., King, N., 2013. Algoriphagus machipongonensis sp. nov., co-isolated with a colonial choanoflagellate. Int. J. Syst. Evol. Microbiol. 63, 163–168. https://doi.org/10.1099/ijs.0.038646-0 Altenhoff, A.M., Boeckmann, B., Capella-Gutierrez, S., Dalquen, D.A., DeLuca, T., Forslund, K., Huerta-Cepas, J., Linard, B., Pereira, C., Pryszcz, L.P., Schreiber, F., Da Silva, A.S., Szklarczyk, D., Train, C.M., Bork, P., Lecompte, O., Von Mering, C., Xenarios, I., Sjölander, K., Jensen, L.J., Martin, M.J., Muffato, M., Gabaldón, T., Lewis, S.E., Thomas, P.D., Sonnhammer, E., Dessimoz, C., 2016. Standardized benchmarking in the quest for orthologs. Nat. Methods 13, 425–430. https://doi.org/10.1038/nmeth.3830 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2 Amann, R., Ludwig, W., Schleifer, K.H., 1988. Beta-subunit of ATP-synthase: a useful marker for studying the phylogenetic relationship of eubacteria. J. Gen. Microbiol. 134, 2815–2821. https://doi.org/10.1099/00221287-134-10-2815 Antón, J., Oren, A., Benlloch, S., Rodríguez-Valera, F., Amann, R., Rosselló-Mora, R., 2002. Salinibacter ruber gen. nov., sp. nov., a novel, extremely halophilic member of the Bacteria from saltern crystallizer ponds. Int. J. Syst. Evol. Microbiol. 52, 485–91. https://doi.org/10.1099/00207713-52-2-485 Arahal, D.R., 2014. Whole-genome analyses: Average nucleotide identity, 1st ed, Methods in Microbiology. Elsevier Ltd. https://doi.org/10.1016/bs.mim.2014.07.002 Barquera, B., 2014. The sodium pumping NADH:quinone oxidoreductase (Na+-NQR), a unique redox-driven ion pump. J. Bioenerg. Biomembr. 46, 289–298. https://doi.org/10.1007/s10863- 014-9565-9 Baughn, A.D., Malamy, M.H., 2004. The strict anaerobe Bacteroides fragilis grows in and benefits from nanomolar concentrations of oxygen. Nature 427, 441–444. https://doi.org/10.1038/nature02285 Beger H, B.G., 1953. Die Scheidenstruktur des Abwasserbakteriums Sphaerotilus und des Eisenbakteriums Leptothrix im elektronenmikroskopischen Bilde und ihre Bedeutung fur die Systematik dieser Gattungen. Zentralblatt fur Bakteriol. Parasitenkunde, Infekt. und Hyg. II, 318–334. Bernardet J. F. , Bowman J. P. , ( 2011; ). Genus I. Flavobacterium Bergey et al. 1923. In Bergey’s Manual of Systematic Bacteriology, 2nd ed., vol. 4, pp. 112––154. The Williams & Wilkins Co., Baltimore. 85

Bowman, J.P., 2014. The Family Cryomorphaceae, in: Rosenberg, E., Stackebrandt, E., Thompson, F., Lory, S., DeLong, E.F. (Eds.), The Prokaryotes. Springer-Verlag Berlin Heidelberg, pp. 539–550. https://doi.org/10.1007/978-3-642-30194-0 Brosius, J., Palmer, M.L., Kennedy, P.J., Noller, H.F., 1978. Complete nucleotide sequence of a 16S ribosomal RNA gene from Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 75, 4801– 4805. https://doi.org/Doi 10.1073/Pnas.75.10.4801 Charlebois, R.L., Doolittle, W.F., 2004. Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res. 14, 2469–77. https://doi.org/10.1101/gr.3024704 Chen, X., Zhang, J., 2012. The Ortholog Conjecture Is Untestable by the Current Gene Ontology but Is Supported by RNA Sequencing Data. PLoS Comput. Biol. 8. https://doi.org/10.1371/journal.pcbi.1002784 Consortium, T.U., 2016. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169. https://doi.org/10.1093/nar/gkw1099 Das, S., Dash, H.R., Mangwani, N., Chakraborty, J., Kumari, S., 2014. Understanding molecular identification and polyphasic taxonomic approaches for genetic relatedness and phylogenetic relationships of microorganisms. J. Microbiol. Methods 103, 80–100. https://doi.org/10.1016/j.mimet.2014.05.013 De Vos, P. (2011) Multilocus sequence determination and analysis. In: Rainey, F., Oren, A. (Eds.), Methods in Microbiology, Taxonomy of Prokaryotes, vol. 38, first ed., Elsevier Ltd., pp. 385– 407. https://doi.org/10.1016/B978-0-12-387730-7.00017-6 Dietrich G, Weiss N, Fiedler F, W.J., 1988. Acetofilamentum rigidum gen. nov., sp. nov., a strictly anaerobic bacterium from sewage sludge. Syst. Appl. Microbiol. 10(3): 273–278. Dietrich G, Weiss N, W.J., 1988. Acetothermus paucivorans, gen. nov., sp. nov., a strictly

anaerobic, thermophilic bacterium from sewage sludge, fermenting hexoses to acetate, CO2

and H2. Syst. Appl. Microbiol. 10(2):174–179. Edgar, R.C., (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 (5), 1792–1797. Edgar, R.C., 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461. https://doi.org/10.1093/bioinformatics/btq461 Eilers, H., Pernthaler, J., Peplies, J., Glockner, F.O., Gerdts, G., Amann, R., 2001. Isolation of Novel Pelagic Bacteria from the German Bight and Their Seasonal Contributions to Surface Picoplankton. Appl. Environ. Microbiol. 67, 5134–5142. https://doi.org/10.1128/AEM.67.11.5134-5142.2001 Foley, M.H., Cockburn, D.W., Koropatkin, N.M., 2016. The Sus operon: a model system for starch uptake by the human gut Bacteroidetes. Cell. Mol. Life Sci. 73, 2603–17. https://doi.org/10.1007/s00018-016-2242-x 86

Frost, L.S., Leplae, R., Summers, A.O., Toussaint, A., 2005. Mobile genetic elements: The agents of open source evolution. Nat. Rev. Microbiol. 3, 722–732. https://doi.org/10.1038/nrmicro1235 Garrity, G.M., Bell, J. A., Lilburn, T.G., Lansing, E., 2004. Taxonomic Outline of the Prokaryotes, in: Bergey’s Manual of Systematic Bacteriology. Springer New York, pp. 1–399. https://doi.org/10.1007/bergeysoutline200405 Gil, R., Silva, F.J., Peretó, J., Moya, A., 2004. Determination of the core of a minimal bacterial gene set. Microbiol. Mol. Biol. Rev. 68, 518–37. https://doi.org/10.1128/MMBR.68.3.518- 537.2004 Glaeser, S.P., Kämpfer, P., 2015. Multilocus sequence analysis (MLSA) in prokaryotic taxonomy. Syst. Appl. Microbiol. 38, 237–245. https://doi.org/10.1016/j.syapm.2015.03.007 Gomariz, M., Martínez-García, M., Santos, F., Rodriguez, F., Capella-Gutiérrez, S., Gabaldón, T., Rosselló-Móra, R., Meseguer, I., Antón, J., 2015. From community approaches to single-cell genomics: the discovery of ubiquitous hyperhalophilic Bacteroidetes generalists. ISME J. 9, 16–31. https://doi.org/10.1038/ismej.2014.95 Goujon, M., McWilliam, H., Li, W., Valentin, F., Squizzato, S., Paern, J., Lopez, R., 2010. A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Res. 38, W695–W699. https://doi.org/10.1093/nar/gkq313 Gupta, R.S., 2004. The phylogeny and signature sequences characteristics of Fibrobacteres, Chlorobi, and Bacteroidetes. Crit. Rev. Microbiol. 30, 123–143. https://doi.org/10.1080/10408410490435133 Gupta, R.S., Lorenzini, E., 2007. Phylogeny and molecular signatures (conserved proteins and indels) that are specific for the Bacteroidetes and Chlorobi species. BMC Evol. Biol. 7, 71. https://doi.org/10.1186/1471-2148-7-71 Hahnke, R.L., Meier-Kolthoff, J.P., García-López, M., Mukherjee, S., Huntemann, M., Ivanova, N.N., Woyke, T., Kyrpides, N.C., Klenk, H.P., Göker, M., 2016. Genome-based taxonomic classification of Bacteroidetes. Front. Microbiol. 7. https://doi.org/10.3389/fmicb.2016.02003 Hall, L.M.C., Fawell, S.C., Shi, X., Faray-Kele, M.-C., Aduse-Opoku, J., Whiley, R.A., Curtis, M.A., 2005. Sequence Diversity and Antigenic Variation at the rag Locus of Porphyromonas gingivalis. Infect. Immun. 73, 4253–4262. https://doi.org/10.1128/IAI.73.7.4253-4262.2005 Hollande A.C., F.R., 1931. La structure cytologique de Blattabacterium cuenoti (Mercier) N.G., symbiote du tissu adipeux des Blattides. Comptes Rendus des Séances la Société Biol. 752– 754. Huelsenbeck, J.P., 1995. The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol. Biol. Evol. 12, 843–849. https://doi.org/10.1093/oxfordjournals.molbev.a040261 Hugenholtz, P., 2002. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, REVIEWS0003. https://doi.org/10.1186/gb-2002-3-2-reviews0003 87

Hunnicutt, D.W., McBride, M.J., 2000. Cloning and Characterization of the Flavobacterium johnsoniae Gliding-Motility Genes gldB and gldC. J. Bacteriol. 182, 911–918. https://doi.org/10.1128/JB.182.4.911-918.2000 Iino, T., Mori, K., Itoh, T., Kudo, T., Suzuki, K.I., Ohkuma, M., 2014. Description of Mariniphaga anaerophila gen. nov., Sp. nov., A facultatively aerobic marine bacterium isolated from tidal flat sediment, reclassification of the Draconibacteriaceae as a later heterotypic synonym of the Prolixibacteraceae and description of the family Marinifilaceae fam. nov. Int. J. Syst. Evol. Microbiol. 64, 3660–3667. https://doi.org/10.1099/ijs.0.066274-0 Plotly Technologies Inc, 2015 Collaborative Data Science, Plotly Technologies Inc., Montreal, QC https://plot.ly. Jukes, T.H., Cantor, C.R., 1969. Evolution of protein molecules. In Munro, H.N. (Ed.) Mammalian protein Metabolism, III. New York Acad. Press 21–132. Kalwasińska, A., Deja-Sikora, E., Burkowska-But, A., Szabó, A., Felföldi, T., Kosobucki, P., Krawiec, A., Walczak, M., 2018. Changes in bacterial and archaeal communities during the concentration of brine at the graduation towers in Ciechocinek spa (Poland). Extremophiles 22, 233–246. https://doi.org/10.1007/s00792-017-0992-5 Kanehisa, M., Sato, Y., Furumichi, M., Morishima, K., Tanabe, M., 2019. New approach for understanding genome variations in KEGG. Nucleic Acids Res. 47, D590–D595. https://doi.org/10.1093/nar/gky962 Kappelmann, L., Krüger, K., Hehemann, J.-H., Harder, J., Markert, S., Unfried, F., Becher, D., Shapiro, N., Schweder, T., Amann, R.I., Teeling, H., 2019. Polysaccharide utilization loci of North Sea Flavobacteriia as basis for using SusC/D-protein expression for predicting major phytoplankton glycans. ISME J. 13, 76–91. https://doi.org/10.1038/s41396-018-0242-6 Kennedy, J., Margassery, L.M., O’Leary, N.D., O’Gara, F., Morrissey, J., Dobson, A.D.W., 2014. Aquimarina amphilecti sp. nov., isolated from the sponge Amphilectus fucorum. Int. J. Syst. Evol. Microbiol. 64, 501–505. https://doi.org/10.1099/ijs.0.049650-0 Kim, M., Oh, H.S., Park, S.C., Chun, J., 2014. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. 64, 346–351. https://doi.org/10.1099/ijs.0.059774-0 Kimura, M., 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120. https://doi.org/10.1007/BF01731581 Klappenbach, J.A., Saxman, P.R., Cole, J.R., Schmidt, T.M., 2001. rrndb: the Ribosomal RNA Operon Copy Number Database, Nucleic Acids Res. 29(1), 181-184. Konstantinidis, K.T., Tiedje, J.M., 2005. Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. U. S. A. 102, 2567–2572. https://doi.org/10.1073/pnas.0409727102 88

Koonin, E. V., 2003. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat. Rev. Microbiol. 1, 127–136. https://doi.org/10.1038/nrmicro751 Krieg, N.R., Ludwig, W., Euzeby, J., Whitman, W.B., 2010. Phylum XIV. Bacteroidetes phyl. nov., in: Bergey’s Manual of Systematic Bacteriologys Manual. Springer New York, New York, pp. 25–469. Kyrpides, N.C., Woyke, T., Eisen, J.A., Garrity, G., Lilburn, T.G., Beck, B.J., Whitman, W.B., Hugenholtz, P., Klenk, H.-P., 2013. Genomic Encyclopedia of Type Strains, Phase I: The one thousand microbial genomes (KMG-I) project. Stand. Genomic Sci. 9, 1278–1284. https://doi.org/10.4056/sigs.5068949 Lambiase, A., 2014. The Family Sphingobacteriaceae, in: DeLong, E.F., Lory, S., Stackebrandt, E., Thompson, F. (Eds.), The Prokaryotes. Springer-Verlag Berlin Heidelberg, pp. 907–914. https://doi.org/10.1007/978-3-642-30194-0 Lanave, C., Preparata, G., Sacone, C., Serio, G., 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20, 86–93. https://doi.org/10.1007/BF02101990 Lapage, S.P., Sneath, P.H.A., Lessel, E.F., Skerman, V.B.D., Seeliger, H.P.R., Clark, W.A., 1992. International Code of Nomenclature of Bacteria: Bacteriological Code, 1990 Revision. ASM Press, Washington (DC), USA. Larsbrink, J., Rogers, T.E., Hemsworth, G.R., McKee, L.S., Tauzin, A.S., Spadiut, O., Klinter, S., Pudlo, N. a, Urs, K., Koropatkin, N.M., Creagh, A. L., Haynes, C. A, Kelly, A.G., Cederholm, S.N., Davies, G.J., Martens, E.C., Brumer, H., 2014. A discrete genetic locus confers xyloglucan metabolism in select human gut Bacteroidetes. Nature 506, 498–502. https://doi.org/10.1038/nature12907 Lasica, A.M., Ksiazek, M., Madej, M., Potempa, J., 2017. The Type IX Secretion System (T9SS): Highlights and Recent Insights into Its Structure and Function. Front. Cell. Infect. Microbiol. 7, 215. https://doi.org/10.3389/fcimb.2017.00215 Lemos, R.S., Fernandes, A.S., Pereira, M.M., Gomes, C.M., Teixeira, M., 2002. Quinol:fumarate oxidoreductases and succinate:quinone oxidoreductases: phylogenetic relationships, metal centres and membrane attachment. Biochim. Biophys. Acta - Bioenerg. 1553, 158–170. https://doi.org/10.1016/S0005-2728(01)00239-0 Liu, K., Li, S., Jiao, N., Tang, K., 2014. Gramella flava sp. nov., a member of the family Flavobacteriaceae isolated from seawater. Int. J. Syst. Evol. Microbiol. 64, 165–168. https://doi.org/10.1099/ijs.0.051987-0 Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar Buchner,A., Lai, T., Steppi, S., Jobb, G., Föster, W., Brettske, I., Gerber, S., Ginhart, A.W.,Gross, O., Grumann, S., Hermann, S., Jost, R., Köning, A., Liss, T., Lüssmann, R.,May, M., Nonhoff, B., Reichel, B., Strehlow, R., Stamatakis, A., Norbert, S., Vil-big, A., Lenke, M., Ludwig, T., Bode, A., Schleifer, K.H., 2004. ARB: a softwareenvironment for sequence data. Nucleic Acids Res. 32 (4), 1363–1371. 89

Ludwig, W., Euzéby, J., Whitman, W.B., 2010. Road map of the phyla Bacteroidetes, , Tenericutes (Mollicutes), , Fibrobacteres, , Dictyoglomi, , Lentisphaerae, , and . In: Krieg, N.R., Staley, J.T., Brown, D.R., Hedlund, B.P., Paster, B.J., Ward, N.L., Ludwig, W., Whitman, W.B. (Eds.), Bergey’sManual®of Systematic Bacteriology Manual, Springer, New York, pp. 1–24. Ludwig W., Klenk H. P. (2005) Overview: A Phylogenetic Backbone and Taxonomic Framework for Procaryotic Systematics. In: Brenner D.J., Krieg N.R., Staley J.T., Garrity G.M. (eds) Bergey’s Manual® of Systematic Bacteriology. Springer, Boston, MA https://doi.org/10.1007/0-387-28021-9_8 Ludwig, W., Schleifer, K.H., 2005. Molecular phylogeny of bacteria based on comparative sequence analysis of conserved genes. Microb. Phylogeny Evol., concepts and controversies, 70–98. Ludwig, W., Schleifer, K.H., 1994. Bacterial phylogeny based on 16S and 23S rRNA sequence analysis. FEMS Microbiol. Rev. 15 (2-3), 155–173. Ludwig, W., Strunk, O., Klugbauer, S., Klugbauer, N., Weizenegger, M., Neumaier, J., Bachleitner, M., Schleifer, K.H., 1998. Bacterial phylogeny based on comparative sequence analysis. Electrophoresis 19, 554–568. https://doi.org/10.1002/elps.1150190416 Makhdoumi-Kakhki, A., Amoozegar, M.A., Ventosa, A. (2012) Salinibacter iranicus sp. nov. and Salinibacter luteus sp. nov., isolated from a salt lake, and emended descriptions of the genus Salinibacter and of Salinibacter ruber. Int.J. Syst. Evol. Microbiol. 62 (7), 1521–1527. https://doi.org/10.1099/ijs.0.031971-0 Manz, W., Amann, R., Ludwig, W., Vancanneyt, M., Schleifer, K.H., 1996. Application of a suite of 16S rRNA-specific oligonucleotide probes designed to investigate bacteria of the phylum cytophaga-flavobacter-bacteroides in the natural environment. Microbiology 142 (5), 1097– 1106. https://doi.org/10.1099/13500872-142-5-1097 Marreiros, B.C., Calisto, F., Castro, P.J., Duarte, A.M., Sena, F. V., Silva, A.F., Sousa, F.M., Teixeira, M., Refojo, P.N., Pereira, M.M., 2016. Exploring membrane respiratory chains. Biochim. Biophys. Acta - Bioenerg. 1857, 1039–1067. https://doi.org/10.1016/J.BBABIO.2016.03.028 McBride, M.J., Liu, W., Lu, X., Zhu, Y., Zhang, W., 2014. The Family Cytophagaceae, in: Rosenberg, E., Stackebrandt, E., Thompson, F., Lory, S., DeLong, E.F. (Eds.), The Prokaryotes. Springer-Verlag Berlin Heidelberg, pp. 577–593. https://doi.org/10.1007/978-3- 642-30194-0 McBride, M.J., Zhu, Y., 2013. Gliding Motility and Por Secretion System Genes Are Widespread among Members of the Phylum Bacteroidetes. J. Bacteriol. 195 (2), 270-278. https://doi.org/10.1128/JB.01962-12 90

McIlroy, S.J., Nielsen, P.H., 2014. The Family Saprospiraceae, in: Rosenberg, E., Stackebrandt, E., Thompson, F., Lory, S., DeLong, E.F. (Eds.), The Prokaryotes. Springer-Verlag Berlin Heidelberg, pp. 863–889. https://doi.org/10.1007/978-3-642-30194-0 Medema, M.H., Takano, E., Breitling, R., 2013. Detecting sequence homology at the gene cluster level with multigeneblast. Mol. Biol. Evol. 30 (5), 1218–1223. https://doi.org/10.1093/molbev/mst025 Meuric, V., Rouillon, A., Chandad, F., Bonnaure-mallet, M., 2010. Putative respiratory chain of Porphyromonas gingivalis. Future Microbiol. 5 (5), 717–734. Moparthi, V.K., Hägerhäll, C., 2011. The evolution of respiratory chain complex i from a smaller last common ancestor consisting of 11 protein subunits. J. Mol. Evol. 72, 484–497. https://doi.org/10.1007/s00239-011-9447-2 Mulkidjanian, A.Y., Dibrov, P., Galperin, M.Y., 2008. The past and present of sodium energetics: May the sodium-motive force be with you. Biochim. Biophys. Acta - Bioenerg. 1777 (7-8), 985–992. https://doi.org/10.1016/j.bbabio.2008.04.028 Munoz, R., Rosselló-Móra, R., Amann, R., 2016. Revised phylogeny of Bacteroidetes and proposal of sixteen new taxa and two new combinations including Rhodothermaeota phyl. nov. Syst. Appl. Microbiol. 39 (5), 281–296. https://doi.org/10.1016/J.SYAPM.2016.04.004 Munoz, R., Rosselló-Móra, R., Amann, R., 2016. Corrigendum to “Revised phylogeny of Bacteroidetes and proposal of sixteen new taxa and two new combinations including Rhodothermaeota phyl. nov.” [Syst. Appl. Microbiol. 39 (5) (2016) 281–296]. Syst. Appl. Microbiol. 39, 491–492. https://doi.org/10.1016/j.syapm.2016.08.006 Munoz, R., Yarza, P., Ludwig, W., Euzéby, J., Amann, R., Schleifer, K.-H., Oliver Glöckner, F., Rosselló-Móra, R., 2011. Release LTPs104 of the All-Species Living Tree. Syst. Appl. Microbiol. 34 (3), 169-170. https://doi.org/10.1016/j.syapm.2011.03.001 Munoz, R., Yarza, P., Rosselló-Móra, R., 2014. Harmonized Phylogenetic Trees for The Prokaryotes, in: Rosenberg, E., DeLong, E.F., Lory, S., Stackebrandt, E., Thompson, F.L. (Eds.), The Prokaryotes. Springer Berlin Heidelberg, pp. 1–3. https://doi.org/10.1007/978-3- 642-30138-4_415 Nedashkovskaya, O.I. and Ludwig, W., 2015. Cyclobacteriaceae fam. nov.. In: Whitman, W.B., Rainey, F., Kämpfer, P., Trujillo, M., Chun, J., DeVos, P., Hedlund, B., Dedysh, S. (eds) Bergey's Manual of Systematics of Archaea and Bacteria. https://doi.org/10.1002/9781118960608.fbm00063 Noinaj, N., Guillier, M., Barnard, T.J., Buchanan, S.K., 2010. TonB-dependent transporters: regulation, structure, and function. Annu. Rev. Microbiol. 64 (1), 43–60. https://doi.org/10.1146/annurev.micro.112408.134247 Olsen, G.J., Woese, C.R., Overbeek, R., 1994. The winds of (evolutionary) change: Breathing new life into microbiology. J. Bacteriol. 176 (1), 1–6. 91

Oren, A., 2013. Salinibacter: An extremely halophilic bacterium with archaeal properties. FEMS Microbiol. Lett. 342 (1), 1–9. https://doi.org/10.1111/1574-6968.12094 Oren, A., 2011. Characterization of Pigments of Prokaryotes and Their Use in Taxonomy and Classification, in: Methods in Microbiology. Academic Press, 38 (1), 261–282. https://doi.org/10.1016/B978-0-12-387730-7.00012-7 Oren, A., da Costa, M.S., Garrity, G.M., Rainey, F.A., Rosselló-Móra, R., Schink, B., Sutcliffe, I., Trujillo, M.E., Whitman, W.B., 2015. Proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes. Int. J. Syst. Evol. Microbiol. 65 (11), 4284–4287. https://doi.org/10.1099/ijsem.0.000664 Ormerod, K.L., Wood, D.L.A., Lachner, N., Gellatly, S.L., Daly, J.N., Parsons, J.D., Dal’Molin, C.G.O., Palfreyman, R.W., Nielsen, L.K., Cooper, M.A., Morrison, M., Hansbro, P.M., Hugenholtz, P., 2016. Genomic characterization of the uncultured Bacteroidales family S24-7 inhabiting the guts of homeothermic animals. Microbiome 4 (36), 1–17. https://doi.org/10.1186/s40168-016-0181-2 Park, S., Akira, Y., Kogure, K., 2014. The Family Rhodothermaceae, in: Rosenberg, E., Stackebrandt, E., Thompson, F., Lory, S., DeLong, E.F. (Eds.), The Prokaryotes. Springer- Verlag Berlin Heidelberg, pp. 849–856. https://doi.org/10.1007/978-3-642-30194-0 Parker, C.T., Tindall, B.J., Garrity, G.M., 2019. International code of nomenclature of Prokaryotes. Int. J. Syst. Evol. Microbiol. 69 (1A), S1-S111. https://doi.org/10.1099/ijsem.0.000778 Parks, D.H., Chuvochina, M., Waite, D.W., Rinke, C., Skarshewski, A., Chaumeil, P.-A., Hugenholtz, P., 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004. https://doi.org/10.1038/nbt.4229 Paster, B.J., Ludwig, W., Weisburg, W.G., Stackebrandt, E., Hespell, R.B., Hahn, C.M., Reichenbach, H., Stetter, K.O., Woese, C.R., 1985. A phylogenetic grouping of the Bacteroides, Cytophagas, and certain Flavobacteria. Syst. Appl. Microbiol. 6 (1), 34–42. https://doi.org/http://dx.doi.org/10.1016/S0723-2020(85)80008-4 Patel G.B., Breuil, C., 1981. Isolation and characterization of Bacteroides polypragmatus sp. nov., an isolate which produces carbon dioxide, hydrogen and acetic acid during growth on various organic substrates, Advances in Biotechnology International Symposium (conference paper), Pergamon Press, Toronto; Oxford; Paris, pp 291-296. Pinnaka, A.K., Tanuku, N.R.S., 2014. The Family Cyclobacteriaceae, in: Rosenberg, E., Stackebrandt, E., Thompson, F., Lory, S., DeLong, E.F. (Eds.), The Prokaryotes. Springer- Verlag Berlin Heidelberg, pp. 551–575. https://doi.org/10.1007/978-3-642-30194-0 Pritchard, L., Glover, R.H., Humphris, S., Elphinstone, J.G., Toth, I.K., 2016. Genomics and taxonomy in diagnostics for food security: Soft-rotting enterobacterial plant pathogens. Anal. Methods. 2016, 8, 12-24. https://doi.org/10.1039/c5ay02550h 92

Pruesse, E., Peplies, J., Glöckner, F.O., 2012. SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28, 1823–1829. https://doi.org/10.1093/bioinformatics/bts252 Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., Peplies, J., Glöckner, F.O., 2012. The SILVA ribosomal RNA gene database project: improved data processing and web- based tools. Nucleic Acids Res. 41, D590–D596. https://doi.org/10.1093/nar/gks1219 R Core Team, 2013 R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/ Reeves, a R., Wang, G.R., Salyers, A. A., 1997. Characterization of four outer membrane proteins that play a role in utilization of starch by Bacteroides thetaiotaomicron . J. Bacteriol. 179, 643–649. Refojo, P.N., Sousa, F.L., Teixeira, M., Pereira, M.M., 2010. The alternative complex III: A different architecture using known building modules. Biochim. Biophys. Acta - Bioenerg. 1797(12), 1869–1876. https://doi.org/10.1016/J.BBABIO.2010.04.012 Reintjes, G., Arnosti, C., Fuchs, B.M., Amann, R., 2017. An alternative polysaccharide uptake mechanism of marine bacteria. ISME J. 11, 1640–1650. https://doi.org/10.1038/ismej.2017.26 Reyes-prieto, A., Barquera, B., Juárez, O., 2014. Origin and Evolution of the Sodium -Pumping NADH : Ubiquinone Oxidoreductase. PLOS ONE 9(5), e96696. https://doi.org/10.1371/journal.pone.0096696 Rodriguez-R, L.M., Konstantinidis, K.T., 2016. The enveomics collection: a toolbox for specialized analyses of microbial genomes and metagenomes. PeerJ,Preprints 4:e1900v1 https://doi.org/10.7287/peerj.preprints.1900v1 Rosselló-Móra, R., Amann, R., 2015. Past and future species definitions for Bacteria and Archaea. Syst. Appl. Microbiol. 38, 209–216. https://doi.org/10.1016/j.syapm.2015.02.001 Rosselló-Móra, R., Whitman, W.B., 2019. Dialogue on the nomenclature and classification of prokaryotes. Syst. Appl. Microbiol. https://doi.org/10.1016/j.syapm.2018.07.002 Saitou, N., Nei, M., 1987. The neighbour-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evo 4, 406–425. Sakamoto, M., 2014. The Family Porphyromonadaceae, in: Rosenberg, E., DeLong, E.F., Lory, S., Stackebrandt, E., Thompson, F. (Eds.), The Prokaryotes. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 811–824. https://doi.org/10.1007/978-3-642-30194-0 Sandri, F., Musiani, F., Selamoglu, N., Daldal, F., Zannoni, D., 2018. Pseudomonas

pseudoalcaligenes KF707 grown with biphenyl expresses a cytochrome caa3 oxidase that uses

cytochrome c4 as electron donor. FEBS Lett. 592, 901–915. https://doi.org/10.1002/1873- 3468.13001 Sato, K., Sakai, E., Veith, P.D., Shoji, M., Kikuchi, Y., Yukitake, H., Ohara, N., Naito, M., Okamoto, K., Reynolds, E.C., Nakayama, K., 2005. Identification of a new membrane- associated protein that influences transport/maturation of gingipains and adhesins of 93

Porphyromonas gingivalis. J. Biol. Chem. 280, 8668–77. https://doi.org/10.1074/jbc.M413544200 Schauer, K., Rodionov, D.A., De Reuse, H., 2008. New substrates for TonB-dependent transport: do we only see the “tip of the iceberg”? Trends Biochem. Sci. 33, 330–8. https://doi.org/10.1016/j.tibs.2008.04.012 Soria-Carrasco, V., Valens-Vadell, M., Peña, A., Antón, J., Amann, R., Castresana, J., Rosselló- Mora, R., 2007. Phylogenetic position of Salinibacter ruber based on concatenated protein alignments. Syst. Appl. Microbiol. 30, 171–179. https://doi.org/10.1016/j.syapm.2006.07.001 Sousa, F.L., Alves, R.J., Ribeiro, M.A., Pereira-Leal, J.B., Teixeira, M., Pereira, M.M., 2012. The superfamily of heme–copper oxygen reductases: Types and evolutionary considerations. Biochim. Biophys. Acta - Bioenerg. 1817, 629–637. https://doi.org/10.1016/J.BBABIO.2011.09.020 Stamatakis, A., 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690. https://doi.org/10.1093/bioinformatics/btl446 Sun, C., Benlekbir, S., Venkatakrishnan, P., Wang, Y., Hong, S., Hosler, J., Tajkhorshid, E., Rubinstein, J.L., Gennis, R.B., 2018. Structure of the alternative complex III in a supercomplex with cytochrome oxidase. Nature 557, 123–126. https://doi.org/10.1038/s41586- 018-0061-y Tamura, K., Nei, M., Kumar, S., 2004. Prospects for inferring very large phylogenies by using the neighbor-joining method. Proc. Natl. Acad. Sci. U. S. A. 101, 11030–11035. https://doi.org/10.1073/pnas.0404206101 Terrapon, N., Lombard, V., Drula, É., Lapébie, P., Al-Masaudi, S., Gilbert, H.J., Henrissat, B., 2018. PULDB: the expanded database of Polysaccharide Utilization Loci. Nucleic Acids Res. 46, D677–D683. https://doi.org/10.1093/nar/gkx1022 Terrapon, N., Lombard, V., Gilbert, H.J., Henrissat, B., 2015. Automatic prediction of polysaccharide utilization loci in Bacteroidetes species. Bioinformatics 31, 647–655. https://doi.org/10.1093/bioinformatics/btu716 Thomas, F., Hehemann, J.H., Rebuffet, E., Czjzek, M., Michel, G., 2011. Environmental and gut Bacteroidetes: The food connection. Front. Microbiol. 2, 1–16. https://doi.org/10.3389/fmicb.2011.00093 Toussaint, A., Chandler, M., 2012. Genome Fluidity: Toward a System Approach of the Mobilome. Springer, New York, NY, pp. 57–80. https://doi.org/10.1007/978-1-61779-361-5_4 Uchiyama, I., 2008. Multiple genome alignment for identifying the core structure among moderately related microbial genomes. BMC Genomics 9, 515. https://doi.org/10.1186/1471- 2164-9-515 Unemoto, T., Hayashi, M., 1993. Na+-translocating NADH-quinone reductase of marine and halophilic bacteria. J. Bioenerg. Biomembr. 25, 385–391. https://doi.org/10.1007/BF00762464 94

Ussery, D.W., Wassenaar, T.M., Borini, S., 2009. Microbial Communities: Core and Pan- Genomics, in: Computing for Comparative Microbial Genomics. Springer London, London, pp. 213–228. https://doi.org/10.1007/978-1-84800-255-5_12 Van de Peer, Y., Chapelle, S., De Wachter, R., 1996. A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucleic Acids Res. 24, 3381–3391. https://doi.org/10.1093/nar/24.17.3381 Veillon, A., Zuber, A., 1898. Recherches sur quelques microbes strictement anaérobies et leur rôle en pathologie. Arch. Med. Exp. 10, 517-545. Ventura, M., Canchaya, C., Tauch, A., Chandra, G., Fitzgerald, G.F., Chater, K.F., Van Sinderen, D., 2007. Genomics of Actinobacteria: tracing the evolutionary history of an ancient phylum. Microbiol. Mol. Biol. Rev. 71, 495–548. https://doi.org/10.1128/MMBR.00005-07 Vidal, R., Ginard, D., Khorrami, S., Mora-Ruiz, M., Munoz, R., Hermoso, M., Díaz, S., Cifuentes, A., Orfila, A., Rosselló-Móra, R., 2015. Crohn associated microbial communities associated to colonic mucosal biopsies in patients of the western Mediterranean. Syst. Appl. Microbiol. 38, 442–52. https://doi.org/10.1016/j.syapm.2015.06.008 Viver, T., Orellana, L., González-Torres, P., Díaz, S., Urdiain, M., Farías, M.E., Benes, V., Kaempfer, P., Shahinpei, A., Ali Amoozegar, M., Amann, R., Antón, J., Konstantinidis, K.T., Rosselló-Móra, R., 2018. Genomic comparison between members of the Salinibacteraceae family, and description of a new species of Salinibacter (Salinibacter altiplanensis sp. nov.) isolated from high altitude hypersaline environments of the Argentinian Altiplano. Syst. Appl. Microbiol. 41, 198–212. https://doi.org/10.1016/j.syapm.2017.12.004 Wang, Y.X., Liu, J.H., Xiao, W., Zhang, X.X., Li, Y.Q., Lai, Y.H., Ji, K.Y., Wen, M.L., Cui, X.L., 2012. Fodinibius salinus gen. nov., sp. nov., a moderately halophilic bacterium isolated from a salt mine. Int. J. Syst. Evol. Microbiol. 62, 390–396. https://doi.org/10.1099/ijs.0.025502-0 Ward, N., Moreno-Hagelsieb, G., 2014. Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: How much do we miss? PLoS One 9, 1–6. https://doi.org/10.1371/journal.pone.0101850 Weller, R., Glöckner, F.O., Amann, R., 2000. 16S rRNA-targeted oligonucleotide probes for the in situ detection of members of the phylum Cytophaga-Flavobacterium-Bacteroides. Syst. Appl. Microbiol. 23, 107–114. https://doi.org/10.1016/S0723-2020(00)80051-X Wexler, A.G., Goodman, A.L., 2017. An insider’s perspective: Bacteroides as a window into the microbiome. Nat. Microbiol. 2, 17026. https://doi.org/10.1038/NMICROBIOL.2017.26 Wexler, H.M., 2007. Bacteroides: the good, the bad, and the nitty-gritty. Clin. Microbiol. Rev. 20, 593–621. https://doi.org/10.1128/CMR.00008-07 Whitman, W.B., Oren, A., Chuvochina, M., da Costa, M.S., Garrity, G.M., Rainey, F.A., Rossello- Mora, R., Schink, B., Sutcliffe, I., Trujillo, M.E., Ventura, S., 2018. Proposal of the suffix –ota to denote phyla. Addendum to ‘proposal to include the rank of phylum in the international 95

code of nomenclature of prokaryotes.’ Int. J. Syst. Evol. Microbiol. 68, 967–969. https://doi.org/10.1099/ijsem.0.002593 Whitman, W.B., Woyke, T., Klenk, H.-P., Zhou, Y., Lilburn, T.G., Beck, B.J., De Vos, P., Vandamme, P., Eisen, J. A., Garrity, G.M., Hugenholtz, P., Kyrpides, N.C., 2015. Genomic Encyclopedia of Bacterial and Archaeal Type Strains, Phase III: the genomes of soil and plant-associated and newly described type strains. Stand. Genomic Sci. 10, 26. https://doi.org/10.1186/s40793-015-0017-x Willems, A., Collins, M.D., 1995. 16S rRNA gene similarities indicate that Hallella seregens (Moore and Moore) and Mitsuokella dentalis (Haapasalo et al.) are genealogically highly related and are members of the genus Prevotella: Emended description of the genus Prevotella (Shah and Collins) and description of Prevotella dentalis comb. nov. Int. J. Syst. Bacteriol. 45 (4), 832–836. Woese, C.R., 1987. Bacterial Evolution. Microbiology 51, 221–271. https://doi.org/10.1139/m88- 093 Xia, J., Ling, S.K., Wang, X.Q., Chen, G.J., Du, Z.J., 2016. Aliifodinibius halophilus sp. nov., a moderately halophilic member of the genus Aliifodinibius, and proposal of Balneolaceae fam. nov. Int. J. Syst. Evol. Microbiol. 66, 2225–2233. https://doi.org/10.1099/ijsem.0.001012 Yang, Z., 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39, 306–314. https://doi.org/10.1007/BF00160154 Yarza, P., Spröer, C., Swiderski, J., Mrotzek, N., Spring, S., Tindall, B.J., Gronow, S., Pukall, R., Klenk, H.P., Lang, E., Verbarg, S., Crouch, A., Lilburn, T., Beck, B., Unosson, C., Cardew, S., Moore, E.R.B., Gomila, M., Nakagawa, Y., Janssens, D., De Vos, P., Peiren, J., Suttels, T., Clermont, D., Bizet, C., Sakamoto, M., Iida, T., Kudo, T., Kosako, Y., Oshida, Y., Ohkuma, M., R. Arahal, D., Spieck, E., Pommerening Roeser, A., Figge, M., Park, D., Buchanan, P., Cifuentes, A., Munoz, R., Euzéby, J.P., Schleifer, K.H., Ludwig, W., Amann, R., Glöckner, F.O., Rosselló-Móra, R., 2013. Sequencing orphan species initiative (SOS): Filling the gaps in the 16S rRNA gene sequence database for all species with validly published names. Syst. Appl. Microbiol. 36, 69–73. https://doi.org/10.1016/j.syapm.2012.12.006 Yarza, P., Yilmaz, P., Pruesse, E., Glöckner, F.O., Ludwig, W., Schleifer, K.-H., Whitman, W.B., Euzéby, J., Amann, R., Rosselló-Móra, R., 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645. https://doi.org/10.1038/nrmicro3330 Zeigler, D.R., 2003. Gene sequences useful for predicting relatedness of whole genomes in bacteria. Int. J. Syst. Evol. Microbiol. 53, 1893–1900. https://doi.org/10.1099/ijs.0.02713-0 Zhu, Q., Mai, U., Pfeiffer, W., Janssen, S., Asnicar, F., Sanders, J.G., Belda-Ferre, P., Al-Ghalith, G.A., Kopylova, E., McDonald, D., Kosciolek, T., Yin, J.B., Huang, S., Salam, N., Jiao, J.Y., Wu, Z., Xu, Z.Z., Cantrell, K., Yang, Y., Sayyari, E., Rabiee, M., Morton, J.T., Podell, S., Knights, D., Li, W.J., Huttenhower, C., Segata, N., Smarr, L., Mirarab, S., Knight, R., 2019. 96

Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat. Commun. 10:5477. https://doi.org/10.1038/s41467-019-13443-4 97

Glossary

AAI: Average aminoacidic identity. ANI: Average nucleotidic identity. ANIm: Average nucleotidic identity based on MUMmer algorithm. BRC: Biological Resources Centers. DDBJ: DNA Data Bank of Japan. DUF: Domain of Unknown Function. ENA: European Nucleotide Archive. FCB: Fibrobacter-Chlorobi-Bacteroidetes superphylum. FISH: Fluorescence in situ hybridization. GEBA: The Genomic Encyclopedia of Bacteria and Archaea. GOLD: Genomes online database. GTDB: Genome Taxonomy Database. HCO: Heme-copper oxygen reductase. HMW: High molecular weight. ICSP: International Committee on Systematics of Prokaryotes. IJSEM: International Journal of Environmental Microbiology. IMEDEA: Institut Mediterrani d’Estudis Avançats (SP). JGI: Joined Genome Institute. LPSN: List of prokaryotic names with standing in nomenclature. LSU: Large subunit. LTP: Living-Tree Project. LUCA: Last universal common ancestor. MBGD: Microbial Genome Database for Comparative Analysis. m.s.i.: Median sequence identity. MFS: Major facilitator superfamily. ML: Maximum Likelihood. MLSA: Multi-locus sequence alignment. MPI: Max-Planck-Institute für marine Mikrobiologie (GE). NCBI: National Center for Biotechnology Information (US). NJ: Neighbor-joining. OTU: Operational taxonomic unit. PBC: Phylogenetically balanced core (of genes). PNU: Prokaryotic nomenclature up-to-date. RDP: Ribosomal Database Project. SAM: Systematic and Applied Microbiology (Journal). SBSV: Société de Bactériologie Systématique et Vétérinaire (FR) SINA: SILVA incremental aligner. SQR: Succinate:quinol oxidoreductase. SSU: Small Subunit. TCA: Tricarboxilic acid. TUM: Technische Universität München (GE). 98

ANNEX

A. Individual phylogenies of orthologous proteins in Bacteroidetes, Rhodothermaeota and Chlorobi evaluated for the phylogeny based on a Multilocus Sequence Alignment. The phylogenies of proteins AspS and RibA are shown but they were discarded from the MLSA.

Each topology is the consensus of their neighbor joining (Saitou and Nei, 1987), and RAxML (Stamatakis, 2006). The sequences were aligned with MUSCLE (Edgar, 2004). All position of the alignment were included in the analysis.

Scale bar indicates sequence divergence. The first figure shows the phylogenies of the discarded proteins AspS and RibA. The rest of topologies are displayed in their sequence alphabetical order. 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128

B. Schematic representation of the selected genomes for genomic comparison upon the revised phylogeny of the Bacteroidetes by Munoz et al. 2016. Red lines outline the coverage and representativity of the 89 genomes (and out-group genomes of the Rhodothermaeota and Chlorobi) that were compared in this study. The Sphingobacteriia lineage of the Mucilaginibacter spp. could not be represented because the genome of was still divided in seven contigs at the time of closing our data collection. 129

C. ANIm of pre-selected genomes depicted specific synonyms. Yellow circles: synonyms Elizabethkingia anophelis NUHP1 versus Elizabethkingia meningoseptica FMS-007 (98% identity, 81% coverage) and Cellulophaga lytica DSM 7489T versus Cellulophaga lytica HI1 (99% identity, 98% coverage) Black circle: false positive due to highly similar regions between Cyclobacterium amurskyense KCTC 12363T and Pedobacter heparinus DSM 2366T (97% identity, 0.001% coverage). These similar fragments cause red pixels scattered non-diagonally over the plot. 130

D. List of the best gene cluster homologies in 89 bacteroidetal genomes against the reference flx gene cluster in Flavobacterium johnsoniae UW101T. The reference cluster contains 35 genes. Only other 22 genomes conserved at least half the cluster (Total score ≥ 17). MultigeneBlast v1.1.13 (Medema et al., 2013) confronted the query gene cluster against 89 genomes with homology cut-off at 20% identity, 30% coverage and inter-gene distance of 10 genes.