bioRxiv preprint doi: https://doi.org/10.1101/2020.07.16.207365; this version posted July 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 2 Early acquisition of conserved, lineage-specific proteins currently lacking functional 3 predictions were central to the rise and diversification of archaea 4 5 Raphaël Méheust+1,4, Cindy J. Castelle1,2, Alexander L. Jaffe5 and Jillian F. Banfield+,1,2,3,4,5 6 7 +Corresponding authors:
[email protected], raphael.meheust@berkeley,edu 8 9 1Department of Earth and Planetary Science, University of California, Berkeley, CA 10 2Chan Zuckerberg Biohub, San Francisco, CA 11 3Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA 12 4Innovative Genomics Institute, University of California, Berkeley, CA 13 5Department of Plant and Microbial Biology, University of California, Berkeley, CA 14 15 16 Abstract 17 Recent genomic analyses of Archaea have profoundly reshaped our understanding of their 18 distribution, functionalities and roles in eukaryotic evolution. Within the domain, major 19 supergroups are Euryarchaeota, which includes many methanogens, the TACK, which includes 20 Thaumarchaeaota that impact ammonia oxidation in soils and the ocean, the Asgard, which 21 includes lineages inferred to be ancestral to eukaryotes, and the DPANN, a group of mostly 22 symbiotic small-celled archaea. Here, we investigated the extent to which clustering based on 23 protein family content recapitulates archaeal phylogeny and identified the proteins that distinguish 24 the major subdivisions. We also defined 10,866 archaeal protein families that will serve as a 25 community resource.