bioRxiv preprint doi: https://doi.org/10.1101/2020.12.23.424215; this version posted December 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 Shedding Light on Microbial Dark Matter with A 2 Universal Language of Life 1,* 2 3 2,* 3 Hoarfrost A , Aptekmann, A , Farfañuk, G , Bromberg Y 4 1 Department of Marine and Coastal Sciences, Rutgers University, 71 Dudley Road, New Brunswick, NJ 08873, USA 5 2 Department of Biochemistry and Microbiology, 76 Lipman Dr, Rutgers University, New Brunswick, NJ 08901, USA 6 3 Department of Biological Chemistry, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Argentina 7 8 * Corresponding author:
[email protected];
[email protected] 9 Running Title: LookingGlass: a universal language of life 10 11 Abstract 12 The majority of microbial genomes have yet to be cultured, and most proteins 13 predicted from microbial genomes or sequenced from the environment cannot be 14 functionally annotated. As a result, current computational approaches to describe 15 microbial systems rely on incomplete reference databases that cannot adequately 16 capture the full functional diversity of the microbial tree of life, limiting our ability to 17 model high-level features of biological sequences. The scientific community needs a 18 means to capture the functionally and evolutionarily relevant features underlying 19 biology, independent of our incomplete reference databases. Such a model can form 20 the basis for transfer learning tasks, enabling downstream applications in 21 environmental microbiology, medicine, and bioengineering.