bioRxiv preprint doi: https://doi.org/10.1101/530188; this version posted January 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT F.A. Bastiaan von Meijenfeldt1,†, Ksenia Arkhipova1,†, Diego D. Cambuy1, Felipe H. Coutinho2,3, Bas E. Dutilh1,2,* 1 Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, The Netherlands. 2 Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands. 3 Instituto de Biologia, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil. * To whom correspondence should be addressed. Tel: +31 30 253 4212; Email:
[email protected]. † These authors contributed equally to this work. Present Address: [Felipe H. Couthinho], Evolutionary Genomics Group, Departamento de Produccíon y Microbiología, Universidad Miguel Hernández, Campus San Juan, San Juan, Alicante 03550, Spain. ABSTRACT Current-day metagenomics increasingly requires taxonomic classification of long DNA sequences and metagenome-assembled genomes (MAGs) of unknown microorganisms. We show that the standard best-hit approach often leads to classifications that are too specific. We present tools to classify high- quality metagenomic contigs (Contig Annotation Tool, CAT) and MAGs (Bin Annotation Tool, BAT) and thoroughly benchmark them with simulated metagenomic sequences that are classified against a reference database where related sequences are increasingly removed, thereby simulating increasingly unknown queries. We find that the query sequences are correctly classified at low taxonomic ranks if closely related organisms are present in the reference database, while classifications are made higher in the taxonomy when closely related organisms are absent, thus avoiding spurious classification specificity.