Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 1 Genome-scale phylogenetic function annotation of large and diverse protein families 1∗ Barbara E Engelhardt Electrical Engineering and Computer Science Department, University of California, Berkeley, Berkeley, CA, USA Current address: Computer Science Department, University of Chicago, Chicago, IL, USA 2 Michael I Jordan Statistics Department and Electrical Engineering and Computer Science Department, University of California, Berkeley, Berkeley, CA, USA 3 John R Srouji Plant & Microbial Biology, University of California, Berkeley, Berkeley, CA USA Current address: Molecular and Cellular Biology Department, Harvard University, Cambridge, MA, USA 4 Steven E Brenner Plant and Microbial Biology Department, University of California, Berkeley, Berkeley, CA, USA ∗ Corresponding author:
[email protected] Running Title: Phylogenetic function annotation Keywords: Protein molecular function annotation, statistical model, protein function evolution, Gene Ontology, whole-genome annotation. Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 2 Abstract The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function pre- diction (Engelhardt et al., 2005, 2006). Here we present a significant revision of the approach (SIFTER version 2.0) that allows annotations to be made on a genomic scale. We confirm that SIFTER 2.0 pro- duces equivalently precise predictions to the earlier version of SIFTER on a carefully studied family and on a collection of one hundred protein families with limited functional diversity. We have added an ap- proximation method to SIFTER 2.0, and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family.