Function Annotation Algorithm Based on Sequence Spectral Features: Evaluation on Human Transcription Factors

Gemović Branislava*, Davidović Radoslav, Šumonja Neven, Perović Vladimir, and Veljković Nevena

Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinča, University of Belgrade, Mike Petrovića Alasa 14-16, 11001 Belgrade, Serbia

*Correspondence to: [email protected]

PROTEIN SEQUENCE INFORMATIONAL SPECTRA FUNCTION – ONTOLOGY Function Annotation Algorithm Based on Sequence Spectral Features: Application on Human Transcription Factors

MOTIVATION RESULTS OF MANUAL CURATION MYBL2 (B-MYB) EXAMPLE There were 69 predicted leaf terms and 449 corresponding parent terms that  Gap between sequence information and protein function knowledge sets a major challenge in the MYBL2 protein is a transcription factor, involved in cancer initiation and were considered FP according to the GO ground truth for in the field of bioinformatics on automatically predicting protein functions from the sequence progression. QuickGO generated GO slim with 70 terms, of which our model dataset. Manual curation of available biomedical literature resulted in  Development of protein function models require a trusted set of known function annotations to correctly predicted 66. There were 16 FP terms: 1 redundant term, 3 identifying evidence for 68% of leaf terms. If we couldn’t find evidence for learn from, such as those stored in the (GO) transcription-related terms and 12 newly predicted annotations (5 leaves). the leaf term, we furthered our search on its parent terms. This way we  Incomplete state of biological ontologies affect development of prediction models, as well as Validation of false positives resulted in finding evidence for 11/12 annotations found evidence for 86% of terms in the GO slims generated using predicted realistic estimation of model performances (4/5 leaves) in the current biomedical literature (Table 4, Fig.1). terms for each protein (Table 2). FUNCTION ANNOTATION ALGORITHM Table 2. Manual curation of biomedical literature used for validation of FP Table 4. Validating FP predictions for MYBL2 (B-MYB). GO ID GO TERM EVIDENCE REFERENCE ISM (Informational Spectrum Method) is a virtual spectroscopy method for structure/function analysis PROTEIN EVIDENCE EVIDENCE PROTEIN EVIDENCE EVIDENCE FOR LEAVES FOR ALL FOR LEAVES FOR ALL “…B-Myb is critical for proper DNA of protein sequences, based on the Fourier transform of numerical representation of protein sequence. TERMS TERMS GO:0006260 DNA replication duplication during an unperturbed S PMID: 20715180 This numerical representation is obtained by replacing each amino acid with the calculated value of ZNF367 1/3 (33%) 10/13 (77%) MEF2B 5/6 (83%) 54/66 (82%) phase in mouse embryonic stem cells…” Electron Ion Interaction potential. KLF7 3/3 (100%) 33/33 (100%) DRAP1 6/7 (86%) 53/55 (96%) “…B-myb antisense oligonucleotides GO:0008283 cell proliferation inhibit proliferation of human PMID: 1586718 RGMA 2/6 (33%) 45/55 (82%) VPS72 6/7 (86%) 58/61 (95%) hematopoietic cell lines…” ISM classifier Protein X Protein 1 P 1 HSF2 6/7 (86%) 29/33 (88%) MYBL2 4/5 (80%) 11/12 (92%) “…B-MYB positively regulates serine- Protein 2 P 2 TBX19 1/2 (50%) 30/31 (97%) RNF25 2/4 (50%) 22/30 (73%) threonine kinase receptor-associated protein (STRAP) activity through direct FEV 2/3 (67%) 18/22 (82%) ARNTL2 2/3 (67%) 19/23 (83%) GO:0007165 signal transduction PMID: 21148321 interaction…B-MYB enhances STRAP- Protein 3 P 3 SOX10 2/5 (40%) 20/33 (61%) CCDC85B 4/7 (57%) 42/51 (82%) mediated inhibition of TGF-β signalling … … pathways…” cellular response “…B-MYB is required for recovery from Protein N P N GO:0006974 to DNA damage the DNA damage-induced G2 checkpoint PMID: 19383908 NEW FUNCTIONS OF TRANSCRIPTION FACTORS stimulus in p53 mutant cells…” generation of If capable to recognize majority of terms associated with the target protein, our GO:0006091 precursor meta- ? algorithm efficiently foresees new functional annotations. (Table 3). bolites and energy Protein 1 GO 1 Threshold p-value Table 3. New functions of transcription factors. … DINGO GO 2 Filter Filter PROTEIN GO TERM GO ID Protein K … GO:0006091; (first K) ZNF367 Generation of precursor metabolites; Intracellular transport GO:0046907 GO M GO:0019276; UDP-N-acetylgalactosamine metabolic process; I-kappaB Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21, 3448-3449. RGMA GO:0007252; phosphorylation; androgen receptor signaling pathway; GO:0030521 VALIDATING FALSE POSITIVES HSF2 Epidermal growth factor receptor signaling pathway GO:0007173 We analyzed 10,672 human proteins and predicted their annotations in GO sub-ontology Biological TBX19 Skeletal system development GO:0001501 Processes (BPO). We singled out a subset of proteins for which our algorithm correctly predicted the FEV Mitochondrion organization GO:0007005 GO:0006497; majority of annotations in BPO, with no or small number of false negatives (Recall≥0.9). From this SOX10 Protein lipidation; Membrane fusion subset, we further selected 14 transcription factors and validated their FP predictions (Table 1). GO:0061025 MEF2B Immunoglobulin V(D)J recombination GO:0033152 DATASET Table 1. Performance of ISM-based algorithm on the selected dataset DRAP1 Histone H2A K63-linked deubiquitination GO:0070537 Human proteins previously annotated in BPO PROTEIN PRECISION RECALL F PROTEIN PRECISION RECALL F Adenylate cyclase-inhibiting G-protein coupled receptor signaling (10672) VPS72 GO:0007193 ZNF367 0.758 1 0.862 MEF2B 0.525 0.961 0.679 pathway AND KLF7 0.562 1 0.719 DRAP1 0.520 0.955 0.674 MYBL2 Generation of precursor meta-bolites and energy GO:0006091 Proteins predicted with recall >=0.9 (282) GO:0043966; RGMA 0.520 1 0.684 VPS72 0.504 0.955 0.660 RNF25 Histone H3 acetylation; segregation AND GO:0007059 HSF2 0.542 0.985 0.699 MYBL2 0.805 0.943 0.868 Figure 1. Functions predicted for MYBL2 Proteins predicted with precision >=0.5 (26) ARNTL2 Positive regulation of cytokine biosynthetic process GO:0042108 TBX19 0.591 0.981 0.738 RNF25 0.643 0.926 0.759 AND Ubiquitin-dependent protein catabolic process; adenylate cyclase- GO:0006511; FEV 0.508 0.968 0.667 ARNTL2 0.557 0.926 0.696 Proteins previously annotated with transcription- CCDC85B inhibiting G-protein coupled receptor signaling pathway; DNA GO:0007193; ACKNOWLEDGEMENT related GO terms (14) SOX10 0.586 0.962 0.729 CCDC85B 0.553 0.907 0.687 packaging GO:0006323 This work was supported by the Ministry of Education, Science and Technological Development of the Republic of Serbia [ON173001].