bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
1 Self-Supervised Representation Learning of Protein Tertiary
2 Structures (PtsRep): Protein Engineering as A Case Study
3
4 Junwen Luo1† , Yi Cai1†, Jialin Wu1, Hongmin Cai2, Xiaofeng Yang1* and Zhanglin Lin1*
5
6 1School of Biology and Biological Engineering, 2School of Computer Science and Engineering,
7 South China University of Technology, University Park, Guangzhou, Guangdong 510006, China.
8 * To whom correspondence should be addressed:
9 School of Biology and Biological Engineering, South China University of Technology, 382 East
10 Outer Loop Road, University Park, Guangzhou, Guangdong 510006, China; Tel: +86 (20)
11 3938-0680; Fax: +86 (20) 3938-0601; Email: [email protected] (Z.L.);
12 [email protected] (X.Y.).
13 † These authors contributed equally to this work.
14
1
bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
15 Abstract
16 In recent years, deep learning has been increasingly used to decipher the relationships
17 among protein sequence, structure, and function. Thus far deep learning of proteins has mostly
18 utilized protein primary sequence information, while the vast amount of protein tertiary
19 structural information remains unused. In this study, we devised a self-supervised representation
20 learning framework to extract the fundamental features of unlabeled protein tertiary structures
21 (PtsRep), and the embedded representations were transferred to two commonly recognized
22 protein engineering tasks, protein stability and GFP fluorescence prediction. On both tasks,
23 PtsRep significantly outperformed the two benchmark methods (UniRep and TAPE-BERT),
24 which are based on protein primary sequences. Protein clustering analyses demonstrated that
25 PtsRep can capture the structural signals in proteins. PtsRep reveals an avenue for general
26 protein structural representation learning, and for exploring protein structural space for protein
27 engineering and drug design.
2
bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
28 Introduction
29 Self-supervised learning is a successful method for learning general representations from
30 unlabeled samples1-5. In natural language processing (NLP), it takes the forms of Word2Vec
31 (continuous skip-grow model, and continuous bag-of-words model)1, next-token prediction2, and
32 masked-token prediction3. Recently these NLP-based techniques have been applied to
33 representation learning of protein primary sequences6-10. Two representative examples are
34 UniRep10 which is based on next-token prediction, and a BERT11 model (denoted as
35 TAPE-BERT) which is based on masked-token prediction. Through protein transfer learning,
36 both methods showed good performance for prediction of protein stability landscape, and green
37 fluorescence protein (GFP) activity landscape. In these tasks, two datasets, one from Rocklin et
38 al12, and the other from Sarkisyan et al13 were used. The former includes the sequences and
39 chymotrypsin stability scores of more than 69,000 de novo designed proteins, natural proteins,
40 and their mutants or other derivants. The latter contains more than 50,000 mutants of GFP from
41 Aequorea victoria.
42 Structural biology has made the tertiary structures of many proteins accessible14, and the
43 advent of recent breakthrough of AlphaFold 215 will likely render protein structures more readily
44 available. Thus far, representation learning of proteins has mostly processed protein primary
45 sequence information, whereas the vast amount of protein tertiary structural information
46 available in the PDB16 database remains unused. How to utilize protein structural information in
47 deep learning remains a fundamentally important question.
3
bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
48 In this work, we present a self-supervised learning framework to learn embedded
49 representations of protein tertiary structures (PtsRep) (Fig. 1). The learned representations
50 summarized known protein structures into fixed-length vectors (738 dimensions). These were
51 then transferred to three tasks: (1) protein fold classification, (2) prediction for protein stability,
52 and (3) prediction for GFP fluorescence (Fig. 1C). We found that PtsRep performed well for fold
53 reclassification of unlabeled structures, but most importantly, it significantly outperformed the
54 benchmark methods (UniRep and TAPE-BERT) in the two protein engineering tasks. Thus, this
55 self-supervised representation learning approach of protein structures provides an avenue to
56 approximate or capture the fundamental features of protein structures, and has the promise to
57 advance protein engineering and drug design.
58
4
bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
59 Methods
60 1. Encoding of protein tertiary structures by K nearest residues
61 To encode the structural information for each residue in a given protein, we adopted a
62 strategy from an algorithm which enables each residue to be represented by its nearest
63 residues (KNR) in the Euclidean space17. Each single residue was represented by its bulkiness18,
64 hydrophobicity19, flexibility20, relative spatial distance, relative sequential residue distance, and
65 spatial position based on spherical coordinate system (Fig. 1A).
66 2. Bidirectional language model for self-supervised representation learning of protein
67 structures
21 68 Given a protein of residues ̊, ̋,…, Ä , for each residue, is the probability of
69 predicted residue corresponding to the actual residue, and the probability for the sequence was
22 70 calculated by modeling the probability of the residue at position ( Ê), for the given history
71 ̊, ̋,…, Êͯ̊ :
Ä