Self-Supervised Representation Learning of Protein Tertiary Structures (Ptsrep)

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Self-Supervised Representation Learning of Protein Tertiary 2 Structures (PtsRep): Protein Engineering as A Case Study 3 4 Junwen Luo1† , Yi Cai1†, Jialin Wu1, Hongmin Cai2, Xiaofeng Yang1* and Zhanglin Lin1* 5 6 1School of Biology and Biological Engineering, 2School of Computer Science and Engineering, 7 South China University of Technology, University Park, Guangzhou, Guangdong 510006, China. 8 * To whom correspondence should be addressed: 9 School of Biology and Biological Engineering, South China University of Technology, 382 East 10 Outer Loop Road, University Park, Guangzhou, Guangdong 510006, China; Tel: +86 (20) 11 3938-0680; Fax: +86 (20) 3938-0601; Email: [email protected] (Z.L.); 12 [email protected] (X.Y.). 13 † These authors contributed equally to this work. 14 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 15 Abstract 16 In recent years, deep learning has been increasingly used to decipher the relationships 17 among protein sequence, structure, and function. Thus far deep learning of proteins has mostly 18 utilized protein primary sequence information, while the vast amount of protein tertiary 19 structural information remains unused. In this study, we devised a self-supervised representation 20 learning framework to extract the fundamental features of unlabeled protein tertiary structures 21 (PtsRep), and the embedded representations were transferred to two commonly recognized 22 protein engineering tasks, protein stability and GFP fluorescence prediction. On both tasks, 23 PtsRep significantly outperformed the two benchmark methods (UniRep and TAPE-BERT), 24 which are based on protein primary sequences. Protein clustering analyses demonstrated that 25 PtsRep can capture the structural signals in proteins. PtsRep reveals an avenue for general 26 protein structural representation learning, and for exploring protein structural space for protein 27 engineering and drug design. 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 28 Introduction 29 Self-supervised learning is a successful method for learning general representations from 30 unlabeled samples1-5. In natural language processing (NLP), it takes the forms of Word2Vec 31 (continuous skip-grow model, and continuous bag-of-words model)1, next-token prediction2, and 32 masked-token prediction3. Recently these NLP-based techniques have been applied to 33 representation learning of protein primary sequences6-10. Two representative examples are 34 UniRep10 which is based on next-token prediction, and a BERT11 model (denoted as 35 TAPE-BERT) which is based on masked-token prediction. Through protein transfer learning, 36 both methods showed good performance for prediction of protein stability landscape, and green 37 fluorescence protein (GFP) activity landscape. In these tasks, two datasets, one from Rocklin et 38 al12, and the other from Sarkisyan et al13 were used. The former includes the sequences and 39 chymotrypsin stability scores of more than 69,000 de novo designed proteins, natural proteins, 40 and their mutants or other derivants. The latter contains more than 50,000 mutants of GFP from 41 Aequorea victoria. 42 Structural biology has made the tertiary structures of many proteins accessible14, and the 43 advent of recent breakthrough of AlphaFold 215 will likely render protein structures more readily 44 available. Thus far, representation learning of proteins has mostly processed protein primary 45 sequence information, whereas the vast amount of protein tertiary structural information 46 available in the PDB16 database remains unused. How to utilize protein structural information in 47 deep learning remains a fundamentally important question. 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 48 In this work, we present a self-supervised learning framework to learn embedded 49 representations of protein tertiary structures (PtsRep) (Fig. 1). The learned representations 50 summarized known protein structures into fixed-length vectors (738 dimensions). These were 51 then transferred to three tasks: (1) protein fold classification, (2) prediction for protein stability, 52 and (3) prediction for GFP fluorescence (Fig. 1C). We found that PtsRep performed well for fold 53 reclassification of unlabeled structures, but most importantly, it significantly outperformed the 54 benchmark methods (UniRep and TAPE-BERT) in the two protein engineering tasks. Thus, this 55 self-supervised representation learning approach of protein structures provides an avenue to 56 approximate or capture the fundamental features of protein structures, and has the promise to 57 advance protein engineering and drug design. 58 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 59 Methods 60 1. Encoding of protein tertiary structures by K nearest residues 61 To encode the structural information for each residue in a given protein, we adopted a 62 strategy from an algorithm which enables each residue to be represented by its § nearest 63 residues (KNR) in the Euclidean space17. Each single residue was represented by its bulkiness18, 64 hydrophobicity19, flexibility20, relative spatial distance, relative sequential residue distance, and 65 spatial position based on spherical coordinate system (Fig. 1A). 66 2. Bidirectional language model for self-supervised representation learning of protein 67 structures 21 68 Given a protein of Ä residues ʚÍ̊,̋,…,ÍÄʛ, for each residue, is the probability of 69 predicted residue corresponding to the actual residue, and the probability for the sequence was 22 70 calculated by modeling the probability of the residue at position Ê (ÍÊ), for the given history 71 ʚÍ̊,̋,…,Êͯ̊ʛ: Ä ÆʚÍ̊,̋,…,Äʛ șÆʚÍÊ | ̊,̋,…,Êͯ̊ʛ ÊͰ̊ 72 At each position Ê , the forward long short-term memory (LSTM) layer output a Ŵ 73 context-dependent representation ¾Ê , which was used to predict the next residue ÍÊͮ̊ with a 74 Softmax layer21. We additionally trained the reverse LSTM to provide a bidirectional context for Ų 75 each residue, which produced the representation ¾Ê for ÍÊ , for the given the history 76 ʚÍÊͮ̊,Êͮ̋,…,ÍÄʛ. 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Ä ÆʚÍ̊,̋,…,Äʛ ÆʚÍÊ | Êͮ̊,Êͮ̋,…,ÍÄʛ ÊͰ̊ 77 Our formulation maximized the sum of the log likelihoods in both the forward and backward 78 directions as follows: Ä Ŵ Ų ¯ Ɣ ȕʚÂÅ½ ʚÍÊ | ̊,̋,…,ÍÊͯ̊; òÍ,¨¯°©,Éʛ ƍÂÅ½ ÆʚÍÊ | Êͮ̊,Êͮ̋,…,ÍÄ; Í,¨¯°©,Éʛʛ ÊͰ̊ 79 Where, we tied the parameters for the residue representation (òÍ), LSTM layer (ò¨¯°©) and 80 the Softmax layer (òÉ) in both the forward and backward directions. At each position Ê, only the 81 history information was given to avoid leakage of subsequent sequence position information23. 82 We used the cross entropy as the loss function21. Then we defined º as the distance from the 83 current position Ê, and R(Ê, ºʛ or the sum of log likelihood in both the forward and backward 84 directions as follows: Ŵ ®ʚ, ºʛ Ɣ ÂÅ½ ÆʚÍÊ | ̊,̋,…,ÍÊͯº;Í,¨¯°©,Éʛ Ų ƍÂÅ½ ÆʚÍÊ | Êͮº,Êͮºͮ̊,…,ÍÄ;Í,¨¯°©,Éʛ 85 Similarly, we tied the parameters for the residue representation (òÍ), LSTM layer (ò¨¯°©) and 86 Softmax layer (òÉ) in both the forward and backward directions. Considering that immediately 87 adjacent amino acid residues may be easy to identify with the input structural information, we 88 omitted these two residues in both directions. The final loss was the average of calculated ¨ÅÉÉ 89 for all position of Ê: Ä ºͮ̊ ̊ ¨ÅÉÉʚ, ºʛ ȕȕ®ʚ, Èʛ ̍Ä Ǝ ̍º Ǝ ̋ ÊͰ̊ ÈͰº 90 3. Additional training details 6 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.22.423916; this version posted December 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 91 We chose 768 as the dimension of embedding, which followed the Bidirectional Encoder 92 Representation from Transformers (BERT)3. Because the length of each protein sequence was 93 different, we used a batch size of 1 with a constant learning rate 2e-3.

Load more