<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A deep learning framework for elucidating whole- chemical interaction space

Tian Cai1, Hansaim Lim2, Kyra Alyssa Abbu3, Yue Qiu4, Ruth Nussinov5,6, and Lei Xie1,2,3,4,7,*

1Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA 2Ph.D. Program in Biochemistry, The Graduate Center, The City University of New York, New York, 10016, USA 3Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA 4Ph.D. Program in Biology, The Graduate Center, The City University of New York, New York, 10016, USA 5Computational Structural Biology Section, Basic Science Program, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA 6Department of Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel 7Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, 10021, USA

*Corresponding Author: Lei Xie , Address: 1008N 695 Park Avenue, New York City, NY 10065, USA , Phone: 212-396-6550 , Email: [email protected]

October 9, 2020

Keywords— machine learning, self-supervised learning, sequence representation, chemical- protein interaction, G-Protein Coupled , drug discovery

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Abstract

2 Molecular interaction is the foundation of biological process. Elucidation of genome-wide 3 binding partners of a biomolecule will address many questions in biomedicine. However, ligands 4 of a vast number of remain elusive. Existing methods mostly fail when the protein 5 of interest is dissimilar from those with known functions or structures. We develop a new 6 deep learning framework DISAE that incorporates biological knowledge into self-supervised 7 learning techniques for predicting ligands of novel unannotated proteins on a genome-scale. In 8 the rigorous benchmark studies, DISAE outperforms state-of-the-art methods by a significant 9 margin. The interpretability analysis of DISAE suggests that it learns biologically meaningful 10 information. We further use DISAE to assign ligands to human orphan G-Protein Coupled 11 Receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and 12 relationships. The promising results of DISAE open an avenue for exploring the chemical 13 landscape of entire sequenced .

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

14 1 Introduction

15 Small molecules like metabolites and drugs play an essential role in modulating physiological and pathological 16 processes. The chemical modulation of biological system results from its interaction with biomolecules, 17 largely proteins. Thus, the genome-wide identification of chemical-protein interactions (CPI) will not 18 only address many fundamental questions in biology (e.g. microbiome-host interaction mediated by the 19 metabolite) but also provide new opportunities in drug discovery and precision medicine [1]. In spite of 20 tremendous advances in genomics, the function of a vast number of proteins, particularly, their ligands 21 are to a great extent unknown. Among approximately 3,000 druggable that encode human proteins, 22 only 5% to 10% of them have been targeted by an FDA-approved drug [2]. The proteins that miss ligand 23 information are called orphan proteins in biology or considered as unlabeled data in terms of machine 24 learning. It is a great challenge to assign ligands to the orphan proteins especially when they are dissimilar 25 from proteins with known structures or functions. Many experimental approaches have been developed 26 to deorphanize the orphan proteins such as G-protein coupled receptors (GPCRs) [3]. However, they are 27 costly and time-consuming. A great deal of efforts have been devoted to develop computational approaches, 28 which may provide an efficient solution to generate testable hypotheses for elucidating the ligand of orphan 29 proteins. 30 With the increasing availability of solved crystal structures of proteins, modeling and protein- 31 ligand docking are major methods for the effort in deorphanization [4]. However, the quality of homology 32 models significantly deteriorates when the sequence identity between query and template is low. When a 33 homology model with an acceptable quality is unavailable, structure-based methods, either physics-based 34 or machine learning-based [5] could be unfruitful. Moreover, protein-ligand docking suffers from a high 35 rate of false positives because it is incapable of accurately modeling conformational dynamics, solvation 36 effect, crystalized water molecules, and other physical phenomena. Considering that around half of Pfam 37 families do not have any structure information[6], it is necessary to develop new sequence-based machine 38 learning approaches to the deorphanization, especially for the proteins that are not close homologs of or 39 even evolutionarily unrelated to the ones with solved crystal structures or known ligands. 40 Many machine learning methods have been developed to predict CPIs from protein sequences. Early 41 works relied on feature engineering to create a representation of protein and ligand then use classifiers 42 such as support vector machine [7] and matrix factorization [8][9] to make final prediction. End-to-end 43 deep learning approaches have recently gained a momentum [10][11]. Various neural network architectures, 44 including Convolutional Neural Network (CNN) [12][13], seq2seq [14][15], and Transformer [16], have been 45 applied to represent protein sequences. These works mainly focused on filling in missing CPIs for the existing 46 drug targets. When the sequence of orphan protein is significantly different from those used in the database 47 or training data, the performance of many existing machine learning methods deteriorates significantly[8]. 48 To extend the scope of state-of-the-arts for the prediction of chemical binding to orphan proteins in entire 49 sequenced genomes, we propose a new deep learning framework for improving protein representation learning 50 so that the relationships between remote proteins or even evolutionary unrelated proteins can be detected. 51 Inspired by the success in self-supervising learning of unlabeled data in Natural Language Processing (NLP) 52 [17], and its application to biological sequences [18][19][20], we proposed a new protein representation method, 53 DIstilled Sequence Alignment Embedding (DISAE), for the purpose of deorphanization of remote orphan 54 proteins. Although the number of labeled proteins that have known ligands is limited, DISAE can utilize 55 all available unlabeled protein sequences without the knowledge of their functions and structures. By 56 incorporating biological knowledge into the sequence representation, DISAE can learn functionally important 57 information about protein families that span a wide range of protein space. Furthermore, we devise a 58 module-based pre-training-fine-tuning strategy using ALBERT [17]. To our knowledge, it is the first time 59 to apply pre-trained-fine-tuned protein sequence model for addressing the challenge of deorphanization of 60 novel proteins on a genome-scale.. 61 In the benchmark study, DISAE significantly outperforms other state-of-the-art methods for the prediction 62 of chemical binding to dissimilar orphan proteins with a large margin. Furthermore, The interpretability 63 analysis of DISAE suggests that it learns biologically meaningful information. We apply DISAE to the 64 deorphanization of G-protein coupled receptors (GPCRs). GPCRs play a pivotal role in numerous physiological 65 and pathological processes. Due to their associations with many human diseases and high druggabilities, 66 GPCRs are the most studied drug targets [21]. Around one-third of FDA-approved drugs target GPCRs 67 [21]. In spite of intensive studies in GPCRs, the endogenous and surrogate ligands of a large number of 68 GPCRs remain unknown [3]. Using DISAE, we can confidently assign 649 orphan GPCRs in Pfam with 69 at least one ligand. 106 of the orphan GPCRs find at least one approved GPCR-targeted drugs as ligand 70 with estimated false positive rate lower than 0.05. These predictions merit further experimental validations. 71 In addition, we cluster the human GPCRome by integrating their sequence and ligand relationships. The 72 promising results of DISAE open the revenue for exploring the chemical landscape of all sequenced genomes. 1 73 The code of DISAE is available from github .

1https://github.com/XieResearchGroup/DISAE

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

74 2 Results and Discussion

75 2.1 Overview of methodology

76 As illustrated in Figure 1A, the proposed method is designed to predict chemical binding to remote orphan 77 proteins that do not have detectable relationships to annotated (i.e. labeled) proteins with known structures 78 or functions. It is different from most of current works that focus on the assignment of functions to proteins 79 that are homologous to the annotated proteins. Our method mainly consist of two stages (Figure 1B, 80 C). The first stage is for unsupervised learning of protein representations using only sequence data from 81 all non-redundant sequences in Pfam-A families [22] but without the need of any annotated structural or 82 functional information. We develop a new algorithm, DIstilled Suquence Alignment Enmbedding (DISAE), 83 for the self-supervised leaning (a special form of unspervised learing) of the protein representation. Different 84 from existing sequence pre-training strategy that uses original protein sequences as input [18][19][20], DISAE 85 distills the original sequence into an ordered list of triplets by excluding evolutionarily unimportant positions 86 from a multiple sequence alignment (Figure 1B). The purpose of sequence distillation is two-folds: improving 87 efficiency for the sequence pre-training in which long sequences cannot be handled, and reducing the noisy in 88 the input sequence. Then long-range residue-residue interactions are learned via the Transformer module in 89 ALBERT. A self-supervised masked language modeling (MLM) approach is used at this stage. In the MLM, 90 15% triplets are randomly masked and assumed that they are unknown. Then the remaining triplets are used 91 to predict what the masked triplets are. In the second stage, a supervised learning model is trained to predict 92 if a pair of chemical and protein interacts using the known CPIs that are collected from chemical genomics 93 databases [23][24][25][26]. The input of the supervised learning includes both the representation of chemical 94 structures from neural fingerprint [27] and the pre-trained protein sequence embedding from the first stage. 95 We develop a module-based fine-tuning strategy to balance the information learned from the unsupervised 96 and the supervised stages, and apply an attention pooling [28] to model the interaction between chemical 97 substructures and protein residues. More details in the sequence representation learning, the architecture of 98 neural network, benchmark data sets, training and evaluation procedure, and data analysis can be found in 99 Method section.

100 2.2 Pre-trained DISAE triplet vector is biochemically meaningful

101 When pre-training ALBERT model with masked triplets of proteins, the masked word prediction accuracy 102 reached 0.982 and 0.984 at the 80,000th and the 100,000th training step, respectively. For comparison, the 103 pre-training accuracy when using BERT [29] was 0.955 at 200,000th training step. It may be because a 104 larger batch size can be used in ALBERT than in BERT. The DIstilled Sequence Alignment Embedding 105 (DISAE) from ALBERT pre-training is subsequently used for the CPI prediction. To evaluate whether 106 the pre-trained DISAE vector is biochemically meaningful, we extracted the pre-trained DISAE vector 107 for all possible triplets. We then used t-SNE to project the triplet vector in a 2D space. As shown in 108 Supplementary Figure 1, the triplet vectors formed distinct clusters by the properties of amino acid side 109 chains in the third amino acid of triplets, especially at level 3 and 4. Triplets containing any ambiguous or 110 uncommon amino acids, such as amino acid U for selenocysteine or X for any unresolved amino acids, formed 111 a large cluster that did not form a smaller group (large group of black dots in each scatter plot), suggesting 112 that the information regarding such rare amino acids are scarce in the pre-training dataset. When there is 113 no ambiguity in triplets, they form clearly separated clusters, meaning that the pre-training process indeed 114 allows the model to extract biochemically meaningful feature vectors from the proteins as sentences. The 115 same clustering trends were also observed when triplets are grouped by individual amino acids, rather than 116 their physicochemical properties of side chains (Supplementary Figure 2).

117 2.3 DISAE significantly outperforms state-of-the-art models for predicting

118 ligands of remote orphan proteins

119 To evaluate the performance of DISAE for the prediction of ligand of remote orphan proteins, we perform 120 experiments in a dissimilar protein benchmark. In this benchmark, the proteins in the testing data (i.e. 121 query proteins) are significantly dissimilar from those in the training data (i.e. template proteins). As 122 shown in Figure 2, the sequence identity of protein pairs between the query and the template are all less 123 than 40%. The pairs with the alignment length larger than 150 (around 50% coverage of the protein) and 124 the sequence identity larger than 30% only constitute less than 10% of total pairs. Thus, homology-based 125 methods cannot be applied to majority of cases in the benchmark. We first evaluate the performance of two 126 baseline models LSTM and Transformer in the dissimilar protein benchmark. In this conventional setting, 127 the ALBERT model in 1C is replaced by LSTM or Transformer. In addition, there is no pre-training 128 stage. Only labeled data are directly applied to supervised learning. As shown in Figure 3A and 3B, the 129 performance of conventional LSTM model is close to random. Its ROC-AUC and PR-AUC is 0.476 and 130 0.261, respectively. The Transformer-based architecture has achieved the state-of-the-art performance in the 131 recently published work[16]. Indeed, it outperforms LSTM model with a ROC-AUC of 0.648 and a PR-AUC 132 of 0.380, respectively. Interestingly, the ALBERT pre-trained model significantly improves the performance 133 of state-of-the-arts. The ALBERT pre-trained model with the optimal configuration (transformer-frozen)

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

134 has a ROC-AUC of 0.725 and a PR-AUC of 0.589, respectively. Furthermore, as shown in the insert of 135 Figure 3A, ALBERT pre-trained transformer-frozen model not only has the highest ROC-AUC and PR- 136 AUC but also far exceeds the baseline models at the range of low false positive rates. For example, at the 137 false positive rate of 0.05, the number of true positives detected by ALBERT pre-trained transformer-frozen 138 model is almost 6 times more than that correctly predicted by Transformer or LSTM. 139 When the train/test set is split at random, i.e. there are similar proteins in the testing set to those in the 140 training set, the performance of LSTM models is slightly better than the ALBERT pre-trained models, as 141 shown in Table 1. However, the performance of the LSTM model significantly deteriorates when the proteins 142 in the testing set are different from those in the training set. The ROC-AUC drops from over 0.9 to ∼0.4-0.6, 143 and PR-AUC decreases from over 0.8 to ∼0.2-0.3, while ALBERT pre-trained models could still maintain 144 the ROC-AUC ∼0.7 and the PR-AUC ∼0.5. The significant drop in the performance for the prediction of 145 dissimilar proteins is also observed in the Transformer model. These results suggest that the supervised 146 learning alone is prone to overfitting, as observed in the Transformer and LSTM models, thus cannot be 147 generalized to modeling remote orphan proteins. The pre-training of DISAE, which uses a large number 148 of unlabeled proteins improve the generalization power. The training curves in Supplementary Figure 3, 149 4, 5 further support that ALBERT frozen transformer is generalizable, thus can reliably maintain its high 150 performance when used for the deorphanization of dissimilar proteins. When evaluated by the dissimilar 151 protein benchmark, the accuracy of training keeps increasing with the increased epochs; and the performance 152 of ALBERT pre-trained transformer-frozen model is slightly worse than the most of models. However, the 153 PR-AUC of ALBERT pre-trained transformer-frozen model is relatively stable and significantly higher than 154 other models on the testing data. These observations further support that DISAE is capable of predicting 155 the ligand binding to novel proteins.

156 2.4 DISAE significantly outperforms state-of-the-art models for predicting

157 ligand binding promiscuity across families

158 We further evaluate the performance of ALBERT pre-trained model and compare it with the baseline 159 Transformer and LSTM models on the prediction of ligand binding promiscuity across gene families. Specifically, 160 we predict the binding of new kinase inhibitors to new GPCRs on the model that is trained using only GPCR 161 sequences that are not known to bind to kinase inhibitors and chemicals that have not been annotated as 162 kinase inhibitors. In other words, all GPCRs that bind to kinase inhibitors and all chemicals that are kinase 163 inhibitors are excluded from the supervised training set. Although the kinase and the GPCRs belong to 164 two completely different gene families in terms of sequence, structure, and function, a number of kinase 165 inhibitors can bind to GPCRs as an off-target. We use these annotated GPCR-kinase inhibitor pairs as the 166 test set. Interestingly, although DISAE is not extensively trained using a comprehensive chemical data set, 167 the ALBERT pre-trained transformer-frozen model outperforms the LSTM and the Transformer (distilled 168 triplets). As shown in Figure 3C and 3D, both sensitivity and specificity of the ALBERT pre-trained 169 transformer-frozen model outperforms other models. The ROC-AUC of ALBERT pre-trained transfomer- 170 frozen model, Transformer, LSTM model is 0.690, 0.687, and 0.667, respectively. Their PR-AUC is 0.673, 171 0.630, and 0.590, respectively. The improvement is the most obvious in the region of low false positive 172 ratio (the insert of Figure 3C). At the false positive rate of 0.05, the number of true positives detected 173 by the ALBERT pre-trained transformer-frozen model is about 50% as many as that by Transfomer and 174 LSTM models. This observation implies that the sequence pre-training captures certain ligand binding site 175 information across gene families. It is noted that it is infeasible to apply homology modeling based method 176 to infer ligand binding promiscuity across gene families.

177 2.5 The effect of distilled sequence representation, pre-training, and fine-

178 tuning

179 To understand the contribution of each component in the ALBERT pre-training-fine-tuning procedure to 180 the performance, we conduct a series of ablation studies as shown in Table 1 under the dissimilar protein 181 and cross-gene-family benchmarks. For the three major pre-training-fine-tuning configurations of DISAE, 182 distilled triplets sequence representation, pre-trained on whole Pfam, and fine-tuned with frozen transformer 183 ("ALBERT frozen transformer" in Table 1), are recommended as the best performed model for predicting 184 ligands of remote orphan proteins.

185 (a) Triplet is preferred over singlet on the ALBERT model

186 Under the same pre-training and fine-tuning settings, both ROC-AUC and PR-AUC of the ALBERT 187 pre-trained model that uses the distilled triplets are significantly higher than the those when the 188 distilled singlet is used. However, for LSTM model, the distilled singlet sometimes outperforms the 189 distilled triplet. It could be because the triplet encodes more relational information on the remote 190 proteins than close homologs.

191 (b) Pre-training on larger dataset is preferred

192 With the same fine-tuning strategy, frozen transformer, and use of distilled triplets, the model that is 193 pre-trained on whole Pfam performs better than that pre-trained on the GPCR family alone in terms

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

194 of both ROC-AUC and PR-AUC.

195 (c) Partial frozen transformer is preferred

196 With the same pre-training on whole Pfam and fine-tuning on the distilled triplets, the ALBERT 197 pre-trained transformer-frozen model outperforms all other models that have only embedding layer 198 frozen or both transformer and embedding layers frozen.

199 2.6 DISAE learns biologically meaningful information

200 Interpretation of deep learning is critical for its real-world applications. To understand if the trained DISAE 201 model is biologically meaningful, we perform the model explainability analysis using SHapley Additive 202 exPlanation (SHAP) [30]. SHAP is a game theoretic approach to explain the output of any machine learning 203 model. Shapley values could be interpreted as feature importance. We utilize this tool to get a closer look 204 into the internal decision making of DISAE’s prediction by calculating Shapley values of each triplet of a 205 protein sequence. The average Shapley values of CPIs for a protein is used to highlight important positions 206 for this protein. 207 Figure 4 shows the distribution of a number of residues of 5-hydroxytryptamine receptor 2B on its 208 structure, which are among 21 (10%) residues with the highest SHAP values. Among them, 6 residues 209 (T140, V208, M218, F341, L347, Y370) are located in the binding pocket. L378 is centered in the functional 210 conserved NPxxY motif that connects the transmembrane helix 7 and the cytoplasmic helix 8 and plays a 211 critical role in the activation of GPCRs [31][32]. P160 and I161 is the part of intracellular loop 2, while I192, 212 G194, I195, and E196 are located in the extracellular loop 2. The intracellular loop 2 interacts with the 213 P-loop of G-protein [33]. It is proposed that the extracellular loop 2 may play a role in the selective switch of 214 ligand binding and determine ligand binding selectivity and efficacy [34][35][36][37]. The functional impact 215 of other residues are unclear. Nevertheless, more than one half of top 21 residues ranked by SHAP values 216 can explain the trained model. The enrichment of ligand binding site residues is statistically significant 217 (p-value=0.01). These results suggest that the prediction from DISAE can provide biologically meaningful 218 interpretations.

219 2.7 Application to the hierarchical classification and deorphanization of

220 human GPCRs

221 With the established generalization power of DISAE, we use ALBERT transformer-frozen model pre- 222 trained on whole Pfam in distilled triplets form to tackle the challenge of deorphanization of human GPCRs, 223 due to its consistently excellent performance by all evaluation metrics in the benchmarks. 224 We define the orphan GPCRs as those which do not have known small molecule binders. 649 human 225 GPCRs that are annotated in Pfam families PF00001, PF13853, and PF03402 are identified as the orphan 226 receptors. 227 Studies have suggested that the classification of GPCRs should be inferred by combining sequence and 228 ligand information [38]. The protein embedding of the DISAE model after pre-training-fine-tuning satisfies 229 this requirement. Therefore, we use the cosine similarity between the embedded vector of protein as a metric 230 to cluster the human GPCRome, which includes both non-orphan and orphan GPCRs. The hierarchical 231 clustering of GPCRs in the Pfam PF00001 is shown in Figure 5. Supplementary Figure 6 and 7 show the 232 hierarchical clustering of PF13853 and PF03402, respectively. 233 Table 2 provides examples of predicted approved-drug bindings to the orphan GPCRs by DISASE with 234 high confidence (false positive rate < 5.0e-4). The complete list of orphans human GPCRs paired with 555 235 approved GPCR-targeted drugs is in the Supplementary Table 1. The predicted potential interactions 236 between the approved drug and the will not only facilitate designing experiments to 237 deorphanize GPCRs but also provide new insights into the mode of action of existing drugs for drug 238 repurposing, polypharmacology, and side effect prediction.

239 3 Conclusion

240 Our primary goal in this paper is to address the challenge of predicting ligands of orphan proteins that are 241 significantly dissimilar from proteins that have known ligands or solved structures. To address this challenge, 242 we introduce new techniques for the protein sequence representation by the pre-training of distilled sequence 243 alignments, and the module-based fine-tuning using labeled data. Our approach, DISAE, is inspired by the 244 state-of-the-art algorithms in NLP. However, our results suggest that the direct adaption of NLP may be 245 less fruitful. The successful application of NLP to biological problems requires the incorporation of domain 246 knowledge in both pre-training and fine-tuning stages. In this regard, DISAE significantly improves the 247 state-of-the-art in the deorphanization of dissimilar orphan proteins. Nevertheless, DISAE can be further 248 improved in several aspects. First, more biological knowledge can be incorporated into the pre-training and 249 fine-tuning at both molecular level (e.g. protein structure and ligand binding site information) and system 250 level (e.g. protein-protein interaction network). Second, in the framework of self-supervised learning, a wide 251 array of techniques can be adapted to address the problem of bias, sparsity, and noisiness in the training

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

252 data. Put together, new machine learning algorithms that can predict endogenous or surrogate ligands of 253 orphan proteins open up a new avenue for deciphering biological systems, drug discovery, and precision 254 medicine.

255 4 Methods

256 4.1 Protein sentence representation from distilled sequence alignment

257 A protein sequence is converted into an ordered list of amino acid fragments (words) with the following 258 steps, as illustrated in Figure 1 B . 259 1. Given a sequence of interest S, a multiple sequence alignment is constructed by a group of similar 260 sequences to S. In this paper, the pre-computed alignments in Pfam [22] are used. 261 2. Amino acid conservation at each position is determined. 262 3. Starting from the most conserved positions that are defined in the Pfam, a pre-defined number of 263 positions in the alignment are selected by the ranking of conservation. In this study, the number of positions 264 is set as 210. The length of positions was not optimized. 265 4. A word, either a single amino acid or a triplet of amino acids in the sequence, is selected at each 266 position. The triplet may include gaps. 267 5. Finally, all selected words from the sequence form an ordered list following the sequence order, i.e. 268 the sentence representation of the protein. 269 The rationale for the distilled multiple sequence alignment representation is to only use functionally 270 or evolutionarily important residues and ignore others. It can be considered a feature selection step. In 271 addition, the use of multiple sequence alignment will allow us to correlate the functional information with 272 the position encoding. The distilled sequence will not only reduce the noise but also increase the efficiency 2 273 in the model training since the memory and time complexity of language model is O(n ). It is noted that 274 we use the conservation to select residues in a sequence because it is relevant to protein function and ligand 275 binding. Other criteria (e.g., co-evolution) could be used depending on the down-stream applications (e.g., 276 protein structure prediction).

277 4.2 Distilled sequence alignment embedding (DISAE) using ALBERT

278 It has been shown that Natural Language Processing (NLP) algorithms can be successfully used to extract 279 biochemically meaningful vectors by pre-training BERT [29] on 86 billion amino acids (words) in 250 million 280 protein sequences [39]. A recently released light version of BERT, called ALBERT [17], boasts significantly 281 lighter memory uses with better or equivalent performances compared to BERT [29]. We extend the idea of 282 unsupervised pre-training of proteins using ALBERT algorithm using distilled multiple sequence alignment 283 representation. The distilled ordered list of triplets is used as the input for ALBERT pre-training. In this 284 work, only masked language model (MLM) is used for the pre-training.

285 4.3 Architecture of deep learning model for whole-genome CPI prediction

286 The deep learning model for the whole-genome CPI prediction is mainly composed of three components, as 287 shown in Figure 1 C, protein embedding by DISAE, chemical compound embedding, and attention pooling 288 with multilayer perceptron (MLP) to model CPIs. DISAE is described in the previous section. Note that 289 once processed through ALBERT, each protein is represented as a matrix of size 210 by 312, where each 290 triplet is a vector of length 312. During protein-ligand interaction prediction task, the protein embedding 291 matrix is compressed using ResNet [40]. Once processed through ResNet layers, each protein is represented 292 as a vector of length 256, which contains compressed information for the whole 210 input triplets for the 293 corresponding protein. 294 Neural molecular fingerprint [41] is used for the chemical embedding. Small molecule is represented as 295 a 2D graph, where vertices are atoms and edges are bonds. We use a popular graph convolutional neural 296 network to process ligand molecules [27]. 297 The attentive pooling is similar to the design in [11]. For each putative chemical-protein pair, the 298 corresponding embedding vectors are fed to the attentive pooling layer, which in turn produces the interaction 299 vector. 300 More details on the neural network model configurations can be found in Supplementary Table 2, 3, 4.

301 4.4 Module-based fine tuning strategy

302 When applying ALBERT to a supervised learning task, fine-tuning [17] is a critical step for the task-specific 303 training following the pre-training. Pre-trained ALBERT already learned to generate protein representation 304 in a meaningful way. However, it is also a design choice whether to allow ALBERT to get updated and 305 trained together with the other components of the classification system during the fine-tuning [17][42]. 306 Updating ALBERT during the fine-tuning will allow the pre-trained protein encoder to better capture 307 knowledge from the training data while minimizing the risk of significant loss of knowledge obtained from

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

308 the pre-training. Hence, to find the right balance, we experiment with ALBERT models that are partially 309 unfrozen [42]. To be specific, major modules in the ALBERT model, embedding or transformer [43], are 310 unfrozen separately as different variants of the model. The idea of “unfrozen” layers is widely used in 311 Natural Language Processing (NLP), e.g., ULMFiT [42] where model consists of several layers. As training 312 proceeds, layers are gradually unfrozen to learn the task-specific knowledge while safeguarding knowledge 313 gained from pre-training. However, this is not straightforwardly applicable to ALBERT because ALBERT 314 is not a linearly layered-up architecture. Hence, we apply a module-based unfrozen strategy.

315 4.5 Experiments Design

316 The purpose of this study is to build a model to predict chemical binding to novel orphan proteins. Therefore, 317 we design experiments to examine the model generalization capability to the data not only unseen but also 318 from significantly dissimilar proteins. We split training/validation/testing data sets to assess the performance 319 of algorithms in three scenarios: 1. The proteins in the testing data set are significantly different from those 320 in the training and validation data set based on the sequence similarity. 2. The ligands in the testing data 321 are from a different gene family from that in the training/validation data. 3. Whole data set is randomly 322 split similar to most of the existing work. 323 To examine the effect of different pre-training-fine-tuning algorithms, we organize experiments in three 324 categories of comparison, as shown in Table 1:

325 (a) THE EFFECT OF VOCABULARY: Taking protein sequence as a sentence, its vocabulary could be 326 built in many ways. We compare taking the singlet as vocabulary against the triplet as vocabulary.

327 (b) THE EFFECT OF PRE-TRAINING: We assess how unlabeled protein sequences affect the performance 328 of the classification. We compare ALBERT pre-trainined on whole Pfam alone against one pre- 329 trainined on GPCRs alone and one without pre-training.

330 (c) THE EFFECT OF FINE-TUNING: We compare three ALBERT models: ALBERT all unfrozen, 331 ALBERT frozen embedding, and ALBERT frozen transformer [43]. All of these modes are pre-trained 332 on the whole Pfam.

333 We use LSTM [44] and Transformer (distilled triplets) as baselines. Two variants of LSTM models are tested 334 to compare with the above three groups of experiments: LSTM with distilled triplets and distilled singlets.

335 4.6 Dataset

336 Our task is to learn from large-scale CPI data to predict unexplored interactions. The quality and quantity 337 of the training samples are critical for biologically meaningful predictions. Despite continuous efforts in 338 the community, a single data source typically curates an incomplete list of our knowledge in protein-ligand 339 activities. Thus, we integrated multiple high-quality, large-scale, publicly available databases of known 340 protein-ligand activities. We extracted, split, and represented protein-ligand activity samples for training 341 and evaluation of machine learning-based predictions.

342 • Sequence data for ALBERT pre-training

343 Proteins sequences are first collected from Pfam-A [22] database. Then, sequences are clustered by 344 90% sequence identity, and a representative sequence is selected from each cluster. The 40,282,439 345 sequences (including 138,288 GPCR) are used for the ALBERT pre-training. To construct protein 346 sentences from the distilled sequence alignment, the original alignment and conservation score from 347 each Pfam-A family are used. As a comparison, the 35,181 GPCR sequences only from GPCR family 348 PF00001 are used to pre-train a separated ALBERT model.

349 We used Pfam alignments directly. From the consensus alignment sequence, we collected positions 350 of high-confidence and low-confidence conserved amino acids together with conservatively substituted 351 ones. We picked these positions from each of the target GPCRs. As a result, each GPCR is represented 352 by 210 amino acids, which may contain gaps. The 210 amino acids are then treated as a sentence of 353 triplets. The triplets are used to train the ALBERT model with the following parameters: maximum 354 sequence length = 256, maximum predictions per sentence = 40, word masking probability = 0.15, 355 and duplication factor = 10. Note that the order of different GPCRs in multiple sequence alignment 356 may not be biologically meaningful. Thus, we did not apply the next sentence prediction task during 357 the pre-training.

358 • Binding assay data for supervised learning

359 We integrated protein-ligand activities involving any GPCRs from ChEMBL [23] (ver. 25), BindingDB 360 [24] (downloaded Jan 9, 2019), GLASS [25] (downloaded Nov 26, 2019), and DrugBank [26] (ver. 5.1.4). 361 Note that BindingDB also contains samples drawn from multiple sources, including PubChem, PDSP 362 Ki, and U.S. Patent. From ChEMBL, BindingDB, and GLASS databases, protein-ligand activity 363 assays measured in three different unit types, pKd, pKi, and pIC50 are collected. Log-transformation 364 were performed for activities reported in Kd, Ki, or IC50. For consistency, we did not convert different 365 activity types. For instance, activities reported in IC50 are converted only to pIC50, but not any other 366 activity types. The activities in log-scale were then binarized based on the thresholds of activity values.

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

367 Protein-ligand pairs were considered active if pIC50 > 5.3, or pKd > 7.3 or pKi > 7.3, and inactive if 368 pIC50 < 5.0, pKd < 7.0 or pKi < 7.0, respectively.

pIC50 = −log(IC50) 369 pKi = −log(Ki) 370 pKd = −log(Kd)

Y ∈ Rm×n

P|A1(i,j)| P|A1(i,j)|  A1(i,j)k  A1(i,j)k  k=1 ≥ 5.3  k=1 ≤ 5.0  |A1(i,j)|  |A1(i,j)|  P|A2(i,j)|  P|A2(i,j)| A2(i,j)k A2(i,j)k 371 Yi,j = 1 if k=1 ≥ 7.3 Yi,j = −1 if k=1 ≤ 7.0 |A2(i,j)| |A2(i,j)|  P|A3(i,j)|  P|A3(i,j)|  A3(i,j)k  A3(i,j)k  k=1 ≥ 7.3  k=1 ≤ 7.0 |A3(i,j)| |A3(i,j)|

372 In the above equations, m and n are the total number of unique proteins and ligands, respectively. 373 A1(i, j), A2(i, j), and A3(i, j) are the list of all activity values for the ith protein and jth ligand in 374 pIC50, pKd, and pKi, respectively. |A| denotes the cardinality of the set A. Note that there are 375 gray areas in the activity thresholds. Protein-ligand pairs falling in the gray areas were considered 376 undetermined and unused for training. If multiple activities were reported for a protein-ligand pair, 377 their log-scaled activity values were averaged for each activity type and binarized accordingly. In 378 addition, we collected active protein-ligand associations from DrugBank and integrated with the 379 binarized activities mentioned above. Inconsistent activities (e.g. protein-ligand pairs that appear both 380 active and inactive) were removed. There are total 9705 active and 25175 inactive pairs, respectively, 381 in the benchmark set.

382 4.7 Benchmark

383 To test the model generalization capability, a protein similarity based data splitting strategy is implemented. 384 First, pair-wise protein similarity based on bit-score is calculated using BLAST [45] for all GPCRs in the 385 data set. The similarity between proteini and proteinj is define as:

SimilarityScore(i, j) = BitScore(i, j)/pBitScore(i, i) ∗ BitScore(j, j)

386 Then, according to the similarity distribution, a similarity threshold is set for the splitting. The bit-score 387 similarity threshold is 0.035. The sequences are clustered such that the sequences in the testing set are 388 significantly dissimilar from those in the training/validation set, as shown in 2. After the splitting, there 389 are 25114, 6278, and 3488 samples for training, validation, and testing, respectively. Distribution of protein 390 sequence similarity scores can be found in Supplementary Figure 8.

391 4.8 Ensemble model for the GPCR deorphanization

392 All annotated GPCR-ligand binding pairs are used to build prediction model for the GPCR deorphanization. 393 To reduce possible overfitting, an ensemble model is constructed using the chosen ALBERT model. Following 394 the strategy of cross-validation [46], three DISAE models are trained. Similar to the benchmark experiments, 395 a hold-out set is selected based on protein similarity, and is used for the early stopping at the preferred epoch 396 for each individual model. Max-voting [47] is invoked to make final prediction for the orphan human GPCRs. 397 To estimate the false positive rate of predictions on the orphan-chemical pairs, a prediction score 398 distribution is collected for known positive pairs and negative pairs of testing data. Assuming that the 399 prediction score of orphan pairs have the same distribution as that of the testing data, for each prediction 400 score, a false positive rate can be estimated based on the score distribution of true positives and negatives.

401 4.9 SHAP analysis

402 Kernel SHAP [30] is used to calculate SHAP values for distilled protein sequences. It is a specially-weighted 403 local linear regression to estimate SHAP values for any model. The model used is the classifier that fine- 404 tunes the ALBERT frozen transformer, and is pre-trained on whole Pfam in distilled triplets under the 405 remote GPCR setting. The data used is the testing set generated under the same remote GRPC setting. 406 Although the whole input features to the classifier consist of both distilled protein sequence and chemical 407 neural fingerprint, only protein feature is of the interest. Hence, when calculating the base value (a value 408 that would be predicted if we do not know any features) required by SHAP analysis, all testing set protein 409 sequence are masked with the same token, while chemical neural fingerprint remain untouched. Therefore, 410 the base value is the average prediction score without protein feature and solely relying on chemical feature. 411 Since the distilled sequences are all set to be of length 210, the SHAP values are the feature importance for 412 each of the 210 positions.

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

413 4.10 Hierarchical clustering of the human GPCRome

414 The pairwise distance between two proteins is determined by the cosine similarity of the DISAE embedding 415 vector after pre-training-fine-tuning, which incoporates both sequence and ligand information. R pacakge 416 “ape” (Analyses of Phylogenetics and Evolution) [48] and treeio [49] is used to convert a distance matrix of 417 proteins into a Newick tree format. Gtree [50][51] package is used to plot the tree in a circular layout.

418 References

419 [1] Oprea, T. Exploring the dark genome: implications for precision medicine. Mammalian Genome 30 420 (2019).

421 [2] Rodgers, G. et al. Glimmers in illuminating the druggable genome. Nature Reviews Drug Discovery 17 422 (2018).

423 [3] Laschet, C., Dupuis, N. & Hanson, J. The -coupled receptors deorphanization landscape. 424 Biochemical Pharmacology 153 (2018).

425 [4] Ngo, T. et al. Identifying ligands at orphan gpcrs: Current status using structure-based approaches. 426 British Journal of Pharmacology 173, n/a–n/a (2016).

427 [5] Zheng, S., Li, Y., Chen, S., Xu, J. & Yang, Y. Predicting drugâ“protein interaction using quasi-visual 428 question answering system. Nature Machine Intelligence 2, 134–140 (2020).

429 [6] Lewis, T. E. et al. Genome3D: exploiting structure to help users understand their sequences. 430 Nucleic Acids Research 43, D382–D386 (2014). URL https://doi.org/10.1093/nar/gku973. https: 431 //academic.oup.com/nar/article-pdf/43/D1/D382/7330713/gku973.pdf.

432 [7] Cheng, Z. et al. Effectively identifying compound-protein interactions by learning from positive and 433 unlabeled examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics PP, 1–1 434 (2016).

435 [8] Lim, H. et al. Large-scale off-target identification using fast and accurate dual regularized one-class 436 collaborative filtering and its application to drug repurposing. PLOS Comput Biol 12 (2016).

437 [9] Lim, H., Gray, P., Xie, L. & Poleksic, A. Improved genome-scale multi-target virtual screening via a 438 novel collaborative filtering approach to cold-start problem. Scientific Reports 6, 38860 (2016).

439 [10] Wen, M. et al. Deep-learning-based drug-target interaction prediction. Journal of Proteome Research 440 16 (2017).

441 [11] Gao, K. Y. et al. Interpretable drug target prediction using deep neural representation. In Proceedings 442 of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, 3371–3377 443 (International Joint Conferences on Artificial Intelligence Organization, 2018). URL https://doi.org/ 444 10.24963/ijcai.2018/468.

445 [12] Nguyen, T., Le, H. & Venkatesh, S. Graphdta: prediction of drug-target binding affinity using graph 446 convolutional networks (2019).

447 [13] Lee, I., Keum, J. & Nam, H. Deepconv-dti: Prediction of drug-target interactions via deep learning 448 with convolution on protein sequences. PLOS Computational Biology 15, e1007129 (2019).

449 [14] Sutskever, I., Vinyals, O. & Le, Q. Sequence to sequence learning with neural networks. 10 (2014).

450 [15] Karimi, M., Wu, D., Wang, Z. & Shen, Y. Deepaffinity: Interpretable deep learning of compound- 451 protein affinity through unified recurrent and convolutional neural networks (2018).

452 [16] Chen, L. et al. Transformercpi: Improving compound-protein interaction prediction by sequence-based 453 deep learning with self-attention mechanism and label reversal experiments. Bioinformatics (Oxford, 454 England) (2020).

455 [17] Lan, Z. et al. Albert: A lite bert for self-supervised learning of language representations (2019).

456 [18] Rao, R. et al. Evaluating protein transfer learning with TAPE. CoRR abs/1906.08230 (2019). URL 457 http://arxiv.org/abs/1906.08230. 1906.08230.

458 [19] Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. CoRR 459 abs/1902.08661 (2019). URL http://arxiv.org/abs/1902.08661. 1902.08661.

460 [20] Min, S., Park, S., Kim, S., Choi, H.-S. & Yoon, S. Pre-training of deep bidirectional protein sequence 461 representations with structural information (2019).

462 [21] Hauser, A., Attwood, M. M., Rask-Andersen, M., Schiöth, H. & Gloriam, D. Trends in gpcr drug 463 discovery: New agents, targets and indications. Nature Reviews Drug Discovery 16, nrd.2017.178 464 (2017).

465 [22] El-Gebali, S. et al. The pfam protein families database in 2019. Nucleic acids research 47 (2018).

466 [23] Gaulton, A. et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research 467 40, D1100–7 (2011).

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

468 [24] Lin, Y., Liu, M. & Gilson, M. The binding database: data management and interface design. 469 Bioinformatics (Oxford, England) 18, 130–9 (2002).

470 [25] Chan, W. et al. Glass: A comprehensive database for experimentally validated gpcr-ligand associations. 471 Bioinformatics (2015).

472 [26] Wishart, D. et al. Drugbank: A comprehensive resource for in silico drug discovery and exploration. 473 Nucleic acids research 34, D668–72 (2006).

474 [27] Duvenaud, D. K. et al. Convolutional networks on graphs for learning 475 molecular fingerprints 2224–2232 (2015). URL http://papers.nips.cc/paper/ 476 5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf.

477 [28] dos Santos, C. N., Tan, M., Xiang, B. & Zhou, B. Attentive pooling networks. CoRR abs/1602.03609 478 (2016). URL http://arxiv.org/abs/1602.03609. 1602.03609.

479 [29] Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional 480 transformers for language understanding. In Proceedings of the 2019 Conference of the North American 481 Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 482 1 (Long and Short Papers), 4171–4186 (Association for Computational Linguistics, Minneapolis, 483 Minnesota, 2019). URL https://www.aclweb.org/anthology/N19-1423.

484 [30] Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. 485 In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems 30, 486 4765–4774 (Curran Associates, Inc., 2017). URL http://papers.nips.cc/paper/ 487 7062-a-unified-approach-to-interpreting-model-predictions.pdf.

488 [31] Fritze, O. et al. Role of the conserved npxxy(x)5,6f motif in the ground state and during 489 activation. Proceedings of the National Academy of Sciences of the United States of America 100, 490 2290–5 (2003).

491 [32] Trzaskowski, B. et al. Action of molecular switches in gpcrs - theoretical and experimental studies. 492 Current medicinal chemistry 19, 1090–109 (2012).

493 [33] Hilger, D., Masureel, M. & Kobilka, B. Structure and dynamics of gpcr signaling complexes. Nature 494 Structural Molecular Biology 25 (2018).

495 [34] Woolley, M. & Conner, A. Understanding the common themes and diverse roles of the second 496 extracellular loop (ecl2) of the gpcr super-family. Molecular and Cellular Endocrinology 449 (2016).

497 [35] Peeters, M., Van Westen, G., Li, Q. & IJzerman, A. Importance of the extracellular loops in g protein- 498 coupled receptors for ligand recognition and receptor activation. Trends in pharmacological sciences 499 32, 35–42 (2010).

500 [36] Seibt, B. et al. The second extracellular loop of gpcrs determines subtype-selectivity and controls 501 efficacy as evidenced by loop exchange study at a2 receptors. Biochemical pharmacology 85 502 (2013).

503 [37] Perez-Aguilar, J. M., Shan, J., LeVine, M., Khelashvili, G. & Weinstein, H. A functional selectivity 504 mechanism at the -2a gpcr involves ligand-dependent conformations of intracellular loop 2. 505 Journal of the American Chemical Society 136 (2014).

506 [38] Scholz, N., Langenhan, T. & Schöneberg, T. Revisiting the classification of adhesion gpcrs. Annals of 507 the New York Academy of Sciences 1456, 80 (2019).

508 [39] Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 509 million protein sequences (2019).

510 [40] He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. CoRR 511 abs/1512.03385 (2015). URL http://arxiv.org/abs/1512.03385. 1512.03385.

512 [41] Duvenaud, D. K. et al. Convolutional networks on graphs for learning 513 molecular fingerprints 2224–2232 (2015). URL http://papers.nips.cc/paper/ 514 5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf.

515 [42] Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. 328–339 (2018).

516 [43] Vaswani, A. et al. Attention is all you need (2017).

517 [44] Van Houdt, G., Mosquera, C. & Nápoles, G. A review on the long short-term memory model. Artificial 518 Intelligence Review (2020).

519 [45] Altschul, S. et al. Gapped blast and psi-blast: a new generation of protein databases search programs. 520 Nucleic acids research 25, 3389–402 (1997).

521 [46] Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection 14 522 (2001).

523 [47] Leon, F., Floria, S.-A. & Badica, C. Evaluating the effect of voting methods on ensemble-based 524 classification. 1–6 (2017).

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

525 [48] Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses 526 in R. Bioinformatics 35, 526–528 (2019).

527 [49] Wang, L.-G. et al. treeio: an r package for phylogenetic tree input and output with richly annotated 528 and associated data. Molecular Biology and Evolution accepted (2019).

529 [50] Yu, G., Smith, D., Zhu, H., Guan, Y. & Lam, T. T.-Y. ggtree: an r package for visualization and 530 annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology 531 and Evolution 8(1):28–36 (2017).

532 [51] Yu, G., Lam, T. T.-Y., Zhu, H. & Guan, Y. Two methods for mapping and visualizing associated data 533 on phylogeny using ggtree. Methods in Ecology and Evolution 35(2):3041–3043 (2018).

534 Acknowledgement

535 This project has been funded in whole or in part with federal funds from the National Institute of General 536 Medical Sciences of National Institute of Health (R01GM122845), the National Institute on Aging of the 537 National Institute of Health (R01AD057555), and the National Cancer Institute of National Institutes of 538 Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect 539 the views or policies of the Department of Health and Human Services, nor does mention of trade names, 540 commercial products or organizations imply endorsement by the US Government. This Research was 541 supported [in part] by the Intramural Research Program of the NIH, National Cancer Institute, Center 542 for Cancer Research and the Intramural Research Program of the NIH Clinical Center.

543 Author Contributions

544 LX conceived and planned the experiments. TC and HL developed and implemented the algorithm. TC, 545 HL, KAA, and YQ carried out the experiments. LX, TC, HL, and KAA contributed to the interpretation of 546 the results. TC, HL, KAA, RN, and LX wrote the manuscript. All authors provided critical feedback and 547 helped shape the research, analysis and manuscript.

548 Competing Interests Statement

549 We declare that we have no financial and personal relationships with other people or organizations that can 550 inappropriately influence our work, there is no professional or other personal interest of any nature or kind 551 in any product, service and/or company that could be construed as influencing the position presented in, or 552 the review of, the manuscript entitled.

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

553 Figures and Tables

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1: (A) Comparison of the scope of this study with that of current works. In existing methods, only annotated proteins (labeled data) are used in the model training. They work well when an orphan protein is homologous to the annotated protein, but mostly fail when the orphan protein is dissimilar from the annotated protein. By contrast, this study uses both annotated and orphan proteins in a self-supervised-learning-fine-tuning framework, thus extends the scope to genome-wide remote orphan proteins that are out of the reach of state-of-the-arts. (B) Illustration of protein sequence representation from distilled sequence alignment. The high and medium conserved positions are marked as red and orange, respectively. (C) Architecture of deep learning model for the whole- genome CPI prediction

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 2: Distribution of sequence alignment length and percentage of sequence identity between proteins in the testing set (query) and those in the training/validation set (template) in the dissimilar protein benchmark. The sequence length of each protein is around 300.

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3: Performance comparison of ALBERT pre-trained transformer-frozen model, Transformer model, and LSTM model. (A, B): ROC- and PR-curves for the prediction of ligand binding to remote GPCRs (more than 90% between any query proteins and any template proteins). (C, D): ROC- and PR-curves for the classification on testing set in the cross-gene-family kinase inhibitor benchmark. All the models are trained on distilled triplets.

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4: (A) top and (B) side view of structure of 5-hydroxytryptamine receptor 2B ( UNIPROT id: 5HTB2_HUMAN, PDB ID: 4IB4). The residues among those with the top 21 ranked SHAP values are shown in dark green, blue, yellow, and light green colored CPK mode for the amino acids in the binding pocket, NPxxY motif, extracellular loop, and intracellular loop, respectively.

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

PF00001 E5RK45

GPR19

GPR45 5HT4R

NPFF1

5HT1A

Q6GPG7 GPR20 PAR3 PKR1 GPR62 GPR6 PKR2

OXYR LPAR1

GPER1 XCR1

LT4R1 LPAR2 Q6LDH7 D6RGL0 A0A1D5RMN4 ADA2A ADA2C 5HT2A E7ENI1 ● ● ● CXCR3 ● ● ● ● ● ● OXER1 ● ● ● ● GP135 PAR2 ● ● KISSR ● ● ● ● NPBW2 ● ● NPY1R PE2R4LSHR ● ● C9JN40 ● ● ● ● V9GYX0 GALR2ACM5 ● ● BRS3 ● ● GPR34 ● ● Q3SAH0 ● ● GPR88 MRGX4 ● ● CML1 FSHR ● ● MRGRD ADA1D ● ● SSR2 GPR75 ● ● ACKR2 ● ● APJ ● ● SSR4 L0E130OPRD ● ● A0A0A0MR48 ● ● GP176 OPRM ● ● Q6P5W7 D2CGC9S1PR4 ● ● ● ● ADRB1 ● ● ADA1A ● ● SSR1 CCR6 HRH3 ● ● GPR55 ● ● AA1R GPR84 ● ● OPN3 ● ● DRD3 TAAR6 E9PCM4 TAAR9 ● ● ● ● GP148 GPR4 ● ● A0A1B0GTK7 E9PNU7OGR1 ● ● A0A1B0GVZ0 A0A140T9K6 ● ● HRH2 TAAR3 ● ● GPR87 ● ● FFAR1 F2Z3L3 ● ● P2Y13 J3KNR3 ● ● GP152 GPR31 C5AR2 ● ● ● ● P2Y10 G3V4F0SSR5 ● ● NK2R ● ● A0A087WZ80 MSHR ● ● CLTR1 GPR83 ● ● ACM2 GP119 A0A0B4J2A7DRD5 ● ● 5HT2B ● ● M0R315 ● ● GP183 ● ● S1PR3 PI2R GP173 ● ● GP142 5HT6R ● ● NPFF2 NPY4R ● ● GPR22 ● ● CCR9 CCR2 ● ● MC4R CCR10 ● ● FPR2 PF2R ● ● GPR52 P2RY6 ● ● GPR63 GPR25 ● ● NTR1 ACKR3 ● ● OPSD S4R2Z8 ● ● V9GZ70 A0A0A0MQW8GP161 ● ● V9GYM0 ● ● F8VYN7 AA3R ● ● F8VSC8 5HT5A ● ● A0A0J9YWR0 GPR15 ● ● M0QXD3 EDNRB ● ● DRD1 ● ● LT4R2 CCR4 ● ● C9J1J7 HRH1 ● ● E9PH76 CCR8 ● ● F5GZF8 P2RY2 ● ● F5GZE6 GP141 ● ● AA2AR 5HT7R ● ● C9JQD8 ADRB3 ● ● G3V4Q5 S1PR1 ● ● MTLR C9JP23 ● ● B1AP63 MCHR2 ● ● C9JFS2 CCR1 ● ● MRGX3 MRGX1 ● ● CCR7 a● Non−Orphan MTR1B ● ● J3KSS9 ● FFAR2 ● ● J3KTN5 a Orphan BKRB2 ● ● HRH4 NPY2R ● ● GPR39 ● ● P2RY1 A0A087WTL7 ● ● GRPR GPR78 ● ● PD2R2 CXCR5 ● ● C5AR1 GPR33 ● ● GPR32 NPBW1 ● ● NPY5R A0A1B0GUJ5 ● ● TAAR2 A0A0A0MTJ0 ● ● GALR1 MCHR1 ● ● LPAR5 GPR61 ● ● GP171 5HT1B ● ● PE2R2 QRFPR ● ● GP151 MAS ● ● GP174 NPSR1 ● ● FPR3 P2Y11 ● ● GP162 5HT2C ● ● OX1R ● ● H0Y622 A6NMV7 RGR ● ●NTR2 FPR1 ● ●EDNRA ● ●LGR4 GALR3 ● ● DRD4 5HT1F ● ●OX2R CX3C1 ● ● ACM3 LPAR6 ● ● OPSB LPAR4 ● ●GP139 P2RY4 ● ●GP150 GP132 ● ●GPR37 ● ● AA2BR S1PR5 PAR4 ● ● GPBAR ● ● ACM4 PE2R3 ● ● PTAFR ● ● OPRX PAR1 ● ● GPR21 A0A0G2JQE4J3KPQ2 ● ● ● ● H9NIL4 D6RDV4 OPN5 ● ● MTR1L ● ● FFAR4 MRGRE 5HT1D G32P1 ● ● ● ● OPN4 NPY6R ● ● C9JWU6 GPR85 ● ● 5HT1E P2Y14 ● ● NMBR E9PIC8 ● ● ADRB2 GPR42 ● ● LPAR3 FFAR3 ● ● GPR35 ● ● CXCR1 GPR12 ● ● CXCR2 ● ● BKRB1 V9GYB7RL3R1 ● ● GASR PE2R1 ● ● MC5R ADA1B ● ● MTR1A ● ● D3DNG8 PRLHRNK1R ● ● A0A0A0MSE3 ● ● AGTR1 ● ● CNR1 GP182 ● ● SUCR1 PSYR ● ● AGTR2 ● ● GNRHR GP101 ● ● ● ● C9J1P7 A0A087X173 GPR17 ● ● MC3R P2RY8 ● ● V2R OPSX ● ● C9JV81 ● ● OR2W5 ● ● RL3R2 ACKR4 ● ● MRGX2 GPR18OPSR ● ● GPR82 ● ● NMUR1 ● ● GPR26 ● ● DRD2 OPSG ● ● ● ● F8VUV1 OPSG3 ● ● HCAR1 ● ● GHSR ● ● ● ● ● ● CXCR6 ● ● ● CXCR4 NMUR2 OXGR1 CCKAR ACTHR OPSG2 GPR3 CLTR2 GPR27 CCR5 NK3R GNRR2

CCRL2 TAAR8

RXFP2 CNR2 SSR3 RXFP1 GP153 ACM1

GP146 GPR1

ADA2B

HCAR2 HCAR3

Figure 5: Hierarchical clustering of human GPCRs in the Pfam PF00001. Non-orphan and orphan GPCRs are labeled in blue and red, respectively

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

554 Tables

ROC-AUC Benchmark Goal of Comparison Model Dissimilar Protein Kinase Inhibitor Random DISAE ALBERT frozen transformer 0.725 0.690 0.889 the effect of pre-training ALBERT frozen transformer (pre-trained on GPCR) 0.441 / 0.849 the effect of distilled sequence ALBERT frozen transformer (distilled singlets) 0.583 / 0.656 ALBERT frozen embedding 0.585 / 0.889 the effect of fine-tuning ALBERT all unfrozen 0.680 / 0.891 Transformer (distilled triplets) 0.648 0.687 0.836 LSTM (distilled singlets) 0.652 0.642 0.907 baseline against ALBERT LSTM (distilled triplets) 0.476 0.667 0.908

PR-AUC Benchmark Goal of Comparison Model Dissimilar Protein Kinase Inhibitor Random DISAE ALBERT frozen transformer 0.589 0.673 0.783 the effect of pre-training ALBERT frozen transformer (pre-trained on GPCR) 0.215 / 0.728 the effect of distilled sequence ALBERT frozen transformer (distilled singlets) 0.370 / 0.477 ALBERT frozen embedding 0.278 / 0.783 the effect of fine-tuning ALBERT all unfrozen 0.418 / 0.785 Transformer (distilled triplets) 0.380 0.630 0.719 LSTM (distilled singlets) 0.372 0.614 0.798 baseline against ALBERT LSTM (distilled triplets) 0.261 0.590 0.804

Table 1: Test set performance under three benchmark settings evaluated in ROC-AUC and PR-AUC. ALBERT pre-trained transformer-frozen model outperforms other models, and its performance is stable across all settings. Hence, it is recommended as the optimal configuration for the pre-trained ALBERT model. Four variants of DISAE models are compared to the frozen transformer one. Unless specified in the parentheses, ALBERT is pre-trained on whole Pfam proteins in the form of distilled triplets. The four DISAE variants are organized into three groups based on the goal of comparison. Two state-of-the-art models Transformer and LSTM are compared with the ALBERT pre-trained models as baselines. Protein similarity based splitting uses a threshold of similarity score of 0.035 (Figure 2).

Orphan Receptor (Uniprot Id) Pfam Drug Drug Target A0A0C4DFX5 PF13853 Isoetharine ADRB1, ADRB2 A0A126GVR8 PF13853 Ganirelix GNRHR A0A1B0GTK7 PF00001 Levallorphan OPRM1 A0A1B0GVZ0 PF00001 Xamoterol ADRB1, ADRB2 A0A286YFH6 PF13853 Degarelix GNRHR A3KFT3 PF13853 Degarelix GNRHR C9J1J7 PF00001 Levallorphan OPRM1 C9JQD8 PF00001 Lixisenatide GLP1R E9PH76 PF00001 Ganirelix GNRHR E9PPJ8 PF13853 Xamoterol ADRB1, ADRB2

Table 2: Example of deorphanization prediction from DISAE ensemble models.

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Information

October 9, 2020

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 1: Clustering of pre-trained triplet DISAE vectors at (A) Level 1, (B) Level 2, (C) Level 3, and (D) Level 4 of ALBERT. The triplet is colored by the physiochemical properties of the third amino acid.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 2: The same clustering trends as in Supplementary Figure 1 were also observed when triplets are grouped by individual amino acid types, rather than their physicochemical properties of side chains. 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 3: The three major models as in the AUC curves in the paper. All x axis are the number of epochs trained. ALBERT frozen transformer proves consistent better performance, which becomes the critical advantage of DISAE in deorphanization, whhere remote orphan proteins are significantly different from training data and overfitting is impossible to control given the unknown true labels for classification. Using DISAE, we could be confident that even we train the model for too long or too short epochs, the prediction reliability on orphans will be robust.

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 4: The effect of pre-training: ALBERT frozen transformer, i.e. the one pre-trained on whole pfams as proposed in DISAE, show robust and consistent better performance. All x axis are the number of epochs trained.

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 5: The effect of different fine-tuning strategy with freezing parts of ALBERT: Although the other setting could be as good as ALBERT frozen transformer in some epochs, but they all suffer from large performance variance by epoch. ALBERT frozen transformer proves consistent and robust high performance. All x axis are the number of epochs trained

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 6: Phylogenetic tree for PF03402

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

PF13853

A0A286YF25

A0A286YF06

OR2M3 OR2M2 OR2M5

OR2M7 O10H3

O11A1

OR6V1

OR9A1

O10J5 OR2M4 OR9K2

OR5A2

OR6K6 OR2L3 OR2L8

OR1E3 O52J3 O52B6

OR6P1 OR7G3 OR2A2 OR6N1 O52E4 OR6C4 OR1L4 OR5K3 O10A4 OR6X1 OR2T7 OR4A8 O2T27 OR4DB OR2V1 ● ● ● ● ● ● ● ● O10Q1 ● ● ● ● ● OR2G6 ● ● ● ● OR5K4 ● OR4KD OR6C2 O11L1 ● ● A0A0G2JL33 OR5MB ● ● ● O2AG2 ● ● ● OR9A4 ● ● ● OR1S2 OR4K2 ● ● OR4Q3 A0A286YF86 OR4DA ● ● ● O52K2 ● ● O52K1 O10AD ● ● ● OR7C1 ● ● OR5C1 ● ● O51L1 OR1I1 ● ● ● O10J4 O12D3 ● ● OR2B2 ● ● O10J6 ● ● OR8H2 A0A2C9F2M2 OR5T2 ● ● ● A0A087WZH1 OR4A4 ● ● OR2V2 ● ● O2AJ1 OR5I1 ● ● O52N5 O10D3 ● ● O6C65 O4A47 ● ● OR5P3 ● ● OR5K1 ● A0A286YFA5 A0A286YEZ2 OR5T3 ● ● OR5G3 ● ● OR9G4 O5AN1 ● OR1L6 ● A0A0C4DFP2 OR8D2 ● ● ● ● O51B4 ● ● O51I1 O10R2 ● ● O52E5 OR7D2 ● ● OR5R1 O2A25 ● ● OR9I1 ● ● O52W1 O10T2 ● O13G1 A0A2C9F2M8 OR6C1 ● ● OR6C3 ● O51H1 ● ● O2T10 O2AP1 ● ● O10H5 ● ● OR5B2 O2AK2 ● ● ● OR8K3 O2T35 ● ● OR1F2 A0A286YFH6 OR2T2 ● ● OR8A1 O11H6 ● ● A0A286YEW5 OR2Z1 ● ● OR5BH ● ● OR5B3 OR4KF ● OR8K5 ● ● O13C5 A0A126GW94 OR2L5 ● ● O13C2 O52N4 ● ● O13C9 OR1J1 ● ● ● OR4CC ● OR1FC A0A286YF40OR1N2 ● ● ● OR6S1 O13C4 ● ● O14K1 ● ● O10G7 O13C3 ● ● O10G4 OR6B3 ● ● O10G9 A0A126GWK9OR6B2 ● ● OR6B1 OR5L1 ● ● O10A6 ● O51F2 ● ● OR2T1 ● A0A2C9F2M5 ● O51E1 OR2B6 ● ● OR7D4 ● O11H2 ● A0A2C9F2Y1 OR2I1 ● ● ● O11H1 OR8B4 ● ● O11HC OR8D4 ● ● O11H7 OR5DE ● ● O52R1 OR8B3 ● ● O51G2 ● OR8B2 ● O2AE1 A0A126GVR6 ● OR4C3 ● O4A15 ● ● A0A286YF59 ● OR1L8 ● OR2A1 ● OR5P2 ● OR6K3 A0A126GVR8 ● O5AS1 ● A0A0C4DFU5 ● ● OR2H2 ● OR4E2 ● OR1E2 ● O13C7 ● OR1E1 ● ● O4C45 OR5BL ● ● O10J3 OR8D1 ● ● O51J1 A0A126GW95OR8G1 ● ● O2A14 ● ● O52A1 OR8G5 ● ● O13F1 O10AG ● ● OR7AA O10A2 ● ● O52B4 O10A5 ● ● OR8I2 OR2A5 ● ● O52M1 OR2A7 ● ● OR2K2 ● OR2A4 ● A0A0C4DFP3 ● OR8BC ● OR1F1 ● OR4C5 ● OR7AH ● OR7A5 ● OR4P4 ● O51M1 ● O51G1 ● OR1J2 ● OR4KE ● OR5W2 ● A0A087WUB2 ● OR5J2 ● O51B5 ● OR8B8 ● OR5DD ● OR5DG ● ● O51B6 ● O5AC1 ● O51S1 ● OR1Q1 ● O10V1 A0A286YF57 ● ● OR5M8 ● OR1S1 ● OR2W3 ● O2A12 ● OR2F1 ● OR8U8 ● OR2F2 ● OR8U9 ● OR2D2 ● OR8U1 ● OR2W6 ● OR7C2 ● OR3A2 ● A0A286YF92 ● A0A286YFF0 ● OR4E1 ● OR1D2 ● ● H0Y3U5 O10D4 ● ● O6C76 O2AT4 ● ● E9PPJ8 OR4S1 ● ● OR2T4 A0A126GWK8 ● ● A0A2C9F2M9 ●a O52I2 ● Orphan ● O10AC J3KQU7 ● ● O52A5 O52I1 ● ● O10H1 O12D1 ● ● O10H2 OR6Y1 ● ● OR2T5 O10G6 ● ● O2T29 ● A0A286YF24 ● OR4A5 ● O5AU1 ● O10K1 ● OR9A2 ● O10K2 ● O51A7 ● O14CZ ● OR4CG ● OR4N5 ● O56B2 ● O13C8 ● OR4F4 ● O52E8 ● O4F17 ● O52D1 ● OR4F5 ● OR7G2 ● OR4D1 ● A0A2C9F2M1 ● OR4D9 ● OR7A2 ● O52Z1 ● OR8J2 ● ● O5AL1 OR4KH ● ● OR4K3 ● A0A126GVZ4 ● O52E2 OR5BC ● ● O51T1 O6C74 ● ● OR4M2 OR4N2 ● ● A0A0G2JMP0 ● ● A0A0X1KG70 A0A0G2JNH3 ● ● OR4M1 A0A096LPK9OR4N4 ● ● OR1K1 OR6K2 ● ● OR4X1 ● OR6A2 ● O51D1 ● OR6T1 ● O14A2 ● O52B2 ● O52A4 ● ● OR5L2 ● A0A286YET0O51V1 ● OR6F1 ● O52L2 ● OR9G9 ● O52L1 ● OR9G1 ● O51B2 ● O4F15 ● ● ● OR4D2 O56A4 ● A0A2C9F2M6 ● O52N2 ● ● O14J1 ● OR4CF ● O2AG1 A0A2C9F2M4 ● O52E1 ● OR5M1 ● OR4K5 ● OR5MA ● O13H1 ● O51E2 ● O8G2P ● O51F1 ● O4C46 ● A6NLW9 ● OR8J3 ● OR5H2 ● ● A0A2C9F2N2 O51A2 ● O51A4 ● O5H14 ● OR5K2 O6C70 ● ● OR5F1 OR6Q1 ● ● ● OR1L3 OR4CB ● ● O5H15 J3KPH0 ● ● OR5H1 O52E6 ● ● O7E24 OR8H3 ● ● OR1P1 OR5M3 ● ● O10H4 O4A16 ● ● OR8K1 O6C75 ● ● O10X1 ● A0A286YEU6 OR4CD ● ● OR4Q2 ● O10W1 ● O52N1 O5AK2 ● ● ● ● O10S1 O11H4 ● A0A286YF54 ● OR5DI A0A286YFI7O51Q1 ● ● ● O5AR1 OR3A4 ● ● ● O10A7 ● ● OR2C1 OR2W1 ● OR7G1 OR83P ● A0A087X172 ● ● OR1G1 OR1A2 ● ● OR5M9 ● ● A0A286YEX6 OR1N1 ● ● OR4C6 OR2B8 ● ● OR3A1 O14AG ● ● O10G2 A0A087WXB1 ● ● OR3A3 OR4D5 ● A0A126GWB3 OR2T6 ● ● ● OR6C6 ● OR4B1 OR1L1 ● ● ● ● OR5V1 OR5A1 ● ● O56A5 A0A0B4J1S0 OR4L1 ● ● O56B4 O12D2 ● ● OR2B3 ● ● O5AK3 ● ● OR4K1 O56A3 ● ● A0A126GWT2 ● ● OR2J3 ● ● A0A140T964 A0A140T9F1 O56A1 ● ● A0A126GWS4 A0A140T931 OR2C3 ● ● OR2J2 ● ● OR2J1 OR2G3O14L1 ● ● OR2D3 A0A286YFC4 ● ● OR2G2 OR6J1 ● ● OR4F6 ● OR2Y1 OR2LD ● ● OR4X2 ● ● OR6M1 ● ● O14I1 OR9Q2 ● ● OR8J1 OR1J4 ● ● OR8S1 ● ● OR6N2 ● ● OR9Q1 ● ● OR1D5 ● ● ● OR1D4 Q6IFQ5 ● ● OR1C1 ● ● OR2S1 ● ● O5AP2 ● ● ● OR2BB O10C1 ● ● O10A3 O13C6 O51I2 ● ● OR5H6 A0A140T9J0 ● ● ● A0A126GW86 ● ● ● OR5H8 ● ● ● OR1B1 A0A140TA55 ● ● ● O52P1 ● ● ● ● ● ● ● O11G2 ● ● A0A126GWS8 OR8H1 O52H1 A0A087X0G6 O10Z1 A0A126GWQ9 OR1M1 OR5T1 O5AC2 O13A1 OR2L2 A0A0G2JL13 O56B1 O2T11 A0A087WY02 O6C68 OR2H1 O10G3 OR4D6 O10P1 O2T12 O2T33

OR2T8 O13J1 O10G8 A0A126GWQ6 O4F21 OR4S2 O2T34 OR4F3

OR2T3

O13D1

OR1A1

A0A286YF94

Supplementary Figure 7: Phylogenetic tree for PF13853

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 8: Protein pair wise similarity distribution of benchmark data:To simulate realistic deorphanization scenario, where remote orphan protein could be significantly different from data used to train models, we design a protein similarity based data splitting strategy. First, pair wise bit-scores are calculated with standard BLAST+ package. A distribution could be found in Figure 8. Then, setting a threshold of 0.035, proteins with pair-wise similarity lower than the threshold are keep separately in two groups. The protein-chemical activity pairs with protein in the relative smaller group will be used as testing data. The left samples are then split into training and validation.

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Orphan Receptor Predicted Drug [Known binding protein] A0A087WY02 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] A0A0A0MQW8 BYJAVTDNIXVSPW-UHFFFAOYSA-N [’Q96RJ0’, ’P35348’, ’P43140’] A0A0B4J1V8 RGCVKNLCSQQDEP-UHFFFAOYSA-N [’P33261’, ’P18901’, ’P08684’] A0A0C4DFX5 MBUVEWMHONZEQD-UHFFFAOYSA-N [’P11509’, ’P33261’, ’Q16873’] A0A126GVR8 BPZSYCZIITTYBL-YJYMSZOUSA-N [’P13945’, ’P11509’, ’P07550’] A0A126GWK9 VQODGRNSFPNSQE-DVTGEIKXSA-N [’P46721’, ’P04083’, ’P20309’] A0A126GWS4 VMWNQDUVQKEIOC-CYBMUJFWSA-N [’P18901’, ’P08684’, ’P50226’] A0A1B0GTK7 ATALOFNDEOCMKK-OITMNORJSA-N [’P29371’, ’P11712’, ’P33261’] A0A1B0GVZ0 OWQUZNMMYNAXSL-UHFFFAOYSA-N [’Q01959’, ’P31390’, ’P35367’] A0A286YF86 BGDKAVGWHJFAGW-UHFFFAOYSA-N [’P08485’, ’P20309’, ’P17200’] A0A286YF92 IRSCQMHQWWYFCW-UHFFFAOYSA-N [”, ’Q13255’, ’Q96FL8’] A0A286YFH6 BUGYDGFZZOZRHP-UHFFFAOYSA-N [’O00591’, ’P11509’, ’Q96FL8’] A0A2C9F2M5 CYQFCXCEBYINGO-IAGOWNOFSA-N [’P47746’, ’P47936’, ’P11712’] A3KFT3 KKZJGLLVHKMTCM-UHFFFAOYSA-N [”, ’Q16678’, ’P33527’] A6NFC9 BYJAVTDNIXVSPW-UHFFFAOYSA-N [’Q96RJ0’, ’P35348’, ’P43140’] A6NH00 BGDKAVGWHJFAGW-UHFFFAOYSA-N [’P08485’, ’P20309’, ’P17200’] A6NMS3 IYIKLHRQXLHMJQ-UHFFFAOYSA-N [’P11509’, ’P07550’, ’P51589’] A6NMU1 BARDROPHSZEBKC-OITMNORJSA-N [’P08684’, ’P11712’, ’P25103’] C9J1J7 YFGHCGITMMYXAQ-LJQANCHMSA-N [’P20815’, ’P11712’, ’P05177’] C9JQD8 LWAFSWPYPHEXKX-UHFFFAOYSA-N [’P10635’, ’P07550’, ’P08588’] C9JW47 BYBLEWFAAKGYCD-UHFFFAOYSA-N [’Q6PIU1’, ’P22001’, ’P33261’] E7ENI1 OCJYIGYOJCODJL-UHFFFAOYSA-N [’P10635’, ’P02768’, ’P35367’] E9PH76 ZSCDBOWYZJWBIY-UHFFFAOYSA-N [’P07550’, ’P33261’, ’P11229’] E9PPJ8 OZYUPQUCAUTOBP-QXAKKESOSA-N [’P33535’, ’P35372’] F8VUV1 CYQFCXCEBYINGO-IAGOWNOFSA-N [’P47746’, ’P47936’, ’P11712’] G3V4Q5 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] H0Y622 UCHDWCPVSPXUMX-TZIWLTJVSA-N [’P11509’, ’P07550’, ’P21452’] J3KQU7 IZTQOLKUZKXIRV-YRVFCXMDSA-N [’Q9NPD5’, ’P30553’, ’Q63931’] O00590 JOATXPAWOHTVSZ-UHFFFAOYSA-N [’P13945’, ’P07550’, ’P18089’] O15218 BYJAVTDNIXVSPW-UHFFFAOYSA-N [’Q96RJ0’, ’P35348’, ’P43140’] O43869 UCHDWCPVSPXUMX-TZIWLTJVSA-N [’P11509’, ’P07550’, ’P21452’] O60431 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] O76099 SHGAZHPCJJPHSC-YCNIQYBTSA-N [’P31025’, ’P11509’, ’Q02928’] P04000 LUZRJRNZXALNLM-JGRZULCMSA-N [’P08684’, ’P34969’, ’P35462’] P04001 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] P0C604 DGBIGWXXNGSACT-UHFFFAOYSA-N [’O00591’, ’O14764’, ’P48169’] P0C628 HTIQEAQVCYTUBX-UHFFFAOYSA-N [’Q02641’, ’P08684’, ’P04798’] P0C645 IRSCQMHQWWYFCW-UHFFFAOYSA-N [”, ’Q13255’, ’Q96FL8’] P0C7N5 VMWNQDUVQKEIOC-CYBMUJFWSA-N [’P18901’, ’P08684’, ’P50226’] P0DMS8 DGBIGWXXNGSACT-UHFFFAOYSA-N [’O00591’, ’O14764’, ’P48169’] P0DN77 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] P0DN78 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] P47804 QWAXKHKRTORLEM-UGJKXSETSA-N [’P21452’, ’P08684’, ’P35462’] P58173 VHYCDWMUTMEGQY-UHFFFAOYSA-N [’P13945’, ’P07550’, ’P08588’] Q15612 OIRDTQYFTABQOQ-KQYNXXCUSA-N [’P29275’, ’P55263’, ’P25099’] Q15617 ZSCDBOWYZJWBIY-UHFFFAOYSA-N [’P07550’, ’P33261’, ’P11229’] Q6IFH4 KWTSXDURSIMDCE-QMMMGPOBSA-N [’P25100’, ’Q8HZ64’, ’P10635’] Q7Z5H5 IYIKLHRQXLHMJQ-UHFFFAOYSA-N [’P11509’, ’P07550’, ’P51589’] Q8N0Y5 COUYJEVMBVSIHV-SFHVURJKSA-N [’P10632’, ’P07550’, ’P11712’] Q8N6U8 BYJAVTDNIXVSPW-UHFFFAOYSA-N [’Q96RJ0’, ’P35348’, ’P43140’] Q8NG76 IZTQOLKUZKXIRV-YRVFCXMDSA-N [’Q9NPD5’, ’P30553’, ’Q63931’] Q8NG83 CJOFXWAVKWHTFT-XSFVSMFZSA-N [’Q9HB55’, ’P22086’, ’P10635’] Q8NG85 IYIKLHRQXLHMJQ-UHFFFAOYSA-N [’P11509’, ’P07550’, ’P51589’] Q8NGB2 BARDROPHSZEBKC-OITMNORJSA-N [’P08684’, ’P11712’, ’P25103’] Q8NGC2 MEFKEPWMEQBLKI-AIRLBKTGSA-N [’P19623’, ’P31153’, ’P25099’] Q8NGC8 COUYJEVMBVSIHV-SFHVURJKSA-N [’P10632’, ’P07550’, ’P11712’] Q8NGD3 OCJYIGYOJCODJL-UHFFFAOYSA-N [’P10635’, ’P02768’, ’P35367’]

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 1 continued from previous page Orphan Receptor Predicted Drug [Known binding protein] Q8NGD4 UCTWMZQNUQWSLP-VIFPVBQESA-N [’P07550’, ’O15244’, ’P08684’] Q8NGE9 WNTYBHLDCKXEOT-UHFFFAOYSA-N [’P14416’, ’P31388’, ’P10275’] Q8NGG2 BGDKAVGWHJFAGW-UHFFFAOYSA-N [’P08485’, ’P20309’, ’P17200’] Q8NGI3 CYQFCXCEBYINGO-IAGOWNOFSA-N [’P47746’, ’P47936’, ’P11712’] Q8NGI4 IRSCQMHQWWYFCW-UHFFFAOYSA-N [”, ’Q13255’, ’Q96FL8’] Q8NGI7 OCJYIGYOJCODJL-UHFFFAOYSA-N [’P10635’, ’P02768’, ’P35367’] Q8NGJ5 JSWZEAMFRNKZNL-UHFFFAOYSA-N [’P11712’, ’P41595’, ’P05177’] Q8NGJ9 GEFQWZLICWMTKF-CDUCUWFYSA-N [’P25100’, ’P18089’, ’P08913’] Q8NGK2 LUZRJRNZXALNLM-JGRZULCMSA-N [’P08684’, ’P34969’, ’P35462’] Q8NGK6 ZSCDBOWYZJWBIY-UHFFFAOYSA-N [’P07550’, ’P33261’, ’P11229’] Q8NGL9 UCHDWCPVSPXUMX-TZIWLTJVSA-N [’P11509’, ’P07550’, ’P21452’] Q8NGM9 BYJAVTDNIXVSPW-UHFFFAOYSA-N [’Q96RJ0’, ’P35348’, ’P43140’] Q8NGN7 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] Q8NGR4 BGDKAVGWHJFAGW-UHFFFAOYSA-N [’P08485’, ’P20309’, ’P17200’] Q8NGS1 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] Q8NGT5 ATALOFNDEOCMKK-OITMNORJSA-N [’P29371’, ’P11712’, ’P33261’] Q8NGU2 UCHDWCPVSPXUMX-TZIWLTJVSA-N [’P11509’, ’P07550’, ’P21452’] Q8NGW1 DRHKJLXJIQTDTD-OAHLLOKOSA-N [’P08684’, ’P34969’, ’P35462’] Q8NGX9 URKOMYMAXPYINW-UHFFFAOYSA-N [’P31389’, ’P07550’, ’P33261’] Q8NGY9 UCTWMZQNUQWSLP-VIFPVBQESA-N [’P07550’, ’O15244’, ’P08684’] Q8NGZ2 IZTQOLKUZKXIRV-YRVFCXMDSA-N [’Q9NPD5’, ’P30553’, ’Q63931’] Q8NGZ4 RUDATBOHQWOJDD-BSWAIDMHSA-N [’P52895’, ’Q96RI1’, ’P08684’] Q8NH04 BYBLEWFAAKGYCD-UHFFFAOYSA-N [’Q6PIU1’, ’P22001’, ’P33261’] Q8NH05 KKGQTZUTZRNORY-UHFFFAOYSA-N [’P21453’, ’O95977’, ’P43004’] Q8NH41 OZVBMTJYIDMWIL-AYFBDAFISA-N [’P07550’, ’P08684’, ’P34969’] Q8NH48 ZSCDBOWYZJWBIY-UHFFFAOYSA-N [’P07550’, ’P33261’, ’P11229’] Q8NH51 BUGYDGFZZOZRHP-UHFFFAOYSA-N [’O00591’, ’P11509’, ’Q96FL8’] Q8NH57 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] Q8NH61 RUDATBOHQWOJDD-BSWAIDMHSA-N [’P52895’, ’Q96RI1’, ’P08684’] Q8NH74 BYJAVTDNIXVSPW-UHFFFAOYSA-N [’Q96RJ0’, ’P35348’, ’P43140’] Q8NH87 GJPICJJJRGTNOD-UHFFFAOYSA-N [’P25101’, ’P11712’, ’O95342’] Q8NH89 ZSCDBOWYZJWBIY-UHFFFAOYSA-N [’P07550’, ’P33261’, ’P11229’] Q8NH95 DERZBLKQOCDDDZ-JLHYYAGUSA-N [’P11509’, ’O60840’, ’P11229’] Q8TCB6 HTIQEAQVCYTUBX-UHFFFAOYSA-N [’Q02641’, ’P08684’, ’P04798’] Q8TDV2 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] Q8WZA6 ZFXYFBGIUFBOJW-UHFFFAOYSA-N [’P29275’, ’P25099’, ’Q60614’] Q96KK4 XXPANQJNYNUNES-UHFFFAOYSA-N [’P23975’, ’Q05940’, ’P31645’] Q96R54 VMWNQDUVQKEIOC-CYBMUJFWSA-N [’P18901’, ’P08684’, ’P50226’] Q96R67 ZSCDBOWYZJWBIY-UHFFFAOYSA-N [’P07550’, ’P33261’, ’P11229’] Q9BZJ7 UCHDWCPVSPXUMX-TZIWLTJVSA-N [’P11509’, ’P07550’, ’P21452’] Q9BZJ8 KWTSXDURSIMDCE-QMMMGPOBSA-N [’P25100’, ’Q8HZ64’, ’P10635’] Q9GZM6 PVNIIMVLHYAWGP-UHFFFAOYSA-N [’P49019’, ’Q15274’, ’Q80Z39’] Q9H1Y3 FIVSJYGQAIEMOC-ZGNKEGEESA-N [’P10635’, ’Q9UNQ0’, ’P25103’] Q9H210 IYIKLHRQXLHMJQ-UHFFFAOYSA-N [’P11509’, ’P07550’, ’P51589’] Q9H339 IYIKLHRQXLHMJQ-UHFFFAOYSA-N [’P11509’, ’P07550’, ’P51589’] Q9NZP2 IQVRBWUUXZMOPW-PKNBQFBNSA-N [’P29275’, ’Q96FL8’, ’P25099’] Q9UGF5 GHOSNRCGJFBJIB-UHFFFAOYSA-N [’P30556’, ’P11712’, ’P23219’] Q9UGF6 BTCSSZJGUNDROE-UHFFFAOYSA-N [’Q7Z2H8’, ’Q9Z0U4’, ’Q9UBS5’] Q9UHM6 BGDKAVGWHJFAGW-UHFFFAOYSA-N [’P08485’, ’P20309’, ’P17200’]

Supplementary Table 1: Predicted approved drug examples: 649 human orphan GPCRs, each paired to 555 approved GPCR-targeted drugs as novel samples chemical-protein pairs. Listed here are 106 of the orphan proteins paired with at least one approved GPCR- targeted drugs with estimated false positive rate lower than 0.05. Proteins are presented with Uniprot Id and chemicals are presented with InChIKey.

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.332346; this version posted October 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

ALBERT MODEL CONFIGURATION TRIPLETS FORM SINGLET FORM "attention_probs_dropout_prob" 0 "hidden_act": "gelu" "gelu" "hidden_dropout_prob" 0 "embedding_size" 128 "hidden_size" 312 "initializer_range" 0.02 "intermediate_size" 1248 "max_position_embeddings" 512 "num_attention_heads" 12 pre-training related "num_hidden_layers" 4 "num_hidden_groups" 1 "net_structure_type" 0 "gap_size" 0 "num_memory_blocks" 0 "inner_group_num" 1 "down_scale_factor" 1 "type_vocab_size" 2 "vocab_size" 19686 32 "ln_type" / "postIn" sequence embedding hidden units 256 fine-tuning related protein sequence post-tokenization length 210

Supplementary Table 2: ALBERT configuration. ALBERT is using the package Transformers by Huggingface (https://github.com/huggingface/transformers). The author installed the package in Jan 2020.

Neural-fingerprint CONFIGURATION Dropout 0.1 converlution layer size 20 converlution layer number 4 hidden units 128 Atomic connectivity degrees for chemical molecules [0,1,2,3,4,5] INERACTION PREDICTION CONFIGURATION design two linear layer with batch normlization and ReLU attentive pooling drop out 0.3 attentive pooling hidden units 64 LSTM CONFIGURATION embedding hidden units 128 number of LSTM layers 1 dropout 0.2

Supplementary Table 3: There are three major components of the fine tuning model architecture: ALBERT to extract protein embedding, Neural-fingerprint to extract chemical embedding, and interaction prediction layers. LSTM serves as the baseline.

OPTIMIZATION CONFIGURATION training epochs 100 batch size 64 optimizer Adam scheduler to adjust learning rate cosineannealing initial learnin rate 2.00E-05 L2 regularization weight 1.00E-04 Deep Learning Server Supermicro SuperServer 4028GR-TR GPU NVIDIA Tesla R V100 with 32 GB per GPU (256 GB total) of GPU memory CPU Intel Xeon E5-2650 v4 2.2 GHz 12-Core (48-core total)

Supplementary Table 4: Optimization related configuration

12