Supplementary Information
Total Page:16
File Type:pdf, Size:1020Kb
1 Supplementary Information 2 3 July 2020 1 Supplementary Figure 1: Protein pair wise similarity distribution of benchmark data:To simulate realistic deorphanization scenario, where remote orphan protein could be signifi- cantly different from data used to train models, we design a protein similarity based data splitting strategy. First, pair wise bit-scores are calculated with standard BLAST+ pack- age. A distribution could be found in Figure 1. Then, setting a threshold of 0.035, proteins with pair-wise similarity lower than the threshold are keep separately in two groups. The protein-chemical activity pairs with protein in the relative smaller group will be used as testing data. The left samples are then split into training and validation. 2 Supplementary Figure 2: Clustering of pre-trained triplet DISAE vectors at (A) Level 1, (B) Level 2, (C) Level 3, and (D) Level 4 of ALBERT. The triplet is colored by the physiochemical properties of the third amino acid. 3 Supplementary Figure 3: The same clustering trends as in Supplementary Figure 2 were also observed when triplets are grouped by individual amino acid types, rather than their physicochemical properties of side chains. 4 Supplementary Figure 4: The three major models as in the AUC curves in the paper. All x axis are the number of epochs trained. ALBERT frozen transformer proves consistent better performance, which becomes the critical advantage of DISAE in deorphanization, whhere remote orphan proteins are significantly different from training data and overfitting is impossible to control given the unknown true labels for classification. Using DISAE, we could be confident that even we train the model for too long or too short epochs, the prediction reliability on orphans will be robust. 5 Supplementary Figure 5: The effect of pre-training: ALBERT frozen transformer, i.e. the one pre-trained on whole pfams as proposed in DISAE, show robust and consistent better performance. All x axis are the number of epochs trained. 6 Supplementary Figure 6: The effect of different fine-tuning strategy with freezing parts of ALBERT: Although the other setting could be as good as ALBERT frozen transformer in some epochs, but they all suffer from large performance variance by epoch. ALBERT frozen transformer proves consistent and robust high performance. All x axis are the number of epochs trained 7 Supplementary Figure 7: Phylogenetic tree for PF03402 8 PF13853 A0A286YF25 A0A286YF06 OR2M3 OR2M2 OR2M5 OR2M7 O10H3 O11A1 OR6V1 OR9A1 O10J5 OR2M4 OR9K2 OR5A2 OR6K6 OR2L3 OR2L8 OR1E3 O52J3 O52B6 OR6P1 OR7G3 OR2A2 OR6N1 O52E4 OR6C4 OR1L4 OR5K3 O10A4 OR6X1 OR2T7 OR4A8 O2T27 OR4DB OR2V1 ● ● ● ● ● ● ● ● O10Q1 ● ● ● ● ● OR2G6 ● ● ● ● OR5K4 ● OR4KD OR6C2 O11L1 ● ● A0A0G2JL33 OR5MB ● ● ● O2AG2 ● ● ● OR9A4 ● ● ● OR1S2 OR4K2 ● ● OR4Q3 A0A286YF86 OR4DA ● ● ● O52K2 ● ● O52K1 O10AD ● ● ● OR7C1 ● ● OR5C1 ● ● O51L1 OR1I1 ● ● ● O10J4 O12D3 ● ● OR2B2 ● ● O10J6 ● ● OR8H2 A0A2C9F2M2 OR5T2 ● ● ● A0A087WZH1 OR4A4 ● ● OR2V2 ● ● O2AJ1 OR5I1 ● ● O52N5 O10D3 ● ● O6C65 O4A47 ● ● OR5P3 ● ● OR5K1 ● A0A286YFA5 A0A286YEZ2 OR5T3 ● ● OR5G3 ● ● OR9G4 O5AN1 ● OR1L6 ● A0A0C4DFP2 OR8D2 ● ● ● ● O51B4 ● ● O51I1 O10R2 ● ● O52E5 OR7D2 ● ● OR5R1 O2A25 ● ● OR9I1 ● ● O52W1 O10T2 ● O13G1 A0A2C9F2M8 OR6C1 ● ● OR6C3 ● O51H1 ● ● O2T10 O2AP1 ● ● O10H5 ● ● OR5B2 O2AK2 ● ● ● OR8K3 O2T35 ● ● OR1F2 A0A286YFH6 OR2T2 ● ● OR8A1 O11H6 ● ● A0A286YEW5 OR2Z1 ● ● OR5BH ● ● OR5B3 OR4KF ● OR8K5 ● ● O13C5 A0A126GW94 OR2L5 ● ● O13C2 O52N4 ● ● O13C9 OR1J1 ● ● ● OR4CC ● OR1FC A0A286YF40OR1N2 ● ● ● OR6S1 O13C4 ● ● O14K1 ● ● O10G7 O13C3 ● ● O10G4 OR6B3 ● ● O10G9 A0A126GWK9OR6B2 ● ● OR6B1 OR5L1 ● ● O10A6 ● O51F2 ● ● OR2T1 ● A0A2C9F2M5 ● O51E1 OR2B6 ● ● OR7D4 ● O11H2 ● A0A2C9F2Y1 OR2I1 ● ● ● O11H1 OR8B4 ● ● O11HC OR8D4 ● ● O11H7 OR5DE ● ● O52R1 OR8B3 ● ● O51G2 ● OR8B2 ● O2AE1 A0A126GVR6 ● OR4C3 ● O4A15 ● ● A0A286YF59 ● OR1L8 ● OR2A1 ● OR5P2 ● OR6K3 A0A126GVR8 ● O5AS1 ● A0A0C4DFU5 ● ● OR2H2 ● OR4E2 ● OR1E2 ● O13C7 ● OR1E1 ● ● O4C45 OR5BL ● ● O10J3 OR8D1 ● ● O51J1 A0A126GW95OR8G1 ● ● O2A14 ● ● O52A1 OR8G5 ● ● O13F1 O10AG ● ● OR7AA O10A2 ● ● O52B4 O10A5 ● ● OR8I2 OR2A5 ● ● O52M1 OR2A7 ● ● OR2K2 ● OR2A4 ● A0A0C4DFP3 ● OR8BC ● OR1F1 ● OR4C5 ● OR7AH ● OR7A5 ● OR4P4 ● O51M1 ● O51G1 ● OR1J2 ● OR4KE ● OR5W2 ● A0A087WUB2 ● OR5J2 ● O51B5 ● OR8B8 ● OR5DD ● OR5DG ● ● O51B6 ● O5AC1 ● O51S1 ● OR1Q1 ● O10V1 A0A286YF57 ● ● OR5M8 ● OR1S1 ● OR2W3 ● O2A12 ● OR2F1 ● OR8U8 ● OR2F2 ● OR8U9 ● OR2D2 ● OR8U1 ● OR2W6 ● OR7C2 ● OR3A2 ● A0A286YF92 ● A0A286YFF0 ● OR4E1 ● OR1D2 ● ● H0Y3U5 O10D4 ● ● O6C76 O2AT4 ● ● E9PPJ8 OR4S1 ● ● OR2T4 A0A126GWK8 ● ● A0A2C9F2M9 ●a O52I2 ● Orphan ● O10AC J3KQU7 ● ● O52A5 O52I1 ● ● O10H1 O12D1 ● ● O10H2 OR6Y1 ● ● OR2T5 O10G6 ● ● O2T29 ● A0A286YF24 ● OR4A5 ● O5AU1 ● O10K1 ● OR9A2 ● O10K2 ● O51A7 ● O14CZ ● OR4CG ● OR4N5 ● O56B2 ● O13C8 ● OR4F4 ● O52E8 ● O4F17 ● O52D1 ● OR4F5 ● OR7G2 ● OR4D1 ● A0A2C9F2M1 ● OR4D9 ● OR7A2 ● O52Z1 ● OR8J2 ● ● O5AL1 OR4KH ● ● OR4K3 ● A0A126GVZ4 ● O52E2 OR5BC ● ● O51T1 O6C74 ● ● OR4M2 OR4N2 ● ● A0A0G2JMP0 ● ● A0A0X1KG70 A0A0G2JNH3 ● ● OR4M1 A0A096LPK9OR4N4 ● ● OR1K1 OR6K2 ● ● OR4X1 ● OR6A2 ● O51D1 ● OR6T1 ● O14A2 ● O52B2 ● O52A4 ● ● OR5L2 ● A0A286YET0O51V1 ● OR6F1 ● O52L2 ● OR9G9 ● O52L1 ● OR9G1 ● O51B2 ● O4F15 ● ● ● OR4D2 O56A4 ● A0A2C9F2M6 ● O52N2 ● ● O14J1 ● OR4CF ● O2AG1 A0A2C9F2M4 ● O52E1 ● OR5M1 ● OR4K5 ● OR5MA ● O13H1 ● O51E2 ● O8G2P ● O51F1 ● O4C46 ● A6NLW9 ● OR8J3 ● OR5H2 ● ● A0A2C9F2N2 O51A2 ● O51A4 ● O5H14 ● OR5K2 O6C70 ● ● OR5F1 OR6Q1 ● ● ● OR1L3 OR4CB ● ● O5H15 J3KPH0 ● ● OR5H1 O52E6 ● ● O7E24 OR8H3 ● ● OR1P1 OR5M3 ● ● O10H4 O4A16 ● ● OR8K1 O6C75 ● ● O10X1 ● A0A286YEU6 OR4CD ● ● OR4Q2 ● O10W1 ● O52N1 O5AK2 ● ● ● ● O10S1 O11H4 ● A0A286YF54 ● OR5DI A0A286YFI7O51Q1 ● ● ● O5AR1 OR3A4 ● ● ● O10A7 ● ● OR2C1 OR2W1 ● OR7G1 OR83P ● A0A087X172 ● ● OR1G1 OR1A2 ● ● OR5M9 ● ● A0A286YEX6 OR1N1 ● ● OR4C6 OR2B8 ● ● OR3A1 O14AG ● ● O10G2 A0A087WXB1 ● ● OR3A3 OR4D5 ● A0A126GWB3 OR2T6 ● ● ● OR6C6 ● OR4B1 OR1L1 ● ● ● ● OR5V1 OR5A1 ● ● O56A5 A0A0B4J1S0 OR4L1 ● ● O56B4 O12D2 ● ● OR2B3 ● ● O5AK3 ● ● OR4K1 O56A3 ● ● A0A126GWT2 ● ● OR2J3 ● ● A0A140T964 A0A140T9F1 O56A1 ● ● A0A126GWS4 A0A140T931 OR2C3 ● ● OR2J2 ● ● OR2J1 OR2G3O14L1 ● ● OR2D3 A0A286YFC4 ● ● OR2G2 OR6J1 ● ● OR4F6 ● OR2Y1 OR2LD ● ● OR4X2 ● ● OR6M1 ● ● O14I1 OR9Q2 ● ● OR8J1 OR1J4 ● ● OR8S1 ● ● OR6N2 ● ● OR9Q1 ● ● OR1D5 ● ● ● OR1D4 Q6IFQ5 ● ● OR1C1 ● ● OR2S1 ● ● O5AP2 ● ● ● OR2BB O10C1 ● ● O10A3 O13C6 O51I2 ● ● OR5H6 A0A140T9J0 ● ● ● A0A126GW86 ● ● ● OR5H8 ● ● ● OR1B1 A0A140TA55 ● ● ● O52P1 ● ● ● ● ● ● ● O11G2 ● ● A0A126GWS8 OR8H1 O52H1 A0A087X0G6 O10Z1 A0A126GWQ9 OR1M1 OR5T1 O5AC2 O13A1 OR2L2 A0A0G2JL13 O56B1 O2T11 A0A087WY02 O6C68 OR2H1 O10G3 OR4D6 O10P1 O2T12 O2T33 OR2T8 O13J1 O10G8 A0A126GWQ6 O4F21 OR4S2 O2T34 OR4F3 OR2T3 O13D1 OR1A1 A0A286YF94 Supplementary Figure 8: Phylogenetic tree for PF13853 9 ALBERT MODEL CONFIGURATION TRIPLETS FORM SINGLET FORM "attention probs dropout prob" 0 "hidden act": "gelu" "gelu" "hidden dropout prob" 0 "embedding size" 128 "hidden size" 312 "initializer range" 0.02 "intermediate size" 1248 "max position embeddings" 512 "num attention heads" 12 pre-training related "num hidden layers" 4 "num hidden groups" 1 "net structure type" 0 "gap size" 0 "num memory blocks" 0 "inner group num" 1 "down scale factor" 1 "type vocab size" 2 "vocab size" 19686 32 "ln type" / "postIn" sequence embedding hidden units 256 fine-tuning related protein sequence post-tokenization length 210 Supplementary Table 1: ALBERT configuration. ALBERT is using the package Trans- formers by Huggingface (https://github.com/huggingface/transformers). The author in- stalled the package in Jan 2020. 10 Neural-fingerprint CONFIGURATION Dropout 0.1 converlution layer size 20 converlution layer number 4 hidden units 128 Atomic connectivity degrees for chemical molecules [0,1,2,3,4,5] INERACTION PREDICTION CONFIGURATION design two linear layer with batch normlization and ReLU attentive pooling drop out 0.3 attentive pooling hidden units 64 LSTM CONFIGURATION embedding hidden units 128 number of LSTM layers 1 dropout 0.2 Supplementary Table 2: There are three major components of the fine tuning model ar- chitecture: ALBERT to extract protein embedding, Neural-fingerprint to extract chemical embedding, and interaction prediction layers. LSTM serves as the baseline. 11 OPTIMIZATION CONFIGURATION training epochs 100 batch size 64 optimizer Adam scheduler to adjust learning rate cosineannealing initial learnin rate 2.00E-05 L2 regularization weight 1.00E-04 Deep Learning Server Supermicro SuperServer 4028GR-TR GPU NVIDIA Tesla R V100 with 32 GB per GPU (256 GB total) of GPU memory CPU Intel Xeon E5-2650 v4 2.2 GHz 12-Core (48-core total) Supplementary Table 3: Optimization related configuration 12 Orphan Receptor Predicted Drug [Known binding protein] A0A087WY02 XXPANQJNYNUNES-UHFFFAOYSA-N ['P23975', 'Q05940', 'P31645'] A0A0A0MQW8 BYJAVTDNIXVSPW-UHFFFAOYSA-N ['Q96RJ0', 'P35348', 'P43140'] A0A0B4J1V8 RGCVKNLCSQQDEP-UHFFFAOYSA-N ['P33261', 'P18901', 'P08684'] A0A0C4DFX5 MBUVEWMHONZEQD-UHFFFAOYSA-N ['P11509', 'P33261', 'Q16873'] A0A126GVR8 BPZSYCZIITTYBL-YJYMSZOUSA-N ['P13945', 'P11509', 'P07550'] A0A126GWK9 VQODGRNSFPNSQE-DVTGEIKXSA-N ['P46721', 'P04083', 'P20309'] A0A126GWS4 VMWNQDUVQKEIOC-CYBMUJFWSA-N ['P18901', 'P08684', 'P50226'] A0A1B0GTK7 ATALOFNDEOCMKK-OITMNORJSA-N ['P29371', 'P11712', 'P33261'] A0A1B0GVZ0 OWQUZNMMYNAXSL-UHFFFAOYSA-N ['Q01959', 'P31390', 'P35367'] A0A286YF86 BGDKAVGWHJFAGW-UHFFFAOYSA-N ['P08485', 'P20309', 'P17200'] A0A286YF92 IRSCQMHQWWYFCW-UHFFFAOYSA-N [", 'Q13255', 'Q96FL8'] A0A286YFH6 BUGYDGFZZOZRHP-UHFFFAOYSA-N ['O00591', 'P11509', 'Q96FL8'] A0A2C9F2M5 CYQFCXCEBYINGO-IAGOWNOFSA-N