Supporting Information
Relating Essential Proteins To Drug Side-Effects Using Canonical Component Analysis: A Structure-Based Approach
Tianyun Liu and Russ B Altman*
Affiliations:
Tianyun Liu Department of Genetics, Stanford University Email: [email protected]
Russ B Altman Department of Genetics and Department of Bioengineering, Stanford University Email: [email protected]
1 Figure S1. We calculated the PF-affinity scores of celecoxib against 563 proteins in Essential Protein Dataset. Of the 563 proteins, we have found experimental IC50 recorded in IC50 Assay Dataset for six unique proteins, shown as color squares in the PDF plot of all 563 PF-affinity scores. The bottom panel shows a clear correlation between the PF-affinity scores and log (IC50) values. It suggests that the predicted PF- affinity score can be an indicator of binding affinity of celecoxib to essential proteins.
2 Figure S2. Performance of SVD-CCA We have created two subsets of the side-effect data from SIDER: SE-subset-50 includes side-effects that are only observed in fewer than 108 (50%) of the 216 drugs. SE- subset-20 includes side-effects that are only observed in fewer than 43 (20%) of the 216 drugs. We evaluated performance by using leave-one-out cross validations.
3 Figure S3-A. Characteristics of side-effect data from Yamanish’s publication We have downloaded side-effect data from Yamanishi’s published work [reference #2 in the manuscript]. This figure shows the histogram of the number of associated drugs (percentage of the 674 drugs) for each of the 1343 side-effects.
4 Figure S3-B. The drug molecular information data (known drug targets from KEGG) is sparse for the 674 drugs published by Yamanishi [REF]. The drug target profile includes a total of 9180 drug-protein interactions, involving 674 drugs and 2080 unique proteins. The average number of interactions per drug is 13 (of a total of 2080 proteins).
2000 s t e g r
a 1500 t
g u r d
1000 f o
r e
b 500 m u (from 2080 targets)2080 (from n 0 0 100 200 300 400 500 600 700 674 drugs published by Yamanishi et al
5 Figure S4-A. Performance of SVD-CCA on the dataset from Yamanishi’s publication The figure shows the AUC (under the ROC curve) scores for side-effect predictions for individual drugs in the leave-one-out cross validation. We compared the performance of SVD-CCA when using different sets of CCs. The average performance is not improved when the number of CCs used in predicting side-effects increases. This suggests that information stored in the first canonical component plays the dominant role.
6 Figure S4-B. We calculated the frequency of side-effect observed in side-effect data from Yamanishi’s publication. We then plotted the weights assigned to each of the 1343 side-effects in the first canonical component (CC#1) against the frequency of side- effects. It shows that the weight assigned in CC#1 correlates well with the frequency of side-effect.
7
Figure S5. The attribute weights from SVD-CCA process is sparse. We are also interested in the biological interpretability of the outputs to understand the relationship between off-targets and side-effects. Because the weights (alpha and beta) from SVD-CCA satisfy the following: