Identifying Interacting Environmental Factor – Pairs

Chad Kimmel, MS1, Jonathan Lustgarten, PhD3, An-Kwok Ian Wong, MS2, Shyam Visweswaran, MD, PhD1,2 1Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 2Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 3University of Pennsylvania School of Veterinary Medicine, Philadelphia, PA

ABSTRACT underlying the disease (based on the already characterized mechanisms in the similar diseases) or in investigating Most diseases are caused by a combination of new therapies (based on the therapies in use in the similar environmental and genetic etiological factors, and these diseases). factors may interact in causing disease. We conjectured that an environmental factor and a gene that are Several papers have described innovative methods for associated with the same set of diseases are likely to extracting environmental factors of disease and in interact. We extracted environmental factor – disease analyzing environmental factors in combination with associations and gene – disease associations from freely genetic factors. Liu et. al. extracted a comprehensive list accessible online databases and the research literature, of associations between disease and environmental factors and identified environmental factor – gene pairs that were using the Medical Subject Headings (MeSH) annotations similar in being associated with a common set of diseases. of MEDLINE articles and combined it with genetic For several of these pairs that we examined, we found factors of disease to characterize the “etiome” profile of evidence for plausible biological interactions in the over 800 diseases [3]. Patel et. al. developed a new literature. We postulate that that the remaining pairs may method called an Environment-Wide Association Study represent novel interactions between an environmental (EWAS) - similar to a Genome Wide Association Study factor and a gene. (GWAS) - to extract environmental factors of disease and applied it to Diabetes mellitus type 2 [4]. Gohlke et. al. INTRODUCTION extracted environmental factors of disease by identifying key molecular pathways that are jointly associated with The etiological factors associated with the development of genetic and environmental factors using a gene-centric human disease are broadly categorized into environmental database, and validated their new-found associations with and genetic factors. While infections like community- known chemical-disease relationships and transcriptional acquired pneumonia are predominantly influenced by regulation data [5]. We followed the strategies used by environmental factors and Mendelian diseases like sickle Liu et. al. to compile datasets of environmental factor – cell anemia are predominantly influenced by genetic disease and gene – disease associations. factors, many of the common diseases like coronary heart disease are influenced by both environmental and genetic We conjectured that an environmental factor and a gene factors that likely interact with each other. that are associated with the same set of diseases are likely to have a biological interaction. In this paper, we Several freely accessible online databases are now computed a similarity measure for an environmental available that collate information on genetic factors factor and a gene. We then investigated the most similar associated with disease, such as the Online Mendelian environmental factor - gene pairs for plausible biological Inheritance in Man (OMIM), the Genetic Association interactions. We defined an interaction as either the Database [1], and GeneCards. Fewer sources of environmental factor having a direct influence on the information are available on environmental factors gene, or the environmental factor and the gene (along associated with diseases. A freely accessible database is with the product) having a direct influence on a the CHE Toxicant and Disease Database [2] that contains common biological molecule (e.g., gene, protein, curated information on chemicals/toxins associated with metabolite). disease, and another source is the dataset provided by Liu et. al. in their paper on the “etiome” [3]. METHODS Collated data on environmental factor – disease In this section, we briefly describe the extraction of associations and gene – disease associations can provide a environmental and genetic factors of diseases, the wealth of information. For example, identifying similarity measure we used and the method we followed etiologically similar diseases to a disease of interest may for the selection of promising environmental factor – gene be useful in unraveling the biological mechanisms pairs.

Environmental Factors. We obtained environmental Environmental Factor – Gene Pairs. We examined factors of human diseases from two sources. We obtained environmental factor – gene pairs with a Jaccard chemical-disease associations from the CHE Toxicant and similarity measure of 0.4 or greater for plausible Disease Database [2]. In addition, we extracted interactions between members of the pair. We chose a environmental factor – disease associations from the threshold of 0.4 rather arbitrarily so that we had a MeSH annotations of MEDLINE articles using the manageable number of high scoring environmental factor strategies developed by Liu et. al. [3]. MeSH is a – gene pairs to examine. We used the research literature comprehensive controlled vocabulary that contains over and Ingenuity (Ingenuity Systems©, www.ingenuity.com) 25,000 descriptors (also known as subject headings) and to identify interactions. Ingenuity is a systems biology over 80 qualifiers (also known as subheadings). An article database that contains manually annotated relationships in MEDLINE is typically annotated with several between biological agents such as chemicals, drugs, descriptor/qualifier pairs. For example, "peptic ulcer" is a , and . In Ingenuity, we searched for descriptor and "chemically induced" is a qualifier and evidence that the environmental factor in a pair directly "peptic ulcer/chemically induced" describes articles on influences the gene. We also searched for biological peptic ulcer that are chemically induced (e.g., by a drug molecules (genes, metabolites, etc.) that are related to like indomethacin). Liu et al. identified several patterns of both the environmental factor and the gene in a pair. pairs of MeSH qualifiers to infer environmental factor → disease associations. For example, the association RESULTS indomethacin → peptic ulcer is induced from the We extracted 51,994 disease-environmental associations following two annotations of an article: peptic ulcer / between 1,911 diseases and 5,801 environmental agents chemically induced and indomethacin / adverse effect from the CHE Toxicant and Disease Database and (MeSH descriptors are in bold and MeSH qualifiers are in MEDLINE. We extracted 8,872 disease-gene associations italics). between 889 diseases and 1,891 genes from the GAD.

The CHE Toxicant and Disease Database and MEDLINE Genetic Factors. We obtained genetic factors of human annotations were downloaded in May 2009. After diseases from the Genetic Association Database (GAD) etiological factors with less than three associated diseases which contains gene-disease associations that have been were excluded the number of environmental factors curated from the literature [1]. The GAD uses decreased to 2,459 and the number of genes decreased to standardized identifiers for genes (NCBI Gene 1,469. This resulted in more than three and a half million identifiers) and diseases (Medical Subject Headings environmental factor – gene pairs for which we computed (MeSH) identifiers) that enable easy machine processing. the Jaccard similarity measure.

Similarity Measure. A similarity measure indicates the We identified a total of 63 environmental factor – gene strength of commonality between a pair of entities (e.g., pairs with a similarity of measure of 0.4 or greater. Table an environmental factor and a genetic factor) based on 1 gives a list of the 63 pairs and the common associated their properties (e.g., associated diseases). We used the diseases. The maximum number of common diseases for Jaccard similarity measure to calculate the similarity a pair was three. between an environmental factor E and a genetic factor G, as follows: We found evidence for an interaction for 10 of the 63

environmental factor – gene pairs. We now summarize ( ) some of this evidence. For the sodium salicylate – KCNQ4 gene pair, Wu et al. [6] showed that salicytate

blocks the action of the KCNQ4 gene resulting in hearing where S11 is the number of diseases associated with both loss. For the antithrombins – F3 pair, antithrombin G and E, S10 is the number of diseases associated with E inhibits the complex consisting of coagulation factors III but not G, and S01 is the number of diseases associated and VII that are produced by genes F3 and F7 [7]. with G but not E. The Jaccard similarity measure varies between 0 and 1 where 1 denotes that E and G are Several of the environmental factor – gene pairs had an associated with exactly the same set of diseases, and 0 influence on the same biological molecule. For instance, denotes that E and G have no common associated in the pair piperonyl butoxide – SMYD3, both piperonyl diseases. We computed the Jaccard similarity measure butoxide, and SMYD3 have an influence on the c-Myc only for those genetic and environmental factors that were oncogene which has been implicated in the genesis of each associated with at least three diseases. diverse human tumors. Piperonyl butoxide has been found

to increase the activation of the c-Myc oncogene [8], and

Environmental factor / Gene Common associated diseases Aminopyridines / HLA-DQB1 Asthma / Occupational Diseases Androstenes / F13B Thromboembolism / Venous Thrombosis Antithrombins / F3 Hemorrhage / Coronary Thrombosis Carbon Compounds, Inorganic / LAPTM4B Lung Neoplasms / Stomach Neoplasms Carmine / HLA-DQB1 Asthma / Occupational Diseases Cellulase / CYSLTR2 Asthma / Rhinitis Cellulase / HLA-DQB1 Asthma / Occupational Diseases Cellulase / IL4R1 Asthma / Rhinitis Cereals / BIRC5 Cell Transformation, Neoplastic / Uterine Cervical Neoplasms Cereals / KIR3DL2 Cervical Intraepithelial Neoplasia / Uterine Cervical Neoplasms Cereals / KIR3DL3 Cervical Intraepithelial Neoplasia / Uterine Cervical Neoplasms Egg Proteins / HLA-DQB1 Asthma / Occupational Diseases Escin / HLA-DQB1 Asthma / Occupational Diseases Fertility Agents / NBS1 Melanoma / Breast Neoplasms / Ovarian Neoplasms / Skin Neoplasms Food, Formulated / DIO1 Atrophy / Alzheimer Disease Glycyrrhizic Acid / CYP11B1 Hypertension / Hyperaldosteronism Meclizine / LEF1 Cleft Lip / Cleft Palate Meclizine / BHMT2 Cleft Lip / Cleft Palate Meclizine / C6orf105 Cleft Lip / Cleft Palate Meclizine / COL4A4 Cleft Lip / Cleft Palate Meclizine / DLX3 Cleft Lip / Cleft Palate Meclizine / GLI2 Cleft Lip / Cleft Palate Meclizine / HDAC4 Cleft Lip / Cleft Palate Meclizine / MLPH Cleft Lip / Cleft Palate Meclizine / SCN3B Cleft Lip / Cleft Palate Meclizine / SHH Cleft Lip / Cleft Palate Meclizine / SP100 Cleft Lip / Cleft Palate Methenamine / HLA-DQB1 Asthma / Occupational Diseases Methyl n-Butyl Ketone / ATP8B1 Cholestasis / Cholestasis, Intrahepatic Methylenebis (chloroaniline) / ZNF350 Urinary Bladder Neoplasms / Carcinoma, Transitional Cell Neurotransmitter Uptake Inhibitors / INPP1 Stress Disorders, Post-Traumatic / Bipolar Disorder Ninhydrin / CYSLTR2 Asthma / Rhinitis Nitrilotriacetic Acid / RGS6 Lung Neoplasms / Urinary Bladder Neoplasms Noise / KCNQ4 Hearing Loss / Hearing Loss, Noise-Induced

Table 1. List of environmental factor – gene pairs with similarity score of 0.4 or higher. The pairs that had evidence of plausible biological interactions in the literature are shown in bold font. the interference of human Smyd3 microRNA has been DQB1 are similar to the ones perturbed by the other found to decrease the binding of the promoter between the environmental factors and genes. Unfortunately, the Tert and the c-Myc genes [9]. annotation coverage of pathways perturbed by genes and environmental factors from databases like the Kyoto Of note, the anti-histamine Meclizine and the HLA-DQB1 Encyclopedia of Genes and Genomes (KEGG) [10] is gene were present in eleven and eight of the top 63 pairs incomplete and we were unable to identify any pathways respectively. This indicates that Meclizine shares a associated with these factors. relatively large number of diseases in common with several other genes. Similarly, HLA-DQB1 shares a CONCLUSIONS relatively large number of diseases in common with We presented a method to identify environmental factor – several other environmental factors. This may be because gene pairs that may have plausible biological interaction the pathways which are perturbed by Meclizine and HLA-

Environmental factor / Gene Common associated diseases O-Phthalaldehyde / HLA-DQB1 Asthma / Occupational Diseases Oxalates / CLCN5 Nephrocalcinosis / Kidney Calculi / Kidney Diseases Papain / IL4R1 Conjunctivitis; Asthma / Rhinitis Parabens / CYP24A1 Breast Neoplasms / Asthma Parabens / KDR Breast Neoplasms / Asthma Pectins / HLA-DQB1 Asthma / Occupational Diseases Pentylenetetrazole / KCNMB3 Epilepsies, Myoclonic / Epilepsy, Generalized / Epilepsy, Absence Pesticide Synergists / CXCL14 Carcinoma, Hepatocellular / Liver Neoplasms Pesticide Synergists / PTK2 Carcinoma, Hepatocellular / Liver Neoplasms Pesticide Synergists / SMYD3 Carcinoma, Hepatocellular / Liver Neoplasms Pesticide Synergists / UGT1A4 Carcinoma, Hepatocellular / Liver Neoplasms Pesticide Synergists / UGT1A8 Carcinoma, Hepatocellular / Liver Neoplasms Phenylurea Compounds / HSD11B2 Diabetic Nephropathies / Diabetes Mellitus, Type 1 Phenylurea Compounds / TCF2 Diabetic Nephropathies / Diabetic Neuropathies / Diabetes Mellitus, Type 1 Piperonyl Butoxide / CXCL14 Carcinoma, Hepatocellular / Liver Neoplasms Piperonyl Butoxide / SMYD3 Carcinoma, Hepatocellular / Liver Neoplasms Pipobroman / CSF3R Leukemia / Myelodysplastic Syndromes / Anemia, Aplastic Pipobroman / CSF3R Leukemia / Myelodysplastic Syndromes / Anemia, Aplastic Platelet-Derived Growth Factor / CCNH Mouth Neoplasms / Precancerous Conditions Potassium Channel Blockers / KCNJ2 Long QT Syndrome / Arrhythmias, Cardiac Potassium Channel Blockers / CAV3 Long QT Syndrome / Arrhythmias, Cardiac Potassium Channel Blockers / KCNE2 Long QT Syndrome / Arrhythmias, Cardiac / Torsades de Pointes Prostaglandin-Endoperoxide Synthases / IRAK3 Pouchitis / Crohn Disease / Colitis, Ulcerative Psoralens / APAF1 Melanoma / Skin Neoplasms Sodium Azide / PRND Nervous System Diseases / Alzheimer Disease Sodium Salicylate / KCNQ4 Deafness / Hearing Loss Tartrazine / HRH2 Angioedema / Asthma / Urticaria Thionucleotides / RP2 Retinitis Pigmentosa / Retinal Diseases Tuberculin / HRH1 Angioedema / Urticaria

Table 1 contd. List of environmental factor – gene pairs with similarity score of 0.4 or higher. The pairs that had evidence of plausible biological interactions in the literature are shown in bold font. from knowledge about environmental factors and genetic ACKNOWLEDGEMENTS factors of disease extracted from freely accessible online This research was funded by a Predoctoral Fellowship in databases and the research literature. Among the top Clinical and Translational Research from the Clinical and scoring environmental factor – gene pairs on the Translational Science Institute (CTSI) at the University of similarity measure, we found evidence for plausible Pittsburgh and the National Library of Medicine grant biological interactions in the literature for 10 of the pairs T15 LM007059 to the University of Pittsburgh we examined. The pairs for which we found no existing Biomedical Informatics Training Program. relationships or interactions may represent novel environmental – gene interactions. REFERENCES There are several limitations to our study. We used a single similarity measure, namely the Jaccard similarity 1. Becker, K.G., et al., The genetic association measure. We also used an arbitrary similarity measure database. Nat Genet, 2004. 36(5): p. 431-2. threshold of 0.4 to select pairs to examine. In future work, 2. Davis, A.P., et al., Comparative Toxicogenomics we plan to use other similarity measures and also to Database: a knowledgebase and discovery tool analyze comprehensively all pairs for known interactions. for chemical-gene-disease networks. Nucleic

Acids Res, 2009. 37(Database issue): p. D786- 7. Dellinger, R.P., Inflammation and coagulation: 92. implications for the septic patient. Clin Infect 3. Liu, Y.I., P.H. Wise, and A.J. Butte, The Dis, 2003. 36(10): p. 1259-65. "etiome": identification and clustering of human 8. Kawai, M., et al., Elevation of cell proliferation disease etiological factors. BMC Bioinformatics, via generation of reactive oxygen species by 2009. 10 Suppl 2: p. S14. piperonyl butoxide contributes to its liver tumor- 4. Patel, C.J., J. Bhattacharya, and A.J. Butte, An promoting effects in mice. Arch Toxicol, 2010. Environment-Wide Association study (EWAS) on 84(2): p. 155-64. type 2 diabetes mellitus. PLoS One, 2010. 5(5): 9. Liu, C., et al., The telomerase reverse p. e10746. transcriptase (hTERT) gene is a direct target of 5. Gohlke, J.M., et al., Genetic and environmental the histone methyltransferase SMYD3. Cancer pathways to complex diseases. BMC Syst Biol, Res, 2007. 67(6): p. 2626-31. 2009. 3: p. 46. 10. Kanehisa, M. and S. Goto, KEGG: kyoto 6. Wu, T., et al., Effect of Salicylate on KCNQ4 of encyclopedia of genes and genomes. Nucleic the Guinea Pig Outer Hair Cell. J Neurophysiol, Acids Res, 2000. 28(1): p. 27-30. 2010.