Biological Applications of Multi-Relational Data Mining

Biological Applications of Multi-Relational Data Mining David Page Mark Craven Dept. of Biostatistics and Medical Informatics Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences and Dept. of Computer Sciences University of Wisconsin University of Wisconsin Madison, WI 53706 Madison, WI 53706 [email protected] [email protected] ABSTRACT ² Gene expression microarrays: measure the degree to which each gene is being transcribed into messenger Biological databases contain a wide variety of data types, RNA (mRNA) under various conditions often with rich relational structure. Consequently multi- relational data mining techniques frequently are applied to ² Proteomics (primarily mass spectrometry): provides biological data. This paper presents several applications of insight into which proteins are present in signi¯cant multi-relational data mining to biological data, taking care concentrations, as well as protein-protein interactions to cover a broad range of multi-relational data mining techniques. ² Metabolomics: measures the concentrations of small molecules (i.e., those with low molecular weight) in a 1. INTRODUCTION sample Biological databases contain a wide variety of data. Con- sider storing in a database the operational details of a single- ² Single-nucleotide polymorphism (SNP) measurements: cell organism. At minimum one would need to encode the SNP patterns are used as a surrogate for complete following. DNA sequences of multiple individuals, e.g. human patients { SNPs are single-base positions in DNA where ² Genome: DNA sequence and gene locations individuals commonly di®er ² Proteome: the organism's full complement of proteins, It is possible to gain insights from using any one of the data not necessarily a direct mapping from its genes types described so far by itself. For example, standard ma- ² Metabolic pathways: linked biochemical reactions in- chine learning and data mining algorithms have been applied volving multiple proteins, small molecules and protein- to gene expression microarray data for a variety of purposes, protein interactions as surveyed by Molla et al. [37]. Nevertheless, researchers in both data mining and biology are realizing that, often, ² Regulatory pathways: the mechanism by which the ex- stronger results can be obtained by using many of these data pression of some genes into proteins, such as transcrip- types together. Mining a database that contains several, or tion factors, inﬂuences the expression of other genes| all, of these data types necessarily requires multi-relational includes protein-DNA interactions techniques, either to analyze the data directly or to convert In fact, such a database exists for part of what is known it into a single table in a useful way. The last two KDD Cup of the widely-studied model organism E. coli|EcoCyc [28]. competitions have focused on biological databases and, not This database contains other information as well besides the coincidentally, have highlighted the need for multi-relational items listed above, such as operons (see Section 6). data mining tools [9; 11]. Recording the diversity of data in the previous paragraph The data types discussed so far, though many and diverse, obviously requires a rich relational schema with multiple, in- only scratch the surface for biological databases. As re- teracting relational tables. In fact, even recording one type searchers attempt to modulate the behavior of biological of data, such as metabolic pathways, requires multiple re- processes, for example through development of novel phar- lational tables because of the graphical nature of pathways. maceuticals, still more types of data arise. Each year mil- And biological data can grow much more involved still. For lions of compounds (molecules) are tested in vitro (in the example, when we move to multi-cellular organisms we need \test tube") and in vivo (in organisms) to see if they will to include cell signaling and other aspects of one cell's inter- bind to a target protein or have a desired biological e®ect. action with other cells. Furthermore, because at present we Even if a molecule has a desired e®ect, such as killing a do not have complete knowledge of the workings of any or- harmful bacterium, it may be useless as a drug because it ganism, biological databases often encode a variety of types may be toxic to humans, it may break down too quickly of laboratory data that provide further insight into organism or too slowly within the human body, it may not di®use behavior. This may include data from the following high- from the gut to the bloodstream (so it has to be taken throughput techniques: by injection|unacceptable for some target activities), or it may not cross the blood-brain barrier (unacceptable, for example, for an anti-depressant or migraine medication). Therefore, molecules also are tested for these other factors, giving rise to volumes of diverse data within pharmaceuti- A NH2 cal companies and university laboratories. Furthermore, the three-dimensional structures of molecules and target pro- R3 teins, where known or estimated, are important items of HN information that require multiple relational tables to fully represent. Both the amount and diversity of biological data will con- NH2 N R4 tinue to grow rapidly because of a coming paradigm shift in biology, toward systems biology. As Hood and Galas (2003) R5 note, whereas in the past biologists could study a \complex system only one gene or one protein at a time," the \systems approach permits the study of all elements in a system in re- B NH sponse to genetic (digital) or environmental perturbations." 2 Hood and Galas go on to state: Cl The study of cellular and organismal biology us- HN ing the systems approach is at its very beginning. It will require integrated teams of scientists from across disciplines|biologists, chemists, computer NH2 N NH2 scientists, engineers, mathematicians and physi- cists. New methods for acquiring and analyzing CH3 high-throughput biological data are needed [24]. Figure 1: The family of analogues in the ¯rst ILP study. A) In addition to data describing our current knowledge of or- Template of 2,4-diamino-5(substituted-benzyl)pyrimidines: ganisms, novel types of high-throughput data, and data that R3, R4, and R5 are the three possible substitution positions. arises from our attempts to alter organism behavior, an- B) Example compound: R3 ¡¡Cl; R4 ¡¡NH2; R5 ¡¡CH3 other major type of biological data is semi-structured or text data. For example, the medline database [41] indexes more than 11 million articles in the biomedical literature, coli dihydrofolate reductase. Eleven additional compounds and contains abstracts and pointers to electronic versions were used as unseen test data. golem obtained rules that for many of these articles. As another example, one of the were statistically more accurate on the training data and web sites used most widely by biologists is GeneCards [46], on the test data than a previously published linear regres- which presents, for each human gene, information mined sion model. It was possible to run linear regression on this from biological text. Text data also has been integrated set because the molecules shared a common structure. Fig- into the process of mining other types of data such as gene ure 1 shows the shared structure, or template, for all the expression microarray data [53; 36]. molecules, as well as one particular instance of the template. Discussing in detail the variety of data types in biological Following shortly after this work [30] the two-dimensional databases is beyond the scope of any single paper. The goal atom-and-bond molecular descriptions of 229 aromatic and of the present paper is to look more closely at a few examples heteroaromatic nitro compounds [15] were given to the ILP of multi-relational data mining applied to biological data. system progol [38]. Such compounds frequently are muta- The paper seeks to present a diversity of data types and of genic. The study was con¯ned to the problem of obtaining approaches, not focusing on any one particular method for structural descriptions that discriminate molecules with pos- multi-relational data mining. itive mutagenicity from those which have zero or negative mutagenicity. A set of eight optimally compact rules were 2. BRIEF SURVEY OF BIOLOGICAL AP- automatically discovered by progol. These rules suggested PLICATIONS OF MULTI-RELATIONAL three previously unknown features leading to mutagenicity. Attempts to force the multi-relational representation of the DATA MINING atom-and-bond structures into feature vectors resulted in Before focusing on a small number of biological applications the generation of examples containing millions of features of multi-relational data mining with which we are most fa- each. Hence, unlike in the earlier molecular application, a miliar, we ¯rst present an overview of a variety of such ap- claim can be made that this application truly needed multi- plications. The earliest such work involved the application relational data mining. The mutagenicity data set remains a of inductive logic programming (ILP) to molecular data. widely-used testbed for multi-relational learning algorithms. Molecular data is a natural application for multi-relational A number of interesting lines of research were motivated by data mining because no method has been found for repre- this work on mutagenicity. First, this work led to a com- senting molecules in ordinary feature-vector form without petition called the Predictive Toxicology Challenge 2000- some loss of information. 2001 [22], in which both ordinary and multi-relational data In the ¯rst successful application of ILP to molecular data mining tools were applied to the task of predicting rodent [29], the golem program [39] was used to model the structure- activity relationships of trimethoprim analogues binding to to accurately predict which molecules have a particular bi- dihydrofolate reductase.1 The training data consisted of 44 ological activity (or to predict their degrees of the activity) trimethoprim analogues and their observed inhibition of E.

Biological Applications of Multi-Relational Data Mining

BIOINFORMATICS DOI: 10.1093/Bioinformatics/Btg1030

Genome Analysis and Knowledge

Full Review Molecular Disease Presentation in Diabetic Nephropathy

Causal Varian Discovery in Familial Congenital Heart Disease - an Integrative -Omic Approach Wendy Demos Marquette University

A New Cerkl Mouse Model Generated by CRISPR-Cas9 Shows

S41467-020-18249-3.Pdf

The Regulation of Htert by Alternative Splicing

Identification of Differentially Expressed Genes in Human Bladder Cancer Through Genome-Wide Gene Expression Profiling

"The Genecards Suite: from Gene Data Mining to Disease Genome Sequence Analyses". In: Current Protocols in Bioinformat

Telomerase a Prognostic Marker and Therapeutic Target

GCH1 Variants Contribute to the Risk and Earlier Age-At-Onset Of

Skeletal Muscle Transcriptome in Healthy Aging