Data-Driven Theory Refinement Algorithms for Bioinformatics Jihoon Yang HRL Laboratories

Zoology and Genetics Proceedings and Zoology and Genetics Presentations 1999 Data-Driven Theory Refinement Algorithms for Bioinformatics Jihoon Yang HRL Laboratories Rajesh Parekh Allstate Research & Planning Center Vasant Honavar Iowa State University Drena Dobbs Iowa State University, [email protected] Follow this and additional works at: http://lib.dr.iastate.edu/zool_conf Part of the Bioinformatics Commons, Cell and Developmental Biology Commons, and the Computational Biology Commons Recommended Citation Yang, Jihoon; Parekh, Rajesh; Honavar, Vasant; and Dobbs, Drena, "Data-Driven Theory Refinement Algorithms for Bioinformatics" (1999). Zoology and Genetics Proceedings and Presentations. 1. http://lib.dr.iastate.edu/zool_conf/1 This Conference Proceeding is brought to you for free and open access by the Zoology and Genetics at Iowa State University Digital Repository. It has been accepted for inclusion in Zoology and Genetics Proceedings and Presentations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. DataDriven Theory Renement Algorithms for Bioinformatics Jiho on Yang Ra jesh Parekh Information Sciences Lab Data Mining Group HRL Lab oratories Allstate Research Planning Center Malibu Canyon Road Middleeld Road Malibu CA Menlo Park CA yangwinshrlcom rpareallstatecom VasantHonavar Drena Dobbs AI Lab Computational Biology Lab Computational Biology Lab Computer Science Department Zo ology Genetics Department Iowa State University Iowa State University Ames IA Ames IA honavarcsiastateedu ddobbsiastateedu Abstract Bioinformatics and related applications call for either augmenting it with new knowledge or by rening the ecient algorithms for knowledgeintensive learning and existing knowledge are called theory renement systems datadriven knowledge renement Knowledge based arti Theory renement systems can b e broadly classied into cial neural networks oer an attractive approach to ex the following categories tending or mo difying incomplete knowledge bases or do main theories We present results of exp eriments with sev Approaches based on Rule Induction which use eral such algorithms for datadriven knowledge discovery decision tree or rule learning algorithms for theory re and theory renement in some simple bioinformatics appli vision Examples of such systems include RTLS cations Results of exp eriments on the rib osome binding site and promoter site identication problems indicate that the EITHER PTR and TGCI p erformance of KBDistAl and TilingPyramid algorithms com Approaches based on Inductive Logic Program pares quite favorably with those of substantially more com ming which representknowledge using rstorder putationally demanding techniques logic or restricted subsets of it Examples of such systems include FOCL and FORTE I Introduction Connectionist Approaches using Articial Neu Inductive learning systems attempt to learn a concept ral Networks whichtypically op erate by rst emb ed description from a sequence of lab eled examples ding domain knowledge into an appropriate initial neu Articial neural networks b ecause of their massive paral ral network top ology and rene it by training the re lelism and p otential for fault and noise tolerance oer an sulting neural network on the set of lab eled examples attractive approach to inductive learning Such The KBANN system as well as related ap systems have b een successfully used for datadriven knowl proaches and oer examples of this approach edge acquisition in several application domains However In exp eriments involving datasets from the Human these systems generalize from the lab eled examples alone Genome Pro ject KBANN has b een rep orted to have out The availability of domain sp ecic knowledge domain the p erformed symb olic theory renement systems suchasEI ories ab out the concept b eing learned can p otentially en THER and other learning algorithms suchasbackpropa hance the p erformance of the inductive learning system gation and ID KBANN is limited bythefactthat Hybrid learning systems that eectively combine domain it do es not mo dify the networks top ology and theory re knowledge with the inductive learning can p otentially learn nement is conducted solely by up dating the connection faster and generalize b etter than those based on purely in weights This prevents the incorp oration of new rules and ductive learning learning from lab eled examples alone also restricts the algorithms ability to comp ensate for in In practice the domain theory is often incomplete or even accuracies in the domain theory Against this background inaccurate constructive neural network learning algorithms b ecause Inductive learning systems that use information from of their ability to mo dify the network architecture bydy training examples to mo dify an existing domain theory by namically adding neurons in a controlled fashion oer an attractive approach to datadriven theory re This researchwas partially supp orted by grants from the National Science Foundation IRI and the John Deere Foundation to These datasets are available at ftpftpcswiscedumachine Vasant Honavar and a grant from the Carver Foundation to Drena learningshavlikgroupdatasets Dobbs and Vasant Honavar networks are generated by strategically adding no des at nement Available domain knowledge is incorp orated into dierent lo cations within the b est network selected These an initial network top ology eg using the rulestonetwork networks are trained and inserted into the queue and the algorithm of or by other means Inaccuracies in the pro cess is rep eated domain theory are comp ensated for by extending the net work top ology using training examples Figure depicts The REGENT algorithm uses a genetic search to explore this pro cess the space of network architectures It rst creates a diverse initial p opulation of networks from the KBANN net work constructed from the domain theory Genetic search alidation set as Constructive Neural Network uses the classication accuracy on a crossv a tness measure REGENTs mutation op erator adds a no de to the network using the TopGen algorithm It also uses a sp ecially designed crossover op erator that maintains the networks rule structure The p opulation of networks is Domain sub jected to tness prop ortionate selection mutation and Theory crossover for many generations and the b est network pro duced during the entire run is rep orted as the solution The KBDistAl algorithm constructs a single hidden layer It uses a computationally ecient DistAl algorithm which tire network in one pass through the train Input Units constructs the en ing set instead of relying on the iterative approachusedby Fletcher and Obradovc which requires a large number of Fig Theory Renement using a Constructive Neural Network passes through the training set The key idea b ehind Dis tAl is to add hyperspherical hidden neurons I I Constructive Theory Refinement Using one at a time based on a greedy strategy which ensures KnowledgeBased Neural Networks that each hidden neuron that is added correctly classies a maximal subset of training patterns b elonging to a single This section briey describ es the constructive theory re class Correctly classied examples can then b e eliminated nement systems which are exp erimentally compared in from further consideration The pro cess is rep eated un this pap er on some sample datadriven knowledge rene til the network correctly classies the entire training set ment tasks in bioinformatics or some other suitable termination criterion eg based Fletcher and Obradovic designed a constructive on crossvalidation is met It is straightforward to show learning metho d for dynamically adding neurons to the that DistAl which sets the hidden to output layer weights initial knowledge based network Their approach starts without going through an iterative pro cess is guaranteed to with an initial network representing the domain theory and converge to classication accuracy on any nite train mo dies this theory by constructing a single hidden layer ing set in time that is quadratic in the numb er of training of threshold logic units TLUs from the lab eled training patterns Exp eriments rep orted in show that Dis data using the HDE algorithm The HDE algorithm tAl despite its simplicity yields classiers that compare divides the feature space with hyp erplanes Fletcher and quite favorably with those generated using more sophisti Obradovics algorithm maps these hyp erplanes to a set of cated and substantially more computationally demanding TLUs and then trains the output neuron using the p o cket learning algorithms KBDistAl uses a very simple approach algorithm to incorp orating prior knowledge into DistAl The input The RAPTURE system is designed to rene domain the patterns are classied using the rules and the resulting out ories that contains probabilistic rules represented in the puts are used to augment the pattern b efore it is fed to the certaintyfactor format RAPTUREs approachtomod neural network which is constructed using DistAl That is ifying the network top ology diers from that used in KB DistAl is used as the constructive algorithm in Figure DistAl as follows RAPTURE uses an iterative algorithm to This eliminates the need for translating the rules into a train the weights and employs the information gain heuris neural network tic to add links to the network KBDistAl is simpler The TilingPyramid algorithm uses a novel combina than RAPTURE in that it uses

Load more