Zoology and Genetics Proceedings and Zoology and Genetics Presentations

1999 Data-Driven Theory Refinement Algorithms for Jihoon Yang HRL Laboratories

Rajesh Parekh Allstate & Planning Center

Vasant Honavar

Drena Dobbs Iowa State University, [email protected]

Follow this and additional works at: http://lib.dr.iastate.edu/zool_conf Part of the Bioinformatics Commons, Cell and Developmental Biology Commons, and the Commons

Recommended Citation Yang, Jihoon; Parekh, Rajesh; Honavar, Vasant; and Dobbs, Drena, "Data-Driven Theory Refinement Algorithms for Bioinformatics" (1999). Zoology and Genetics Proceedings and Presentations. 1. http://lib.dr.iastate.edu/zool_conf/1

This Conference Proceeding is brought to you for free and open access by the Zoology and Genetics at Iowa State University Digital Repository. It has been accepted for inclusion in Zoology and Genetics Proceedings and Presentations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

DataDriven Theory Renement Algorithms for

Bioinformatics

Jiho on Yang Ra jesh Parekh

Information Sciences Lab Group

HRL Lab oratories Allstate Research Planning Center

Malibu Canyon Road Middleeld Road

Malibu CA Menlo Park CA

yangwinshrlcom rpareallstatecom

VasantHonavar Drena Dobbs

AI Lab Computational Biology Lab Computational Biology Lab

Computer Science Department Zo ology Genetics Department

Iowa State University Iowa State University

Ames IA Ames IA

honavarcsiastateedu ddobbsiastateedu

Abstract Bioinformatics and related applications call for either augmenting it with new knowledge or by rening the

ecient algorithms for knowledgeintensive learning and

existing knowledge are called theory renement systems

datadriven knowledge renement Knowledge based arti

Theory renement systems can b e broadly classied into

cial neural networks oer an attractive approach to ex

the following categories

tending or mo difying incomplete knowledge bases or do

main theories We present results of exp eriments with sev

Approaches based on Rule Induction which use

eral such algorithms for datadriven knowledge discovery

decision tree or rule learning algorithms for theory re

and theory renement in some simple bioinformatics appli

vision Examples of such systems include RTLS

cations Results of exp eriments on the rib osome binding site

and promoter site identication problems indicate that the

EITHER PTR and TGCI

p erformance of KBDistAl and TilingPyramid algorithms com

Approaches based on Inductive Logic Program

pares quite favorably with those of substantially more com

ming which representknowledge using rstorder

putationally demanding techniques

logic or restricted subsets of it Examples of such

systems include FOCL and FORTE

I Introduction

Connectionist Approaches using Articial Neu

Inductive learning systems attempt to learn a concept

ral Networks whichtypically op erate by rst emb ed

description from a sequence of lab eled examples

ding domain knowledge into an appropriate initial neu

Articial neural networks b ecause of their massive paral

ral network top ology and rene it by training the re

lelism and p otential for fault and noise tolerance oer an

sulting neural network on the set of lab eled examples

attractive approach to inductive learning Such

The KBANN system as well as related ap

systems have b een successfully used for datadriven knowl

proaches and oer examples of this approach

edge acquisition in several application domains However

In exp eriments involving datasets from the Human

these systems generalize from the lab eled examples alone

Genome Pro ject KBANN has b een rep orted to have out

The availability of domain sp ecic knowledge domain the

p erformed symb olic theory renement systems suchasEI

ories ab out the concept b eing learned can p otentially en

THER and other learning algorithms suchasbackpropa

hance the p erformance of the inductive learning system

gation and ID KBANN is limited bythefactthat

Hybrid learning systems that eectively combine domain

it do es not mo dify the networks top ology and theory re

knowledge with the inductive learning can p otentially learn

nement is conducted solely by up dating the connection

faster and generalize b etter than those based on purely in

weights This prevents the incorp oration of new rules and

ductive learning learning from lab eled examples alone

also restricts the algorithms ability to comp ensate for in

In practice the domain theory is often incomplete or even

accuracies in the domain theory Against this background

inaccurate

constructive neural network learning algorithms b ecause

Inductive learning systems that use information from

of their ability to mo dify the network architecture bydy

training examples to mo dify an existing domain theory by

namically adding neurons in a controlled fashion

oer an attractive approach to datadriven theory re

This researchwas partially supp orted by grants from the National

Science Foundation IRI and the John Deere Foundation to

These datasets are available at ftpftpcswiscedumachine Vasant Honavar and a grant from the Carver Foundation to Drena

learningshavlikgroupdatasets Dobbs and Vasant Honavar

networks are generated by strategically adding no des at nement Available domain knowledge is incorp orated into

dierent lo cations within the b est network selected These an initial network top ology eg using the rulestonetwork

networks are trained and inserted into the queue and the algorithm of or by other means Inaccuracies in the

pro cess is rep eated domain theory are comp ensated for by extending the net

work top ology using training examples Figure depicts

The REGENT algorithm uses a genetic search to explore

this pro cess

the space of network architectures It rst creates a

diverse initial p opulation of networks from the KBANN net

work constructed from the domain theory Genetic search alidation set as

Constructive Neural Network uses the classication accuracy on a crossv

a tness measure REGENTs mutation op erator adds a

no de to the network using the TopGen algorithm It also

uses a sp ecially designed crossover op erator that maintains

the networks rule structure The p opulation of networks is

Domain

sub jected to tness prop ortionate selection mutation and

Theory

crossover for many generations and the b est network pro

duced during the entire run is rep orted as the solution The

KBDistAl algorithm constructs a single hidden layer

It uses a computationally ecient DistAl algorithm which

tire network in one pass through the train

Input Units constructs the en

ing set instead of relying on the iterative approachusedby

Fletcher and Obradovc which requires a large number of

Fig Theory Renement using a Constructive Neural Network

passes through the training set The key idea b ehind Dis

tAl is to add hyperspherical hidden neurons

I I Constructive Theory Refinement Using

one at a time based on a greedy strategy which ensures

KnowledgeBased Neural Networks

that each hidden neuron that is added correctly classies a

maximal subset of training patterns b elonging to a single

This section briey describ es the constructive theory re

class Correctly classied examples can then b e eliminated

nement systems which are exp erimentally compared in

from further consideration The pro cess is rep eated un

this pap er on some sample datadriven knowledge rene

til the network correctly classies the entire training set

ment tasks in bioinformatics

or some other suitable termination criterion eg based

Fletcher and Obradovic designed a constructive

on crossvalidation is met It is straightforward to show

learning metho d for dynamically adding neurons to the

that DistAl which sets the hidden to output layer weights

initial knowledge based network Their approach starts

without going through an iterative pro cess is guaranteed to

with an initial network representing the domain theory and

converge to classication accuracy on any nite train

mo dies this theory by constructing a single hidden layer

ing set in time that is quadratic in the numb er of training

of threshold logic units TLUs from the lab eled training

patterns Exp eriments rep orted in show that Dis

data using the HDE algorithm The HDE algorithm

tAl despite its simplicity yields classiers that compare

divides the feature space with hyp erplanes Fletcher and

quite favorably with those generated using more sophisti

Obradovics algorithm maps these hyp erplanes to a set of

cated and substantially more computationally demanding

TLUs and then trains the output neuron using the p o cket

learning algorithms KBDistAl uses a very simple approach

algorithm

to incorp orating prior knowledge into DistAl The input

The RAPTURE system is designed to rene domain the

patterns are classied using the rules and the resulting out

ories that contains probabilistic rules represented in the

puts are used to augment the pattern b efore it is fed to the

certaintyfactor format RAPTUREs approachtomod

neural network which is constructed using DistAl That is

ifying the network top ology diers from that used in KB

DistAl is used as the constructive algorithm in Figure

DistAl as follows RAPTURE uses an iterative algorithm to

This eliminates the need for translating the rules into a

train the weights and employs the information gain heuris

neural network

tic to add links to the network KBDistAl is simpler

The TilingPyramid algorithm uses a novel combina than RAPTURE in that it uses a noniterative constructive

tion of the Tiling and Pyramid constructive learning algo learning algorithm to augment the initial domain theory

rithms It employs a symb olic knowledge enco ding Opitz and Shavlik have extensively studied connectionist

pro cedure to translate a domain theory into a set of prop o theory renement systems that overcome the xed top ology

sitional rules using a pro cedure that is based on the rules limitation of the KBANN algorithm The TopGen

tonetworks algorithm of Towell and Shavlik whichis algorithm uses a heuristic search through the space

used in KBANN TopGenandREGENTItyieldsasetof of p ossible expansions of a KBANN network constructed

rules each of which has only one antecedent The rule set from the initial domain theory TopGen maintains a queue

is then mapp ed to an ANDOR graph whichinturnisdi of candidate networks ordered by their test accuracy on a

rectly translated into a neural network The TilingPyramid crossvalidation set At each step TopGen picks the b est

algorithm uses the Tiling algorithm to construct a faithful network and explores p ossible ways of expanding it New

representation of the pattern set A faithful representation

TABLE I

of the pattern set is one in whichnotwo patterns b elonging

Results of Rib osome dataset

to dierent output classes has the same output represen

tation The Pyramid algorithm is used to successively add

Test Size

new layers of TLUs to the network Each newly added

Rules alone 

layer b ecomes the networks new output layer The neu

KBDistAl no pruning  

rons of the new layer are connected to the input layer and

KBDistAl with pruning  

all previously added layers of the network Exp eriments

TilingPyramid  

have shown that the hybrid algorithm outp erforms b oth

TopGen 

Pyramid and Tiling algorithms

REGENT 

In the exp eriments rep orted in this pap er the Tiling

Pyramid constructive learning algorithm was used to aug

TABLE I I

ment the initial domain knowledge The hybrid net

Results of Promoters dataset

work was trained using the Thermal Perceptron learning

rule Each TLU was trained for ep o chs with the

Test Size

initial weights chosen randomly b etween and and the

Rules alone 

initial temp erature for the thermal p erceptron T set to



KBDistAl no pruning  

KBDistAl with pruning  

I I I Experimental Results

TilingPyramid  

TopGen 

This section rep orts results of exp eriments using KBDis

REGENT 

tAl and TilingPyramid on datadriven theory renement

on the rib osome binding site and promoter site prediction

used by Shavliks group

ting of training data in the case of the Rib osome data In

Rib osome

fact when the network pruning pro cedure was applied in

This data is from the Human Genome Pro ject It com

KBDistAl to avoid overtting the generalization accuracy

prises of a domain theory and a set of lab eled exam

b ecame comparable to that of TilingPyramid

ples The input is a short segmentofDNAnucleotides

The time taken by b oth KBDistAl and TilingPyramid is

and the goal is to learn to predict whether the DNA

signicantly less than that of the other approaches KBDis

segments contain a rib osome binding site There are

tAl and TilingPyramid take a fraction of a minute to a few

rules in the domain theory and examples in

minutes of CPU time on each dataset used in the exp eri

the dataset

ments In contrast TopGen and REGENT were rep orted to

Promoters

have taken several days to obtain the results rep orted in

This data is also from the Human Genome Pro ject

It is also worth noting that the networks generated by

and consists of a domain theory and a set of lab eled

KBDistAl and TilingPyramid are substantially more com

examples The input is a short segment of DNA nu

pact than those generated by TopGen and REGENT

cleotides and the goal is to learn to predict whether

the DNA segments contain a promoter site There are

IV Summary and Discussion

rules in the domain theory and examples in the

dataset

Theory renementtechniques oer an attractiveap

The rep orted results are based on a fold cross proach to exploiting available domain knowledge to en

hance the p erformance of datadriven knowledge acquisi

validation The average training and test accuracies of

tion systems Neural networks have b een used extensively

the rules in domain theory alone were  and

 for Rib osome dataset and  and in theory renement systems that have b een prop osed in

the literature Most of such systems translate the domain

 for Promoters dataset resp ectivelyTable I

and I I shows the average generalization accuracy and the theory into an initial neural network architecture and then

train the network to rene the theoryTheKBANN algo

average network size along with the standard deviations

rithm is demonstrated to outp erform several other learning

where available for Rib osome and Promoters datasets

resp ectively algorithms on some domains However a signi

cant disadvantage of KBANN is its xed network top ology

Table I and I I compare the p erformance of KBDistAl and

TopGen and REGENT algorithms were prop osed to elimi

TilingPyramid with that of some of the other approaches

nate this limitation and attempt to mo dify the network ar

that have b een rep orted in the literature TilingPyramid

chitecture Exp erimental results demonstrate that TopGen

substantially outp erforms TopGen REGENT and KBDis

and REGENT outp erform KBANN on several applications

tAl in terms of generalization accuracy on the Promoters

dataset TilingPyramids generalization accuracy is com

parable to that of TopGen and REGENT and substantially Exp erimental results presented in this pap er demon

b etter than that of KBDistAl on the Rib osome dataset We strate that approaches to datadriven theory renement

conjecture that KBDistAl mighthave suered from overt using constructive neural network training algorithms such

ded in mobile software agents that travel from one as KBDistAl and TilingPyramid are comp etitive in terms

data source to another carrying with them only the cur of of generalization accuracy with several of the more com

rentknowledge base as opp osed to approaches rely on ship putationally exp ensive algorithms Additional exp eriments

ping large volumes of data to a centralized rep ository where are needed to conclusively determine the extent to which

knowledge acquisition is p erformed Thus datadriven the incorp oration of prior knowledge improves classication

knowledge renement algorithms constitute one of the key accuracy andor reduces learning time in realworld knowl

comp onents of distributed knowledge network environ edge discovery applications Examination of the changes to

ments for knowledge discovery in bioinformatics and re domain knowledge that are induced by training data is also

lated applications of signicantinterest

In several application domains knowledge acquired on

It can b e argued that KBDistAl and TilingPyramid are

one task can often b e utilized to accelerate knowledge ac

not theory renement algorithms in a strict sense They

quisition on related tasks Datadriven theory renement

make use of the domain knowledge in inductive learning

is particularly attractive in applications that lend them

as opp osed to truly rening the knowledge Perhaps KB

selves to suchcumulativemultitask learning The use

DistAl TilingPyramid and related approaches should b e

of KBDistAl TilingPyramid or similar algorithms in such

more accurately describ ed as a know ledge guided inductive

scenarios remains to b e explored

theory construction systems

There are several extensions and variants of KBDistAl

References

that are worth exploring Given the fact that DistAl relies

T Mitchell McGraw Hill New York

on interpattern distances to induce classiers from data

V Honavar R Parekh and J Yang Machine learning Prin

it is straightforward to extend it so as to handle a much

ciples and applications in Encyclopedia of Electrical and Elec

broader class of problems including those that involvepat

tronics JWebster Ed WileyNewYork In

press

terns of variable sizes eg strings or symb olic structures

M Hassoun Fundamentals of Articial Neural Networks MIT

as long as suitable interpattern distance metrics can b e

Press Boston MA

dened Some steps toward rigorous denitions of distance

B Ripley Pattern Recognition and Neural NetworksCambridge

University Press New York

metrics based on information theory are outlined in

J W Shavlik A framework for combining symb olic and neural

Variants of DistAl and KBDistAl that utilize such distance

learning in Articial Intel ligence and Neural Networks Steps

metrics are currently under investigation

Toward Principled Integration V Honavar and L Uhr Eds

pp Academic Press New York

Several authors haveinvestigated approaches to rule ex

A Ginsb erg Theory reduction theory revision and retransla

traction from neural networks in general and connectionist

tion in Proceedings of the Eighth National ConferenceonAr

ticial Intel ligence Boston MA pp AAAIMIT

theory renement systems in particular One

Press

goal of suchwork is to represent the learned knowledge in

D Ourston and R J Mo oney Theory renementCombining

a form that is comprehensible to humans In this context

analytical and empirical metho ds Articial Intel ligencevol

pp

rule extraction from classiers induced by KBDistAl and

M Kop el R Feldman and A Serge Biasdriven revision of

TilingPyramid is of some interest

logical domain theories Journal of Articial Intel ligenceRe

Extensive exp eriments on a wide range of knowledge dis

searchvol pp

S Donoho and L Rendell Representing and restructuring do

covery tasks in bioinformatics eg protein structure pre

main theopries A constructive induction approach Journal of

diction identication of molecular structurefunction rela

Articial Intel ligenceResearchvol pp

tionships using a broad range of datadriven knowledge re

M Pazzani and D Kibler The utilityofknowledge in inductive

learning Machine Learningvol pp

nementtechniques including constructive neural network

B Richards and R Mo oney Automated renement of rst

approaches suchasKBDistAl and TilingPyramid decision

order hornclause domain theories Machine Learningvol

tree and rule induction techniques Bayesian networks and

pp

G Towell J Shavlik and M No ordwier Renementofapprox

related approaches are needed in order to characterize the

imate domain theories byknowledgebasedneuralnetworks in

relative strengths and limitations of dierenttechniques

Proceedings of the Eighth National ConferenceonArticial In

tel ligence Boston MA pp

In several practical applications of interest all of the

G Towell and J Shavlik Knowledgebased articial neural

data needed for synthesizing reasonably precise classiers

networks Articial Intel ligencevol no pp

is not available at once This calls for incremental algo

L M Fu Integration of neural heuristics into knowledgebased

rithms that continually rene knowledge as more and more

inference Connection Sciencevol pp

data b ecomes available Computational eciency consid

B F Katz Ebl and sbl A neural network synthesis in

erations argue for the use of datadriven theory renement

Proceedings of the Eleventh Annual Conference of the Cognitive

ScienceSociety pp

systems as opp osed to storing large volumes of data and

R Parekh J Yang and V Honavar Constructiveneural

rebuilding the entire classier from scratch as new data b e

network learning algorithms for multicategory realvalued pat

comes available Some preliminary steps in this direction

tern classication Tech Rep ISUCSTR Department

of Science Iowa State University

are describ ed in

V Honavar J Yang and R Parekh Structural learning in

A related problem involves knowledge discovery from

Encyclopedia of Electrical and Electronics EngineeringJWeb

large physically distributed dynamic data sources in a

ster Ed WileyNewYork In press

J Yang R Parekh and V Honavar DistAlAninterpattern

networked environment eg data in genome databases

distancebased constructive learning algorithm in Proceedings

Given the large volumes of data involved this argues for

of the International Joint ConferenceonNeural NetworksAn

chorage Alaska pp the use of datadriven theory renement algorithms embed

J Fletcher and Z Obradovic Combining prior symbolic knowl

edge and constructive neural network learning Connection Sci

encevol no pp

E Baum and K Lang Constructing hidden units using exam

ples and queries in Advances in Neural Information Process

ing Systems vol R Lippmann J Mo o dy and D Touretzky

Eds San Mateo CA pp Morgan Kaufmann

S Gallant Perceptron based learning algorithms IEEE

Transactions on Neural Networksvol no pp

June

J Mahoney and R Mo oney Comparing metho ds for ren

ing certaintyfactor rulebases in Proceedings of the Eleventh

International Conference on Machine Learning pp

R Quinlan Induction of decision trees Machine Learning

vol pp

D W Opitz and J W Shavlik Dynamically adding symbol

ically meaningful no des to knowledgebased neural networks

Know ledgeBased Systemsvol no pp

D W Opitz and J W Shavlik Connectionist theory rene

ment Genetically searching the space of network top ologies

Journal of Articial Intel ligenceResearchvol pp

J Yang R Parekh V Honavar and D Dobbs Datadriven

theory renement using KBDistAl in Proceedings of the Third

Symposium on Intel ligent Data Analysis In press

J Yang R Parekh and V Honavar DistAl An inter

pattern distancebased constructive learning algorithm Intel

ligent Data Analysis To app ear

R Parekh and V Honavar Constructive theory renementin

knowledge based neural networks in Proceedings of the In

ternational Joint Conference on Neural NetworksAnchorage

Alaska pp

M Frean A thermal p erceptron learning rule Neural Com

putationvol pp

D Lin An informationtheoretic denition of similarity in

International Conference on Machine Learning

G Towell and J Shavlik Extracting rules from knowledge

based neural networks Machine Learningvol pp

L M Fu Knowledge based connectionism for revising domain

theories IEEE Transactions on Systems Man and Cybernet

icsvol no pp

M Craven Extracting Comprehensible Models from Trained

Neural Networks PhD thesis Department of Computer Sci

ence University of Wisconsin Madison WI

V Honavar L Miller and J Wong Distributed knowledge net

works in Proceedings of IEEE Information Technology Con

ference Piscataway NJ pp

J White Mobile agents in SoftwareAgents J Bradshaw

Ed MIT Press Cambridge MA

S Thrun Lifelong learning A case study Tech Rep CMU

CS Carnegie Mellon University Pittsburgh PA

© 1999 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

View publication stats