Nn636jn0614.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
/ (41) HPP-72-5 // OrPRINTS Genetics 194 / BIOCHEMICAL' pECTPOMEW " BY R. WAIUH JOHN WILEY 1372 CHAPTER 7 USE OF A COMPUTER TO IDENTIFY UNKNOWN COMPOUNDS: THE AUTO- MATION OF SCIENTIFIC INFERENCE* JOSHUA LEDERBERG Departmentof Genetics, School of Medicine, Stanford University, Stanford, California Introduction B. Motivation 194 C. Implementation 194 1. Generator 194 2. All the Ways to Build a Molecule , 195 3. Graphs of Ring Compounds 197 4. Heuristics 197 D. Commentary 199 E. Example 200 A. INTRODUCTION knowledge. The problem was merely one of selecting the proper texts. The Argentinian writer JorgeLuis Borges.in a short The identification of an unknown compound story called "The Library of Babel," showed that all presents a similar challenge. If the universe of knowledge can be reduced to a problem of selection. possibilities were the problem might not be He portrayed a library of infinite dimensions filled rigorously soluble. Practical solutions depend upon with books printed in an obscure code in which the ingenuity with which the domain of acceptable familiar phrases occasionally appeared. Eventually, solutions can be narrowed within a particular experi- a mathematician-inhabitant of this space surmised mental context and the efficiency with which tentative that each book was one of all possible random con- solutions can be tested against the data. catenations of letters. After a few centuries of dis- The previous chapter deals with the pragmatics couragement, the inhabitants were inspired by a new of searching the index to a finite library, i.e., the revelation — that the library must in fact contain all catalog of mass spectra of previously studied mole- cules, with occasional extensions to related structures. *This report is a summary of the current status of the Heuristic The present chapter deals with chemical structures jointly by the Departments of Chemis- dendral project conducted in more theoretical terms, as part of an effort to try, Computer and Genetics at Stanford University under inference in a computer program. the directionof Professors Carl Djerassi, Edward A. Feigenbaum. embody scientific and Joshua Lederberg. This research was financed by the Advanced Instead of listing known structures, this program, Research Projects Agency (Contract SD-I83), the Ntitional dendral,* incorporates rules by which all con- Aeronautics and Space Administration (Grant NGR-05-020-004), and the National Institutes of Health (Grant AM-04257). Most of the programming reported here was done by Dr. Bruce :: The program is called dendral (for DENDKitic ALgorithm). It is Mrs. Georgia Sutherland. Mr. Allan Delfino, and Dr. Armand written in the list-processing language i.isp. It requires 40,000 or Buchs. more words of memory, depending on the number of atoms in the 193 FROM APPLICATIONSOF MASSS EDITEO CEORCE & SONS, infinite, Science, Buchanan, 194 Use ofa Computerto Identify Unknown Compounds ceivable structures can be generated and encoded into We have designed, engineered, and demonstrated a fairly legible but computer-compatiblenotation (1). a computer program that manifests many aspects of In the general case, the generator is constrained only human problem-solving techniques. It also works by the elementary rules of valence of the various faster than human intelligence in solving problems atoms. In practice it also includes many heuristics that chosen from an appropriately limited domain of types limit its speculations to plausibly stable structures, of compounds, as illustrated in the cited publications and further to those of particular interest to the line (1,2). of chemistry in which it is applied. Besides allowing Some of the essential features of the dendral for the exhaustive enumeration of all possible struc- program include the following: tures, dentral is also devised to be irredundant v — I. Conceptualizing organic chemistry terms of it allows for the presentation of a given structure in a in topologicalgraph theory, i.e., a general theory ofways single standardized, or canonical notation. The pro- of combiningatoms. gram is also prospectively so that most re- 2. Embodying this approach in an exhaustive dundancies are anticipated and prevented, rather hypothesis generator. This is a program that is than having to be weeded out after having been capable, principle, formulated. in of '"imagining" every con- ceivable molecular structure. The primary motivation of the Heuristic dendral 3. Organizing the generator so that it avoids project is to study and model processes of inductive duplication and irrelevancy and movesfrom structure inference in science, in particular, the formation of to structure in an orderly and predictable way. hypotheses that best explain given sets of empirical data. The task chosen for detailed study is the struc- The key concept is that induction becomes a pro- ture determination of organic molecules, and this has cess of efficient selection from the domain of all pos- been advanced furthest with MS data ( 1-8). However, sible structures. Heuristic search and evaluation is the principles are readily generalizedto other data for used to implement this "efficient selection." Most of which some chemical theory can be formulated. the ingenuity in the program is devoted to heuristic The motivation and a general outline of the approach modifications of the generator. Some of these are presented first. Next, a sketch is given of how the modifications result in early pruning of unproductive program works and how good its performance is at or implausible branches of the search tree. Other this stage. Last, an example, taken from our group's modifications require that the program consult the recent work on aliphaticethers (2). is shown. data for cues (feature analysis) that can be used by the generator as a plan for a more effective order of priorities during hypothesis generation. The pro- B. MOTIVATION gram incorporates a memory of solved subproblems that can be consulted to look up a result rather than The dendral project aims at emulating in a com- compute it over and over again. The program is aimed puter program the inductive behavior of the scientist at facilitating the entry of new ideas by thc chemist in an important but sharply limited area of science, when discrepancies are perceived between the actual organic chemistry. Most of our work is addressed to functioning of the program and his expectationofit. the following problem: Given the data of the mass spectrum of an unknown compound, induce a work- able number of plausible solutions, i.e. a small list C. IMPLEMENTATION of candidate molecular structures. In order to com- plete the task, the dendral program then deduces the 1. Generator mass spectrum predicted by the theory of mass spec- trometry for each of the candidates and selects the As just noted. (I I. 13-15). the dendral program most productive hypothesis, i.e.. the structure whose contains a structure generator as its core, abun- predicted spectrum most closelymatches the data. dantly constrained by a set of relevant heuristics. The generator is built upon a consideration of the con- composition and the speed with which one wants to see the answers. ventional structure representation as a topological Many options are available to the chemist at the teletype console; graph,i.e.. the connectivity relations of a set ofchemi- tor instance, he can revise the program's theory of chemical cal atoms taken as nodes. We recognize more than instability (badlist). he can restrict structure generation to type -double, molecules of a specified class (GOODI IST), or he can monitor the one of connection triple, and non structure-generation process through a dialogue with the program. covalent bonds, as well as single bonds. From an Programming details are available (9). electronic standpoint, however, the special bonds I efficient, Implementation 195 could just as well be denoted as special atoms. The ever, each tree has a unique center. In in 1869 structural graph does not specify the bond distances Jordan showed that any tree has two kinds ofcenter, a and bond angles of the molecule. In fact, these are mass center and a radius center. Each center has a known for only a small proportion of the enormous unique placein any tree; the two may coincide. number of organic molecules whose structure is very Tofind theradius center, the tree is pruned one level well known from a topological standpoint. at a time; cut back one link from every terminal at Most of the syllabus of elementary organic chemis- each level. This will leave,finally, an ultimate node or try thus comprises a survey of the topological possi- node pair (in effect, edge) as the center. The radius bilities for the distinct ways in which sets ofatoms may then reflects the levels of pruning needed to reach be connected, subject to the rules ofchemical valence. the center. The student then also learns rules that prohibit some To identify the mass center of a tree, we must con- configurations as unstable or unrealizable. (He may sider the two or more branches that join to each non- later earn his scientific reputation by justifying or terminal node. The center is the node whose branches overturning one of these rules.) But the field oforganic have the most evenlybalanced allocation of theremain- chemistry has reached its present stature without ing mass (node count) of the tree. This is the same as many benefits from any general analysis of molecular saying that none of the pendant branches exceeds topology. These benefits might arise in applications at half of the total mass. If the structure is a union two extremes of sophistication: teaching chemical of equal halves, the center is the bond or edge that principles to college undergraduates and teaching joins them. them to electronic computers. They may also apply to Each of the centers (Fig. 7-1 ) is unique and so could the vexatious problems of nomenclature and systema- solve our problem of defining acanonical starting point tic methods of information retrieval.