<<

ZOBODAT - www.zobodat.at

Zoologisch-Botanische Datenbank/Zoological-Botanical Database

Digitale Literatur/Digital Literature

Zeitschrift/Journal: Sydowia

Jahr/Year: 1990

Band/Volume: 42

Autor(en)/Author(s): Petrini Orlando, Rusca C. V., Szabo I.

Artikel/Article: ASCUS: an error-tolerant mycological classification system. 273-285 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

ASCUS: an error-tolerant mycological classification system*

0. PETRINI1, C. V. RUSCA2 & I. SZABO2 1 Mikrobiologisches Institut, ETH-Zentrum, 8092 Zürich, Switzerland 2 Institut de Microtechnique, DMT, EPFL, 1015 Lausanne, Switzerland

O. PETRINI, C. V. RUSCA & I. SZABO (1990). ASCUS: an error-tolerant mycological classification system. - SYDOWIA 42: 273-285. ASCUS, an error-tolerant classification system to be used in the identification of fungal taxa is described. ASCUS is a hybrid system and combines a connexionist with a rule-based expert system to be used by experts for the preparation of identification keys and by novices for the identification of fungi. The system is tolerant and is not too sensitive to mistakes by the user. It also has a built-in mechanism to deal with user uncertainty and vague qualifiers. The recent development of powerful, yet comparatively inexpen- sive hardware and increasingly user-friendly software now allows most mycologists to organize their collections in databases, to analyze morphological and ecological data with complex statistical packages and to eventually write monographs and research papers with sophisticated word processors at home on their personal com- puters. The introduction of databases that collect and apply the know- ledge of experts has led to the development of computer systems to assist in the identification of organisms by scientists (e.g. plant pathologists, ecologists) who are not taxonomists but have to rely on the use of taxonomic techniques (e.g. SHAPIRO & al., 1974; ESTEP & al., 1989). In , the use of computer keys for the identification of fungi is comparatively recent (KORF & ZHUANG, 1985; POLONELLI & al., 1985; MARGOT, 1980; MARGOT & al., 1984). All applications so far developed, however, are computer versions of synoptic or dichoto- mous keys; they cannot deal with any kind of uncertainty nor can they tolerate some mistakes by the user. Moreover, they have no, or very modest, graphics capabilities and do not allow the preparation of a graphically oriented user-interface. The recent introduction of hypermedia (see below) has overcome the problem of adding

* Paper based on a talk given at the Fourth International Mycological Congress, Symposium G-2, Computers and Information Systems, held in Regensburg, FRG, 28th August - 3rd September 1990.

273 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

graphics to databases and using them either as identification infor- mation or as on-line help to illustrate the scientific jargon used in the key. For instance, HyperCard has been successfully used to produce a key to guide identification of Zooplankton in the North Sea (ESTEP & al., 1989) and several prototypes written in HyperCard and Super- Card exist to help mycologists in the identification of selected fungal genera (H. CLEMENQON, unpublished; O. PETRINI & L. PETRINI, unpublished). All these applications, however, are computerized syn- optic or dichotomous keys: in no case has a mechanism for reasoning with uncertainty been embedded in such applications, although recently an expert system has been developed which makes use of HyperCard's excellent graphical capabilities (EVANS, 1990).

Subjectively defined characters (e. g. shapes and size of or , frequency of occurrence by a given taxon) are very common in mycology. Conjunctive (AND), disjunctive (OR), and mixed (AND/OR) sorting is also currently used in most fungal descriptions. The presence of linguistic descriptors ("rare", "often", "more or less") makes the identification of a a difficult task for many novices. All these features lead to vague or uncertain definitions. Thus, knowledge-based classification systems able to deal with uncertainty become a necessary tool for the preparation of robust (able to withstand intrinsic contradictions) computer-assisted identification systems in mycology (PETRINI & RUSCA, 1989). We report here on a project (ASCUS) currently underway in our laboratories to develop an error-tolerant classification system to be used in the identification of fungal taxa. A startup-prototype (ASCUS-0) cur- rently under pre-release (alpha) testing is described in detail.

Some terminology Confusion exists among biologists on the meaning of terms such as hypermedia (hypertext) and expert systems, as well as on the terminology used to describe some of their properties. Some defini- tions are given below; terms which are not exhaustively explained here can be found in SHAPIRO & ECKROTH (1987). Hypermedia is "... an approach to information management in which data is stored in a network of nodes connected by links. Nodes can contain text, graphics, audio, video, as well as source code [of computer applications] or other forms of data. The nodes . . . are meant to be viewed through a structure editor." (SMITH & WEISS, 1988). The hypermedia environment is one which allows information (the "nodes") to be linked and accessed by association (the links), in a

274 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

similar way human beings rapidly access diverse types of informa- tion. Dealing with uncertainty is not directly possible with hyper- media tools, but methods can be found to simulate it or inference engines (see below) prepared with external expert systems can be activated as external commands by the hypermedia. Software pack- ages based on the hypermedia philosophy are e.g. HyperCard, Super- Card (both for the Macintosh) and HyperPAD (on the IBM-PC and compatibles). The existing hypermedia packages generally work well for building front ends (user-interfaces) for other software (e.g. expert systems) or tutorial systems.

An expert system is a computer programme able to do a well- defined kind of reasoning by using a database (the so-called knowl- edge base) that may incorporate facts and rules (GRAHAM & JONES, 1988). The expert system's knowledge is based not only on formal textbook information but also on judgmental or heuristic knowledge derived from the experience of a specialist. An expert system usually consists of five main components: - An user interface - A knowledge base - An inference engine - An explanation module - A knowledge elicitation module

Knowledge can be represented as verbal descriptors and graphi- cal illustrations, as a set of production rules, or in a connexionist way.

The knowledge base (KB) contains an abstract representation of the knowledge used by an expert in a given area to solve a family of problems such as classification, diagnosis, advisory, tutoring, or planning. Knowledge can be stored as a collection of verbal descrip- tors or graphical illustrations (collectively also called objects) in nodes, connected by arcs (the equivalent of hypermedia links in the expert system terminology), which represent the relationships bet- ween objects or their characterizations. Knowledge is also often represented as a set of production rules. The following is an example of a production rule:

Rule: "IF an ascus is present in the specimen THEN the specimen is an ascomycete" Fact: "the specimen has an ascus" the inference engine (see below) will deduce: "the specimen is an ascomycete".

275 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

a) Formal neuron c) 3-layer network

Sums Thresholds '.•;>'.

+1

Out2

d) Fully connected network: Hopfield net

b) 2-layer network: the perceptron

Outi Used as a classifier:

Inputs (Characters)

Out2 Outputs (Species) connection weights Input Output layer layer

Fig. 1. - a. The formal neuron. - b. The perceptron (two-layered neuronal network). - c. A three-layered network. - d. A fully connected network ('Hopfield net'). - For further explanations see text. ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

Knowledge need not be represented exclusively as a collection of character symbols. One of the most common symbolic representa- tions is the use of numbers. The contingency tables, for example, and many other tools used in statistics (e.g. correlation coefficients, regression lines, tables of covariance) can be seen as numerical repre- sentations of a particular kind of knowledge. Knowledge can also be represented in a connexionist way, a paradigm increasingly used in many scientific fields (e.g. biology, psychology, physics; LIPPMANN, 1987). The basic element of this repre- sentation is the formal neuron (Fig. la), a processing element which roughly attempts to mimic the structural and functional unit of the nervous system in animals. A formal neural network results from the interconnexion of many formal neurons. In a two-layered network (Fig. lb) each input (in this case a taxon's character) is potentially connected to each output (a taxon). In the connexionistic jargon this is called the perceptron topology. Fig. lc is the graphical representa- tion of the three-layered neuronal network. The formal neurons of the intermediate layer, invisible to the user and called hidden units, transform the input in such a way that problems not addressable by a two-layer topology may be handled. In the fully connected network (Fig. Id) every formal neuron is connected to all others. There is no distinction between input and output units and the user decides from time to time to use a unit as input , as output, or as both input and output. This network type is generally used as a classifier. Its main disadvantage is its limited storage capacity. For each network topo- logy algorithms exist that calculate the connexion strengths, which are real numbers that express the weight of a connexion. A formal neuron combines all its inputs with the corresponding connexion weights. If the result of the combination is above a fixed threshold, a " + 1" output value is generated. If, on the other hand, the value is below the threshold, the output is a value comprised between "-1" and " + 1". The output generated is then used as an input to the formal neurons connected to it. In Fig. 2, for instance, the observed characters will be evaluated by the neural network, which considers the contextual relationships and produces a numerical output value for each species. Without going into further details, the principal interest in using a connexionist approach to solve problems resides in the intrinsic robustness of neural networks. A neural classifier can suggest a correct answer to a problem even if some input characters are not known or even wrong, provided, of course, that the majority of the remaining input characters are correct. A dichotomous key can somehow be equated to a rule-based expert system, whereas a synop- tic key can be compared to the neural network approach, where minor mistakes are inherently well tolerated. 277 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

Anthostomella

Biscogniauxia ^ / ~ stromata immersed _ Hypocopra stromata only partly 0^AX^TA ypoxylon immersed ^^5^- \ \ z^\ j H yr y

stromata superficial #-^X\\X) Lopadostoma

Poronia Fig. 2. — Example of a two-layered network in mycology.

The inference engine scans the KB ("search space") and provides the user with an answer to the problem. The scanning of the search space is called in the Artificial Intelligence (AI) jargon "reasoning". Rule-based systems that reason as described previously (p. xx) are said to work in "forward-chaining" mode: conclusions are derived from rules whose conditions are satisfied directly by the input data. "Backward-chaining", on the contrary, is used by inference engines that start from a hypothetical conclusion and try to confirm it by verifying the truth of the conditions contained in the KB. A "mixed- chaining" mode is also often used, where part of the rules work in forward-chaining, part in backward-chaining and part in both modes. An explanation module is necessary to enable the user to ask the system HOW it came to a conclusion and WHY a question is being asked. Very often this module is itself an expert system. The knowledge elicitation module is used to generate sugges- tions for the construction of the KB by the expert. Finally, a user interface is used to open files and create reports. Considerable effort has usually gone into the implementation of this component. Two basic requirements must be met by a classification system to be used in mycology (PETRINI & RUSCA, 1989): - it must be tolerant, i.e. a limited number of wrong or incom- plete answers by the user should still allow a correct identi- fication; 278 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

- it should support the treatment of linguistic ("fuzzy") quali- fiers. Linguistic qualifiers are particularly frequent in mycol- ogy. A typical example is contained in the expression " appendages mostly present". In this case, "mostly" is a linguistic or vague qualifier: how much is "mostly"? It can be anything between 80% and 100% of all cases, depending on the subjective judgement of the expert. The expert system must deal with this problem by the use of appropriate algo- rithms and knowledge representation techniques.

ASCUS: an overview ASCUS is a hybrid and combines a connexionist with a rule-based expert system to be used by experts for the preparation of identifica- tion keys and by novices for the identification of fungi. The system is tolerant and is not too sensitive to mistakes by the user; it also has a built-in mechanism to deal with user uncertainty and vague quali- fiers. The inference engine is basically a neural network classifier guided by a rule-based control level, called meta-level. The system simulates common sense reasoning by using processed information extracted from the neuronal network. To help in the knowledge elicitation process, a learning module allows both for analysis of a collection of observations and suggestions for the construction of dichotomous and synoptic keys. A simple explanation module which contains also an on-line help is under development. The first prototype (ASCUS-0) was developed on a Macintosh IIx and is written in Common Lisp. This prototype runs on all Macintoshes, including the Mac Plus, provided that 4MB RAM are available. A more compact version, running on machines with less available RAM, can be produced. The system is MultiFinder compati- ble and supports all basic Macintosh operations included in the basic FILE and EDIT menus. Portability to DOS-machines is not exclud- ed. A novice and an expert mode are available. In the novice mode, the user is confronted with the familiar Mac-interface with multiple windows, pull-down menus and dialogue boxes. In the expert mode, the Mac user interface is more powerful and provides a family of specialized editors to update or to define the KB, which includes the taxon's characters and their type (numeric, boolean, qualitative, mutually exclusive, picture, etc.), their linguistic qualifiers, the taxa, the rules, the classification attributes. KBs are constructed indepen- dently but different hierarchical levels can be linked: for example, 279 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

KB A, which contains the data related to the genera of a given fungal family, can be hierarchically linked to a number of KBs A,, A2, .... , AN, each containing the data pertaining to the species of a given genus in A (taxonomic tree structure). All files are editable with the conventional Macintosh text or picture editors (e.g. MacWrite, Mac- Paint). The "Print" command allows transfer of the partial or the total content of a KB onto paper or to a user-specified ASCII-file. An option is provided to print the KB files in a format suitable for publications and monographs. The KB can also be exported to, or imported from, current spreadsheets or database packages. A help system, composed of a system help and a database-related help (to be prepared by the expert) will be available in the first release (ASCUS-1).

Design of ASCUS The ASCUS architecture is typical of a conventional expert system and is composed of the following modules (Fig. 3): - the user (novice and expert) interface - the knowledge base hierarchy - the inference engine - the explanation module - the knowledge elicitation module - the import/export module The complete knowledge base is hierarchical and isomorphic to the structure of the corresponding taxonomy. It is a tree of simple KBs each constructed to classify a set of taxa (e.g. all genera of the family Xylariaceae will be contained in KB1, all species of Rosellinia in KBla, KB1 and KBla being hierarchically related). Each simple KB consists of the group description, the synoptic key, the explana- tion documents specification, the meta-rules, and the user answers slots. While the first four components are defined by the expert, the last one will be activated by the novice through the novice user interface. The group description contains the group vocabulary, its posi- tion in the KB hierarchy and the specifications related to the linguis- tic qualifiers (expert's definitions of "often", "rarely", "red to brown", etc.). The group vocabulary specifies the group name (e.g. "Xylariaceae"), the characters and their type, the character groups (a set of related characters, e.g. all ascospore attributes) and the taxa. The synoptic key is stored symbolically (pictures, characters, etc.) in the KB and is translated in the inference engine into a numerical representation, which is a neural network connecting characters and taxa (Fig. 1). 280 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

Rules Properties

Knowledge User base

interface

Tolerant

inference

Fig. 3. - Diagrammatic representation of the ASCUS architecture. ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

The explanation documents are a list of textual and pictorial documents used by the explanation module. The meta-level may encode - the co-relevance of some characters when their individual relevance cannot be specified by a simple numerical coeffi- cient. This is the typical case when the relevance of a character is dependent on the presence or absence of a group of other characters. - criteria to suggest the most useful next question. - criteria to score possible alternative taxa. - criteria to terminate the identification session. Most meta-rules are not accessible to the expert. The meta-rules related to the character co-relevance are editable by the expert through the meta-rule editor of the expert interface and may be considered an addition to the synoptic key. Explanation-related rules are encoded in the explanation module. The user answer slots are obviously filled by the answers pro- vided by the user. Partial or even complete uncertainty ("I-do-not- know" answers) is accepted (Tab. 1). The inference engine consists of a tree of neural networks, which is the numerical representation of the synoptic key, and a conven- tional rule-based inference engine, which interprets the meta-rules of the knowledge-base. Its task is to control the neural networks, to suggest the next most relevant question and to perform the reasoning required by the explanation module. The explanation module is composed of a system and an area on- line help (both used as hypermedia), and of the rules encoding the explanation-related knowledge necessary to answer the WHYs and HOWs by the user. The knowledge elicitation module is used to generate sugges- tions for the construction of dichotomous or synoptic keys which would include the majority of the collections entered in the KB by the expert. Its main component is an inductive algorithm. Original to our approach is the fact that the algorithm "knows its limits": the key generated may propose for some character values a taxon "UNKNOWN" or a group of taxa instead of a single taxon, thus pointing out possible flaws and fallacies in the expert's observations. The knowledge elicitation module acts therefore as a consultant to the expert and is not necessarily intended to provide the best mycological classification key. The import/export module provides a facility to print the KB on paper or to save it in an ASCII file. Special options to provide formatted printouts for publications are provided. On the other hand, the import/export command allows easy transfer to and from external spreadsheets or databases. 282 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

Tab. 1. Results of a classification session with ASCUS-0. The hierarchical databases contained the information related to ten genera of Xylariaceae and to seven species of Rosellinia. The user has attempted to identify a specimen of Rosellinia helvetica to the genus only, in the first case (a) by giving correct answers to all questions asked by the system, in the second (b) by giving a wrong and four "I- do-not-know" answers. The results of the identification session are expressed by two coeffi- cients: the certainty is an expression of confidence that the system is correct by proposing a given taxon as the outcome of the identification procedure, whilst the possibility indicates the probability of a given taxon to be the correct one in dependence on the user's answers. Ideally, when all anwers have been given correctly and completely by the user, both coefficients are 100%. Certainty and possibility thresholds have been given default values by the system, but they can also be manually selected by the user.

a) all answers given by the user are correct

Taxon Certainty Possibility (%) (%) Rosellinia 100 100 R. helvetica 80 100 R. britannica 69 ' 100 R. mammaeformis 69 100 R. morthieri 69 100 Camarops 35 81

b) one wrong answer and four unanswered questions by the user

Taxon Certainty Possibility Rosellinia 68 98 R. helvetica 55 98 R. britannica 47 98 R. mammaeformis 47 98 R. morthieri 47 98 Camarops 23 85

The choice of Macintosh and Common Lisp for ASCUS

We have opted for the Macintosh family of personal computers for two main reasons. Macintosh is provided with a well-defined user interface, com- plete and consistent throughout the whole Mac family and widely accepted by a large number of users. The primitives necessary to

283 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

build up this interface arc accessible to almost all programming languages. Moreover, we decided to use Common Lisp as the programming language. A choice of Lisp packages which do not require any special co-processor board is available for the Mac. At least one of them (Procyon Common Lisp, ExperTelligence, Goleta, CA, USA) already provides the CLOS (Common Lisp Object System) standard. After careful evaluation of several expert system shells available for the Mac, including Nexpert Object (Neuron Data, Palo Alto, CA, USA), which is considered a good expert system shell, we decided to refrain from using any shell for ASCUS. The classification system we envisage would require the ASCUS architecture (user interface and inference engine) to be programmed as an external package in C to run with Nexpert Object. The use of Nexpert Object would have meant the purchase of additional software involving the ASCUS user in considerable extra expense. The most reasonable choice therefore seems to be an object- oriented, all-purpose AI language, since ASCUS design foresaw a hybrid (rule-based and numerical) inference engine, coupled with a hierarchical KB. The advantage of using an AI language for develop- ment is the flexibility for prototyping when coding in an AI environ- ment. There are, however, two drawbacks: a Lisp programme is usually slower and it is larger than its corresponding application written in a non-AI language as, for instance, C. Procyon Common Lisp has been used for the development of ASCUS-0. We are currently evaluating several other languages for the choice of the language to be used in the final release.

Acknowledgment

We are deeply indebted to Drs I. H. CHAPEI.A, Basle, Switzerland, S. M. FRANCIS, Almen, The Netherlands, and W. GAMS, Baarn, The Netherlands, for many helpful suggestions and discussions that have improved the original manuscript.

References

ESTEP, K. W., A. HASLE, L. OMLI & F. MACIKTYRE (1989). Linnaeus: Interactive taxonomy using the Macintosh Computer and HyperCard. - BioScience 39(9): 635-638. EVANS, R. (1990). Expert systems and HyperCard. - Byte, January 1990: 317-324. GRAHAM, I. & P. L. Jones (1988). Expert systems. - Chapman & Hill Computing, London. KORF, R. P. & W. Y. ZHUANG (1985). A synoptic key to the species of Lambertella (Sclerotiniaceae) with comments on a version prepared for TAXADAT, Anderegg's computer program. — Mycotaxon 24: 361-386. LIPPMANN, R. P. (1987). An introduction to computing with neuronal nets. - IEEE ASSP Magazine 4(2): 4-22. 284 ©Verlag Ferdinand Berger & Söhne Ges.m.b.H., Horn, Austria, download unter www.biologiezentrum.at

MARGOT, P. (1980). On-line identification program of poisonous and hallucinogenic . - In: OLIVER, J. (ed.). Forensic Toxicology. - Croom Helm, London: 221-234. MARGOT, P., G. FARQUHAR & R. WATLING (1984). Identification of toxic mushrooms and toadstools (Agarics) - an on-line identification program. In: ALLKIN, R. & F.A. BISBY (eds.). Databases in Systematics. - Systematics Association Special Vol. No. 26. Academic Press, London & Orlando: 249-261. PETRINI, O. & C. V. RUSCA (1989). Knowledge-based Expert Systems in mycology: tolerance is a necessary attribute. - The Mycologist 3(3): 128-130. POLONELLI, L., M. CASTAGNOLA, D. V. ROSSETTI & G. MORACE (1985). Use of killer toxins for computer-aided differentiation of Candida albicans strains. - Mycopathologia 91: 175-179. SHAPIRO, B., P. LEMKIN & L. LIPKIN (1974). The application of artificial intelligence techniques to biologic identification. - J. Histochem. Cytochem. 22(7): 741— 750. SHAPIRO, S. C. & D. ECKROTH (1987). Encyclopedia of Artificial Intelligence. - John Wiley & Son, New York. SMITH, J. B. & S. F. WEISS (1988). Hypertext. - Communications of the ACM 31(7): 816- 819.

285