A Computer Program for Semantic Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
, -1 -, , -, i I T f -1 1:11 1 !, I I - i I 1. w:, . .- I 4 I L S IR: A COMPUTER PROGRAM FOR SEMANTIC INFORMATION RETRIEVAL by BERTRAM RAPHAEL B.S., Rensselaer Polytechnic Institute (1957) M.S., Brown University (1959) SUBMITTED N PARTIAL FULFILLMENI' OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHTLOSOPHY at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June, 1964 Signature of Author ............................................. Department of Mathematics, April 1964 Certified by * . 0 - * . 0 ID 0 0 0 a * * * 0 0 0 . a a 0 0 V 1 0 .9 a Thesis Supervisor . a . 0 . 1 . v P , a '. P 0 0 . V F * * . 0 a * a Chairman, Departmental Committee on Graduate Students 2 Acknowledgement The work reported herein was supported in part by the MIT Compu- tation Center) and in part by Project MAC, an MIT research program sponsored by the Advanced Research Projects Agency, Department of Defense, under Office of Naval Research Contract Number Nonr-4102(01). Associated preliminary research was supported 'in part by the RAND Corporation, Santa Monica, California, and in part by Bolt, Beranek, and Newman, Inc.2 Cambridge, Massachusetts. The author expresses his gratitude to Prof. Marvin Minsky, for his supervision of this thesis; to Dr. Victor Yngve and Prof. Hartley Rogers, Jr., for their criticisms and suggestions; and to hs ife Anne for her unfailing confidence and encouragement, without which this thesis could not have been completed. III S IR - A COMPUTER PROGRAM FOR SEMANTIC INFORMATION RETRIEVAL by BERTRAM RAPHAEL Submitted to the Department of Mathematics on April 8, 1964, in-partial fulfillment of the requirements-for the degree of Doctor of Philosophy. ABSTRACT SIR is a computer system, programmed in the LISP language, which. accepts information and answers questions expressed in a restricted form of English. Thi's system demonstrates what can reasonably becalled an ability to "understand" semantic information. SIR's semantic and deductive ability is based on the construction of -an internal model,. which uses word associations and property lists, for the relational information normally conveyed in conversational statements.. A format-matching' procedure extracts semantic content from English sentences. If an input sentence 'is declarative) the system adds appropriate information to the model. -If an input sentence is a question, the system searches the model until t either fnds the answer or determines why t cannot find the answer. In all cases SIR reports its conclusions. The system has some capacity to recognize exceptions to general rules, resolve certain semantic ambiguities, and modify its model structure in order to save computer memory space. Judging from its conversational ability, SIR is more "intelligent" than any other existing question-answering system. The author describes how this ability was developed and how the basic features of SIR com- pare with those of other systems. The working system, SIR, is a first step toward intelligent man- machine communication, The author proposes a next step by describing how to construct a more general system which is less complex and yet more powerful than SIR. This proposed system contains a generalized version of the SIR model, a formal logical system called SIRI, and a computer program for testing the truth of SIR1 statements with respect to the generalized model by using partial proof procedures in the predicate calculus. The thesis also describes the formal properties of SIR1 and how they relate to the logical structure of SIR. Thesis Supervisor: Marvin L Minsky Title: Professor of Electrical Engineering., ;,-----,.- --I.- ---- III Li .1 , III 5 TABLE OF CONTENTS Chapter Page I* INTRODUCTION A# The Problem B. Where the Problem Arises .1-0 II. SEMANTIC INFORMATION RETRIEVAL SYSTEMS 13 A* Semantics '3 B* Models 21 C. Some Existing Question-Answering Systems 24 III. REPRESENTATIONS FOR SEMANTIC INFORMATION *40*40*90000 j4 At Symbol-Manipulating Computer Languages 090,.0. J4 B. Word-Association Models C. Semantics and Logic 42 D. The SIR Model ,oos .......... 44 IV* SIR TREATMENT OF RESTRICTED NATURAL LANGUAGE .00040** 52 r) A. Background 54. B. Input Sentence Recognition 54 C. Output: Formation and Importance of Responses 60 Vo BEHAVIOR AND OPERATION OF SIR 6/ A. Relations and Functions 64 B. Special Features VI, FORMALIZATION AND GENERALIZATION OF SIR 92 A. Properties and Problems of SIR 92 B. Formalism for a General System 1-01J C. Implementation of the General Question- Answering System I'-,- 3 VII* CONCLUSIONS 127 A* Results J_2 B. Extensions of SIR 31 C* Concerning Programming 136 D. Subjects for Future Experiments IIt,0 - I BIBLIOGRAPHY 4 , III Page APPENDIX I: Notation 1-46 II: Derivations of SIR Deduction Procedures 148 III: Full-Response Output for Figure 15 7 FIGURE 1: Sample Conversation 8 2: A Word-Association Model 40 3: Sentence Recognition Rules 59 4: Sample Conversation 'in Full-Response Mode 62 5: Selected Conversations ............ 0008060 6,"-) 6: Special Features 86 TABLE a: Relational Notation 97 b: Deduction Procedures in SIR Subprograms cl: Basic Relations of SIRI 1.0 c 2 : 8IR Predicates Expressed in SIRI load d: SIRI Axioms BIOGRAPHICAL NOTE 7 Chapter I: Introduction A. The Problem. This report discusses the problem of developing a computer* which "understands." The method of study involved the construction of a computer system having certain cognitive abilities and exhibiting some human-like conversational behavior. This computer system 'is called "SIR" -- Semantic Information Retriever. The conversation presented in Fig. I between a person (identified by and SIR illustrates some of the system's capabilities. Although it is capable of understanding" statements dealing with set relations, part-whole relations, ownership, and certain spatial relations, SIR has no initial built-in associations for nouns like "John," "boy," and "table." "Understanding" is dfficult to define precisely. The present study makes the following specific assumptions and observations: O Understanding" can be demonstrated by dialogue, which requires the participants to 'indicate an awareness of the meaning of their subject matter. Therefore a computer should be considered able to "understand" if it can converse intelligently, i.e., if it can remem- ber what it is told, answer questions and make responses which a human observer considers reasonable. The term "computer" refers to the combination of digital computer hardware and operating computer program. ------ --I- III EVERY BOY IS A PERSON) (I UNDERSTAND) A FINGER IS PART OF A HANDI (I UNDERSTAND) EACH PERSON HAS TWO HANDS) (THE AVE SENTENCE IS AMBIGUUUS 0 PLEASE RE-PHRASE IT) THERE ARE TWO HANDS ON EACH PERSON) (I UNDERSTAND) HOW MANY FINGERS DOES JOHN HAVE 1 (THE ABOVE SENTENCE IS AMBIGUOUS ** BUT I ASSUME (HAS) MEANS (HAS AS ARTS)) (I DON*T KNUW WHETHER FINGER IS PART OF JOHN) JOHN IS A BOY) (I UNDERSTAND) HOW.MANY FINGERS DOES JOHN HAVE Q) ([HE ABOVE SENTENCE IS AMBIGUOUS U I ASSUME (HAS) MEANS (HAS AS PARTS)) ((HOW MANY FINGER PER HAND ) EVERY HAND HAS FINGERS) (THE ABOVE SENTENCE IS AMBIGUOUS ** BUT I ASSUME (WAS) MEANS (HAS AS PARTS)) (I UNDERSTAND) NOW MANY FINGERS DOES JOHN HAVE Ql (THE ABOVE SENTENCE IS AMBIGUOUS ** BUT I ASSUME (HAS) MEANS (HAS AS PARTS)) (THE ANSWER IS 10) HOW MANY AUTOMOBILES DOES JOHN HAVE Q) (THE ABOVE SENTENCE IS AMBIGUOUS PLEASE RE-PHRASE ITI WHO IS PRESIDENT OF THE UNITED STATES Q) (STATEMENT FORM NOT RECOGNIZED) to**. THE BOY IS JUST TO THE LEFT OF THE TABLE) (GO2840 IS A TABLE) (I UNDERSTAND) THE LAMP IS JUST TO THE LEFT OF THE TABLE) (GO2841 IS A LAMP) (THE ABOVE STATEMENT IS IMPOSSIBLE) THE TABLE IS TO THE RIGHT F THE CHAIR) (GO2842 IS A CHAIR) (I UNDERSTAND) WHAT IS THE RELATIVE POSITION OF A PRSON Q) (THE LEFT-TU-RIGHT ORDER IS AS FOLLOWS) (CHAIR (BOY TABLE)) FIGURE 1: SAMPLE CONVERSATION -III 9 Note: I am concerned here with the computer's 'internal information representation and retrieval techniques. For this purpose I assume that abstract "words" are the basic signal unit. There is no need to be concerned with speech recognitions sensory receptors, or other problems involving the physical nature of the communication channel and signals. ii) In addition to echoing, upon request, the facts it has been given, a machine which "understands" must be able to recognize the logical implications of those facts. It also must be able to identify (from a large data store) facts which are relevant to a particular question. iii) The most important prerequisite for the ability to "understand" is a suitable internal representation, or model, for stored information. This model should be structured so that information relevant for question-answering is easily accessible. Direct storage of English text 'is not suitable since the structure of an English statement gener- ally is not a good representation of the meaning of the statement, On the other hand, models which are direct representations of certain kinds of relational information usually are unsuited for use with other relations. A general-purpose "understanding" machine should utilize a model which can represent semantic content for a wde variety of subject areas. SIR is a prototype of an "understanding" machine. It demonstrates how these conversational and deductive abilities can be obtained through use of a suitable model. Later chapters will describe the model and the SIR program, how they were developed, how they are used, and how they can be extended for future applications. III 10 B. Where the Problem Arises. The need for computers which understand" arises in several areas of computer research. Some examples follow. 1) Information retrieval: The high speeds and huge memory capacities of present computers could be of great aid in scanning scientific literature. Unfortunately, high-speed search is useless unless the searcher is capable of recognizing what is being searched for; and existing computer systems for information retrieval use too crude'techniques for specifying and identifying the objects of the search* Information retrieval systems generally provide either document retrieval or fact retrieval.