A PRELIMINARY SKETCH TO DEFINE RESEARCH OPPORTUNITIES RELEVANT TO ARPA'S PLANS FOR KNOWLEDGE PROCESSING IN . Gio Wiederhold, Earl Sacerdoti, Daniel Sagalowicz, Ramez ElMasri, and Gordon Novak. Stanford, September 27, 1977.

TITLE Use of Context Knowledge to Query Large Databases

OBJECTIVE We see a need to improve the utility of large databases. In order to make progress in that direction it seems to be potentially profitable to develop an integrated set of methods which will combine the high level query capability developed in applications with the access mechanisms to large quantities of data which are embodied in management systems. Typical users of the systems envisaged here include planners and analysts who want to obtain information from large, multifaceted databases. The proposed methods will include as major components: * Heuristic procedures to determine the focus of user interest during sessions of active database interaction. * Extraction of knowledge relevant to current database processing from the database itself. * Combination of stored knowledge about the database with user- or session-specific as well as with transient knowledge. Processing of natural language queries with the aid of the * combined knowledge. Establishment of rapid access to data-records selected by the * query processing using access structures established according to the combined knowledge.

PROBLEM TO BE ADDRESSED While an ever larger number of databases are now in operation [Steel74 ], it has not been easy to extract the knowledge contained in them. The physical management of these large databases has proven sufficiently difficult to force the concentration of efforts into issues of reliability, economy, and integrity [Everest74 ]. These efforts have paid off to the extent that large databases are now in extensive use for routine data-processing tasks. The design alternatives which are available to database implementors have received widespread debate [Sibley76] . The alternatives presented, however, stress opposing approaches: structural simplicity versus use of structure to encode relationships. Proposals to improve database management include greater formalization [Codd7o] , and hence a reduction of knowledge implicit in the database structure itself. The conceptually simple structures advocated can be manipulated using the operations defined by the relational algebra, and consistency and integrity problems can be avoided. Since relationships among entities are not encoded within the database, redundancy is removed. This normal, tabular view is just the opposite of the view which sees intelligent processing of queries driven by a base of data encoded as a semantic net. Many operational commercial systems also stress an encoded network structure to achieve rapid access to records known to be related in some sense (Honeywell IDS, DEC DBMS, IBM IMS etc.). It is interesting to note that proposals from commercially based developers [Bachman72] have stressed a semantically bound structure and viewed the user as a navigator within it. This issue of structural simplicity versus structural encoding of relationships among data can be viewed as a binding problem [Wiederhold77b] . Lack of binding promotes integrity, whereas binding establishes relationships of interest to a database user. Relationships may be predefined or may be conceived during the process of extracting information out of the database. A database that is not bound at all defers all binding required to answer queries to the time that the database is to be used, and also leads to a situation where the binding operations are performed again and again. This mode of operation increases retrieval cost at the critical query processing time. On the other hand, any predefined access paths are apt to be constraining to a database user who has an imaginative and interactive approach to the extraction of knowledge out of the database. Furthermore, a database with predefined access paths for the most common queries will be least useful precisely when information is most needed: in a crisis situation. Crisis situations are, by definition, uncommon, and the common queries will be insufficient.

Symbolic modelling, a well-known techique in Artificial Intelligence, holds promise for enabling databases to be both efficient and easy to use. Data access can be made efficient by utilizing links among records that reflect relationships implicit among those records. Data access can be made easy by explicating those implicit relationships in a symbolic model of the database structure.

At least two current efforts are exploring the utility of symbolic modelling for databases. The structure of the database is described using a semantic net technology in Torus [Mylopoulos7s, Poussopoulos77] , and by a simple property list, structure in IDA [Sagalowicz77] . While in Torus the description and the database are intertwined, (the semantic nodes can carry names or values) , the approach used by IDA is that the database is distinct, in fact remote. Tne former approach becomes untenable for large databases;

2 the latter implies that data which are relevant to query processing and are themselves database components are maintained redundantly. As queries are processed by IDA additional knowledge is encoded into the local model, which pertains to the current user and the current session. Maintenance of all of the conceptual database description outside of the database can present problems of long term integrity since changes in the database will eventually affect the relationships existing within the database.

RESEARCH OPPORTUNITIES In order to resolve the problems seen in the merging of AI techniques with the use of databases which cover multiple related domains, we find that a number of interfaces have to be addressed. It has been demonstrated [Grosz77] that interactions in a task-oriented dialogue can be used to establish the focus, or current area of interest, within the task. We believe that, in an analogous manner, information obtained from the current database query interactions can be used to establish the focus that defines the context, or the current area of interest. Data from the database in the domains on which interest is focused can be extracted, indexed, or otherwise made easily available to help with query parsing and database access. Typical stored elements to be made available to help with query parsing are the defined relationships among entities of interest that were already specified in the stored database schema, the terms used in lexical files [Wiederhold77] , and information about their usage according to the database schema. The stored relationships provide an initial semantic description of the database. The terms will constitute auxiliary lexicons for assistance in query interpretation.

As an example of the use of a lexical file from the database, consider the query "Where is (the) Ford ?" A lexical file in the database will contain the names of all the objects in the organization's inventory. A search in this file will determine that Ford is an automobile with licence XTHI3B which is checked out to a given customer. Without access to lexical files during parsing, this query would have to have been stated as "Where is inventory item = Ford ?" In this case, the user requires knowledge of the structure of the database. In a large database, however, access to all lexical files can take much effort and create ambiguities. For example, "Ford" might exist in a presidents file, in a geographic file, in a ships file, etc. Knowledge of the context disambiguates the queries by reducing the search space.

Most stored relationships and terms can be considered to be global, that is, available to all processes using the database. Other knowledge about the database may pertain to one group of users, to one specific user, or may be of utility only during one user-session. Information about attributes of current interest will be used to

3 construct indexes, select subsets of access files, reorganize parts of the database, or to combine index data and references [Burkhard76, Gosh7s, Wiederhold7s, Rivest76, and Schkolnick77] . Such techniques promise rapid access to databases, but are costly to maintain over long periods.

Questions to be answered within this research frame include: * What is a good global semantic: description of the database? * How does such a semantic description fit within the so-called conceptual schema [Steel7 s] ? How can database facilities be used to maintain database * descriptions of a quality adequate for Artificial Intelligence procedures? Does modularization of a large database into domains that * match the user's focus and interest help with query processing? Can we recognize such a focus sufficiently rapidly so that * domain selection and rejection can be automatic? How can semantic and schema information from different sources * be managed? How can consistency problems ocurring when database * descriptions are joined be resolved? How can new knowledge developed during database usage be be * added dynamically to the conceptual schema? How can one insure that the resulting semantic description is * acceptable for every user?

The effectiveness of the techniques will have to be evaluated on a small scale model. Such a model will however have to cover several distinct domains of user interest. To allow reliable extrapolation to large databases, careful selection of parameters of the model will be required.

RATIONALE Query Processing. A number of query subsystems have had significant success in the processing of natural language queries, and have been able to retrieve data selected via complex attributes from modestly sized databases [LUNAR:Woods73, MYClN:Shortliffe73, PLANES:WaItz7S, LADDER: Sacerdoti77] . These systems have demonstrated that natural language access to databases in a limited domain can be done economically, both in terms of development costs and in terms of computer time for query processing. Research is currently in progress at SRI to extend the size of the language definition that can be handled, handle linguistic transformations which extend the number of ways in which similar ideas can be expressed, and develop methods for representing complex questions in a computationally

4 efficient manner. A parallel research effort [Grosz77] is investigating the use of focus in understanding dialogs. Focus is important for a number of linguistic processes, including the disambiguation of multiple word-senses and proper handling of anaphora and ellipsis (e.g., the interpretation of pronoun references, or the implicit reference to objects that were previously explicitly referenced.) In addition, the focus representation can be used to restrict accesses to a database to a particular area of interest.

Access to Large Databases. When a database becomes large (greater than can be managed by one individual) or very large (impossible to structurally alter without negatively affecting the operation of the enterprise using the database) , then in current systems the retrieval requests to be processed have to be well structured and match the database structure. The use of natural language techniques has been considered unworkable by many database experts [Montgomery72 ]. Requests which match the expectations of the database designer are processed rapidly, but those that require exhaustive processing of data files are rejected or serviced so poorly that the user will cease asking for the data. Some research in relational databases is oriented towards the parsing of queries to minimize sequential scanning of the database. Auxiliary access structures are found in systems which support on-line data retrieval; typical are the index structures of TDMS [Bleier67] , the directly accessed multi-attribute 'inverted' files of the DATACOMPUTER [Marrill7s] , and the hierarchical ring structures implied by the CODASYL proposal [Taylor74] . Domains of Interest. In many uses of a database [McDermott7s] the attention of the inquirer will be focused on one or a few of the many domains of knowledge contained in a large database. In a military situation one may deal successively with a threat assessment, with the availability of supplies, then with logistics routing, and so on. Similar scenarios can be developed for economic planning, natural resource exploitation, marketing, and other activities that rely on a broad availability of data. The databases which are needed to support such activities are large, since without a large quantity of data the quality of the information which can be obtained is apt to be negligible. The failure to effectively marshal the available data resources is a reason for the continued reliance on intuitive approaches in these areas.

Our Hypothesis. The central hypothesis on which our proposal for joining the achievements of artificial intelligence and database organization is based, can be shown through the following scenario:

5 From the interaction of the inquirer with the database, the focus of current interest can be established,. The database contains within itself data descriptions and 'data values that can be extracted to provide knowledge to assist in the processing of queries. This global knowledge can be combined with knowledge which is specific to the user. The combined knowledge can be used to interpret the queries and to create efficient access paths to relevant data. The combination of improved understanding of queries and rapid retrieval will allow larger databases to become interactively utilized in an effective manner.

Example dialogue using focus representation To illustrate how the use of a focus representation can aid both in understanding the user's queries and in defining a subset of the data base which is to be used in answering the queries, the following hypothetical dialogue is included. This dialogue concerns a database of information on naval ships such as the database used for the INLAND system developed at SRI. In addition, the dialogue presumes an A. 1. -based 'scenario' system which is able to answer complex questions about a proposed scenario with the aid of information in the database; such a system is currently under development at SRI. Following the dialogue, the nature of the focus representations which could be constructed and their uses will be discussed.

1. Q: How long would it take for Task Force ALPHA to reach Naples? 2. A: 34 hours at cruising speed, 22 hours at maximum speed 3. Q: Do they have enough fuel? 4. A: At cruising speed, there is enough fuel. At maximum speed, an additional 14000 gallons are required 5. Q: What ships would run out? 6. A: The cruiser Biddle and the carrier Saratoga 7. Q: What oilers could meet them enroute? 8. A: The Ashtabula could meet them near Sicily. 9. Q: Could they get there at maximum speed? 10. A: Yes.

There are several levels of focus implicit in such a dialogue. The overall focus, established in the first sentence, is a proposed trip by Task Force ALPHA from its current position to Naples. The initial question requires the construction of a tentative course for such a trip, using the positions of Naples and of the task force from the database; then, using this course, it requires the computation of the travel time using information on the speeds of ships in the task force from the database. This results in an answer to the user's question, and also in a partially complete model of the trip. This model of the trip is an important part of the focus representation, and is referenced throughout the remainder of the dialogue in understanding and answering the user's questions. For example, in the question of line 3, the model is used to find the referent of

6

v 'they' (the ships in the task force) and to fill the implicit elliptical reference (i.e., the answer to the question, "Enough fuel for what?"). This use of a focus representation developed from the answers to previous questions can be seen throughout the remainder of the dialogue.

Research in progress at SRI is currently investigating the representation and use of scenario models and focus in the sense described in the previous paragraph. We believe that a different kind of focus representation, which deals with those subsets of a large database which are considered to be likely to be accessed based on the pattern of past access areas, can be developed and used to improve the efficiency of database access. The following paragraph illustrates how such a database focus might be used in conjunction with a dialog such as that given in the above example.

In addition to establishing a conceptual focus for the dialogue, the initial questions can also establish a database focus which defines a relatively small subset of the total database which is sufficient for answering the user's questions. The first question establishes a focus on travel on a certain course by a specified group of naval ships, and the second establishes an interest in the ships' fuel state. These foci are represented by a relatively small part of the total database. In addition to the explicit definition of focus from the user's questions, implicit focus may be inferred to further restrict the portion of the database which is of interest. For example, it would be reasonable to infer that the user is interested only in naval ships (not merchants) and that a relatively small geographic area (the Mediterranean) is of interest; these restrictions might make it possible to subset a large file (e.g., the positions of all ships in the world) into a file small enough to be retrieved for local processing. This local file would contain the information necessary to answer question 7, for example, without further reference to the large database file containing all ship positions.

An Analogy. There is a close analogy here between the proposed information processing methods, and the data processing methods now commonly implemented in computer system hardware and operating systems. The methods referred to are those that have provided programmers with the ability to manage large programs and data areas without managing themselves the actual references and access to the memories. Ihe use of paging and cache memories allows operations to be performed rapidly although the access to most of the hardware is slow and tedious [Denning7o, Kaplan73] . In both instances advantage is made of locality of processing to reduce the address space and increase the accessability to data by defining dynamically the current working space. The primary difference between the operating systems approach and what we are proposing is one of the conceptual level of the

7 domains where locality is exploited. This does of course mean that the simple mechanical techniques of Least Recently Used page etc. are not adequate to determine focus. Distributed Databases. An additional benefit of focus determination may be that this approach will interface well with the concept of distributed databases, a subject of particular interest to us and many others. Distribution costs are apt to be highly dependent on the extent to which query processing is distributed. A distribution which matches the domains established by the focus of the users will lessen the load on a computer network; if the domains cross subsystem boundaries, costs will be greater.

Scaling. Problems of scale have often been neglected, both in AI and in database proposals. Our understanding of the processing effort expended as the knowledge spaces to be addressed increase is improving to the extent that the effect of size parameters on performance can be assessed using concepts such as : linear, logarithmic, squared, or exponential effects. We hence are aware that when working with small scale models an evaluation of the appropriateness of the techniques for a large database has to be included. A quantitative assessment of the proposed methods will be an important complement to the qualitative evaluation of the research results.

Databases for Research. The databases to be used for experimentation is an open question. It seems prohibitive and unwise to populate a database de novo for this work. Databases to which we have access around Stanford and SRI include: Medical — There are now separate databases in clinics, the hospital, and in laboratories at Stanford [Cohen74] . Naval — Fleet-level command/control data have been used for work at SRI. Administrative — space, class, student, terminal allocations Ihe move to the new CS building may provide an interesting micro model.

The need in a research database for this area is always that it encompasses multiple domains, and that there is moderately loose coupling between these domains.

SUMMARY The problems addressed above range over many issues in the Al-database interface. While they are all interrelated, it is not yet clear in which sequence the issues should be addressed and what milestones are suitable to measure progress. We are confident that research in this area will lead to inovative application of past efforts in the two disciplines which have yet seen insufficient interaction.

8 REFERENCES

[Bachman72] C. W. Bachman, "Ihe Evolution of Storage Structures", CACM Vol. 15, Num. 7, July 1972, pp. 628-634.

[Bleier67] R. E. Bleier, "Treating Hierarchical Data Structures in the SDC Time-shared Data Management System (TDMS)", Proc. 22nd ACM National Conference 1967, MDI Pub. Wayne, Pa., pp. 4l-49

[Burkhard76] W. A. Burkhard, "Hashing and Trie Algorithms for Partial Match Retrieval", TODS Vol. 1, Num. 2, June 1976, pp. 175-187.

[Codd7o] E. F. Codd, "A Relational Model of Data for Large Shared Data Banks", CACM Vol. 13, Num. 6, June 1970, pp. 377-387.

[Cohen74] S. N. Cohen et. al., "Computer Monitoring and Reporting of Drug Interactions", Proc. MEDINFO 1974, IFIP, Anderson and loishthe, Editors, Congress, North Holland 1974. [Denning7o] P. J. Denning, "Virtual Memory", Computing Surveys, Vol. 2, Num. 3, September 1970, pp. 153-189.

[Everest74] G. C. Everest, "Ihe Objectives of Data Base Management", Proc. 4th International Symposium on Computer and Information Sciences, Plenum Press, 1974, pp. 1-35.

[Gosh7s] S. P. Gosh, "Consecutive Storage of Relevant Records with Redundancy", CACM Vol. 16, Num. 8, August 1975, pp. 464-471.

[Grosz77] J. B. Grosz, "The Representation and Use of Focus in a System for Understanding Dialogs", Proc. sth IJCAI, Cambridge, Mass., 1977, Vol. 1, pp. 67-76.

[Kaplan73] K. R. Kaplan and R. 0. Winder, "Cache-based Computer Systems", Computer, March 1973, pp. 31-36.

[Marill7s] T. Marill and D. Stern, "Ihe Datacomputer , A Network Data Utility", Proc. 1975 NCC, AFIPS, Vol. 44, pp. 389-395.

[McDermott7s] D. V. McDermott, "Very Large Planner-type Data Bases", M. I. T. Artificial Intelligence Laboratory Memo 339, September 1975.

[Montgomery72] C. A. Montgomery, "Is Natural Language an Unnatural Query Language", Proc. 27th Nat. Conf., ACM, 1972, pp. 1075-1078

[Mylopoulos7s] J. Mylopoulos and N. Roussopoulos, "Using Semantic Networks for Data Base Management", Proc. International Conf. on Very Large Data Bases, Farminham, Massachusetts, Sept. 1975, pp. 144-172.

9 [Rivest76] R. L. Rivest, "On Self-Organizing Sequential Search Heuristics", CACM, Vol. 19, Num. 2, February 1976, pp. 63-67.

[Roussopoulos77] N. Roussopoulos , "A Semantic Network Model of Databases", PhD Thesis, Computer Science Dept., University of Toronto, 1977.

[Sacerdoti77] E. Sacerdoti, "Language Access to Distributed Data With Error Recovery", SRI Artificial Intelligence Center Technical Note 140, February 1977.

[Sagalowicz77] D. Sagalowicz, "IDA: An Intelligent Data Access Program", SRI Artificial Intelligence Center Technical Note 145, June 1977.

[Schkolnick77] M. Schkolnick, "A Clustering Algorithm for Hierarchical Structures", ACM-TODS, Vol. 2, Num. 1, March .1977, pp. 27-44.

[Shortliffe73] E. Shortliffe, S. G. Axline, B. G. Buchanan, T. C. Merigan and S. N. Cohen, "An Artificial Intelligence Program to Advise Physicians Regarding Antimicrobial Therapy", Comp. and Biomed. Res., Vol. 6, Num. 6, December 1973, pp. 544-560.

[Sibley76] E. H. Sibley and J. P. Fry, "Evolution of Data-Base Management Systems", ACM Computing Surveys, Vol. 8, Num. 1, March 1976, pp. 7-42. [Steel74] T. B. Steel Jr., "Data Base Systems - Implications for Comerce and Industry", in "Data Base Management Systems", D. A. Jardine, editor, North Holland 1974. [Steel7s] T. B. Steel Jr., "ANSI/X3/SPARC Study Group on Data Base Management Systems, Interim Report 75-02-08", FDT (Pub. ACM-SIGMOD) , Vol. 7, Num. 2, 1975.

[Taylor74] R. W. Taylor, "When Are Pointer Arrays Better Than Chains", Proc. 1974 ACM Nat. Conf., November 1974.

[Waltz7s] D. L. Waltz, "Natural Language Access to a Large Database: An Engineering Approach", Adv. Papers 4th Intl. Joint Conf. on Artificial Intelligence, Tbilisi, U.S.S.R. , September 1975, pp. 868-872.

[Wiederhold7s] G. Wiederhold, J. F. Fries and S. Weyl, "Structured Organization of Clinical Data Bases", Proc. 1975 NCC, AFIPS Vol. 44, pp. 479-486.

[Wiederhold77a] G. Wiederhold, "Database Design", McGraw-Hill, 1977, Chapter 7.

10 I

(Wied.erhold77b] G. Wiederhold, "Binding in Information Systems", in preparation. [Woods73] W. A. Woods, "Progress in Natural Language Undersnding, An Application to Lunar Geology", Proc. 1973 NCC, AFIPS, Vol. 42, pp. 441-450.

11