Gio Wiederhold, Earl Sacerdoti, Daniel Sagalowicz, Ramez Elmasri, and Gordon Novak
Total Page:16
File Type:pdf, Size:1020Kb
A PRELIMINARY SKETCH TO DEFINE RESEARCH OPPORTUNITIES RELEVANT TO ARPA'S PLANS FOR KNOWLEDGE PROCESSING IN DATABASES. Gio Wiederhold, Earl Sacerdoti, Daniel Sagalowicz, Ramez ElMasri, and Gordon Novak. Stanford, September 27, 1977. TITLE Use of Context Knowledge to Query Large Databases OBJECTIVE We see a need to improve the utility of large databases. In order to make progress in that direction it seems to be potentially profitable to develop an integrated set of methods which will combine the high level query capability developed in artificial intelligence applications with the access mechanisms to large quantities of data which are embodied in database management systems. Typical users of the systems envisaged here include planners and analysts who want to obtain information from large, multifaceted databases. The proposed methods will include as major components: * Heuristic procedures to determine the focus of user interest during sessions of active database interaction. * Extraction of knowledge relevant to current database processing from the database itself. * Combination of stored knowledge about the database with user- or session-specific as well as with transient knowledge. Processing of natural language queries with the aid of the * combined knowledge. Establishment of rapid access to data-records selected by the * query processing using access structures established according to the combined knowledge. PROBLEM TO BE ADDRESSED While an ever larger number of databases are now in operation [Steel74 ], it has not been easy to extract the knowledge contained in them. The physical management of these large databases has proven sufficiently difficult to force the concentration of efforts into issues of reliability, economy, and integrity [Everest74 ]. These efforts have paid off to the extent that large databases are now in extensive use for routine data-processing tasks. The design alternatives which are available to database implementors have received widespread debate [Sibley76] . The alternatives presented, however, stress opposing approaches: structural simplicity versus use of structure to encode relationships. Proposals to improve database management include greater formalization [Codd7o] , and hence a reduction of knowledge implicit in the database structure itself. The conceptually simple structures advocated can be manipulated using the operations defined by the relational algebra, and consistency and integrity problems can be avoided. Since relationships among entities are not encoded within the database, redundancy is removed. This normal, tabular view is just the opposite of the view which sees intelligent processing of queries driven by a base of data encoded as a semantic net. Many operational commercial systems also stress an encoded network structure to achieve rapid access to records known to be related in some sense (Honeywell IDS, DEC DBMS, IBM IMS etc.). It is interesting to note that proposals from commercially based developers [Bachman72] have stressed a semantically bound structure and viewed the user as a navigator within it. This issue of structural simplicity versus structural encoding of relationships among data can be viewed as a binding problem [Wiederhold77b] . Lack of binding promotes integrity, whereas binding establishes relationships of interest to a database user. Relationships may be predefined or may be conceived during the process of extracting information out of the database. A database that is not bound at all defers all binding required to answer queries to the time that the database is to be used, and also leads to a situation where the binding operations are performed again and again. This mode of operation increases retrieval cost at the critical query processing time. On the other hand, any predefined access paths are apt to be constraining to a database user who has an imaginative and interactive approach to the extraction of knowledge out of the database. Furthermore, a database with predefined access paths for the most common queries will be least useful precisely when information is most needed: in a crisis situation. Crisis situations are, by definition, uncommon, and the common queries will be insufficient. Symbolic modelling, a well-known techique in Artificial Intelligence, holds promise for enabling databases to be both efficient and easy to use. Data access can be made efficient by utilizing links among records that reflect relationships implicit among those records. Data access can be made easy by explicating those implicit relationships in a symbolic model of the database structure. At least two current efforts are exploring the utility of symbolic modelling for databases. The structure of the database is described using a semantic net technology in Torus [Mylopoulos7s, Poussopoulos77] , and by a simple property list, structure in IDA [Sagalowicz77] . While in Torus the description and the database are intertwined, (the semantic nodes can carry names or values) , the approach used by IDA is that the database is distinct, in fact remote. Tne former approach becomes untenable for large databases; 2 the latter implies that data which are relevant to query processing and are themselves database components are maintained redundantly. As queries are processed by IDA additional knowledge is encoded into the local model, which pertains to the current user and the current session. Maintenance of all of the conceptual database description outside of the database can present problems of long term integrity since changes in the database will eventually affect the relationships existing within the database. RESEARCH OPPORTUNITIES In order to resolve the problems seen in the merging of AI techniques with the use of databases which cover multiple related domains, we find that a number of interfaces have to be addressed. It has been demonstrated [Grosz77] that interactions in a task-oriented dialogue can be used to establish the focus, or current area of interest, within the task. We believe that, in an analogous manner, information obtained from the current database query interactions can be used to establish the focus that defines the context, or the current area of interest. Data from the database in the domains on which interest is focused can be extracted, indexed, or otherwise made easily available to help with query parsing and database access. Typical stored elements to be made available to help with query parsing are the defined relationships among entities of interest that were already specified in the stored database schema, the terms used in lexical files [Wiederhold77] , and information about their usage according to the database schema. The stored relationships provide an initial semantic description of the database. The terms will constitute auxiliary lexicons for assistance in query interpretation. As an example of the use of a lexical file from the database, consider the query "Where is (the) Ford ?" A lexical file in the database will contain the names of all the objects in the organization's inventory. A search in this file will determine that Ford is an automobile with licence XTHI3B which is checked out to a given customer. Without access to lexical files during parsing, this query would have to have been stated as "Where is inventory item = Ford ?" In this case, the user requires knowledge of the structure of the database. In a large database, however, access to all lexical files can take much effort and create ambiguities. For example, "Ford" might exist in a presidents file, in a geographic file, in a ships file, etc. Knowledge of the context disambiguates the queries by reducing the search space. Most stored relationships and terms can be considered to be global, that is, available to all processes using the database. Other knowledge about the database may pertain to one group of users, to one specific user, or may be of utility only during one user-session. Information about attributes of current interest will be used to 3 construct indexes, select subsets of access files, reorganize parts of the database, or to combine index data and references [Burkhard76, Gosh7s, Wiederhold7s, Rivest76, and Schkolnick77] . Such techniques promise rapid access to databases, but are costly to maintain over long periods. Questions to be answered within this research frame include: * What is a good global semantic: description of the database? * How does such a semantic description fit within the so-called conceptual schema [Steel7 s] ? How can database facilities be used to maintain database * descriptions of a quality adequate for Artificial Intelligence procedures? Does modularization of a large database into domains that * match the user's focus and interest help with query processing? Can we recognize such a focus sufficiently rapidly so that * domain selection and rejection can be automatic? How can semantic and schema information from different sources * be managed? How can consistency problems ocurring when database * descriptions are joined be resolved? How can new knowledge developed during database usage be be * added dynamically to the conceptual schema? How can one insure that the resulting semantic description is * acceptable for every user? The effectiveness of the techniques will have to be evaluated on a small scale model. Such a model will however have to cover several distinct domains of user interest. To allow reliable extrapolation to large databases, careful selection of parameters of the model will be required. RATIONALE Query Processing.