On Semantic Issues Connected with Incomplete Information Databases ” .’

On Semantic Issues Connected with Incomplete Information Databases ” .’ ::: WITOLD LIPSKI, JR. 1 Institute of Computer Science, Polish Academy of Sciences Various approaches to interpreting queries in a database with incomplete information are discussed. A simple model of a database is described, based on attributes which can take values in specified attribute domains. Information incompleteness means that instead of having a single value of an attribute, we have a subset of the attribute domain, which represents our knowledge that the actual value, though unknown, is one of the values in this subset. This extends the idea of Codd’s null value, corresponding to the case when thii subset is the whole attribute domain. A simple query language to communicate with such a system is described and its various semantics are precisely defined. We emphasize the distinction between two different interpretations of the query language-the external one, which refers the queries directly to the real world modeled in an incomplete way by the system, and the internal one, under which the queries refer to the system’s information about this world, rather than to the world itself. Both external and internal interpretations are provided with the corresponding sets of axioms which serve as a basis for equivalent transformations of queries. The technique of equivalent transformations of queries is then extensively exploited for evaluating the : :. interpretation of (i.e. the response to) a query. Key Words and Phrases: database, incomplete information, query language semantics, modal logic, ; .’ relational model, null values CR Categories: 3.50, 3.70, 4.33, 5.21 1. INTRODUCTION The notion of information incompleteness seems to be inherent in the domain of t databases. However, very little has been done toward clarifying the problems connected with incomplete information and creating a theoretical background for studying them. This is probably one of the reasons why present database products provide little or no support for information incompleteness, though the situation when data are incomplete is quite common. In this paper we propose a mathematical model of a database with incomplete information, which we call an information systeni. Basically, such a system stores information concerning properties of some objects. The information may be incomplete in that it may not be known whether or not an object has a property. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. This work was supported by the Polish Academy of Sciences under Contract MR 1.3. A version of this paper was presented at the Third Internstional Conference on Very Large Data Bases, Tokyo, Japan, October 1977. Author’s address: Institute of Computer Science, Polish Academy of Sciences, P.O. Box 22, 06-901 Warsaw PKiN, Poland. 0 1979 ACM 0362-5915/79/0900-0262 $00.75 ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979,Pages 262-296. Incomplete Information Databases 263 We describe a simple language for formulating queries to such a system. A query can either define a property of objects (the response should then be the set of objects satisfying this property), or express some property of the system as a whole (the expected response is then “yes” or “no”). Simple queries can be combined into more complex ones by using “not,” “or,” and “and.” We first define the semantics of our language in the special case when the information is complete. This semantics is intuitively evident, and is “the only natural one.” It is no longer so when the information is incomplete. For instance, what should be the response to the query “list all objects which are red or blue”? Should we list only those objects known to be red and those known to be blue, or should we also include the objects whose color is determined only to the extent that is known to be red or blue? And what about objects for which the possibility of being red or blue cannot be ruled out? The need for a precise formal semantics is evident here. It is also clear that a query can be interpreted in many different ways, of which none is distinguished as “the only natural one.” This fact raises the problem of making sure that the user’s intention expressed in a query and the system’s understanding of that query coincide. Of course, in order to solve this problem, it is not sufficient to merely realize that there may be many different interpretations of a query. Rather, we should understand these differences. As we shall see, there are two essentially different ways of interpreting a query-the external interpretation and the internal interpretation. The external interpretation refers the queries directly to the real world modeled (in an incomplete way) by the system, whereas under the internal interpretation the queries refer to the system’s information about this world, rather than to the world itself. Let us consider the following simple example. Suppose that an information system contains four objects a, b, c, d, and that the set of objects which are known to be red consists of the single object a, whereas the only object known not to be red is d (it is not known whether or not b and c are red). Then the external interpretation of the query red (i.e. “list all red objects”) is the set of all the objects which are red in reality, that is, it may be {p}, {a, b}, (a, c}, or {a, b, c} . Of course, the information contained in the system is not sufficient to exactly determine this set. In this sense our interpretation is “external” with respect to the system. However, we may consider, for any query Q, the following two bounds on the external interpretation of Q: (i) 11Q II*, the set of objects for which we can conclude, from the information available in the system, that they are in the external interpretation of Q, and (ii) 11Q II *, the set of objects for which we cannot rule out the possibility of belonging to the external interpretation of Q. (If Q is a “yes-no” query then the above definitions should be modified in a natural way.) In other words, II Q II * and II Q II* are the best possible bounds on the external interpretation of Q logically derivable from the system. In our example 11red II * = {a}, IIredlj* = {a, b, c}. ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 264 * Witold Lipski, Jr. It is very easy to compute I] Q ]I *, I] Q I( * when Q expresses an elementary property (e.g. red), since in such a case ]I Q ]I * and ]I &]I * are usually explicitly contained in the system. The situation becomes much more complicated when Q is a Boolean combination of elementary properties. We are then to derive ]I Q ]I .+ and ]I Q (I* from the bounds on elementary properties occurring in Q. To illustrate what kind of problems may arise in this connection, let us consider the following example. Suppose that a medical database contains information concerning the group of blood (0, A, B, or AB) for a large collection of individuals. Sometimes it may happen that we have only partial information on the group of blood of a person, derived from the group of blood of the parents, or from some partial tests of the blood. For instance, if one of the parents has group AB, then group 0 is excluded; if the parents’ groups are A and 0 then the only possible groups are 0 and A, etc. Now suppose that we are looking for candidates for transfusion with group 0 or A. Then any person with parents’ groups 0 and A is appropriate, even if we do not exactly know his group-it is sufficient that we know this group to be in the set (0, A}. In other words, every such person is in the set ]I BLOODGROUP = A OR BLOODGROUP = 0 ]I * , and consequently in the external interpretation of the query BLOODGROUP = A OR BLOODGROUP = 0. Of course, not every such person need be in 1)BLOODGROUP = A ]I * U I( BLOODGROUP = 0 I] * . This shows that usually ]I Q1 OR QZ ]I * cannot be found by simply taking the union ofIIQ~II~andIlQ4~. Notice that partial information on the value of an attribute is usually not expressible in the approach based on null values (see Codd [3]). Indeed, if we use null values, then we are restricted to the following two extreme cases: (i) Everything is known about the value of an attribute. (ii) Nothing is known about the value of an attribute. Let us consider another example, involving a “yes-no” query. Suppose that we classify objects a, b, c, d, e with respect to color. Assume that the color of no object is known, but our partial knowledge about the colors is as follows: possibly red objects a, b possibly blue objects a, b, c, d, e possibly green objects a, c possibly white objects a, b, c possibly black objects b, c (We do not exclude the possibility of an object having a color other than those colors listed above.) Now consider the query “Are there objects of all colors in our collection?” We may provide a definite response, though the color of no object is known.

On Semantic Issues Connected with Incomplete Information Databases ” .’

Probabilistic Databases

Download a Copy of the 264-Page Publication

Best Answers Over Incomplete Data : Complexity and First-Order Rewritings

1 Tarski's Influence on Computer Science

Nonapplicable Nulls

My Six Encounters with Victor Marek — a Personal Account

Ooo Ooto •Oooo

Oooo Oo#Oo Oteoo

Best Answers Over Incomplete Data: Complexity and First-Order Rewritings

FINDING a MANHATTAN PATH and RELATED PROBLEMS by Witold

The Relational Model of Data and Cylindric Algebras

Query Processing on Probabilistic Data: a Survey