On Semantic Issues Connected with Incomplete Information ” .’ :::

WITOLD LIPSKI, JR. 1 Institute of Computer Science, Polish Academy of Sciences

Various approaches to interpreting queries in a with incomplete information are discussed. A simple model of a database is described, based on attributes which can take values in specified attribute domains. Information incompleteness means that instead of having a single value of an attribute, we have a subset of the attribute domain, which represents our knowledge that the actual value, though unknown, is one of the values in this subset. This extends the idea of Codd’s value, corresponding to the case when thii subset is the whole attribute domain. A simple query language to communicate with such a system is described and its various semantics are precisely defined. We emphasize the distinction between two different interpretations of the query language-the external one, which refers the queries directly to the real world modeled in an incomplete way by the system, and the internal one, under which the queries refer to the system’s information about this world, rather than to the world itself. Both external and internal interpretations are provided with the corresponding sets of axioms which serve as a basis for equivalent transformations of queries. The technique of equivalent transformations of queries is then extensively exploited for evaluating the : :. interpretation of (i.e. the response to) a query. Key Words and Phrases: database, incomplete information, query language semantics, modal logic, ; .’ relational model, null values CR Categories: 3.50, 3.70, 4.33, 5.21

1. INTRODUCTION

The notion of information incompleteness seems to be inherent in the domain of t databases. However, very little has been done toward clarifying the problems connected with incomplete information and creating a theoretical background for studying them. This is probably one of the reasons why present database products provide little or no support for information incompleteness, though the situation when data are incomplete is quite common. In this paper we propose a mathematical model of a database with incomplete information, which we call an information systeni. Basically, such a system stores information concerning properties of some objects. The information may be incomplete in that it may not be known whether or not an object has a property.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. This work was supported by the Polish Academy of Sciences under Contract MR 1.3. A version of this paper was presented at the Third Internstional Conference on Very Large Data Bases, Tokyo, Japan, October 1977. Author’s address: Institute of Computer Science, Polish Academy of Sciences, P.O. Box 22, 06-901 PKiN, . 0 1979 ACM 0362-5915/79/0900-0262 $00.75 ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979,Pages 262-296. Incomplete Information Databases 263

We describe a simple language for formulating queries to such a system. A query can either define a property of objects (the response should then be the set of objects satisfying this property), or express some property of the system as a whole (the expected response is then “yes” or “no”). Simple queries can be combined into more complex ones by using “not,” “or,” and “and.” We first define the semantics of our language in the special case when the information is complete. This semantics is intuitively evident, and is “the only natural one.” It is no longer so when the information is incomplete. For instance, what should be the response to the query “list all objects which are red or blue”? Should we list only those objects known to be red and those known to be blue, or should we also include the objects whose color is determined only to the extent that is known to be red or blue? And what about objects for which the possibility of being red or blue cannot be ruled out? The need for a precise formal semantics is evident here. It is also clear that a query can be interpreted in many different ways, of which none is distinguished as “the only natural one.” This fact raises the problem of making sure that the user’s intention expressed in a query and the system’s understanding of that query coincide. Of course, in order to solve this problem, it is not sufficient to merely realize that there may be many different interpretations of a query. Rather, we should understand these differences. As we shall see, there are two essentially different ways of interpreting a query-the external interpretation and the internal interpretation. The external interpretation refers the queries directly to the real world modeled (in an incomplete way) by the system, whereas under the internal interpretation the queries refer to the system’s information about this world, rather than to the world itself. Let us consider the following simple example. Suppose that an information system contains four objects a, b, c, d, and that the set of objects which are known to be red consists of the single object a, whereas the only object known not to be red is d (it is not known whether or not b and c are red). Then the external interpretation of the query red (i.e. “list all red objects”) is the set of all the objects which are red in reality, that is, it may be {p}, {a, b}, (a, c}, or {a, b, c} . Of course, the information contained in the system is not sufficient to exactly determine this set. In this sense our interpretation is “external” with respect to the system. However, we may consider, for any query Q, the following two bounds on the external interpretation of Q: (i) 11Q II*, the set of objects for which we can conclude, from the information available in the system, that they are in the external interpretation of Q, and (ii) 11Q II *, the set of objects for which we cannot rule out the possibility of belonging to the external interpretation of Q. (If Q is a “yes-no” query then the above definitions should be modified in a natural way.) In other words, II Q II * and II Q II* are the best possible bounds on the external interpretation of Q logically derivable from the system. In our example 11red II * = {a}, IIredlj* = {a, b, c}. ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 264 * , Jr.

It is very easy to compute I] Q ]I *, I] Q I(* when Q expresses an elementary property (e.g. red), since in such a case ]I Q ]I * and ]I &]I * are usually explicitly contained in the system. The situation becomes much more complicated when Q is a Boolean combination of elementary properties. We are then to derive ]I Q ]I .+ and ]I Q (I* from the bounds on elementary properties occurring in Q. To illustrate what kind of problems may arise in this connection, let us consider the following example. Suppose that a medical database contains information concerning the group of blood (0, A, B, or AB) for a large collection of individuals. Sometimes it may happen that we have only partial information on the group of blood of a person, derived from the group of blood of the parents, or from some partial tests of the blood. For instance, if one of the parents has group AB, then group 0 is excluded; if the parents’ groups are A and 0 then the only possible groups are 0 and A, etc. Now suppose that we are looking for candidates for transfusion with group 0 or A. Then any person with parents’ groups 0 and A is appropriate, even if we do not exactly know his group-it is sufficient that we know this group to be in the set (0, A}. In other words, every such person is in the set ]I BLOODGROUP = A OR BLOODGROUP = 0 ]I * , and consequently in the external interpretation of the query BLOODGROUP = A OR BLOODGROUP = 0. Of course, not every such person need be in 1)BLOODGROUP = A ]I * U I( BLOODGROUP = 0 I] * . This shows that usually ]I Q1 OR QZ ]I * cannot be found by simply taking the union ofIIQ~II~andIlQ4~. Notice that partial information on the value of an attribute is usually not expressible in the approach based on null values (see Codd [3]). Indeed, if we use null values, then we are restricted to the following two extreme cases: (i) Everything is known about the value of an attribute. (ii) Nothing is known about the value of an attribute. Let us consider another example, involving a “yes-no” query. Suppose that we classify objects a, b, c, d, e with respect to color. Assume that the color of no object is known, but our partial knowledge about the colors is as follows: possibly red objects a, b possibly blue objects a, b, c, d, e possibly green objects a, c possibly white objects a, b, c possibly black objects b, c (We do not exclude the possibility of an object having a color other than those colors listed above.) Now consider the query “Are there objects of all colors in our collection?” We may provide a definite response, though the color of no object is known. Indeed, there are only three objects a, b, and c to represent the four colors red, green, white, and black. Consequently, for our query we have IIQ II* = CCno*)) ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases * 265

We see from the above examples that determining ]] Q ]] * and ]I Q ]I * has the flavor of extracting as much (implicit) information as possible from incomplete data. A uniform method for computing I] Q ]I * and II Q ]I * for the case where Q is not a “yes-no” query is presented in Section 5. It is interesting that the analogous problem for “yes-no” queries has quite a different combinatorial flavor. A method for computing ]I Q I] * and ]I Q I] * for such queries which relies on the classical combinatorial problem of distinct representatives of subsets is given in a subse- quent paper [ll]. Now let us turn to the internal interpretation. The internal interpretation of query Q will be denoted by I] Q I]. A query of the form of an elementary property, say red, is understood as known to be red. We define II Q1 OR Qz ]I = ]I &I I] U IIQ2 II, II&I AND Q2II = IIQl II n IIQZ II, and II NOT Q II is the complement of ]I Q I]. This means that the internal interpretation behaves as the usual interpretation for the case of complete information. Notice that Q is understood as known to have property Q only when Q expresses an elementary property. For instance, red OR blue is understood as known to be red or known to be blue, rather than known to be [red or blue]. Let us consider the following example. Suppose that objects a, b, c, d, e, f, and g are classified with respect to color, which may be only white, black, or red for our collection, and assume that our information on the color of objects is as follows: possibly white objects a, b, c, d possibly black objects c, 6 e, f possibly red objects b, c, L g We may easily determine II white OR black II* = {a, d, e}, II white OR black II* = (a, b, c, d, e, f}, II white OR black 11 = II white II U II black (I

= {a} U {e} = (a, e}. (It is not always true that I] Q ]I c I( Q I] * ; f or instance, II NOT white 11= {b, c, d, e, f, g>,II NOT white II * = {e,f, g1.J The internal interpretation is not very interesting as long as we use the same language as for the case of complete information. However, we may enrich the language by introducing an additional unary operator surely with the following interpretation: ]I surely Q 11is the set of objects which belong to ]I Q ]I in every possible completion of our present knowledge (for “yes-no” queries the definition should be suitably modified; this case is deferred to [ll]). As we shall see in Section 7, the operation surely considerably increases the expressive power of the language. An important role in our approach to incomplete information is played by suitably chosen sets of axioms which serve as a basis for equivalent transforma- tions of queries (of course, the notion of equivalence depends on the interpretation we consider). Equivalent transformations of queries are the main tool exploited ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 266 * Witold Lipski, Jr. in our methods of evaluating 1)Q I] * , I] Q ]I *, and I] Q I]. This technique seems to be especially suitable when a database is very large, since it is then profitable to transform any query entering the system into a form which minimizes the cost of determining the response. In other words, equivalent transformations of queries enable us to shift as much of the work as possible to the level of the language. It should be emphasized that we do not consider any measure of the lack of information. This fact excludes alI probabilistic interpretations. This “yes-no-not known” approach might suggest the use of the three-valued logic, as is done, for instance, in Codd [3], Hajek [5], or Hajek et al. [6]. However, we shall argue that the three-valued logic is not an appropriate tool for our purposes. Let us also mention that problems connected with “fuzzy sets” are outside the scope of our study. Usually situations involving the use of fuzzy sets are related to an inherent fuzziness of the classification, and have nothing to do with information incom- pleteness-even knowing the height of a man exactly to a millimeter, we still may not be able to say whether or not he is tall. In our study we allow the data to be incomplete, but we insist that they are correct-at least, we do not expect the response to be correct when data are incorrect. Some of the ideas which we develop in this paper were presented in a preliminary version [12, 141. Our study is an attempt to extend the model of an information storage and retrieval system proposed by Marek and Pawlak [19] (also investigated by Lipski and Marek [16]) to the case of incomplete informa- tion. Another important study of the problems related to incomplete information has been undertaken by Jaegermann [7]. His work differs both in the methods used (algebraic rather than logical ones used here) and in the results obtained. His paper reveals the relation of pseudo-Boolean algebras and intuitionistic logic to the problems of incomplete information, whereas in our approach the relevant notions turn out to be those of a topological Boolean algebra and modal logic S4 (see [ll]). In order to make this paper more readable, especially for those who are not very much interested in the mathematical aspects of the theory, the proofs of many of the results are accumulated in the Appendix. On the other hand, a more theoretically oriented reader is referred to [ll] for parts of the theory which required too much mathematical machinery to be included here.

2. BASIC NOTIONS We give below a mathematical model of an information system. The model will then be studied in the rest of the paper. One of the basic components of an information system is a finite set X of objects. The objects are classified with respect to a finite number of attributes. The set of these attributes will usually be denoted by I. Associated with every attribute iEI is a nonempty attribute domain Di, consisting of all possible values this attribute can take. For instance, if i is “length” then we can take as Di the (infinite) set of nonnegative reals; if i is “sex” then Di = {male, female}. Of course, different attributes can have the same domains. The basic assumption concerning our classification is that, for any fixed object in X, to every attribute i E I there corresponds a unique value of this attribute. However, this value need ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases - 267 not be known in the system. In the model to be presented in this paper, the classification-strictly speaking, our incomplete knowledge of the values of attri- butes for any particular object-is represented by a function U. This function associates with any i E I and a E Di a set of objects U(i, a) c X. Intentionally, U(i, a) is the set of objects for which attribute i possibly takes value a. In other words, the complement X \U(i, a) is exactly the set of objects for which attribute i is known not to take value a. Since Di contains all possible values of attribute i,

U { U(i, U): U E Di} = X, (1) for every i E I. From the function U we can determine the set, denoted by u(i, a), of all the objects for which the value of attribute i is known to be a:

u(i, U) = X \ U {U(i, b): b E Di A b # u}. (2) Indeed, we know that attribute i takes value a exactly when we know that it is not possible for it to take any other value b E Di\{a}. It should be noted that u does not uniquely determine U. For instance, all functions U with the property that for every i E I there are at least two a E Di such that U(i, a) = X determine the same function u, namely u(i, a) = 0 for all i E I, a E Di. From eqs. (1) and (2) we easily obtain the following two intuitively evident facts: 46 4 L Ui, 4, (3) u(i, a) n U(i, b) = 0, (4) for all a # b, a, b E Di (taking i to be “color”: if something is known to be red then it cannot be blue). Let us denote by D the “disjoint union” of attribute domains, D=((~,U):~EIACZED~}, and by 9(X) the set of all subsets of X. We are now ready to summarize a formal definition of an information system. Definition 1. An information system (or a system for short) is a triple Y= (X, (Di)iez, u) where (i) X is a finite set of objects, (ii) I is a finite set of attributes, (iii) Di is a nonempty set called the domain of attribute i, (iv) U is a function, U: D + P(X), such that for every i E I U (U( i, u): u E Di} = X. According to the interpretation of function U, we may determine, for every x E X and every i E I, the set pi(X) = {U E Di: x E U(i, u)} (5) of all possible values attribute i can take for object x. Conversely, U can be obtained from functions pi, i E I, by the formula ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 268 - Witold Lipski, Jr.

U(i, U) = {X E XI U E pi(X)}* (6) Indeed, this follows from the fact that, by (5),

X E U(i, U) e U E pi(X)* It is clear from these remarks that we can uniquely represent a system by X, (Di)iel and (Pi)iEr. We shall cdl (pi)iel the CZUSS~~~CU~~OTI associated with system Y= (X7 (Di)ieI, U). The reader may easily prove that pi(X) # 0 (use (l)), and

U(i, U) = {X E X: pi(X) = {U}} (7) (use (2)). Let us notice that, very roughly speaking, a tuple (pi(r))icI has a similar interpretation to a tuple belonging to a relation of Codd’s relational model [2]. However, p,(x) is usually a set of values, rather than a single value. We defer a more detailed discussion of the connection with the relational model to Section 7. Here we only remark that a null value (see Codd [3]) for attribute i corresponds t0 pi(X) = Di. Now suppose that the information about the objects increases, while the objects themselves remain invariant. To deal with such a situation we introduce the following definition. Definition 2. Let sp1 = (X, (Di)iEI, VI) and % = (X, (Di)iE,, U2) be two systems. We say that sP2 is an extension of 3, and we write YI < % (or % 3 Y;), if

U2(i, a) C Ul(i, a), (8) foralliEIandaE& Equivalently, we can say that 3 4 3 if Pi%) G Pi’(x), (9)

for all i E I and x E X, where (pi’)iel and (Pi2)ier are the classifications corresponding to sPl and 92, respectively. Indeed, if (8) holds then, by (6),

P:(X) = {U E Di: x E U2(i, u)} c {U E Di: x E Ul(i, u)} = pi’(X), and conversely, (9) implies, by (5),

U2(i, a) = {X E X: U E /$Q)} C {X E X: U E pi’(X)} = Ul(i, U). This formal definition of extension has a very simple intuitive explanation. If we think of yl and Y2 as corresponding to states of knowledge about the objects, then X < Y2 means that this knowledge is either more complete in % than in yl , or else Y1 = sPz. To increase information about an object means to decrease the uncertainty concerning the values of attributes, i.e. to decrease (some of) the sets pi(x), which is exactly what (9) says. To illustrate the notion of an extension, let us consider the following situation. Suppose four objects x, y, z, and t are given, and we have a balance to determine their respective masses. First assume that only a single l-kilogram weight is available. We are then able to determine, for each object, whether its mass is less than, equal to, or greater than 1 kilogram. By weighing all the objects we might obtain the following information: ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases * 269

3: Pi’(r) = {w E R+: w < l} = [0, 1)

pi’(y) = {W E R+: W > l} = (1, m) pi’(z) = (w E R+: w > l} = (1, m)

Pi’(t) = {W E R+: W < 1) = [O, 1) (where i is “mass” and R’ denotes the set of nonnegative reals). When an arbitrary number of l-kilogram weights are available, our information increases. Now weighing the objects might, for instance, yield the following information:

9$4”2: Pi%) = [O, 1) my) = (3,4) Pi%) = (2, 3) Pi%) = [O, 1). When lo-dekagram weights are used, we can obtain, say

33: /3i3(x) = [O, 0.1) Pt?y) = (3.6, 3.7) &yz) = (2.0, 2.1) /3?(t) = (0.8, 0.9). Of course, spl =Z$2 =S9%. It is obvious that < is a partial order’ on the set of all systems ,with fixed X and (Di)iEl. It will play a fundamental role in our further considerations. The partial order < has the least element (i.e. a system of which all systems are extensions), viz. the system with U( i, a) = X for all i E 1, a E Di. This system will be called minimal. It contains no information at all, except for the mere fact of which attributes refer to the objects. It is easy to see that in a minimal system /Ii(X) = Di for all X E X, i E I. A system is called complete if it is a maximal element in the order <, i.e. if it has no extensions 9” + Yexcept for 9 = 9 It is easy to see that a system Y= (X, (Di)iEI, U) is complete if and only if it satisfies one of the following three equivalent conditions:

u = u, (10) U(i, a) fl U(i, b) = 0, foralliE1, a,bEDi, cz#b, (11) pi(x) consists of a single element of Di , for all i E I, x E X. (12) In a complete system we know exactly the unique value attribute i takes for object .1c,for any i E I and x E X. A system which is not complete is called incomplete. Complete systems are (a slight generalization of) information storage and retrieval systems introduced and investigated by Marek and Pawlak (see IN 161).

‘Thatis,(a)Y~Y;(b)if~Y;~~d~Yi~,then~Y;~;(c)if~,Y;~~d~~SPZ~,,then~==~. ACM Transactions on Database System, Vol. 4, No. 3, September 1979. 270 - Witold Lipski, Jr.

Any complete extension of a system Y will be called a completion of % Having defined the notion of a completion, we can give the following simple interpretation to an information system Y: the objects of Ycan in reality be described by a completion of Y: though we do not know by which one. A similar interpretation of incomplete information, in terms of “earthly” and “heavenly” models, was considered by Hajek et al. [S].

3. QUERY LANGUAGE AND ITS SEMANTICS IN A COMPLETE SYSTEM The main task of an information system is to answer queries submitted by the user(s). To communicate with the system-more specifically, to formulate quer- ies-the user has a query language at his or her disposal. We describe such a language below. It is an inessential modification of a language considered by Jaegermann [7]. The language has two levels. Expressions of these levels are called terms and formulas, respectively. Intensionally, terms describe subsets of the set of objects, whereas formulas (“yes-no” queries) express some facts concerning the system as a whole. By a query we shall mean a term or a formula. We begin with terms. These are built up from certain elementary parts called descriptors, symbols for Boolean operations, 0, 1, -, +, ., +, and parentheses. More exactly, the set F of terms is defined to be the least set F’ with the following two properties: (i) 0, 1 and all descriptors are in F’. (ii) --t,(t+s),(t~s),(t+s)arein.F’whenevert,s~~’. Terms will usually be denoted by t, s, . .. (possibly with indices). We use the usual rules for saving parentheses and we abbreviate finite sums and products as CjEJ tj and HjEJ tj, respectively. Every descriptor is of the form (i, A) where i E I and A is a (usually not arbitrary) subset of Di. In a complete system, descriptor (i, A) is used to denote the set of objects for which the value of attribute i is in A. We treat descriptors as nondecomposable elements without any internal structure. But in any practical implementation of the language we need a method to constructively represent subsets of attribute domains. To this end we usually use, for some attributes i E I, certain auxiliary languages for describing subsets of the corresponding attribute domains Di. Clearly, if Di is infinite, e.g. is an interval of reals, then it is not possible to constructively represent all subsets of Di (since P(Di) is then nondenumerable). But we shall always assume that the class of describable subsets of Di forms a Boolean algebra, i.e. it is closed under union, intersection, and complement&ion. This class will be denoted by &?i. Unless stated otherwise, X, (Di)i=~ and (%‘~i)i,l will always be fixed throughout our considerations. Example A. Let i be “length” and let Di be the set of nonnegative reals. To represent the term (i, A) + (i, B), where A = {x: x 5 100)) B = {x: x I 50 v x > 200}, we might-in some implementation of the language-use the expression (LENGTH 5 100) + (LENGTH IN [0,50] + (200, a))). We may assume that S?i is the Boolean algebra of subsets of Di generated by intervals with rational endpoints. ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases 271

Example B. Let i be “position” and let Di be the set of all pairs of reals representing latitude and longitude, Di= {(x,y):- !3O~x~9OA- 18Osy5180). To represent (i, A) where A = {p E Di: dist(po,p) I lOO}, po = (52.11, 21.00), we may use the expression (DIST(52.11,21.00) I 100) (dist(po, p) denotes the distance between positions p. and p-it can be computed from PO, p by using some spherical trigonometric formulas). Notice that in order to define a subset of an attribute domain we used an additional structure on this domain, an order structure in Example A, and a metric structure in Example B. As our study is intended to have a foundational character, we shall never be interested in the exact form of the language for specifying the subsets of attribute domains-it is irrelevant to the basic problems connected with the incompleteness of information. We shall often use the “incompletely specified” language with descriptors written in the form (i, A). We shall informally write green instead of (color, {green} ), (LENGTH I 30) instead of (length, {z E R+: x I 30) ), male or (SEX = M) instead of (sex, {male) ), etc. Let us note that our model, in the case of complete information, corresponds to the relational model [2] with only one relation, say R. (An object x corresponds to tuple (pi(x))iel E R). A term t corresponds to the following query in SEQUEL 2 (see [l]): SELECT * FROM R WHERE t. Now we pass to the second level of the language. Expressions of this level, called formulas, are built up from atomic formulas of the form t = s(t, s are arbitrary terms) and logical constants T, F (truth and falsity, respectively) by means of logical connectives -I, v, A, *. More exactly, the set 9 of formulas is defined to be the least set 9’ with the following two properties: (i) T, F and all atomic formulas are in 9’. (ii) 10, ($ v #), ($ A II/), and ($I * 4) are in 9’ whenever I#J,J/ E 9. Formulas will usually be denoted by 9, I/J,. . . (possibly with indices). As an example, consider a system describing a set of employees, with attributes EMPNO (employee number), NAME, MGR (manager’s name), and SAL (salary). The query “Does each of the employees having Lipski as manager earn more than 20000?” is expressed by the following (atomic) formula: (MGR = LIPSKI).-(SAL > 20000) = 0. (13) Now we shall define the semantics of our query language in a complete system. For any query Q we shall define its value, denoted by ]I Q I], which-intensionally- ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 272 * Witold Lipski, Jr.

is the response of the system to query Q. It is intuitively clear what the value of a query is, if we think of 0, 1, -, +, -, + as corresponding to 0, X, and the set- theoretical operations of complementation, union, intersection, and the operation “(X\A) U B”, respectively, and if we understand F, T, 1, v, A, * as the usual logical constants and connectives. More formally, we have the following definition. Definition 3. The value of query Q in a complete system 9, denoted by ]I Q ](.Y (or I( Q ]I when 9’ is understood), is defined inductively by (i) II (i, A) II = Uoa~Wi, a), (ii) II 0 II = 0, II 1 II = X, (iii) II -t II = X \ II t II, (iv) IIt + s II = IIt II U IIs II, (4 Ilt4 = Iltll nl141j w IIt + sII = (X\ IIt II) u IIs II,

(viii) ]I Tll = T, ‘11Fll = F, (ix) II1411= 1 Ilbll, (4 II4 v 1c,II = II+ IIv II#II, (xi) II4~ A 3 II = II+J IIA II4 IO (xii) II9 * JIII = II+ II* II4 II. Note: T, F, 1, v, A, * denote symbols of the query language on the left-hand sides, and the corresponding operations defined in the natural way on truth values T, F on the right-hand sides. In practice it may also be useful to consider atomic formulas of the type t I s, withIIt~sII=TifandonlyifIItll~I(sll.F or instance, formula (13) could be rewritten as (MGR = LIPSKI) 5 (SAL > 20000). However, since t 5 s can always be expressed by t. -s = 0, we shall not explicitly introduce I into the language. Two queries Q, P are called externally equivalent (in symbols Q 5 P) if IIQII~=IIPII-y f or every complete system Y(here (Di)iEI is fixed but Xis arbitrary). A query entering a complete system can be replaced by any other externally equivalent query. In practice, it is reasonable to transform the query into some externally equivalent query which minimizes the cost (e.g. time) of retrieval. We want the transformation to be carried out by purely syntactical means without accessing the file which stores the information about the objects. To this end we can use during the transformation process the following axioms (rules of trans- formation). AXIOMSFORQUERIESTOCOMPLETESYSTEMS. ThesetBofaxiomsconsistsof the following. Bl. Substitutions of terms into the axioms of Boolean algebra (e.g. t-s = s. t, -(t + s) = -t---s, t + s = -t + s, etc.), and the axioms of equality (see e.g. Rasiowa and Sikorski [21]). B2. The following axioms concerning descriptors: (i) (i, 0) = 0, (i, Di) = 1, (ii) (&A) + (i, B) = (i,A U B), ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases * 273

(iii) (i,A) - (i, El) = (i,A n B), (iv) -(i, A) = (i, Di\A), for all i E I, A, B E 93i. B3. Substitutions of formulas into the propositional calculus axioms (e.g. 0 v 7 4, C#J* ii f$, etc.). It can easily be proved that the axiomatization given by B is complete, in the sense that two queries are externally equivalent if and only if the equality t = s or the formula (+ * #) A (9 + 9) (depending on whether our queries are terms t, s or formulas 0, Ic/) can be derived from the axioms in B (see [ 141). We shall extensively exploit the technique of externally equivalent transfor- mation of queries in subsequent sections.

4. EXTERNAL VERSUS INTERNAL INTERPRETATION OF QUERIES In this section we shall briefly discuss two approaches to interpreting queries to an incomplete system, namely the external interpretation and the internal inter- pretation, already mentioned in the Introduction. The distinction between these two approaches seems to be fundamental for understanding problems related to information incompleteness. The basis for our considerations will be the view of an information system as an “incomplete model” of reality. We assume that the system refers to a fragment of reality consisting of a collection of real objects-say, a collection of employees of an enterprise, to use a favorite example of most authors in the domain of databases-and that some properties of objects are mapped, in an incomplete way, into the system. A query entering the system can be viewed in two different ways. One way is to think of the query as expressing an external property (of objects or of the collection as a whole, depending on whether it is a term or a formula), that is, a property of the real world modeled by the system. The adjective “external” reflects the fact that the interpretation of the query is external with respect to the system. The other way of interpreting a query consists of referring it to the system’s information about the external world, rather than to the world itself. In such a case we say that the query expresses an internal property. The two interpretations described above can be depicted as follows:

internal interpre+a+io\ /%Z~Z+o+ion

Query

Let the query be, for instance, the term consisting of a single descriptor (AGE < 30). Its external interpretation is the set of persons who in reality are of age less than 30, whereas the internal interpretation consists of persons known (in the system) to be under 30. Alternatively, we may have defined the internal interpretation to consist of persons possibly under 30. What is important is that ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 274 * Witold Lipski, Jr. for the external property “age < 30” we have its internal substitute “known to be of age < 30” (or “possibly of age < 30”) which approximates it. We see that under the external interpretation a query refers directly to the real world. Since the system’s information about an external property is-in general- incomplete, the external interpretation of a query is usually not accessible for the system, and consequently for the user. The information contained in the system enables us only to give some bounds on the external interpretation. These bounds will be studied in the next section. Contrary to this, our information about an internal property is-trivially- complete (since we know exactly what we know). The internal interpretation of a term is nothing else but the set of objects for which the information contained in the system satisfies the conditions expressed by this term. Very loosely speaking, the internal interpretation of a term is equal to the external interpre- tation of this term in an “artificial reality” where any external property expressed by a descriptor (i, A) is replaced by the property “known in the system to have the value of attribute i in A.” We assume that under both interpretations the symbols 0, 1, -, +, ., + and F, T, 1, v, A, * have their usual set-theoretical and logical interpretations, respec- tively. However, in the case of internal interpretation we shall introduce an additional unary operation and an additional unary connective, which will con- siderably increase the expressive power of the language. We defer a more detailed explanation of the internal interpretation to Section 6. Here we give only one more example. Consider the term (AGE c 30) + (AGE 2 30). It is clear that the external interpretation of this term is always X, the set of all objects. (In fact, using the axioms in B, we have (AGE c 30) + (AGE 1 30) 5 (AGE < 30) + -(AGE < 30) 5 1.) But the internal interpretation of this term consists of persons who are known to be either under 30, or who are known not to be under 30. Of course, this set is, in general, a proper subset of X-for instance, it does not contain a person whose age is known to be only between 25 and 35.

5. EXTERNAL INTERPRETATION: BOUNDS DERIVED FROM A SYSTEM In Section 2 we gave the following interpretation of a system 9 the objects of Y are in reality described by a completion of Y: though we do not know by which one. Having this in mind, we may conclude that x has in reality property t only if x has property t in every completion of 9 This leads in a natural way to the following definition. Definition 4. (i) The lower value of term t in system 9~ 11t ll*Y = {x E X for every completion 9’ of Z x E II t II.Y,) = fl &Jl.y* Y’ complete (ii) The upper value of term t in system 9 11tll,y* = {x E XI f or some completion 9’ of Y: x E II t 1I.v) = w$b. 9’ complete ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete information Databases 275

(iii) The lower value of formula 4 in system 9’~ II G II *y= inf{ II 9 IIy’: Y’ is a completion of 9) T if for every completion 9’ of 9 II9 II.V2= T = t F otherwise.

(iv) The upper value of formula 9 in system 97 II(P lb* = sup { II 4 IIy~: Y’ is a completion of 5@] T if for some completion 9” of Y: II $IIl,v, = T = 1F otherwise.

(The terms inf and sup refer to the ordering F < T, the subscript Y will usually be omitted.) The above definitions are well founded since the value of a query in a complete system has already been defined in Section 3 (see Definition 3). According to our definitions, we have the following inclusions for the external interpretation E of a term t: Iltll*~GE iz II%*- An object n is in II t Il*yif and only if from the information available in 9’ we can conclude that x has (external) property t, or-putting it still another way-x could not fail to have property t, no matter what the objects turned out to be like. Similarly, x E II t 1)Y* whenever x may turn out to have property t, i.e. whenever the information available in Ydoes not rule out the possibility for x to have property t. The interpretation of II + II., and II C#BII* is analogous. The reader can easily verify the following simple properties of 11. II+ and II . II*. THEOREM5. (a) If yi < 54 then IItll*y; G IItll*y; c IIa2 c IItll& II4 II*%5 IIdJ II*% 5 II9 II%!5 II+ II%. (b) If 9% complete then for every query Q llQll*~= IIQ lb* = IIQ lb. (c) If Q 7 P then for every system 9 IIQIb= llPll*yi IIQIl:Y* = 11PJl.v*. (d) Let Y, Z be arbitrary sets such that IItlls.4 y G2 G IItv. Then there exists an extension 9” Z=Ysuch that IIt ll*Y = y, 11t II.;, = 2. ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 276 * Witold Lipski, Jr.

In particular, for any S satisfying

IItIl*.vC s c IIa* (14) there is a completion Y’ with 1)t II.‘/.’= S. (e) If t-s 7 0 then for every system 9

II t II*.4 II ~11; = II t II.v* n 11s ILy= 0. Let us notice that (a) can be viewed as a generalization of (3) and (8); simiharly, (e) is a generalization of (4). The last part of (d) means that, in a sense, the bounds 11t jl*CYand II t II? are all we can say about the external interpretation of t-it may turn out to be any set E satisfying (14). while the intuitive meaning of Definition 4 should be clear, it does not explicitly provide any reasonably effective method to compute 11t II*, 11t II*, II C$II*, and II+ II*. In other words, we know what are the bounds II . II* and II . II*, but we do not know how to compute them. Now we shall study some properties of II - II* and II . II* which will lead to such effective algorithms. THEOREM 6. In any system Y= (X, (Di)i,r, U) we have (4 IlOll, = IlOll* = 0 (b) IIlII* = lIlII* =x tc) II (i, A) II* = X\aEv,a W, 4 (d) II(i, -4) II* = avAW, A) (4 II-t II* = X\ll t II* (f) II-tII* =X\lltll* k) IIt + s II* 1 IItll* u IIs II* (h) IIt + sII* = Iltll* u IIsII* (i) IlW* = Iltll* n Ibll, (j) iit4l* c iitll* n Ibli* (h) IIt+ sII* 2 (x\lltll*) u IISII, (0 IIt-+ sII* = m\lltll*) u llsll* (ml llFll* = IIFII* = F (nL) IITll* = IIT/I* = T T if II t.-s + s.-tl(* = 0 (0) IIt = SII* = F otherwise

(4 114i*= 1iid* (4 ll~~ll* = 1 II+II* (s) II~v~II*~II~Il*vIIIcIII* (t) Il~v~II* = II@JII*vll#ll* (u) II@A#ll* = ll9ll* All#ll* (4 II~~~II* 5 II4II*AIlGIl* (w) II@- $11,2 (1 II+II*) v II4II* cd II+* +II* = (1 lIGII*)v II1c,II*. For the proof see the Appendix. ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases * 277

Neither the inclusions in (g), (j), and (k) nor the inequalities in (s), (v), and (w) can, in general, be replaced by equalities. The interpretation of this fact is very simple: 11t + s I\* may contain objects known to have (external) property t or (external) property s, though it is not known which one. For example, I((SEX = M) + (SEX = F)ll, = IIlII, = X, while Il(SEX = M)ll* U II(SEX = F)ll* contains only the objects for which the value of SEX is known. 11t II* O 11s II* contains, in general, objects which may turn out to have property t and may turn out to have property s, but cannot have both properties at a time. For example, /(SEX = M)ll* O IJ(SEX = F)I(* is exactly the set of objects for which the value of SEX is unknown, while of course II( SEX = M) . (SEX = F)ll* = 110II* = 0. I( t + s II* may contain objects for which we know that if they have property t then they have property s, but which are neither known to have property s nor known not to have property t. For example, II( SEX = M) + (SEX = M)ll* = 11111, =X, but (X\Il(SEX = M)ll*) u Il(SEX = M)I(, = II(SEX = F)IJ, U ll(SEX = Wll,. The interpretation of the inequalities in (s), (v), and (w) (from Theorem 6) is similar. It will be important in our method of determining 11t II* and 11t II* that in some special cases, specified by the next lemma, the equalities in (g), (j), and (k) (from Theorem 6) do hold. An attribute i will be said to be represented in term t if t contains an occurrence of a descriptor (i, A), A E 9$, LEMMA 7. Let t, s be two terms and suppose that no attribute is represented in both t and s. Then, in any system, (4 llt+sll* =Iltll, Ullsll, (b) IIt-s II* = Iltll* fl IIsII* (cl IIt+ sII* = (x\lltll*) u IISII,. For the proof see the Appendix. By this lemma, we have, for instance,

(((SEX = M) + (AGE z 30)((, = (((SEX = M)ll* U (((AGE I 30)((,, (I( PUBL = NORTH-HOLLAND) . (YEAR IN (1973 .. . 1978)) . (LANG # ENGLISH) I(* = II(PUBL = NORTH-HOLLAND)II* I-) II(YEAR IN (1973 .. . 1978))11* n II(LANG # ENGLISH)l)*.

Definition 8. (i) A term is primitive if it is of the form

wherei,#i,forp#q,and0#Aj~l+foralljEJ. (ii) A term is in additive normal form (ANF) if it is of the form c th LEK where all tk’s are primitive. ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 278 * Witold Lipski, Jr.

(iii) A term is coprimitiue if it is of the form CI C&,4) jEJ whereiP#i,forp#q,andO#Aj~DDi,foralljEJ. (iv) A term is in multiplicative normal form (MNF) if it is of the form

where all tk’s are coprimitive. We shall always interpret vacuous (J = 0) sums and products as 0 and 1, respectively. In particular, 1 is a primitive term (but not coprimitive), 0 is a coprimitive term (but not primitive), and both 0 and 1 are in ANF and MNF. THEOREM 9. (a) For every term t there is a term s in ANF such that s F t. (b) For every term t there is a term s in MNF such that s 5 t. PROOF. We shall show a constructive method of transforming t into ANF. Although the method is quite standard, nevertheless we shall describe it in some detail, since it is crucial for computing I( t ]I*. Step 1. Eliminate ---, by using axiom (iv) of group B2. Step 2. Transform the resulting term, by exploiting De Morgan’s Laws

-(t + s) = -t.-s, -(t*s) = -t + -s, and - -t= t, -0 = 1, -1 = 0,

to the form in which every occurrence of “-” precedes a descriptor. Step 3. Eliminate “-” by using axiom (iv) of group B2. Step 4. Transform the resulting term into a sum of products of descriptors, O’s and l’s, by using the Distributive Laws t(s + r) = ts + tr, (t + s)r = tr + sr. Step 5. Transform each of the products to a primitive term or 0, by using axioms (iii) and (i) of group B2. Step 6. Delete zero summands or reduce the whole term to a 0. For transforming into MNF we can use a similar algorithm. Alternatively, we can transform -t into ANF, say & flk (ijk, Ajk ) . Then t5 --t5 - c n (ijk, Ajk)

4 7 T (ijk, d,,;Ajk- 0

As an example, consider the following term t: -((SAL > 20000) .-(SEX = F) + (AGE IN [20,30)). (SEX = F) --(SAL IN (25OOO,3OOOOl)). (AGE 2 25). (15) The process of transforming t into ANF is shown as follows: ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. incomplete Information Databases - 279

TV (-(SAL > 20000) + - -(SEX = F)).(-(AGE IN [20,30)) + -(SEX = F) + - -(SAL IN (25000,3OOOO])).(AGE a 25) F ((SAL d 20000) + (SEX = F)).((AGE IN [0,20) + [30, co)) + (SEX = M) + (SAL IN (25000,3OOOO]).(AGE P 25) (16) 5 (SAL~~OOOO).(AGEIN[~,~~)+[~~,W)).(AGE~~~) + (SAL < 20000). (SEX = M) . (AGE 2 25) + (SAL~2OOOO).(SALIN(25OOO,3OOOO]).(AGE~25) + (SEX=F).(AGEIN[0,20)+[30,w)).(AGEz+25) + (SEX = F) . (SEX = M) . (AGE 3 25) + (SEX=F).(SALIN(25OOO,3OOOO]).(AGEs25) F (AGE 2 30). (SAL d 20000) + (AGE > 25). (SAL d 20000). (SEX = M) + (AGEs30).(SEX=F) + (AGE B 25). (SAL IN (25000,30000]) .(SEX = F). Notice that in passing we incidentally transformed t into MNF (16). The following theorem provides a basis for the promised method of evaluating IIt II* andII t II** THEOREM10. Let Y= (X, (Di)~~, U) be a system. (a) If t is a term in ANF,

j-l k-l then

(17)

(b) If t is a term in MNF, t =,tl k2,(ijk, Ajk),

then

(7 (X\U(ijk, a)). (1% j=1 k=l WZD,,~\A,~

PROOF. (a) By (a repeated application of) Theorem 6(h),

Using Lemma 7(b) and the fact that for any j the attributes iii, . .. . ijmj are distinct, we get

II kzI (ijk, Ajk) lid = kc, ll (ijk, Ajk) ll.v*.

ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 280 - Witold Lipski, Jr.

Finally, by Theorem 6(d),

11(ilk, Ajk) lib = ,i Wijk, u). Jk Part (b) of Theorem 10 can be proved similarly, by using Theorem 6(i), Lemma 7(a), and Theorem 6(c). Cl Our method of evaluating ]I t ]I * and ]I t ]I * is now evident: First, using the algorithm described in the proof of Theorem 9, we transform t into ANF (MNF) and then we exploit Theorem 10(a) (10(b)). It is useful to derive equivalent expressions for ]I t )I* and II t ]I * in terms of (Pi)iEI rather than U. To this end let us note that by eq. (6) and Theorem 6(d) we have I( (i, A) II * = aLJJU(i, U) = =?A {X E X: u E /S(X)}

= {X E X (3~ E A) U E pi(X)} (19) = {x E x: A n ,8,(x) # 0}

II(i,A)I(*=X\II-(i,A)I(*=X\II(i,~i\A)II* = {X E X (Di\A) n pi(x) = 0) (20) = {X E X pi(x) G A}.

Hence, (17) and (18) can be rewritten as

IIt IIY*= ,’ fiI {X E x Pi,,(r) n Ajk z 01, (21)

I(tII*y= f? CI {X EX pi,,(x) c Ajk}. (22) j=l k=l

These formulas are especially useful when our model is viewed as an extension of the relational model. We give below a convenient formulation of (21) and (22). THEOREM 11. (a) x E II t )I* if and only if after transforming t into ANF there is a summund (il, Al) - - - - - (ik, Ak) of this ANF such thut pi, (x) n Aj # 0 for j = 1, . . . , k. (b) x E (I t )I* if and only if after transforming t into MNF, for every factor (( il, A,) + . . - + (ik, Ah)) of this MNF there is an attribute ij such that /I< (x) c Aj. Now we shall show how to apply Theorem 11 to properly evaluate responses to queries in the relational model with null values (see Codd [3]). Let t be the query (term) considered by Grant [4]:

(NAME = JONES) - (STATUS = 10) + (STATUS # 10). (CITY = PARIS) which refers to a relation S with domains S#, NAME, STATUS, CITY. Again as in [4], let us consider the tuple p = (25, JONES, @, PARIS), where @ indicates a null value for STATUS. This tuple corresponds, in our model, to an object x with ACM Transactions on Dats,base Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases - 281

hw = (251

BNAME@) = {JONES) &TATUs(x) = &TATUs = the set of all possible values of STATUS

,&ITYb) = {PARIS}. Since (23) is already in ANF, we easily see, using Theorem 11(a), that x E I\ t ]I *, which means that p is in the union of the “true set” and the “maybe set” of tuples corresponding to (23). Indeed, for the first summand of (23) we have /~NAME(X)II {JONES} = {JONES} # 0,

/&~TATUS(X) fl (10) = {lo} +a. In order to decide whether p is in the “true set” of tuples corresponding to (23)- which intensionally means x E ]I t I(.-we transform t into MNF (by using one of the Distributive Laws): t 5 ((NAME = JONES) + (STATUS # 10))

. ((NAME = JONES) + (CITY = PARIS)) . (( STATUS = 10) + (STATUS # 10)) .((STATUS = 10) + (CITY = PARIS)) 5 ((NAME = JONES) + (STATUS # 10)) . ((NAME = JONES) + (CITY = PARIS)) . ((STATUS = 10) + (CITY = PARIS)). (24) By using Theorem 11(b), we conclude that p is in the “true set” of tuples- corresponding to (23). Indeed, &ME(X) c (JONES} in the first factor of (24), /?NAME(X)c {JONES} in the second factor of (24), &Y(X) c {PARIS} in the third factor of (24). As noted by Grant [4], if we apply to query (23) and tuple p the approach of Codd [3], which is based on a three-valued logic, the result is incorrect-at least if we understand the “true set” of tuples determined by a query t as the set of those tuples which always are in the response to t, no matter how we replaced the null values by arbitrary values in the corresponding attribute domains (in other words, if we understand it as ]I t 1),). Codd uses a “null substitution principle” which says: “A truth-valued expression has the value @ if and only if both of the following conditions hold: (1) Each occurrence of @ in the expression can be replaced by a nonnull value (possibly a distinct one for every occurrence) so as to yield the value T for the expression. (2) Each occurrence of @ in the expression can be replaced by a nonnull value ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 282 - Witold Lipski, Jr.

(possibly a distinct one for every occurrence) so as to yield the value F for the expression.” This principle is consistent with the usual three-valued logic (see Kleene [8]). Its inadequacy for our purpose can be explained by the following simple example. Consider the expression P AND NOT P, and assume that P has value @. Then, according to the null substitution principle, the value of our expression is @, since (1) TAND NOT F = T, (2) F AND NOT F = F. In order to obtain value T we were bound to substitute different truth values for different occurrences of P. However, it would be very hard to give any intuitive explanation of substituting different truth values for different occurrences of the same variable in the same expression. It seems that a much more natural approach is to restrict the null substitution principle by the requirement that different occurrences of the same variable are always replaced by a value which is equal for all occurrences. Notice that it is a similar approach that led us to Definition 4 (one may think of a variable P as a two-valued attribute with domain (T, F}). However, adding this restriction causes the resulting three-valued logic no longer to be truth functional (i.e. the truth value of an expression is not determined by the truth values of its sub- expressions). This fact is closely related to the lack of equalities in Theorem 6(g), (j), (k), (s), (v), and (w). Let us consider, as an example, a system describing a set of employees, where nothing is known about the value of AGE for any employee. In such a system we obviously have II (AGE < 30) II* = 11(AGE > 30) I]* = 0, II (AGE < 30) II* =]](AGE~30))]*=X, but x = II1 II * = 11(AGE c 30) + (AGE 3 30) ]I * # 11(AGE < 30) + (AGE < 30) ]I* = ]I (AGE < 30) I(* = 0. This indicates that the lower value of the term (AGE < 30) + (AGE 3 30) is not determined by the lower and upper values of the descriptors it contains-we cannot replace the occurrence of (AGE 3 30) by (AGE < 30)) though the lower values of these two descriptors are equal, as are their upper values. In other words, the truth value of “x is of age < 30 or 3 30” is not determined by the truth values of “x is of age < 30” and “x is of age 3 30” (the truth value of “x is of age <30”isTifx~(I(AGE<30)~~~,Fifr~II(AGE<30)~~*,and@ifx~I~(AGE < 30) I] *\I] (AGE < 30) ]I*). Putting it still another way, there cannot exist any inductive method for computing the lower value of an arbitrary term from the lower and upper values of all descriptors it contains, as in the case of complete systems (see Definition 3). It is similar with the upper value. What Theorem 10 says, is that for terms of a special form-ANF for ]I - I] * and MNF for I] - I( ,-such a method does exist, and is given by Theorem 6 with the inclusions in (g) and (j) replaced by equalities. The fact that the three-valued logic arising in connection with the null values ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases - 283 is not truth functional was also noted in the paper of Grant [4]. He proposes a method of deciding whether x E 1)t I(* which essentially involves substituting, in all possible ways, a value in the corresponding attribute domain for any occurrence of a null value in the tuple corresponding to 3~.It is also shown how to simplify this method, especially when the descriptors corresponding to an attribute with a null value in x determine a partition of the domain of this attribute. Our method of evaluating II t (I* based on Theorem 11(b) seems to be more efficient, especially if we take into account the fact that once a term is transformed into MNF, we can very easily check whether x E II t II* for any number of x’s, while in Grant’s approach every new x (tuple) involves additional computations. An interesting problem which arises in connection with transforming terms into ANF (MNF) is that of minimizing the number of summands (factors). This problem seems to be very difficult, as it contains the classical problem of the minimization of Boolean functions. Let us also note that for the very special case of two real-valued attributes, a closely related problem is that of representing a union of rectangles on the plane (with sides parallel to the axes) by a union of the minimal possible number of rectangles. Questions of this type are treated, with quite different motivation, in [la, 201. An efficient algorithm for finding a decomposition of a union of rectangles into the minimal possible number of disjoint rectangles is given in [I151. The algorithm described in the proof of Theorem 9 has been implemented in PASCAL and tested on CDC 6000. Queries containing both discrete and real- valued attributes are processed by the program. In the latter case a subset of the attribute domain specified by a descriptor can be any disjoint union of intervals. The implementation is described in more detail in [lo]. Below we give an example of an output of the program:

* -(a 72>+ SO>*):

MULTIPLICATIVE NORMAL FORM

() * (*= lOOOO>l Q (+= 15000>)

ADDITIVE RORMAL FORM

Q*= lOOOO> + fi= 15000> t * t **= 15000>

EXECUTION TIME: 0.058 s

ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 284 - Witold Lipski, Jr.

The transformation has been carried out under the assumption that DDEPT~= { 1, 2,3,4, 5}, DHIREYEAR = (70, 71, . . . . SO}, DACE = DSAL = (0, w). Consider a system represented by the following table:

Obiect AGE DEPT# HIREYEAR SAL

Xl [60, 701 (1, .. . . 5) (70, .. . . 75) Ill) x2 [5% 561 (2) (72, .. .. 76) (0, 2ooool x3 (30) (3) (70,71) (0, m) x.4 to,4 (Z3) (70, ...) 74) WOW X5 (32) (4) (75) am)

By applying Theorem 11, we get

lltll* = {x3, x41, Iltll* = h x3,x44). Consider for example x2. We have x2 + ]I t ]I * , since

PHIREYEARbZ) = (7% . . . . 76) g (7% 71, 72)

,&AL(%) = (0, 2ooo0] p [loooo, O”) in the second factor of the MNF of t. On the other hand

fiDEPT&B) n (2, 3) = (2) + 0

,&AL(X2) n [lm, ‘=) = [15000, 2ooo0] # 0 in the second summand of the ANF of t, and consequently x2 E ]I t ]I *. The following is another example of an output of the program:

-(* + 73>9 + ~AGE>40>"~)i

MULTIPLICATIVE NORMAL FORM

( SMITH>+) * ( 2>+) 4 (+= 20000>+)

ADDITIVE hORMAL FORY

** SMITH> + * SMITH>Q + * SMITH>Q= 20000> + QCDEPT# <> 2>* SMITH> + ~>* + 2>“ SMITH>*= 213000> + ** + Q= 20000>* + * 2>* + 2>*= 2i)OOO>*

EXECUTION TIME: 0.094 s

ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases * 285

6. INTERNAL INTERPRETATION OF QUERIES The way in which our query language can be used is as follows. The user formulates a query (term) t, and the system responds by listing first the objects in I(tll* as those which surely have (external) property t, and then the objects in II t II *\I1 t (I* as those for which the information available in the system is insufficient to decide whether or not they have property t. Similarly, upon receipt of a formula C#I,the system’s response is “yes” when IIC#J II * = T, “no” when II C$II * = F, and “?” when /$I[* = IIl~ll* = T. The external interpretation is especially apropriate for a “naive user” who may not be aware of the fact that the information is incomplete. Moreover, it seems to be very natural if we consider the fact that a user is primarily interested in external properties-an information system is for him merely a means for the cognition of the real world modeled by the system. On the other hand, the internal interpretation, which we describe in this section, will turn out to provide the user with a much greater expressive power. Under this interpretation it is possible to formulate more sophisticated queries, for instance of the type “list alI employees of age surely less than 33 who possibly earn more than 15000.” Roughly speaking, we can apply, at a time, 11. II* to some parts of a query and 11.II * to other parta of the same query. Moreover, it will turn out that the bounds on the external interpretation are, in some natural sense, expressible within the internal interpretation. It should be emphasized that the internal interpretation is intended for a user who is aware of the fact that the information available in the system may be incomplete, and who may explicitly refer to this incompleteness in his queries. Now we shall formally define the internal interpretation of terms (the internal interpretation of formulas is deferred to [ll]). We shall consider a broader class of terms-we enrich our query language by introducing an additional symbol for a unary operation, surely. More exactly, the set 31 of special tens is defined to be the least set with the following two properties: (i) 0,l and every descriptor is in 5’; (ii) -t, surely t, (t + s), (t-s), (t -+ s) are in F whenever t, s E 9’. Now we shall formally define the internal interpretation of a special term. Definition 12. Let Y= (X, (Di)iel, U) be an arbitrary system and let t be a special term. The value of t in Y: denoted by II t 119(or II t II when 9% understood), is defined inductively as follows:

(9 II (4 A) II = II (i, A) II* (seeT’heorem 6(c) and WA (ii) IlOll=@, lllll=X W II-t II = X \ IIt II, (iv) IIt + s II = IIt II U IIs II, (4 IIt-s II = IIt II f-l IIs II9 (vi) IIt-+ sll = cx\lltll, u llsll, (vii) II surely t II9 = (x E X for every completion Y’ Z=x x E II t Ilvt) =.JQy IIt lb. 9’ complete

ACM Transactions on Database Systems, Vol. 4, No. 3, September1979. 266 l Witold Lipski, Jr,

If we compare Definitions 3 and 12 then we see that for complete systems (and terms not containing surely) the values defined by these two definitions coincide- this justifies the use of the same symbol ]I . I]. Let us denote - surely - t by possibly t for any special term t. Then it follows easily from Definitions 4 and lft(vii) that IlsureWll= II% (25) llpossibly t II = II t II*, (26) for any term t not containing surely, i.e. the bounds on the external interpretation can be expressed within the internal interpretation. Two special terms t and s are said to be internally equiuakent (in symbols t7 9) ifIltllY= II s II9 f or every (not necessarily complete) system 9 We shah now sketch a method for determinin g the internal interpretation of an arbitrary special term. The method is based on internally equivalent transformation of special terms, by using the following axioms. AXIOMS FOR SPECIAL TERMS UNDER INTERNAL INTERPRETATION. The set S of axioms consists of the following. Sl. Substitutions of special terms into the axioms of Boolean algebra and the axiOlll.3 surely (t.s) = surely t . surely s {ii’, surely surely t = surely t (iii) surely 0 = 0 (iv) surely 1 = 1. S2. The following axioms concerning descriptors: (v) (i,0) =0 (Vi) (i, Di) = 1 (vii) (i,A).(i,B) = (i,AnB) for all i E I, A, B E 2i. S3. The axiom (viii) surely Cpk,l C-t&, Ap) + C&

+ surely ((NAME # SMITH) l (DEPTH = 2))) . surely ((SAL < 15000) + (#CHILDREN > 2) . (SAL IN (10000,20000)). (STATUS = MARRIED)). (27) Using axioms in S we obtain t 7 - surely- ((DEPT# = 3).(NAME = JONES) + (NAME # SMITH). (DEPT# = 2)) ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. Incomplete Information Databases * 287

*surely (((SAL < 15000) + (#CHILDREN > 2)) *((SAL < 15000) + (SAL IN (10000,20000))) . ((SAL < 15000) + (STATUS = MARRIED))) 7 -(( (DEPT# # 3) + (NAME # JONES)) . ((NAME = SMITH) + (DEPT# # 2) )) .((SAL< 15000) + (#CHILDREN>2)).(SAL<20000) . ((SAL < 15000) + (STATUS = MARRIED)) 7 (-(DEPT# # 3). -(NAME # JONES) + -(NAME = SMITH). -(DEPT# # 2)) .((SAL < 15000) + (#CHILDREN > 2)).(SAL < 20000) . ((SAL < 15000) + (STATUS = MARRIED)) 7 (-(DEPT# # 3) + -(NAME = SMITH)).(DEPT# IN (1,4,5)) . -(NAME = SMITH). (-(NAME # JONES) + -(DEPT# # 2)) .((SAL < 15000) + (#CHILDREN > 2)).(SAL < 20000) . ((SAL < 15000) + (STATUS = MARRIED)).

(In the transformation process we assumed that &Em6 = (1, 2, 3, 4, 5) .) The term which we obtained does not contain surely, and is a product of sums of descriptors or negations of descriptors. However, unlike in the case of the MNF considered in Section 5 (see Definition 8(iv)), we are in general not able either to eliminate “-” or to guarantee that in each product every attribute appears at most once. This is so because the internal equivalences -(i, A) 7 (i, Di\A), (i, A) + (i, B) 7 (i, A U B) do not hold, except for trivial cases. Indeed, in general

Il-(i, A)\\ = X\ll(i, A)11 = X\(l(i, A)\\* = II-C6 A>ll* = II (4 Di\A)ll* Z 11(6 Di\A)lle = Il(i, Di\Al(, Ilfi, A) + (i, WI1= Il(i A)11U lIti, WI = Il(i,All, U Il(i, Wll, + Il(i,A) + (6 WI* = Ilk A u WI* = Ilk A U WI. Notice that we can guarantee that every attribute appears in a negated descriptor at most once in each product, since -(i, A) + -(i, B) 7 -((i,A).(i,B)) 7 -(i,AnB).

Now it should be clear that any special term t can be transformed, in a sin&r way as we did it for term (27) into an internally equivalent product of sums, each sum being of the form

pil (- (6, A,) + i

ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 288 - Witold Lipski, Jr.

THEOREM 13. For any special term t and any system, x E 11t 11if and only if after transforming t into WMNF, for every factor (28) of this WMNF there is an attribute ij such that R,(x) Q 4 or for some q, B,;(X) c Bpqe As an example, consider again term (27), and suppose that a system contains the following two objects:

Object NAME SAL STATUS #CHILDREN DEPT#

x [15ooo, lfmo] (MARRIED} (3,4,5) (2, 3,4) Y w@w (MARRIED} (4 (-4

By using Theorem 13 we easily conclude that x E (1tjl, y $ ]I t )I. Consider for instance, object y and the fifth factor of the WMNF of t. We have /3&y) = (18000) a. LO, 15000), P #CHIILDREN(Y) = (2) c (3, 4, ..}, and hence y $ ]I t I]. An alternate way to determine ]I t I] is to transform t into a dual weak additive normal form (WANF), and then to use a dual form of Theorem 13. (In WANF, we allow several occurrences of the same attribute in a product only in negated descriptors.) The algorithm of transforming any special term into WANF has been imple- mented in PASCAL and tested on CDC 6000 (see [lo] for more detail). The WANF of (24) computed by the program (S and P denote surely and possibly, respectively) is shown on page 289. A more thorough treatment of the internal interpretation, including an expla- nation of its connection to a certain modal logic, is deferred to a subsequent paper WI-

7. FINAL REMARKS In this section we indicate areas for further research and we briefly describe some topics which could not be treated in fuIl detail in either this paper or in [ll].

7.1 Representability of Knowledge Consider an incomplete system and suppose that we have a new piece of information about the objects. This information is usually expressed in some language, e.g. it can be a formula 4 of our query language (with the following meaning: we learned that + is true in reality). To avoid formal difficulties with the language in which the information is expressed, let us think of a piece of information as a subset 9 of the set of all complete systems (recall that X and (Di)i,l are fixed). Information 3 represents our knowledge of the fact that the complete system Y * describing the reality is in 9. For instance, if the information is expressed by a formula 9, then 9 = (9: I] $ ]].y = T A Y complete}. In some cases we can include this new information into a system Y by simply changing Y to an extension .Y’ & .% More exactly, information 9 is said to be representable in Y if there is an extension Y’ $ Y such that %(9’) = v?(9) n 9 (29) ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979.

290 - Witold Lipski, Jr.

(%7((Y)denotes the set of alI completions of 9). Obviously, if 9 is representable in Y: then it is consistent with Y: i.e. 4 n G&Y’) # 0. Let us call 9 absolutely ’ representable if it is representable in every system with which it is consistent. It is easy to prove that 4 is absolutely representable if and only if 9 = %‘(.Y) for some 3 A simple fact which is needed in the proof is that Ya;n & is absolutely representable whenever 31, JJ~are absolutely representable (if J+ O 92 # 0 then we obtain % such that %(%) = & n 92 = %(Y1) n %?(,Yz)by putting p?(x) = &l(n) O /3i2(x), where (aj)i,, is the classification corresponding to y, j = 1,2,3). It is a rather trivial fact that the family (~6 JJ absolutely representable} partially ordered by inclusion is isomorphic to the set ( %, <) of ah systems ordered by the relation of extension. The isomorphism is established by the mapping Y H W(Y). It is easy to see that the information expressed by any formula ‘F (ij, 4)) = 0 is absolutely representable. For example, the information 1OOOOI SAL 5 500- which is a typical example of an integrity constraint-is absolutely representable (it corresponds to (SAL IN [0, 1OOOO)+ (50000, 00)) = 0). In order to obtain the system Y’ occurring in (29) it is sufficient to put P&(x) = &AL(X) n [lOOOO, 509001. It is not difficult to prove that the information TAX I SAL (with the obvious meaning) is not absolutely representable. Neither is the information “at least one of the employees makes less than 10000.” One more example can be obtained if we recall the explanation of the relation of extension given in Section 2. We considered the process of weighing some objects x, y, z, and t in the situation when more and more weights are available. Notice that even with no weights we may determine that, for instance, y is heavier than x. However, this information is, in general, not representable (in some special cases it is, e.g. if the mass of x or y is known). The case of nonrepresentable information 3 gives rise to the problem of determining the following modified lower value and upper value:

II tll *,gp,9 = {x E X for every 9’ E w940) n 4 x E I( tII.Y,> IIt II*,y,/ = {x EX f or some Y’ E %(.Y) n Ua; x E IIt ll.~} . This is a difficult problem which heavily depends on the form of 3

7.2 Extensions of the Query Language One can extend our simple query language by allowing queries of the type “how many objects with property t are there in our collection?” (numerical term, denoted by #t), “Are there more objects with property t than with property s?” (numerical formula, denoted by #s < #t). The theory of such queries leads to some nontrivial combinatorial problems, much more complicated than the prob- lem of distinct representatives encountered during the valuation of I(C$ ]I*. This extension is treated-in the case of complete information-by Lipski and Marek [171* Another extension is to allow “binary descriptors” of the form (i, R, j ), i, ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979 Incomplete Information Databases * 291 jEI,RcDiXDjtith in any complete system 9 Examples of such descriptors are (LENGTH < HEIGHT), (FIRSTNAME = FATHERSFIRSTNAME), (EXPENSES > SAL), etc. Most of the theory presented in this paper can be extended to the language involving binary descriptors (see Konikowska [9]). A natural internal property which the user may wish to ask for is “the value of attribute i is known.” We may introduce into the query language “descriptor” known(i), for every i E 1, with 11known(i)11 = {WTE X: I pi(x)I = 1). Notice that if Di is finite (and Bi = P(Di)) then khown(i) is expressible by the formula CaE~, (i, {u) ), for instance

known(SEX) 7 (SEX = F) + (SEX = M).

The operator known, which behaves as a kind of a (nonclassical) quantifier, will be treated in more detail elsewhere.

7.3 Privacy and Views Some aspects of privacy, views, and security of databases can be expressed in the framework of our theory. This ia not strange, if we notice the fact that a user of a shared database has only “incomplete information” about the content of the whole database. Let us consider the following example. Suppose that some user is not allowed to know the exact value of some attributes, e.g. SAL, but is allowed to have partial information concerning these values. For instance, he or she may be allowed to know whether SAL I 10000,10000 < SAL 5 20000, or SAL > 20000. This user may think of the database-more exactly, of his view of the database- as an incomplete system Y where, for any x E X, /THAI. can be one of the intervals [0, lOOOO], (10000, 200001, (20000, 03). The partial information on the values of attributes should satisfy the following condition: If it is confidential whether x has property t, then x E I( t ll.~*\ll t II*+ In other words, x E 11t II*Y or x E II -t Il,Ymeans that we let the user “know too much.” Similarly, we should have II+Ib= F, Ilcpllb= T w henever the truth value of 0 is confidential. We see that in this context the methods of evaluating 11t II* and II $ II* can be treated as methods for checking whether confidential information can be deduced by an unauthorized user. 7.4 Relational Model of Data Our theory-or, at least, a major part of it-can be rephrased within Codd’s relational model of data [2]. This possibility has already been mentioned in several places in this paper. We noticed that our model-especially when ex- pressed in terms of the classification (pi)iel-can be treated as a relational model with only one relation. In the relational model our knowledge is represented by a set of tuples of the form (al, . . ., a,,} where ai is an element of Di (we identify here attributes with the integers 1, . . . , n). In our model we consider tuples (AI, . . . , A,,) where Ai is a nonempty subset of Di (Ai = pi(x)). Recall that a “null ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 292 * Witold Lipski, Jr.

value” (see Codd [3]) for attribute i corresponds to Ai = Di. Of course, usually we must restrict the class of admissible subsets Ai, since we need a standard representation for them. The theory of “incomplete relational models” which may be developed along these lines heavily depends on the choice of the classes of admissible subsets of attribute domains, especially when these classes do not form Boolean algebras. As an example, consider the attribute LENGTH with &ENGTH = R+, and let the class of admissible subsets of DL ENGTHconsist of nonempty intervals (of course it does not form a Boolean algebra). The following are examples of differences from the case of arbitrary systems where /~LENGTH(X)can be any subset of R+: (1) In our incomplete relational model we have equalities of the following type, which do not hold in arbitrary systems: ]I(LENGTH IN (20, 30) + (40, 50) ) II* = II(LENGTH IN (20, 3O))(l, U II(LENGTH IN (40, 50))](,. Indeed, since PLENGTH(X)is an interval, it is contained in (20,30) U (40,50) if and only if it is contained in (20, 30) or in (40, 50). (2) A piece of information representable in an incomplete system need not be representable in our incomplete relational model. For instance, suppose that we know that the length of x is not in the interval (50,100). If we consider arbitrary systems then we can represent this fact by putting PLENGTH(X) = [0, 501 U [MO, w). However, if we are restricted to intervals as values of /ILENGTH then this information is not representable, since we are bound to put PLEN~TH(X) = [0, co] (i.e. a null value). Notice that in cases where the above properties of the incomplete relational model are inconvenient, we may partition &ENGTH into some number of disjoint intervals, and then treat LENGTH as a discrete-valued attribute, with any subset of the intervals as a possible value of P~N~TH. An important topic not treated in this paper is that of functional dependencies (by the way, notice that a functional dependency is one more example of a nonrepresentable information). Recall that a functional dependency il +- - i, + i - - - j, represents the information that if two objects (tuples) agree (in reality) on attributes il, , i, then they agree on attributes jl, , j,,,. Suppose that in an

incomplete relational model with il - l a i, + j, a- - j,,, as one of the functional dependencies, we have Pik(X) = pi,(y), I&(X)] = 1, for 1 5 tZ 5 n. By the interpretation of our functional dependency, it is then obvious that we may replace both&(x) and&(y) by&(r) fl pi,(y), for 1 I 1 c: n. In this way the sets P&) and Pi,(Y) are, in general, decreased. We may repeat this process, using all functional dependencies, until no sets pi(x) can be decreased. The classical theory of decomposing a relation with respect to a set of functional dependencies can be extended to the case of incomplete information, under suitable restrictions concerning the set of attributes for which information is allowed to be incomplete. However, this will not be treated here. 8. CONCLUSIONS We have presented a mathematical theory of incomplete information databases, which we call information systems. This theory is intended to provide a logical background for studying problems connected with information incompleteness.

ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. incomplete Information Databases * 293

In order to precisely define the semantics of queries in an information system we consider all possible completions of the present knowledge contained in the system. It should be noted that these completions may have nothing to do with the real process of increasing the system’s information-this real process involves time, and the properties of objects represented in the system may change (e.g. age or weight of a person), new objects may be added, old ones deleted, etc. The semantics of queries which we have defined refers to the objects as they are at the moment of submitting the query. We have shown that usually in the case of incomplete information there are many interpretations-equally natural-of the same query. Two essentially dif- ferent interpretations have been studied in this paper-the external interpretation and the internal interpretation. The external interpretation refers the queries directly to the real world modeled by the system, and so in general we are not able to determine this interpretation from the incomplete information contained in the system. However, we have described an effective method of computing the best possible lower bound and upper bound on this interpretation derivable from the system (the case of “yes-no” queries being deferred to a subsequent paper [ll]). Under the internal interpretation we treat a query as referring to our knowledge about reality rather than reality itself. We have presented an effective algorithm for computing the internal interpretation of queries from a broad class which includes most queries that are likely to arise in practice. It seems that for a casual user the most suitable interpretation is the external one-usually he is interested in the real world modeled by the system and not in the system itself. However, for a more sophisticated user the internal interpre- tation may be more useful, since it gives a much greater expressive power to the query language. For instance, a person responsible for collecting information for the system may prefer to use the internal interpretation to test whether there are any gaps in the information contained in the system-in this case he is directly interested in the system’s information and not in the real world modeled by the system. An important role in our theory is played by suitably chosen axioms which serve as a basis for equivalent transformations of queries. Both external and internal interpretations are provided with the corresponding complete sets of axioms. These axioms constitute the basic tool for developing algorithms to determine the response to an arbitrary query, under both external and internal interpretations. Some of these algorithms are of considerable computational complexity-the number of steps may, in the worst case, grow exponentially with the size of a query. This does not seem to be very important from the practical point of view, since only for a query of a reasonable size a user is able to understand its meaning, and only such queries are likely to be submitted to the system.

APPENDIX PROOF OF THEOREM 6. The proof of parts (a) and (b) is obvious. (d) By Definition 4(ii), x E 1)(i, A)11.v* if and only if there is a completion 9” = (X, (Di)iEr, U’) + Ysuch that x E ]I(& A)Il. Y, i.e. x E vl(i, a) for some a E A (see Definition 3(i)). It is easy to see that such a completion exists if and only if x E UC&4 U(i, a). Indeed, if x E U(i, CQ,),a~ E A, then we may first define ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979. 294 ’ Witold Lipski, Jr.

W, a) if a=&, U”(i, a) = U(i, a)\{~} otherwise, and then take as 9” any completion of (X, (Di)icr, U”). Then of course x E U’(i, a~). Conversely, ifx E U’(i, a), ao E A, for some completionY’ = (X, (Di)iel, U’) + Y: then, by Definition 2, x E U’(i, aa) z U(i, ao) G UaEAU(i, a). (e) Let %‘(s(y)be the set of all completions of 9 II-tll*y= y,ny) II-t lb = ,,,“,p, Gullt lb) = wyY) IIt IIY = X\ll t 11;.

We made use of the fact that for any complete system 9”, II -t II.v, = X\ll t II.q, (see Definition 3(iii)). (f) By (e), II t ll*Y= II --t Il*.Y= X\ll -t II.q*, that is, II -t II.v*= X\ll t ll*.v. (c) By virtue of axiom (iv) of group B2 and (e), we have IlCi, A$Y= II -(i, Di\A)II*.Y = X\(( (i, Di\A)((v* = X\,,u, U(i, A).

tg) 11t + s II*.~= n II t + s 11.~= ;II t II.yfu II s II.~,) .Y’EM.Yl .Y’E%<,Y)n = f& IIt II.qy’u f& IIs lb = IIt II*.yu IIs II*.+

(h) 11 + s I~.v*= u II t + s /.Y = u (11t 1I.y’U II a ll.4 t .YG%(.Y) .LY’E’&P) = .Y’E’L(.Y)u IIt II9 u .y,it,y)II s lb* = IIt 1l.du IIs II.v*. (i) and (j) can be proved similarly. An alternative method is to derive them from (g), (h) by using De Morgan’s Laws: I( t-s II*.Y= 11-(-t + -s)ll*,v= X\ll -t + -sIl.Y* = x\(ll -tIl.y* u 11-sII.~*) = cx\ll -tII.;) n cx\ll -sII.~*, = II--t 114 11--~(l..~= 11t II*.4 II s II*.Ly. 11t.sll,y* = 11-c-t + -s)Il.Y* = X\ll --t + -sll*Yc X\(ll -tll*.4 II-sII*d = cx\ll -t II*.yJn (x\ll -S Il*.y)= IIt lb* n lls ll.q*. 04 IIt+ SLY= ll-t+ slI,.v=II-tll*.vu llsll*.v=w\lltll.d) u Ilsll*.u: (1) II t + s IW = 11-t + sIl,d = 11-t Il.V*u II s ll.V* = (X\ll t II*.4 u II slid. (m) and (n) are trivial. (0) First notice that in any complete system Y’ & Y

Ilt = slly = T - Iltll.v = Ilsll.v, w cpll.yn cx\ll~Il.~,))u cll~ll.~~n WMI .a = IIt. -s + s.-tllY* = 0.

ACM Tractions ~n Databe System, Vol.4, No. 3, f+wt+dm 19% Incomplete Information Databases * 295

Hence, (It=sII*,v= T @ for every completion Y ’ of 9, 11t = s 1l.r’ = T w for every completion Y ’ of 9, IIt* -s + s.-tll,y, = 0 @ 11t--s + S.-tll,y* = 0. (p) II t = S I~.v*= T @ for some completion Y’ of Y: IIt = slly = T w for some completion Y’ of 3 IIt*-s + s--t Il.‘Y= 0 @ 11t--s + S.-tll*.$/= 0. (q) through (x) can be proved in a similar way as (e) through (1). As an example we prove (q):

117 Q II*9 = inf { 117 $J [I.w: Y+ E W.Y4p)} = inf{i(l+l(v,: Y+ E g(Y)}

=1sup{I(~IJw: 9’ E %w)) = 1II+IIv* (recall that inf and sup refer to the ordering F < T). Cl PROOFOF LEMMA 7. (b) Let I1 and I2 be the sets of attributes represented in t and s, respectively. By the assumptions of the lemma, Ii f~ 12 = 0. In virtue of Theorem 6(j), it suffices to show that II t )I* fl 11s II* c 11t-s II*. Let x E 11t II

U’(i, a) = Ul(i,a) if iElI 1 U2(i,a) if iE12

(for i E I \(I1 U 12) the values of U’(i, a) are immaterial, we can put, e.g. U’(i, a) = Ul(i, a)). It is easy to see that II t lly = II tllY, and II s 11~= 11s 11%.Indeed, from Definition 3 it immediately follows that the value of a term depends only on the sets U(i, a) for attributes i represented in this term. Consequently, x ~11t 11% n Il~ll~=lItll~~~lI~ll~=llt~~ll~~IIt~~ll~*, which completes the proof of (b). (a) By using De Morgan’s Law and Theorem 6(e), we get IIt + SII, = II-(- t--s,II* =X\ll-t.-sll’. By (b) and Theorem 6(f), X\II -t.--s11* = x\ (II -tll* n II-sll*) =x\ccx\lltll*) n cx\Ibll*)) =x\u\(lltll* u Ilsll*)) = IItll* u IIsII*. (c) follows easily from (a). 0

ACKNOWLEDGMENTS The author is indebted to many individuals for their assistance at various stages of the development of this paper. His special thanks are due to W. Marek and ACM Transactions on D&abase Systems, Vol. 4, No. 3, September 1979. 296 * Witold Lipski, Jr.

Z. Pawlak who encouraged him to work on the problem of incomplete information. He has also benefited greatly from discussions with M. Jaegermann, J. EoloL, C. Rauszer, and K. Segerberg. The remarks and suggestions of Ron Fagin of the San Jose IBM Research Center were invaluable in preparing the final version of this paper.

REFERENCES (Note. Reference [13] is not cited in the text.) 1. CHAMBERLIN, D.D., ASTRAHAN, M.M., ESWARAN, K.P., GRIFFITHS, P.P., LORIE, R.A., MEHL, J.W., REISNER, P., AND WADE, B.W. SEQUEL 2: A unified approach to data definition, manip- ulation, and control. IBM J. Res. Develop. 20 (1976), 569-575. 2. CODD, E.F. A relational model of data for large shared data banks. Comm. ACM 13, 6 (June 1970), 377-387. 3. CODD, E.F. Understanding relations (Instalhnent #7). FDTBuZL ofACMSIGMOD 7,3-4 (1975), 23-28. 4. GRANT, J. Null values in a relational data base. Inform. Process. Lett. 5 (1977), 156-157. 5. H~JEK, P. Automatic listing of important observational statements III. Kybernetika 10 (1974), 95-124. 6. HAJEK, P., BENDOV~, K., AND RENC, Z. The GUHA method and the three-valued logic. Kyber- netika 7 (1971), 421-435. 7. JAEGERMANN, M. Information storage and retrieval systems with incomplete information. I. Fund. Inform. 2 (1978), 17-41. (A preliminary version available as CC PAS Rep. 214, Warsaw, Poland, 1975.) 8. KLEENE, S.C. Introduction to Metamathematics. North-Holland Pub. Co., Amsterdam, 1952. 9. KONIKOWSKA, B. Data bases with incomplete information: On queries involving binary descrip- tors. To appear. 10. KONIKOWSKA, B., LIPSKI, W., MICHALEWICZ, Z., AND SENDOVA, E. Au implementation of a date base with incomplete information I. To appear. 11. LIPSKI, W. On data bases with incomplete information. To appear. 12. LIPSKI, W. Informational systems with incomplete information. Proc. 3rd Int. Symp. on Auto- mata, Languages and Programming, Edinburgh, Scotland, 1976, pp. 120-130. 13. LIPSKI, W. On the logic of incomplete information. Proc. 6th Int. Symp. on Math. Foundations of Comptr. Sci., Tatranska Lomnica, Czechoslovakia, Sept. 1977, pp. 374-381. 14. LIPSKI, W. Informational systems: Semantic issues related to incomplete information, Pert I. CC PAS Rep. 275, Warsaw, Poland, 1977. 15. LIPSKI, W., LODI, E., LUCCIO, F., MUGNAI, C., AND PAGLI, L. On two dimensional data organization II. Fund. Inform. To appear. 16. LIPSKI, W., AND MAREK, W. On information storage and retrieval systems. In Mathematical Foundations of Computer Science, A. Mazurkiewicz and Z. Pawlak, Eds., Banach Center Pubh- cations, Vol. 2, Polish Scientific Publishers, Warsaw, Poland, 1977, pp. 215-259. 17. LIPSKI, W., AND MAREK, W. Information systems: On queries involving cardinahties. Inform. Syst. To appear. 18. LODI, E., LUCCIO, F., MUGNAI, C., AND PAGLI, L. On two dimensional data organization I. Fund. Inform. To appear. 19. MAREK, W., AND PAWLAK, Z. Information storage and retrieval systems: Mathematical founda- tions. Theoret. Comptr. Sci. I (1976), 331-354. 20. MASEK, W. J. Some NP-complete set covering problems. Comptr. Sci. Lab., M.I.T., Cambridge, Mass., May 1978. 21. RASIOWA, H., AND SIKORSKI, R. The Mathematics of Metamathematics. Polish Scientific Pub- lishers, Warsaw, Poland, 1963.

Received July 1977; revised August 1978

ACM Transactions on Database Systems, Vol. 4, No. 3, September 1979.