Handling Missing Values in the SQL Procedure
Total Page:16
File Type:pdf, Size:1020Kb
Handling Missing Values in the SQL Procedure Danbo Yi, Abt Associates Inc., Cambridge, MA Lei Zhang, Domain Solutions Corp., Cambridge, MA non-missing numeric values. A missing ABSTRACT character value is expressed and treated as a string of blanks. Missing character values PROC SQL as a powerful database are always same no matter whether it is management tool provides many features expressed as one blank, or more than one available in the DATA steps and the blanks. Obviously, missing character values MEANS, TRANSPOSE, PRINT and SORT are not the smallest strings. procedures. If properly used, PROC SQL often results in concise solutions to data In SAS system, the way missing Date and manipulations and queries. PROC SQL DateTime values are expressed and treated follows most of the guidelines set by the is similar to missing numeric values. American National Standards Institute (ANSI) in its implementation of SQL. This paper will cover following topics. However, it is not fully compliant with the 1. Missing Values and Expression current ANSI Standard for SQL, especially · Logic Expression for the missing values. PROC SQL uses · Arithmetic Expression SAS System convention to express and · String Expression handle the missing values, which is 2. Missing Values and Predicates significantly different from many ANSI- · IS NULL /IS MISSING compatible SQL databases such as Oracle, · [NOT] LIKE Sybase. In this paper, we summarize the · ALL, ANY, and SOME ways the PROC SQL handles the missing · [NOT] EXISTS values in a variety of situations. Topics 3. Missing Values and JOINs include missing values in the logic, · Inner Join arithmetic and string expression, missing · Left/Right Join values in the SQL predicates such as LIKE, · Full Join ANY, ALL, JOINs, missing values in the 4. Missing Values and Aggregate aggregate functions, and missing value Functions conversion. · COUNT Functions · SUM, AVG and STD Functions INTRODUCTION · MIN and MAX Functions 5. Missing value conversion SQL procedure follows the SAS® System convention for handling missing values. The In order to see how differently the missing way it expresses missing numeric values data are handled in SQL procedure, we will and character values are totally different. A produce examples based on a very small missing numeric value is usually expressed data set that can be generated by following as a period (.), but it also can be stated as SAS codes. one of other 27 special missing value expressions based on the underscore (_) data ABC; and letters A, B,…,Z, that is, ._, .A, .B,…, input x1 1-2 x2 4-5 .Z. In SAS SQL procedure, a particular missing value is equal to itself, but those 28 y1 $7-9 y2 $11-12; missing values are not equal to each other. datalines; They actually have their own order, that is ._ -1 2 ABC 12 < . < .A< .B, …< .Z. When missing numeric 0 -3 DE 34 values are compared to non-missing 1 numeric value, the missing numeric values are always less than or smaller than all the 1 1 CDE 56 ._ .A 78 ARITHMETIC EXPRESSIONS . .Z ER When you use a missing numeric value in .A 2 ABC 90 an arithmetic expression (+,- ,*, /), the SQL procedure always set the result of the ; expression to period (.) missing value. If you run; use that result in another expression, the next result is also period (.) missing value. The small data set has four variables, X1 This method of treating missing values is and X2 are numeric variables, and Y1 and called propagation of missing values. For Y2 are character variables. example, 1. MISSING VALUES AND EXPRESSION Proc SQL; Since SAS system uses different ways to Select x1+1 as z1 from ABC express the missing numeric values and where x1<0; missing character values, the missing values are treated differently in the logic, arithmetic, Result: and string expression. LOGIC EXPRESSION Z1 When a missing numeric value appear in the ------ logic expression (not, and, or). The missing 0 numeric value is regarded as 0 or FALSE. In another word, In SAS system, 0 and missing values are regarded as FALSE, and . non missing numeric values except 0 are . regarded as TRUE. Consider this example, Notice that all special missing values are Proc sql; changed into period (.) missing value after Select not x1 as z1, x1 or . as z2, the calculation. x1 and . as z3 from ABC; STRING EXPRESSIONS When you use a missing character value in Result: a string expression concatenated by ||, the SQL procedure sets the missing character value to a string of blanks, the length of Z1 Z2 Z3 which is equal to the length of the variable. ---------------------- Here is an example. 0 -1 0 Proc SQL; 1 0 0 Select ‘AA’||y1||y2 from ABC; 0 1 0 Result: 0 1 0 1 _ 0 Z1 1 . 0 ------- 1 A 0 AAABC12 AADE 34 Notice that missing numeric values behave AA like FALSE or 0 in NOT and AND logical AACDE56 expression, but in the logical OR expression, the missing values was simply dropped AA 78 because of the nature of OR operation. AAER AAABC90 2. MISSING VALUES AND PREDICATES where y1 like ' '; In SQL procedure predicates test some conditions that have the truth-value TRUE, /* Two Blanks */ FALSE and NULL, and returns TRUE and select x1, y1 FALSE. It also has special predicates to handle missing values, that is IS NULL and from ABC IS MISSING. Although missing character where y1 like ' '; values are expressed as a string of blanks, but a string of blanks are always regarded /* Three Blanks */ as missing values. You will find this situation select x1, y1 in a LIKE predicate below. from ABC IS [NOT] NULL /IS [NOT] MISSING where y1 like ' '; IS [NOT] NULL and IS [NOT] MISSING predicates are two predicates especially Same result: designed to deal with missing values. IS [NOT] NULL is a SQL standard predicate X1 Y1 and IS [NOT] MISSING is SAS SQL ------ predicate. They are generic because they - can handle both numeric and character variables. For example, 1 ._ Proc SQL; Select x1, y1 from ABC However, the following SQL statement Where x1 is not null and y1 is not missing; returns nothing. Result: proc sql; /* Four Blanks */ X1 Y1 select x1, y1 -------- from ABC -1 ABC where y1 like ' '; 0 DE 1 CDE If you use [NOT] IN Predicate, missing character values are always equal to one or more blanks no matter what size of the LIKE variables. For example, following SQL As mentioned above, missing character statement returns the same results as those values are always the same but there is one of first three SQL statements with LIKE exception. In SAS SQL LIKE predicate, predicate even though the size of blanks are missing values are regarded as a string of more than 3. blanks, they are not missing values. If the length of a character column variable is 3, Proc SQL; then a missing character value is like one Select x1, y1 blank, two blanks, and three blanks, but not from ABC like four blanks or more than four blanks. where y1 in (‘ ‘); /* Consider following SQL statement, which Four Blanks */ returns the same results because the length of y1 is 3. For two character variables, say A, and B, with different size, if size(A) > size(B), then proc sql; predicate ‘A like B’ is TRUE, but predicate /* One Blank */ ‘B like A’ is FALSE when they both have select x1, y1 missing values. Note that predicate ‘A=B’ is from ABC always TRUE when both have missing Proc SQL; values. For example, select x2 from ABC Proc sql; where x2 > ALL select x1, y1, y2 (select x1 from ABC from ABC where y1 like y2; where x1 >3); select x2 will return the following results: from ABC where x2 < ALL X1 Y1 Y2 (select x1 from ABC ------------- where x1 >3); 1 select x2 from ABC However, where x2 NE ALL Proc sql; (select x1 from ABC select x1, y1, y2 where x1 >3); from ABC where y2 like y1; would produce the entire list of X2 values as following results. Returns nothing. X2 --- ALL, ANY, AND SOME 2 ALL, ANY and SOME are special predicates -3 that are oriented around subqueries. They . are used in conjunction with relational operator. Actually, there are only two 1 because ANY and SOME in SQL procedure A are the same. ALL, ANY, and SOME are Z very similar to the IN predicate when it is 2 used with subqueries; they take all the values produced by the subquery and treat Whereas those queries them as a unit. However, unlike IN, they can be used only with subqueries. Proc SQL; In SQL procedure, ALL and ANY differ from select x2 each other in how they react if the subquery from ABC procedures have no values to use in a where x2 > SOME comparison. These differences can give (select x1 from ABC your queries unexpected results if you do not account for them. One significant where x1 >3); difference between ALL and ANY is the way select x2 they deal with the situation in which the from ABC where x2 < SOME subquery returns no values. In SQL (select x1 from ABC procedure, whenever a legal subquery fails where x1 >3); to produce output, the comparison operator modified with ALL is automatically TRUE, select x2 and the comparison operator modified with from ABC ANY is automatically FALSE. This means where x2 NE SOME that the following queries (select x1 from ABC where x1 >3); would produce no output. Of course, neither of these comparisons is very meaningful. [NOT] EXISTS Result: In [NOT] EXISTS predicate, the way missing values are dealt with in SQL procedure is Row X1 X2 different from that in most SQL databases ------------------------ because missing values are regarded as 1 1 1 comparable values in SQL procedures.