SUGI 24: How and When to Use WHERE
Total Page:16
File Type:pdf, Size:1020Kb
Beginning Tutorials How and When to Use WHERE J. Meimei Ma, Quintiles, Research Triangle Park, NC Sandra Schlotzhauer, Schlotzhauer Consulting, Chapel Hill, NC INTRODUCTION Large File Environment This tutorial explores the various types of WHERE In large file environments choosing an efficient as methods for creating or operating on subsets. programming strategy tends to be important. A large The discussion covers DATA steps, Procedures, and file can be defined as a file for which processing all basic use of WHERE clauses in the SQL procedure. records is a significant event. This may apply to files Topics range from basic syntax to efficiencies when that are short and wide, with relatively few accessing large indexed data sets. The intent is to observations but a large number of variables, or long start answering the questions: and narrow, with only a few variables for many observations. The exact size that qualifies as large • What is WHERE? depends on the computing environment. In a • How do you use WHERE? mainframe environment a file may need to contain • Why is WHERE sometimes more efficient? millions of records before being considered large. • When is WHERE appropriate? For a microcomputer, the threshold will always be lower even as processing power increases on the The primary audience for this presentation includes desktop. Batch processing is used more frequently programmers, statisticians, data managers, and for large file processing. Data warehouses typically other people who often need to process subsets. involve very large files. Only basic knowledge of the SAS system is assumed, in particular no experience beyond Base Types of WHERE SAS is expected. However, people with more experience may still discover new reasons for using Several types of “WHERE” exist in Release 6.08 or WHERE as well as potential pitfalls. The simple later of SAS software. The most common are: examples provided to show syntax and potential complications apply to all operating systems running • WHERE statement (DATA step or Procedure) Release 6.08 or later. • WHERE data set option • WHERE clause in PROC SQL TERMINOLOGY The original source of WHERE probably stems from Efficiency Elements SQL (Structured Query Language). Other areas that support WHERE commands or clauses include SAS/FSP, SAS/ASSIST, SAS/CONNECT. Issues There are two primary categories of elements to related to these products are not addressed in this consider when evaluating efficiency: machine and paper. human. Machine efficiency elements include computer processing time, often called CPU, and processing time for reading or writing computer data, THE WHERE STATEMENT called I/O for Input/Output. Efficiency for humans is measured by considering programmer time, level of Syntax expertise required, or clarity of final code. Major components of programmer time include planning The WHERE statement is used for selecting programming strategy, writing new code, testing observations from a SAS data set by specifying a programs, running production programs, and simple or complex conditional clause. In addition to revising existing code. standard comparison and logical operators such as EQ or AND, special operators are available. In Always consider both machine and human elements general, SAS functions can be included. The syntax when choosing between options for efficiency is identical whether the statement is used in a DATA reasons. The choice may be difficult, or at least step or a Procedure: ambiguous, since the more machine efficient option can require additional human effort or vice-versa. WHERE where-expression ; Beginning Tutorials For example, to create a subset of a permanent SAS as in the example above, can be used more than data set that includes all observations with a certain once in an expression. value of a categorical variable: /* Select body system starting with Gastro */ DATA subset; WHERE bodysys LIKE 'Gastro%' ; SET libref.indata; "Gastrointestinal" and "GastroIntestinal" would be WHERE catvar = value; selected but not "GASTROINTESTINAL" because LIKE is case-sensitive. /* other SAS statements */ /* Select values not equal to 65 */ OUTPUT; /* OPTIONAL */ WHERE spdlimit <> 65 ; RETURN; /* OPTIONAL */ WHERE spdlimit NE 65; RUN; /* OPTIONAL */ In this case,the two WHERE statements are If no DATA step is required, then you should simply equivalent. Version 6 documentation incorrectly process a subset directly, as in the following: states that the <> operator is interpreted as “maximum” when it is actually interpreted as “not PROC FREQ DATA=libref.indata; equals.” WHERE catvar = value; TABLES var1 * var2; /* Select age >= 18 AND age <= 34 */ RUN; WHERE age BETWEEN 18 AND 34 ; Special Operators The BETWEEN operator can be easier for a programmer to understand. In ACCESS view descriptors, using BETWEEN is more machine Several special operators are available for where- efficient than the traditional operators. expressions. The examples here provide a taste of the possibilities. Execution Logic of WHERE Statement /* Select drug names starting with A */ WHERE drugname CONTAINS 'A'; In a DATA step, the WHERE statement is not considered a standard executable statement. "Asprin" and "Advil" would be selected, but so would Observations are selected before data values are "VOLMAX." However, the WHERE statement would moved into the program data vector (PDV). The not select "Zithromax" because the CONTAINS selection occurs immediately after input data set operator is case-sensitive. options are applied and before other DATA step statements are executed. As a result, certain /* Select for missing values */ automatic variables cannot be used in where- WHERE treatmnt IS MISSING; expressions and any variables created in the DATA WHERE treatmnt IS NULL; step cannot be included. Other implications of WHERE processing logic become clearer when All missing values for the variable TREATMNT compared with Subsetting IF logic. would be selected. The advantage of using either IS MISSING or IS NULL is that you do not need to The WHERE statement can only be used in DATA know whether a variable is numeric or character. steps that use existing SAS data set(s) as input, i.e., a SET, MERGE, or UPDATE statement must exist. If /* Select states with capital C in name */ you are using an INPUT statement to read in “raw” WHERE state LIKE '%C%' ; files, then you cannot use WHERE. A single WHERE statement can apply to multiple data sets. The states "North Carolina", "South Carolina", Of course, the assumption is that all the data sets "Connecticut", "California" and "DC" would be found. include the variables referenced in the where- However, the WHERE statement would not select expression. In a complex DATA step with more than "Kentucky", "Wisconsin", and so on because the one SET, MERGE, or UPDATE statement, you LIKE operator is case-sensitive. The % sign should combine the selection criteria into one substitutes for any combination of characters, and WHERE statement. If more than one WHERE exists Beginning Tutorials then only the last statement is applied and no error THE WHERE DATA SET OPTION message appears. Syntax Using WHERE is incompatible with the OBS= data set option and the POINT= option of the SET The syntax of the WHERE data set option, called statement. These statements select observations by WHERE=, is a combination of standard data set observation number. Also, FIRSTOBS= may not option parentheses and a where-expression. The have a value other than one. For instance, you selection criteria only apply to the last data set if cannot use a testing strategy based on using more than one is listed in the same statement. The OBS=100 when your program includes WHERE. syntax is the same for either DATA steps or Procedures: If you use the MODIFY statement with BY processing, then dynamic WHERE processing datasetname(WHERE=(where-expression)); occurs. There may be an efficiency concern for large files that are not sorted and have not been indexed For example: as a result. DATA subset; Comparison to Subsetting IF MERGE libref.indata(WHERE=(age > 35)) libref.trtment(WHERE=(trt='T')) ; BY idvar; The “traditional” method of creating subsets in a /* SAS statements */ DATA step relies on the subsetting IF statement RUN; (SubIF). In many cases, you can produce the same subset using either WHERE or SubIF. However, PROC FREQ situations exist in which different results will be DATA=libref.indata(WHERE=(catvar=value)); obtained. The values of FIRST. and LAST. may be TABLES var1 * var2; different since WHERE selections are made before RUN; BY groups are defined. With a MERGE or UPDATE based on a BY statement, WHERE selects observations before the merge while SubIF selects Comparison to WHERE Statement after the merge (see Example 2). The WHERE data set option has the same incom- Advantages of WHERE Statement patibilities as the WHERE statement with the OBS= data set option and the POINT= option of the SET • Uses index if appropriate statement. The limitation that FIRSTOBS=1 must be • Efficient (before transfer to PDV) true also applies. • Special operators available • Can apply directly to Procedures If both a WHERE statement and WHERE= are used together in the same DATA step or Procedure, then Advantages of Subsetting IF the data set option takes precedence. In this situa- tion, the WHERE statement is ignored. • Executable conditionally (IF-THEN) • Any input file type In many situations, the differences in using a • Use any automatic variables WHERE statement or the WHERE= data set option • Compatible with OBS=, POINT=, FIRSTOBS= are minimal. However, known issues in some • Same in all versions/releases Version 6 releases are: Because of the overhead required for WHERE • With the CNTLIN option in PROC FORMAT, you processing, using SubIF can be more machine need to use WHERE=. efficient when a data set is relatively narrow (short • PROC COPY and PROC CPORT do not support record lengths). In any case, SubIF should appear the WHERE statement. as close to the beginning of a DATA step to avoid • In PROC GANNO, you should use WHERE= if unnecessary processing of records that will you want the program code to be portable ultimately be deleted.