<<

NESUG 2012 Coders' Corner

Careful Use of Input Delimiters Howard Schreier, Howles Informatics, Arlington VA

ABSTRACT The DLM= option provides a lot of flexibility, but there are pitfalls which can cause unexpected results. This paper illustrates one of these, caused by reliance on default variable attributes, and recommends a practice which can prevent such difficulties. INTRODUCTION In the DATA step, the INFILE statement provides a lot of features that govern the behavior of the INPUT statement as it reads external data. In particular, the DLM= option specifies the or characters which terminate fields in the course of list style (free-form) input. The DLM= option is not often needed. The default delimiter is usually the space (blank) character, and in many cases that is appropriate and adequate. Another common situation is -separated values (CSV). When processing CSV files, one usually wants consecutive delimiters to be treated as significant; the INFILE statement’s DSD option toggles on this behavior. A side effect of DSD is that it makes the comma the default delimiter, so this is another circumstance in which explicitly coding the DLM= option is unnecessary. When DLM= is needed, it is usually pretty simple. Typically there is just one character that establishes all of the field boundaries in all of the records. If, for example, that character is the pound sign (#), the INFILE statement would follow this pattern: infile . . . DLM = '#' . . . ;

IT CAN GET COMPLICATED The DLM= option has even more versatility, but there are pitfalls in its use. Consider, for example, a data stream consisting of just one record and four fields: The blue horizontal bars were added to the image only to mark the span of each field, and are not part of the data stream. Notice that there are two spaces. One (following A) is intended as a delimiter, terminating the first field. The other (between C and c) is the middle character of a three-character field, and must not be treated as a delimiter. To successfully read the four fields, we have to define the space character as the delimiter, read one field so delimited (A), change the delimiter to the vertical pipe character (|), and read the three remaining fields. Changing the delimiter on the fly seems like a tall order, but SAS® has features that make it feasible. • The assignment in the DLM= option can be a variable instead of a character literal. So the delimiters are in effect dynamic (code-driven or data-driven). • When an at sign (@) is the last element of an INPUT statement (that is, the element preceding the closing ), the record is said to be “held”. A subsequent INPUT

1 NESUG 2012 Coders' Corner

statement will reference the same record, and the pointer into the record will be unchanged. In other words, the second INPUT will begin processing exactly where the first INPUT left off. In the absence of the trailing @, the behavior is different—the buffer is flushed, a new record is loaded, and the pointer is returned to the first (leftmost) position. Together, these two capabilities enable us to change delimiters on the fly. Here is a first attempt at coding it. data _null_ ; infile cards dlm=mydlm ; mydlm = ' ' ; input V_1 $ @ ; mydlm = '|' ; input (V_2-V_4)($) ; put ( v : ) ( = ) ; cards ; A B|C c|D ;

MYDLM is the variable assigned to contain the current delimiter(s). The first assignment statement loads it with a blank, then the first INPUT statement begins reading from the data line, storing the letter “A” in variable V_1. The trailing @ holds the record and leaves the pointer at column position 3, so that after the second assignment statement changes the delimiter to the pipe (|), the second INPUT statement picks up where the first left off. It reads the “B” into VAR_2, advances the pointer, and continues reading. We expect the string “C c” to be loaded into VAR_3 and “D” into VAR_4. The PUT statement is there to display the results in the log so that we can confirm things. We get V_1=A V_2=B V_3=C V_4=c Somehow, the INPUT statement continues to treat the blank as a delimiter, so that scanning for VAR_3 terminates right after the upper case “C”, and VAR_4 ends up with the lower case “c”. Scanning stops there, so at least the pipe is treated as a delimiter. There are no more variables, so processing of the statement goes no further and the “D” is not picked up at all. This is not what we expected. What happened? Think about the declaration of DLM=MYDLM in the INFILE statement. MYDLM is a character variable. SAS does not know what length is appropriate, so it makes the length 8, which is the default. When the first assignment statement gives MYDLM a single space as a delimiter, that is padded with 7 more spaces and stored in the 8-character variable. The repetition is harmless. But when the same process occurs for the second assignment, the padded value is one pipe followed by 7 blanks. Consequently, SAS will persist in treating the space as a delimiter as it executes the second INPUT statement. The solution is simple: just place a LENGTH statement before the INFILE statement. data _null_ ; length mydlm $ 1 ; infile cards dlm=mydlm ; . . .

2 NESUG 2012 Coders' Corner

We can greatly generalize what we’ve just learned. The default length of 8 for character variables is utterly arbitrary. It is almost always a very good idea to explicitly declare lengths for character variables, rather than relying on SAS’s defaults and heuristics. TRICKIER SOLUTIONS A somewhat more complicated situation can arise when alternative delimiters are to be recognized. Consider this input: 111/112/121|122 211\212\221|222 DLM='/|\' works, but it allows any of the three characters to terminate any of the fields. Suppose that we want to be more restrictive, allowing the first two fields of each record to be terminated by either a forward or backward slash, but the third field to be ended only by a pipe. Our LENGTH statement will make the delimiter variable two characters long, to accommodate the forward and backward slashes. But when we assign the pipe character by itself (DLM='|'), the blank-padding, which we saw earlier, will occur. We don’t want that. The solutions exploit the harmless duplication mentioned above. One way is to just code data _null_ ; length mydlm $ 2 ; infile cards dlm=mydlm ; mydlm = '/\' ; input v11 v12 @ ; mydlm = '||' ; input v21 v22 ; put ( v : ) ( = ) ; cards ; 111/112/121|122 211\212\221|222 ; The essential thing is the repetition of the pipe to preclude the need for blank-padding. Here is another solution, one that does not require us to figure out the appropriate length. data _null_ ; infile cards dlm=mydlm ; mydlm = repeat('/\',1234) ; input v11 v12 @ ; mydlm = repeat('|' ,1234) ; input v21 v22 ; put ( v : ) ( = ) ; cards ; 111/112/121|122 211\212\221|222 ; The 1234 is merely an excessively long number, for purpose of illustration. The REPEAT function generates a long, redundant string, which is truncated to length 8 in the process of being stored in MYDLM. There is simply no room for a space to squeeze in. Notice that this actually violates the rule proposed above: that character variables’ lengths always be declared. In this

3 NESUG 2012 Coders' Corner

situation it’s safe to skip the LENGTH statement because the repetition-truncation technique addresses the blank-padding issue in a different way. CONCLUSION The lesson specific to the topic of this paper: when coding a variable (rather than a literal) on the right hand side of the DLM= option, avoid the unintentional introduction of the blank as a delimiter. The general lesson: don’t let SAS determine the length of a newly created character variable. Instead, explicate it with a LENGTH or ATTRIBUTE statement. REFERENCES SAS Institute Inc. 2011. SAS® 9.3 Statements: Reference. Cary, NC: SAS Institute Inc. SAS Institute Inc. 2011. SAS® 9.3 Functions and CALL Routines: Reference. Cary, NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author: Howard Schreier Howles Informatics Arlington VA

703-979-2720

hs AT howles DOT com http://howles.com/saspapers/ http://sascommunity.org/wiki/Howard_Schreier

Or visit the wiki page that has been established for discussing this paper:

http://www.sascommunity.org/wiki/Careful_Use_of_Input_Delimiters

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

4