Quick viewing(Text Mode)

Reading MORE Difficult Raw Data Matthew Cohen, Wharton Research Data Services

Reading MORE Difficult Raw Data Matthew Cohen, Wharton Research Data Services

NESUG 2010 Foundations and Fundamentals

Reading MORE Difficult Raw Data Matthew Cohen, Wharton Research Data Services

ABSTRACT Raw data from outside sources is often messy and can be difficult to read into SAS®. This paper will cover many of the features and creative uses of the INFILE and INPUT statements that make this task easier. It will also include several case studies using SAS I/O functions and CALL EXECUTE. This is intended for beginner or intermediate programmers, or anyone who reads raw data into SAS.

Since all of the examples in this paper are based on real situations, most require the use of multiple options.

FIXED WIDTH Column input – Formatted column input – Inline raw data This first example reads from a fixed width . The value for each variable can be found between the same starting and ending columns throughout the input file. The unusual aspect of this file is that the person’s name is on one line and their age, sex, income, and education is on the next. This sample file has 6 variables and 4 records. Below is the data followed by the code to read it into SAS.

Joe Smith 20 M 35000 BS Jane Doe 24 F 50000 MS Matt Brown 30 M 65500 HS Mike Johnson 55 M 85000 PhD

data new; infile "c:\myfile.txt"; length fname $6 lname $10 sex $1 education $3; informat lname $10. income 7.; input fname 1-6 @7 lname / age 1-3 sex +1 income @14 education; run;

There are several things to note in this code:  The default length is 8 for numeric variables. I declare a length for each of the variables, but leave the numeric variables unchanged.  The INPUT statement actually reads in the data. In column input, the variable name is followed by the beginning and ending columns. In formatted column input, the first column of the variable is written following an @, and the informat tells SAS how many characters to read. SAS will accept any combination of formatted and unformatted column input.  +1 (or +) moves the pointer one column to the right.  / moves the pointer to the next line.

Data may also be written directly into the SAS editor for testing purposes using the DATALINES statement.

data new; infile datalines; length fname $6 lname $10 sex $1 education $3; informat lname $10. income 7.; input fname 1-6 @7 lname / age 1-3 sex +1 income @14 education;

datalines;

1

NESUG 2010 Foundations and Fundamentals

Joe Smith 20 M 35000 BS Jane Doe 24 F 50000 MS Matt Brown 30 M 65500 HS Mike Johnson 55 M 85000 PhD ; run;

In rare cases, external files will behave differently than datalines. The rest of this paper reads the raw data from external files, the method that is more common when reading large amounts of data from outside sources.

DELIMITERS List input – DLM option – RETAIN – Date handling This example reads from a pipe-delimited ( | ) text file. A delimiter is a character used to denote the end of one value and the beginning of the next. It is needed because the values for each variable may be in different locations on each record. The advantage of the pipe over or space delimiters is that the pipe is rarely used in the actual data values.

ID | ticker | trade date | trade time | price | sequence num

1|abc|20080101|093000|10.23|123456| 251|def.g|20080229|170000|5.02|654321| 255|bls|20080915|123000|100.00|987654| 260|xyz|20080916|091500|15.75|555132|

data new; infile "c:\myfile.txt" dlm="|"; retain id ticker trdate trtime price seq_num; length ticker $5 seq_num $6 trtime_temp $6; informat trdate yymmdd8.; format trdate yymmdds10. trtime time8.; input id ticker trdate trtime_temp price seq_num;

trtime=hms(input(substr(trtime_temp,1,2),2.),input(substr(trtime_temp,3,2),2.), input(substr(trtime_temp,5,2),2.)); drop trtime_temp; run;

This is a fairly simple example, but still takes advantage of several techniques.  The INFILE statement includes the delimiter (DLM) option and sets it equal to pipe.  The RETAIN statement has many uses, in this case it orders the variables in the output dataset.  SAS needs an INFORMAT to correctly read date values, and a FORMAT to display date and time values in a human-readable form. The informats can be written directly on the INPUT line instead of as a separate statement – I find it easier to read when they are separate.  Notice that the variables on the INPUT statement are listed one after the other, in the same order as the raw data. This is known as list input, the alternative to column input. This raw data file cannot be used with column input.  The time values are actually read into a character variable because they do not match any SAS informats. I use the SUBSTR function to read the hours, minutes, and seconds separately, and feed them into the HMS function which creates a SAS time.

Even a very simple input file requires some knowledge of SAS options and functions.

2 NESUG 2010 Foundations and Fundamentals

LINE WRAPPING FLOWOVER – MISSOVER – TRUNCOVER – LIST

EXAMPLE 1 This example is very similar to the previous one, but introduces a small data error - the last value on the third record is on a new line. For simplicity, the remaining examples do not transform the time variables.

1|abc|20080101|093000|10.23|123456| 251|def.g|20080229|170000|5.02|654321| 255|bls|20080915|123000|100.00| 987654| 260|xyz|20080916|091500|15.75|555132|

data new; infile "c:\myfile.txt" dlm="|" flowover; retain id ticker trdate trtime_temp price seq_num; length ticker $5 trtime_temp $6 seq_num $6; informat trdate yymmdd8.; format trdate yymmdds10.; input id ticker trdate trtime_temp price seq_num; run;

 The FLOWOVER option is the default for the INFILE statement, and does not have to be written in the code. In cases where the record ends before all input values have been read, it tells SAS to look on the next line for the remaining values.

SAS correctly reads the sequence number even though it is on the next line.

EXAMPLE 2 The alternatives to FLOWOVER are MISSOVER and TRUNCOVER. Both MISSOVER and TRUNCOVER will set variables to missing if they are not found on the original input record; neither wrap to the next line. The difference is that MISSOVER will set a variable to missing if it cannot read the entire value. TRUNCOVER will write out any part of the value that it can read, and then stop. It is actually very difficult to find a situation where MISSOVER and TRUNCOVER produce different results since SAS can almost always read the whole value.

MISSOVER and TRUNCOVER are always the same when using list input or formatted column input. The only time they produce different results is with column input, without the use of informats. And even then, there is only a difference if the record length is shorter than expected. Using the first input file as an example:

Joe Smith 20 M 35000 BS Jane Doe 24 F 50000 MS Matt Brown 30 M 65500 HS Mike Johnson 55 M 85000 PhD

data new; infile "c:\myfile.txt " ; length fname $6 lname $10 sex $1 education $3; informat lname $10. income 7. education $3.; input fname 1-6 @7 lname / age 1-3 sex 4 +1 income education 14-16; list; run;

 The education variable is now read in by stating the exact columns rather than using a format.  MISSOVER will set Education to missing for the first three records because they only use 2 of the 3 allotted characters. The value “PhD” is read correctly.  TRUNCOVER will store all four values as they appear in the raw data.  The LIST statement writes the input line to the SAS Log along with a ruler, which may help to debug programs.

3 NESUG 2010 Foundations and Fundamentals

DSD DOES IT ALL

EXAMPLE 1 Here, a new variable named flag has been added to the input data.

1|abc|20080101|093000|10.23||123456| 251|def.g|20080229|170000|5.02||654321| 255|bls|20080915|123000|100.00|y| 987654| 260|xyz|20080916|091500|15.75||555132|

data new; infile "c:\myfile.txt" dlm="|" dsd flowover; retain id ticker trdate trtime_temp price flag seq_num; length ticker $5 trtime_temp $6 flag $1 seq_num $6; informat trdate yymmdd8.; format trdate yymmdds10.; input id ticker trdate trtime_temp price flag seq_num; run;

 The flag variable is missing in some cases, causing two pipes to be read in a row. The DSD option tells SAS to treat them as separate delimiters and set the value in between to missing.

With the DSD option turned on, SAS no longer reads the wrapped value correctly. It sees the pipe after the flag variable as one delimiter and the end of the record acts as another delimiter. Two delimiters in a row cause the sequence number on the third record to be set to missing. Even worse, SAS treats the sequence number as the first variable in the next observation. From there, the rest of the file is incorrect. FLOWOVER can be useful, but can also have unintended consequences.

EXAMPLE 2 The vendor is helpful and quoted the character variables. But how do you read them? What length do you specify?

"abc"|’Abc Corp’|20080101|100|50|40| "def"|"David, Ernie, & Frank’s, llc"|20080229|115.02|60|50.2| "booz"|"Booz | Allen | Hamilton"|20080915|975|600.00|354| "xyz"|"’Miscellaneous’ company"|20080916|25|5.75|18.9|

data new; infile "c:\myfile.txt" dlm="|" dsd missover; retain ticker name date revenue cogs net_income; length ticker $4 name $50; informat trdate yymmdd8.; format trdate yymmdds10.; input ticker name date revenue cogs net_income; run;

 The DSD option takes care of the quotes for you. They will not appear in the raw data, and are not included in the length.  It allows single or double quotes, apostrophes (unbalanced quotes) and nested quotes as long as the inner quotes are of a different type (single vs double).  DSD also allows embedded delimeters If the character variables are quoted.

HOLDING THE INPUT RECORD @ - @@ - OUTPUT

EXAMPLE 1 Here I introduce a new data error . There is a blank row after the 3rd record.

1|abc|20080101|093000|10.23||123456| 251|def.g|20080229|170000|5.02||654321| 255|bls|20080915|123000|100.00|y|987654|

260|xyz|20080916|091500|15.75||555132|

4 NESUG 2010 Foundations and Fundamentals

data new; infile "c:\myfile.txt" dlm="|" dsd flowover; retain id ticker trdate trtime_temp price flag seq_num; length ticker $5 trtime_temp $6 seq_num $6 flag $1; informat trdate yymmdd8.; format trdate yymmdds10.; input id @;

if id = . then delete; input ticker trdate trtime_temp price flag seq_num; run;

 The trailing @ option tells SAS to hold the current input record.

While the input record is being held, I perform validation of the ID variable. If it is blank, the record is deleted, otherwise SAS continues with the input.

EXAMPLE 2 In some ways, the trailing @ is the opposite of the FLOWOVER/MISSOVER options. FLOWOVER tells SAS what to do if there are fewer values in a raw data record than there are input variables. The @ and @@ options tell SAS what to do if there are more raw data values than input variables.

In this example, each record has two observations. For simplicity, only 3 variables are used.

1|abc|20080101|251|def.g|20080229| 255|bls|20080915|260|xyz|20080916|

data new; infile "c:\myfile.txt" dlm="|"; retain id ticker trdate; length ticker $5; informat trdate yymmdd8. id 3.; format trdate yymmdds10.; input id ticker trdate @; output; input id ticker trdate; output; run;

 The first INPUT statement reads the first 3 values as usual. The trailing @ holds the pointer after the third value.  The first OUPUT writes them out to the first observation.  The second INPUT tells SAS to read the next 3 values, which are then written to the second observation. SAS moves on to the next input record and starts again.

EXAMPLE 3 The input in this case is similar, but the ID is only listed once and is followed by two sets of observations.

1|abc|20080101|def.g|20080229| 255|bls|20080915|xyz|20080916|

data new; infile "c:\myfile.txt" dlm="|"; retain id ticker trdate; length ticker $5; informat trdate yymmdd8. id 3.; format trdate yymmdds10.; input id @; input ticker trdate @; output; input ticker trdate; output; run; 5 NESUG 2010 Foundations and Fundamentals

 Three inputs are used – one for the ID and one more for each observation.

This technique can get complicated, but it is flexible enough to handle many difficult raw data situations.

EXAMPLE 4 To read streaming data, where the observations are listed on the same line, one after the other, use the trailing @@ option. Here, 4 observations appear on the same raw data record.

1|abc|20080101|251|def.g|20080229|255|bls|20080915|260|xyz|20080916

data new; infile "c:\myfile.txt" dlm="|" dsd flowover; retain id ticker trdate; length ticker $5; informat trdate yymmdd8.; format trdate yymmdds10.; input id ticker trdate @@; run;

 The trailing @@ option tells SAS to keep reading the current input record until all fields have been read. SAS loops through the variables on the INPUT statement until the end of the record.  Notice that there is no ending delimiter after the last value. If one exists, SAS continues to look for more input data. SAS still reads the data correctly, but notes a “Lost Card” and gives an error.

SPECIAL CHARACTERS Tab delimeter – LRECL This example is from a different data source that tracks product sales. The input file is tab delimited, and the lines are very long and filled with characters that have special meaning in SAS. Even so, this is a relatively easy file to read.

Dell 225 1 17" LCD monitor Sony 1499.50 1 Brand new 2008 Sony 50 inch plasma television; 1080i resolution; speakers on each side of the unit; 10 ports including HDMI, component, composite, and s-video located on the front and back of the unit; swivel base included; Energy Star compliant. Canon 150 2 Canon A280 digital camera

data new; infile "c:\myfile.txt" dlm="09"x dsd missover lrecl=1000; retain company price quantity description; length company $20 description $1000; input company price quantity description; run;

 Tab delimiters are specified using a hexadecimal . “09” in hexadecimal is a tab.  LRECL is the logical record length. The default is 256, meaning that SAS will read the first 256 characters of a record unless directed otherwise. SAS prints the note “One or more lines were truncated. “ if a record is longer than the LRECL.

If the length of a long character variable is not given in the data’s documentation, it will take 2 or more passes through the data to determine an appropriate length in SAS. If the SAS length is too short, the data will be truncated. A variable length that is too long wastes memory. This is because SAS pads unused portions of text variables with blanks.

To read in the data initially, set a length that is likely to be too big for the variable. Once it is in a SAS dataset, use the LENGTH function to determine the maximum actual length used. The ideal is one character less than the defined length. The single blank character ensures that nothing was truncated, but wastes minimal space. If the maximum length equals the defined length, increase the defined length and read in the data again.

6 NESUG 2010 Foundations and Fundamentals

READING FROM ZIPPED FILES FILENAME – Pipe The raw data files I receive are getting significantly bigger over time. I often receive them in a compressed (zipped) format. Decompressing the file then reading it in SAS requires an extra manual step in what might otherwise be an automated process, and also takes extra disk space. SAS can read directly from compressed files with the FILENAME statement.

filename zipfile PIPE "unzip -p /home/myfile.zip"; data new; infile zipfile firstobs=2 obs=11;

 The FILENAME statement defines a raw data file or directory in SAS. FILENAME and INFILE are very similar, but INFILE does not work with pipes.  The PIPE option tells SAS to read the results of the following command. This example reads from a zipped file on and writes directly to SAS. This will also work on Windows with the right software.  The FIRSTOBS option specifies the first raw data record to read, while the OBS option specifies the last record to read. FIRSTOBS = 2 is useful when the variable names appear in the raw data. In this case, only records 2- 11 (10 records) are read. It is often easier to test and debug programs using a small number of records.

filename flist PIPE "ls /home/datadir/*.txt | -pe 's/\s+/\n/'"; data new; infile flist ; length a $ 100; input a; run;

 This is a more complicated version of a piped FILENAME statement. It reads all the text files in the directory and writes them out to SAS on separate lines.  They can be read into a dataset like any text file with a single variable.

SAS I/O FUNCTIONS DOPEN – DNUM – DREAD – FEXIST – DCLOSE Another method of obtaining directory and file information is with the SAS I/O functions. Some of the I/O functions work on SAS datasets, others on raw files or directories.

filename testDir "c:\mydir"; filename readme "c:\mydir\readme.txt"; data _null_; dirID = dopen("testDir"); if dirID = 0 then do; put "Could not open directory"; stop; end;

dirNum = dnum(dirID); put dirNum=; fname = dread(dirID, 15); put fname=; if fexist("readme") then put "Yes"; dirID = dclose(dirID); run;

 The FILENAME statement defines a raw data file or directory in SAS.  Most I/O functions require you to open the file or directory and retrieve an ID number.  The DNUM function returns the number of files in the directory.  The DREAD function returns the name of a file. 15 is a sequence number – the 15th file read from the directory.  The FEXIST function reports whether or not a specified file exists.  DCLOSE closes the directory.

This example can be used to read an entire directory of raw data files, or a subset of files if a little extra logic is applied. I use the FEXIST function to look for new documentation from the data vendor. The I/O functions can accomplish some of the same tasks as the piped infile statement, using SAS commands instead of PERL.

7 NESUG 2010 Foundations and Fundamentals

CALL EXECUTE CALL EXECUTE – ATTRIB – CATS – CATX The data vendor is very helpful and provides a dictionary file that describes exactly what the data looks like. How do you use it?

column_name data_type max_length column_label AnnualMtg smalldatetime NULL Annual Meeting Date AnnualMtgPlace varchar 300 Annual Meeting Location AuditFees bigint NULL Fees Paid to Auditor CompanyName varchar 255 Company Name Ticker varchar 10 Ticker

Ticker|CompanyName|AnnualMtg|AnnualMtgPlace|AuditFees A|Agilent Technologies, Inc.|2010-03-15|New York City, NY|75000 AA|Alcoa, Inc.|2010-01-30|Sheraton Hotel. Pittsburgh, PA|100000 AAI|AirTran Holdings, Inc.|2010-05-01|Tampa, FL|43000

For only 5 variables, you can write the length, format, and input statements by hand. For 600 variables (as is the case with this database), that is much too difficult.

data _null_; call execute ('data auditors;'); call execute ('infile "c:\2010_auditors.txt" delimiter="|" missover DSD firstobs=2 lrecl=32767;');

length input_statement $32767; input_statement="input"; length attr_statement $120;

do until (last_field); set dictionary_table end=last_field; input_statement=catx(' ',input_statement,column_name);

attr_statement =catx(' ','attrib', column_name); if data_type ="$" then attr_statement =catx('',attr_statement, cats('length=$', max_length)); attr_statement =catx(' ',attr_statement,cats('label="', column_label,'"')); call execute(cats(attr_statement,';')); end;

call execute (cats(input_statement,';')); call execute ('run;');

stop; run;

 Some additional work not shown above needs to be done first: o Read the dictionary file into a SAS dataset, named dictionary_table in this example. o Convert SQL datatypes (varchar, bigint, etc.) into SAS datatypes (char, num). I use a “$” in the data_type field for character variables. o Sort dictionary_table so the variable names are in the same order as they appear in the raw data.  CALL EXECUTE “builds” a new DATA step which is run immediately after the current one ends. Look closely at the text in paretheses and quotes. First is the DATA statement, then INFILE, and so on – very similar to the previous examples.  Instead of listing the length for every variable then the label for every variable, I use the ATTRIB statement which lists all the attributes for one variable at a time.  The INPUT and ATTRIB statements are inside a loop. Using the CATX function, new variable information is added to temporary character variables (“input_statement” and “attr_statement”) one a time as each observation in dictionary_table is read.  The ATTRIB statement is executed inside the loop because there is one statement per variable. The input statement is executed after all the variables have been read.  The CATS function strips all leading and trailing spaces between string, while CATX allows a delimiter, in this case, a single space.

8 NESUG 2010 Foundations and Fundamentals

CONCLUSION This paper has attempted to explain, by example, many of the options on the INFILE and INPUT statements, as well as SAS I/O functions and CALL EXECUTE that can be used to read difficult raw data. Below is a summary of the options and a description of each. Please note that these are not the official SAS descriptions, and may leave out some details.

INFILE options: DATALINES – Used to write raw data directly into the SAS editor. DSD – 1. Removes quotes surrounding values in the raw data. 2. Treats consecutive delimiters as separate, and sets the value between them to missing. 3. Sets the default delimiter to comma. DLM – Sets the delimiter. FLOWOVER – If the input record ends before all variables are read, tells SAS to look on the next record. MISSOVER – If the input record ends before all variables are read, tells SAS to set any remaining variables that can not be read completely to missing. TRUNCOVER – If the input record ends before all variables are read, writes out any part of the remaining variables that have been read and sets the rest to missing. LRECL – Logical record length. Sets the maximum number of characters to read on a raw data record. FIRSTOBS – Sets the first line of an input file that SAS should read. OBS – Sets the last line of an input file that SAS should read.

FILENAME options: PIPE – Reads the results of the following command, rather than a file.

INPUT options: +n – Moves the pointer forward n spaces. @n – Moves the pointer to column n. / - Moves the pointer to the next line. #n – Moves the pointer to line n (trailing) @ - holds the pointer at its current location on a raw data record. (trailing) @@ - holds the pointer at its current location on a raw data record and continues until the end of the record.

I/O functions: DOPEN – Open a directory for reading DNUM – Return the number of files in a directory. DREAD – Return the name of a given file. FEXIST – Report if a given file exists. DCLOSE – Close an open directory.

Statements: LIST – writes the input line to the SAS Log along with a ruler.

Call Routines: EXECUTE – builds the code of a DATA step to be run immediately after the current DATA step.

ACKNOWLEDGEMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

CONTACT INFORMATION Matthew Cohen IT Director Wharton Research Data Services [email protected] http://wrds.wharton.upenn.edu

9