Host Systems and Environments 217

USING THE SAS @DATA STEP LANGUAGE TO READ dBASE III FILES Michael . Harris ARC Information Systems Division

Abstract The dBASE III file fannat is arguably the most enduring • Date fields with data stored in YYYYMMDD in the PCIMS-DOS world. Originated by Ashton-Tate in format. the 19805, it has won widespread support among third­ party vendors of both enduser products and application • Logical fields hold BooJean data. They are repre­ development tools. The tenn "xBase" has been coined sented in xBase as one byte character fields that to describe third-party products supporting the dBASE may contain only the values "t", "T", "y", "Y", f file fonnat. Today, specialized research are n"', "N", "f' or "F '. available in the dBASE .DBF tbnnat. They may contain many tens of thousands ofrecords and occupy multiple • Memo fields are variable length character strings and megabytes of disk space. This paper discusses a tech­ are not stored in the .DBF file. The .DBF file itself nique for reading DBF files into SAS data sets in the contains a ten byte pointer into another specially PCIMS-DOS and UNIX Sun-OS environments. The structured file having the same name as the .DBF concepts may be applicable to other operating systems as file and a .DBT extension. The xBase language au­ well. tomatically creates and manages .DBT files and many users are not aware of their existence. Be­ Introduction cause the actual data can have maximum sizes of Native support for the dBASE exists in PC­ four kilobytes to sixty-four kilobytes depending SAS as PROC DBF. This procedure reads and writes upon implementation, handling memo fields in SAS both dBASE III and dBASE IV files. However, it cannot can be very difficult. read files containing more than 32,767 observations, a constraint that limits the usefulness of the procedure The dBASE ill File Structure when reading files several megabytes in size. The data A dBASE file consists of a variable length binary header step code we will discuss was originally written to followed by fixed length data records. The header overcome this limitation in the PCIMS-DOS environ­ consists of two types of records: ment. An inquiry from a version 6.07 SAS user operat­ ing under Sun-OS prompted us to port the code to that • One thirty-two byte record containing a file identi­ . Briefly, the task at band was to read fier, the date the file was last modified, the number large (ten megabytes or more) files in dBASE III .DBF of records. the size of each record and an unused fonnat from CD-ROMs into SAS data sets. Note that area. PROC DBF exists only in PC-SAS. dBASE Field Types Fields in the xBase world are analogous to variables in • A variable number of thirty-two byte field descriptor SAS. records containing field names, types, lengths, and While only character and numeric data exist in SAS. decimal places for numeric fields. xBase supports a rich variety of data types. With one exception however, they can all be represented in SAS as Each data record is preceded by a one- byte field xBase characters or numerics. Here are the data types that most uses internally to mark records as logically deleted. xBase products support, along with some of their dBASE files are unlike most data we are accustomed to characteristics. reading with SAS in that there are no "lines" ofdata: The filesaresimplystringsofbytes. Theyarenotrawdataflles. • Character fields contain alphanumeric data. The They resemble SASdatasetsinthatthey containdataand maximum length is implementation dependent. all the infonnationyouneedto read it correctIy, provided you knowhowtodeciphertheheader. Tables 1 and2showthe • Numeric fields hold up to 19 bytes including structure of each component of the dBASE header. positions for the sign and decimal point.

NESUG '92 Proceedings 218 Host Systems and Environments

Table 1: Initial 32 Byte Header be discarded. The second section is a code generator. It reads each field descriptor record, extracts variable ~ Contents Length names and adjusts them if necessary, extracts field 00 .DBF ID (00 or 83 hex) 1 widths and types and constructs intormats, and writes a 01 year of last update 1 data step program to a tile. The .OBF header is read into 02 month of last update 1 one or more variables and dropped, along with the field 03 day of last update 1 indicating logical deletion. We initially attempted to 04 number of records 4 ignore the header and simply begin reading the dBASE 08 header length 2 file beyond the header using @ pointer controls. That OA record size 2 didn't work, because it wasn't possible to advance the OC unused 20 pointer beyond byte 32,767.

Table 2: Field Descriptors After writing all the necessary statements to a file, that file is %INCLUDEO into the current program and run. ~ Contents Length We considered writing macro code but decided this 00 field name 11 approach would be more tlexible. if the user is going to OB field type 1 be processing several files baving the same structure, OC unused 4 which is not unusual, the data step code contained in the 10 field length 1 file we wrote can easily be reused by changing the 11 field decimals 1 names of the input and output files. It isn't necessary to 12 unused 14 run the entire program again.

Implementation Notes ~esign Goals Below are two sets of source code, one for PClMS-DOS, Our primary goal was to create a program that would and the other for Sun OS. The most obvious difference read dBASE files and represent data types not native to between them is in the handling of binary fields two or SAS in the same maD.Iler as PROC OBF. Here are the more bytes in length. The 80x86 t3.mi.ly of microproces­ representations we chose: sors for which dBASE was originally written stores integers with the least significant byte at the lower • dBASE character fields become SAS character address, the reverse of Sun's family ofRISC processors: variables. Therefore, header fields COI1taining the number of records, the header length, and the record length must • dBASE numeric fields become SAS nmneric have thejr byte ordering reversed on the Sun workstation. variables. dBASE flIes of seven or more fields have headers greater than two hundred bytes long. We checkthe header length •

• Memo fields will not be supported. A warning Field names in dBASE can be up to ten bytes long, while will be issued if memo fields are found, but the valid SAS variable names are limited to eight bytes. fields will not become part of the SAS data set. We handle this incompatibility by checking the length of Also, we chose to ignore the field indicating logical each field name. if it is greater than eight bytes, we deletion and read every record into the new SAS truncate and compare it to a list Ofall variable names data set. processed thus far. If truncation creates a duplicate name, we dynamically create a new name by concatenat­ ing the character string 'VAR' to the alphanumeric The Strategy representation of the variable's position in the program This program consists of two parts. The first section data vector. We notify the of any changes in reads the. first thirty-two bytes of the header and det«­ user mines how many SAS variables need to be created, variable names. including variables to hold the entire header, wbich will

NESUG '92 Proceedings Host Systems and Environments 219

COnclusion Acknowledgements It has been said that the SAS data step language can be SAS is a registered trademark of SAS Institute Inc., used for reading any kind of data. We developed a Cary NC, USA. program for reading structured biliary data in the form of dBASE ill files using PC-SAS and ported it to a new dBASE and dBASE III are registered trademarks of operating system \¥ith minimal effort. The techniques !Ashton-Tate. employed may be useful for reading other types of files as well. Further Information For further infonnation, the author can be reached at: References 1. Professional SAS Programming Secrets McGraw-Hili, Atlantic Research Corp. Aster/Seidman Information Systems Division 1301 Piccard Drive, 2nd floor 2. Clipper Programming Guide, 2nd Edition Microtrend, Rockville, MD 20850 Spence (301)258-5300 3. SAS Language Guide for Personal Computers, Version 6 Edition, SAS Institute, Inc.

4. SAS Companion for the UNIX Environment and De­ rivatives SAS Institute, Inc.

NESUG '92 Proceedings 220 Host Systems and Environments

1* NAME: DBF.SAS */ end; /* AurHORSHrP: ARCIPSG */ /* FUNCTION: This program reads */ put dbyear= dbmon= dbday= numrecs=headlen= 1* dBase III files containing any */ reclen=; 1* nwnber of records. it is a */ /* replacement for PROC DBF when the */ 1* 1* nwnber of records is greater than 32,767 */ calculate nwnber of dbf fields - they will become SAS /* ASSUMPTIONS: The user must enter a fully */ variables /* qualified filename and the file must exist and be */ */ /* a valid dBase file */ numflds=(headlen-34)/32; /* INPUTS: dBase III or dBase III Plus files *1 putnumtlds=; /* OUTPUTS: A SAS data step program and a SAS */ /* data set *1 pos=33; /* pointer variable to read field descriptors *1 /* REFERENCE: Michael C. Hanis */ /* /* the user supplies this infonnation *1 Nowthatweknowsomethingabout thedBase file, write %let fbame="; a data step program to read it properly %let dsname='''; */ filename dbfile &fname; filename pinc 'pinc.sas'; /* the file to be created by this tilepinc; /*filereffornew datastep program */ data step */ /* read and ignore the header */ options source2; /* show %INCLUDED code */ if headlen Ie 200 then do; /* if we can get the entire header into one SAS variable */ 1* Get information about the me we want to read ,. / hdfmt='$cbar' IIleft(put!,1leadlen, 3.)) II'.;'; data...null...; put 'data' &dsname 'Cdrop=header del);'; infi.le dbfileunbufferedrecftn--n; put 'infile ., &fname ,. unbuffered recfm=u /* Make sure we have enough room later */ eof-=eof;'; length vamame $ 8 hedftntinfint $ 9 droplist $ 200; put 'input header' hdfmt; end; 1* The structure of dBase files is documented by else do; 1* This sectionbandles dBase files with headers Ashton-Tate in the user manuals and in several third > 200 bytes long* / party publications put 'data ' &dsname ';'; */ put 'length del$ 1;'; input @I bas_memo ib I. /* memo file exists? *1 put 'infile "' &fname "' unbuffered recfm=u @2 dbyear ibl. /* year of last update */ eof=eof;'; @3 dbmon ibl. /'" month oflast update */ put 'input'; /* start input statement */ @4 dbday ibl. /* day oflast update *1 hedftnt='$cbar' 11'200.'; 1* all except the last section @; numrecs iM.I* nwnber of records will be 200 bytes long */ in file */ @9 headlen ib2. 1* length of header portion /* calculate the number of variables we'Uneed to read offile */ the entire header */ @11 reclen ib2. 1* length of each record *1 numsects=int(headlen/200)+ I; 1* Show infonnationabout the file being read /* calculate length ofthe last section */ bas.;nemo can have values of3 hex and 83 hex; 3 Ien1ast=head1en-Cnumsects-l )*200); means no memo field exists. This program does not support memo frelds. /* start list of variables to be dropped from data set */ */ droplist='del'; ifbasJ11en1o ne 3 then do; put 'WARNING: The input file bas a memo file /* create variable names for each chunk of the associated with it!'; header */ put 'WARNING: Data will be lost!'; do pes= 1 to numsects-l;

NESUG '92 Proceedings Host systems and Environments 221

'* create variable names for header *' ) intint=trim(put(fld_Ien. 3.» II',' IIleft(put(fld_dec, hedname='b' IIleft(put(pcs, 1.»; . 2.»; /* don't keep the memo field */ '* build list ofheader variables to drop • / dropl.i.st=trim(dropJ.i.st) II ' , II varname; droplist=trim(dropJ.i.st) I! ' , I! hedname; end; put' 'varname infint; puthedname hedfmt; '* write to file */ /* set pointer to next field descriptor */ end; pos+32; end; hedfint='$char' IIleft(put(lenlast, 3.» II '.'; /*writenon-datadepen

/* loop through field descriptors ... *' do fl.ds-l to numflds; input @postlcLnameS8.+3flcUypeSl.+4 tldJenib 1. fkLdecibl.; '* trim away unwantednulls */ vamame=scan(fld....name.l,nul);

'*ifadBasecharacter,logica1or date variable ... */ if (fl

NESUG '92 Proceedings 222 Host Systems and Environments

1* NAME: SUNDBF.SAS *1 reclen= input(reverse(head), pib2.); 1* AUTIIORSHIP: ARCIPSG *1 1* FUNCTION: This program reads 200 bytes long *1 @4 dbday ibI. put 'data ' &dsname ';'; 1* number of records in file *1 put 'length del$ 1;'; @5 num $char4. put infJ.le ., &fname ,. unbuffered recfm=n 1* length of header pOrtion of file *1 eof=eof;'; @9 head $cbat2. put 'input'; 1* start input statement *1 1* length of each record *1 1* all except the last section will be 200 bytes long *1 @11 reo $char2. hedfmt='$ehar' II '200.';

1* TInS IS CRITICAL: reverse byte order of 1* calculate the nmnber of variables we'll need to multiple byte binary fields *1 read the entire header *1 numrecs= input(reverse(num), pib4.); numsects=int(headlenl200)+ 1; headlen= input(reverse(head), pib2.);

NESUG '92 Proceedings Host Systems and Environments 223

1* calculate length of the last section *1 file log; len1ast= headlen-( (numsects-l )*200); put '**Note: The field name ' varname' is not a valid SAS variable name. '; 1* 1* reduce to 8 bytes *1 start list of variables to be dropped from data set varname= put(varname, $8.); *1 if (flds ne 1) then do; droplist= 'del'; do ndx = 1 to flds: if (varname = _names {ndx}) then 1* if 1* create variable names for each chunk of the duplicate name ... *1 header *1 varnam.e= 'VAR' I trim (left (put (flds, do pes=l to numsects-l; 4.»); 1* create variable names for header *1 end; hedname= 'h' II left(put(pcs, 3.»; end; put '**Note: The name has been changed to ' 1* build list of header variables to drop *1 vamame '.'; droplist=trim(droplist) II' 'II hedname; end; 1* write to file *1 1* update array with latest variable name *1 put hedname hedfmt; _names{flds} = varname; end; file pine; hedfmt='$char'11 trim. (left(put(lenlast,3.))) 1* if a dBASE character. logical or date variable *1 II'·'; if (fld_type='C' I fld_type = 'L' I fld_type= 'D') then 1* create variable name for last section of header *1 infmt='$char' II trim (left(put(tld_len.3.») II hedname= 'h' Illeft(put(numsects, 3.»; .. ' ., droplist=trim(droplist) II ' , II hedname; 1* create informal *! 1* write to file *1 else put hedname hedfmt ';'; if fld_ type= 'N' then 1* if a dBASE numeric end; variable *1 infmt=trim(puI(fld_Ien, 3.» II '.' II left{put put 'do while(l);'; (fld_dec, 2.»; put ' input' ; else put' del $charl. '; 1* the first byte of each record if fld_type= 'M' then do; 1* the dreaded memo indicates whether or not it has been logically deleted. field *1 An asterisk ("*") means deleted. We choose to infmt=trim(put(fld_len, 3.» II '.' Illeft{put ignore this field and read every record *1 (fld_dec, 2.»; 1* don't keep the memo field*1 nu1=put(O, ibI.); 1* Field names are C strings - droplist=trim(droplist) II" II varname; ASCIIZ, or null delimited. Create a null variable to end; be used as a delimiter in the SCANO function *1 put' 'vamame infmt; pos + 32; 1* set pointer to next field descriptor *! 1* loop through field descriptors. •• */ end; do flds= 1 to numf1ds; !* write non-data dependent code to file *1 put' ;'; input@pos fld_name $10. + 3 fld_type $1. +4 !* end list of variables to drop fldJen ib 1. fld_ dec ibl.; *1 put' output;' / ' end; , ; vamame=scan(fld_name. I, nul); 1* trim away put 'drop , droplist ';'; unwanted null characters *1 put • eof: stop;' / 'run;'; stop; stop this data step */ 1* dBASE field names can be up to 10 bytes long. 1* Check length, adjust and rename to avoid dupli­ run; %include 'pinc.sas'; 1* include the code we just cates if necessary *1 generated! *1 if (Iength(vamame) > 8) then do;

NESUG '92 Proceedings