CODY I Data Conversion

SAS (R) TUTORIAL SESSION: CONVERTING DATA BETWEEN "FOREIGN" FORMATS AND SAS (R) SYSTEM FILES Dr. Ronald Cody Robert Wood Johnson Medical School

This paper will discuss ways to move data from such formats as ASCII, Lotus(r), and dBASE(r) to a SAS system file. Included in this discussion will be the moving of SAS system files from other platforms (such as UNIX) to system files on PC 1 s. PROC DIF, DBF, CPORT, and CIMPORT will be discussed as well as a non-SAS Institute package, DBMS/COPY which translates data between a variety of formats, including SAS system files. The special problems of missing values and incompatible formats is also addressed.

I. Reading Data from an External ASCII file.

One common way for users to enter data into a micro-computer is with a wordprocessing package. Several such packages directly in ASCII format such as PCWRITE (r) and WordStar (r) (non-document mode only). Others use their own proprietary format such as Word Perfect(r) and Multimate(r). These latter packages contain translation routines which can convert their internal format to standard ASCII. In Word Perfect, the choice "Save to DOS " will write ASCII files, while in Multimate, you must run a translate program.

Another way to create ASCII files is to have a spread sheet program or a database program "print to a file." This technique is similar to sending data to the printer except that the resulting text will reside in a disk file. Care must be exercised here so that the package you are using does not format the text by adding margins or placing page breaks in the file. In Lotus, be sure to select the "Unformatted" and "Margin" (set left to zero) options in the Print menu before writing out the file. ASCII is a good "common denominator" between other packages and SAS system files when all else fails.

Regardless of how the ASCII file was created, let us now see how a SAS program can such a file. The ASCII file that was used in the program example which follows is listed below: FILE ASCII.TXT

001M2368160 ID is in cols 1-3, SEX in col 4, AGE in 5-6 002F4462 99 HEIGHT in col 7-8 and WEIGHT in col 9-11 003M29 200 004F2765 Note: This is a short record 005M6672220 006F6060100

The SAS program to read this file is shown next:

96 CODY 1 Data Conversion

DATA ASCII; INFILE 1 ASCII.TXT 1 MISSOVER; INPUT ID 1-3 SEX $ 4 AGE 5-6 HEIGHT 7-8 WEIGHT 9-11; RUN; PROC PRINT NOOBS; TITLE 'SAMPLE DATA SET'; VAR ID--WEIGHT; RUN;

Special care must be taken when reading this file. Notice that subject 004 has a short record (i.e. the carriage return was pressed immediately after the 11 65 11 was entered--no blanks were typed). Unlike mainframe systems, this file is not padded on the right with blanks. Without the "MISSOVER" option on the INFILE statement, the program would move to the next record in an attempt to read a value for WEIGHT even though the INPUT statement specifies columns 9-11 for this vari­ able. Below is a listing of the SAS data set that was produced without the MISSOVER option:

Result of PROC PRINT when Option MISSOVER was Not Specified

ID SEX AGE HEIGHT WEIGHT

1 M 23 68 160 2 F 44 62 99 3 M 29 200 4 F 27 65 5 6 F 60 60 100

Notice first that the SAS pointer went to record five to look for a value for WEIGHT and read the first three columns which was actually the ID number for the next subject. Then the SAS pointer moved to the next record, causing the data in record five to be skipped and the values in last record (seven) to appear in the 6th observation. The SAS LOG below shows that the SAS pointer went to a new line to read data and that the minimum record length was 8.

SAS LOG where Option MISSOVER was Not Specified

NOTE: The infile 1 ASCII.TXT 1 is file 0:\SASDATA\ASCII.TXT. NOTE: 6 records were read from the infile 0:\SASDATA\ASCII.TXT. The minimum record length was 8. The maximum record length was 11. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.ASCII has 5 observations and 5 variables.

If you see this NOTE in a SAS LOG and you did not intend for the SAS pointer to go to a new line to read data (as with INPUT statements

97 CODY I Data Conversion

with @@) , be sure to think about this short record problem and the MISSOVER option to solve the problem. It is a good idea to include the option MISSOVER when reading ASCII files with SAS/PC.

II. Reading Data from a Lotus(r) Spreadsheet Via DIF Format

One way of converting a Lotus spreadsheet into a SAS system file is via a DIF (Data Interchange Format) file. Once your spreadsheet has been translated to a DIF file, you may use PROC DIF to convert the DIF file into a SAS system file. Let 1 s first discuss the format of the original spreadsheet. You may simply have columns of variables, with the first row of the spreadsheet containing the values for your first observation. Below is a sample of a simple spreadsheet (containing the same values as the sample ASCII file above):

Lotus Spreadsheet Example 1

A B c D E 1 1 M 23 68 160 2 2 F 44 62 99 3 3 M 29 200 4 4 F 27 65 5 5 M 66 72 220 6 6 F 60 60 100 Notice that the numbers are right justified and the character variables are left justified. When this spreadsheet is converted to a SAS system file, the numbers will be SAS numeric (8 byte) variables and the characters will become character variables of length 20. Character values longer than 20 bytes will be truncated when we use the DIF format for our translation.

Another form of the spreadsheet is to have the first row contain column headings. This form is shown next:

Lotus Spreadsheet Example 2

A B ~ D E 1 ID SEX AGE HEIGHT WEIGHT 2 1 M 23 68 160 3 2 F 44 62 99 4 3 M 29 200 5 4 F 27 65 6 5 M 66 72 220 7 6 F 60 60 100

98 CODY 1 Data Conversion

Finally, you may have one or more lines of text or comments in your spreadsheet. An example of this is shown next:

Lotus Spreadsheet Example 3

A B c D E 1 These lines contain comments that we do not 2 want to include with our data. 3 ------4 ID SEX AGE HEIGHT WEIGHT 5 1 M 23 68 160 6 2 F 44 62 99 7 3 M 29 200 8 4 F 27 65 9 5 M 66 72 220 10 6 F 60 60 100 There are two ways of dealing with examples 2 and 3. First, before we enter the Lotus translate program, we can use the "RANGE" command of Lotus and name a range that includes only the data. We can then translate only the range and create a DIF file that will be identical to the one from example 1. The other alternative, is to translate the spreadsheet intact and use the SKIP option of PROC DIF to skip the first n lines of the spreadsheet.

Now that we have translated our WK1 file to a DIF file, we are ready to see how PROC DIF works. The syntax for PROC DIF is:

PROC DIF DIF=fileref OUT=sas_file SKIP=n;

where fileref = a file reference to the .DIF file

sas file = name of the newly created SAS system file

n = number of lines of the spreadsheet to skip

For example, suppose our original worksheet file was called LOTUS.WK1. The translated DIF file will be named LOTUS.DIF (the .DIF is added automatically by the translate routine). If we want our SAS system file to be called LOTUSAS, we would write out PROC statements as follows:

FILENAME IN 'LOTUS.DIF'; PROC DIF DIF=IN OUT=LOTUSAS;

The variables in the resulting SAS data set would be named COL1, COL2, COL3, etc. You could rename these variables using PROC DATASETS such as:

99 CODY I Data Conversion

PROC DATASETS; MODIFY LOTUSAS; RENAME COLl=ID COL2=SEX COL3=AGE COL4=HEIGHT COL5=WEIGHT; III. Reading Data From a Lotus Spreadsheet via DBF Format

An alternate method of converting a Lotus spreadsheet to a SAS system file is by first converting the spreadsheet to DBF format (choose dBase III from the Lotus translate screen) and to then use PROC DBF to create the SAS system file. There are advantages and dis­ advantages to this method. First, the Lotus translate routine expects that the first row of the spreadsheet contains variable names and subsequent rows contain data values. If there are extraneous rows or columns in your spreadsheet, use the "range" command in Lotus to name a range where your variable names and values are located. The Lotus to DBF conversion is more particular than the Lotus to DIF conversion. The translate routine insists that the second row of the spreadsheet (the first row of data) either contains data values or is formatted. After the conversion is completed, the resulting SAS system file will have the same variable names as the column headings. Therefore, we must be careful to choose column headings which are valid SAS variable names. If we have column headings that are too long, they will be tr•Jncated to 8 bytes. If they contain invalid characters (such as blanks or -), they will be converted by PROC DBF into valid SAS variable names. The resulting SAS system file will use 8 bytes for the numeric variables and the Lotus spreadsheet column width for the character variables. Empty cells will be converted to zeros by the Lotus to DBF conversion. Therefore, if zero is a valid value for a variable, you must put a missing value code (such as 9999) in the empty cells of these variables. After the SAS system file is created, you can convert the missing value codes back to SAS missing values. The syntax for PROC DBF is shown next: PROC DBF DB3=dbase fileref OUT=sas_file;

where dbase fileref = the fileref for the .DBF file, created with the statement

sas_file = the name of the SAS system file

For our example, suppose that the Lotus spreadsheet was converted to a .DBF file with the name LOTUS.DBF. The SAS programming statements to convert a DBF file to a SAS system file are shown below:

1 FILENAME GARFIELD LOTUS.DBF 1 ; PROC DBF DB3=GARFIELD OUT=LOTUSAS;

If we had missing values in our spreadsheet and zero was not a valid data value, we would next have to convert to zeros to SAS miss­ ing values. One method, using ARRAYS is shown here:

100 CODY I Data Conversion

DATA NEWSAS~ SET LOTUSAS ~ ARRAY XXX[*) _NUMERIC ~ DO I= 1 TO DIM(XXX)~ IF XXX[I] = 0 THEN XXX[I] = ., END~ DROP I~ RUN~ Notice that missing values for character variables converted correctly to SAS missing values. Also, had zero been a valid data value, you would have had to place a unique missing value (9999 for example) in the spread sheet and converted that value to missing in the same way that zero was converted in the example above. In case you are not familiar with the DIM function, it returns the length of the array (i.e. the number of elements in the array) so we do not have to count how many numeric variables we had in our data set. IV. Converting dBase III Files to SAS system Files Converting dBase III files to SAS system files is accomplished the same way we converted Lotus to SAS via DBF format, from the point that that spreadsheet was converted to a dBase file. The cautions concerning missing values still pertain. v. Converting Lotus Spreadsheets to SAS System Files via DBMS/COPY A third party software package called DBMS/COPY, available from Conceptual Software (Houston, TX, telephone (SOO)STAT-WOW) can be used to convert system files between most of the major database, spread­ sheet, and statistical packages, including SAS. Some of the more common packages supported are: Lotus, Quatro, Clipper, Dataease, dBase, Informix, Ingres, Oracle, Paradox, PC-File, Prodas, Rbase, Reflex, Smart, ASCII, ACT!, Datalex, ABstat, Bass, BMDP, CSS, 4CaST/2, Forecast Pro, MicroStat-II, Ness, Probe, RATS, SigmaPlot, StatGraphics, Sygraph, Excel, Autobox, Gauss, Glim, Minitab, NCSS, SAS, SCA, Soritex, SPSS, Stata, StatPac, and Systat. In addition, DBMS/COPY can convert SAS transportable files to SAS system files (more about this later) . DBMS/COPY knows which formats you want to convert between by the file extensions of the two in the DBMSCOPY command. But, since several systems use the same extensions, DBMS/COPY uses what is calls pseudo extensions. Thus, to convert from SPSS to SAS the command would be: spss_filename.SPSS sas_filename.SSD even though your SPSS file extension is • SYS. A very minor problem exists in the conversion to SAS system files with DBMS/COPY. That is, the file name for the SAS system file can only be a single letter. The reason for this is that SAS system files contain a header which encrypts the file name with the date and time. the SAS Institute will not share this algorithm with the DBMS/COPY people. However, the programmers at Conceptual Software managed to figure out the encryption algorithm for one character file names. Thus, when we use DBMS/COPY to convert to SAS system files, we will most probably want

101 CODY I Data Conversion

to use PROC DATASETS to rename the single letter SAS system file to a more meaningful file name. we will show this in our example. Using DBMS/COPY to convert our Lotus spreadsheet called LOTUS.WKl to a SAS system file we would enter a single command at the normal DOS prompt. For example:

C> DBMSCOPY LOTUS.WKl A.SSD

The resulting SAS sytem file called A, will use the variable names from the first row of the spreadsheet (if you had them) or will assign variable names A,B,C, etc. if you did not have variable names as the first row of your spreadsheet. Also, the missing values of the spreadsheet will be converted correctly to SAS missing values! This is by far, the easiest way to convert spreadsheets to SAS system files.

In the example above, to rename the SAS system file called A to a file called LOTUSAS, we would use PROC DATASETS as follows:

1 LIBNAME libref 'C:\ ; PROC DATASETS LIBRARY=libref; CHANGE A=LOTUSAS; RUN;

VI. Converting dBase III Files to SAS System Files via DBMS/COPY

To convert a dBase III file called TEST.DBF to a SAS system file called X.SSD, the command would be: C> DBMSCOPY TEST.DBF X.SSD

VII. Converting SAS System Files from UNIX to SAS/PC:

SAS system files are not compatible across hardware platforms (such as UNIX to SAS/PC). To migrate a SAS system file from one platform to another, you must first convert the system file to a transportable file using PROC CPORT, transfer the file to your PC (over a network or telecommunications line), and then run PROC CIMPORT to convert the transportable file back to a system file on the PC. We will show how this is done with a simple example of a single SAS system file on a UNIX system which we want to move to a PC (Note: PROC CPORT and CIMPORT are capable of transporting an entire library or catalog as well). Suppose our original UNIX file is called MYSAS.SSD in a subdirectory called SASDATA. Here are the steps we should follow: 1. Program to run on the UNIX system.

LIBNAME XXX 1 /USERS/CODY/SASDATA'; *Note: the actual name of this subdirectory may be slightly different; FILENAME TRANS 1 /USERS/CODY/SASDATA/MYTRANS';

102 CODY 1 Data Conversion

PROC CPORT DATA=XXX.MYSAS FILE=TRANS: RUN:

2. Download the transportable file (MYTRANS) to your PC. We use KERMIT which provides an error free transmission protocol.

3. Run the following program on your PC:

LIBNAME YYY 'C:\SASDATA'; *The subdirectory where you want the SAS system file to reside; FILENAME IN 'MYTRANS'; PROC CIMPORT DATA=YYY.NEWSAS INFILE=IN; RUN; The DATA option of CPORT provides the name of the SAS system file that you want to convert to a transportable format. OUT= defines the filename of the transportable file you want to create. PROC CIMPORT is similar to CPORT except that DATA= will provide the name of the SAS system file you want to recreate on the PC and INFILE= gives the filename of the transportable file you downloaded to your PC. Information about CPORT and CIMPORT can be found in the SAS Procedures Guide release 6.03 or, in more detail, in the Technical Rrport: P-176 Using the SAS System. VIII. Using DBMS/COPY to Translate SAS Transportable Files to SAS System files An alternate procedure to running PROC CIMPORT is to use DBMS/COPY to translate a SAS transportable file to a SAS system file. To do this, we must make sure the transportable file has the extension .SSP. This is easily done with a REN (rename) command in DOS, if the file does not already have this extension. Next, enter the DBMS/COPY command:

DBMSCOPY trans- name.SASPORT sas- name.SSD where trans name = the name of the transportable file without the .SSP extension (SASPORT) is the DBMS/COPY pseudo-extension discussed earlier

sas name = SAS system file name (must be a single letter as discussed earlier)

No doubt, you will want to use PROC DATASETS to rename the single letter SAS system file.

IX. To Convert from Almost Anything to Anything Else

Use DBMS/COPY.

103