Creating SAS® Datasets from Varied Sources Mansi Singh and Sofia Shamas, Maxisit Inc, NJ
Total Page:16
File Type:pdf, Size:1020Kb
Paper 74881-2011 Creating SAS® Datasets from Varied Sources Mansi Singh and Sofia Shamas, MaxisIT Inc, NJ ABSTRACT Often SAS® programmers find themselves dealing with data coming from multiple sources and usually in different formats. Steps have to be taken to logically relate the process and convert the variety of data into SAS data sets before it can be analyzed. Since these sources do not follow a similar pattern, this paper is to serve as a collection of examples illustrating the conversion of data coming from various sources, such as extensible markup language (XML), comma separated values (CSV), Microsoft excel (XLS), or tab delimited (TXT) files to SAS data sets. INTRODUCTION Often data comes from a variety of sources. These different formats of data have to be put together in a cohesive way so that it can be used for further analysis. Often this responsibility falls on the shoulders of a programmer. Every programmer has their own unique way to programming and tackling this issue. The optimal way depends upon the needs of the project and programmer's preference. There are various tools at our disposal such as IMPORT procedure, IMPORT WIZARD, and the DATA STEP which help programmers convert data coming from different sources to SAS data sets. Although these are very useful and widely used tools, these methods come with some limitations. PROC IMPORT gives no control over the field attributes as it scans the input file to automatically determine name, type and ideal length of the variables. DATA STEP – INFILE can be more programming-intensive if there are more variables in the data set. DATA STEP – INFILE even though being a more primitive approach, increases a programmer’s control over the data. It allows programmer to be precise in variable definitions by specifying variable names and their attributes as the file is read through the INPUT statement. One can also do data manipulations directly within the same DATA STEP, which cannot be done using PROC IMPORT. CDISC procedure is used for XML files based on ODM structure which gives user more control over the metadata content. In this paper, DATA STEP - INFILE method will be used to convert, 1. Comma Separated Value (CSV) file to SAS data set. 2. Tab delimited (TXT) file to SAS data set. 3. Microsoft Excel (XLS) file to SAS data set. PROC CDISC will be used to convert, 4. eXtensible Markup Language (XML) file for ODM structure to SAS data set. SASHELP.SHOES is used as the data source to illustrate these conversions. CONVERTING CSV FILE TO SAS DATA SET Comma Separated Values (CSV) file is used to store tabular data in which numbers and text are stored in plain-text form. Plain text in such files is delimited by a symbol (comma). Traditionally, lines in the text file represent rows in a table, and commas separate the columns. CSV files are a common medium of data transfer especially when dealing with external vendors. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let’s take a look at the CSV file SHOES.CSV as an example, which will be later converted to a SAS data set. 1 STEP I: CREATE THE VARIABLE NAMES Programmers are familiar with the way traditional DATA STEP and INFILE is used to read external files. Even though DATA STEP gives more control to the programmer when reading the data into SAS, it can be a tedious job to type all the variable names, particularly if the data contains a lot of variables. This part of the code illustrates an innovative way of reading and creating variable names for a data set. *Creating dataset with variable names from csv file; data all_attb; infile "&path\shoes.csv" pad firstobs=1 obs=1 lrecl=32767; 1 *Reading in the line containing variable names; length randstr $500.; 2 *Storing variable names as one random string; input @1 randstr $char500.; *Creating variables needed in the dataset; 3 array a{*} $50. a1-a7; do i=1 to dim(a); a{i}=scan(randstr,i,','); end; 4 run; 1 The INFILE statement reads in the CSV file. Row line where the variable names are stored within the file is read through FIRSTOBS option. 2 A character variable, RANDSTR is created that will contain the variable names as one long string separated by the file delimiter, which in this case is a comma (‘,’). ARRAY is used to create the number of variables that will be in the data set. In this example, there are seven 3 variables, so the ARRAY dimension ranges from 1 – 7. SCAN function is used to read the string created in step 2. It scans for each variable that was stored in the 4 character string separated by the delimiter and then creates individual variables. Here seven different variable names are being created. STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS So now we need to define the attributes for the variable names created in STEP I. It can again be tiresome to type all the variable names and its attributes. This part of the code demonstrates how this can be achieved in a more efficient way. *Creating the strings for INPUT & ATTRIB statements; data _null_; set all_attb; length inpt attb $32000. label $35.; array a{*} $50. a1-a7; do i=1 to dim(a); 2 *Creating the string for INPUT and ATTRIB statements; *For character variables; if i^=4 then do; inpt=trim(inpt)||' '||trim(a{i})|| ' $ '; 1 var=a{i}; length='length=$15.'; if i=1 then label='label="Region" '; else if i=2 then label='label="Product" '; else if i=3 then label='label="Subsidiary" '; else if i=5 then label='label="Total Sales" '; else if i=6 then label='label="Total Inventory" '; else if i=7 then label='label="Total Returns" '; attb=trim(attb)||' '||trim(var)||' '||trim(length)||' '||trim(label); 2 end; *For numeric variables; else do; inpt=trim(inpt)||' '||trim(a{i})||' '; var=a{i}; length='length=8.'; 3 if i=4 then label='label="Number of Stores" '; attb=trim(attb)||' '||trim(var)||' '||trim(length)||' '||trim(label); end; end; call symput("inpt", trim(inpt)); 4 call symput("attb", trim(attb)); run; Creates the string of variable names for the INPUT statement along with the type identifier ($) for character 1 variables in the data set. For each variable defined by the ARRAY, a variable containing the length information and a variable containing 2 the label information is created using IF-ELSE logic. These variables are then concatenated together to create the information needed for the ATTRIB statement. 3 The logic used in 1 and 2 is then repeated for numeric type variables. The variables created for the INPUT and ATTRIB statements are then converted into macro variables. These 4 macro variables (INPT and ATTB) will be used in the next step. STEP III: READ IN DATA FROM CSV FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from csv file; data shoes; 1 infile "&path\shoes.csv" delimiter=',' pad missover firstobs=2 lrecl=32767; attrib &attb; input &inpt; 2 run; This INFILE statement will now read in the data from the CSV file. FIRSTOBS option points to the row line where 1 the data exists. The macro variables (INPT and ATTB) created in the previous step are now used for INPUT and ATTRIB 2 statements to create the final SAS data set. 3 CONVERTING TXT FILE TO SAS DATA SET A tab delimited (TXT) file is a plain text file which uses a tab stop as a separator between the data fields. Each line of the text file is a record of the data table. TXT is a widely supported file format which is often used to move data between various sources. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let’s take a look at the TXT file SHOES.TXT as an example, which will be later converted to a SAS data set. STEP I: CREATE THE VARIABLE NAMES This step is similar to STEP I of the CSV file conversion process. The only difference is when the delimiter for TXT file is defined. *Creating dataset with variable names from txt file; data all_attb; infile "&path\shoes.txt" dsd firstobs=1 obs=1 lrecl=32767; 1 . *Creating variables needed in the dataset; array a{*} $50. a1-a7; do i=1 to dim(a); a{i}=scan(randstr,i,'09'x); end; 2 run; 1 The INFILE statement reads in the TXT file. Row line where the variable names are stored within the file is read through FIRSTOBS option. SCAN function is used to read the string created in previous steps. It scans for each variable that was stored in 2 the character string separated by the delimiter (TAB). STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS This step is similar to STEP II of the CSV file conversion process. STEP III: READ IN DATA FROM TXT FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from txt file; data shoes; infile "&path\shoes.txt" delimiter='09'x dsd missover firstobs=2 lrecl=32767; 1 attrib &attb; input &inpt; 2 run; 4 This INFILE statement will now read in the data from the TXT file.