Creating SAS® Datasets from Varied Sources Mansi Singh and Sofia Shamas, Maxisit Inc, NJ

Total Page:16

File Type:pdf, Size:1020Kb

Creating SAS® Datasets from Varied Sources Mansi Singh and Sofia Shamas, Maxisit Inc, NJ Paper 74881-2011 Creating SAS® Datasets from Varied Sources Mansi Singh and Sofia Shamas, MaxisIT Inc, NJ ABSTRACT Often SAS® programmers find themselves dealing with data coming from multiple sources and usually in different formats. Steps have to be taken to logically relate the process and convert the variety of data into SAS data sets before it can be analyzed. Since these sources do not follow a similar pattern, this paper is to serve as a collection of examples illustrating the conversion of data coming from various sources, such as extensible markup language (XML), comma separated values (CSV), Microsoft excel (XLS), or tab delimited (TXT) files to SAS data sets. INTRODUCTION Often data comes from a variety of sources. These different formats of data have to be put together in a cohesive way so that it can be used for further analysis. Often this responsibility falls on the shoulders of a programmer. Every programmer has their own unique way to programming and tackling this issue. The optimal way depends upon the needs of the project and programmer's preference. There are various tools at our disposal such as IMPORT procedure, IMPORT WIZARD, and the DATA STEP which help programmers convert data coming from different sources to SAS data sets. Although these are very useful and widely used tools, these methods come with some limitations. PROC IMPORT gives no control over the field attributes as it scans the input file to automatically determine name, type and ideal length of the variables. DATA STEP – INFILE can be more programming-intensive if there are more variables in the data set. DATA STEP – INFILE even though being a more primitive approach, increases a programmer’s control over the data. It allows programmer to be precise in variable definitions by specifying variable names and their attributes as the file is read through the INPUT statement. One can also do data manipulations directly within the same DATA STEP, which cannot be done using PROC IMPORT. CDISC procedure is used for XML files based on ODM structure which gives user more control over the metadata content. In this paper, DATA STEP - INFILE method will be used to convert, 1. Comma Separated Value (CSV) file to SAS data set. 2. Tab delimited (TXT) file to SAS data set. 3. Microsoft Excel (XLS) file to SAS data set. PROC CDISC will be used to convert, 4. eXtensible Markup Language (XML) file for ODM structure to SAS data set. SASHELP.SHOES is used as the data source to illustrate these conversions. CONVERTING CSV FILE TO SAS DATA SET Comma Separated Values (CSV) file is used to store tabular data in which numbers and text are stored in plain-text form. Plain text in such files is delimited by a symbol (comma). Traditionally, lines in the text file represent rows in a table, and commas separate the columns. CSV files are a common medium of data transfer especially when dealing with external vendors. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let’s take a look at the CSV file SHOES.CSV as an example, which will be later converted to a SAS data set. 1 STEP I: CREATE THE VARIABLE NAMES Programmers are familiar with the way traditional DATA STEP and INFILE is used to read external files. Even though DATA STEP gives more control to the programmer when reading the data into SAS, it can be a tedious job to type all the variable names, particularly if the data contains a lot of variables. This part of the code illustrates an innovative way of reading and creating variable names for a data set. *Creating dataset with variable names from csv file; data all_attb; infile "&path\shoes.csv" pad firstobs=1 obs=1 lrecl=32767; 1 *Reading in the line containing variable names; length randstr $500.; 2 *Storing variable names as one random string; input @1 randstr $char500.; *Creating variables needed in the dataset; 3 array a{*} $50. a1-a7; do i=1 to dim(a); a{i}=scan(randstr,i,','); end; 4 run; 1 The INFILE statement reads in the CSV file. Row line where the variable names are stored within the file is read through FIRSTOBS option. 2 A character variable, RANDSTR is created that will contain the variable names as one long string separated by the file delimiter, which in this case is a comma (‘,’). ARRAY is used to create the number of variables that will be in the data set. In this example, there are seven 3 variables, so the ARRAY dimension ranges from 1 – 7. SCAN function is used to read the string created in step 2. It scans for each variable that was stored in the 4 character string separated by the delimiter and then creates individual variables. Here seven different variable names are being created. STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS So now we need to define the attributes for the variable names created in STEP I. It can again be tiresome to type all the variable names and its attributes. This part of the code demonstrates how this can be achieved in a more efficient way. *Creating the strings for INPUT & ATTRIB statements; data _null_; set all_attb; length inpt attb $32000. label $35.; array a{*} $50. a1-a7; do i=1 to dim(a); 2 *Creating the string for INPUT and ATTRIB statements; *For character variables; if i^=4 then do; inpt=trim(inpt)||' '||trim(a{i})|| ' $ '; 1 var=a{i}; length='length=$15.'; if i=1 then label='label="Region" '; else if i=2 then label='label="Product" '; else if i=3 then label='label="Subsidiary" '; else if i=5 then label='label="Total Sales" '; else if i=6 then label='label="Total Inventory" '; else if i=7 then label='label="Total Returns" '; attb=trim(attb)||' '||trim(var)||' '||trim(length)||' '||trim(label); 2 end; *For numeric variables; else do; inpt=trim(inpt)||' '||trim(a{i})||' '; var=a{i}; length='length=8.'; 3 if i=4 then label='label="Number of Stores" '; attb=trim(attb)||' '||trim(var)||' '||trim(length)||' '||trim(label); end; end; call symput("inpt", trim(inpt)); 4 call symput("attb", trim(attb)); run; Creates the string of variable names for the INPUT statement along with the type identifier ($) for character 1 variables in the data set. For each variable defined by the ARRAY, a variable containing the length information and a variable containing 2 the label information is created using IF-ELSE logic. These variables are then concatenated together to create the information needed for the ATTRIB statement. 3 The logic used in 1 and 2 is then repeated for numeric type variables. The variables created for the INPUT and ATTRIB statements are then converted into macro variables. These 4 macro variables (INPT and ATTB) will be used in the next step. STEP III: READ IN DATA FROM CSV FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from csv file; data shoes; 1 infile "&path\shoes.csv" delimiter=',' pad missover firstobs=2 lrecl=32767; attrib &attb; input &inpt; 2 run; This INFILE statement will now read in the data from the CSV file. FIRSTOBS option points to the row line where 1 the data exists. The macro variables (INPT and ATTB) created in the previous step are now used for INPUT and ATTRIB 2 statements to create the final SAS data set. 3 CONVERTING TXT FILE TO SAS DATA SET A tab delimited (TXT) file is a plain text file which uses a tab stop as a separator between the data fields. Each line of the text file is a record of the data table. TXT is a widely supported file format which is often used to move data between various sources. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let’s take a look at the TXT file SHOES.TXT as an example, which will be later converted to a SAS data set. STEP I: CREATE THE VARIABLE NAMES This step is similar to STEP I of the CSV file conversion process. The only difference is when the delimiter for TXT file is defined. *Creating dataset with variable names from txt file; data all_attb; infile "&path\shoes.txt" dsd firstobs=1 obs=1 lrecl=32767; 1 . *Creating variables needed in the dataset; array a{*} $50. a1-a7; do i=1 to dim(a); a{i}=scan(randstr,i,'09'x); end; 2 run; 1 The INFILE statement reads in the TXT file. Row line where the variable names are stored within the file is read through FIRSTOBS option. SCAN function is used to read the string created in previous steps. It scans for each variable that was stored in 2 the character string separated by the delimiter (TAB). STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS This step is similar to STEP II of the CSV file conversion process. STEP III: READ IN DATA FROM TXT FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from txt file; data shoes; infile "&path\shoes.txt" delimiter='09'x dsd missover firstobs=2 lrecl=32767; 1 attrib &attb; input &inpt; 2 run; 4 This INFILE statement will now read in the data from the TXT file.
Recommended publications
  • A First Course to Openfoam
    Basic Shell Scripting Slides from Wei Feinstein HPC User Services LSU HPC & LON [email protected] September 2018 Outline • Introduction to Linux Shell • Shell Scripting Basics • Variables/Special Characters • Arithmetic Operations • Arrays • Beyond Basic Shell Scripting – Flow Control – Functions • Advanced Text Processing Commands (grep, sed, awk) Basic Shell Scripting 2 Linux System Architecture Basic Shell Scripting 3 Linux Shell What is a Shell ▪ An application running on top of the kernel and provides a command line interface to the system ▪ Process user’s commands, gather input from user and execute programs ▪ Types of shell with varied features o sh o csh o ksh o bash o tcsh Basic Shell Scripting 4 Shell Comparison Software sh csh ksh bash tcsh Programming language y y y y y Shell variables y y y y y Command alias n y y y y Command history n y y y y Filename autocompletion n y* y* y y Command line editing n n y* y y Job control n y y y y *: not by default http://www.cis.rit.edu/class/simg211/unixintro/Shell.html Basic Shell Scripting 5 What can you do with a shell? ▪ Check the current shell ▪ echo $SHELL ▪ List available shells on the system ▪ cat /etc/shells ▪ Change to another shell ▪ csh ▪ Date ▪ date ▪ wget: get online files ▪ wget https://ftp.gnu.org/gnu/gcc/gcc-7.1.0/gcc-7.1.0.tar.gz ▪ Compile and run applications ▪ gcc hello.c –o hello ▪ ./hello ▪ What we need to learn today? o Automation of an entire script of commands! o Use the shell script to run jobs – Write job scripts Basic Shell Scripting 6 Shell Scripting ▪ Script: a program written for a software environment to automate execution of tasks ▪ A series of shell commands put together in a file ▪ When the script is executed, those commands will be executed one line at a time automatically ▪ Shell script is interpreted, not compiled.
    [Show full text]
  • Bash Guide for Beginners
    Bash Guide for Beginners Machtelt Garrels Garrels BVBA <tille wants no spam _at_ garrels dot be> Version 1.11 Last updated 20081227 Edition Bash Guide for Beginners Table of Contents Introduction.........................................................................................................................................................1 1. Why this guide?...................................................................................................................................1 2. Who should read this book?.................................................................................................................1 3. New versions, translations and availability.........................................................................................2 4. Revision History..................................................................................................................................2 5. Contributions.......................................................................................................................................3 6. Feedback..............................................................................................................................................3 7. Copyright information.........................................................................................................................3 8. What do you need?...............................................................................................................................4 9. Conventions used in this
    [Show full text]
  • Lecture 17 the Shell and Shell Scripting Simple Shell Scripts
    Lecture 17 The Shell and Shell Scripting In this lecture • The UNIX shell • Simple Shell Scripts • Shell variables • File System commands, IO commands, IO redirection • Command Line Arguments • Evaluating Expr in Shell • Predicates, operators for testing strings, ints and files • If-then-else in Shell • The for, while and do loop in Shell • Writing Shell scripts • Exercises In this course, we need to be familiar with the "UNIX shell". We use it, whether bash, csh, tcsh, zsh, or other variants, to start and stop processes, control the terminal, and to otherwise interact with the system. Many of you have heard of, or made use of "shell scripting", that is the process of providing instructions to shell in a simple, interpreted programming language . To see what shell we are working on, first SSH into unix.andrew.cmu.edu and type echo $SHELL ---- to see the working shell in SSH We will be writing our shell scripts for this particular shell (csh). The shell scripting language does not fit the classic definition of a useful language. It does not have many of the features such as portability, facilities for resource intensive tasks such as recursion or hashing or sorting. It does not have data structures like arrays and hash tables. It does not have facilities for direct access to hardware or good security features. But in many other ways the language of the shell is very powerful -- it has functions, conditionals, loops. It does not support strong data typing -- it is completely untyped (everything is a string). But, the real power of shell program doesn't come from the language itself, but from the diverse library that it can call upon -- any program.
    [Show full text]
  • ASCII Delimited Format Plug-In User’S Guide
    ASCII Delimited Format Plug-in User’s Guide Version 3.4 ASCII DELIMITED ......................................................................................................... 4 CREATING AN ASCII DELIMITED MESSAGE ....................................................... 4 ASCII DELIMITED EXTERNAL MESSAGE UI........................................................ 6 DEFINING AN ASCII DELIMITED MESSAGE FORMAT...................................... 7 ASCII DELIMITED FORMAT OPTIONS .............................................................................. 7 Delimiter ..................................................................................................................... 8 Message Options......................................................................................................... 9 Treat Entire Input/Output as a Single Message (Message Mode) ...................... 9 Treat Each Record as a Separate Message (Batch Mode) ................................ 10 Single Record Mode ......................................................................................... 10 Header/Trailer Option.............................................................................................. 11 ADDING A NEW FIELD.................................................................................................... 12 SPECIFYING FIELD PROPERTIES...................................................................................... 13 The Required Property.....................................................................................
    [Show full text]
  • STAT579: SAS Programming
    Note on homework for SAS date formats I'm getting error messages using the format MMDDYY10D. even though this is listed on websites for SAS date formats. Instead, MMDDYY10 and similar (without the D seems to work for both hyphens and slashes. Also note that a date format such as MMDDYYw. means that the w is replaced by a number indicating the width of the string (e.g., 8 or 10). SAS Programming SAS data sets (Chapter 4 of Cody book) SAS creates data sets internally once they are read in from a Data Step. The data sets can be stored in different locations and accessed later on. The default is to store them in WORK, so if you create a data set using data adress; the logfile will say that it created a SAS dataset called WORK.ADDRESS. You can nagivate to the newly created SAS dataset. In SAS Studio, go to the Libraries Tab on the left (Usually appears toward the bottom until you click on it). Then WORK.ADDRESS should appear. SAS Programming SAS data sets SAS Programming SAS data sets SAS Programming Making datasets permanent You can also make SAS datasets permanent. This is done using the libname statement. E.g. SAS Programming Permanent SAS datasets The new dataset should be available to be accessed directly from other SAS programs without reading in original data. This can save a lot of time for large datasets. If the SAS dataset is called mydata, the SAS dataset will be called mydata.sas7bdat, where the 7 refers to the datastructures used in version 7 (and which hasn't changed up to version 9).
    [Show full text]
  • Positive Pay Format Guide
    Positive Pay Format Guide Check File Import Contents Contents ........................................................................................................................................................ 1 I. Supported File Types ............................................................................................................................. 2 A. Delimited Text Files ........................................................................................................................... 2 B. Microsoft Excel Files.......................................................................................................................... 2 C. Fixed-width Text Files ....................................................................................................................... 2 D. Header and Trailer Records .............................................................................................................. 2 II. File Data Requirements ......................................................................................................................... 3 A. Required Columns ............................................................................................................................. 3 B. Optional Columns.............................................................................................................................. 3 Positive Pay 1 of 3 BankFinancial, NA Format Guide 11-2016-1 I. Supported File Types Positive Pay supports the following three types of issued files: A. Delimited
    [Show full text]
  • Teach Yourself Perl 5 in 21 Days
    Teach Yourself Perl 5 in 21 days David Till Table of Contents: Introduction ● Who Should Read This Book? ● Special Features of This Book ● Programming Examples ● End-of-Day Q& A and Workshop ● Conventions Used in This Book ● What You'll Learn in 21 Days Week 1 Week at a Glance ● Where You're Going Day 1 Getting Started ● What Is Perl? ● How Do I Find Perl? ❍ Where Do I Get Perl? ❍ Other Places to Get Perl ● A Sample Perl Program ● Running a Perl Program ❍ If Something Goes Wrong ● The First Line of Your Perl Program: How Comments Work ❍ Comments ● Line 2: Statements, Tokens, and <STDIN> ❍ Statements and Tokens ❍ Tokens and White Space ❍ What the Tokens Do: Reading from Standard Input ● Line 3: Writing to Standard Output ❍ Function Invocations and Arguments ● Error Messages ● Interpretive Languages Versus Compiled Languages ● Summary ● Q&A ● Workshop ❍ Quiz ❍ Exercises Day 2 Basic Operators and Control Flow ● Storing in Scalar Variables Assignment ❍ The Definition of a Scalar Variable ❍ Scalar Variable Syntax ❍ Assigning a Value to a Scalar Variable ● Performing Arithmetic ❍ Example of Miles-to-Kilometers Conversion ❍ The chop Library Function ● Expressions ❍ Assignments and Expressions ● Other Perl Operators ● Introduction to Conditional Statements ● The if Statement ❍ The Conditional Expression ❍ The Statement Block ❍ Testing for Equality Using == ❍ Other Comparison Operators ● Two-Way Branching Using if and else ● Multi-Way Branching Using elsif ● Writing Loops Using the while Statement ● Nesting Conditional Statements ● Looping Using
    [Show full text]
  • Installing and Configuring PHP
    05 6205 CH03.qxd 11/20/03 11:27 AM Page 51 CHAPTER 3 Installing and Configuring PHP In the last of the three installation-related chapters, you will acquire, install, and configure PHP and make some basic changes to your Apache installation. In this chapter, you will learn . How to install PHP with Apache on Linux/Unix . How to install PHP with Apache server on Windows . How to test your PHP installation . How to find help when things go wrong . The basics of the PHP language Current and Future Versions of PHP The installation instructions in this chapter refer to PHP version 4.3.3, which is the current version of the software. The PHP Group uses minor release numbers for updates containing security enhancements or bug fixes. Minor releases do not follow a set release schedule; when enhancements or fixes are added to the code and thor- oughly tested, the PHP Group will releases a new version, with a new minor version number. It is possible that by the time you purchase this book, the minor version number will have changed, to 4.3.4 or beyond. If that is the case, you should read the list of changes at http://www.php.net/ChangeLog-4.php for any changes regarding the installation or configuration process, which makes up the bulk of this chapter. Although it is unlikely that any installation instructions will change between minor version updates, you should get in the habit of always checking the changelog of software that you install and maintain. If a minor version change does occur during the time you are reading this book, but no installation changes are noted in the 05 6205 CH03.qxd 11/20/03 11:27 AM Page 52 52 Chapter 3 changelog, simply make a mental note and substitute the new version number wherever it appears in the installation instructions and accompanying figures.
    [Show full text]
  • Understanding the Command-Line Interface
    Understanding the Command-Line Interface This chapter helps you understand the command-line interface. • Information About the CLI Prompt, on page 1 • Command Modes, on page 2 • Special Characters, on page 5 • Keystroke Shortcuts, on page 5 • Abbreviating Commands, on page 7 • Completing a Partial Command Name, on page 8 • Identifying Your Location in the Command Hierarchy, on page 8 • Using the no Form of a Command , on page 9 • Configuring CLI Variables, on page 10 • Command Aliases, on page 12 • Command Scripts, on page 14 • Context-Sensitive Help , on page 16 • Understanding Regular Expressions, on page 17 • Searching and Filtering show Command Output, on page 19 • Searching and Filtering from the --More-- Prompt, on page 23 • Using the Command History, on page 24 • Enabling or Disabling the CLI Confirmation Prompts, on page 26 • Setting CLI Display Colors, on page 26 • Sending Commands to Modules, on page 27 • BIOS Loader Prompt, on page 28 • Examples Using the CLI , on page 28 Information About the CLI Prompt Once you have successfully accessed the device, the CLI prompt displays in the terminal window of your console port or remote workstation as shown in this example: User Access Verification login: admin Password:<password> Cisco Nexus Operating System (NX-OS) Software TAC support: http://www.cisco.com/tac Copyright (c) 2002-2009, Cisco Systems, Inc. All rights reserved. Understanding the Command-Line Interface 1 Understanding the Command-Line Interface Command Modes The copyrights to certain works contained in this software are owned by other third parties and used and distributed under license. Certain components of this software are licensed under the GNU General Public License (GPL) version 2.0 or the GNU Lesser General Public License (LGPL) Version 2.1.
    [Show full text]
  • Command-Line Interface User's Guide
    Storage Productivity Center for Replication for System z Version 4.2.2.1 Command-line Interface User's Guide SC27-2323-06 Storage Productivity Center for Replication for System z Version 4.2.2.1 Command-line Interface User's Guide SC27-2323-06 Note Before using this information and the product it supports, read the information in “Notices” on page 133. This edition applies to version 4, release 2, modification 2, fix pack 1 of IBM Tivoli Storage Productivity Center for Replication for System z (product number 5698-B30 and 5698-B31) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition replaces SC27-2323-05. © Copyright IBM Corporation 2005, 2011. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Figures ...............v lslss .................55 lsmc .................57 Tables ...............vii lspair ................59 lsparameter ..............63 lspath ................65 About this guide ...........ix lspool ................67 Intended audience ............ix lsrolepairs ...............70 Command-line interface conventions ......ix lsrolescpset ..............73 Presentation of command information ....ix lssess ................75 Command entry ............xi lssessactions ..............78 Command modes ...........xii lssessdetails ..............80 User assistance for commands .......xiv lssnapgrp ...............82 Output from command processing .....xv lssnapgrpactions.............85 Accessing the
    [Show full text]
  • The Linux Command Line
    The Linux Command Line Second Internet Edition William E. Shotts, Jr. A LinuxCommand.org Book Copyright ©2008-2013, William E. Shotts, Jr. This work is licensed under the Creative Commons Attribution-Noncommercial-No De- rivative Works 3.0 United States License. To view a copy of this license, visit the link above or send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran- cisco, California, 94105, USA. Linux® is the registered trademark of Linus Torvalds. All other trademarks belong to their respective owners. This book is part of the LinuxCommand.org project, a site for Linux education and advo- cacy devoted to helping users of legacy operating systems migrate into the future. You may contact the LinuxCommand.org project at http://linuxcommand.org. This book is also available in printed form, published by No Starch Press and may be purchased wherever fine books are sold. No Starch Press also offers this book in elec- tronic formats for most popular e-readers: http://nostarch.com/tlcl.htm Release History Version Date Description 13.07 July 6, 2013 Second Internet Edition. 09.12 December 14, 2009 First Internet Edition. 09.11 November 19, 2009 Fourth draft with almost all reviewer feedback incorporated and edited through chapter 37. 09.10 October 3, 2009 Third draft with revised table formatting, partial application of reviewers feedback and edited through chapter 18. 09.08 August 12, 2009 Second draft incorporating the first editing pass. 09.07 July 18, 2009 Completed first draft. Table of Contents Introduction....................................................................................................xvi
    [Show full text]
  • DATALINES, Sequential Files, CSV, HTML, and More
    SUGI 31 Tutorials Paper 228-31 DATALINES, Sequential Files, CSV, HTML and More – Using INFILE and INPUT Statements to Introduce External Data into the SAS® System Andrew T. Kuligowski, Nielsen Media Research ABSTRACT / INTRODUCTION The SAS® System has numerous capabilities to store, analyze, report, and present data. However, those features are useless unless that data is stored in, or can be accessed by, the SAS System. This presentation is designed to review the INFILE and INPUT statements. It has been set up as a series of examples, each building on the other, rather than a mere recitation of the options as documented in the manual. These examples will include various data sources, including DATALINES, sequential files, CSV files, and HTML files. GETTING STARTED – BASIC INFILE / INPUT with DATALINES In order to bring data from an external source into your SAS session, the user must provide the answers to a couple of simple questions: Where is the data, and what does it look like? The INFILE statement will define the data source and provide a few tidbits of information regarding its form, while the INPUT statement will define the format of the data to be processed. /* INTRO EXAMPLE */ NOTE: Invalid data for ConfName in DATA SasConf; line 9 1-9. INFILE DATALINES; NOTE: Invalid data for ConfCity in INPUT ConfName line 9 16-18. ConfYear NOTE: Invalid data for ConfST in ConfCity line 9 20-28. ConfST ; RULE: ----+----1----+----2----+----3--- DATALINES; 9 SUGI 2006 San Francisco CA SUGI 2006 San Francisco CA ConfName=. ConfYear=2006 ConfCity=. PHARMASUG 2006 Bonita Springs FL ConfST=.
    [Show full text]