Converting Data Between "Foreign" Formats and SAS System Files

Total Page:16

File Type:pdf, Size:1020Kb

Converting Data Between CODY I Data Conversion SAS (R) TUTORIAL SESSION: CONVERTING DATA BETWEEN "FOREIGN" FORMATS AND SAS (R) SYSTEM FILES Dr. Ronald Cody Robert Wood Johnson Medical School This paper will discuss ways to move data from such formats as ASCII, Lotus(r), and dBASE(r) to a SAS system file. Included in this discussion will be the moving of SAS system files from other platforms (such as UNIX) to system files on PC 1 s. PROC DIF, DBF, CPORT, and CIMPORT will be discussed as well as a non-SAS Institute package, DBMS/COPY which translates data between a variety of formats, including SAS system files. The special problems of missing values and incompatible formats is also addressed. I. Reading Data from an External ASCII file. One common way for users to enter data into a micro-computer is with a wordprocessing package. Several such packages write directly in ASCII format such as PCWRITE (r) and WordStar (r) (non-document mode only). Others use their own proprietary format such as Word Perfect(r) and Multimate(r). These latter packages contain translation routines which can convert their internal format to standard ASCII. In Word Perfect, the choice "Save to DOS Text File" will write ASCII files, while in Multimate, you must run a translate program. Another way to create ASCII files is to have a spread sheet program or a database program "print to a file." This technique is similar to sending data to the printer except that the resulting text will reside in a disk file. Care must be exercised here so that the package you are using does not format the text by adding margins or placing page breaks in the file. In Lotus, be sure to select the "Unformatted" and "Margin" (set left to zero) options in the Print menu before writing out the file. ASCII is a good "common denominator" between other packages and SAS system files when all else fails. Regardless of how the ASCII file was created, let us now see how a SAS program can read such a file. The ASCII file that was used in the program example which follows is listed below: FILE ASCII.TXT 001M2368160 ID is in cols 1-3, SEX in col 4, AGE in 5-6 002F4462 99 HEIGHT in col 7-8 and WEIGHT in col 9-11 003M29 200 004F2765 Note: This is a short record 005M6672220 006F6060100 The SAS program to read this file is shown next: 96 CODY 1 Data Conversion DATA ASCII; INFILE 1 ASCII.TXT 1 MISSOVER; INPUT ID 1-3 SEX $ 4 AGE 5-6 HEIGHT 7-8 WEIGHT 9-11; RUN; PROC PRINT NOOBS; TITLE 'SAMPLE DATA SET'; VAR ID--WEIGHT; RUN; Special care must be taken when reading this file. Notice that subject 004 has a short record (i.e. the carriage return was pressed immediately after the 11 65 11 was entered--no blanks were typed). Unlike mainframe systems, this file is not padded on the right with blanks. Without the "MISSOVER" option on the INFILE statement, the program would move to the next record in an attempt to read a value for WEIGHT even though the INPUT statement specifies columns 9-11 for this vari­ able. Below is a listing of the SAS data set that was produced without the MISSOVER option: Result of PROC PRINT when Option MISSOVER was Not Specified ID SEX AGE HEIGHT WEIGHT 1 M 23 68 160 2 F 44 62 99 3 M 29 200 4 F 27 65 5 6 F 60 60 100 Notice first that the SAS pointer went to record five to look for a value for WEIGHT and read the first three columns which was actually the ID number for the next subject. Then the SAS pointer moved to the next record, causing the data in record five to be skipped and the values in last record (seven) to appear in the 6th observation. The SAS LOG below shows that the SAS pointer went to a new line to read data and that the minimum record length was 8. SAS LOG where Option MISSOVER was Not Specified NOTE: The infile 1 ASCII.TXT 1 is file 0:\SASDATA\ASCII.TXT. NOTE: 6 records were read from the infile 0:\SASDATA\ASCII.TXT. The minimum record length was 8. The maximum record length was 11. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.ASCII has 5 observations and 5 variables. If you see this NOTE in a SAS LOG and you did not intend for the SAS pointer to go to a new line to read data (as with INPUT statements 97 CODY I Data Conversion with @@) , be sure to think about this short record problem and the MISSOVER option to solve the problem. It is a good idea to include the option MISSOVER when reading ASCII files with SAS/PC. II. Reading Data from a Lotus(r) Spreadsheet Via DIF Format One way of converting a Lotus spreadsheet into a SAS system file is via a DIF (Data Interchange Format) file. Once your spreadsheet has been translated to a DIF file, you may use PROC DIF to convert the DIF file into a SAS system file. Let 1 s first discuss the format of the original spreadsheet. You may simply have columns of variables, with the first row of the spreadsheet containing the values for your first observation. Below is a sample of a simple spreadsheet (containing the same values as the sample ASCII file above): Lotus Spreadsheet Example 1 A B c D E 1 1 M 23 68 160 2 2 F 44 62 99 3 3 M 29 200 4 4 F 27 65 5 5 M 66 72 220 6 6 F 60 60 100 Notice that the numbers are right justified and the character variables are left justified. When this spreadsheet is converted to a SAS system file, the numbers will be SAS numeric (8 byte) variables and the characters will become character variables of length 20. Character values longer than 20 bytes will be truncated when we use the DIF format for our translation. Another form of the spreadsheet is to have the first row contain column headings. This form is shown next: Lotus Spreadsheet Example 2 A B ~ D E 1 ID SEX AGE HEIGHT WEIGHT 2 1 M 23 68 160 3 2 F 44 62 99 4 3 M 29 200 5 4 F 27 65 6 5 M 66 72 220 7 6 F 60 60 100 98 CODY 1 Data Conversion Finally, you may have one or more lines of text or comments in your spreadsheet. An example of this is shown next: Lotus Spreadsheet Example 3 A B c D E 1 These lines contain comments that we do not 2 want to include with our data. 3 ------------------------------------------- 4 ID SEX AGE HEIGHT WEIGHT 5 1 M 23 68 160 6 2 F 44 62 99 7 3 M 29 200 8 4 F 27 65 9 5 M 66 72 220 10 6 F 60 60 100 There are two ways of dealing with examples 2 and 3. First, before we enter the Lotus translate program, we can use the "RANGE" command of Lotus and name a range that includes only the data. We can then translate only the range and create a DIF file that will be identical to the one from example 1. The other alternative, is to translate the spreadsheet intact and use the SKIP option of PROC DIF to skip the first n lines of the spreadsheet. Now that we have translated our WK1 file to a DIF file, we are ready to see how PROC DIF works. The syntax for PROC DIF is: PROC DIF DIF=fileref OUT=sas_file SKIP=n; where fileref = a file reference to the .DIF file sas file = name of the newly created SAS system file n = number of lines of the spreadsheet to skip For example, suppose our original worksheet file was called LOTUS.WK1. The translated DIF file will be named LOTUS.DIF (the .DIF is added automatically by the translate routine). If we want our SAS system file to be called LOTUSAS, we would write out PROC statements as follows: FILENAME IN 'LOTUS.DIF'; PROC DIF DIF=IN OUT=LOTUSAS; The variables in the resulting SAS data set would be named COL1, COL2, COL3, etc. You could rename these variables using PROC DATASETS such as: 99 CODY I Data Conversion PROC DATASETS; MODIFY LOTUSAS; RENAME COLl=ID COL2=SEX COL3=AGE COL4=HEIGHT COL5=WEIGHT; III. Reading Data From a Lotus Spreadsheet via DBF Format An alternate method of converting a Lotus spreadsheet to a SAS system file is by first converting the spreadsheet to DBF format (choose dBase III from the Lotus translate screen) and to then use PROC DBF to create the SAS system file. There are advantages and dis­ advantages to this method. First, the Lotus translate routine expects that the first row of the spreadsheet contains variable names and subsequent rows contain data values. If there are extraneous rows or columns in your spreadsheet, use the "range" command in Lotus to name a range where your variable names and values are located. The Lotus to DBF conversion is more particular than the Lotus to DIF conversion. The translate routine insists that the second row of the spreadsheet (the first row of data) either contains data values or is formatted. After the conversion is completed, the resulting SAS system file will have the same variable names as the column headings.
Recommended publications
  • File Formats
    man pages section 4: File Formats Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. Part No: 817–3945–10 September 2004 Copyright 2004 Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, docs.sun.com, AnswerBook, AnswerBook2, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements.
    [Show full text]
  • Files and Processes (Review)
    Files and Processes (review) Files and Processes (review) 1/61 Learning Objectives Files and Processes (review) I Review of files in standard C versus using system call interface for files I Review of buffering concepts I Review of process memory model I Review of bootup sequence in Linux and Microsoft Windows I Review of basic system calls under Linux: fork, exec, wait, exit, sleep, alarm, kill, signal I Review of similar basic system calls under MS Windows 2/61 Files Files and I Recall how we write a file copy program in standard C. Processes (review) #include <stdio.h> FILE *fopen(const char *path, const char *mode); size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream); size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream); int fclose(FILE *fp); I We can also use character-based functions such as: #include <stdio.h> int fgetc(FILE *stream); int fputc(int c, FILE *stream); I With either approach, we can write a C program that will work on any operating system as it is in standard C. 3/61 Standard C File Copy Files and Processes (review) I Uses fread and fwrite. I files-processes/stdc-mycp.c 4/61 POSIX/Unix Files Files and Processes (review) I "On a UNIX system, everything is a file; if something is not a file, it is a process." I A directory is just a file containing names of other files. I Programs, services, texts, images, and so forth, are all files. I Input and output devices, and generally all devices, are considered to be files.
    [Show full text]
  • The Application of File Identification, Validation, and Characterization Tools in Digital Curation
    THE APPLICATION OF FILE IDENTIFICATION, VALIDATION, AND CHARACTERIZATION TOOLS IN DIGITAL CURATION BY KEVIN MICHAEL FORD THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Library and Information Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2011 Urbana, Illinois Advisers: Research Assistant Professor Melissa Cragin Assistant Professor Jerome McDonough ABSTRACT File format identification, characterization, and validation are considered essential processes for digital preservation and, by extension, long-term data curation. These actions are performed on data objects by humans or computers, in an attempt to identify the type of a given file, derive characterizing information that is specific to the file, and validate that the given file conforms to its type specification. The present research reviews the literature surrounding these digital preservation activities, including their theoretical basis and the publications that accompanied the formal release of tools and services designed in response to their theoretical foundation. It also reports the results from extensive tests designed to evaluate the coverage of some of the software tools developed to perform file format identification, characterization, and validation actions. Tests of these tools demonstrate that more work is needed – particularly in terms of scalable solutions – to address the expanse of digital data to be preserved and curated. The breadth of file types these tools are anticipated to handle is so great as to call into question whether a scalable solution is feasible, and, more broadly, whether such efforts will offer a meaningful return on investment. Also, these tools, which serve to provide a type of baseline reading of a file in a repository, can be easily tricked.
    [Show full text]
  • Chapter 10: File System
    Chapter 10: File System Operating System Concepts – 9th Edition Silberschatz, Galvin and Gagne © 2013 Chapter 10: File System File Concept Access Methods Disk and Directory Structure File-System Mounting File Sharing Protection Operating System Concepts – 9th Edition 11.2 Silberschatz, Galvin and Gagne © 2013 Objectives To explain the function of file systems To describe the interfaces to file systems To discuss file-system design tradeoffs, including access methods, file sharing, file locking, and directory structures To explore file-system protection Operating System Concepts – 9th Edition 11.3 Silberschatz, Galvin and Gagne © 2013 File Concept Contiguous logical address space Types: Data numeric character binary Program Contents defined by file’s creator Many types Consider text file, source file, executable file Operating System Concepts – 9th Edition 11.4 Silberschatz, Galvin and Gagne © 2013 File Structure None - sequence of words, bytes Simple record structure Lines Fixed length Variable length Complex Structures Formatted document Relocatable load file Can simulate last two with first method by inserting appropriate control characters Who decides: Operating system Program Operating System Concepts – 9th Edition 11.5 Silberschatz, Galvin and Gagne © 2013 File Attributes Name – only information kept in human-readable form Identifier – unique tag (number) identifies file within file system Type – needed for systems that support different types Location – pointer to file location on device Size
    [Show full text]
  • Man Pages Section 2 System Calls
    man pages section 2: System Calls Part No: E29032 October 2012 Copyright © 1993, 2012, Oracle and/or its affiliates. All rights reserved. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT END USERS. Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including anyoperating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs. No other rights are granted to the U.S. Government. This software or hardware is developed for general use in a variety of information management applications.
    [Show full text]
  • Wikipedia: Design of the FAT File System
    Design of the FAT file system A FAT file system is a specific type of computer file system architecture and FAT a family of industry-standard file systems utilizing it. Developer(s) Microsoft, SCP, IBM, [3] The FAT file system is a legacy file system which is simple and robust. It Compaq, Digital offers good performance even in very light-weight implementations, but Research, Novell, cannot deliver the same performance, reliability and scalability as some Caldera modern file systems. It is, however, supported for compatibility reasons by Full name File Allocation Table: nearly all currently developed operating systems for personal computers and FAT12 (12- many home computers, mobile devices and embedded systems, and thus is a bit version), well suited format for data exchange between computers and devices of almost FAT16 (16- any type and age from 1981 through the present. bit versions), Originally designed in 1977 for use on floppy disks, FAT was soon adapted and FAT32 (32-bit version used almost universally on hard disks throughout the DOS and Windows 9x with 28 bits used), eras for two decades. Today, FAT file systems are still commonly found on exFAT (64- floppy disks, USB sticks, flash and other solid-state memory cards and bit versions) modules, and many portable and embedded devices. DCF implements FAT as Introduced 1977 (Standalone the standard file system for digital cameras since 1998.[4] FAT is also utilized Disk BASIC-80) for the EFI system partition (partition type 0xEF) in the boot stage of EFI- FAT12: August 1980 compliant computers. (SCP QDOS) FAT16: August 1984 For floppy disks, FAT has been standardized as ECMA-107[5] and (IBM PC DOS 3.0) ISO/IEC 9293:1994[6] (superseding ISO 9293:1987[7]).
    [Show full text]
  • 430 File Systems Chap
    430 FILE SYSTEMS CHAP. 6 6.4 EXAMPLE FILE SYSTEMS In the following sections we will discuss several example file systems, rang- ing from quite simple to highly sophisticated. Since modern UNIX file systems and Windows 2000’s native file system are covered in the chapter on UNIX (Chap. 10) and the chapter on Windows 2000 (Chap. 11) we will not cover those systems here. We will, however, examine their predecessors below. 6.4.1 CD-ROM File Systems As our first example of a file system, let us consider the file systems used on CD-ROMs. These systems are particularly simple because they were designed for write-once media. Among other things, for example, they have no provision for keeping track of free blocks because on a CD-ROM files cannot be freed or added after the disk has been manufactured. Below we will take a look at the main CD- ROM file system type and two extensions to it. The ISO 9660 File System The most common standard for CD-ROM file systems was adopted as an International Standard in 1988 under the name ISO 9660. Virtually every CD- ROM currently on the market is compatible with this standard, sometimes with the extensions to be discussed below. One of the goals of this standard was to make every CD-ROM readable on every computer, independent of the byte order- ing used and independent of the operating system used. As a consequence, some limitations were placed on the file system to make it possible for the weakest operating systems then in use (such as MS-DOS) to read it.
    [Show full text]
  • Linux File System and Linux Commands
    Hands-on Keyboard: Cyber Experiments for Strategists and Policy Makers Review of the Linux File System and Linux Commands 1. Introduction Becoming adept at using the Linux OS requires gaining familiarity with the Linux file system, file permissions, and a base set of Linux commands. In this activity, you will study how the Linux file system is organized and practice utilizing common Linux commands. Objectives • Describe the purpose of the /bin, /sbin, /etc, /var/log, /home, /proc, /root, /dev, /tmp, and /lib directories. • Describe the purpose of the /etc/shadow and /etc/passwd files. • Utilize a common set of Linux commands including ls, cat, and find. • Understand and manipulate file permissions, including rwx, binary and octal formats. • Change the group and owner of a file. Materials • Windows computer with access to an account with administrative rights The Air Force Cyber College thanks the Advanced Cyber Engineering program at the Air Force Research Laboratory in Rome, NY, for providing the information to assist in educating the general Air Force on the technical aspects of cyberspace. • VirtualBox • Ubuntu OS .iso File Assumptions • The provided instructions were tested on an Ubuntu 15.10 image running on a Windows 8 physical machine. Instructions may vary for other OS. • The student has administrative access to their system and possesses the right to install programs. • The student’s computer has Internet access. 2. Directories / The / directory or root directory is the mother of all Linux directories, containing all of the other directories and files. From a terminal users can type cd/ to move to the root directory.
    [Show full text]
  • Maintaining a File System File System Integrity Utility: Fsck -P [Filesystem]
    Maintaining a File System File System Integrity utility: fsck -p [fileSystem] fsck (file system check) scans the specified file systems and checks them for consistency. The kind of consistency errors that can exist include: • A block is marked as free in the bitmap but is also referenced from an inode. • A block is marked as used in the bitmap but is never referenced from an inode. • More than one inode refers to the same block. • An invalid block number. • An inode's link count is incorrect. • A used inode is not referenced from any directory. 1 file system integrity fsck -p [fileSystem] If the -p option is used, fsck automatically corrects any errors that it finds. Without the -p option, it prompts the user for confirmation of any corrections that it suggests. If fsck finds a block that is used but is not associated with a named file, it connects it to a file whose name is equal to the block's inode number in the "/lost+found" directory. If no file systems are specified, fsck checks the standard file systems listed in "/etc/fstab." Linux has specialized fsck programs for different types of file systems. For example, when checking an ext2 or ext3 file system, fsck act as a front-end to e2fsck, which is the program that actually checks the file system. 2 Display disk statistics My disk is full, my files are not saved, why?!@#$ du -- display disk usage displays the number of kB that are allocated to each of the specified filenames. If a filename refers to a directory, its files are recursively described -h option displays more human-readable
    [Show full text]
  • The UNIX Time- Sharing System
    1. Introduction There have been three versions of UNIX. The earliest version (circa 1969–70) ran on the Digital Equipment Cor- poration PDP-7 and -9 computers. The second version ran on the unprotected PDP-11/20 computer. This paper describes only the PDP-11/40 and /45 [l] system since it is The UNIX Time- more modern and many of the differences between it and older UNIX systems result from redesign of features found Sharing System to be deficient or lacking. Since PDP-11 UNIX became operational in February Dennis M. Ritchie and Ken Thompson 1971, about 40 installations have been put into service; they Bell Laboratories are generally smaller than the system described here. Most of them are engaged in applications such as the preparation and formatting of patent applications and other textual material, the collection and processing of trouble data from various switching machines within the Bell System, and recording and checking telephone service orders. Our own installation is used mainly for research in operating sys- tems, languages, computer networks, and other topics in computer science, and also for document preparation. UNIX is a general-purpose, multi-user, interactive Perhaps the most important achievement of UNIX is to operating system for the Digital Equipment Corpora- demonstrate that a powerful operating system for interac- tion PDP-11/40 and 11/45 computers. It offers a number tive use need not be expensive either in equipment or in of features seldom found even in larger operating sys- human effort: UNIX can run on hardware costing as little as tems, including: (1) a hierarchical file system incorpo- $40,000, and less than two man years were spent on the rating demountable volumes; (2) compatible file, device, main system software.
    [Show full text]
  • Mac Essentials Organizing SAS Software
    Host Systems Getting Organized in Mac OS Rick Asler If you work on just one simple project with SAS MAE is emulation software to run Mac on some software, it may not matter very much how you UNIX systems. It is not the best way to run SAS organize your files. But if you work on a complex software. project or on several projects, your productivity and SAS software requires at least version 7.5 of peace of mind depend on organizing your projects Mac OS. System 7.5 shipped on new computers in effectively. 1994-1995 and can be purchased separately for This paper presents a system for organizing the older computers. A new system version is due in files of SAS projects, taking advantage of the 1996. special features of Mac OS. Then it demonstrates A computer system for running SAS software techniques for automating SAS projects. should also have at least an average-size hard disk and at least 16 megabytes of physical RAM. The Finder is the main application that is always Mac essentials running in Mac OS. It displays disks and files as icons and subdirectories as folders. First, these are some Mac terms and features The system folder is the folder containing the you may need to be aware of to operate SAS System file, Finder, and other Mac OS files. software under Mac OS. The Trash Is a Finder container where deleted Mac, Mac OS, Mac operating system, or Macin­ files go. You can retrieve them Hyou don't wait too tosh operating system is the distinctive graphical long.
    [Show full text]
  • Orion File System : File-Level Host-Based Virtualization
    Orion File System : File-level Host-based Virtualization Amruta Joshi Faraz Shaikh Sapna Todwal Pune Institute of Computer Pune Institute of Computer Pune Institute of Computer Technology, Technology, Technology, Dhankavadi, Pune 411043, India Dhankavadi, Pune 411043, India Dhankavadi, Pune 411043, India 020-2437-1101 020-2437-1101 020-2437-1101 [email protected] [email protected] [email protected] Abstract— The aim of Orion is to implement a solution that The automatic indexing of files and directories is called provides file-level host-based virtualization that provides for "semantic" because user programmable transducers use better aggregation of content/information based on information about these semantics of files to extract the semantics and properties. File-system organization today properties for indexing. The extracted properties are then very closely mirrors storage paradigms rather than user- stored in a relational database so that queries can be run access paradigms and semantic grouping. All file-system against them. Experimental results from our semantic file hierarchies are containers that are expressed based on their system implementation ORION show that semantic file physical presence (a separate drive letter on Windows, or a systems present a more effective storage abstraction than the particular mount point based on the volume in Unix). traditional tree structured file systems for information We have implemented a solution that will allow sharing, storage and retrieval. users to organize their files
    [Show full text]