Reading From Alternate Sources: What To Do When The Input Is Not a Flat File
Michael Davis, Bassett Consulting Services, North Haven, Connecticut
ABSTRACT protocol, employed for both corporate network and dial-up connections. Most SASÒ programmers are comfortable creating applications when the input is in a SAS data set or The general syntax for a FILENAME statement that flat file on the same computer. However, what do employs the FTP Access Method is: you do when your data is to be accessed from a web page or TCP socket? Can you read data on another FILENAME fileref FTP 'external-file' computer accessible via FTP without first
The baseball.csv file is found in the saspaper subdirectory on an FTP server referenced by the IP informat DIVISION $4. ; address 207.244.93.41. Being a common-separated informat POSITION $2. ; file, the length of each record varies, hence the informat NO_OUTS best32. ; variable-length record format (recfm=v). The debug informat NO_ASSTS best32. ; parameter is optional. It is included in this example informat NO_ERROR best32. ; informat SALARY best32. ; for instructional purposes and is usually omitted format NAME $20. ; from production code. format TEAM $12. ; format NO_ATBAT best12. ; A comma-separated value example is shown in part format NO_HITS best12. ; because it allows for the columns names to be format NO_HOME best12. ; passed within the file. Also, the SAS code to read format NO_RUNS best12. ; the file can be automatically generated by the format NO_RBI best12. ; Import Wizard. The Import Wizard is available from format NO_BB best12. ; the File pmenu in SAS Release 6.12 and later format YR_MAJOR best12. ; versions. An edited version of the code it generated format CR_ATBAT best12. ; is shown below. format CR_HITS best12. ; format CR_HOME best12. ; format CR_RUNS best12. ; data WORK.baseball ; format CR_RBI best12. ; infile myfile delimiter = ',' format CR_BB best12. ; MISSOVER DSD lrecl=32767 format LEAGUE $8. ;
firstobs=2 ; format DIVISION $4. ; informat NAME $20. ; format POSITION $2. ; informat TEAM $12. ; format NO_OUTS best12. ; informat NO_ATBAT best32. ; format NO_ASSTS best12. ; informat NO_HITS best32. ; format NO_ERROR best12. ; informat NO_HOME best32. ; format SALARY best12. ; informat NO_RUNS best32. ; input informat NO_RBI best32. ; NAME $ TEAM $ NO_ATBAT informat NO_BB best32. ; NO_HITS NO_HOME NO_RUNS informat YR_MAJOR best32. ; NO_RBI NO_BB YR_MAJOR informat CR_ATBAT best32. ; CR_ATBAT CR_HITS CR_HOME informat CR_HITS best32. ; CR_RUNS CR_RBI CR_BB informat CR_HOME best32. ; LEAGUE $ DIVISION $ informat CR_RUNS best32. ; POSITION $ NO_OUTS NO_ASSTS informat CR_RBI best32. ; NO_ERROR SALARY ; informat CR_BB best32. ; run; informat LEAGUE $8. ; informat DIVISION $4. ; informat POSITION $2. ; informat NO_OUTS best32. ; informat NO_ASSTS best32. ; informat NO_ERROR best32. ; informat SALARY best32. ; Although it is beyond the scope of this paper, the format NAME $20. ; FTP file access method can be used to create files format TEAM $12. ; as well as read them. However, the anonymous format NO_ATBAT best12. ; login supplied in the example FILENAME statement format NO_HITS best12. ; only allows read-access. format NO_HOME best12. ; format NO_RUNS best12. ; Fans of “big iron” might be thinking at this point format NO_RBI best12. ; whether the FTP Access Method is available to format NO_BB best12. ; them. The good news is that users of SAS on format YR_MAJOR best12. ; format CR_ATBAT best12. ; mainframes are often able to use the FTP Access format CR_HITS best12. ; Method as well. On the following page, an example format CR_HOME best12. ; is shown to demonstrate the use of the FTP Access format CR_RUNS best12. ; Method is shown for an IBM OS/390 (MVS) format CR_RBI best12. ; computer. format CR_BB best12. ; format LEAGUE $8. ;
the contents of the file character by character, we /* set the FTP fileref */ use the $ebcdic and S370fpd informats. filename rsrawdat ftp "''" In our MVS FTP Access Method example, a SAS user='
Similar to the FTP Access Method, the URL Access baseball SAS data set used earlier in this paper. He Method is defined through a FILENAME statement. used ODS to create the HTML page. The page can The syntax for the FILENAME statement employing be viewed at the following URL: the URL Access Method is: http://www.bassettconsulting.com/baseball.html FILENAME fileref URL 'external-file' http:// understanding of HTML tables and other formatting if input(buffer,$3.) eq ' quotation marks as the delimiter. The loop variable “i” shows the position of each word. retain name team no_atbat no_hits no_home no_runs no_rbi no_bb The test program shows that for the rows that yr_major cr_atbat cr_hits contain data to be extracted, the word “#000000”, cr_home cr_runs cr_rbi which set the color of the displayed font to black, is cr_bb league division the thirteenth word extracted. The fourteenth word position no_outs no_assts is always the data to be extracted. However, in the no_error salary ; case of the player name, the fourteenth word is the retain count 0 ; player’s last name, with a comma. The player’s first format salary 8.2 ; name is the fifteenth word. set testhtml ; if count eq 0 then name = So the first part of our program to extract the table trim(word) || ' ' || word2 ; values from the baseball.html page should be else if count eq 1 then revised as shown in the following version: team = word ; else if count eq 2 then no_atbat = input(word, 8.) ; else if count eq 3 then filename readhtml url no_hits = input(word, 8.) ; 'http://www.bassettconsulting.com/ baseball.html'; else if count eq 4 then no_home = input(word, 8.) ; data testhtml(drop=buffer) ; else if count eq 5 then no_runs = input(word, 8.) ; length buffer $ 200 word word2 $ 25 ; else if count eq 6 then infile readhtml lrecl=200 pad ; no_rbi = input(word, 8.) ; input @1 buffer 200. ; else if count eq 7 then if input(buffer,$3.) eq ' name $ 18 else if count eq 20 then team $ 12 no_error = input(word, 8.) ; no_atbat no_hits no_home else if count eq 21 then no_runs no_rbi no_bb yr_major salary = input(word, 8.2) ; cr_atbat cr_hits cr_home count+1 ; cr_runs cr_rbi cr_bb 8 if count ge 22 then do ; league division position $ 8 count= 0 ; output ; no_outs no_assts no_error end ; salary 8 ; run ; The preceding code was customized for the The syntax for using the Socket Access Method in a baseball data set and would have to be adapted for client (reading data) mode is: different web pages. If the web page had been produced by a different technique, different FILENAME fileref SOCKET 'hostname:portno' “landmarks” may be required. However, the general For those whose requirements cannot be met by the Open up a SAS session and enter the following FTP and URL Access Methods, another way to program: communicate with web hosts follows. filename web socket ':80' server Socket Access Method termstr=CRLF ; At Northeast SAS Users Conference held in data _null_ ; Philadelphia last year, David Ward presented a infile web ; wonderful paper on using the Socket Access input ; Method, which is cited in the bibliography. Those file log ; who are interested in experimenting with and put _infile_ ; perhaps using the Socket Access Method are run ; commended to seek a copy of David’s paper, along with consulting the the SAS OnlineDoc. Notice in the above code that the hostname was not The portion of the SAS OnlineDoc relevant to the specified. In this example, the host would default to Socket Access Method can be found by selecting -> localhost. Base SAS Software -> SAS Language Reference: Dictionary -> Dictionary of Language Elements -> Next open up a web browser and type in a URL. If Statements -> FILENAME, SOCKET Access you switch back to the SAS session, the SAS Log Method. will show GET request and other communication between the browser and web host. When you are finished, hit When the author first tried this exercise, it did not filename dummycat catalog work on his PC until he was reminded that Microsoft 'work.mycat.dummydat.source' ; Personal Web Server, running as a NT service, was also listening on Port 80. When he temporarily data _null_ ; stopped the service, the example worked as file dummycat ; anticipated. put 'here is some sample data' ; run ; CATALOG Access Method data stuff ; By way of background, a catalog is a special type of length buffer $ 20 ; file that allows SAS to store different types of infile dummycat ; information in partitions called catalog entries. Each input @1 buffer $20. ; entry type identifies the purpose of the entry to the run ; SAS system. proc print; run ; Many, if not most, SAS users would find it relatively easy to write a program to read a straight-forward flat text file as program input. However, were one to The DATA _null_ step that follows the FILENAME take the contents of that file and convert it to a SAS statement writes a character string to the dummycat catalog source entry, some of those users might fileref. The filref was opened by the preceding hesitate or not be able to complete the task. FILENAME statement employing the Catalog Access Method. However, the principles behind reading a SAS catalog entry are not very different from those The following DATA step creates the SAS data set employed in reading flat text files. The Catalog stuff which reads the source catalog entry created Access Method allows one to read log, output, and written by the DATA _null_ step. The final source and catams SAS catalog entries. PRINT procedure shows that the preceding steps worked as designed. The syntax for the Catalog Access Method is: One might ask, “Why would anyone want to read FILENAME fileref CATALOG 'catalog' Reading Data from Serial Communication Ports One way review the byte stream being sent is to use a terminal program such as Hyperterminal, supplied If the ability to read data from FTP and web pages with Microsoft Windows. A second way to capture using SAS seems fantastic, consider that SAS can the byte stream for review is to pipe the also read data from (and write data to) other communication port to a file and then use a hex computers using an RS-232 serial communication editor to examine the captured byte stream. ports. For most personal computer users, that would be the 25 pin or 9 pin port that is used to SAS/ACCESS LIBNAME Engines connect external modems and palm-size personal organizers. In the preceding sections, all of the techniques offered involve making data not stored as a flat text However, serial interfaces are probably more useful file appear to SAS as if it were actually in text file to the users of laboratory and service equipment format. If one reviews the syntax used in the who need a way to automatically record test results. previous examples, each program begins with a Serial communication ports are also valuable when FILENAME statement. it is necessary to interface SAS with older computers whose operating system or disk file However, if one takes an expansive view of the title structure make difficult, if not impossible to obtain of this paper, one of the most common sources of the data they contain in any other manner. program input not stored as a flat text file is data stored in a database management system (DBMS). The details for reading data from serial SAS users who can read text files and SAS data communication ports can be found in the SAS sets with confidence sometimes lose their OnlineDoc. The relevant section can be found by confidence when they must get data from a DBMS. selecting -> Base SAS Software -> Host Specific Information -> Microsoft Windows Environment -> One of the more useful features introduced with the Getting Started -> Using External Files -> Reading Nashville release of SAS software (Versions 7 & 8) Data from the Communications Port. is the introduction of the LIBNAME access engines. If you recall from the earlier sections, we can In order for serial communication to work, it is specify that a FILENAME statement points to necessary match the port setting of the SAS client something other than a flat text file by the use of computer with the other computer with which one key words such as FTP and URL. wishes to connect. For a Microsoft Windows computer, the ports are set through the Control Similarly, we can specify that a LIBNAME statement Panel. points to something other than a SAS data set or view by the use of key words such as ODBC or The serial communications port is associated with ORACLE. Such a LIBNAME statement tells SAS SAS using a FILENAME statement. The syntax is: that the data is already stored as columns and rows and to use specified SAS/ACCESS to make the FILENAME fileref COMMPORT “ port: ; DBMS appear as if it were a SAS data set. The following program, taken from the OnlineDoc The general syntax for using the SAS/ACCESS section cited, reads in data, byte by byte until an LIBNAME engines is: end-of-file (hex ‘1a’x) is encountered: LIBNAME libref SAS/ACCESS-engine-name data acquire; Some SAS sites do not license any SAS/ACCESS has been defined, ordinary DATA step code can be software products. Also, the availability of used to access the DBMS data. In most situations, SAS/ACCESS for specific DBMS products varies by SAS performs the same optimization that previously platform. Last, for some of the common DBMSs, a required the use of pass-thru coding. full discussion of the various options could supply enough content for another SAS conference paper. The IMPORT, DBF, and DIF procedures can be So a “follow-along” example is not provided for this good tools to convert files from Microsoft AccessÒ section. and ExcelÒ into SAS data sets. However, the SAS data set from converted data does not change when However, to give a flavor of how SAS/ACCESS the source data changes. By contrast, the LIBNAME engines are used, the following example LIBNAME Engine provides an updated view when was culled from a program in use at one of the the source data changes. author’s clients to access data from OracleÒ. CONCLUSION libname ora_prod oracle user=userid password=password path=”@path” The author hopes this survey of SAS tools to read schema=schema ; data in formats other than flat text files will increase the reader’s appreciation of beauty and flexibility of SAS software. He also hopes that he has helped to Since SAS/ACCESS for Oracle at this site is only demystify the various access methods that have licensed and installed on an NT SAS Server, the been covered. preceding code is either submitted from the console of the NT SAS Server or from a desktop instance of One last tip. This paper directs the reader to the SAS via SAS/CONNECT. SAS OnlineDoc for details about the subjects covered. However, OnlineDoc may not be installed Another variation of this technique is to define a at every site. Fortunately, SAS OnlineDoc is libref to a desktop instance of SAS using a remote available from SAS on the World-Wide Web. The engine. A SAS/CONNECT session between the NT URL is: SAS Server and the desktop instance of SAS must be started before submitting a libname statement http://v8doc.sas.com/sashtml with the follow syntax. BIBLIOGRAPHY libname ora_prod rengine=oracle server=server roptions= “user=user password=password Ward, David L., “You Can Do THAT with SAS Software? Using the socket access method to unite SAS with path=’@path’ schema=schema” ; the Internet,” Proceedings of the 2000 Northeast SAS User Conference, 2000. 405-11 One important caveat. If you are using a SAS/ACCESS LIBNAME remote engine to read the ACKNOWLEDGEMENTS schema used for an Enterprise Reporting System (ERP) with hundreds or thousands of tables and you SAS, SAS/ACCESS, SAS/CONNECT, SAS/IntrNet accidentally double-click on the SAS Explorer icon and OnlineDoc are registered trademarks of for the libref (ora_prod in this example), you could SAS Institute Inc. Microsoft, Access, Excel, and be in for a lengthy wait for the screen to refresh. Windows are registered trademarks of the Microsoft Corporation. Oracle is a registered trademark of However, the SAS/ACCESS LIBNAME engines Oracle Corporation. offer some attractive features over alternative DBMS access methods. The LIBNAME engine The author would like to thank the Hartford Area makes all of the tables in the schema instantly SAS User Group Steering Committee, which available. By contrast, SAS/ACCESS views only encouraged him to prepare this paper. define a single table at a time. Please note that the code supplied in this paper is When compared to PROC SQL Pass-Thru, the designed only to illustrate the concepts being SAS/ACCESS LIBNAME saves one the trouble of discussed and may need to be modified to work in learning and using a different dialect of SQL tailored other applications. Modified code is not supported to the DBMS being queried. Further, once the libref by the author of this paper. CONTACT INFORMATION The author may be contacted as follows: Michael L. Davis Bassett Consulting Services, Inc. 10 Pleasant Drive North Haven CT 06473-3712 E-Mail: [email protected] Web: http://www.bassettconsulting.com Telephone: (203) 562-0640 Facsimile: (203) 498-1414 and http:// tags contain “ALIGN=LEFT” The most common HTTPD port is 80. We will when the cell contains character data and return to this subject at the end of URL Access “ALIGN=RIGHT” when the data is numeric. While Method section. useful in some applications, this information was not needed for the SAS coding shown subsequently. The URL Access Method is often used without any options. For a complete list of options that may be One of the techniques that the author uses to “fine- used with the URL Access Method, consult the SAS tune his programming approach is a “quick and OnlineDoc. Navigate to the relevant portion by dirty” program such as the following example to selecting -> Base SAS Software -> SAS Language generate a data set to show the position of all the Reference: Dictionary -> Dictionary of Language relevant words. Elements -> Statements -> FILENAME, URL Access Method. filename readhtml url Unlike the FTP Access Method, the tricky part is not 'http://www.bassettconsulting.com/ making the connection but parsing the returned byte baseball.html'; stream and converting it into a SAS data set. Fortunately, HTML pages contain tags, denoted by data testhtml(drop=buffer) ; the “<” and “>” characters. Scanning the URL byte length buffer $ 200 word $ 25 ; infile readhtml lrecl=200 pad ; streams for selected tags, coupled with an input @1 buffer 200. ; "') ; information as a SAS data set. if word eq 'TD' then do i = 1 to 20 ; The heart of the typical page returned from a CGI word= scan(buffer, i,' <>"') ; application, such as SAS/IntrNet, or from the SAS output ; web-formatting macros, or from the SAS Output end ; Delivery System (ODS) is an HTML table. Within run ; an HTML page, tables are enclosed between and
tags. Each row in a table is enclosed by and tags. Within each In this example, the baseball.html file is opened as row, each cell is enclosed between and a URL fileref. Input rows that do not begin with the tags. tag are immediately dropped. The resulting SAS data set, testhtml, shows the contents of each For the purpose of demonstrating the URL Access “word” identified by the Scan function. The Scan Method, the author created an HTML page from the function is using blanks, “<”, “>”, and double "') ; yr_major = input(word, 8.) ; if word eq '#000000' then do ; else if count eq 9 then word= scan(buffer,14,' <>"') ; cr_atbat = input(word, 8.) ; word2=scan(buffer,15,' <>"') ; else if count eq 10 then output ; cr_hits = input(word, 8.) ; else if count eq 11 then end ; cr_home = input(word, 8.) ; end ; run ; else if count eq 12 then cr_runs = input(word, 8.) ; else if count eq 13 then cr_rbi = input(word, 8.) ; The resulting data set, testhtml, has 7084 else if count eq 14 then observations, one observation for each of the 22 cr_bb = input(word, 8.) ; variables in the baseball data set. However, what else if count eq 15 then we really wanted was just one observation with 22 league = word ; variables for each row of data shown in on the web else if count eq 16 then page. The following program transforms the division = word ; testhtml data into the desired format. else if count eq 17 then position = word ; else if count eq 18 then data testhtml2(drop=word word2 count) no_outs = input(word, 8.) ; else if count eq 19 then ; length no_assts = input(word, 8.) ;