Writing Portable Code: The SAS Institute Experience

Richard D. Langston SAS Institute Inc.

This paper illustrates how SAS Institute Inc. went the VM family, so the emphasis was placed on per­ about converting a large software system from PL/I formance within OS systems instead of portability. and IBM® 370 into a new ver­ Of course, the user community quickly recognized sion written almost entirely in C. The paper briefly the usefulness of the VM family and began an alle­ discusses the history of so~tware development at giance to those operating systems that precipitated the Institute, then explains the rationale behind the the Institute's need to convert the SAS System to company's decision to convert to a new language. run under VM. From there, the Institute's experience in writing portable software is discussed. In addition, the pa­ This effort was accomplished while maintaining per touches on the development process the Institute the aforementioned hybrid of languages. Very little used, including the adaptation of software coding PL/I code was added to support VM; most of the standards. system modifications were to the supervisor. There­ fore, most modifications were m_ade in assembly lan­ Let us begin by tracing the history of program­ guage. The conversion goal was to change as small ming language usage at the Institute. SAS® soft­ amount of procedure code as possible and allow the ware was originally implemented for OS-type oper­ supervisor to recognize the environment and deal ating systems (MVT, MFT, MVS, VSl). The SAS with its eccentricities, relieving the procedure writ­ System was implemented using a combination of ers to once again concentrate on their specialties. PL/I and assembly language. The assembly lan­ guage portion was known as the "supervisor" and However, certain aspects of the VM operating was implemented as such for efficiency, speed, code system certainly made their way to the procedure size, and any other characteristic for which assem­ writer's level. For example, the MVS operating sys­ bler is best suited. The PL/I portion of the system tem fills dynamically allocated memory with zeros was that of the procedures (programs written to ac­ before the application receives the pointer to the cess and create data using the supervisor through a memory. The PLjl programmers were sometimes common set of routines). PLjl was chosen here be­ unknowingly taking advantage of this fact by not cause th;: procedures were performing a higher-level initializing variables (whose initial values would be set of functions, such as statistical analysis and re­ zero nonetheless). 0 nly when their procedures were port writing. Procedure writers could spend more tested under VM did this fact arise. This was the time concentrating on the implementation of com­ first of many occurrences of recognizing the differ­ plex algorithms instead of concerning themselvses ence an can make on the perfor­ with how-many base registers to have available. mance of an application that is supposed to be op­ erating system-independent. This hybrid of language served the Institute well until it was decided to transport the SAS System to The conversion for VfVl was accomplished pn­ other operating systems. Recall that at the time the marily by OS simulation"; that is, simulating OS system was first implemented, the IBM OS family of SVCs and control blocks when system calls were operating systems was much more widespread than made. However, when the Institute converted to

26 the DOS JVSE system, the system developers rec­ were available for all the systems. Third, conversion ognized the need for isolating the system-dependent woul-d not be as painful since the procedures were portions of the code so that these could be replaced already in PLjl, albeit using the PL/I Optimizing when moving to a new environment. Once again, as a base language_ very little of the procedure code had to be changed to function under DOSjVSE. The programming staff quickly discovered that each vendor's implementation of the PLjllanguage The SAS System remained an IBM mainstay was different. As the staff gained experience with through the 82.4 release of the product. The sys­ the , a set of coding standards was adapted tem did not function on any other computers other to ensure that the code would compile on any of the than those within the 370 family (using the oper­ machines. Some of the simpler decisions made were ating systems VSjl, MVS, VMjCMS, DOSjVSE). that such characters as '#' and '@' could not be However, just as the Institute recognized a grow­ used in the names of variables because the ANSI ing force in the user community for the VM family, Standard did not include them. The use of # and @ there was acceptance of the fact that even more in variables was a common practice among the de­ operating systems seemed likely candidates for con­ velopers using the more amiable Optimizing Com­ version. The user community had been asking for piler. Another common practice was incrementing minicomputer support, primarily on Digital Equip­ pointers by overlaying an integer on the pointer and ment Corp.'s VMS® Data General's AOSjVS, and incrementing the integer. This became verboten in Prime's PRIMOS® operating systems. the portable implementation due to pointer sizes on some machines. These are but two of many eXam­ But converting to those operating systems would ples of differences that were dealt with. Striving be much different than converting from MVS to for portability across operating systems became a VMjCMS. After all, the IBM operating systems all common goal among all developers, and this virtue use a common assembly language and PL/I com­ carried on throughout the minicomputer implemen­ piler (not a completely. true statement but relatively tation and is still apparent in the latest design effort speaking, close enough)_ The three minicomputer using the C language. operating systems all had their own assembly lan­ guages and different Pljl compilers (all different The minicomputer and mainframe implementa­ implementations of the ANSI standard), and they tions merged with the advent of the Version 5 SAS were architecturally far removed from the IBM 370 System. The procedure code implemented for the family. A totally new implementation of the SAS minicomputers was also functional on the mainframe System would be needed. And, to simplify this im­ system_ Only the supervisor was different between plementation, the largest possible subset of the sys­ the two systems. The supervisor on all three mini­ tem needed to be implemented in a portable lan­ computer systems was the same, with the exception guage so that the same code would function on all of small amounts of host-dependent code, necessary three minicomputers_ to perform the low-level tasks demanded of a com­ plex system. The final outcome of the design was that the su­ pervisor for the system was wriiten in PLjl. This lan­ As the Version 5 SAS System made its way into guage was chosen for several reasons_ First, almost the user community, the IBM PC was quickly evolv­ all of the Institute programming staff was familiar ing into a viable environment for a system as com­ with it, having written procedures in this language_ plex and expansive as the SAS System. An experi­ Second, the intention was to once again have the mental project began on this microcomputer, using procedures be as portable as possible across oper­ C as the implementation language. C was chosen ating systems (IBM mainframes and the minicom­ because there was not a fully acceptable PLjl com­ puters). At the time, PLjl was the language with piler available for the machine_ C has many charac­ the most power and flexibility for which compilers teristics of Pl/l, so its choice was not an unreason-

27 able one. And its functionality on a microcomputer tion's NOS IVE. The Institute also realized that was well recognized. many other operating systems had C compilers, and if a new implementation would function success­ Once again, the Institute was faced with needs fully on all of the systems currently available to the of the user community that would require massive Institute, new operating systems could be added changes to system implementation. Was it better expediently and with much less "pain" than with to develop a system written in ( solely for the P( VM/(MS, DOS/VSE, or the minicomputers using' environment, similar to the Version 5 PLjl system PL/L written for the minicomputers? It was decided that the best approach was to rewrite the entire system Most developers accepted the ( language quite tn C in such a way that it would be portable across readily. As stated earlier, there are' similarities be­ all operating systems, so that minimal reimplemen­ tween PL/I and C. The most obvious are the use tat ion would be necessary when stili more computer of semicolons as statement ddimiters and j* * j as systems became viable as environments for the SAS comment delimiters. My own experience with learn­ System. ing C was that it was initially a "step down" from PL/I; the string-handling capabilities of ( are min­ The biggest problem with this decision was com­ imal, and it has the feel of a low-level language, piler availability. All of the minicomputers on which unlike PL/I. However, as more developers began to . the SAS System ran had good ( compilers, anq convert to the new language, more subroutines and there were several available on the IBM Pc. How­ macros were created and C began to have the look ever, such was not the case for the IBM mainframe. and feel of a high-level language like PL/I, but with The compiler from Ltd. was tested, the portability that ( enjoys. but for various reasons, it did not meet -the needs for MVS development. Neither did the Waterloo To ensure that this portability would truly exist compiler, which at the time was strictly a VM com­ and that the SAS System would indeed be trans­ piler. The solution was to transfer a P( DOS ( ferable to many different operating systems, a more compiler to the IBM mainframe so that sufficient rigid design was constructed for Version 6 of the SAS control could be ensured. System. This design, known as Multi-Vendor Archi­ tecture (MVA), is a three-tiered approach steeped The outgrowth of this solution was the agreement with the experience of previous reimplementations. with Lattice, Inc. to implement their C compiler on The lowest tier was the host level. A set of formally the mainframe. After these efforts ensued, Lattice, specified and documentated routines would be im­ Inc. agreed to be acquired by the Institute. After the plemented by a host development team on each host acquisition, further enhancements were made to the machine. The choice of implementation would be up P( DOS version of the compiler, concurrent with to that team, although the desire was to use (when­ the mainframe development. Examples of these en­ ever possible. Examples of functions at the host level hancements include the introduction of built-in func­ were memory acquisition and freeing, opening and tions to produce in-line code instead of subroutine closing files, task management, and conversion of calls, better diagnostic messages to ensure that a generic global representations to native forms. The function declaration or prototype is always present, end result of the host level was a host supervisor. and improved floating-point processing to improve The second tier of

28 (the interactive: editor and output manager used by ficulty in dealing with several vendors' PL/I compil­ the SAS System) or in the form of parsing program ers in the project mentioned earlier in this paper. input, to name but two. Another characteristic of the core supervisor would be to accept calls from The following are some examples of the house the procedures for services, such as managing out­ stand-ards: 1) All external names must be seven put, reading/writing observations to SAS data sets, characters or less. This is because some systems or managing pools of memory. have a restriction on the size of external names. The C language itself and the PC version of the The third tier of the design was the application. compiler do not have this restriction, but some sys­ The application would also be written in C. Most tems do. (The PL/I Optimizing Compiler actually applications are procedures that are either part of has this restriction, in that it chooses the first four the base product or one of the add-on products li­ and last three characters of the name and then fash­ censed by the Institute. In any case, the application ions external names with this seven-character root). would consist of a set of calls made to the core su­ 2) Included H files must have names that are eight pervisor, along with the real work of the application, characters or less. (Once again, due to some sys­ such as factor analysis, tabular reporting, full-screen tems' limitations). 3) Character range checking editing of SAS data sets, and so on. The point is, should only be performed by C library functions or as has always been the case in the SAS System de­ by Institute-supplied functions. For example, if x is sign, the application writer can concentrate on im­ a numeric digit, the condition plementing the task using his specialized skills; the work of interfacing to the system is handled by the (x >= '0') core supervIsor.

With this three-tiered system In place, the SAS will be true on an EBCDIC-based system only for the System operates and has a similar look and feel digits 0-9 (and the unlikely characters OxFA through across all operating systems. SAS code (that is, OxFF), but on an ASCII system this condition will SAS program statements issued by the SAS user) be true for almost all characters. The proper way to can function the same regardless of the operating test for nu merie digits is environment. There is ease of use for both the user of the system and the implementer of the system at (isdigit (x)) SAS Institute.

The development staff, being aware of the three­ 4) Always use the sizeof operator to get the size of tiered approach, set out on the monumental task any object. This is because different compilers pad of implementing a new system and converting ex­ structures differently, and some systems even have isting software yet again, this time from PL/I to unconventional lengths for simple elements. For ex­ C. Naturally, a large number of programmers, espe­ ample, PRIMOS has a 6-byte char pointer in ad­ cialty those who were not conversant In C, needed dition to its 4-byte pointer. 5) Do not use the & a stringent set of guidelines to follow to properly operator on non-Ivalues in function calls. In the ex­ code in a style of C that would be acceptable to ample below the widest range of compilers. Developers through­ out the Institute experimented with the peculiarities char x[50]; of the various compilers, studied the Kernighan and Ritchie standards, and came up with a set of house abc(&x); standards. The house standards were -intended to minimize compiler complaints on all systems. The Institute had much experience with the concept of the argument to abc should not be passed with '&'. implementation' standards, having found similar dif- Some compilers get confused here and do not put

29. the address of x on the stack. Omitting the & will piler (preacquisition) had function prototyping as a remove the confusion. feature. However, certain compilers within our tar­ get operating system range did not have this fea­ The house standards document was very useful in ture implemented. In order to' incorporate proto­ assisting neophyte C programmers to adjust to the typing where possible, we introduced macros to be new language. However, there were still limitations expanded on each host as appropriate. So, for ex­ due to the inherent nature of the C language. One ample, such limitation was in the use of external functions.

int abc U PARMS (( long )); In the C languag-e. an external function being used in a program is declared as being external. However, there is no mention of the attributes of the argu­ could be expanded into a functioh prototype (if the ments to that function. For example, a function compiler supports it) or simply a function declaration abc that expects an int and a pointer as arguments (if the compiler does not). and a function def that expects two longs will both be declared as Simultaneously, we introduced the concept of structured macros. With structured macros, if a program were accessing routines from a given sub­ external int abc(),def(); system, the appropriate #include statement would be given for that subsystem. The "h" file in­ If the caller does not use the correct arguments when cluded would have the necessary structure declara­ callin_g the routines (for example, an int instead of a tions within, along with the function prototypes for long), there is no validity checking that can be done all the routines of the subsystem. A drawback, of by the compiler. Furthermore, type differences such sorts, was that there was no requirement that the as int vs. long may be acceptable on one machihe prototypes be given for a fUhction that was called. (when an int an~ a long are the same thing) and The compiler would ,accept external function calls may cause serious problems on another where they without any declaration. We did not want to depart are different. fully from a standard in fordng complete compli­ ance. Most C developers over the years have used the LINT utility to assist them in finding these argu­ Another decision made in trying to develop this ment mismatches. LINT ;s given the source for all large portable system was a decreased reliance on interrelated programs, and it will determine if there the C library functions. We found that many of are mismatches. However, the Institute did not find them did not completely meet our needs. For exam­ LINT to be acceptable for a variety of reaSOhS. For ple, the strcpy function requires a null terminator at one, we had to rely on a diff-erent implementation the end of the source string, and that terminator is of LINT for every machine, and the inconsistency copied to the target string. We wanted to have our in performance was apparent. Some versions would own copy function that did not rely on null termi­ abend quite readily. Also, we felt it was much more nators since that concept was somewhat foreign to desirable to have the compiler teU us if we were call~ our coding methods. The strlen function was found ing the routines incorrectly. We as developers ap­ to sometimes give incorrect results with null strings. preciated the functionality of Pl/l on this issue: any Granted, some problems may have been due to a deviation from the function declaration was stated given implementation of the compiler, but it was in compilation warnings. necessary for us to work within those constraints. The less we would have to contend with peculiari­ Our solution was to use function prototyping, ties of a given vendor's compiler, the better. which had been specified ahd approved by the ANS I C Standards Committee. The lattice PC DOS com- In the automated process to allow for portable

30 code was old-fashioned human intervention: the the three-tiered design, implemented following our code review process. Each subsystem of the core su­ house standards. We recognized the knowledge pervisor was scrutinized to ensure that coding stan­ gained from previous re-implementation projects and dards were followed. Other benefits of the review from the experience brought by expertise of many were to check for more efficient ways of coding al­ different operating systems. We have found that gorithms, how to cut down on memory usage (a this approach has been most productive in having crucial factm in PC DOS), clarity of code, adequate our software f~nctioning properly on as many differ~ in-tine documentation, and the recognition of eso· ent operating systems as possible. teric features that could be used by other members of the development team. SAS is a registered trademark of SAS Institute Inc., Cary, NC, USA. More human intervention was apparent In host committee meetings. Each of the subsystems within IBM is a registered trademark of International the host supervisor had a corresponding committee, Business Machines Corporation, Armonk, NY, USA composed of host developers from all hosts for which implementation was proceeding. The developers PRIMOS is a registered trademark of Prime Com­ would discuss proposed changes to their routines, puter, Inc. such as calling sequence, documentation enhance­ ments, or elaboration of functionality. These com­ VMS is a trademark of Digital Equipment Corp. mittees submitted their subsystems' changed specifi­ cations periodically, resulting in continually updated host supervisor implementation guides. This allowed for a robust -development environment where feed­ back was quickly accepted and acted upon.

In summary, we at SAS Institute have had a great deal of experience with dealing with a changing user community, and have redesigned and reimplemented our systems accordingly. One important underlying characteristic that has been maintained throughout the years is that as much code as possible should be usable on as many operating systems as possible with little or no change. PLjl served this design well in the past; the C language is serving well for Version 6 and the foreseeable future.

Furthermore, the peculiarities of each vendor's compilers and operating systems must be taken into consideration when developing a portable system, regardless of whether the language or operating sys­ tem is said to be portable. To accomplish this, cod­ ing standards are absolutely essential. And, the sys­ tem design must encourage the majority of the de­ velopment to be totally system-independent, with the system-dependent portion layered fOf easy re­ implementation on new hosts.

We at SAS Institute have accomplished this with

31