Porting to the IBM" Mainframe Environment Gary H. Merrill, SAS Institute Inc., Cary, NC

ABSTRACT • Class 3 contains programs that take advantage in inessential ways of various features of the machine ThiS paper deals with issues encountered in porting existing C pro­ architecture (for example, the size of ints or size of grams to the mainframe environment Following a discussion con­ pointers). cerning what it means to say that a program is portable, this paper emphasizes program features that are likely to inhibit portability to • Class 4 contains programs that take advantage in and from the mainframe. inessential ways of various features of the host and file system (for example, the format of file names, the maximum length of file names, or library THE MEANING OF PORTABILITY functions specific to that compiler or operating system).

What does it mean to say that a program, or a body of source code, Potentially portable is portable? For most practical purposes (and portability is a practi­ • Class 5 contains programs that make use of certain cal concept), this means that if you have written and debugged the common, but not universally available, library interfaces (for program so that it runs correctly on one system, when you compile example. curses or XWINDOWS). and link the same code for a second system, the program will (with­ out modification) run correctly on that system as well. Thus, portabil­ • Class 6 contains programs that intentionally take advantage ity is a relation among a body of source cocle and two computer of features of the machine architecture (for example, the systems. The two computer systems can be segmentloffset format of pointers on the Intel" 8086, or the • the same operating system (for example, UNIXj on two fact that byte ordering on a MC68000 is from left to right). different architectures (for example, a MC68000" and an Intel" 80286). • Class 7 contains programs that intentionally take advantage of features of the host operating system (for example, the • diffierent operating systems on the same architecture (for format of time and date information). example, CMS and MVS, or UNIX and VMS"'). Nonportable • two different operating systems on two different • Class 8 contains programs that make use of specific architectures (for example, MS-DOS" on an Intel 80286, and N assembly language interfaces for a given machine or MVS on an IBM''' 3090 ). assembler.

Each of the first two possibilities presents its own problems regard­ • Class 9 contains programs that are coded to interface with ing portability, and the third combines those probl~ms. a specific device such as an IBM 3270 terminal, a DEC VT100~ terminal. an IBM/PC display, an async port, or a However, portability can be an even more complex issue than this particular brand of laser printer. because it not only involves a body of source code and two com­ puter systems, but it also involves two compilers (and two run-time Most programs commonly thought of as portable fall into either libraries) as well. If there is a standard for the language in which the class 2, 3, or 4. Once any inessentiaUy unportable code is converted program is implemented (as there is in the case of C), then some to functionally equivalent portable code, or once the minor system sense can be made of this concept of portability. In the absence of dependencies are isolated into just a few modules, such programs such a standard, compilers are free to differ in how they interpret are considered highly portable. In these cases, porting the program a program, so the notion of portability has no content. to a new environment requires only rewriting clearly identified and isolated portions of the code. Given this view of portability, you may have noticed that virtuaBy no interesting program is truly portable. It makes sense, however Programs in classes 5, 6, and 7 can be considered as potentially to speak of degrees of portability and to recognize that some pro­ portable, depending upon their structure and functionality. If such grams are more portable than others. In short, a program's portabil­ programs are specifically designed for a particular system, then the ity is inversely proportional to the amount of code that must be question of porting them may not arise. If it does and the program changed in order to move it from the initial system to the new (target) is otherwise well designed so that system dependencies can either system. In C, the portability of programs can be measured along be eliminated or isolated, then the program can be rewritten into a the following spectrum of classes (from most portable to least porta­ portable program. Unless such a program has specifically been writ­ ble): ten to accomplish a task tied tightly to the operating system or archi­ tecture (such as converting files to different record formats under Highly portable programs eMS), then the chances are good that the program can easily be • Class 1 contains programs that conform strictly to the ANSI converted to a highly portable program. In the case of a· program standard in coding, that use no 110, and that use no library in class 5, porting the program may require implementing an entire functions. (Such programs appear to be of little practical library on the new target system. This is more work than most pro­ value.) grammers are willing to do while maintaining that the underlying pro­ gram is portable. But once that work is done, the resulting program • Class 2 contains programs that conform strictly to the ANSI (together with its new library, which should have been implemented standard in coding, use only ANSI standard 110, and use as portably as possible) can be considered highly portable. ANSI standard library functions.

736 Programs falling into classes 8 and 9 are generally considered non~ A couple of approaches are possible in these cases. First. you can portable. but even this is a matter of degree. Certainly, a full~screen define a number of preprocessor symbols such as the following to debugger designed to run under MVS cannot be expected to run stand for input character (IC_) and output character (OC_) sets. on a PC because both its 110 and its interactions with the operating • IC_a system are at tao Iowa level to carry over to another environment. Nonetheless, this kind of program can be designed in such a way that all calls to 1/0 routines and operating system services are done through relatively high-level functions, so that a substantial amount of portable code can be retained. Porting such a program then involves rewriting those modules specific to a given system. • OC~

The moral of the portability story up to this point is this: portability The definitions look something like this: is a matter of degree. and the degree to which a program is portable lif EBCDICIN can be enhanced by careful attention to the language standard. ,define ICa Ox8l avoidance of system-specific code, and isolation of system~specific Idefine Ic...A OXCl code to a small number of separate modules. /* More source lines */ 'else /* ASCILIN *' Let us now look at some specific factors affecting the portability of ,define IC-.a Ox61 ,define Ic....A Ox41 C programs. '* More source lines */ 'endif

ISSUES AFFECTING PORTABILITY lif EBCDICOUT Idefine OCa DxS 1 Although the topic of this paper concerns porting C to the main­ ,define OC...A OxCl /. More source lines ./ frame environment, other issues that have an affect on portability 'else '* ASCI LOUT ./ in general should be considered because in modifying a program ,define OC_a Ox61 for a new environment, it is preferable that the program continue Idefine OC..A OxQl to work under the old environment as well. The goal should be to ,. More soarce lines */ enhance the portability of the program and to create a single body lendif of code suitable for multiple environments, rather than simply to In your code you would then you use the manifest constants IC_a, create multiple versions of the same program, one for each environ~ OC_a, and so on, rather than explicit character constants. ment. By defining the symbols EBCDIC_IN and EBCDIC_OUT, you can control which character set the program uses for input and which Maximizing the amount of code shared in all of the target environ­ it uses for outp~t. ments results in significant savings in development time and in test­ ing and debugging efforts, and also results in a decrease in the bug String constants (and strings in general) are a bit more problematic level of the code. My experience with the SAS/C® and Lattioe~ com~ to handle, and for output you may want to implement a translation pilers, cross compilers, and associated products has shown repeat~ function called This function can be written por~ edly that the reliability and robustness of the code is enhanced str ing_out ( ). tably by making use of the I C_ and OC_ symbols just defined_ In greatly by the process of porting it to diverse systems. I will say this way you can easily control the input and output character sets more of this tater. of your program and have a single set of string and character han­ dling functions that are fully portable. The next sections deal with the most common issues involved in portability of C code. The result of writing portable code which can be shared by all the implementations of a program, is a decrease in maintenance cost, The Character Set testing time, and bug level. lance had the task of writing a code The character sets with which programmers are normally concerned generator for a specialized controller chip and porting the Lattice are ASCII and EBCDIC. It is uncommon to see genuine character compiler to run under a CMS host and to generate assembly code set dependencies in C code nowadays, and most problems can be output for this special chip. The CMS development system available avoided if you use explicit character constants rather than numeric to me was an AT"/370, which was not particular1y fast (it took about values (for example, hex) to represent characters. A problem does eight hours to compile and to link the entire compiler), so I did my arise if you want your program to do one of the following, as may development under MS~DOS on a normal AT. For testing purposes, be the case with a CMS-hosted cross compiler for a PC target: this required running the compiler under MS~DOS, transferring the assembly code to the CMS system, and running the CMS~hosted • You want the program to run on an EBCDIC machine but assembler on that code. In turn, this required that my AT-hosted do its input or output in ASCII. cross compiler accept its input in ASCII (from MS~DOS source files), • You want your program to run on an ASCII machine but do that its text output also be in ASCII, but that hex constants used its input or output in EBCDIC. to initialize character and string types be the EBCDIC hex values_ My technique to accomplish this was similar to the one just outlined. In such a Situation, the program is expected to read EBCDIC files and write EBCDIC error messages and listings, but write ASCII I did all preliminary testing of the code generator on MS~DOS and strings into the object files it creates for the PC. Moreover, then ported the entire compiler to a Sun Workstation~ Forcing the if you want to use the same code as a CMS~hosted, CMS~targeted program to run on an MC68000 under UNIX flushed out a number native compiler and also as a PC-hosted, PC-targeted native com­ of bugs not caught under MS-DOS, and when the program was ported to the CMS host, there was only one bug related to portability piler you must carefully think through the way that you want to han~ die character and string constants. and input/output. I will mention what this bug was later when I dis-

737 cuss UNIX assumptions in C code. The point I want to make here You should be aware that there is no completely standard represen­ is that writing for portability makes your code more robust and can tation of floating-point types (float and double) across all save you significant time in debugging. machines. The IEEE fonnat is perhaps the most popular, but this is not the fonnat employed on the mainframe, and the VAX has yet A problem involving character sets also occurs when porting C code other floating-point representations. Size of the data objects is not that has been generated by using such tools as the UNIX Y ACC usually an issue here, but representation is because it affects the parser generator. In a grammar specification it is common range that can properly be covered by a floating-point variable. to see productions such as the following, where explicit character Careful use of the ANSI symbols defined in the float.h header constants are used: (FLT---RADIX, DBL....DIG, FLT--..MAx....EXP, and so on) go a long way expr expr '+' expr toward making your program portable, but pre-ANSI C programs .) '(' e:lpr ')' will not have been able to take advantage of this approach. Be sensi­ tive to this fact when porting such programs.

This does tend to make the grammar a bit more intelligible, but it Signed Characters leads to difficulties if YACC is run on an ASCII system (which most In some C programs, character variables are used where you might UNIX systems are). This is because the generated C code contains otherwise expect int variables. For example, if you want to define arrays that are initialized via integer constants, so the ASCII values a flag that can have one of the values -1, 0, or 1, you may choose of '+' (43), '{' (40),and ')' (41) appear in the table. Even to use a character variable for the flag since you know its value will if this C code is then compiled on an EBCDIC machine, the resulting never be outside the range of such a variable. This kind of approach parser will not correctly parse EBCDIC input. The way out of this is usually an attempt to save space, but the result is almost always is to never use explicit character constants in your YACC code. a false economy. Instead, use the YACC %token feature to define token types, as in this example: A more serious use of signed characters may occur if the program ~token PLUS LPAR RPAR needs large arrays (to represent tables or matrices, for example), but the elements of these arrays never exceed the range of a signed expr" expr PLUS expr LPAR expr RPAR character. Then, you may reasonably define such arrays to be of signed character type rather than int type. On a PC, this saves you 1 byte per element, while on a VAX, 68000, or mainframe it Your lexical analyzer (which, of course, must be sensitive to the cor­ saves you 3 bytes per element. If your tables are several thousand rect character set) returns the appropriate token representation to elements (or more) in size, the space saving can be substantial. the parser, and YACC generates its parsing tables using a represen­ tation of these tokens independent of the character set. As an example of this, I have recently seen a parser written in YACC that, when generated with the Berkeley YACC tool, results in tables Sizes and Representations of Fundamental Types (of shorts) almost SOK in size. If these tables can be redefined as tables of chars (and some of them can), the resulting program In general, porting code to the mainframe does not encounter prob­ shrinks by almost 50 percent. lems concerning the sizes of fundamental types. If you are porting code from mainframe to a smaller machine, you may encounter such But signed characters should be avoided at all possible cost on the difficulties if the programmer has made one of the following assump­ mainframe. This is because the mainframe instruction set does n01 tions because both of these assumptions are false on such support many useful operations on signed characters, so operations machines as the 8086 and 80286: or comparisons involving signed characters typically require conver­ Integers are as big as longs. sion to in t with sign extension. Accordingly, code involving signed characters is extraordinarily inefficient on the mainframe, and so Integers are as big as pointers. signed chars should be replaced by ints wherever possible.

Code such as the following can still commonly be found in many If you are seized by the deSire to reduce your amount of static data in UNIX programs, which were originally implemented on a V AX­ by using signed chars instead of ints, then you must carefully machine: weigh the resulting decrease in data size against the increase in code size and degradation of performance. unsigned *0 int count; A good mainframe compiler treats all characters as unsigned by '* increment unsigned int pOinter by" default, with a compile-time switch available if this is not desired '* 'count' number of bytes " The SAS/C Compiler takes this approach. It is therefore possible u=(unsigned *)((int) u+count); to port a program that has worked perfectly well on a variety of machines (such as the PC, the 68000, and the VAX) and have it fail Here, a pointer to an unsigned integer is cast to (that is, treated as) in mysteriOUS ways on the mainframe. I have just had such an expe­ an integer, the integer is incremented by a byte count, and the result rience in porting the GNU EMACS editor to AIX-/370. Everything is cast back a pOinter to an unsigned integer. Why is this done? as worked well, except in certain esoteric areas where the editor was Usually there is a misguided striving for some kind of efficiency supposed to compile certain of its text files into an internal represen­ behind such code because it amounts to circumventing the manner tation. Some characters simply were not being translated correctly, in which the C compiler scales additions to pointer variables. How­ and this was traced to the assumption that char meant signed ever, the code is nonportable and will not work at aU in any environ­ char. Because this assumption is reasonable on most ment where pointers and integers are not the same size, or where machines but not on the mainframe, you should be sensitive to it pointers have such a sufficiently odd representation that conversion when porting to that environment. to and from int is virtually meaningless. Such environments are found on the 8086 and 80286 architectures with their segmentl Use of the Int Type offset representation of pointers, and it is possible that in the future these problems could arise on larger problems as well. For the most The in t type is a deceptively dangerous element of C. Its real value part, it is wise to avoid such clever programming tricks and to let lies in its contribution to the efficiency of portable programs since a good optimizing compiler handle code generation for you. it is intended to be implemented as the most efficiently sized of the

738 integral types. Normally this means that the size of an int is the UNIX" Assumptions size of a register on the host machine. Thus, on the mainframe, Another fairly large class of portability problems falls under the gen­ VAX, and 68000 ints are 32 bits, while on smaller machines such eral description of UNIX assumptions. Since C was developed on as the 8086 they are 16 bits. and for UNIX systems, and since most C code has been written for such systems, it is likely that you will encounter such assumptions The idea is that if you define a particular variable (for example, a in porting code. loop counter) as an i nt, the code using that variable will be efficient whether it is compiled for a 32-bit or a 16-bit machine. However, One assumption could be called the a-file-is-a-fUe assumption. All if you define that same variable as a short or a long, on some files are the same to UNIX programmers. They know nothing of machines the code will be efficient, while on others it will not. records. Text files are composed of lines (separated by the newline character), and binary files are just uninterrupted streams of bytes. A consequence of this reasoning is that you should never use an In fact, there is really no difference between text and binary files int if you are concerned about possible overflow or except that when you display the text file, you can make sense of underflow problems. You should never use an i nt if you are saying what you see. to yourself, "Well, ints are the same as longs, SO this will work just fine.~ An int is not the same as a long - it just happens to On systems other than UNIX, It is common to encounter a number be the same size on some machines. Beware of this confusion. of different kinds of files: text files, binary files, files composed of fixed-length records. files composed of variable-length records, and Again, this is not a problem when porting code to the mainframe, so on. Tools originally designed to run on systems similar to UNIX but in making necessary modifications to accomplish such a port, often need to have their file 110 rewritten to work correctly (or effi­ you should take care not to decrease the portability of the resulting ciently) on such systems as MVS, CMS, or VMS. An excellent treat­ code. ment of the potential difficulties in this area can be found in Chapter 8 of the SASIC Library Reference, Second Edition, Volume 2, but Byte Ordering here is a summary of the relevant points: The byte ordering on the mainframe (and the 68000) is left-to-right • Use text access only for those files that truly contain text (that is, high byte at the lowest address). On the VAX and the PC (printable characters). Even so, you may want to use binary it is right-ta-Ieft. That is, in memory, the hex number OxAABBCCDD access if you will be seeking about in the file. looks like the following one the mainframe:

AA BB CC DD • If your program accesses a binary file and needs to know exactly where the end of this file is, adopt a convention It looks like this on the VAX and PC: (such as a control character or sequence-of-control characters) to be used as an end-of-file marker. Then have DD CC BB AA the program write such a marker each time it needs to mark the end of the file. Without this approach, the file may be Which of these is more natural depends upon how you were brought padded with zeros after the program writes the last byte. up. And when your program (or another one intended to work with it) reads or seeks to the end of the file, it will go right It may seem unlikely that any C code would run afoul of differences past what you thought was the last byte, gobbling up the of byte ordering, but I have seen this happen more than once in the extra padding. In such a case you need to do your own case of porting tools to the mainframe (or 68000) that were originally end-of-file determination. implemented on the PC. Cross compilers, cross assemblers, and cross linkers are some good cases in pOint. For example, your cross Beware of a program that writes a sequence of trailing tool may take advantage of the host arithmetic operations (as is nor­ blanks at the ends of its text lines and expects to read mally the case) and then need to emit resulting values to an output these blanks back in when the file is read. The ANSI file to be used on the target system. If the value OxAABBCCDD com­ standard permits a C implementation to pad with blanks puted on the mainframe is Simply written directly to the output file when a line is written (for example, to an F format file) and for the PC, it will be in the wrong byte ordering and will be inter­ to remove trailing blanks when the file is read. Thus, trailing preted by the PC as OxDDCCBBAA. For programs such as these, blanks on a text line cannot be treated as meaningful by the the bytes of arithmetic items must be reordered on output if the cor­ program. rect result is to be achieved. • Some systems (such as CMS) do not support lines It is also possible, of course, to run afoul of byte ordering differences containing no characters. The ANSI standard permits a C through poor program design. I still recall such a case when porting implementation to write a line containing a single blank in a popular link editor for the PC to run on a 68000 UNIX system. The place of an empty line. Thus, it is not possible to distinguish program used an array of longs as offsets into a section of code, an empty line from a line containing only one blank_ Some or as fixups. At one point, the author of the program wanted to programs depend upon such a distinction and need to be update the values of these longs, but he chose to do so by using modified as part of the porting process. a pointer to the array that he cast as a short pointer so he could access the individual shorts (for example, one part of the long Programs depending upon single-character buffered terminal might have been the segment and the other part the offset). The 110 (in the manner of common screen editors for UNIX and manner in which this was done, of course, depended upon the back­ the PC) need to be rewritten because full-duplex protocol is wards ordering of longs on the PC, and it did not work well on not supported by CMS and MVS. A similar fate befalls the 68000 at all. It would have had the same problem in porting to programs which assume that screen formatting is to be the mainframe. handled by standard control sequences in the manner of the VT100. Byte-ordering problems are not common when porting code to the mainframe environments, but you should be aware that they can exist.

739 • Programs designed to share their files with other programs, The best guide to writing portable code In C is the ANSI standard or to use pipes, need to be modified because the mainframe (known officially as the American National Standard for Information operating systems offer little or no support in such areas. Systems -- Programming Language C, X3. 159-1989). Programs that conform to the language and library specifications in this document • Programs (such as the UNIX make) that expect to be able to will be maximally portable, and any remaining system dependencies tell when a file was last accessed or modified, in general, can be handled through the usual technique of conditionally com~ will fail or require significant revision. Under MVS, such piled code and by isolating such dependencies in a small number information is simply not available. Some editors store such of system-specific modules. information in a control area of the file, but this is difficult to access and is not reliable because access could be made to the file through some means other than such an editor. REFERENCES

There are also other UNIX assumptions that may better be called American National Standards Committee (1988), American National UNIX bugs. I mentioned earlier that after having developed a cross Standard for Information Systems - Programming Language C, compiler on an MS~DOS system and having ported it successfully Document Number X3/159~ 1989, Washington, DC: X3 Secretar­ to a 68000 UNIX system, I encountered only one portability bug iat: Computer and Business Equipment Manufacturers Associa­ when the program was ported to CMS. This came about because tion. the program attempted to open the same file twice! Of course, this SAS Institute Inc. (1989), SASjC Library Reference, Second Edition, was a error in the program, and it turned out to be a harmless one. Volume 2, Cary, NC: SAS Institue Inc. But both MS~DOS and UNIX allowed the second open. CMS did not. AT is a registered trademark of International Business Machines In a similar recent case, the Berkeley version of YACC was ported Corporation. Armonk, NY. to a variety of environments, induding CMS and MVS. Having previ­ ously ported the tool to MS-DOS and various UNIX systems, I did AIX is a trademark of International Business Machines Corporation, not expect much trouble and was somewhat surprised that the pro­ Armonk, NY. gram dramatically crashed just before it attempted a successful ter­ mination. It seems that an attempt was being niade to unlink (that DEC VT100 is a trademark of Digital Equipment Corporation. is, remove) certain temporary files prior to closing them. Though systems similar to UNIX do not object to such silliness, the main­ IBM is a registered trademark of International Business Machines frame operating systems do. Corporation, Armonk, NY.

Two other problems arose in this port as well. The first stemmed 3090 is a trademark of International Business Machine Corporation, from the assumption that you can ask malloc{ ) to aliocate 0 Armonk, NY. bytes of memory and it will retum to you a meaningful pointer with­ out raising an error condition. This is true on most UNIX systems, Intel is a registered trademark of Intel Corporation, Hillsboro, OR. but false on the mainframe. Finally, there was the assumption (actu­ ally more of an oversight or coding error) that you can meaningfully Lattice is a registered trademark of Lattice, Inc., Lombard, IL. reference deallocated memory if you do it quickly enough. This involves a technique where in deallocating a linked list, you first deal­ MC68000 is a registered trademark of Motorola, Inc. locate a node, and then use the link pointer of that node to reference the next node. The mainframe operating systems appear to object SASIC is a registered trademark of SAS Institute Inc., Cary, NC to such references to deallocated memory. Sun Workstation is a registered trademark of Sun Micro Systems, Inc. WRITING PORTABLE CODE UNIX is a registered trademark of AT&T, New York, New York. Writing portable code in C (or at least writing code that can be ported easily) is not difficult. It requires only a bit of attention to detail VAX is a trademark of Digital Equipment Corporation, Maynard, MA. and a sensitivity to those issues that can affect portability. In addi~ tion, you should not expect certain kinds of code (for example, VMS is a trademark of Digital Equipment Corporation, Maynard, screen-oriented programs, programs dosely related to file system MA. structure, and so on) to have a high degree of portability.

740