Habits that Help: Developing Good Programming Style Casey Cantrell, Clarion Consulting, Los Angeles, CA

ABSTRACT The first goal of most new is to write programs that run. Often, more time is spent checking and debugging programs than is spent writing them. As programmers mature, they typically find it easier to test, debug and document their programs. This occurs in large part because they have developed a consistent programming style and acquired habits that expedite error detection and documentation. Putting into practice a few simple procedures will yield significant benefits: greater accuracy, fewer hours invested and less aggravation. This paper will discuss practical programming ‘habits’ that will make the ’s work easier.

INTRODUCTION One evening many years ago, while talking shop with some work mates, the conversation turned to the question of "What makes a good program?" Without thinking much about it, I stated imperiously that a good program was one that could complete a task with the fewest lines of code. My colleague, and senior partner, considered this for a while before offering his definition. “A good program," he said, " is one that can complete a task with the fewest lines of code that someone else can understand.”

In the intervening years, I've come to realize how right he was and would delete only the words “someone else” from his definition. While it is certainly important to write code than others can work with, the programmer who does so also profits from the effort. Well-written programs are easier to maintain, modify and debug. Programs that are well organized, conforming to a consistent style facilitate documentation. Carefully planned programs provide for efficient use of the most important resource of all: the programmer’s time. Time invested in writing good programs will pay off handsomely for everyone in the long run.

Although each programmer will develop his/her own style, there are habits common to all good programmers. First, they write with clarity and consistency, opting for simplicity whenever possible. Second, they test their programs and meticulously checks results. Third, the good programmer writes flexibility into his programs. Since his programs are "data-driven", they require little or no modification when the data change. Finally, the good programmer documents extensively. In short, good programmers write with an eye towards simplicity and flexibility. Their programs work, require little maintenance and are organized to expedite documentation.

GOOD HABIT #1: WRITE CLEAR CODE Writing clear code involves not only writing programs that are easy to understand, but writing programs that are easy to read. Get in the habit of using adequate spacing, blank lines and indentation. SAS® is free format, meaning you may start your program on any column of any line, continue your program over several lines, and include as many blank columns and blank lines as you want. Like a good designer, use white space to your advantage. Do not try to squeeze your programs into as few lines as possible, nor put multiple statements on a single line. Remember that other programmers may read your programs on their computers, so format your code to fit comfortably on the screen. The programs below differ only in layout, but Program 1b is far easier to read..

DATA BOTHGEND; MERGE MEN (RENAME=(POP = MENPOP)) WOMEN; BY RACE AGE; PCT_MEN = (MENPOP/(MENPOP+ POP))* 100; Run; Program 1a

DATA BOTHGEND; MERGE MEN (RENAME=(POP = MENPOP)) Easier WOMEN; to read BY RACE AGE;

PCT_MEN = (MENPOP/(MENPOP+ POP))* 100; run;

Program 1b

1

These two programs are also identical. Program 2b uses spacing and indention to improve legibility.

Data bothgend; set male (in=menfile) female (in=womfile); if menfile then sex='M'; else if womfile then sex=’F'; run;

Program 2a

Data bothgend; Easier Set men (in=menfile) to read women (in=womfile); if menfile then sex = 'M'; else if womfile then sex = 'F'; run;

Program 2b

Get in the habit of beginning each program with a program header. The header should include your name, dates the program was written and modified, the program file name and location, and a description of what the program does. You may also include input and output files and any pertinent notes. An example follows.

/*******************************************************************************/ /* MDsample.sas CC 2/21/10 */ /* :/CMC3/Baseline/SAS Programs */ /* */ /* Reads: c:/CMC/Baseline/Data/PatientWave1 */ /* Writes: c:/CMC/Baseline/Analyis/PilotSample */ /* Create a sample of patients who completed the survey representing */ /* 7.5% of each MD’s patients. */ /* */ /* Modified 3/5/10 to increase sample to 10%. */ /* Note: Dr. Smith retired 12/31/09. Dr. Jones has taken over the practice */ /*******************************************************************************/

Get in the habit of using comments throughout your programs. Explain what you are doing and why. Describe problems, anomalies and complicated logic. If there is anything unusual about your program or data, make note of that. What may seem perfectly clear to you at 3am tonight may not be as clear six months from now. Comments are also a valuable way to organize and clarify your own thinking.

/* Bookend style comment */ * This is also a comment ;

SAS pays no attention to semicolons used within bookend style comments, so you may block out entire sections of code you do not want SAS to execute. This is especially helpful when testing or debugging a program.

Use the label option on the data step to identify and describe files. If the file is a subset of another file, make note of that on the label. Later, this information may prove invaluable when you are searching for a particular file or trying to figure out how one file differs from several others just like it. Program 3 utilizes the label statement, as shown in Figure 1.

data analysis1 (label=Subset of COMPLETED. Deletes records for MDs who did not complete baseline interview.); set newstat.allwave2; run;

Program 3

2

Figure 1

Separate steps and conceptually distinct tasks for clarity. SAS will end and execute a step when it encounters the key words DATA or PROC, which designate the start of a new step. Nonetheless, get in the habit of including the "run;" statement to delineate steps in your programs. This will make it easier to follow your program logic and help you resist the temptation to mix DATA and PROC steps inappropriately. Program 3b illustrates the use of the “run;” statement, which is not used in Program 3a.

DATA ages; dog_age=5; humanyrs=dog_age * 7;

PROC PRINT data=ages; TITLE “Dog’s age in human years”;

Program 3a

DATA ages; dog_age=5; humanyrs=dog_age * 7; run; Include RUN; PROC PRINT data=ages; statement TITLE “Dog age in human years”; run;

Program 3b

Simplify complex expressions. Break tasks into steps. Use multiple statements if necessary to clarify your logic. Programs 4a and 4b do exactly the same thing: read a text variable (patient) and write out a new variable (name) in a different format, as illustrated in Figure 2. Which program is easier to read? Which would be easier to modify?

Figure 2 data mailing; set patnames; name=substr(patient,index(patient,',')+2,length (patient)-index(patient,',')-1) || ' || substr(patient,1,index(patient,',')-1); run;

Program 4a

3

data mailing; Easier set patnames; to read comma = index(patient,','); & follow longn = length (patient);

lname = substr(patient,1,comma-1); fname = substr(patient,comma+2,longn - comma-1);

pat_name =trim(fname) || ' ' || lname; run;

Program 4b

Avoid compound expressions when possible, and be particularly stingy in your use of NOTs and nested ANDs/ORs. Use DO LOOPS instead to clarify your logic and remember that compound expressions are less efficient, since SAS will evaluate each condition even when the first one is false. Evaluations stated in the affirmative are also easier to follow. Consider the following programs. Each in the pair performs the same task. Which is easier to follow?

if sex=‘M’ AND race NE ‘B’OR race NE ‘W’’…;

Program 5a if sex = ‘M’ then do; if race ne ‘B’ and race ne ‘W’ … end; DO loop

Program 5b

if elisa=1 &wb24 >1 & wb24 <4 & (wb120=1 or wb160=1) then westblot='1';

Program 6a

if elisa = 1 then do;

DO loop if 1 < wb24 < 4 then do; if wb120=1 or wb160=1 then westblot='1'; end; end;

Program 6b

GOOD HABIT #2: DEVELOP A CONSISTENT STYLE Get in the habit of organizing your programs using a consistent scheme. Define arrays, create labels and declare length attributes at the same place in your programs and group similar statements together. This makes it easier to see what you've done and easier to modify. Programs 7a and 7b again perform the same tasks.

Data forlogit; set analyfile;

LENGTH assignment length preteen $1. if 0 < age < 12 then preteen ='1'; else preteen = '0'; LABEL label preteen = 'Age < 12 years'; FORMAT assignment format preteen $teenf.;

LENGTH assignment length senior $1.; if 65 < age < 99 then senior ='1'; else senior = '0'; LABEL label senior = 'Age > 65"; format senior $seniorf.; FORMAT assignment Program 7a

4

Data forlogit; set analyfile; LENGTH assignments length preteen senior $1.;

if 0 < age < 12 then preteen ='1'; else preteen = '0';

if 65 < age < 99 then senior ='1'; else senior = '0'; LABEL assignments label preteen = 'Age < 12 years senior = 'Age > 65";

format preteen $teenform. Senior $seniorform.; FORMAT Program 7b assignments

Keep SAS default actions in mind. Remember that new variables built from existing ones will inherit the attribute type of the original variable. Use explicit length statements when necessary and define them before using the variable in your code. Figure 3 illustrates the output generated from the Program 8a. Use of the LENGTH statement in Program 8b corrects the truncation as shown in figure 4.

Data litter; If AKC_reg='2345' then name='Kip'; else if AKC_REG='9876' then name='Presto'; else if AKC_REG='9876' then name='Blaze'; run;

Program 8a

Figure 3 LENGTH assignment

Data litter;

Length name $10.;

If AKC_reg='7785' then name='Kip';

else if AKC_reg=’7181’ then name=’Presto’;

else if AKC_REG='9876' then name='Blaze'; run;

Program 8b

Figure 4

5

Pay attention to the location of subsetting IF statements. Don't bury them in your programs.

Data forlogit; Data forlogit; set analyfile; set analyfile; if plan ne 5;

length preteen senior $1.;

length preteen senior $1.;

if 0 < age < 12 then preteen ='1';

else preteen = '0'; if 0

if 65 < age < 99 then senior ='1'; stateme nt if 65

label preteen = 'Age < 12 years’ senior = 'Age > 65"; label preteen= 'Age < 12 years’ senior = 'Age > 65";

if plan ne 5;

Program 9a Program 9b

Devise a consistent system for naming files and variables. For example, you might use a prefix to identify date variables (d_birth, d_test, d_terminate) or variables collected on an exit survey: (x_tbtest, x_bloodtest, x_hivtest). Whenever possible, use meaningful and descriptive variable names and be certain to use a consistent coding scheme for similar variables, e.g., 1=True and 0=False. When naming like variables, give them common roots and consecutively numbered suffixes, so you can later work with them in abbreviated format using series notation.

Variable list: test1 test2 test3 test4 test5 test6 test7 test8 ..... Abbreviated list: test1 - test8

Always specify data set names in PROCs and DATA steps. By default, when no data set name is specified, SAS will use the last one created. This can be dangerous. Get in the habit of naming input data sets as shown in Program 10b. Your programs will be much easier to follow and far less likely to contain errors if you do.

DATA mysubset;

set master ;

PROC freq; Tables gender * status;

DATA subset2; set; DATA mysubset;

PROC means; var weight1.; set master ; run;

Program 10a PROC freq DATA= mysubset; Tables gender * status; Run; Specify

Dataset DATA subset2; set mysubset; Names

run;

PROC means DATA= subset2; Var weight1;

run; Program 10b

6

If you are running your programs interactively, do not depend entirely on the LIBRARY icon. Include fully defined libname statements in your programs to permanently document where your input files came from and where your output files were stored. Figure 5 illustrates the library which is better defined using the fully qualified LIBNAME statement which follows.

Figure 5

Fully defined LIBRARY libname baseline "//katzen/e/NHIS2002"; name

Select variables to include in your files using KEEP= and DROP= data set options instead of using KEEP and DROP statements, so you can easily find them in your programs. Since well-written programs facilitate documentation, use KEEP= instead of DROP= whenever possible. Knowing what variables were included in your file is far more useful than knowing what variables were not.

GOOD HABIT #3: CHECK YOUR PROGRAMS PROC CONTENTS provides a plethora of information about SAS files. Use it to acquaint yourself with your input data so you know what to expect from your output files. Consider attribute type and variable ranges. Are your data stored in character or numeric formats? What do the distributions look like? How are missing values stored? Are they stored as SAS missing values or as "9"s?

Get in the habit of reading the SAS program log. New programmers all too often scan the log looking only for errors. Just because your program has run successfully does not mean it has done what you intended for it to do. The log provides invaluable diagnostic information. Learn to use it.

Pay attention to WARNINGS as well as error messages. Although a WARNING message indicates a potential problem, your program will not abort. SAS will underline the problem and provide a message as to its cause. Review these carefully to make sure your program is in fact working. Sometimes the problem is as simple as a misspelled word that SAS nonetheless recognizes (see figure 6), but other times the warning may involve a legitimate error (as shown in figure 7.)

Figure 6

7

Figure 7

Make a habit of reading NOTES in the SAS log. When your program creates a new SAS file, the log will specify how many variables and records were written to it. If you are reading raw data into SAS, the log will provide information about input file attributes as well as the location of the input file. Check that you have read the correct input file and that your output file contains the number of records you expect. Look carefully for what may be systematic errors, such as notes about invalid data or missing values.

INPUT file attributes

Records READ

Records WRITTEN

Variables

Figure 8 WRITTEN

Figure 9 shows a program that ran successfully. Or did it? The “Invalid data” note may indicate a problem as does the “NOTE: SAS went to a new line when INPUT statement reached past the end of a line.” Both warrant closer scrutiny. Finally, the LOG indicates three observations were written to the output file. Is this correct?

Possible ERRORS

Figure 9

8

Look closely at your output file. Get in the habit of running PROC CONTENTS and reviewing the listing with some care. Is the sample size what it should be? Are the variable attributes what you expect? Have they been named correctly? Was the file written to the correct output destination?

Browsing the alphabetical variable listing is a good way to check for spelling errors. See Figure 10.

Possible ERRORS

Figure 10

Print several records and look at them. Run frequencies and review variable distributions. Does anything look unusual?

Possible ERRORS

Figure 11

Possible ERRORS

Figure 12

9

Investigate missing values. Run cross tabs and be alert for anomalies.

Possible ERRORS

Figure 13

Generate univariate statistics and check for unreasonable values. There are myriad SAS tools that will help you locate possible errors in your program logic. Make it a habit to use them. If you have acquainted yourself with your input data, you will be better prepared to identify potential problems.

Figure 14 lists statistics for puppies’ weights in pounds at 1 week and at 6 weeks. Would these be suspicious If the unit of measurement were ounces? Possible ERRORS

Figure 14

GOOD HABIT #4: CODE FLEXIBILITY INTO YOUR PROGRAMS Your programs will be much easier to maintain, modify and correct if you learn to use the tools available in SAS to write programs that work even when your data change. Avoid hard coding value assignments and be extremely judicious in changing original data values. Learn to use macros, which make your programs easier to read, minimize problems caused by typing errors and eliminate the need to make the same changes in several files. Take advantage of the enormous library of SAS functions.

Program 11a may require several hundred lines of code, depending on the dates being evaluated. Changes would be tedious and error prone.

* Organize data by 6 month intervals; if date1 < mdy(01,01,80) then halfyr = 790; else if date1 ge mdy(01,01,80) and date1 < mdy(07,01,80) then halfyr=801; else if date1 ge mdy(07,01,80) and date1 < mdy(01,01,81) then halfyr=802; else if date1 ge mdy(01,01,81) and date1 < mdy(07,01,81) then halfyr=811; …………………………. else if date1 ge mdy(07,01,92) and date1 < mdy(01,01,93) then halfyr=922;

Program 11a

10

Program 11b does the same thing using functions, so it requires minimal coding and will work even when the dates change.

Flexible code * Build six month variable;

using quarter=qrt(date1); if quarter=1 or quarter=2 then halfyr='1st'; FUNCTIONS else halfyr='2nd'; halfyear= trim(left(year(date1))) || ' ' || half;

Program 11b

Resist the temptation to hard code text variable values as shown in Program 12a. Use PROC FORMAT instead and leave original coded values intact.

Data names; Set litters; length kname $15.; If kennel ='1' then kname='Aarowag'; Else if kennel='2' then kname='Blazeaway'; Else if kennel='3' then kname='Clarion'; Proc format; ; Value $kname Program 12a '1' = 'Aarowag' '2' = 'Blazeaway' '3' = 'Clarion'; Flexible code Run; using PROC FORMAT Proc print data=litters; Format kennel $kname.; run; Program 12b

Using formats to group data also provides greater flexibility and, if your program is well organized, makes it easy to quickly see what you’ve done. PROC FORMAT is far more efficient, less prone to error and easier to change.

DATA model1;

Set analyfil e; if age < 15 then agegroup = 1; else if 15 < age <= 19 then agegroup = 2; else if 20 < age <= 24 then agegroup = 3; … etc… Proc format; Program 13a value ageA low-14 = ‘ < 15 years’ 15 -19 = ‘15-19 years’ 20 –24 = ‘20–24 years’; etc…

value ageB

low-24 = ‘ < 15 years’ Flexible code 26 -34 = ‘26-34 years’ using PROC 35 –44 = ‘35–44 years’; etc… FORMAT proc freq data=analyfile; tables age; format age ageA.; run;

proc freq data=analyfile; tables age; format age ageB.; Program 13b run;

11

Use the IN operator rather than IF/OR comparisons when evaluating a series of constants. Which of these two programs would be easier to change?

DATA malepups;

Set litter1; If name='Jet' OR name='Kip' OR name = 'Spike';

Program 14a

Flexible code DATA malepups; using IN set litter1; operator If name in ('Jet', 'Kip', 'Spike');

Program 14b

GOOD HABIT #5: TEST YOUR PROGRAMS Get in the habit of testing your programs. Use RUN CANCEL; to first check your syntax, and then use OBS= to test on a subset. Program 15 reads only the first 100 records from the input file.

SUBSET using

/* Read the first 100 records */ OBS=

Data test; set disk.tbfile (OBS = 100 ); run;

Program 15

OBS= used in conjunction with FIRSTOBS= allows you to subset from within a file and to specify which records to read. Program 16 reads observations 100 through 120.

/* Read observations 100 through 120 */ SUBSET using

FIRSTOBS= Data test; and OBS= set tbfile (FIRSTOBS = 100 OBS = 120);

run;

Program 16

Use the RANUNI or UNIFORM function to generate a random sample or a stratified random sample to test your code.

/* Approx 10% random sample */ SUBSET on a Data test; random set disk.tbfile; sample if RANUNI (-1) le 0.10; run;

Program 17a /* Approx 10% stratified random sample */

Data ny_samp; set allcity; SUBSET on a if city= ‘New York’ then do; stratified random if UNIFORM (0) < .10 then output; sample end; run;

Program 17b

12 Test and debug your programs incrementally, checking results at intermediate steps. Your programs will be much easier to debug if you do not delete intermediate files or exclude intermediate variables from output file before checking results.

GOOD HABIT #6: DOCUMENT EXTENSIVELY Documentation is extremely important, both for you and for those who work with your programs. Get in the habit of documenting what you've done as you do it. If you've been generous in using comments and labels and have organized your programs in a logical way, half of your documentation work will already be done. If you have used PROC FORMAT, you will have already documented your discrete variable values. Don't forget to use as many TITLE statements as necessary to describe output listings and make sure your variables have meaningful labels.

CONCLUSION There are probably as many ways to approach a programming task as there are programmers. However, the programmer whose code conforms to a consistent style will find it far easier to determine later what a program was written to do. Grouping similar statements together and putting them in the same place in every program makes them easy to find without having to search through several lines of code. A consistent scheme for naming variables, arrays and files often enables the coder to deduce meaning from names alone. Logic is far easier to follow if DO LOOPs, assignment statements and evaluations are coded the same way across programs.

There are, however, programming practices shared by good programmers, regardless of individual style. First, their programs are easy to read. Programs are well organized so that logic is easy to follow, and the good programmer is willing to sacrifice brevity for clarity when necessary. Second, good programmers write programs that are easy to maintain and change. They let their programs do the ‘heavy lifting’, avoiding tedious, repetitive coding when alternatives that provide flexibility are available. Finally, the good programmer thoroughly tests, checks and documents his/her programs and does not consider the work done until these tasks have been completed.

Not all habits are bad! The programmer who makes these habits his own will find his job easier and will certainly earn the gratitude of those who inherit his programs.

TRADEMARK SAS is a registered trademark of the SAS Institute, Inc. in the USA and other countries.

AUTHOR CONTACT Casey Cantrell Clarion Consulting 4404 Grand View Blvd. Los Angeles, CA 90066 [email protected]

13