US Statistical Programming Standards
Total Page:16
File Type:pdf, Size:1020Kb
Programming Standards
A. Conventions for program names
Standard Reporting and Dataset Program names
In general, all programs except validation programs should be limited to eight (8) consecutive characters to ensure that most operating systems will be able to accommodate the program names.
D#xxxxxx Derived dataset creation programs; # indicates the order in which the programs should be run; xxxxxx should be as close as possible to the name of the dataset created by the program D#Axxxxx Analysis dataset creation programs; # indicates the order in which the programs should be run; xxxxx should be as close as possible to the name of the dataset created by the program Ixxxxxxxx Import programs Exxxxxxx Export programs Txxxxxxx Table programs; xxxxxxx should indicate the dataset that the table is based on or a portion of the table title Lxxxxxxx Patient data listing programs; xxxxxxx should indicate the dataset that the listing is based on or a portion of the listing title Gxxxxxxx Graph programs; xxxxxxx should indicate the dataset that the graph is based on or a portion of the graph title Cxxxxxxx Data checking programs; xxxxxxx should indicate the dataset that the program is checking Mxxxxxxx Macro programs Vxxxxxxxx Validation programs; xxxxxxxx indicates the name of the program being validated Kxxxxxxx Internal checking programs to be kept for future reference Xxxxxxxx Miscellaneous testing programs to be deleted at the end of a project
1 B. General Programming Standards
Remember that, more often than not, you will not be maintaining the programs that you have written. What seems most obvious to you may not be so obvious to the next programmer.
Program documentation should be included in the program as the code is being developed and tested, not after coding is completed.
All programs should contain sufficient documentation describing processing as deemed sufficient by the programmer, depending on the complexity of the code. Specific attention should be given to derived variables, transformations and calculations. Within code use a standard formatting structure: The standard header must be used in all programs. /*------** ** PROGRAM: program name ** CREATED: mm/dd/yyyy ** PURPOSE: description of what the program does ** PROGRAMMER: your name ** INPUT: libname.dsname ** OUTPUT: libname.dsname or brief table title ** PROTOCOL: ABC-123 ** MODIFIED: DATE BY NOTE ** ------** **------** ** PROGRAMMED USING SAS VERSION x.x.x ** ** COPYRIGHT (C) YYYY BY COMPANY, ANYTOWN, USA --- ALL RIGHTS RESERVED ** **------*/ Put liberal comments in the programs to describe what you are doing. Use one statement per line. This may be a procedure or data step statement. Place two blank lines between each program module (“major” section of logic). Place one blank line between procedures and/or data steps. Use indentation and blank lines to enhance readability, especially in if-then-else statements, select statements, and do loops. Should be consistent throughout your code.
2 Use meaningful, self-documenting data set and variable names. Use variable and dataset names that are meaningful so that they help the reader understand the program. Avoid using /* */ for comments. This should only be used for blocking out sections of code when testing the program. Write programs with ease of modification in mind. All DATA steps and PROC.’s must end with a RUN statement. Validation and debugging cannot be done correctly without RUN statements. All PROC SQL and PROC DATASETS statements must end with a QUIT statement. All options, which are created and used in the program, should be placed at the top of the program. Use indentation to align all Do/End sections and indent code within sections to represent levels of nested logic. Always start procedure statements or data step statements in column one. All other code should be indented within data step or procedure. Reference arrays with curly braces “{ }” rather than parentheses “( )” to avoid confusion with function references. Avoid mixed data type calculations. Use INPUT or PUT functions prior to calculation to avoid SAS warning messages. When using IF-THEN-ELSE-IF statements, arrange in descending order of likelihood of condition being true, from highest to lowest, to ensure efficient execution of code. Always use ‘DATA=dataset’ in PROC statements and ‘WHERE=condition’ statements instead of subsetting IF’s when reading data in. This will minimize entries in the PDV and thus save processing time. Use ‘KEEP=’ and ‘DROP=’ options on SET statements to minimize variables read into PDV. Also, use KEEP or DROP statements at the end of DATA step processing to eliminate unneeded variables from being kept. Avoid repeating SAS code over and over again. Use macros or arrays when appropriate but don’t over-do it. Data specific code, or hardcoding, should only occur in highly exceptional situations and must be approved and documented by appropriate project team leader(s), e.g. if patient = ‘00065’ then delete or if dose = ‘20’ then dose = ‘40’ When modifying someone else’s program, differentiate, as much as possible, your programming modifications from existing programming code in some way, preferable with embedded comments, your initials and the date of the change. ENDSAS statements must be removed from all finalized programs.
3 C. Program Validation Standards
The programmer that is assigned to validation must validate all programs and proof of validation must be filed with the Lead Programmer for both draft and final tables. Proof of validation includes, but is not limited to, the following:
1. General
Copies of completed QC Checklists should be given to the Lead Programmer, along with copies of the validated program and any other materials in the validation folder.
2. Production Programs
The following are guidelines for validation elements to include in production programs.
A. All validation printing that does not identify specific data problems should be subset with a WHERE=(&PRINTME) statement. Conversely, any validation that does identify data issues should not be subset with a WHERE=(&PRINTME) statement. This will allow the Lead Programmer to “turn off” superfluous printing at the time of final production and only allow true data issues to be printed in the validation output.
B. The programmer is in an experienced position to know which subjects have problem data for a particular program. Select 5% of the population, or 5 subjects, whichever is higher, and PROC PRINT the input original data set(s) and the output derived data sets. Be sure that problem subject information is included in the chosen percentage.
C. When your program has critical expectations about the data, your SAS program code should be robust and flexible enough to check that these expectations are met, and to provide warning when erroneous values appear. Remember that the data may change, so just because your expectations are valid today, don’t assume they will be valid forever. In general, you should try to program such that the program does the validation, not the programmer. It is recommended to create temporary datasets in your programming and use the %mwhoops macro to print offending records and to write a warning to the .LOG file. When the data problem is corrected, the message will automatically be removed.
4 D. The following statements are suggested as a tool for checking logic for creating variable X from existing variables Y and Z, especially checking for missing values:
proc freq; tables X*Y*Z / list missing nocum nopercent; title “&progname: CHECK VARIABLE X’; run;
E. For table programs, create cell indexes where appropriate.
F. For derived data set programs, check a PROC CONTENTS of the derived dataset for unlabeled or unneeded variables, and be sure that the dataset has a label as well.
3. SAS LOGs
A. Any ERROR messages are prohibited. All warnings and notes must be avoided, where possible. B. SAS data manipulation notes are not permitted in program logs. These include, but are not limited to: i. Character variables converted to numeric… ii. Numeric variables converted to character…
C. The note ‘The Merge statement has more than one dataset with repeats of by values’ indicates a serious merging problem. This note is not permitted and SAS code must be revisited to correct the problem.
D. In the rare instance where a message that is normally not permitted is found to be correct, put a note in the program where the message is generated and write on the QC Checklist the reason it is correct.
4. Validation Programs
A. In general, keep it simple. For TLF programs, do NOT try to duplicate the “cosmetics” of the output.
B. For tables that include counts and percents, manually verify several percents in each column but program the counts.
5. Manual Review Validation
A. For derived data sets, in general you could manually validate small (less than 20 observations and less than 15 derived variables). However, if there are a number of variables that are complicated to calculate or the
5 data set will need to be validated often (say, monthly for DMC reports) it would be preferable to create a program to do the validation.
B. For data listings, manual review is preferred. Programs can be useful if there are items in the listing that are calculated in the reporting program or if the listing is a complicated subset of the main data.
C. For tables, if the total counts are small (n<20) validation by manual comparison to the data listing is preferred unless the table will need to be validated often.
6. How much do you check?
A. To do a “spot check” – check the “cosmetics” quickly; check about 2% of the output.
B. To do a “full check” – check the “cosmetics” carefully; check between 10% and 20% of the output.
C. For tables that include counts and percents, manually verify several percents in each column but program the counts.
D. Original Programming Guidelines
1) Validation
A. If the SAP changes after it is finalized, make a reasonable effort to get approval of those changes in writing (memos, email, etc.); at the very least, document what changed, why it was needed, and who authorized it within the program.
B. Include the name of the program in validation output titles.
C. A Table/Listing/Figure program is considered final when the author feels that the output is ready for initial delivery to the client (internal or external). The derived data set program that the TLF program is based on must be final before the TLF program can be considered final.
D. Once the program author determines that a program is final, the program author must complete a QC checklist. If separate validation needs to be completed, the QC checklist must be completed by the validator before delivery of the final output (Refer to programming SOPs).
E. If modifications are made to a program after it has been finalized and the QC checklist was completed, these modifications must be listed in both the header of the program and on the Program Modification checklist.
6 F. If a modification needs to be made to a validated program after final delivery of output, notify the Lead Programmer on the project. G. The following should be included in the validation folders:
i) A completed QC checklist ii) A printout of the program as it existed when the QC checklist was filled out (initialed and dated) iii) Any emails or other documentation related to the output being created
2) Derived Data Sets
A. Data set structures, data set names and variable names should match company standards (based on CDISC) unless otherwise specified by the client.
B. All flags and calculated variables needed for tables and listings should be created and stored on the derived data set.
C. Do not attach any user-defined SAS formats to any variables in derived data sets unless specifically requested by the client.
D. User-defined formats should be named the same as the variable that it describes, when possible.
E. Derived data sets should have the variables ordered meaningfully through a PROC SQL statement followed by a PROC COMPARE to ensure that the resulting data set is the same as the original.
F. Derived data sets should be stored and sorted by the minimum number of variables needed to identify a unique record.
G. There should be sufficient validation code built into the program to prove that the resulting data set is correct. This includes, but is not limited to, the following: i. Cross-tabs of old vs. new variables. ii. Print of select patients from the raw data set and then the derived data set. iii. Ensure that merges were performed correctly using in= and printing cases that do not meet your merging expectations (i.e. – if ina and not inb then output check;). iv. Frequencies of categorical variables to ensure that there are no unexpected values and that any custom formats are comprehensive.
3) Listings
7 A. In general, there should be no significant manipulation of incoming data sets – including flagging.
4) Tables
A. There should not be any re-coding or meaningful flagging done; only statistical calculations.
B. There should be sufficient validation code built into the program to prove that the resulting table values are correct. This can include, but is not limited to, the following: i. Print of output directly from the SAS procedure ii. Cell indexes
5) Exporting Data Sets (assuming the structure is different than the existing company (data structure)
A. Review client specifications for each output data set and note in which existing data set each variable is located. Note what new variables need to be derived.
B. Check that all formats conform to client specifications.
C. Program the new derived data sets, reformatting and renaming where required. Check the contents and format against the client specifications and against the original data. Fill out a derived data set QC checklist. Program should be stored in SAS/Export subdirectory.
D. If the data to be exported is not the direct result of a SAS program (e.g., SAS data set Excel spreadsheet), be sure to check the final result against the original SAS files.
6) Importing Data (assuming the end result is a SAS data set)
A. There should be sufficient validation code to prove that you actually received what was sent (provided the sender sends adequate documentation with the files being imported). This should include, but is not limited to, the following: i. A PROC CONTENTS of all uploaded SAS data sets ii. A PROC PRINT of select observations from each uploaded SAS data set
B. Where possible, you should also add checks to see if the content of what you received is what you were expecting. This can include, but is not limited to, the following: i. A PROC COMPARE between previous data and the new data ii. PROC FREQ of categorical variables
8 E. Complete TLF Deliverable (at DRAFT or FINAL File)
a. Run mchkvlen.sas macro (S:\SASUtils\MacLib) to make sure that length of all variable names is ≤ 8, and labels don’t exceed 40 characters.
b. Make sure all programs have been validated and QC checklists completed and signed off. For Draft deliveries, only Production programmer must have signed off; for Final deliveries, both Production and Validation programmers must have signed off.
c. Make sure all data set and TLF programs are in a RUN*.SAS batch program.
d. Clean up MTITLNUM.SAS and FORMATS.SAS (remove any code that is commented out).
e. Delete all derived data sets from the DATA\SUBMISSION directory.
f. Delete all files from the OUTPUT directory (including the GRAPHS subdirectory).
g. Make sure that &printme=0 in MSETUP.SAS.
h. Create and run RUNDATA.SAS and check the .LOG file for problems.
i. Create and run all other RUN*.SAS programs (RUNLIST, RUNTABS, etc.) and check the logs for problems.
j. Run all graph programs and check the logs for problems.
k. Do a quick on-line check of all .LST files to make sure nothing weird happened, if appropriate; check for “WHOOPS” output.
l. Run all validation programs separately and check derived data set compares, spot-check tables, listings, and figures.
m. Concatenate all tables and listings into one PDF file using PDF utility, create Table of Contents, check pagination and bookmarks.
n. Print all TLF output and review vs. SAP and validation output.
o. Check Table of Contents vs. TLF output to ensure consistency.
p. Copy all programs and output files into a DRAFT subdirectory in every folder (Pgms/DRAFT, Data/DRAFT etc.) and make them ‘read-only’.
q. Create password protected ZIP file.
9 r. If for some reason there is a need to modify program, unlock *.sas, *.log and *.lst, fix and rerun the program, lock *.sas, *.log, *.lst and sign Program Modification Form the same day.
F. Partial TLF Re-issue (FINAL or Post-FINAL File; select TLFs added or re-issued)
a. Run the new/additional programs and make sure logs are error-free b. Make all files in the PGMS directory read-only c. Make any data sets and/or output files read-only d. Run the appropriate validation programs and check e. Create a PDF file with just the new/additional output f. Update the Table of Contents, if applicable g. Create a PDF file with the new/additional TLFs integrated into the original PDF file with all TLFs h. Copy all relevant files into a FINAL subdirectory and make it ‘read-only’
G. Study Documentation
b. Create Table of Contents c. Create Program Directory d. Create data set directory e. Create cover memo f. Create labels for binders g. Create labels for CD/Diskette h. Create binder tabs and section cover sheets i. Compile final study binder
H. Final Data Delivery
a. Create a directory under SAS\Export with directories for DATA, PGMS, OUTPUT, MACROS and DOCUMENTS. b. Copy all programs, final run *.log files, and final run *.lst files from the main study directory to the EXPORT\CD\PGMS directory. c. Copy all output (*.RTF) from the main study directory to the EXPORT\CD\OUTPUT (and OUTPUT\GRAPHS, if applicable) directory. d. Create a program that converts SAS datasets to SAS transport files in the SAS\EXPORT directory (see example S:\SASUtils\Program Examples\Import- Export\etrnsprt.sas). The program should: 1. Create transport files from the DATA\ORIGINAL and DATA\SUBMISSION data sets
10 2. Create transport files from the DATAMGMT\EXPORT\yyyy-mm-dd FINAL\AUDIT TRAIL data sets 3. Extract all data back into the EXPORT\DATA directories and compare the extracted files to the original files e. Create PDF versions of the data dictionaries for both the ORIGINAL and SUBMISSION libraries and copy it to the EXPORT\CD\DOCUMENTS directory. f. Create a PDF version of the SAS Program Directory and copy it to the EXPORT\CD\DOCUMENTS directory. g. Copy a PDF version of the complete TLFs (including TOC) to the EXPORT\CD\DOCUMENTS directory. h. Create the CD/DVD and delivery binder.
11