SAS REPORT WRITING OVER THE YEARS

An Exploration Into the Difficult Task of Meeting Everyone's Needs.

Written by Richard C Chiofolo, Ph. D. Kaiser Foundation Health Plan Oakland, California 5 July 1 \ 2000

528 Writing reports is not a very esoteric or intellectual task. It lacks glamour. The ability to produce reports isn't something people cite on their resumes. However, it may be one of the most useful functions in any language package because programmer/ analysts often need only put information on a piece of paper as a list or simple comparison of some business function or event. Whether the data is character fields in a list or data points put on a graph_ the intent is the same: put information in a format for presentation. Some level of summarization may be desired, but people are rarely willing to look at final statistical results without seeing some of the detailed information first. Communicating information itself should be the goal, but, in many cases, presentation is as important as the information itself. Finally, how information is presented counts, a lot. Programmers and analysts have their biases about how they want to do their work. At my age a non-graphic PROC CHART is fine, my assistants want to play with WEB ready graphics.

We should look at the SAS procedures designed for making reports, but only those that give us the simple ability to present raw or summarized lists and comparisons. How well do they aid us to produce reports, not esoteric statistics, only mundane listings, detailed or summarized records in our workplace. SAS was designed originally by statisticians to prepare data for statistical comparisons. In the 70's, SPSS, BMDP and SAS were strong rivals for the market of statisticians and scientific experiments. By the 80's SAS became a more generally used package for "users" and other non-statistical applications, competing with FOCUS and CICS. Report generation became a more important component of the overall SAS package. Although relatively simple in purpose, attempts by SAS Institute to generate a flexible, coherent, easy to use and efficient SAS procedure for generating reports has actually been fraught with difficulties and mistakes.

Since we are concerned only with "reports," not summary statistics or high level analysis, we need to define which SAS Procedures generate reports. Clearly PROC FREQ is less statistical than PROC REG, yet both are analytical in nature. A report is output that lists detailed or summarized data, either as a listing or multi-field comparison, called cross-tabulation. It must show actual data points, not just our conclusions about some data, even if the data points are sununarized data rather than detailed. Cross-tabulation should be included, since it also displays detailed information. Most businesses do not use the refined statistics to test and compare factors affecting their business, like the variance procedures. Business analysts are prone to more basic and obvious comparisons of data, so cross-tabulation is a more common feature in business analysis. ln essence, a report either lists values or compares two or more lists of data values.

Lets look at several SAS procedures, over time, as they have evolved since the early stages, and measure them, not by what they accomplish, but by what affects their use, and rate them on these "Report Writer'' dimensions. Here are three dimensions we can use to rate SAS Procedures as Report Writers.

1. A procedure is either algorithmic or textual.

Different people prefer one type or the other. People with statistical or heavy third GL backgrounds often prefer very concise, algebraic languages, terse with short names. SAS overall is wordy, like early BASIC, as seen in the logic and controls of a data step, while still retaining the algebraic logic of FORTRAN and PLl, the languages SAS was first written in. Fourth GL' s are usually more wordy than third GL's. SAS has tried to be both.

2. A procedure is either compact and standardized or modular and flexible.

Most SAS procedures are standardized packages of code that users never see but will produce consistent preformatted output, as long as the design covers all possible options. Users are constrained by the available options but also aided by simple designs that can produce complicated results. Sometimes users want more flexibility for very customized output, but most procedures follow standard approaches for ease of use. The need for flexibility leads to modular designs that offer more but are harder to master.

529 3. A procedure may be limited in scope or very comprehensive.

Most SAS procedures are limited. SAS is written to be a collection of data steps and Procedures, each completing more limited tasks. Over time, SAS has become more comprehensive, becoming more 4th GL with longer more complicated procedures that do more for you. You may or may not include various optional statements that make a report more complex, but you can opt for the very simple and minimal statements. A comprehensive procedure forces users to provide more detailed coding. with a higher learning curve, but allows more features and controls. SAS has always provided both types of procedures.

In fact, SAS provides all kinds of methods to complete a report, and we shall find procedures that represent all combinations on these dimensions. It is up to tne user, not the product, to pick what works best for them. There is almost always more than one procedure and more than one method to generate any report. Let's now look at some of those methods used by SAS programmers over the years.

T~e Early Years: Using PRINT and NULL Data steps

The first attempt to use SAS for generating reports was found in PROC PRINT. This very basic procedure was found in the earliest versions of SAS and was probably the most used. PRINT is essentially a "dump" of raw data points. Each row of the output is a single record or observation in the data. Each field or variable is put on the paper output as a single column. PRINT takes care of several features needed. PRINT provides automated spacing between the columns, labeling variables that become column headings, observation counts, double spacing, enough room to handle wide field values, D and BY statements provide report breaks and subtotals and SUM provides subtotals (with BY) and totals. PAGEBY and SUMBY were added later to allow control over page breaks and specifying which BY variables are used for subtotals.

The key was simplicity. PRINT is textual, very standard and limited You simply specify the fields you want listed, in order, on a VAR statement. SAS takes care of the formatting of columns on the paper, and adjusts for pages with wider values. You can standardize column widths for the entire report using UNIFORM. PRINT is comprehensive, even using minimal code, as long as you stay with a simple need to dump the detailed records and fields of a dataset. For example, just:

PROCPRINT;

Will provide almost entirely what you need. Without any other statement, SAS defaults to dumping the last used dataset, all fields, in the order in which they appear in the dataset. If there are too many fields to fit on a page, SAS automatically "wraps" records, so that you get to see something like a multi-panel spreadsheet. PRINT is so useful as a debugging tool, that it is essential to writing even the most complicated SAS application. It is, on our dimensions, a comprehensive standardized textual procedure. Easy to use, it has filled so many needs it has become the procedure most taken for granted. It is closest to what most analysts are already familiar with, EXCEL and LOTUS.

Because PRINT was so standardized, it did not meet many specialized needs. Very often we needed to create a more customized report. There were ways to accomplish this, using PRINT, mostly by preparing other records for the report BEFORE running PRINT. For example, if we wanted "Mean's" on the subtotal lines instead of "Summations," we:

1. Ran PROC SUMMARY first to create subtotal records with Mean values; 2. Marked the new subtotal records with special keys to make them sort correctly when we: 3. Merged the subtotal and original records together. 4. Printed the resulting file, without any SUM statement, since the subtotals already exist.

530 This method took time, and made the programming more complicated, since we've added a SUMMARY procedure to create the subtotal records and a DATA step to merge them back in with the original detail records. False keys have to be invented just so we can sort each subtotal line after the detail lines it belongs to. In sum, PRINT could not be both simple and customizable. Whenever the users requested a more precise layout, we turned to a procedure that was the opposite on all dimensions: Using NULL data steps to generate customized reports.

Report Writing a Ia Carte

Report Writing is often so complex, with such specialized needs, that SAS had a manual dedicated to using data step processing to produce a custom report. Using a combination of BY statements to keep track of the current record and some sophisticated features of PUT, such as the cursor's specific line number and column on the output page, it was possible to do almost anything COBOOL or EASYTRIEVE could produce. The technique is well described in the SAS Applications Programming manual. This technique treats DATA steps as blocks of output paper. You control directly which record you are reading, and the way in which data is placed on the paper. Like in the days when COBOL was king, this technique allows infinite options and direct control of where each value is placed on the output device. Using our dimensions, this technique is textual, highly customized and very comprehensive, particularly if you are good at algebraic formulas designed to make patterns of fields across a two dimensional space. For a very skilled technician, used to laborious coding to produce a custom product, this technique is ideal. This is an APL programmer's lovefest. For the rest of us, this method is too complicated and too much work.

Using MEANS and FREQ as a Report Writers

In the early period, when there were no specific Report Writing tools, we found many creative ways to create a report. Most often PROC PRINT was combined with PROC SUMMARY. The SUMMARY provided a way to create subtotal records that were often added back to the original records for printing. Another feature was to use PROC MEANS instead. SUMMARY produces data records, but not list them to paper. MEANS, now almost identical to SUMMARY, is a procedure that defaults to paper. By selecting fields and which the needed statistics, you c~ create a report directly using MEANS. However, MEANS only shows summary data, not the original details. FREQ is even better. FREQ's output defaults to paper and provides cross-tabulation between two or more fields. The numbers are actually counts of records, unless you use a WEIGHT option to shift the counts to another measure, like dollars. FREQ allows sending the output to a file, which can then be printed via PRINT. You gain the "grouping" feature of FREQ which is the equivalent of the grouping that a CLASS statement provides in MEANS or SUMMARY.

In fact, FREQ is very powerful, but is rarely used with the optional parameters that make FREQ so useful. If you use only one field, FREQ's output is a list. And LIST is an optional parameter that you can use to make cross­ tabulations lists as well. The default values in each table cell when you use two or more fields are:

Actual Frequency of Records that have that unique combination of field values. The Frequency as a Percent of the Column Total. The Frequency as a Percent of the Row Total The Frequency as a Percent of the Table Grand Total.

Remember that FREQ provides only "Frequency" measures (without a WEIGHT); it tabulates how many records have the unique combination of Row and Column values. However, frequencies are sometimes the most basic way to demonstrate a relationship between two or more factors that affect a business process or product. It is used a lot, but rarely outside the standard layout of four values, frequency and the three percents. It is often useful to point out "What frequency should have appeared (EXPECTED), and how off the actual is from that value (DEVIATION). Without bothering a non-statistical analyst with Chi-square values, I can simply show my

531 users which cells are "off" or "unexpected," and base the conclusions on counts alone. It has become one of my most powerful tools. However the standardized and limited output of FREQ has often lost me the attention of managers: they don't like it. The frequent question is "Which dimension (row or column) is for which field?"

The real choice is not which procedure to use, but "How acceptable is the standard layout?" I have often had requests for very detailed reports, in which the user specifies exactly how the page layout should look. I usually give this response: nlf want it in a day, I have to use the standard layout of PRINT or FREQ. If you will give me a month, I can customize the report to your specific needs." Most often, users will take the quicker, though less adequate alternative. Over time, this has become difficult, since other languages, like FOCUS, have concentrated on providing more support for comprehensive and flexible report writing. SAS has had to respond to the need for more flexible control while still remaining'easy to use.'

Adequate Support for a Report Writer

At some point SAS stopped being primarily a statistical analysis tool. Although its reputation as a package hinges largely on the completeness, robustness, and reliability of its statistical procedures, SAS has evolved in the market. The "statistical" image is now a partial, not a complete picture of the usage of SAS in today' s market. Many users, with no background in programming, use SAS as a report writer, ignoring most of the statistical procedures in the package. They want speed and ease, but also elaboration and complex structure. These goals may be incompatible. Nevertheless, SAS has tried several procedures designed to generate reports. They vary on our dimensions and they have been received with mixed results. All were designed to provide:

1. Ust actual data points onto an output device. 2. Provide some measure of summarization in the same report at ".break" points in the data. 3. Allow summary measures other than subtotals, like means or maximum values. 4. Allow titles for report pages, columns and notes in the body of the report. 5. Allow some form or simple comparisons between two or more fields to demonstrate a nrelationship."

This has never been easy, since designing a general use Report Writer has to take into account the dimension preferences and choices we have noted. More important, which procedure is used has more to do with the programmer than the package. Whether the Procedure should be algorithmic or textual is a n preference," that should have nothing to do with the expected outcome. For example, programmers with scientific and mathematical backgrounds prefer algorithmic procedures, like TABULATE. Users-turned-programmers prefer REPORT. Less experienced SAS programmers tend to prefer standardized procedures to the more flexible and elaborate ones. Notice that the actual choice has more to do with the programmer.

TABULATE as a Report Writer

For many of us TABULATE was gift from Heaven. After struggling with combinations of SUMMARY and PRINT, or the use of the NULL step technique, many SAS programmers were getting frustrated. SAS made a large mistake with COMPUTAB, which turned out to be an insufficient copy of spreadsheet techniques. It never became popular, although SAS Institute touted it as an innovative report writer. Using cell addresses was not a step up for seasoned programmers. It led to sloppy and very specific code that could not be readily documented, the flaws of every spreadsheet. Another example of a serious mistake was the introduction of PROC MATRIX, which was an inadequate copy of the APL language. Although slick, and containing many of the elaborate and stunning features of APL and Matrix algebra, like APL's nDomino," it did not stand on its own. Its most difficult hurdle is that the entire dataset has to reside in memory. Fourth Generation Languages were all "record readers." They did not have the true array indexing of a matrix or arrays language like BASIC or C++. You could not go back a record; MATRIX deviated from this underlying basis for SAS popularity on the mainframe. SAS, like FOCUS, only read and wrote one record at a time, conserving mainframe memory and resources.

532 TABULATE was flexible, algorithmic and comprehensive. You could vary the report layout and statistics shown. Relationships were easy to demonstrate. It could be used for simple or very elaborate listings as well as complex comparisons with summary statistics. Its "box" options, an extension of FREQ' s tables, allowed for fancy almost graphic displays. Many algebraic programmers became so fond ofTABULA TE, that they use it for all output they produce. They are TABULATE bigots, totally loyal to the capacities of this procedure. They stay with the procedure usually because they love algebraic code. With very slight changes to the details a completely different layout can be produced. For experienced programmers, TABULATE met a lot of needs.

CHART, the Graphic List

I have always liked CHART. For a non-graphic procedure used on a mainframe it is very graphic. Almost identical to a complex form of FREQ it allows for FREQ's output and it redisplays a distribution of data points as bar charts. Since it allowed multiple "layers," with BY, GROUP and SUBGROUP you could display several breakdowns and combining one typology field with a SUMVAR field covered all bases. The HBAR was always the better choice, since users could verify the data's" image" with real values on the same page. It allowed most of the basic layout controls for displaying ranges of values (AXIS, DISCRETE), broken down (MIDPOINTS, LEVELS) various ways. The best was the two-way simulated three-dimensional BLOCK's which are as old as the first paintings done with vanishing points to simulate 3D in art.

Ultimately PLOT and CHART failed in popularity. By comparison with GPLOT and GCHART, they were primitive and too standardized to maintain anyone's interest. FOCUS on the mainframe was even more disappointing by comparison. Ultimately PC graphics were so far ahead for display and software, mainframe graphics terminals with fat volumes of SAS Graphics manuals, foreign language character sets, fine tuning ANNOTATE elements, and multiple device support could not begin to compete.

PROC REPORT, the Ultimate Report Writer

SAS continued to develop reporting procedures. Although TABULATE became very popular, it failed with a certain market: those who want textual procedures, with greater standardization, less effort, and a more comprehensive approach. For a while, FOCUS became very popular with non-programmer users in Business environments. The competition was stiff, with FOCUS salesman poking fun at the SAS statistical image and its algebraic formats. A "user friendly'' textual and comprehensive Report Writer Procedure was still needed.

SAS came up with PROC REPORT. It has become the most popular Report Writer in the language, so much so that I no longer use TABULATE. It combines from PRINT: COLUMN is equivalent to the VAR statement; with data step logic sections under COMPUTE that allows computational logic (COMPUTAB did as well). In some ways the Procedure is deceptively simple. For elaborate layouts and computations, REPORT does not always work the way it is expected to work, but over time, one absorbs the intricate details necessary to make REPORT meet very specific reporting needs.

Again, programmer background and bias has a lot to do with attitudes toward REPORT. Seasoned programmers often have a vested interest in their elaborate Data step logic, SUMMARY's and TABULATE features. TABULATE can do a lot in only a few lines of code, but REPORT is very wordy. The new combination is matching a very comprehensive procedure to very standard syntax in a textual format.

REPORT is a return to basics. It looks like a data step, SUMMARY and PRINT combined. The features of PRINT, like the COLUMN and BREAK/RBREAK statements are reused, but now you have DEFINE's for each variable that distinguish the displayed values (ANALYSIS and DISPLAY) from SUMMARY's CLASS values (GROUP and ACROSS). A simple shift from GROUP to ACROSS turns a list into a cross-tabulation, just as found in TABULATE by shifting the dimension (page, row and column) of a field and adding CLASS statements. In fact, SAS has simply repackaged the SUMMARY /PRINT combination as a new form of text based TABULATE. The key advantage of REPORT is the use of data step logic that allows new field calculations and AI exception

533 annotations on a report list. It is sometimes very tricky to use, largely because the programmer must grasp, almost intuitively, what values are currently found in a field inside the COMPUTE logic. Just calculating a percent column can be tricky, when it was standard in TABULATE. Like with all choices, flexible code is often the opposite of simple code.

Report Writer History as a Dimensional Space

Now lets summarize the chronology of SAS report writing evolution as points in a three-dimensional space, so we can see if there is a definitive pattern of development. Here is my high level quesstimation comparing SAS Report Writing procedures on three dimensions: Style (Text or Algebra), Standardization (Standard Compact or Flexible Modular) and Scope (Limited or Comprehensive): '

SAS Report Writing Procedures Dimensionalized

Limited Comprehensive Comprehensiveness

Make two assumptions about the chart: that the top is more Standard, and the capped Procedure names are text based, while the lower case procedures are algorithmic.

The chart illustrates that SAS started with a highly standard and very limited text procedure, PRINT. It's first developments were to enhance its cross-tabulation techniques in FREQ with a graphic version called CHART which enlarged the scope of simple PRINT listings. Pressure to open up a very flexible procedure for fully customized reports fell back on the open ended processing of NULL data steps. This did not last, since all standardization was lost, and the code looked too algebraic and unreadable. SAS shifted to a completely flexible TABULATE procedure that reintroduced a standard method for laying out the report cells that looked almost

534 like FREQ' s output but with infinite flexibility. The push for a textual procedure that pulled back in the flexibility of data step manipulation produced REPORT, the ultimate SAS report writer.

What did we learn?

Hopefully some lessons were learned. The primary one is simple, use text, not formulas. The market of SAS users is larger than the market of SAS programmers. They may be less loyal, and less committed, but there are more of them buying the SAS product and the SAS image has moved into their world. The other is that it is still a good idea to have both limited and comprehensive procedures, but all of them must apply strong standards. The NULL report is a nightmare to understand, change, document and maintain. PRINT and FREQ have not died. They are not obsolete, but we can now choose different levels of scope, depending on the assignment. What will the future hold? Probably more standards, as REPORT becomes more the first choice of SAS users. It still lacks the elegance of a well thought out TABULATE report, but it will always be more popular in the general market.

535